Skip to content

Fixed profilier rules for empty strings and nulls#1037

Closed
howardwu-db wants to merge 2 commits intodatabrickslabs:mainfrom
howardwu-db:fix_profiler
Closed

Fixed profilier rules for empty strings and nulls#1037
howardwu-db wants to merge 2 commits intodatabrickslabs:mainfrom
howardwu-db:fix_profiler

Conversation

@howardwu-db
Copy link

Changes

Fixed profiler section for empty values and nulls.

Linked issues

Resolves #1036

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@howardwu-db howardwu-db requested a review from a team as a code owner February 17, 2026 22:51
@howardwu-db howardwu-db requested review from nehamilak-db and removed request for a team February 17, 2026 22:51
@github-actions
Copy link

All commits in PR should be signed ('git commit -S ...'). See https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

@ghanse ghanse self-requested a review February 17, 2026 22:51
Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to setup GPG key and sign commits, you can follow instructions here to repush the commits: https://databrickslabs.github.io/dqx/docs/dev/contributing/#first-contribution

null_percentage = 1 - (1.0 * count_non_null) / total_count
empty_percentage = 1 - (1.0 * count_non_empty) / total_count

if typ == T.StringType() and count_non_empty >= (total_count *(1 - max_empty_ratio)) and count_non_null >= (total_count * (1 - max_null_ratio)):
Copy link
Contributor

@mwojtyczka mwojtyczka Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please refactor this to a separate methods for readability and document as per https://databrickslabs.github.io/dqx/docs/dev/contributing/#first-contribution

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is good — separating is_not_null, is_not_empty, and is_not_null_or_empty into explicit branches is cleaner than the previous approach. However there are several correctness bugs that need fixing before merge, and no tests are included for the new behaviour. See inline comments.


return {
"check": {"function": "is_not_empty", "arguments": {"column": column}},
"name": f"{column}_is_empty",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule name is wrong — will generate broken rule names

The generated rule name is f"{column}_is_empty" but the check function is is_not_empty. Every other generator method uses the check-function name as the suffix (e.g. dq_generate_is_not_null{column}_is_not_null). Should be:

"name": f"{column}_is_not_empty",

filter=opts.get("filter", None),
)

count_non_empty = dst.filter(F.col(field_name) == "").count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count_non_empty is a misnomer — it stores the count of EMPTY strings

dst.filter(F.col(field_name) == "") returns rows where the value is the empty string, so count_non_empty actually holds the empty-string count. This wrong name ripples into every downstream condition and description. Rename to count_empty:

count_empty = dst.filter(F.col(field_name) == "").count()

And update all references accordingly. This single naming error is the root cause of the two bugs described in the comments on lines 482 and 484.


count_non_empty = dst.filter(F.col(field_name) == "").count()
null_percentage = 1 - (1.0 * count_non_null) / total_count
empty_percentage = 1 - (1.0 * count_non_empty) / total_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty_percentage computes the NON-empty fraction, not the empty fraction

# current (wrong):
empty_percentage = 1 - (1.0 * count_non_empty) / total_count
# count_non_empty is actually count_empty, so this gives: 1 - (empty_count / total)
# = fraction of NON-empty rows

The description then displays this as "has X% of empty values", which shows the inverse. Should be:

empty_percentage = (1.0 * count_empty) / total_count  # fraction of empty rows

null_percentage = 1 - (1.0 * count_non_null) / total_count
empty_percentage = 1 - (1.0 * count_non_empty) / total_count

if typ == T.StringType() and count_non_empty >= (total_count *(1 - max_empty_ratio)) and count_non_null >= (total_count * (1 - max_null_ratio)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Condition for is_not_null_or_empty is inverted — rule is never emitted under normal conditions

# current (wrong):
if ... and count_non_empty >= (total_count * (1 - max_empty_ratio)) ...
# count_non_empty is actually count_empty, so with default max_empty_ratio=0:
# → count_empty >= total_count → only true when EVERY row is an empty string

The original intent (and what the old code did) was: emit the rule when the empty-string count is within the allowed threshold, so the rule can enforce that going forward. Should be:

if typ == T.StringType() and count_empty <= (total_count * max_empty_ratio) and count_non_null >= (total_count * (1 - max_null_ratio)):

Same inversion issue affects the third elif branch (line ~506): count_non_empty <= (metrics["count"] * max_empty_ratio) is also backwards once you rename count_non_emptycount_empty.

@ghanse
Copy link
Collaborator

ghanse commented Mar 5, 2026

We will handle this through #1059 .

@ghanse ghanse closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Fix profiling rules for empty string values

3 participants