Fixed profilier rules for empty strings and nulls#1037
Fixed profilier rules for empty strings and nulls#1037howardwu-db wants to merge 2 commits intodatabrickslabs:mainfrom
Conversation
|
All commits in PR should be signed ('git commit -S ...'). See https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits |
mwojtyczka
left a comment
There was a problem hiding this comment.
You need to setup GPG key and sign commits, you can follow instructions here to repush the commits: https://databrickslabs.github.io/dqx/docs/dev/contributing/#first-contribution
| null_percentage = 1 - (1.0 * count_non_null) / total_count | ||
| empty_percentage = 1 - (1.0 * count_non_empty) / total_count | ||
|
|
||
| if typ == T.StringType() and count_non_empty >= (total_count *(1 - max_empty_ratio)) and count_non_null >= (total_count * (1 - max_null_ratio)): |
There was a problem hiding this comment.
can you please refactor this to a separate methods for readability and document as per https://databrickslabs.github.io/dqx/docs/dev/contributing/#first-contribution
mwojtyczka
left a comment
There was a problem hiding this comment.
The intent here is good — separating is_not_null, is_not_empty, and is_not_null_or_empty into explicit branches is cleaner than the previous approach. However there are several correctness bugs that need fixing before merge, and no tests are included for the new behaviour. See inline comments.
|
|
||
| return { | ||
| "check": {"function": "is_not_empty", "arguments": {"column": column}}, | ||
| "name": f"{column}_is_empty", |
There was a problem hiding this comment.
Rule name is wrong — will generate broken rule names
The generated rule name is f"{column}_is_empty" but the check function is is_not_empty. Every other generator method uses the check-function name as the suffix (e.g. dq_generate_is_not_null → {column}_is_not_null). Should be:
"name": f"{column}_is_not_empty",| filter=opts.get("filter", None), | ||
| ) | ||
|
|
||
| count_non_empty = dst.filter(F.col(field_name) == "").count() |
There was a problem hiding this comment.
count_non_empty is a misnomer — it stores the count of EMPTY strings
dst.filter(F.col(field_name) == "") returns rows where the value is the empty string, so count_non_empty actually holds the empty-string count. This wrong name ripples into every downstream condition and description. Rename to count_empty:
count_empty = dst.filter(F.col(field_name) == "").count()And update all references accordingly. This single naming error is the root cause of the two bugs described in the comments on lines 482 and 484.
|
|
||
| count_non_empty = dst.filter(F.col(field_name) == "").count() | ||
| null_percentage = 1 - (1.0 * count_non_null) / total_count | ||
| empty_percentage = 1 - (1.0 * count_non_empty) / total_count |
There was a problem hiding this comment.
empty_percentage computes the NON-empty fraction, not the empty fraction
# current (wrong):
empty_percentage = 1 - (1.0 * count_non_empty) / total_count
# count_non_empty is actually count_empty, so this gives: 1 - (empty_count / total)
# = fraction of NON-empty rowsThe description then displays this as "has X% of empty values", which shows the inverse. Should be:
empty_percentage = (1.0 * count_empty) / total_count # fraction of empty rows| null_percentage = 1 - (1.0 * count_non_null) / total_count | ||
| empty_percentage = 1 - (1.0 * count_non_empty) / total_count | ||
|
|
||
| if typ == T.StringType() and count_non_empty >= (total_count *(1 - max_empty_ratio)) and count_non_null >= (total_count * (1 - max_null_ratio)): |
There was a problem hiding this comment.
Condition for is_not_null_or_empty is inverted — rule is never emitted under normal conditions
# current (wrong):
if ... and count_non_empty >= (total_count * (1 - max_empty_ratio)) ...
# count_non_empty is actually count_empty, so with default max_empty_ratio=0:
# → count_empty >= total_count → only true when EVERY row is an empty stringThe original intent (and what the old code did) was: emit the rule when the empty-string count is within the allowed threshold, so the rule can enforce that going forward. Should be:
if typ == T.StringType() and count_empty <= (total_count * max_empty_ratio) and count_non_null >= (total_count * (1 - max_null_ratio)):Same inversion issue affects the third elif branch (line ~506): count_non_empty <= (metrics["count"] * max_empty_ratio) is also backwards once you rename count_non_empty → count_empty.
|
We will handle this through #1059 . |
Changes
Fixed profiler section for empty values and nulls.
Linked issues
Resolves #1036
Tests