Fixed profilier rules for empty strings and nulls by howardwu-db · Pull Request #1037 · databrickslabs/dqx

howardwu-db · 2026-02-17T22:50:59Z

Changes

Fixed profiler section for empty values and nulls.

Linked issues

Resolves #1036

Tests

github-actions · 2026-02-17T22:51:27Z

All commits in PR should be signed ('git commit -S ...'). See https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

mwojtyczka

You need to setup GPG key and sign commits, you can follow instructions here to repush the commits: https://databrickslabs.github.io/dqx/docs/dev/contributing/#first-contribution

mwojtyczka · 2026-02-18T18:20:08Z

src/databricks/labs/dqx/profiler/profiler.py

+        null_percentage = 1 - (1.0 * count_non_null) / total_count
+        empty_percentage = 1 - (1.0 * count_non_empty) / total_count
+
+        if typ == T.StringType() and count_non_empty >= (total_count *(1 - max_empty_ratio)) and count_non_null >= (total_count * (1 - max_null_ratio)):


can you please refactor this to a separate methods for readability and document as per https://databrickslabs.github.io/dqx/docs/dev/contributing/#first-contribution

mwojtyczka

The intent here is good — separating is_not_null, is_not_empty, and is_not_null_or_empty into explicit branches is cleaner than the previous approach. However there are several correctness bugs that need fixing before merge, and no tests are included for the new behaviour. See inline comments.

mwojtyczka · 2026-03-02T23:17:36Z

src/databricks/labs/dqx/profiler/generator.py

+
+        return {
+            "check": {"function": "is_not_empty", "arguments": {"column": column}},
+            "name": f"{column}_is_empty",


Rule name is wrong — will generate broken rule names

The generated rule name is f"{column}_is_empty" but the check function is is_not_empty. Every other generator method uses the check-function name as the suffix (e.g. dq_generate_is_not_null → {column}_is_not_null). Should be:

"name": f"{column}_is_not_empty",

mwojtyczka · 2026-03-02T23:17:36Z

src/databricks/labs/dqx/profiler/profiler.py

-                        filter=opts.get("filter", None),
-                    )
+
+        count_non_empty = dst.filter(F.col(field_name) == "").count()


count_non_empty is a misnomer — it stores the count of EMPTY strings

dst.filter(F.col(field_name) == "") returns rows where the value is the empty string, so count_non_empty actually holds the empty-string count. This wrong name ripples into every downstream condition and description. Rename to count_empty:

count_empty = dst.filter(F.col(field_name) == "").count()

And update all references accordingly. This single naming error is the root cause of the two bugs described in the comments on lines 482 and 484.

mwojtyczka · 2026-03-02T23:17:36Z

src/databricks/labs/dqx/profiler/profiler.py

+
+        count_non_empty = dst.filter(F.col(field_name) == "").count()
+        null_percentage = 1 - (1.0 * count_non_null) / total_count
+        empty_percentage = 1 - (1.0 * count_non_empty) / total_count


empty_percentage computes the NON-empty fraction, not the empty fraction

# current (wrong): empty_percentage = 1 - (1.0 * count_non_empty) / total_count # count_non_empty is actually count_empty, so this gives: 1 - (empty_count / total) # = fraction of NON-empty rows

The description then displays this as "has X% of empty values", which shows the inverse. Should be:

empty_percentage = (1.0 * count_empty) / total_count # fraction of empty rows

mwojtyczka · 2026-03-02T23:17:36Z

src/databricks/labs/dqx/profiler/profiler.py

+        null_percentage = 1 - (1.0 * count_non_null) / total_count
+        empty_percentage = 1 - (1.0 * count_non_empty) / total_count
+
+        if typ == T.StringType() and count_non_empty >= (total_count *(1 - max_empty_ratio)) and count_non_null >= (total_count * (1 - max_null_ratio)):


Condition for is_not_null_or_empty is inverted — rule is never emitted under normal conditions

# current (wrong): if ... and count_non_empty >= (total_count * (1 - max_empty_ratio)) ... # count_non_empty is actually count_empty, so with default max_empty_ratio=0: # → count_empty >= total_count → only true when EVERY row is an empty string

The original intent (and what the old code did) was: emit the rule when the empty-string count is within the allowed threshold, so the rule can enforce that going forward. Should be:

if typ == T.StringType() and count_empty <= (total_count * max_empty_ratio) and count_non_null >= (total_count * (1 - max_null_ratio)):

Same inversion issue affects the third elif branch (line ~506): count_non_empty <= (metrics["count"] * max_empty_ratio) is also backwards once you rename count_non_empty → count_empty.

ghanse · 2026-03-05T01:03:30Z

We will handle this through #1059 .

Fixed profilier rules for empty strings and nulls

6227f39

howardwu-db requested a review from a team as a code owner February 17, 2026 22:51

howardwu-db requested review from nehamilak-db and removed request for a team February 17, 2026 22:51

ghanse self-requested a review February 17, 2026 22:51

mwojtyczka requested changes Feb 18, 2026

View reviewed changes

Merge branch 'main' into fix_profiler

2b9ef15

mwojtyczka reviewed Mar 2, 2026

View reviewed changes

ghanse closed this Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed profilier rules for empty strings and nulls#1037

Fixed profilier rules for empty strings and nulls#1037
howardwu-db wants to merge 2 commits intodatabrickslabs:mainfrom
howardwu-db:fix_profiler

howardwu-db commented Feb 17, 2026

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

mwojtyczka left a comment

Uh oh!

mwojtyczka Feb 18, 2026 •

edited

Loading

Uh oh!

mwojtyczka left a comment

Uh oh!

mwojtyczka Mar 2, 2026

Uh oh!

mwojtyczka Mar 2, 2026

Uh oh!

mwojtyczka Mar 2, 2026

Uh oh!

mwojtyczka Mar 2, 2026

Uh oh!

ghanse commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

howardwu-db commented Feb 17, 2026

Changes

Linked issues

Tests

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

ghanse commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mwojtyczka Feb 18, 2026 •

edited

Loading