Skip to content

fix(sql): fix a bug when planning semi- or antijoins#20990

Open
aalexandrov wants to merge 1 commit intoapache:mainfrom
aalexandrov:fix_semijoin_using_wildcard_planner_bug
Open

fix(sql): fix a bug when planning semi- or antijoins#20990
aalexandrov wants to merge 1 commit intoapache:mainfrom
aalexandrov:fix_semijoin_using_wildcard_planner_bug

Conversation

@aalexandrov
Copy link
Contributor

Rationale for this change

The planner should be consistent with the expected SQL behavior—swapping the names of tables that have identical structure in a SQL query should not affect the schema for that query.

What changes are included in this PR?

  • A fix in the exclude_using_columns helper method in datafusion/expr/src/utils.rs that ensures that we don't retain columns from the projected side when deciding which USING columns to exclude and which to retain on top of semi- or antijoins.
  • Regression tests for the change in test_using_join_wildcard_schema_semi_anti.

Are these changes tested?

  • Added a regression test.

Are there any user-facing changes?

Yes, the change is user facing, but I doubt that this behavior is expected and is documented anywhere.
If existing docs need to be updated, please point me to the concrete places and I can take a look.

Currently, the `exclude_using_columns` called from `expand_wildcard`
doesn't consider the filtering semantics of semi- and antijoins when
expanding wildcards on top of joins defined via `USING(<columns>)`
syntax.

From each set of columns equated by a `USING(<column>)` expression, the
code currently (1) sorts the set entries, and (2) retains only the first
entry from each set.

Because of that, the columns surviving the `exclude_using_columns` call
might be wrongly chosen from the filtering side if the table qualifier
from that side is lexicographically before the filtered side qualifier.

For example, given this schema of two identical tables:

```sql
create table s(x1 int, x2 int, x3 int);
create table t(x1 int, x2 int, x3 int);
```

One would expect that the schema of queries where the `s` and `t` names
are swapped will be identical. However, currently this is not  the case:

```sql
-- Q1 schema: x1 int, x2 int, x3 int (because s < t)
select * from s left semi join t using (x1);

-- Q2 schema: x2 int, x3 int (because t < s)
select * from t left semi join s using (x1);
```

This commit fixes the issue and adds some regression tests.
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions labels Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

exclude_using_columns might wrongly retain columns from a projected join input

1 participant