Data duplication occurs when using CDC upsert mode

I’m currently using CDC Upsert added to load data into an Iceberg table. Catalog is AWS Glue Catalog.
I’ve discovered that row duplicates are occurring based on the id field. (I believe there are no issues with the Kafka Connect configuration.)
Below is the query result for the table in question:
+-----+-----------+
|   id|customer_id|
+-----+-----------+
|3000|       10000|
|3000|       10000|
+-----+-----------+

To identify the cause of these duplicates, I examined the data files and delete files from the snapshot when this data was loaded (as I suspected missing delete files might be the cause).

In the snapshot, I found data files, positional delete files, and equality delete files containing the problematic id, and verified their contents:
- Data file (00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00001.parquet):
+-----+-----------+
|   id|customer_id|
+-----+-----------+
|3000|       10000|
|3000|       10000|
+-----+-----------+

- Equality delete file (00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00002.parquet):
+-----+
|   id|
+-----+
| 3000|
+-----+

- Positional delete file (00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00003.parquet):
+-------------------------------------------------------------------------------+---+
|                                                                      file_path|pos|
+-------------------------------------------------------------------------------+---+
|s3://.../00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00001.parquet|  0|
+-------------------------------------------------------------------------------+---+

I confirmed that all three files are included in the following query result:
```
SELECT *
FROM db.table.files
```

In this case, when executing a SELECT query, I expected that:
1. Equality deletes would exclude previously loaded rows with the corresponding id
2. Positional deletes would exclude the pos 0 row of the data file
3. Only the pos 1 row of the data file would be returned as the result

So I believe there should be no duplicates.
Is my expected behavior correct? Or what else should I check to identify the root cause?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data duplication occurs when using CDC upsert mode #341

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data duplication occurs when using CDC upsert mode #341

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions