-
Notifications
You must be signed in to change notification settings - Fork 63
Description
I’m currently using CDC Upsert added to load data into an Iceberg table. Catalog is AWS Glue Catalog.
I’ve discovered that row duplicates are occurring based on the id field. (I believe there are no issues with the Kafka Connect configuration.)
Below is the query result for the table in question:
+-----+-----------+
| id|customer_id|
+-----+-----------+
|3000| 10000|
|3000| 10000|
+-----+-----------+
To identify the cause of these duplicates, I examined the data files and delete files from the snapshot when this data was loaded (as I suspected missing delete files might be the cause).
In the snapshot, I found data files, positional delete files, and equality delete files containing the problematic id, and verified their contents:
-
Data file (00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00001.parquet):
+-----+-----------+
| id|customer_id|
+-----+-----------+
|3000| 10000|
|3000| 10000|
+-----+-----------+ -
Equality delete file (00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00002.parquet):
+-----+
| id|
+-----+
| 3000|
+-----+ -
Positional delete file (00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00003.parquet):
+-------------------------------------------------------------------------------+---+
| file_path|pos|
+-------------------------------------------------------------------------------+---+
|s3://.../00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00001.parquet| 0|
+-------------------------------------------------------------------------------+---+
I confirmed that all three files are included in the following query result:
SELECT *
FROM db.table.files
In this case, when executing a SELECT query, I expected that:
- Equality deletes would exclude previously loaded rows with the corresponding id
- Positional deletes would exclude the pos 0 row of the data file
- Only the pos 1 row of the data file would be returned as the result
So I believe there should be no duplicates.
Is my expected behavior correct? Or what else should I check to identify the root cause?