Skip to content

Data duplication occurs when using CDC upsert mode #341

@BadCandy

Description

@BadCandy

I’m currently using CDC Upsert added to load data into an Iceberg table. Catalog is AWS Glue Catalog.
I’ve discovered that row duplicates are occurring based on the id field. (I believe there are no issues with the Kafka Connect configuration.)
Below is the query result for the table in question:
+-----+-----------+
| id|customer_id|
+-----+-----------+
|3000| 10000|
|3000| 10000|
+-----+-----------+

To identify the cause of these duplicates, I examined the data files and delete files from the snapshot when this data was loaded (as I suspected missing delete files might be the cause).

In the snapshot, I found data files, positional delete files, and equality delete files containing the problematic id, and verified their contents:

  • Data file (00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00001.parquet):
    +-----+-----------+
    | id|customer_id|
    +-----+-----------+
    |3000| 10000|
    |3000| 10000|
    +-----+-----------+

  • Equality delete file (00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00002.parquet):
    +-----+
    | id|
    +-----+
    | 3000|
    +-----+

  • Positional delete file (00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00003.parquet):
    +-------------------------------------------------------------------------------+---+
    | file_path|pos|
    +-------------------------------------------------------------------------------+---+
    |s3://.../00001-1749517607583-2920af6b-223c-4af4-b94d-51775fd9cedb-00001.parquet| 0|
    +-------------------------------------------------------------------------------+---+

I confirmed that all three files are included in the following query result:

SELECT *
FROM db.table.files

In this case, when executing a SELECT query, I expected that:

  1. Equality deletes would exclude previously loaded rows with the corresponding id
  2. Positional deletes would exclude the pos 0 row of the data file
  3. Only the pos 1 row of the data file would be returned as the result

So I believe there should be no duplicates.
Is my expected behavior correct? Or what else should I check to identify the root cause?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions