Skip to content

Conversation

@shaoting-huang
Copy link
Contributor

@shaoting-huang shaoting-huang commented Oct 13, 2025

Although milvus-storage provides an FFI interface that allows Spark to read and write binlogs in the Storage v2 format, the current Milvus 2.6 version uses the milvus-storage packed format, which does not include a manifest file.

To ensure compatibility with the Milvus 2.6 binlog format when using the FFI reader interface, we need to implement a manifest builder that generates a manifest file based on the segment information and Milvus schema.

This commit introduces a new Spark data source for reading Milvus Storage V2 binlog data. Key additions include:

  • MilvusStorageV2DataSource: Spark DataSourceV2 implementation for accessing V2 binlogs.
  • binlogv2 module: handles manifest generation, binlog grouping, and Parquet metadata reading.
  • FFI and JNI integration via milvus-storage to interface with native binlog parsing libraries.
  • Utility packages (schema, etc.) to support schema mapping, and type conversion.
  • Test suites covering manifest building, native integration, and source loading with Spark SQL logic.
  • Updated build.sbt and added spark_submit_demo.sh for native linking and demo execution.
  • Registered new DataSource in META-INF/services.

This enhancement enables Spark to efficiently read Milvus V2 binlog files and makes the connector compatible with the latest Milvus storage format.

Support ability:

  • column projection
  • pushed down filters
  • vector search top K (For now: cosine, L2, inner product)
  • Spark SQL

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shaoting-huang
To complete the pull request process, please assign xiaofan-luan after the PR has been reviewed.
You can assign the PR to them by writing /assign @xiaofan-luan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shaoting-huang shaoting-huang changed the title feat: add origin storage v2 binlog reader feat: storage v2 binlog data source Oct 14, 2025
Though milvus-storage provides ffi interface so that spark can
read/write binlogs with storage v2 format, the current Milvus 2.6 is
using milvus-storage packed format, which there is no manifest file.

Therefore, in order to be compatibile with Milvus 2.6 binlog format
with the ffi reader interface, we
need to implement a manifest builder to build a manifest file based on
segment info and milvus schema.

Introduces a new Spark data source for reading Milvus Storage V2 binlog data.

Key additions include:

MilvusStorageV2DataSource: Spark DataSourceV2 implementation for accessing V2 binlogs.
binlogv2 module: handles manifest generation, binlog grouping, and Parquet metadata reading.
FFI and JNI integration via milvus-storage to interface with native binlog parsing libraries.
Utility packages (serde, schema, etc.) to support serialization, schema mapping, and type conversion.
Test suites covering manifest building, native integration, and source loading with Spark SQL logic.
Updated build.sbt and added spark_submit_demo.sh for native linking and demo execution.
Registered new DataSource in META-INF/services.
This enhancement enables Spark to efficiently read Milvus V2 binlog files and makes the connector compatible with the latest Milvus storage format.

Signed-off-by: shaoting-huang <[email protected]>
Signed-off-by: shaoting-huang <[email protected]>
Signed-off-by: shaoting-huang <[email protected]>
Signed-off-by: shaoting-huang <[email protected]>
Signed-off-by: shaoting-huang <[email protected]>
Signed-off-by: shaoting-huang <[email protected]>
Signed-off-by: shaoting-huang <[email protected]>
Signed-off-by: shaoting-huang <[email protected]>
Signed-off-by: shaoting-huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants