feat: storage v2 binlog data source #52

shaoting-huang · 2025-10-13T07:14:53Z

Although milvus-storage provides an FFI interface that allows Spark to read and write binlogs in the Storage v2 format, the current Milvus 2.6 version uses the milvus-storage packed format, which does not include a manifest file.

To ensure compatibility with the Milvus 2.6 binlog format when using the FFI reader interface, we need to implement a manifest builder that generates a manifest file based on the segment information and Milvus schema.

This commit introduces a new Spark data source for reading Milvus Storage V2 binlog data. Key additions include:

MilvusStorageV2DataSource: Spark DataSourceV2 implementation for accessing V2 binlogs.
binlogv2 module: handles manifest generation, binlog grouping, and Parquet metadata reading.
FFI and JNI integration via milvus-storage to interface with native binlog parsing libraries.
Utility packages (schema, etc.) to support schema mapping, and type conversion.
Test suites covering manifest building, native integration, and source loading with Spark SQL logic.
Updated build.sbt and added spark_submit_demo.sh for native linking and demo execution.
Registered new DataSource in META-INF/services.

This enhancement enables Spark to efficiently read Milvus V2 binlog files and makes the connector compatible with the latest Milvus storage format.

Support ability:

column projection
pushed down filters
vector search top K (For now: cosine, L2, inner product)
Spark SQL

sre-ci-robot · 2025-10-13T07:15:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shaoting-huang
To complete the pull request process, please assign xiaofan-luan after the PR has been reviewed.
You can assign the PR to them by writing /assign @xiaofan-luan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Though milvus-storage provides ffi interface so that spark can read/write binlogs with storage v2 format, the current Milvus 2.6 is using milvus-storage packed format, which there is no manifest file. Therefore, in order to be compatibile with Milvus 2.6 binlog format with the ffi reader interface, we need to implement a manifest builder to build a manifest file based on segment info and milvus schema. Introduces a new Spark data source for reading Milvus Storage V2 binlog data. Key additions include: MilvusStorageV2DataSource: Spark DataSourceV2 implementation for accessing V2 binlogs. binlogv2 module: handles manifest generation, binlog grouping, and Parquet metadata reading. FFI and JNI integration via milvus-storage to interface with native binlog parsing libraries. Utility packages (serde, schema, etc.) to support serialization, schema mapping, and type conversion. Test suites covering manifest building, native integration, and source loading with Spark SQL logic. Updated build.sbt and added spark_submit_demo.sh for native linking and demo execution. Registered new DataSource in META-INF/services. This enhancement enables Spark to efficiently read Milvus V2 binlog files and makes the connector compatible with the latest Milvus storage format. Signed-off-by: shaoting-huang <[email protected]>

Signed-off-by: shaoting-huang <[email protected]>

sre-ci-robot requested review from czs007 and xiaofan-luan October 13, 2025 07:14

sre-ci-robot added the size/XXL label Oct 13, 2025

shaoting-huang force-pushed the storagev2_binlog branch from 1ac7af1 to dc6b468 Compare October 13, 2025 10:07

shaoting-huang changed the title ~~feat: add origin storage v2 binlog reader~~ feat: storage v2 binlog data source Oct 14, 2025

shaoting-huang force-pushed the storagev2_binlog branch from a2889d6 to b755020 Compare October 14, 2025 14:36

shaoting-huang added 3 commits October 16, 2025 10:52

add storage v2 for milvus data reader

5a64f9b

Signed-off-by: shaoting-huang <[email protected]>

clean up serde useless mapping

ab0a6f5

Signed-off-by: shaoting-huang <[email protected]>

shaoting-huang force-pushed the storagev2_binlog branch from b755020 to ab0a6f5 Compare October 16, 2025 03:02

shaoting-huang added 8 commits October 20, 2025 16:54

column pruning

f359168

Signed-off-by: shaoting-huang <[email protected]>

add vector search topk

10fce6d

Signed-off-by: shaoting-huang <[email protected]>

support pushed down filters

145c887

Signed-off-by: shaoting-huang <[email protected]>

add writer and refactor to loon

a65821b

Signed-off-by: shaoting-huang <[email protected]>

fix writer

1606b68

Signed-off-by: shaoting-huang <[email protected]>

fix storage v2 writer writes to minio

b08fcbd

Signed-off-by: shaoting-huang <[email protected]>

define properties constants

dfb9ac9

Signed-off-by: shaoting-huang <[email protected]>

package to 0.2.1 version

df0dfd0

Signed-off-by: shaoting-huang <[email protected]>

shaoting-huang force-pushed the storagev2_binlog branch from da08492 to df0dfd0 Compare October 28, 2025 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: storage v2 binlog data source #52

feat: storage v2 binlog data source #52

shaoting-huang commented Oct 13, 2025 •

edited

Loading

Uh oh!

sre-ci-robot commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: storage v2 binlog data source #52

Are you sure you want to change the base?

feat: storage v2 binlog data source #52

Conversation

shaoting-huang commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sre-ci-robot commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shaoting-huang commented Oct 13, 2025 •

edited

Loading