Contributing

Thank you for contributing to Apache DataSketches!

The goal of this document is to provide everything you need to start contributing to this core Rust library.

Your First Contribution

Fork the DataSketches repository in your own GitHub account.
Create a new Git branch.
Make your changes.
Submit the branch as a pull request to the upstream repo. A DataSketches team member should comment and/or review your pull request within a few days. Although, depending on the circumstances, it may take longer.

Setup

This repo develops Apache® DataSketches™ Core Rust Library Component. To build this project, you will need to set up Rust development first. We highly recommend using rustup for the setup process.

For Linux or macOS users, use the following command:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

For Windows users, download rustup-init.exe from here instead.

Rustup will read the rust-toolchain.toml file and set up everything else automatically. To ensure that everything works correctly, run cargo version under the root directory:

cargo version
# cargo 1.85.0 (<hash> 2024-12-31)

To keep code style consistent, run cargo x lint --fix to automatically fix any style issues before committing your changes.

Build and Test

We recommend using cargo x as a single entrypoint (provided by the workspace xtask crate). This repo defines the cargo x alias in .cargo/config.toml, which maps to cargo run --package x -- ....

Build:

cargo build --workspace

Test:

cargo x test
# or
cargo test --workspace --no-default-features

Lint:

cargo x lint

Manual workflow (without xtask)

cargo x lint runs the following steps. Use these directly when you need more control or want to isolate failures:

cargo +nightly clippy --tests --all-features --all-targets --workspace -- -D warnings
cargo +nightly fmt --all --check
taplo format --check
typos
hawkeye check

Automatic fix commands:

cargo +nightly clippy --tests --all-features --all-targets --workspace --allow-staged --allow-dirty --fix
cargo +nightly fmt --all
taplo format
hawkeye format --fail-if-updated=false

Install the extra tools with:

cargo install taplo-cli typos-cli hawkeye

Serialization snapshots and test data generation

Some tests depend on snapshot files under datasketches/tests/serialization_test_data. If they are missing, tests will fail. Regenerate them with:

python3 ./tools/generate_serialization_test_data.py --all

The script pulls datasketches-java and datasketches-cpp and writes files to:

datasketches/tests/serialization_test_data/java_generated_files
datasketches/tests/serialization_test_data/cpp_generated_files

You can generate them separately:

python3 ./tools/generate_serialization_test_data.py --java
python3 ./tools/generate_serialization_test_data.py --cpp

The script requires these commands on PATH (and network access):

Java data: git, java, mvn
C++ data: git, cmake, ctest

The current datasketches-java generation flow requires JDK >= 25 and Maven >= 3.9.11, otherwise Maven Enforcer will fail.

Code of Conduct

We expect all community members to follow our Code of Conduct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing

Your First Contribution

Setup

Build and Test

Manual workflow (without xtask)

Serialization snapshots and test data generation

Code of Conduct

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing

Your First Contribution

Setup

Build and Test

Manual workflow (without xtask)

Serialization snapshots and test data generation

Code of Conduct