Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Dec 30, 2025

This PR will introduce dictionary encoding for 64bit types like int64/double.

Field Parquet (bytes) Lance 2.1 (bytes) Lance 2.2 before (bytes) Lance 2.2 after (bytes) vs Parquet vs Lance 2.1 vs Lance 2.2 before
token_count 806,050 350,852 351,208 312,168 -493,882 (-61.3%) -38,684 (-11.0%) -39,040 (-11.1%)
score 254,145 1,438,596 1,438,952 164,048 -90,097 (-35.5%) -1,274,548 (-88.6%) -1,274,904 (-88.6%)

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions github-actions bot added the enhancement New feature or request label Dec 30, 2025
@github-actions
Copy link
Contributor

Code Review

Summary

This PR adds dictionary encoding support for 64-bit types (int64/double), extending the existing 128-bit support. The compression gains look good based on the benchmarks provided.

P0/P1 Issues

P1: Potential correctness issue with float64 dictionary encoding

Using HashMap<u64, ...> for float64 values (line 129-148 in dict.rs) could cause issues with special floating point values. The code reinterprets f64 as u64 bytes, which means:

  • NaN \!= NaN in normal float comparison, but would match as equal bytes (if same bit pattern)
  • Different NaN bit patterns would be treated as different values
  • +0.0 and -0.0 have different bit patterns but are semantically equal

This may or may not be intentional behavior. If you want IEEE semantics where -0.0 == +0.0, this needs adjustment. If byte-level deduplication is the goal, this is fine but should be documented.

P1: Whitespace error in line 4425

There's an indentation issue at line 4425 in primitive.rs - the line has extra leading spaces that break consistent formatting:

        let task = spawn_cpu(move || {
-            let num_values = arrays.iter().map(|arr| arr.len() as u64).sum();
+                let num_values = arrays.iter().map(|arr| arr.len() as u64).sum();

This should be fixed to maintain consistent formatting (run cargo fmt).

Minor Observations

  • The test test_should_not_dictionary_encode_when_not_beneficial changes the test data from cardinality=10 to cardinality=1000 with num_values=1000. Since cardinality equals num_values, this makes the test less meaningful as it's testing a worst-case scenario. Consider if the original intent was to test high cardinality ratio rather than 100% cardinality.

@codecov
Copy link

codecov bot commented Dec 30, 2025

Codecov Report

❌ Patch coverage is 96.82540% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../lance-encoding/src/encodings/logical/primitive.rs 95.49% 5 Missing ⚠️
...e-encoding/src/encodings/logical/primitive/dict.rs 98.38% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@wjones127 wjones127 merged commit 3216350 into main Jan 1, 2026
30 checks passed
@wjones127 wjones127 deleted the xuanwo/lance-compress-int64 branch January 1, 2026 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants