feat: add dictionary encoding for 64bit types like int64/double #5594

Xuanwo · 2025-12-30T04:51:57Z

This PR will introduce dictionary encoding for 64bit types like int64/double.

Field	Parquet (bytes)	Lance 2.1 (bytes)	Lance 2.2 before (bytes)	Lance 2.2 after (bytes)	vs Parquet	vs Lance 2.1	vs Lance 2.2 before
`token_count`	806,050	350,852	351,208	312,168	-493,882 (-61.3%)	-38,684 (-11.0%)	-39,040 (-11.1%)
`score`	254,145	1,438,596	1,438,952	164,048	-90,097 (-35.5%)	-1,274,548 (-88.6%)	-1,274,904 (-88.6%)

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

github-actions · 2025-12-30T04:52:54Z

Code Review

Summary

This PR adds dictionary encoding support for 64-bit types (int64/double), extending the existing 128-bit support. The compression gains look good based on the benchmarks provided.

P0/P1 Issues

P1: Potential correctness issue with float64 dictionary encoding

Using HashMap<u64, ...> for float64 values (line 129-148 in dict.rs) could cause issues with special floating point values. The code reinterprets f64 as u64 bytes, which means:

NaN \!= NaN in normal float comparison, but would match as equal bytes (if same bit pattern)
Different NaN bit patterns would be treated as different values
+0.0 and -0.0 have different bit patterns but are semantically equal

This may or may not be intentional behavior. If you want IEEE semantics where -0.0 == +0.0, this needs adjustment. If byte-level deduplication is the goal, this is fine but should be documented.

P1: Whitespace error in line 4425

There's an indentation issue at line 4425 in primitive.rs - the line has extra leading spaces that break consistent formatting:

        let task = spawn_cpu(move || {
-            let num_values = arrays.iter().map(|arr| arr.len() as u64).sum();
+                let num_values = arrays.iter().map(|arr| arr.len() as u64).sum();

This should be fixed to maintain consistent formatting (run cargo fmt).

Minor Observations

The test test_should_not_dictionary_encode_when_not_beneficial changes the test data from cardinality=10 to cardinality=1000 with num_values=1000. Since cardinality equals num_values, this makes the test less meaningful as it's testing a worst-case scenario. Consider if the original intent was to test high cardinality ratio rather than 100% cardinality.

codecov · 2025-12-30T05:36:45Z

Codecov Report

❌ Patch coverage is 96.82540% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../lance-encoding/src/encodings/logical/primitive.rs	95.49%	5 Missing ⚠️
...e-encoding/src/encodings/logical/primitive/dict.rs	98.38%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

feat: add dictionary encoding for 64bit types like int64/double

e0bebfa

github-actions bot added the enhancement New feature or request label Dec 30, 2025

wjones127 approved these changes Jan 1, 2026

View reviewed changes

wjones127 merged commit 3216350 into main Jan 1, 2026
30 checks passed

wjones127 deleted the xuanwo/lance-compress-int64 branch January 1, 2026 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add dictionary encoding for 64bit types like int64/double #5594

feat: add dictionary encoding for 64bit types like int64/double #5594

Xuanwo commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

codecov bot commented Dec 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add dictionary encoding for 64bit types like int64/double #5594

feat: add dictionary encoding for 64bit types like int64/double #5594

Conversation

Xuanwo commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Code Review

Summary

P0/P1 Issues

Minor Observations

Uh oh!

codecov bot commented Dec 30, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants