Skip to content

Conversation

@OutisLi
Copy link
Contributor

@OutisLi OutisLi commented Oct 24, 2025

Accelerate data loading in training phase.

Up to 40x acceleration for small descriptors with lower CPU usage.

Summary by CodeRabbit

  • New Features

    • Single-frame loading API and improved mixed-type atomic system support.
    • Platform-aware cache sizing option exposed.
  • Performance Improvements

    • Memory-mapped data access with cached file handles.
    • Concurrent loading of frame data for better multi-core utilization.
  • Bug Fixes

    • Standardized frame keys and reshaping for consistent downstream behavior.
  • Tests

    • Updated test seed declaration and batch key handling (includes polarizability).

@Copilot Copilot AI review requested due to automatic review settings October 24, 2025 08:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR accelerates data loading in the training phase by implementing memory-mapped file access and parallel loading. The changes claim up to 40x acceleration for small descriptors with reduced CPU usage.

Key changes:

  • Introduced memory-mapped file loading with LRU caching to reduce I/O overhead
  • Implemented parallel data loading using ThreadPoolExecutor for non-reduced data items
  • Refactored frame loading logic into a new get_single_frame method that both PyTorch and Paddle backends use

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
deepmd/utils/data.py Core implementation of memory-mapped loading and parallel data fetching via get_single_frame and _load_single_data methods
deepmd/pd/utils/dataset.py Updates Paddle backend to call the new get_frame_paddle method (appears to be a typo)
deepmd/env.py Adds dynamic LRU cache size calculation based on system file descriptor limits

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a dynamic LRU_CACHE_SIZE in deepmd/env.py; implements memmap-backed, LRU-cached, parallel single-frame loading in deepmd/utils/data.py with new methods and routing changes; get_item_torch/get_item_paddle now delegate to get_single_frame; test adds local GLOBAL_SEED and includes "atom_polarizability" in batch handling.

Changes

Cohort / File(s) Change Summary
Resource Management Configuration
deepmd/env.py
Adds _default_lru_cache_size = 512, imports platform, computes LRU_CACHE_SIZE on non-Windows by reading resource.RLIMIT_NOFILE with a safe-buffer heuristic and fallbacks, handles ImportError and Windows fallback, and exposes LRU_CACHE_SIZE in __all__.
Data Access & Memmap Caching
deepmd/utils/data.py
Adds get_single_frame(index) and _load_single_data(set_dir, key, frame_idx, set_nframes) for single-frame loading; parallelizes non-reduced key loads using ThreadPoolExecutor; introduces _get_memmap and cached static _create_memmap(path_str, mtime_str) (LRU cached) to reuse np.memmap objects and validate .npy headers; routes get_item_torch / get_item_paddle through get_single_frame; adds imports for concurrency, pathlib, functools, and uses LRU_CACHE_SIZE.
Tests — Seed & Batch Keys
source/tests/pt/test_loss_tensor.py
Adds local GLOBAL_SEED constant and extends get_single_batch key handling to include "atom_polarizability" when reshaping/assembling batch outputs.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant DeepmdData
    participant ThreadPool as ThreadPoolExecutor
    participant Loader as _load_single_data
    participant MemmapFactory as _create_memmap
    participant Reducer

    Caller->>DeepmdData: get_single_frame(index)
    DeepmdData->>DeepmdData: prepare keys & indices

    rect rgb(235,243,255)
        Note over DeepmdData,ThreadPool: parallel load of non-reduced keys
        DeepmdData->>ThreadPool: submit Loader(set_dir, key, frame_idx)
        ThreadPool->>Loader: execute _load_single_data
        Loader->>MemmapFactory: request memmap(path_str, mtime)
        MemmapFactory-->>Loader: return cached np.memmap
        Loader-->>ThreadPool: return (key, data_slice)
        ThreadPool-->>DeepmdData: collect loaded slices
    end

    rect rgb(240,255,240)
        Note over DeepmdData,Reducer: synchronous reduce/compute keys
        DeepmdData->>Reducer: compute reduced entries (aggregates)
        Reducer-->>DeepmdData: reduced results
    end

    rect rgb(255,245,235)
        Note over DeepmdData: optional mixed-type handling
        alt mixed-type dataset
            DeepmdData->>MemmapFactory: load real_atom_types.npy memmap
            MemmapFactory-->>DeepmdData: atom-type memmap
            DeepmdData->>DeepmdData: derive natoms per type and type selection
        end
    end

    DeepmdData-->>Caller: assembled single-frame dict
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect _create_memmap header parsing, dtype/shape validation, and correctness of LRU cache keying (path + mtime).
  • Verify thread-safety and lifecycle of cached np.memmap objects and that file descriptors are managed safely across threads.
  • Confirm get_single_frame preserves previous semantics (ordering, dtypes, shapes) for get_item_torch/get_item_paddle.
  • Validate mixed-type handling, natoms/type selection logic, and the added "atom_polarizability" batch path in tests.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "perf: accelerate data loading in training" directly aligns with the main objective and changes in the PR. The modifications across deepmd/env.py and deepmd/utils/data.py introduce concurrency-enabled single-frame loading orchestration, memory-mapped data access, and LRU cache optimization—all of which are focused on achieving the stated goal of accelerating data loading during training. The title is concise, clear, and uses the conventional "perf:" commit prefix appropriately to indicate performance improvements. It captures the primary intent without requiring excessive detail, as is expected for a PR summary.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6d4638 and 40e7a12.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (12 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

383-383: Avoid specifying long messages outside the exception class

(TRY003)


441-443: Avoid specifying long messages outside the exception class

(TRY003)


863-863: Avoid specifying long messages outside the exception class

(TRY003)


905-907: Abstract raise to an inner function

(TRY301)


905-907: Avoid specifying long messages outside the exception class

(TRY003)


945-945: Avoid specifying long messages outside the exception class

(TRY003)


983-983: Unused static method argument: mtime_str

(ARG004)


1001-1001: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (9)
deepmd/utils/data.py (9)

80-80: LGTM: Improved readability.

Extracting the long error message into a variable improves code readability and maintainability.


105-111: LGTM: Efficient vectorized type mapping.

The vectorized approach using a lookup array is significantly more efficient than iterating over atoms, especially for systems with large atom counts.


252-271: LGTM: Clean consolidation of single-frame loading.

Delegating both PyTorch and Paddle backends to get_single_frame eliminates code duplication and maintains a single source of truth for single-frame loading logic.


380-475: Well-structured single-frame loader with good optimizations.

The method correctly:

  • Validates frame index bounds
  • Maps global index to set directory and local frame index
  • Uses parallel loading for independent keys
  • Handles mixed types efficiently with memmap
  • Uses vectorized operations (e.g., bincount, isin)

The ThreadPoolExecutor usage is appropriate given the small number of keys (as confirmed in past discussions).


494-501: LGTM: Vectorized type selection.

Using np.isin instead of explicit loops improves performance for atom type filtering.


721-723: LGTM: Consistent vectorized operations.

The vectorized type selection approach matches the optimization applied elsewhere in the codebase.


511-518: LGTM: Proper cache invalidation with mtime.

Using file modification time as part of the cache key ensures that the memmap cache is automatically invalidated when data files are updated.


812-945: Well-implemented single-frame data loader.

The method correctly replicates _load_data logic for single frames:

  • Properly handles default values when files don't exist (lines 860-873)
  • Single-frame corner case correctly addressed (lines 878-879) by ensuring memmap has frame dimension before slicing
  • Type selection logic matches the batch loading approach
  • Hessian special case properly handled (lines 909-920)
  • Exception handling uses log.exception to preserve traceback (lines 938-944), as per coding guidelines

The use of pre-computed set_nframes avoids redundant file I/O compared to calling _get_nframes().


981-1007: LGTM: Sophisticated memmap caching with proper invalidation.

The implementation correctly:

  • Uses LRU cache to limit open file handles
  • Manually parses .npy headers to support versions 1, 2, and 3
  • Creates read-only memmaps with correct dtype, shape, and offset
  • Uses mtime_str parameter for cache key to invalidate when files change

Note: The static analysis warning about unused mtime_str is a false positive—it's intentionally part of the cache key for the @lru_cache decorator.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
deepmd/utils/data.py (4)

137-139: Remove unused memmap_cache attribute.

It’s never read. Keep surface minimal; rely on the LRU-cached _create_memmap.

-        # Add a cache for memory-mapped files
-        self.memmap_cache = {}

406-408: Use logging.exception to keep traceback.

Preserve stack trace when surfacing loader errors.

-                        log.error(f"{key!r} generated an exception: {exc}")
+                        log.exception("%r generated an exception: %s", key, exc)

946-971: Static LRU-cached memmap helper looks good; minor nits.

  • Good header parsing for v1–v3.
  • Consider catching OSError from np.memmap to rethrow with path context.
-        return np.memmap(
-            path_str, dtype=dtype, mode="r", shape=shape, order=order, offset=offset
-        )
+        try:
+            return np.memmap(
+                path_str, dtype=dtype, mode="r", shape=shape, order=order, offset=offset
+            )
+        except OSError as e:
+            raise OSError(f"memmap failed for {path_str}: {e}") from e

904-910: Use logging.exception in error path.

Keep traceback for diagnostics.

-            log.error(str(err_message))
-            log.error(explanation)
+            log.exception("%s. %s", err_message, explanation)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9764f8 and b460dc1.

📒 Files selected for processing (3)
  • deepmd/env.py (3 hunks)
  • deepmd/pd/utils/dataset.py (1 hunks)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/pd/utils/dataset.py
  • deepmd/env.py
  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

407-407: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


842-842: Avoid specifying long messages outside the exception class

(TRY003)


883-885: Abstract raise to an inner function

(TRY301)


883-885: Avoid specifying long messages outside the exception class

(TRY003)


908-908: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


909-909: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


964-964: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build C++ (cuda, cuda)
🔇 Additional comments (9)
deepmd/env.py (2)

4-4: Import looks fine.
No concerns.


16-24: Exporting LRU_CACHE_SIZE is good.
Public constant is now discoverable.

deepmd/utils/data.py (7)

452-465: Shape post-processing is fine; keep reformatting consistent.

Non-atomic arrays flattened; atomic preserved as [natoms, ndof]. Matches reformat_data_torch. No change needed.


80-84: Clearer format error message is good.
Improves readability.


21-25: Importing LRU_CACHE_SIZE is appropriate.
Integrates env sizing into cache.


946-948: Decorator order is correct.
lru_cache wraps the function then staticmethod wraps the cached callable.


1-14: New imports are appropriate.
functools, ThreadPoolExecutor, Path are justified by new logic.


52-57: The review comment is incorrect or misleading.

The snippet shown at lines 52-57 of deepmd/utils/data.py contains only docstring content (parameter descriptions), which cannot have TRY400/TRY003 violations. While Ruff confirms that many TRY003 violations exist throughout the codebase (e.g., in git/backend/find_pytorch.py, git/deepmd/common.py, git/deepmd/dpmodel/... and many others), these violations are not present at the specific file and lines referenced in the review.

The snippet shown—documentation of the Data class parameters—is immutable docstring text and is not code with exception handling that would trigger TRY003 hints.

Likely an incorrect or invalid review comment.


259-271: No action required—delegation and per-frame shapes are correct.

The review concern is unfounded. Verification confirms that get_single_frame returns properly formatted per-frame data with atomic fields already shaped as [natoms, ndof] (via _load_single_data lines 851, 869, 875). Downstream consumers handle this correctly:

  • deepmd/pt/utils/dataset.py:40 uses the output directly, expecting per-frame [natoms, ndof] format ✓
  • Test files calling reformat_data_torch work fine since the reshape operation is idempotent on already-shaped data ✓

The delegation pattern is sound and requires no changes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
deepmd/utils/data.py (3)

502-505: MAJOR: Add fallback for non-filesystem paths (e.g., HDF5).

np.memmap requires a real OS file path. For HDF5-backed DPPath instances, Path(str(path)) won't resolve to an actual file, causing failures. Implement a fallback to path.load_numpy() for virtual filesystems.

Apply this diff to add the fallback:

-    def _get_memmap(self, path: DPPath) -> np.memmap:
+    def _get_memmap(self, path: DPPath) -> np.ndarray:
         """Get or create a memory-mapped object for a given npy file."""
-        abs_path_str = str(Path(str(path)).absolute())
-        return self._create_memmap(abs_path_str)
+        try:
+            abs_path = Path(str(path)).resolve(strict=False)
+            if abs_path.is_file():
+                return self._create_memmap(str(abs_path))
+        except Exception:
+            pass
+        # Fallback for non-OS paths (e.g., HDF5)
+        return path.load_numpy()

Based on past review comments.


377-385: CRITICAL: Index-to-set mapping has off-by-one error.

bisect_right returns the insertion point after all existing occurrences. To map a global frame index to its local position within a set directory, you must subtract the previous cumulative sum, not the current one. Without this fix, negative or out-of-bounds local indices will cause data corruption or crashes.

Apply this diff to fix the mapping and add bounds validation:

 def get_single_frame(self, index: int) -> dict:
     """Orchestrates loading a single frame efficiently using memmap."""
+    if index < 0 or index >= self.nframes:
+        raise IndexError(f"Frame index {index} out of range [0, {self.nframes})")
     # 1. Find the correct set directory and local frame index
     set_idx = bisect.bisect_right(self.prefix_sum, index)
     set_dir = self.dirs[set_idx]
     if not isinstance(set_dir, DPPath):
         set_dir = DPPath(set_dir)
     # Calculate local index within the set.* directory
-    local_idx = index - self.prefix_sum[set_idx]
+    prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+    local_idx = index - prev_sum

Based on past review comments.


801-903: CRITICAL: Missing repeat logic breaks compatibility with _load_data.

The original _load_data method applies the repeat field (lines 788-789, 797-798) to tile data along the appropriate axis. _load_single_data completely omits this, causing shape mismatches and incorrect results when repeat != 1.

Apply this diff to add repeat handling:

                     data = data[idx_map, :]
 
+            # Apply repeat to match _load_data behavior
+            rep = int(vv.get("repeat", 1) or 1)
+            if rep != 1:
+                if vv["atomic"]:
+                    # Tile per-atom features
+                    data = np.repeat(data, rep, axis=1).reshape([-1])
+                else:
+                    data = np.repeat(data, rep)
+ 
             return np.float32(1.0), data

Additionally, line 897 assigns ndof but never uses it afterward. Consider removing the dead assignment:

-                    data = data.reshape(-1)
-                    ndof = 3 * ndof * 3 * ndof  # size of hessian is 3Natoms * 3Natoms
+                    data = data.reshape(-1)

Based on past review comments.

🧹 Nitpick comments (3)
deepmd/utils/data.py (3)

392-406: Cap thread pool size to avoid oversubscription.

Spawning one thread per key can create excessive threads when there are many data keys, leading to context-switching overhead and resource contention.

Apply this diff to cap the worker count:

+        import os
         if non_reduced_keys:
-            with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor:
+            max_workers = min(len(non_reduced_keys), max(4, os.cpu_count() or 8))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:

Based on past review comments.


405-405: Prefer log.exception for exception context.

Using log.exception automatically includes the traceback, which is more informative for debugging parallel loading failures.

Apply this diff:

-                        log.error(f"{key!r} generated an exception: {exc}")
+                        log.exception(f"{key!r} generated an exception: {exc}")

905-909: Use log.exception for better error context.

Replace two log.error calls with a single log.exception to automatically include the traceback, making debugging easier.

Apply this diff:

         except ValueError as err_message:
             explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
-            log.error(str(err_message))
-            log.error(explanation)
+            log.exception(f"{err_message}. {explanation}")
             raise ValueError(str(err_message) + ". " + explanation) from err_message
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b460dc1 and 97ddb72.

📒 Files selected for processing (2)
  • deepmd/env.py (3 hunks)
  • deepmd/utils/data.py (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/env.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

405-405: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


882-884: Abstract raise to an inner function

(TRY301)


882-884: Avoid specifying long messages outside the exception class

(TRY003)


907-907: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


908-908: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


963-963: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (4)
deepmd/utils/data.py (4)

4-13: LGTM: Imports support new memmap and parallel loading functionality.

The added imports are appropriate for implementing cached memmap creation (functools), parallel loading (ThreadPoolExecutor, as_completed), path handling (Path), and cache sizing (LRU_CACHE_SIZE).

Also applies to: 24-24


408-464: LGTM: Reduction, mixed-type handling, and output formatting are correct.

The logic properly:

  • Computes reduced items by summing atomic data
  • Uses memmap for efficient real_atom_types loading in mixed-type systems
  • Applies type mapping when enforced
  • Efficiently counts atoms per type with bincount
  • Standardizes keys and handles non-PBC systems

945-969: LGTM: Memmap creation is well-implemented.

The static cached method properly:

  • Uses @lru_cache with the environment's LRU_CACHE_SIZE to limit open file handles
  • Reads NumPy file headers for versions 1.0, 2.0, and 3.0
  • Creates read-only memmaps with correct dtype, shape, order, and offset
  • Raises clear errors for unsupported versions

1-969: Reminder: Run linters before committing.

The coding guidelines require running ruff check . and ruff format . before committing Python changes. Multiple Ruff violations are flagged by static analysis (TRY003, TRY301, TRY400) that should be addressed.

As per coding guidelines.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
deepmd/utils/data.py (2)

392-399: Thread pool sizing: okay given prior clarification.

Given your note that non_reduced_keys is small, current max_workers=len(non_reduced_keys) is acceptable.


380-386: Fix index-to-set mapping and add bounds check (off-by-one bug).

Current local_idx uses the current prefix sum, yielding negatives; also missing bounds validation.

Apply:

-        set_idx = bisect.bisect_right(self.prefix_sum, index)
-        set_dir = self.dirs[set_idx]
+        if index < 0 or index >= self.nframes:
+            raise IndexError(f"frame index {index} out of range [0, {self.nframes - 1}]")
+        set_idx = bisect.bisect_right(self.prefix_sum, index)
+        set_dir = self.dirs[set_idx]
         if not isinstance(set_dir, DPPath):
             set_dir = DPPath(set_dir)
-        # Calculate local index within the set.* directory
-        local_idx = index - self.prefix_sum[set_idx]
+        # Calculate local index within the set.* directory
+        prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+        local_idx = index - prev_sum

Based on past review comments.

🧹 Nitpick comments (2)
deepmd/utils/data.py (2)

450-457: Nit: comment mismatches logic.

This loop reshapes non-atomic items, not atomic ones. Update the comment for clarity.

-        # 6. Reshape atomic data to match expected format [natoms, ndof]
+        # 6. Reshape non-atomic data to 1D [ndof]

783-788: Minor grammar + logging parity in _load_data error path.

Mirror the logging.exception/grammar improvements here for consistency.

-            except ValueError as err_message:
-                explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
-                log.error(str(err_message))
-                log.error(explanation)
-                raise ValueError(str(err_message) + ". " + explanation) from err_message
+            except ValueError as err_message:
+                explanation = (
+                    "This error may occur when your label mismatches its name, "
+                    "e.g., global tensor stored in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
+                )
+                log.exception("Batch load failed for key=%s in set=%s", key, set_name)
+                raise ValueError(f"{err_message}. {explanation}") from err_message

As per coding guidelines (ruff TRY400/TRY003).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97ddb72 and 2ce2ea4.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

405-405: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


887-889: Abstract raise to an inner function

(TRY301)


887-889: Avoid specifying long messages outside the exception class

(TRY003)


914-914: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


915-915: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


970-970: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (7)
deepmd/utils/data.py (7)

5-13: LGTM on imports.

Adds functools, concurrent.futures, and pathlib for new flow; appropriate.


24-24: LRU cache size import looks good.

Using LRU_CACHE_SIZE for the memmap cache is appropriate.


80-80: Message tweak looks fine.

Error string clarified; no functional impact.


257-257: Good reuse via get_single_frame.

Torch path delegating to get_single_frame simplifies code.


261-268: Paddle path reuse LGTM.

Keeps both backends consistent.


502-506: Memmap helper: looks good.

Directly delegating to cached _create_memmap matches scope (npy-only).


952-977: Memmap cache helper LGTM.

Header-parse + offset-based np.memmap with LRU cache is solid.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
deepmd/utils/data.py (1)

377-386: Critical: Fix off-by-one error in local index calculation.

The calculation at line 385 is incorrect. bisect_right(self.prefix_sum, index) returns the position where index should be inserted, which is the correct set_idx. However, local_idx must subtract the previous prefix sum, not the current one.

Example: If prefix_sum = [5, 10, 15] and index = 7:

  • bisect_right returns 1 (correct set_idx)
  • Current code: local_idx = 7 - 10 = -3
  • Correct: local_idx = 7 - 5 = 2

Apply this diff:

-        set_idx = bisect.bisect_right(self.prefix_sum, index)
+        if index < 0 or index >= self.nframes:
+            raise IndexError(f"Frame index {index} out of range [0, {self.nframes})")
+        set_idx = bisect.bisect_right(self.prefix_sum, index)
         set_dir = self.dirs[set_idx]
         if not isinstance(set_dir, DPPath):
             set_dir = DPPath(set_dir)
-        # Calculate local index within the set.* directory
-        local_idx = index - self.prefix_sum[set_idx]
+        # Calculate local index within the set.* directory
+        prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+        local_idx = index - prev_sum
🧹 Nitpick comments (1)
deepmd/utils/data.py (1)

401-406: Prefer logging.exception to preserve traceback.

Using log.exception instead of log.error inside the except block will automatically include the stack trace, making debugging easier.

Apply this diff:

                     try:
                         frame_data["find_" + key], frame_data[key] = future.result()
-                    except Exception:
-                        log.exception("Key %r generated an exception", key)
+                    except Exception:
+                        log.exception("Failed to load key %r", key)
                         raise

Based on learnings.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ce2ea4 and e19efb0.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


862-864: Avoid specifying long messages outside the exception class

(TRY003)


894-896: Abstract raise to an inner function

(TRY301)


894-896: Avoid specifying long messages outside the exception class

(TRY003)


933-933: Avoid specifying long messages outside the exception class

(TRY003)


987-987: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
🔇 Additional comments (6)
deepmd/utils/data.py (6)

5-13: LGTM! Import additions are appropriate.

All new imports are necessary for the memmap-backed concurrent loading implementation.

Also applies to: 24-24


80-80: LGTM! Error message formatting improved.

The single-line format is cleaner and the grammar is correct.


249-268: LGTM! Backend methods correctly routed to new single-frame loader.

Both PyTorch and Paddle backends now consistently use the optimized get_single_frame method.


502-505: LGTM! Memmap path resolution is appropriate.

The method correctly converts DPPath to an absolute filesystem path for memmap creation. Error handling is delegated to _create_memmap.


921-933: LGTM! Error handling is robust.

The exception handler correctly uses log.exception to preserve the traceback and provides a clear error message with context.


969-993: LGTM! Memmap caching implementation is well-designed.

The static method with LRU caching efficiently limits file handles while supporting numpy format versions 1-3. The read-only mode is appropriate for training data.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
deepmd/utils/data.py (2)

910-910: Minor: Remove unused variable assignment.

The ndof variable assigned on line 910 is never used. While the comment is helpful for documentation, the assignment itself is dead code.

Consider either removing the assignment or using it if needed:

                     data = data[idx_map_hess, :]
                     data = data[:, idx_map_hess]
                     data = data.reshape(-1)
-                    # size of hessian is 3Natoms * 3Natoms
-                    ndof = 3 * ndof * 3 * ndof
+                    # size of hessian is 3*natoms × 3*natoms

377-387: Critical: Fix off-by-one error in index-to-set mapping.

The current calculation of local_idx subtracts the wrong prefix sum. For example, if prefix_sum = [10, 25, 40] and index = 5 (should be frame 5 in set 0):

  • bisect_right([10, 25, 40], 5) returns 0
  • local_idx = 5 - prefix_sum[0] = 5 - 10 = -5

The local index should subtract the previous cumulative sum, not the current one.

Apply this diff to fix the mapping and add bounds checking:

     def get_single_frame(self, index: int) -> dict:
         """Orchestrates loading a single frame efficiently using memmap."""
+        if index < 0 or index >= self.nframes:
+            raise IndexError(f"Frame index {index} out of range [0, {self.nframes})")
         # 1. Find the correct set directory and local frame index
         set_idx = bisect.bisect_right(self.prefix_sum, index)
         set_dir = self.dirs[set_idx]
         if not isinstance(set_dir, DPPath):
             set_dir = DPPath(set_dir)
         # Calculate local index within the set.* directory
-        local_idx = index - self.prefix_sum[set_idx]
+        prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+        local_idx = index - prev_sum
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e19efb0 and 3d75fa4.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


862-864: Avoid specifying long messages outside the exception class

(TRY003)


894-896: Abstract raise to an inner function

(TRY301)


894-896: Avoid specifying long messages outside the exception class

(TRY003)


933-933: Avoid specifying long messages outside the exception class

(TRY003)


987-987: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
🔇 Additional comments (8)
deepmd/utils/data.py (8)

5-13: LGTM!

The new imports appropriately support the memmap caching (functools), parallel loading (ThreadPoolExecutor, as_completed), path operations (Path), and configurable cache size (LRU_CACHE_SIZE).

Also applies to: 24-24


80-80: LGTM!

The simplified error message is clear and concise.


249-268: LGTM!

Excellent refactoring that eliminates code duplication by routing both backends to the unified get_single_frame method.


392-406: LGTM!

The parallel loading approach is sound. Using len(non_reduced_keys) as worker count is acceptable given the author's confirmation that the key count is small. The error handling correctly uses log.exception to preserve stack traces.


502-505: LGTM!

The wrapper appropriately converts DPPath to an absolute path string for memmap caching. The scope is correctly limited to .npy files as per the original codebase design.


784-784: LGTM!

The grammar is correct: "its" is the proper possessive form.


969-993: LGTM!

Excellent implementation of cached memmap creation. The manual header parsing correctly handles all .npy format versions (1, 2, 3), and the LRU cache appropriately limits open file handles. The read-only mode prevents accidental data corruption.


921-933: LGTM!

The error handling correctly uses log.exception to preserve the stack trace and provides a clear explanation of common causes. The approach aligns with coding guidelines.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
deepmd/utils/data.py (2)

395-401: Thread pool size: acceptable as-is if key count is small; consider a soft cap.

Given earlier note that non_reduced_keys is small, this is fine. If variability grows, cap to CPU count to avoid oversubscription.

-            with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor:
+            max_workers = min(len(non_reduced_keys), (os.cpu_count() or 8))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:

504-508: Memmap creation assumes OS-backed .npy; OK for current scope.

Given the stated scope is .npy on the filesystem, direct Path resolution is acceptable. If HDF5-backed DPPath support is added later, consider a DPPath-aware fallback.

🧹 Nitpick comments (4)
deepmd/utils/data.py (4)

450-461: Non-atomic reshaping: concise condition is fine; minor simplification optional.

Current loop is clear. If desired, the conditional can be simplified.

-        for kk in self.data_dict.keys():
-            if (
-                "find_" not in kk
-                and kk in frame_data
-                and not self.data_dict[kk]["atomic"]
-            ):
-                frame_data[kk] = frame_data[kk].reshape(-1)
+        for kk, meta in self.data_dict.items():
+            if kk in frame_data and not meta["atomic"] and not kk.startswith("find_"):
+                frame_data[kk] = frame_data[kk].reshape(-1)

786-790: Use logging.exception in legacy _load_data for consistency and better traces.

Aligns with your new path and satisfies TRY400.

-            except ValueError as err_message:
-                explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
-                log.error(str(err_message))
-                log.error(explanation)
-                raise ValueError(str(err_message) + ". " + explanation) from err_message
+            except ValueError as err_message:
+                explanation = (
+                    "This error may occur when your label mismatches its name, "
+                    "e.g., global tensor stored in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
+                )
+                log.exception("Batch load failed for key=%s in set=%s", key, set_name)
+                raise ValueError(f"{err_message}. {explanation}") from err_message

As per coding guidelines (ruff TRY400/TRY003).


971-996: Memmap header read: correct offset/order; minor improvement—validate dtype endianness.

Consider rejecting non-native byte order to avoid silent casts by NumPy when slicing memmap.

-            offset = f.tell()
+            offset = f.tell()
+        # Ensure native-endian dtype for memmap
+        if dtype.byteorder not in ('=', '|'):  # not native and not not-applicable
+            raise ValueError(f"Unsupported non-native-endian dtype in {path_str}: {dtype}")

377-387: Index mapping verified correct; boundary tests are optional but recommended.

Verification confirms the bisect_right + prefix subtraction logic handles all boundary and mid-range cases correctly. Adding explicit boundary tests (e.g., test_index_mapping_boundaries covering indices 0, set boundaries, and the last frame) would be good defensive practice for long-term maintainability, but is not blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3aeedc and 291cbdb.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


843-843: Avoid specifying long messages outside the exception class

(TRY003)


864-866: Avoid specifying long messages outside the exception class

(TRY003)


896-898: Abstract raise to an inner function

(TRY301)


896-898: Avoid specifying long messages outside the exception class

(TRY003)


935-935: Avoid specifying long messages outside the exception class

(TRY003)


989-989: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
🔇 Additional comments (4)
deepmd/utils/data.py (4)

403-408: Good: exception logging preserves traceback.

Switch to logging.exception here is correct; keeps diagnostics without extra cost on success paths.


803-871: Single-frame loader: robust frame slicing and dtype handling.

Nice job covering 0D/1D/multi-frame and enforcing frame_idx==0 on single-frame files; reshape(1, -1) prevents downstream shape errors.

Please confirm that keys with non-atomic 0D scalars actually occur; if not, consider simplifying the ndim==0 branch later.


872-922: Atomic type_sel and Hessian paths look consistent; keep returns shapes stable.

The type_sel branches and Hessian remapping are consistent with _load_data. Returning [natoms, ndof] (or flattened for Hessian) is correct given the later normalization.

Also applies to: 923-936


424-436: The review comment suggests defensive programming for a theoretically possible but practically non-existent scenario.

Based on verification:

  • File format is consistently 2D: All test data creation and documentation specify real_atom_types.npy has shape [nframes, natoms]
  • Current code is correct: Line 426's mmap_types[local_idx] works correctly when the file is 2D (returns 1D array)
  • Suggestion mirrors existing patterns: Lines 856-866 already validate ndim for other memmap operations, so the guard is stylistically consistent
  • Risk is theoretical: No evidence of 1D files being created or used

The ndim guard would improve error messages if a malformed 1D file were encountered, but based on all available evidence, this scenario does not occur in practice. The code's current behavior is appropriate for the documented and consistently-used format.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aafccb2 and 0c87625.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


843-843: Avoid specifying long messages outside the exception class

(TRY003)


887-889: Abstract raise to an inner function

(TRY301)


887-889: Avoid specifying long messages outside the exception class

(TRY003)


925-925: Avoid specifying long messages outside the exception class

(TRY003)


979-979: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
🔇 Additional comments (7)
deepmd/utils/data.py (7)

5-13: LGTM: Imports support the new parallel loading infrastructure.

The new imports are necessary for the memmap-backed parallel single-frame loading feature.

Also applies to: 24-24


249-268: LGTM: Clean delegation to the new single-frame loader.

Both get_item_torch and get_item_paddle correctly delegate to get_single_frame, maintaining backward compatibility.


377-466: LGTM: Well-structured parallel single-frame loading orchestration.

The method correctly:

  • Validates bounds
  • Maps global index to set directory and local frame index
  • Loads data keys in parallel using ThreadPoolExecutor
  • Handles mixed_type efficiently with memmap
  • Computes reduced keys from loaded data
  • Preserves exception tracebacks with log.exception

504-507: LGTM: Clean wrapper for memmap access.

The method correctly resolves the absolute path and delegates to the cached _create_memmap method.


863-911: LGTM: Correct data processing for single-frame loads.

The type_sel handling, hessian special case, and atomic/non-atomic branching correctly replicate the _load_data semantics for single frames.


913-925: LGTM: Improved error handling preserves tracebacks.

The error handling correctly uses log.exception to preserve the stack trace and provides a helpful explanation for common mismatches. The grammar has been corrected per previous review feedback.


961-985: LGTM: Efficient cached memmap creation with proper header parsing.

The static method with LRU cache correctly:

  • Limits open file handles using LRU_CACHE_SIZE
  • Manually parses .npy headers to extract dtype, shape, fortran_order, and offset
  • Supports .npy format versions 1, 2, and 3
  • Creates read-only memmaps for safe concurrent access

The caching strategy is essential for performance when repeatedly accessing the same data files during training.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
deepmd/env.py (1)

52-69: Harden LRU sizing: guard RLIM_INFINITY, clamp to sane bounds, broaden except.

Unbounded or “infinite” RLIMIT_NOFILE can yield an excessive cache size. Also, any failure should fall back safely. Clamp the final value and ensure it’s an int. This also addresses earlier feedback.

-# Dynamic calculation of cache size
-_default_lru_cache_size = 512
-LRU_CACHE_SIZE = _default_lru_cache_size
-
-if platform.system() != "Windows":
-    try:
-        import resource
-
-        soft_limit, hard_limit = resource.getrlimit(resource.RLIMIT_NOFILE)
-        safe_buffer = 128
-        if soft_limit > safe_buffer + _default_lru_cache_size:
-            LRU_CACHE_SIZE = soft_limit - safe_buffer
-        else:
-            LRU_CACHE_SIZE = soft_limit // 2
-    except ImportError:
-        LRU_CACHE_SIZE = _default_lru_cache_size
-else:
-    LRU_CACHE_SIZE = _default_lru_cache_size
+# Dynamic calculation of cache size
+_default_lru_cache_size = 512
+_max_lru_cache_size = 4096  # upper bound to avoid excessive open FDs
+LRU_CACHE_SIZE = _default_lru_cache_size
+
+if platform.system() != "Windows":
+    try:
+        import resource
+        soft_limit, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
+        rlim_inf = getattr(resource, "RLIM_INFINITY", -1)
+        safe_buffer = 128
+        if soft_limit in (-1, rlim_inf) or soft_limit <= 0:
+            LRU_CACHE_SIZE = _default_lru_cache_size
+        else:
+            target = soft_limit - safe_buffer if soft_limit > safe_buffer else max(soft_limit // 2, 128)
+            # Clamp and cast
+            LRU_CACHE_SIZE = int(max(128, min(target, _max_lru_cache_size)))
+    except Exception:
+        LRU_CACHE_SIZE = _default_lru_cache_size
+else:
+    LRU_CACHE_SIZE = _default_lru_cache_size
deepmd/utils/data.py (2)

861-865: Avoid loading coord.npy per call; infer single-frame from memmap shape.

Calling _get_nframes(set_dir) defeats the purpose of memmap and adds I/O. Inspect the loaded array’s ndim instead (handle ndim==0/1).

-        # corner case: single frame
-        if self._get_nframes(set_dir) == 1:
-            mmap_obj = mmap_obj[None, ...]
+        # Normalize to have a frame dimension without touching other files
+        while mmap_obj.ndim < 2:
+            mmap_obj = mmap_obj[None, ...]

504-512: Fix: memmap assumes OS path; add robust fallback for non-OS DPPath (e.g., HDF5).

Path(str(path)).stat() will fail for DPH5Path (e.g., “file.h5#dataset”). Fall back to DPPath.load_numpy() in such cases to preserve compatibility.

-    def _get_memmap(self, path: DPPath) -> np.memmap:
-        """Get or create a memory-mapped object for a given npy file.
-        Uses file path and modification time as cache keys to detect file changes
-        and invalidate cache when files are modified.
-        """
-        abs_path = Path(str(path)).absolute()
-        file_mtime = abs_path.stat().st_mtime
-        return self._create_memmap(str(abs_path), str(file_mtime))
+    def _get_memmap(self, path: DPPath) -> np.ndarray:
+        """Get or create a memory-mapped object for a given npy file.
+        Falls back to loading into memory if the path is not a regular OS file.
+        """
+        try:
+            abs_path = Path(str(path)).resolve(strict=False)
+            if abs_path.is_file():
+                file_mtime = abs_path.stat().st_mtime
+                return self._create_memmap(str(abs_path), str(file_mtime))
+        except Exception:
+            # Non-OS-backed path or resolution/stat failed
+            pass
+        # Fallback: DPPath-aware loading (e.g., HDF5 datasets)
+        return path.load_numpy()
🧹 Nitpick comments (2)
deepmd/utils/data.py (2)

394-408: Optional: cap thread pool size to avoid oversubscription.

If non_reduced_keys can grow, cap workers to CPU count or a small upper bound. If keys are always few, ignore.

-        if non_reduced_keys:
-            with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor:
+        if non_reduced_keys:
+            import os
+            max_workers = min(len(non_reduced_keys), max(4, (os.cpu_count() or 8)))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:

969-993: Silence ruff ARG004: unused _create_memmap mtime parameter (used only for cache key).

Rename the parameter to underscore-prefixed to indicate intentional unused usage while still impacting the LRU key.

-    def _create_memmap(path_str: str, mtime_str: str) -> np.memmap:
+    def _create_memmap(path_str: str, _mtime_token: str) -> np.memmap:

Optionally add a brief comment above: “_mtime_token is used only to vary the lru_cache key; not read.”

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47845a4 and d877e5b.

📒 Files selected for processing (2)
  • deepmd/env.py (3 hunks)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/env.py
  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


847-847: Avoid specifying long messages outside the exception class

(TRY003)


891-893: Abstract raise to an inner function

(TRY301)


891-893: Avoid specifying long messages outside the exception class

(TRY003)


931-931: Avoid specifying long messages outside the exception class

(TRY003)


969-969: Unused static method argument: mtime_str

(ARG004)


987-987: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
deepmd/utils/data.py (1)

249-269: Routing to get_single_frame looks good.

Delegating both backends to the same fast path keeps behavior consistent.

@codecov
Copy link

codecov bot commented Oct 25, 2025

Codecov Report

❌ Patch coverage is 72.89157% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.18%. Comparing base (b9764f8) to head (40e7a12).

Files with missing lines Patch % Lines
deepmd/utils/data.py 73.02% 41 Missing ⚠️
deepmd/env.py 71.42% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #5023      +/-   ##
==========================================
- Coverage   84.23%   84.18%   -0.05%     
==========================================
  Files         709      709              
  Lines       70074    70214     +140     
  Branches     3618     3619       +1     
==========================================
+ Hits        59028    59112      +84     
- Misses       9880     9935      +55     
- Partials     1166     1167       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz njzjz requested review from caic99, iProzd and njzjz October 26, 2025 03:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/utils/data.py (1)

80-80: Consider shortening the inline error message.

The error message is quite long and Ruff suggests avoiding long messages outside the exception class.

Apply this diff to address the style guideline:

-        error_format_msg = "if one of the set is of mixed_type format, then all of the sets in this system should be of mixed_type format!"
+        error_format_msg = (
+            "if one of the set is of mixed_type format, then all of the sets "
+            "in this system should be of mixed_type format!"
+        )

As per coding guidelines (ruff TRY003).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d877e5b and e6d4638.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


438-440: Avoid specifying long messages outside the exception class

(TRY003)


864-864: Avoid specifying long messages outside the exception class

(TRY003)


908-910: Abstract raise to an inner function

(TRY301)


908-910: Avoid specifying long messages outside the exception class

(TRY003)


948-948: Avoid specifying long messages outside the exception class

(TRY003)


986-986: Unused static method argument: mtime_str

(ARG004)


1004-1004: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
🔇 Additional comments (6)
deepmd/utils/data.py (6)

5-24: LGTM: Imports are appropriate for the performance optimization.

All new imports are properly utilized for the memory-mapped, parallelized single-frame loading implementation.


249-268: LGTM: Clean routing to the new parallelized single-frame loader.

Both PyTorch and Paddle backends now share the optimized get_single_frame implementation, which provides consistency and improved performance.


377-472: LGTM: Efficient single-frame orchestration with proper parallelism.

The implementation correctly:

  • Maps global frame index to set directory and local index using bisect_right
  • Validates bounds upfront
  • Parallelizes non-reduced key loading with ThreadPoolExecutor
  • Uses log.exception for proper traceback preservation
  • Handles mixed_type efficiently with memmap

510-517: LGTM: Memmap wrapper with modification-time-based cache invalidation.

The implementation correctly invalidates the LRU cache when files are modified by including st_mtime in the cache key.


813-948: LGTM: Robust single-frame data loader with proper edge case handling.

The implementation correctly:

  • Handles missing files with appropriate defaults
  • Adds dimension to single-frame files before slicing (line 879-880)
  • Applies type_sel logic consistently with _load_data
  • Handles the hessian special case
  • Uses log.exception to preserve stack traces (line 941)

The set_nframes == 1 check protects against IndexError when accessing mmap_obj.shape[1].


984-1010: LGTM: Well-designed cached memmap factory with proper header parsing.

The mtime_str parameter (flagged by Ruff as unused) is intentionally part of the cache key for automatic cache invalidation when files are modified. This is a correct design pattern for lru_cache.

The implementation properly:

  • Parses .npy headers using NumPy's official format readers
  • Supports .npy versions 1, 2, and 3
  • Creates read-only memmaps with correct offset, dtype, shape, and order

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant