-
Notifications
You must be signed in to change notification settings - Fork 575
perf: accelerate data loading in training #5023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR accelerates data loading in the training phase by implementing memory-mapped file access and parallel loading. The changes claim up to 40x acceleration for small descriptors with reduced CPU usage.
Key changes:
- Introduced memory-mapped file loading with LRU caching to reduce I/O overhead
- Implemented parallel data loading using ThreadPoolExecutor for non-reduced data items
- Refactored frame loading logic into a new
get_single_framemethod that both PyTorch and Paddle backends use
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| deepmd/utils/data.py | Core implementation of memory-mapped loading and parallel data fetching via get_single_frame and _load_single_data methods |
| deepmd/pd/utils/dataset.py | Updates Paddle backend to call the new get_frame_paddle method (appears to be a typo) |
| deepmd/env.py | Adds dynamic LRU cache size calculation based on system file descriptor limits |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a dynamic LRU_CACHE_SIZE in deepmd/env.py; implements memmap-backed, LRU-cached, parallel single-frame loading in deepmd/utils/data.py with new methods and routing changes; get_item_torch/get_item_paddle now delegate to get_single_frame; test adds local GLOBAL_SEED and includes "atom_polarizability" in batch handling. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant DeepmdData
participant ThreadPool as ThreadPoolExecutor
participant Loader as _load_single_data
participant MemmapFactory as _create_memmap
participant Reducer
Caller->>DeepmdData: get_single_frame(index)
DeepmdData->>DeepmdData: prepare keys & indices
rect rgb(235,243,255)
Note over DeepmdData,ThreadPool: parallel load of non-reduced keys
DeepmdData->>ThreadPool: submit Loader(set_dir, key, frame_idx)
ThreadPool->>Loader: execute _load_single_data
Loader->>MemmapFactory: request memmap(path_str, mtime)
MemmapFactory-->>Loader: return cached np.memmap
Loader-->>ThreadPool: return (key, data_slice)
ThreadPool-->>DeepmdData: collect loaded slices
end
rect rgb(240,255,240)
Note over DeepmdData,Reducer: synchronous reduce/compute keys
DeepmdData->>Reducer: compute reduced entries (aggregates)
Reducer-->>DeepmdData: reduced results
end
rect rgb(255,245,235)
Note over DeepmdData: optional mixed-type handling
alt mixed-type dataset
DeepmdData->>MemmapFactory: load real_atom_types.npy memmap
MemmapFactory-->>DeepmdData: atom-type memmap
DeepmdData->>DeepmdData: derive natoms per type and type selection
end
end
DeepmdData-->>Caller: assembled single-frame dict
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)**/*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧬 Code graph analysis (1)deepmd/utils/data.py (2)
🪛 Ruff (0.14.1)deepmd/utils/data.py383-383: Avoid specifying long messages outside the exception class (TRY003) 441-443: Avoid specifying long messages outside the exception class (TRY003) 863-863: Avoid specifying long messages outside the exception class (TRY003) 905-907: Abstract (TRY301) 905-907: Avoid specifying long messages outside the exception class (TRY003) 945-945: Avoid specifying long messages outside the exception class (TRY003) 983-983: Unused static method argument: (ARG004) 1001-1001: Avoid specifying long messages outside the exception class (TRY003) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
🔇 Additional comments (9)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (4)
deepmd/utils/data.py (4)
137-139: Remove unusedmemmap_cacheattribute.It’s never read. Keep surface minimal; rely on the LRU-cached
_create_memmap.- # Add a cache for memory-mapped files - self.memmap_cache = {}
406-408: Uselogging.exceptionto keep traceback.Preserve stack trace when surfacing loader errors.
- log.error(f"{key!r} generated an exception: {exc}") + log.exception("%r generated an exception: %s", key, exc)
946-971: Static LRU-cached memmap helper looks good; minor nits.
- Good header parsing for v1–v3.
- Consider catching
OSErrorfromnp.memmapto rethrow with path context.- return np.memmap( - path_str, dtype=dtype, mode="r", shape=shape, order=order, offset=offset - ) + try: + return np.memmap( + path_str, dtype=dtype, mode="r", shape=shape, order=order, offset=offset + ) + except OSError as e: + raise OSError(f"memmap failed for {path_str}: {e}") from e
904-910: Uselogging.exceptionin error path.Keep traceback for diagnostics.
- log.error(str(err_message)) - log.error(explanation) + log.exception("%s. %s", err_message, explanation)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deepmd/env.py(3 hunks)deepmd/pd/utils/dataset.py(1 hunks)deepmd/utils/data.py(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/pd/utils/dataset.pydeepmd/env.pydeepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
DPPath(28-158)is_file(111-112)is_file(245-247)is_file(433-437)deepmd/utils/data_system.py (1)
get_ntypes(608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py
407-407: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
432-434: Avoid specifying long messages outside the exception class
(TRY003)
842-842: Avoid specifying long messages outside the exception class
(TRY003)
883-885: Abstract raise to an inner function
(TRY301)
883-885: Avoid specifying long messages outside the exception class
(TRY003)
908-908: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
909-909: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
964-964: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Build C++ (cuda, cuda)
🔇 Additional comments (9)
deepmd/env.py (2)
4-4: Import looks fine.
No concerns.
16-24: Exporting LRU_CACHE_SIZE is good.
Public constant is now discoverable.deepmd/utils/data.py (7)
452-465: Shape post-processing is fine; keep reformatting consistent.Non-atomic arrays flattened; atomic preserved as [natoms, ndof]. Matches
reformat_data_torch. No change needed.
80-84: Clearer format error message is good.
Improves readability.
21-25: ImportingLRU_CACHE_SIZEis appropriate.
Integrates env sizing into cache.
946-948: Decorator order is correct.
lru_cachewraps the function thenstaticmethodwraps the cached callable.
1-14: New imports are appropriate.
functools,ThreadPoolExecutor,Pathare justified by new logic.
52-57: The review comment is incorrect or misleading.The snippet shown at lines 52-57 of
deepmd/utils/data.pycontains only docstring content (parameter descriptions), which cannot have TRY400/TRY003 violations. While Ruff confirms that many TRY003 violations exist throughout the codebase (e.g., ingit/backend/find_pytorch.py,git/deepmd/common.py,git/deepmd/dpmodel/...and many others), these violations are not present at the specific file and lines referenced in the review.The snippet shown—documentation of the
Dataclass parameters—is immutable docstring text and is not code with exception handling that would trigger TRY003 hints.Likely an incorrect or invalid review comment.
259-271: No action required—delegation and per-frame shapes are correct.The review concern is unfounded. Verification confirms that
get_single_framereturns properly formatted per-frame data with atomic fields already shaped as[natoms, ndof](via_load_single_datalines 851, 869, 875). Downstream consumers handle this correctly:
deepmd/pt/utils/dataset.py:40uses the output directly, expecting per-frame[natoms, ndof]format ✓- Test files calling
reformat_data_torchwork fine since the reshape operation is idempotent on already-shaped data ✓The delegation pattern is sound and requires no changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
deepmd/utils/data.py (3)
502-505: MAJOR: Add fallback for non-filesystem paths (e.g., HDF5).
np.memmaprequires a real OS file path. For HDF5-backedDPPathinstances,Path(str(path))won't resolve to an actual file, causing failures. Implement a fallback topath.load_numpy()for virtual filesystems.Apply this diff to add the fallback:
- def _get_memmap(self, path: DPPath) -> np.memmap: + def _get_memmap(self, path: DPPath) -> np.ndarray: """Get or create a memory-mapped object for a given npy file.""" - abs_path_str = str(Path(str(path)).absolute()) - return self._create_memmap(abs_path_str) + try: + abs_path = Path(str(path)).resolve(strict=False) + if abs_path.is_file(): + return self._create_memmap(str(abs_path)) + except Exception: + pass + # Fallback for non-OS paths (e.g., HDF5) + return path.load_numpy()Based on past review comments.
377-385: CRITICAL: Index-to-set mapping has off-by-one error.
bisect_rightreturns the insertion point after all existing occurrences. To map a global frame index to its local position within a set directory, you must subtract the previous cumulative sum, not the current one. Without this fix, negative or out-of-bounds local indices will cause data corruption or crashes.Apply this diff to fix the mapping and add bounds validation:
def get_single_frame(self, index: int) -> dict: """Orchestrates loading a single frame efficiently using memmap.""" + if index < 0 or index >= self.nframes: + raise IndexError(f"Frame index {index} out of range [0, {self.nframes})") # 1. Find the correct set directory and local frame index set_idx = bisect.bisect_right(self.prefix_sum, index) set_dir = self.dirs[set_idx] if not isinstance(set_dir, DPPath): set_dir = DPPath(set_dir) # Calculate local index within the set.* directory - local_idx = index - self.prefix_sum[set_idx] + prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1] + local_idx = index - prev_sumBased on past review comments.
801-903: CRITICAL: Missingrepeatlogic breaks compatibility with_load_data.The original
_load_datamethod applies therepeatfield (lines 788-789, 797-798) to tile data along the appropriate axis._load_single_datacompletely omits this, causing shape mismatches and incorrect results whenrepeat != 1.Apply this diff to add repeat handling:
data = data[idx_map, :] + # Apply repeat to match _load_data behavior + rep = int(vv.get("repeat", 1) or 1) + if rep != 1: + if vv["atomic"]: + # Tile per-atom features + data = np.repeat(data, rep, axis=1).reshape([-1]) + else: + data = np.repeat(data, rep) + return np.float32(1.0), dataAdditionally, line 897 assigns
ndofbut never uses it afterward. Consider removing the dead assignment:- data = data.reshape(-1) - ndof = 3 * ndof * 3 * ndof # size of hessian is 3Natoms * 3Natoms + data = data.reshape(-1)Based on past review comments.
🧹 Nitpick comments (3)
deepmd/utils/data.py (3)
392-406: Cap thread pool size to avoid oversubscription.Spawning one thread per key can create excessive threads when there are many data keys, leading to context-switching overhead and resource contention.
Apply this diff to cap the worker count:
+ import os if non_reduced_keys: - with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor: + max_workers = min(len(non_reduced_keys), max(4, os.cpu_count() or 8)) + with ThreadPoolExecutor(max_workers=max_workers) as executor:Based on past review comments.
405-405: Preferlog.exceptionfor exception context.Using
log.exceptionautomatically includes the traceback, which is more informative for debugging parallel loading failures.Apply this diff:
- log.error(f"{key!r} generated an exception: {exc}") + log.exception(f"{key!r} generated an exception: {exc}")
905-909: Uselog.exceptionfor better error context.Replace two
log.errorcalls with a singlelog.exceptionto automatically include the traceback, making debugging easier.Apply this diff:
except ValueError as err_message: explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`." - log.error(str(err_message)) - log.error(explanation) + log.exception(f"{err_message}. {explanation}") raise ValueError(str(err_message) + ". " + explanation) from err_message
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deepmd/env.py(3 hunks)deepmd/utils/data.py(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deepmd/env.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
DPPath(28-158)is_file(111-112)is_file(245-247)is_file(433-437)deepmd/utils/data_system.py (1)
get_ntypes(608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py
405-405: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
430-432: Avoid specifying long messages outside the exception class
(TRY003)
841-841: Avoid specifying long messages outside the exception class
(TRY003)
882-884: Abstract raise to an inner function
(TRY301)
882-884: Avoid specifying long messages outside the exception class
(TRY003)
907-907: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
908-908: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
963-963: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test C++ (true)
- GitHub Check: Test C++ (false)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (4)
deepmd/utils/data.py (4)
4-13: LGTM: Imports support new memmap and parallel loading functionality.The added imports are appropriate for implementing cached memmap creation (
functools), parallel loading (ThreadPoolExecutor,as_completed), path handling (Path), and cache sizing (LRU_CACHE_SIZE).Also applies to: 24-24
408-464: LGTM: Reduction, mixed-type handling, and output formatting are correct.The logic properly:
- Computes reduced items by summing atomic data
- Uses memmap for efficient
real_atom_typesloading in mixed-type systems- Applies type mapping when enforced
- Efficiently counts atoms per type with
bincount- Standardizes keys and handles non-PBC systems
945-969: LGTM: Memmap creation is well-implemented.The static cached method properly:
- Uses
@lru_cachewith the environment'sLRU_CACHE_SIZEto limit open file handles- Reads NumPy file headers for versions 1.0, 2.0, and 3.0
- Creates read-only memmaps with correct dtype, shape, order, and offset
- Raises clear errors for unsupported versions
1-969: Reminder: Run linters before committing.The coding guidelines require running
ruff check .andruff format .before committing Python changes. Multiple Ruff violations are flagged by static analysis (TRY003, TRY301, TRY400) that should be addressed.As per coding guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
deepmd/utils/data.py (2)
392-399: Thread pool sizing: okay given prior clarification.Given your note that non_reduced_keys is small, current max_workers=len(non_reduced_keys) is acceptable.
380-386: Fix index-to-set mapping and add bounds check (off-by-one bug).Current local_idx uses the current prefix sum, yielding negatives; also missing bounds validation.
Apply:
- set_idx = bisect.bisect_right(self.prefix_sum, index) - set_dir = self.dirs[set_idx] + if index < 0 or index >= self.nframes: + raise IndexError(f"frame index {index} out of range [0, {self.nframes - 1}]") + set_idx = bisect.bisect_right(self.prefix_sum, index) + set_dir = self.dirs[set_idx] if not isinstance(set_dir, DPPath): set_dir = DPPath(set_dir) - # Calculate local index within the set.* directory - local_idx = index - self.prefix_sum[set_idx] + # Calculate local index within the set.* directory + prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1] + local_idx = index - prev_sumBased on past review comments.
🧹 Nitpick comments (2)
deepmd/utils/data.py (2)
450-457: Nit: comment mismatches logic.This loop reshapes non-atomic items, not atomic ones. Update the comment for clarity.
- # 6. Reshape atomic data to match expected format [natoms, ndof] + # 6. Reshape non-atomic data to 1D [ndof]
783-788: Minor grammar + logging parity in _load_data error path.Mirror the logging.exception/grammar improvements here for consistency.
- except ValueError as err_message: - explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`." - log.error(str(err_message)) - log.error(explanation) - raise ValueError(str(err_message) + ". " + explanation) from err_message + except ValueError as err_message: + explanation = ( + "This error may occur when your label mismatches its name, " + "e.g., global tensor stored in `atomic_tensor.npy` or atomic tensor in `tensor.npy`." + ) + log.exception("Batch load failed for key=%s in set=%s", key, set_name) + raise ValueError(f"{err_message}. {explanation}") from err_messageAs per coding guidelines (ruff TRY400/TRY003).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
DPPath(28-158)is_file(111-112)is_file(245-247)is_file(433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py
405-405: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
430-432: Avoid specifying long messages outside the exception class
(TRY003)
841-841: Avoid specifying long messages outside the exception class
(TRY003)
887-889: Abstract raise to an inner function
(TRY301)
887-889: Avoid specifying long messages outside the exception class
(TRY003)
914-914: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
915-915: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
970-970: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test C++ (true)
- GitHub Check: Test C++ (false)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (c-cpp)
🔇 Additional comments (7)
deepmd/utils/data.py (7)
5-13: LGTM on imports.Adds functools, concurrent.futures, and pathlib for new flow; appropriate.
24-24: LRU cache size import looks good.Using LRU_CACHE_SIZE for the memmap cache is appropriate.
80-80: Message tweak looks fine.Error string clarified; no functional impact.
257-257: Good reuse via get_single_frame.Torch path delegating to get_single_frame simplifies code.
261-268: Paddle path reuse LGTM.Keeps both backends consistent.
502-506: Memmap helper: looks good.Directly delegating to cached _create_memmap matches scope (npy-only).
952-977: Memmap cache helper LGTM.Header-parse + offset-based np.memmap with LRU cache is solid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
deepmd/utils/data.py (1)
377-386: Critical: Fix off-by-one error in local index calculation.The calculation at line 385 is incorrect.
bisect_right(self.prefix_sum, index)returns the position whereindexshould be inserted, which is the correctset_idx. However,local_idxmust subtract the previous prefix sum, not the current one.Example: If
prefix_sum = [5, 10, 15]andindex = 7:
bisect_rightreturns1(correct set_idx)- Current code:
local_idx = 7 - 10 = -3❌- Correct:
local_idx = 7 - 5 = 2✓Apply this diff:
- set_idx = bisect.bisect_right(self.prefix_sum, index) + if index < 0 or index >= self.nframes: + raise IndexError(f"Frame index {index} out of range [0, {self.nframes})") + set_idx = bisect.bisect_right(self.prefix_sum, index) set_dir = self.dirs[set_idx] if not isinstance(set_dir, DPPath): set_dir = DPPath(set_dir) - # Calculate local index within the set.* directory - local_idx = index - self.prefix_sum[set_idx] + # Calculate local index within the set.* directory + prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1] + local_idx = index - prev_sum
🧹 Nitpick comments (1)
deepmd/utils/data.py (1)
401-406: Prefer logging.exception to preserve traceback.Using
log.exceptioninstead oflog.errorinside theexceptblock will automatically include the stack trace, making debugging easier.Apply this diff:
try: frame_data["find_" + key], frame_data[key] = future.result() - except Exception: - log.exception("Key %r generated an exception", key) + except Exception: + log.exception("Failed to load key %r", key) raiseBased on learnings.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
DPPath(28-158)is_file(111-112)is_file(245-247)is_file(433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py
430-432: Avoid specifying long messages outside the exception class
(TRY003)
841-841: Avoid specifying long messages outside the exception class
(TRY003)
862-864: Avoid specifying long messages outside the exception class
(TRY003)
894-896: Abstract raise to an inner function
(TRY301)
894-896: Avoid specifying long messages outside the exception class
(TRY003)
933-933: Avoid specifying long messages outside the exception class
(TRY003)
987-987: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
🔇 Additional comments (6)
deepmd/utils/data.py (6)
5-13: LGTM! Import additions are appropriate.All new imports are necessary for the memmap-backed concurrent loading implementation.
Also applies to: 24-24
80-80: LGTM! Error message formatting improved.The single-line format is cleaner and the grammar is correct.
249-268: LGTM! Backend methods correctly routed to new single-frame loader.Both PyTorch and Paddle backends now consistently use the optimized
get_single_framemethod.
502-505: LGTM! Memmap path resolution is appropriate.The method correctly converts DPPath to an absolute filesystem path for memmap creation. Error handling is delegated to
_create_memmap.
921-933: LGTM! Error handling is robust.The exception handler correctly uses
log.exceptionto preserve the traceback and provides a clear error message with context.
969-993: LGTM! Memmap caching implementation is well-designed.The static method with LRU caching efficiently limits file handles while supporting numpy format versions 1-3. The read-only mode is appropriate for training data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
deepmd/utils/data.py (2)
910-910: Minor: Remove unused variable assignment.The
ndofvariable assigned on line 910 is never used. While the comment is helpful for documentation, the assignment itself is dead code.Consider either removing the assignment or using it if needed:
data = data[idx_map_hess, :] data = data[:, idx_map_hess] data = data.reshape(-1) - # size of hessian is 3Natoms * 3Natoms - ndof = 3 * ndof * 3 * ndof + # size of hessian is 3*natoms × 3*natoms
377-387: Critical: Fix off-by-one error in index-to-set mapping.The current calculation of
local_idxsubtracts the wrong prefix sum. For example, ifprefix_sum = [10, 25, 40]andindex = 5(should be frame 5 in set 0):
bisect_right([10, 25, 40], 5)returns0local_idx = 5 - prefix_sum[0] = 5 - 10 = -5❌The local index should subtract the previous cumulative sum, not the current one.
Apply this diff to fix the mapping and add bounds checking:
def get_single_frame(self, index: int) -> dict: """Orchestrates loading a single frame efficiently using memmap.""" + if index < 0 or index >= self.nframes: + raise IndexError(f"Frame index {index} out of range [0, {self.nframes})") # 1. Find the correct set directory and local frame index set_idx = bisect.bisect_right(self.prefix_sum, index) set_dir = self.dirs[set_idx] if not isinstance(set_dir, DPPath): set_dir = DPPath(set_dir) # Calculate local index within the set.* directory - local_idx = index - self.prefix_sum[set_idx] + prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1] + local_idx = index - prev_sum
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
DPPath(28-158)is_file(111-112)is_file(245-247)is_file(433-437)deepmd/utils/data_system.py (1)
get_ntypes(608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py
430-432: Avoid specifying long messages outside the exception class
(TRY003)
841-841: Avoid specifying long messages outside the exception class
(TRY003)
862-864: Avoid specifying long messages outside the exception class
(TRY003)
894-896: Abstract raise to an inner function
(TRY301)
894-896: Avoid specifying long messages outside the exception class
(TRY003)
933-933: Avoid specifying long messages outside the exception class
(TRY003)
987-987: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Test C++ (false)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Analyze (python)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Test C++ (true)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
🔇 Additional comments (8)
deepmd/utils/data.py (8)
5-13: LGTM!The new imports appropriately support the memmap caching (
functools), parallel loading (ThreadPoolExecutor,as_completed), path operations (Path), and configurable cache size (LRU_CACHE_SIZE).Also applies to: 24-24
80-80: LGTM!The simplified error message is clear and concise.
249-268: LGTM!Excellent refactoring that eliminates code duplication by routing both backends to the unified
get_single_framemethod.
392-406: LGTM!The parallel loading approach is sound. Using
len(non_reduced_keys)as worker count is acceptable given the author's confirmation that the key count is small. The error handling correctly useslog.exceptionto preserve stack traces.
502-505: LGTM!The wrapper appropriately converts
DPPathto an absolute path string for memmap caching. The scope is correctly limited to.npyfiles as per the original codebase design.
784-784: LGTM!The grammar is correct: "its" is the proper possessive form.
969-993: LGTM!Excellent implementation of cached memmap creation. The manual header parsing correctly handles all
.npyformat versions (1, 2, 3), and the LRU cache appropriately limits open file handles. The read-only mode prevents accidental data corruption.
921-933: LGTM!The error handling correctly uses
log.exceptionto preserve the stack trace and provides a clear explanation of common causes. The approach aligns with coding guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
deepmd/utils/data.py (2)
395-401: Thread pool size: acceptable as-is if key count is small; consider a soft cap.Given earlier note that non_reduced_keys is small, this is fine. If variability grows, cap to CPU count to avoid oversubscription.
- with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor: + max_workers = min(len(non_reduced_keys), (os.cpu_count() or 8)) + with ThreadPoolExecutor(max_workers=max_workers) as executor:
504-508: Memmap creation assumes OS-backed .npy; OK for current scope.Given the stated scope is .npy on the filesystem, direct Path resolution is acceptable. If HDF5-backed DPPath support is added later, consider a DPPath-aware fallback.
🧹 Nitpick comments (4)
deepmd/utils/data.py (4)
450-461: Non-atomic reshaping: concise condition is fine; minor simplification optional.Current loop is clear. If desired, the conditional can be simplified.
- for kk in self.data_dict.keys(): - if ( - "find_" not in kk - and kk in frame_data - and not self.data_dict[kk]["atomic"] - ): - frame_data[kk] = frame_data[kk].reshape(-1) + for kk, meta in self.data_dict.items(): + if kk in frame_data and not meta["atomic"] and not kk.startswith("find_"): + frame_data[kk] = frame_data[kk].reshape(-1)
786-790: Use logging.exception in legacy _load_data for consistency and better traces.Aligns with your new path and satisfies TRY400.
- except ValueError as err_message: - explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`." - log.error(str(err_message)) - log.error(explanation) - raise ValueError(str(err_message) + ". " + explanation) from err_message + except ValueError as err_message: + explanation = ( + "This error may occur when your label mismatches its name, " + "e.g., global tensor stored in `atomic_tensor.npy` or atomic tensor in `tensor.npy`." + ) + log.exception("Batch load failed for key=%s in set=%s", key, set_name) + raise ValueError(f"{err_message}. {explanation}") from err_messageAs per coding guidelines (ruff TRY400/TRY003).
971-996: Memmap header read: correct offset/order; minor improvement—validate dtype endianness.Consider rejecting non-native byte order to avoid silent casts by NumPy when slicing memmap.
- offset = f.tell() + offset = f.tell() + # Ensure native-endian dtype for memmap + if dtype.byteorder not in ('=', '|'): # not native and not not-applicable + raise ValueError(f"Unsupported non-native-endian dtype in {path_str}: {dtype}")
377-387: Index mapping verified correct; boundary tests are optional but recommended.Verification confirms the bisect_right + prefix subtraction logic handles all boundary and mid-range cases correctly. Adding explicit boundary tests (e.g.,
test_index_mapping_boundariescovering indices 0, set boundaries, and the last frame) would be good defensive practice for long-term maintainability, but is not blocking.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
DPPath(28-158)is_file(111-112)is_file(245-247)is_file(433-437)deepmd/utils/data_system.py (1)
get_ntypes(608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py
380-380: Avoid specifying long messages outside the exception class
(TRY003)
432-434: Avoid specifying long messages outside the exception class
(TRY003)
843-843: Avoid specifying long messages outside the exception class
(TRY003)
864-866: Avoid specifying long messages outside the exception class
(TRY003)
896-898: Abstract raise to an inner function
(TRY301)
896-898: Avoid specifying long messages outside the exception class
(TRY003)
935-935: Avoid specifying long messages outside the exception class
(TRY003)
989-989: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Analyze (python)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
🔇 Additional comments (4)
deepmd/utils/data.py (4)
403-408: Good: exception logging preserves traceback.Switch to logging.exception here is correct; keeps diagnostics without extra cost on success paths.
803-871: Single-frame loader: robust frame slicing and dtype handling.Nice job covering 0D/1D/multi-frame and enforcing frame_idx==0 on single-frame files; reshape(1, -1) prevents downstream shape errors.
Please confirm that keys with non-atomic 0D scalars actually occur; if not, consider simplifying the ndim==0 branch later.
872-922: Atomic type_sel and Hessian paths look consistent; keep returns shapes stable.The type_sel branches and Hessian remapping are consistent with _load_data. Returning [natoms, ndof] (or flattened for Hessian) is correct given the later normalization.
Also applies to: 923-936
424-436: The review comment suggests defensive programming for a theoretically possible but practically non-existent scenario.Based on verification:
- File format is consistently 2D: All test data creation and documentation specify
real_atom_types.npyhas shape[nframes, natoms]- Current code is correct: Line 426's
mmap_types[local_idx]works correctly when the file is 2D (returns 1D array)- Suggestion mirrors existing patterns: Lines 856-866 already validate
ndimfor other memmap operations, so the guard is stylistically consistent- Risk is theoretical: No evidence of 1D files being created or used
The ndim guard would improve error messages if a malformed 1D file were encountered, but based on all available evidence, this scenario does not occur in practice. The code's current behavior is appropriate for the documented and consistently-used format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
DPPath(28-158)is_file(111-112)is_file(245-247)is_file(433-437)deepmd/utils/data_system.py (1)
get_ntypes(608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py
380-380: Avoid specifying long messages outside the exception class
(TRY003)
432-434: Avoid specifying long messages outside the exception class
(TRY003)
843-843: Avoid specifying long messages outside the exception class
(TRY003)
887-889: Abstract raise to an inner function
(TRY301)
887-889: Avoid specifying long messages outside the exception class
(TRY003)
925-925: Avoid specifying long messages outside the exception class
(TRY003)
979-979: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
🔇 Additional comments (7)
deepmd/utils/data.py (7)
5-13: LGTM: Imports support the new parallel loading infrastructure.The new imports are necessary for the memmap-backed parallel single-frame loading feature.
Also applies to: 24-24
249-268: LGTM: Clean delegation to the new single-frame loader.Both
get_item_torchandget_item_paddlecorrectly delegate toget_single_frame, maintaining backward compatibility.
377-466: LGTM: Well-structured parallel single-frame loading orchestration.The method correctly:
- Validates bounds
- Maps global index to set directory and local frame index
- Loads data keys in parallel using ThreadPoolExecutor
- Handles mixed_type efficiently with memmap
- Computes reduced keys from loaded data
- Preserves exception tracebacks with
log.exception
504-507: LGTM: Clean wrapper for memmap access.The method correctly resolves the absolute path and delegates to the cached
_create_memmapmethod.
863-911: LGTM: Correct data processing for single-frame loads.The type_sel handling, hessian special case, and atomic/non-atomic branching correctly replicate the
_load_datasemantics for single frames.
913-925: LGTM: Improved error handling preserves tracebacks.The error handling correctly uses
log.exceptionto preserve the stack trace and provides a helpful explanation for common mismatches. The grammar has been corrected per previous review feedback.
961-985: LGTM: Efficient cached memmap creation with proper header parsing.The static method with LRU cache correctly:
- Limits open file handles using
LRU_CACHE_SIZE- Manually parses .npy headers to extract dtype, shape, fortran_order, and offset
- Supports .npy format versions 1, 2, and 3
- Creates read-only memmaps for safe concurrent access
The caching strategy is essential for performance when repeatedly accessing the same data files during training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
deepmd/env.py (1)
52-69: Harden LRU sizing: guard RLIM_INFINITY, clamp to sane bounds, broaden except.Unbounded or “infinite” RLIMIT_NOFILE can yield an excessive cache size. Also, any failure should fall back safely. Clamp the final value and ensure it’s an int. This also addresses earlier feedback.
-# Dynamic calculation of cache size -_default_lru_cache_size = 512 -LRU_CACHE_SIZE = _default_lru_cache_size - -if platform.system() != "Windows": - try: - import resource - - soft_limit, hard_limit = resource.getrlimit(resource.RLIMIT_NOFILE) - safe_buffer = 128 - if soft_limit > safe_buffer + _default_lru_cache_size: - LRU_CACHE_SIZE = soft_limit - safe_buffer - else: - LRU_CACHE_SIZE = soft_limit // 2 - except ImportError: - LRU_CACHE_SIZE = _default_lru_cache_size -else: - LRU_CACHE_SIZE = _default_lru_cache_size +# Dynamic calculation of cache size +_default_lru_cache_size = 512 +_max_lru_cache_size = 4096 # upper bound to avoid excessive open FDs +LRU_CACHE_SIZE = _default_lru_cache_size + +if platform.system() != "Windows": + try: + import resource + soft_limit, _ = resource.getrlimit(resource.RLIMIT_NOFILE) + rlim_inf = getattr(resource, "RLIM_INFINITY", -1) + safe_buffer = 128 + if soft_limit in (-1, rlim_inf) or soft_limit <= 0: + LRU_CACHE_SIZE = _default_lru_cache_size + else: + target = soft_limit - safe_buffer if soft_limit > safe_buffer else max(soft_limit // 2, 128) + # Clamp and cast + LRU_CACHE_SIZE = int(max(128, min(target, _max_lru_cache_size))) + except Exception: + LRU_CACHE_SIZE = _default_lru_cache_size +else: + LRU_CACHE_SIZE = _default_lru_cache_sizedeepmd/utils/data.py (2)
861-865: Avoid loading coord.npy per call; infer single-frame from memmap shape.Calling _get_nframes(set_dir) defeats the purpose of memmap and adds I/O. Inspect the loaded array’s ndim instead (handle ndim==0/1).
- # corner case: single frame - if self._get_nframes(set_dir) == 1: - mmap_obj = mmap_obj[None, ...] + # Normalize to have a frame dimension without touching other files + while mmap_obj.ndim < 2: + mmap_obj = mmap_obj[None, ...]
504-512: Fix: memmap assumes OS path; add robust fallback for non-OS DPPath (e.g., HDF5).Path(str(path)).stat() will fail for DPH5Path (e.g., “file.h5#dataset”). Fall back to DPPath.load_numpy() in such cases to preserve compatibility.
- def _get_memmap(self, path: DPPath) -> np.memmap: - """Get or create a memory-mapped object for a given npy file. - Uses file path and modification time as cache keys to detect file changes - and invalidate cache when files are modified. - """ - abs_path = Path(str(path)).absolute() - file_mtime = abs_path.stat().st_mtime - return self._create_memmap(str(abs_path), str(file_mtime)) + def _get_memmap(self, path: DPPath) -> np.ndarray: + """Get or create a memory-mapped object for a given npy file. + Falls back to loading into memory if the path is not a regular OS file. + """ + try: + abs_path = Path(str(path)).resolve(strict=False) + if abs_path.is_file(): + file_mtime = abs_path.stat().st_mtime + return self._create_memmap(str(abs_path), str(file_mtime)) + except Exception: + # Non-OS-backed path or resolution/stat failed + pass + # Fallback: DPPath-aware loading (e.g., HDF5 datasets) + return path.load_numpy()
🧹 Nitpick comments (2)
deepmd/utils/data.py (2)
394-408: Optional: cap thread pool size to avoid oversubscription.If non_reduced_keys can grow, cap workers to CPU count or a small upper bound. If keys are always few, ignore.
- if non_reduced_keys: - with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor: + if non_reduced_keys: + import os + max_workers = min(len(non_reduced_keys), max(4, (os.cpu_count() or 8))) + with ThreadPoolExecutor(max_workers=max_workers) as executor:
969-993: Silence ruff ARG004: unused_create_memmapmtime parameter (used only for cache key).Rename the parameter to underscore-prefixed to indicate intentional unused usage while still impacting the LRU key.
- def _create_memmap(path_str: str, mtime_str: str) -> np.memmap: + def _create_memmap(path_str: str, _mtime_token: str) -> np.memmap:Optionally add a brief comment above: “_mtime_token is used only to vary the lru_cache key; not read.”
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deepmd/env.py(3 hunks)deepmd/utils/data.py(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/env.pydeepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
DPPath(28-158)is_file(111-112)is_file(245-247)is_file(433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py
380-380: Avoid specifying long messages outside the exception class
(TRY003)
432-434: Avoid specifying long messages outside the exception class
(TRY003)
847-847: Avoid specifying long messages outside the exception class
(TRY003)
891-893: Abstract raise to an inner function
(TRY301)
891-893: Avoid specifying long messages outside the exception class
(TRY003)
931-931: Avoid specifying long messages outside the exception class
(TRY003)
969-969: Unused static method argument: mtime_str
(ARG004)
987-987: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test C++ (true)
- GitHub Check: Test C++ (false)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
deepmd/utils/data.py (1)
249-269: Routing to get_single_frame looks good.Delegating both backends to the same fast path keeps behavior consistent.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #5023 +/- ##
==========================================
- Coverage 84.23% 84.18% -0.05%
==========================================
Files 709 709
Lines 70074 70214 +140
Branches 3618 3619 +1
==========================================
+ Hits 59028 59112 +84
- Misses 9880 9935 +55
- Partials 1166 1167 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
deepmd/utils/data.py (1)
80-80: Consider shortening the inline error message.The error message is quite long and Ruff suggests avoiding long messages outside the exception class.
Apply this diff to address the style guideline:
- error_format_msg = "if one of the set is of mixed_type format, then all of the sets in this system should be of mixed_type format!" + error_format_msg = ( + "if one of the set is of mixed_type format, then all of the sets " + "in this system should be of mixed_type format!" + )As per coding guidelines (ruff TRY003).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deepmd/utils/data.py(9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
DPPath(28-158)is_file(111-112)is_file(245-247)is_file(433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py
380-380: Avoid specifying long messages outside the exception class
(TRY003)
438-440: Avoid specifying long messages outside the exception class
(TRY003)
864-864: Avoid specifying long messages outside the exception class
(TRY003)
908-910: Abstract raise to an inner function
(TRY301)
908-910: Avoid specifying long messages outside the exception class
(TRY003)
948-948: Avoid specifying long messages outside the exception class
(TRY003)
986-986: Unused static method argument: mtime_str
(ARG004)
1004-1004: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (python)
- GitHub Check: Test C++ (true)
- GitHub Check: Test C++ (false)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build wheels for cp311-macosx_x86_64
🔇 Additional comments (6)
deepmd/utils/data.py (6)
5-24: LGTM: Imports are appropriate for the performance optimization.All new imports are properly utilized for the memory-mapped, parallelized single-frame loading implementation.
249-268: LGTM: Clean routing to the new parallelized single-frame loader.Both PyTorch and Paddle backends now share the optimized
get_single_frameimplementation, which provides consistency and improved performance.
377-472: LGTM: Efficient single-frame orchestration with proper parallelism.The implementation correctly:
- Maps global frame index to set directory and local index using
bisect_right- Validates bounds upfront
- Parallelizes non-reduced key loading with ThreadPoolExecutor
- Uses
log.exceptionfor proper traceback preservation- Handles mixed_type efficiently with memmap
510-517: LGTM: Memmap wrapper with modification-time-based cache invalidation.The implementation correctly invalidates the LRU cache when files are modified by including
st_mtimein the cache key.
813-948: LGTM: Robust single-frame data loader with proper edge case handling.The implementation correctly:
- Handles missing files with appropriate defaults
- Adds dimension to single-frame files before slicing (line 879-880)
- Applies type_sel logic consistently with
_load_data- Handles the hessian special case
- Uses
log.exceptionto preserve stack traces (line 941)The
set_nframes == 1check protects against IndexError when accessingmmap_obj.shape[1].
984-1010: LGTM: Well-designed cached memmap factory with proper header parsing.The
mtime_strparameter (flagged by Ruff as unused) is intentionally part of the cache key for automatic cache invalidation when files are modified. This is a correct design pattern forlru_cache.The implementation properly:
- Parses .npy headers using NumPy's official format readers
- Supports .npy versions 1, 2, and 3
- Creates read-only memmaps with correct offset, dtype, shape, and order
Accelerate data loading in training phase.
Up to 40x acceleration for small descriptors with lower CPU usage.
Summary by CodeRabbit
New Features
Performance Improvements
Bug Fixes
Tests