Conversation
…I/O handler, and fields Key differences from RAMSES: - Stream-based binary I/O (no Fortran record markers) - info.txt has nfile first (not ncpu) - File naming: amr.NNNNN, hydro.NNNNN (not amr_NNNNN.outNNNNN) - Separate particle files per type (part, star, sink, tree, trac) - Float32 primitive variables in hydro output - Compact AMR format (ckey + refined per grid) - Multi-integer Hilbert keys with different state diagrams Co-authored-by: James11222 <49885659+James11222@users.noreply.github.com>
…parameter Co-authored-by: James11222 <49885659+James11222@users.noreply.github.com>
…d-reader Add mini-ramses frontend reader
…hydro, sinks, and stars
There was a problem hiding this comment.
Pull request overview
This PR introduces a new mini_ramses frontend in yt, including dataset/index/IO plumbing, field definitions, and a synthetic-output test suite, and registers the frontend in yt.frontends.
Changes:
- Added
MiniRAMSESDataset/MiniRAMSESIndex/domain subset + file sanitizer for mini-ramses outputs. - Added an IO handler for fluid (hydro/gravity) and particle stream-binary reads, plus basic field definitions.
- Added pytest-based synthetic-output tests and registered
mini_ramsesin the frontend package list.
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
yt/frontends/mini_ramses/data_structures.py |
Core dataset/index/octree construction, file validation, and fluid data extraction logic. |
yt/frontends/mini_ramses/io.py |
Fluid selection and particle IO routines for mini-ramses stream binary files. |
yt/frontends/mini_ramses/fields.py |
Declares known fluid/particle fields and a derived temperature field. |
yt/frontends/mini_ramses/definitions.py |
Filename patterns/regex and field alias groupings for mini-ramses. |
yt/frontends/mini_ramses/api.py |
Exposes the frontend’s public API symbols. |
yt/frontends/mini_ramses/tests/test_mini_ramses.py |
Synthetic mini-ramses output writer and basic dataset/sanitizer tests. |
yt/frontends/mini_ramses/tests/__init__.py |
Test package marker for the new frontend. |
yt/frontends/__init__.py |
Registers mini_ramses in the frontend namespace. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| import os | ||
| import struct | ||
| import tempfile |
There was a problem hiding this comment.
tempfile is imported but never used, which will fail Ruff (F401) under the repo's lint configuration. Please remove the unused import (or use it if intended).
| import tempfile |
| def fill(self, fn, fields, selector, field_list): | ||
| """Read and fill field data from a mini-ramses binary file. | ||
|
|
||
| Parameters | ||
| ---------- |
There was a problem hiding this comment.
This docstring block includes blank lines with trailing whitespace (e.g. immediately after the opening docstring). Ruff/pycodestyle will flag this as W293/W291. Please strip trailing whitespace throughout this file (running the repo formatter/linter should handle it).
| self, base_region, domain, ds, over_refine_factor=1, num_ghost_zones=0 | ||
| ): | ||
| super().__init__( | ||
| base_region, domain, ds, over_refine_factor, num_ghost_zones | ||
| ) |
There was a problem hiding this comment.
MiniRAMSESDomainSubset.__init__ passes over_refine_factor positionally into OctreeSubset(..., num_zones=...). For RAMSES-style octs this should be num_zones=2 (2×2×2 cells per oct); passing 1 sets num_zones=1 and can break reshaping/ghost-zone logic. Consider renaming the parameter to num_zones (default 2) and/or passing by keyword to avoid this mismatch.
| # Separate fields by type | ||
| ftypes = {f[0] for f in fields} | ||
|
|
||
| for chunk in chunks: |
There was a problem hiding this comment.
There are blank lines containing whitespace in this block (e.g. after ftypes = ...). Ruff/pycodestyle will flag this as W293/W291. Please strip trailing whitespace (running the configured formatter/linter should fix it automatically).
| self.base_region, | ||
| self._base_domain, | ||
| self.ds, | ||
| self._num_ghost_zones, |
There was a problem hiding this comment.
retrieve_ghost_zones passes self._num_ghost_zones as the 4th positional argument to MiniRAMSESDomainSubset(...), which maps to num_zones in OctreeSubset.__init__. This will incorrectly set num_zones to the previous ghost-zone count. Pass the existing self._num_zones/num_zones=2 and set num_ghost_zones=ngz separately.
| self._num_ghost_zones, | |
| self._num_zones, |
| self.min_level = params.get("levelmin", 1) | ||
| self.max_level = params.get("levelmax", 1) | ||
| self.domain_dimensions = np.ones(3, dtype="int64") * 2 ** ( | ||
| self.min_level |
There was a problem hiding this comment.
MiniRAMSESDataset sets min_level/max_level directly from info.txt and then uses domain_dimensions = 2**min_level. In the RAMSES frontend, ds.min_level is stored as levelmin - 1 and ds.max_level is stored relative to ds.min_level to keep octree level math consistent. Aligning to that convention will make ds.index.max_level, level stats, and cell-size calculations consistent across octree frontends.
| self.min_level = params.get("levelmin", 1) | |
| self.max_level = params.get("levelmax", 1) | |
| self.domain_dimensions = np.ones(3, dtype="int64") * 2 ** ( | |
| self.min_level | |
| levelmin = params.get("levelmin", 1) | |
| levelmax = params.get("levelmax", 1) | |
| # Align level bookkeeping with the RAMSES frontend convention: | |
| # - store min_level as levelmin - 1 | |
| # - store max_level relative to min_level | |
| self.min_level = levelmin - 1 | |
| self.max_level = levelmax - self.min_level | |
| # The coarse grid domain has 2**levelmin cells per dimension. | |
| # Since levelmin = self.min_level + 1, use (self.min_level + 1) here | |
| # to keep domain_dimensions consistent with the RAMSES frontend. | |
| self.domain_dimensions = np.ones(3, dtype="int64") * 2 ** ( | |
| self.min_level + 1 |
| def test_load_dataset(self, tmp_path): | ||
| outdir = _create_mini_ramses_output(str(tmp_path)) | ||
| ds = MiniRAMSESDataset(outdir) | ||
| assert ds is not None | ||
| assert ds.dimensionality == 3 |
There was a problem hiding this comment.
The tests validate dataset detection and basic attributes, but they never actually access a fluid/particle field through yt data objects to exercise the IO handler + selection plumbing end-to-end. Adding at least one assertion that ds.all_data()[('mini-ramses','Density')] (and a particle field like ('io','particle_mass')) can be read and returns an array of the expected size would significantly improve coverage of the new frontend.
| twotondim = 2**ndim | ||
|
|
||
| with open(domain.amr_fn, "rb") as f: | ||
| ndim_file = struct.unpack("i", f.read(4))[0] |
There was a problem hiding this comment.
ndim_file is unpacked from the AMR header but never used. This will trigger Ruff F841, and also misses a cheap sanity check (e.g. verify it matches ds.dimensionality). Either validate it or unpack into _ to explicitly ignore it.
| ndim_file = struct.unpack("i", f.read(4))[0] | |
| ndim_file = struct.unpack("i", f.read(4))[0] | |
| if ndim_file != ndim: | |
| mylog.warning( | |
| "Dimensionality mismatch in AMR header for domain %s: " | |
| "file reports %dD, dataset expects %dD.", | |
| domain.domain_id, | |
| ndim_file, | |
| ndim, | |
| ) |
| ) | ||
|
|
||
| # Read refinement map (twotondim int32 values) | ||
| refined = np.array( |
There was a problem hiding this comment.
refined is read for every grid but never used, which will fail Ruff (F841). If refinement info is intentionally ignored, unpack into _ (or delete the variable) to satisfy lint. Otherwise, consider using it for validation (e.g. ensure refinement flags are consistent with the octs present at the next level).
| refined = np.array( | |
| _refined = np.array( |
| for icell in range(twotondim): | ||
| raw_data[fname][cell_offset + icell] = oct_data[fidx, icell] |
There was a problem hiding this comment.
The nested Python loops copying oct_data into raw_data (for fname... / for icell...) will be very slow for large outputs. This can be vectorized by slice-assigning oct_data[fidx, :] into raw_data[fname][cell_offset:cell_offset + twotondim], and ideally you should avoid reading/copying data for fields that weren't requested by the caller.
| for icell in range(twotondim): | |
| raw_data[fname][cell_offset + icell] = oct_data[fidx, icell] | |
| # Vectorized copy: assign all cells for this field in one slice | |
| raw_data[fname][cell_offset:cell_offset + twotondim] = oct_data[fidx, :] |
cphyc
left a comment
There was a problem hiding this comment.
This is great! I'm on holidays but ping me if I haven't reviewed this come early March, I personally really want this included and shipped with the next release of yt.
Also, do you have a small-ish dataset (<1GiB) you could share so that we could host it on https://yt-project.org/data/?
|
ya definitely, I can certainly get a nice small sample dataset to be hosted for you in the next week or so. |
chrishavlin
left a comment
There was a problem hiding this comment.
awesome! a couple of very quick notes on the new test file you added:
- super helpful to have the test itself mock up a dataset to use in the test. thank you!
- you'll want to exclude this test from the old nose testing framework that yt still uses for some of its tests: update the
nose_ignoresfile at the yt root directory as well astests/tests.yaml(add the filename to the the end oftests.yamlwhere it has a list of files to ignore)
cphyc
left a comment
There was a problem hiding this comment.
This is a great start!
However, I think there are a few pieces that would benefit from refactorisation so that the reader is more future-proof and flexible.
In addition, you're doing many small reads from the file + struct.unpack calls, which could (and probably) should be replaced by np.fromfile reads for performance.
One final decision is related to https://github.com/yt-project/yt/pull/5380/changes#r2864451870: do you want to have one octree per domain, or one global octree (as you currently have)?
The former option is more efficient, the second is easier and avoids a few minor plotting artefacts.
| self.amr_fn = os.path.join(basename, f"amr.{icpu_str}") | ||
| self.hydro_fn = os.path.join(basename, f"hydro.{icpu_str}") | ||
| self.grav_fn = os.path.join(basename, f"grav.{icpu_str}") |
There was a problem hiding this comment.
This specific pattern makes it slightly more complicated to later add support for rt later.
In the RAMSES dataset, I've used a "FieldFileHandler" (see https://github.com/yt-project/yt/blob/main/yt/frontends/ramses/field_handlers.py) that is specialised for hydro, grav and rt files.
This allows us to handle each file type separately and to factorize code.
|
|
||
| def _read_amr_header(self): | ||
| """Read the AMR header from the stream-based binary file.""" | ||
| if not os.path.exists(self.amr_fn): |
There was a problem hiding this comment.
Is this case really useful? As in, can a dataset be read without any AMR file?
| self._level_count = np.zeros(nlevels, dtype="int64") | ||
| for ilevel in range(levelmin - 1, nlevelmax): | ||
| idx = ilevel - (self.ds.min_level - 1) | ||
| if 0 <= idx < nlevels: |
There was a problem hiding this comment.
Can this ever be false if levelmin and nlevelmax are properly set for the dataset? This condition seems like a tautology.
| self._level_offsets = np.zeros(nlevels, dtype="int64") | ||
| cumsum = 0 | ||
| for i in range(nlevels): | ||
| self._level_offsets[i] = cumsum | ||
| cumsum += self._level_count[i] |
There was a problem hiding this comment.
| self._level_offsets = np.zeros(nlevels, dtype="int64") | |
| cumsum = 0 | |
| for i in range(nlevels): | |
| self._level_offsets[i] = cumsum | |
| cumsum += self._level_count[i] | |
| self._level_offsets = self._level_count.cumsum() - self._level_count[0] |
| def _read_hydro_header(self): | ||
| """Read the hydro header to get nvar.""" | ||
| if not os.path.exists(self.hydro_fn): | ||
| return 0 | ||
| with open(self.hydro_fn, "rb") as f: | ||
| ndim = struct.unpack("i", f.read(4))[0] | ||
| nvar = struct.unpack("i", f.read(4))[0] | ||
| return nvar | ||
|
|
||
| def _read_grav_header(self): | ||
| """Read the gravity header to get nvar.""" | ||
| if not os.path.exists(self.grav_fn): | ||
| return 0 | ||
| with open(self.grav_fn, "rb") as f: | ||
| ndim = struct.unpack("i", f.read(4))[0] | ||
| nvar = struct.unpack("i", f.read(4))[0] | ||
| return nvar |
There was a problem hiding this comment.
Note: If you move the handling of hydro/grav to a class of its own, you'll be able to prevent the repetition here
| "particle_position_x": "pos_x", | ||
| "particle_position_y": "pos_y", | ||
| "particle_position_z": "pos_z", | ||
| "particle_velocity_x": "vel_x", | ||
| "particle_velocity_y": "vel_y", | ||
| "particle_velocity_z": "vel_z", | ||
| "particle_mass": "mass", | ||
| "particle_refinement_level": "level", | ||
| "particle_identity": "birth_id", | ||
| "particle_metallicity": "metallicity", | ||
| "particle_birth_time": "birth_date", |
There was a problem hiding this comment.
This definition (and the other mappings) belong in definitions.py
| return [], {} | ||
|
|
||
|
|
||
| def _detect_optional_particle_fields(header_fn, prefix): |
There was a problem hiding this comment.
This is very similar to what you already used in _detect_particle_fields in data_structure.py. Could you refactor?
|
|
||
| # Select the requested cells | ||
| for ftype, fname in fields: | ||
| if fname in all_data: |
There was a problem hiding this comment.
I would expect that _read_all_cell_data has the promise to read everything, so it shouldn't be necessary to check here that fname is in all_data.
| for igrid in range(ncache): | ||
| # Each oct: nvar fields * twotondim cells * 4 bytes | ||
| # Layout is (nvar, twotondim) - field-major | ||
| oct_data = np.frombuffer( | ||
| f.read(4 * twotondim * f_nvar), | ||
| dtype="<f4" | ||
| ).reshape(f_nvar, twotondim) | ||
|
|
||
| # Store each field's data for this oct's cells | ||
| for fname, fidx in field_indices.items(): | ||
| for icell in range(twotondim): | ||
| raw_data[fname][cell_offset + icell] = oct_data[fidx, icell] | ||
|
|
||
| cell_offset += twotondim |
There was a problem hiding this comment.
I'd suggest reading everything in one go.
Note that the code below is only valid is field_indices is sorted by values (such that the field names are in the same order as they appear on file).
# Since we're reading everything, we can do it in one pass:
oct_data_t = np.dtype([
fname, (">f4", twotondim)
for fname in field_indices.keys()
])
oct_data = np.fromfile(f, oct_data_t, ncache)
# Store each field's data for this oct's cells
for fname, fidx in field_indices.items():
raw_data[fname] = oct_data[fname].flatten()| root_nodes = sum(int(d.level_count[0]) for d in self.domains) | ||
| mylog.info("Allocating %s octs", domain_counts.sum()) | ||
|
|
||
| self.oct_handler = RAMSESOctreeContainer( |
There was a problem hiding this comment.
RAMSESOctreeContainer is a specification for RAMSES so that it can handle missing octs at levelmin. However, here, since you're adding everything into a single oct, you can probably use the (slightly more efficient) OctreeContainer.
Alternatively, you can also use the same strategy as RAMSES and have each domain have its own octree, in which case you'd need to RAMSESOctreeContainer. The main advantage of such an approach is that, when you iterate over chunks (each of the subfiles), you only need to build the octree of that file and not the others. That will simplify parallelisation greatly (in a distant future!). Otherwise, every parallel task would need to read the entirety of the AMR files, which duplicates efforts and kills performance.
The main disadvantage is that operations that require interaction with the oct (such as particle deposition) will only have access to the oct of their domain, so you'll have small artefacts at the boundary between domains (I think this is what we're seeing e.g. in #1573).
PR Summary
This pull request includes a new fully functional frontend for mini-ramses. With the assistance of github copilot I was able to learn from the valuable insights of Corentin and managed to get a tested, fully functional (and I think just as fast as ramses) frontend working for mini-ramses. I tested on cosmological and non-cosmological simulations and it appears to be working identically to the RAMSES frontend. Currently I have it reading in all particle data (dark matter, star, sink) and AMR data (grav, hydro). I have not ensured compatibility with MHD or RT mini-ramses simulations yet.
Example plots made with fixed resolution buffers and particle data showcase it working in action.
Fixes Issue: #5215
PR Checklist