Skip to content

Conversation

crisely09
Copy link

We would like to use Croissant recordsets to read FHIR (nested JSON Lines), wildly used in the medical sector.
This PR is an "easy" approach to enable the support for FHIR (application/fhir+json) encoding format.

@crisely09 crisely09 requested a review from a team as a code owner May 28, 2025 09:54
Copy link

github-actions bot commented May 28, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@ccl-core ccl-core self-requested a review May 28, 2025 10:01
@ccl-core
Copy link
Contributor

Hi @crisely09 , thank you for your contribution!
To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

@crisely09
Copy link
Author

Hi @crisely09 , thank you for your contribution! To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

I have added the example metadata into the datasets folder. I am not sure how to generate the output folder.
Also, I don't know what is the format error I am getting in the read.py file.
Thanks a lot for your help.

@ccl-core
Copy link
Contributor

Thanks! I'll review later on today.
You can generate the output records using this script: https://github.com/mlcommons/croissant/blob/main/python/mlcroissant/mlcroissant/scripts/load.py

@crisely09
Copy link
Author

I have noticed that the way the json is loaded is suuuuper slow, I am trying something to accelerate the Reading of Json files when jsonPath is used.

@crisely09
Copy link
Author

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

for field in fields:
json_path = field.source.extract.json_path
if json_path is None:
jp = field.source.extract.json_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of keeping the old variable names, for readibility (same below)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced jp with json_path.

"""Parse JSON operation."""

import json
import jmespath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know these libraries are not that big, but I was wondering whether we should rather lazily load them?

@ccl-core
Copy link
Contributor

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to:
mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment]
So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

For the formatting error, have you tried runnin black with the same specifications (--check --line-length 88 --preview etc) as we do in the tests? This should hopefully fix the tests.

"""
# raw JSON fallback: one‐cell DataFrame
fh.seek(0)
content = json.load(fh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it might make sense to use orjson.loads here as well? Wouldn't it maximise performance and be more consistent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes total sense.

@ccl-core
Copy link
Contributor

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

@crisely09
Copy link
Author

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to: mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment] So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

I went back to the logs, and the errors seem to be related to files I haven't modified, base_node.py for instance.

@crisely09
Copy link
Author

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

Thanks a lot @ccl-core for the careful review ! I think I have addressed all your comments, feel free to have another look.

@crisely09
Copy link
Author

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

@ccl-core
Copy link
Contributor

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

Hi @crisely09 , sorry, I was OOO last week :) Let me try to see if I can reproduce the mypy errors in my workspace!

@ccl-core
Copy link
Contributor

Hi @crisely09 , the mypy errors were due to a new version of mypy, and were unrelated to your changes, as you already pointed out (the CI was failing since a few weeks anyways https://github.com/mlcommons/croissant/actions/workflows/ci.yml :) )
I sent #890 that should hopefully fix the issue.


# Load entire JSON file (could be a list or a single dict).
raw = fh.read()
data = orjson.loads(raw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see here an example of how to lazily load a library: 4fbd358

@ccl-core
Copy link
Contributor

ccl-core commented Jun 12, 2025

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

@crisely09
Copy link
Author

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

Thank you for the review!!
I will have a look at the parsing logic, to keep the expected behavior for this type of files.

@crisely09
Copy link
Author

Hi @ccl-core ,
I managed to fix things to make the test pass. Could you have another look, please?

There is something I would like to discuss with you, I made a change in the way the bounding boxes are loaded, in a way that one RecordSet contains all bounding boxes from the same contentUrl, this makes more sense to me than having one RecordSet per bounding box. If we can have a chat on zoom/meet/teams it would be much better.
Thank you!

@ccl-core ccl-core requested a review from marcenacp July 25, 2025 10:02
Copy link
Contributor

@marcenacp marcenacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I have a few clarification questions about why we need to introduce new dependencies and how we could simplify the parsers/readers.

# simple JSONPath → JMESPath
jm = json_path.lstrip("$.") # drop the "$."
expr = jmespath.compile(jm)
engine = "jmespath"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OoC, why do we need both JSONPath and JMESPath? Why not use JSONPath everywhere?

EncodingFormat.FHIR,
):
# JSON_LINES and FHIR do the same thing
reader = JsonlReader(self.fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing our own readers/parsers looks scary because it can quickly become a rabbit hole of bugs! So I have a few questions:

  • Is there an existing reader for FHIR files or do we really have to do it manually? Is fhir.resources an option?
  • Can we handle it as nested JSON, e.g. using pd.json_normalize?

"""
# Load entire JSON file (could be a list or a single dict).
raw = fh.read()
data = orjson.loads(raw) if orjson else json.loads(raw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To understand: what cases does json.loads not handle that orjson handles?

# If the input is not a string or a list, or if it's a list with
# an invalid length (e.g., 5), we let _parse_one raise the
# appropriate, specific error.
if isinstance(value, list) and len(value) != 4:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have a test case for those errors?

"""Module to manage "bounding boxes" annotations on images."""

from typing import Any
from typing import Any, List, Union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please split the PR into 2 PRs:

  • One for FHIR files (this one)
  • One for bounding boxes

?


def parse(value: Any) -> list[float]:
"""Parses a value to a machine-readable bounding box.
def _parse_one(value: Union[str, List[Any]]) -> List[float]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str | list[Any]

"outputs": [],
"source": [
"image_id, bbox = record[\"images_with_bounding_box/image_id\"], record[\"images_with_bounding_box/bbox\"]\n",
"image_id, bbox = (\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rollback the changes on the ipynb if they are noop?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants