Add EncodingFormat for FHIR files #883

crisely09 · 2025-05-28T09:54:25Z

We would like to use Croissant recordsets to read FHIR (nested JSON Lines), wildly used in the medical sector.
This PR is an "easy" approach to enable the support for FHIR (application/fhir+json) encoding format.

github-actions · 2025-05-28T09:54:38Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ccl-core · 2025-05-28T10:05:48Z

Hi @crisely09 , thank you for your contribution!
To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

crisely09 · 2025-05-28T10:31:39Z

Hi @crisely09 , thank you for your contribution! To increase our test coverage and enrich the example datasets for Croissant users, would you mind adding an example dataset which uses the new FHIR format to https://github.com/mlcommons/croissant/tree/main/datasets/1.1 ?

I have added the example metadata into the datasets folder. I am not sure how to generate the output folder.
Also, I don't know what is the format error I am getting in the read.py file.
Thanks a lot for your help.

datasets/1.1/pharmaccess-momcare-fhir/metadata.json

ccl-core · 2025-05-28T10:36:50Z

Thanks! I'll review later on today.
You can generate the output records using this script: https://github.com/mlcommons/croissant/blob/main/python/mlcroissant/mlcroissant/scripts/load.py

crisely09 · 2025-05-28T10:57:57Z

I have noticed that the way the json is loaded is suuuuper slow, I am trying something to accelerate the Reading of Json files when jsonPath is used.

crisely09 · 2025-05-28T13:32:02Z

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

datasets/1.1/pharmaccess-momcare-fhir/metadata.json

ccl-core · 2025-05-29T20:54:09Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

    for field in fields:
-        json_path = field.source.extract.json_path
-        if json_path is None:
+        jp = field.source.extract.json_path


I would be in favor of keeping the old variable names, for readibility (same below)

Replaced jp with json_path.

ccl-core · 2025-05-29T21:11:57Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

 """Parse JSON operation."""

+import json
+import jmespath


I know these libraries are not that big, but I was wondering whether we should rather lazily load them?

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-05-29T21:52:32Z

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to:
mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment]
So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

For the formatting error, have you tried runnin black with the same specifications (--check --line-length 88 --preview etc) as we do in the tests? This should hopefully fix the tests.

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-05-29T21:22:14Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

+        """
+        # raw JSON fallback: one‐cell DataFrame
+        fh.seek(0)
+        content = json.load(fh)


I wonder whether it might make sense to use orjson.loads here as well? Wouldn't it maximise performance and be more consistent?

Yes, makes total sense.

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-05-29T21:55:45Z

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

crisely09 · 2025-05-30T15:24:53Z

OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows.

Yeah, mypy is annoying :S So the logs point to: mlcroissant/_src/operation_graph/operations/read.py:137: error: Incompatible types in assignment (expression has type "JsonlReader", variable has type "JsonReader") [assignment] So it seems like MyPy believes the variable reader is expected to hold an object of type JsonReader -- I guess MyPy infers the type of reader from its first assignment reader = JsonReader(self.fields)? Have you tried to explicitly declare the possible types for reader, like with reader: JsonReader | JsonlReader before the conditional block? I guess another option could be to use a typing.Protocol, but I would give it a try with the first method first...

I went back to the logs, and the errors seem to be related to files I haven't modified, base_node.py for instance.

crisely09 · 2025-05-30T15:27:02Z

Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great!

I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed?

Thanks a lot @ccl-core for the careful review ! I think I have addressed all your comments, feel free to have another look.

crisely09 · 2025-06-06T09:43:48Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

ccl-core · 2025-06-11T08:59:38Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

Hi @crisely09 , sorry, I was OOO last week :) Let me try to see if I can reproduce the mypy errors in my workspace!

ccl-core · 2025-06-12T15:00:52Z

Hi @crisely09 , the mypy errors were due to a new version of mypy, and were unrelated to your changes, as you already pointed out (the CI was failing since a few weeks anyways https://github.com/mlcommons/croissant/actions/workflows/ci.yml :) )
I sent #890 that should hopefully fix the issue.

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

ccl-core · 2025-06-12T15:11:51Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

+
+        # Load entire JSON file (could be a list or a single dict).
+        raw = fh.read()
+        data = orjson.loads(raw)


You can see here an example of how to lazily load a library: 4fbd358

ccl-core · 2025-06-12T15:17:56Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

crisely09 · 2025-06-23T07:18:55Z

Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look?

I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic.

Thank you for the review!!
I will have a look at the parsing logic, to keep the expected behavior for this type of files.

crisely09 · 2025-07-23T15:09:30Z

Hi @ccl-core ,
I managed to fix things to make the test pass. Could you have another look, please?

There is something I would like to discuss with you, I made a change in the way the bounding boxes are loaded, in a way that one RecordSet contains all bounding boxes from the same contentUrl, this makes more sense to me than having one RecordSet per bounding box. If we can have a chat on zoom/meet/teams it would be much better.
Thank you!

marcenacp

Thanks for your contribution! I have a few clarification questions about why we need to introduce new dependencies and how we could simplify the parsers/readers.

marcenacp · 2025-07-25T10:03:28Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

+                # simple JSONPath → JMESPath
+                jm = json_path.lstrip("$.")  # drop the "$."
+                expr = jmespath.compile(jm)
+                engine = "jmespath"


OoC, why do we need both JSONPath and JMESPath? Why not use JSONPath everywhere?

marcenacp · 2025-07-25T10:04:58Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/read.py

+                    EncodingFormat.FHIR,
+                ):
+                    # JSON_LINES and FHIR do the same thing
+                    reader = JsonlReader(self.fields)


Implementing our own readers/parsers looks scary because it can quickly become a rabbit hole of bugs! So I have a few questions:

Is there an existing reader for FHIR files or do we really have to do it manually? Is fhir.resources an option?

Can we handle it as nested JSON, e.g. using pd.json_normalize?

marcenacp · 2025-07-25T10:06:19Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py

+        """
+        # Load entire JSON file (could be a list or a single dict).
+        raw = fh.read()
+        data = orjson.loads(raw) if orjson else json.loads(raw)


To understand: what cases does json.loads not handle that orjson handles?

marcenacp · 2025-07-25T10:08:53Z

python/mlcroissant/mlcroissant/_src/core/ml/bounding_box.py

+    # If the input is not a string or a list, or if it's a list with
+    # an invalid length (e.g., 5), we let _parse_one raise the
+    # appropriate, specific error.
+    if isinstance(value, list) and len(value) != 4:


Can you have a test case for those errors?

marcenacp · 2025-07-25T10:09:37Z

python/mlcroissant/mlcroissant/_src/core/ml/bounding_box.py

 """Module to manage "bounding boxes" annotations on images."""

-from typing import Any
+from typing import Any, List, Union


Can you please split the PR into 2 PRs:

One for FHIR files (this one)

One for bounding boxes

?

marcenacp · 2025-07-25T10:10:32Z

python/mlcroissant/mlcroissant/_src/core/ml/bounding_box.py


-def parse(value: Any) -> list[float]:
-    """Parses a value to a machine-readable bounding box.
+def _parse_one(value: Union[str, List[Any]]) -> List[float]:


str | list[Any]

marcenacp · 2025-07-25T10:13:53Z

python/mlcroissant/recipes/bounding-boxes.ipynb

   "outputs": [],
   "source": [
-    "image_id, bbox = record[\"images_with_bounding_box/image_id\"], record[\"images_with_bounding_box/bbox\"]\n",
+    "image_id, bbox = (\n",


Can you please rollback the changes on the ipynb if they are noop?

add reading option for fhir

fe626df

crisely09 requested a review from a team as a code owner May 28, 2025 09:54

ccl-core self-requested a review May 28, 2025 10:01

crisely09 added 3 commits May 28, 2025 12:05

little reformatting

4501421

add fhir dataset example

5d82a63

small addition to metadata

1dd393f

ccl-core reviewed May 28, 2025

View reviewed changes

datasets/1.1/pharmaccess-momcare-fhir/metadata.json Show resolved Hide resolved

crisely09 added 2 commits May 28, 2025 12:55

added output for serviceRequest loading record-set

92a9c75

simplify a bit the metadata file

cc18426

crisely09 added 6 commits May 28, 2025 14:00

Read JSON files faster

265e93f

bring back previous definition of the parse_json_content

1ee3986

few format fixes

30fde7c

align dataset metadata example

350e9a5

fall back to jsonpath_rw when there is recursive-descent

ed78906

fix flake8

a4dce21

ccl-core reviewed May 29, 2025

View reviewed changes

datasets/1.1/pharmaccess-momcare-fhir/metadata.json Outdated Show resolved Hide resolved

ccl-core reviewed May 29, 2025

View reviewed changes

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py Show resolved Hide resolved

ccl-core reviewed May 29, 2025

View reviewed changes

crisely09 added 2 commits May 30, 2025 16:55

Black format fixes, add tests for classes, other suggested changes

3fb1277

updated output from dataset

bf76353

crisely09 added 3 commits May 30, 2025 17:05

fix isort

3504bf6

fix test expectations

5e5b9b2

fix format

062ab96

crisely09 added 3 commits May 30, 2025 17:31

fix flakes

5c790b0

fix expectation of tests

d0f36f6

if not replaced to if is None

c331ae3

ccl-core reviewed Jun 12, 2025

View reviewed changes

python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py Outdated Show resolved Hide resolved

ccl-core reviewed Jun 12, 2025

View reviewed changes

crisely09 added 9 commits July 15, 2025 11:22

read bounding boxes all at once

238bedd

lazy load orjson

469e870

remove imports of orjson

d88f892

fix python format black

a351116

run black again

9b94d70

update bounding_box parsing to pass the test

7f73bc6

trying to include all cases for bounding boxes

741bdfa

fix format and pytype

18895c0

trying to fix format errors

15c49b0

ccl-core requested a review from marcenacp July 25, 2025 10:02

marcenacp reviewed Jul 25, 2025

View reviewed changes

stefanches7 mentioned this pull request Oct 9, 2025

Adding support for reading medical images #862

Open

Add EncodingFormat for FHIR files #883

Are you sure you want to change the base?

Add EncodingFormat for FHIR files #883

Uh oh!

Conversation

crisely09 commented May 28, 2025

Uh oh!

github-actions bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccl-core commented May 28, 2025

Uh oh!

crisely09 commented May 28, 2025

Uh oh!

Uh oh!

ccl-core commented May 28, 2025

Uh oh!

crisely09 commented May 28, 2025

Uh oh!

crisely09 commented May 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ccl-core commented May 29, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ccl-core commented May 29, 2025

Uh oh!

crisely09 commented May 30, 2025

Uh oh!

crisely09 commented May 30, 2025

Uh oh!

crisely09 commented Jun 6, 2025

Uh oh!

ccl-core commented Jun 11, 2025

Uh oh!

ccl-core commented Jun 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccl-core commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crisely09 commented Jun 23, 2025

Uh oh!

crisely09 commented Jul 23, 2025

Uh oh!

marcenacp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

github-actions bot commented May 28, 2025 •

edited

Loading

ccl-core commented Jun 12, 2025 •

edited

Loading