-
Notifications
You must be signed in to change notification settings - Fork 89
Add EncodingFormat for FHIR files #883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Hi @crisely09 , thank you for your contribution! |
I have added the example metadata into the datasets folder. I am not sure how to generate the output folder. |
Thanks! I'll review later on today. |
I have noticed that the way the json is loaded is suuuuper slow, I am trying something to accelerate the Reading of Json files when jsonPath is used. |
OK, I have fixed most of the issues, I really don't know how to fix the MyPy and Python format flows. |
for field in fields: | ||
json_path = field.source.extract.json_path | ||
if json_path is None: | ||
jp = field.source.extract.json_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of keeping the old variable names, for readibility (same below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced jp
with json_path
.
"""Parse JSON operation.""" | ||
|
||
import json | ||
import jmespath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know these libraries are not that big, but I was wondering whether we should rather lazily load them?
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Show resolved
Hide resolved
Yeah, mypy is annoying :S So the logs point to: For the formatting error, have you tried runnin black with the same specifications ( |
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Show resolved
Hide resolved
""" | ||
# raw JSON fallback: one‐cell DataFrame | ||
fh.seek(0) | ||
content = json.load(fh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it might make sense to use orjson.loads here as well? Wouldn't it maximise performance and be more consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes total sense.
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Show resolved
Hide resolved
Thank you @crisely09 for the PR! I like this approach, the refactoring of the JSON parsing logic into the two classes makes the codebase cleaner and more modular. And having support for FHIR-formatted data is great! I left a few comments, let me know if you have further problems with the tests. I'll be OOO next week, but maybe @marcenacp or @benjelloun can unblock you with the needed approvals if needed? |
I went back to the logs, and the errors seem to be related to files I haven't modified, |
Thanks a lot @ccl-core for the careful review ! I think I have addressed all your comments, feel free to have another look. |
Hello @ccl-core, I had to fix a few things, but some tests are still failing. I am not sure I am causing this to fail, could you have a look? |
Hi @crisely09 , sorry, I was OOO last week :) Let me try to see if I can reproduce the mypy errors in my workspace! |
Hi @crisely09 , the mypy errors were due to a new version of mypy, and were unrelated to your changes, as you already pointed out (the CI was failing since a few weeks anyways https://github.com/mlcommons/croissant/actions/workflows/ci.yml :) ) |
python/mlcroissant/mlcroissant/_src/operation_graph/operations/parse_json.py
Outdated
Show resolved
Hide resolved
|
||
# Load entire JSON file (could be a list or a single dict). | ||
raw = fh.read() | ||
data = orjson.loads(raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see here an example of how to lazily load a library: 4fbd358
I believe the mypy tests should be fixed now. The failures in the notebook tests probably stem from the refactored JSON parsing logic. |
Thank you for the review!! |
Hi @ccl-core , There is something I would like to discuss with you, I made a change in the way the bounding boxes are loaded, in a way that one RecordSet contains all bounding boxes from the same contentUrl, this makes more sense to me than having one RecordSet per bounding box. If we can have a chat on zoom/meet/teams it would be much better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! I have a few clarification questions about why we need to introduce new dependencies and how we could simplify the parsers/readers.
# simple JSONPath → JMESPath | ||
jm = json_path.lstrip("$.") # drop the "$." | ||
expr = jmespath.compile(jm) | ||
engine = "jmespath" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OoC, why do we need both JSONPath and JMESPath? Why not use JSONPath everywhere?
EncodingFormat.FHIR, | ||
): | ||
# JSON_LINES and FHIR do the same thing | ||
reader = JsonlReader(self.fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing our own readers/parsers looks scary because it can quickly become a rabbit hole of bugs! So I have a few questions:
- Is there an existing reader for FHIR files or do we really have to do it manually? Is fhir.resources an option?
- Can we handle it as nested JSON, e.g. using
pd.json_normalize
?
""" | ||
# Load entire JSON file (could be a list or a single dict). | ||
raw = fh.read() | ||
data = orjson.loads(raw) if orjson else json.loads(raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To understand: what cases does json.loads not handle that orjson handles?
# If the input is not a string or a list, or if it's a list with | ||
# an invalid length (e.g., 5), we let _parse_one raise the | ||
# appropriate, specific error. | ||
if isinstance(value, list) and len(value) != 4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you have a test case for those errors?
"""Module to manage "bounding boxes" annotations on images.""" | ||
|
||
from typing import Any | ||
from typing import Any, List, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please split the PR into 2 PRs:
- One for FHIR files (this one)
- One for bounding boxes
?
|
||
def parse(value: Any) -> list[float]: | ||
"""Parses a value to a machine-readable bounding box. | ||
def _parse_one(value: Union[str, List[Any]]) -> List[float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str | list[Any]
"outputs": [], | ||
"source": [ | ||
"image_id, bbox = record[\"images_with_bounding_box/image_id\"], record[\"images_with_bounding_box/bbox\"]\n", | ||
"image_id, bbox = (\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please rollback the changes on the ipynb if they are noop?
We would like to use Croissant recordsets to read FHIR (nested JSON Lines), wildly used in the medical sector.
This PR is an "easy" approach to enable the support for FHIR (application/fhir+json) encoding format.