-
Notifications
You must be signed in to change notification settings - Fork 4
major refactoring about ms2rescore and pyarrow #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@ypriverol has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 15 minutes and 34 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThis pull request introduces changes across several configuration files to update dependency specifications. The version of Changes
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
environment.yml (1)
13-13: Add newline at end of fileAdd a newline character at the end of the file to comply with YAML formatting standards.
- pyarrow>=18.0.0 +🧰 Tools
🪛 yamllint (1.35.1)
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (8)
README.md(0 hunks)environment.yml(1 hunks)pyproject.toml(2 hunks)quantmsutils/mzml/mzml_statistics.py(2 hunks)quantmsutils/quantmsutilsc.py(0 hunks)quantmsutils/rescoring/ms2rescore.py(0 hunks)recipe/meta.yaml(2 hunks)requirements.txt(1 hunks)
💤 Files with no reviewable changes (3)
- quantmsutils/quantmsutilsc.py
- README.md
- quantmsutils/rescoring/ms2rescore.py
🚧 Files skipped from review as they are similar to previous changes (2)
- pyproject.toml
- recipe/meta.yaml
🧰 Additional context used
🪛 yamllint (1.35.1)
environment.yml
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (3)
requirements.txt (1)
6-6: LGTM: pyarrow version update
Setting pyarrow to >=18.0.0 is appropriate and aligns with the PR objective.
environment.yml (1)
10-13: LGTM: Dependency updates are consistent
The changes align with requirements.txt, maintaining consistency across dependency specifications.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
quantmsutils/mzml/mzml_statistics.py (1)
142-143: LGTM: Improved data handling with pyarrow Table
The switch from RecordBatch to Table is a good improvement as:
- Tables are more flexible and can handle multiple record batches
- This change aligns well with the pyarrow >=18.0.0 requirement
Also applies to: 161-163
✅ Verification successful
Let me verify the pyarrow Table API usage further by checking the imports and version requirements.
LGTM: Verified pyarrow Table API compatibility
The switch to pa.Table.from_pylist is correctly implemented and compatible with the codebase:
- The required pyarrow version (>=18.0.0) supports this API
- All instances of Table usage are consistent across the file
- The imports are properly defined with the standard
paalias
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify pyarrow Table API compatibility
echo "Checking pyarrow Table API usage..."
ast-grep --pattern 'pa.Table.from_pylist($_,$_)' || echo "No Table.from_pylist usage found"
Length of output: 514
Script:
#!/bin/bash
# Check pyarrow imports and version requirements
echo "Checking pyarrow imports..."
rg "^import pyarrow|^from pyarrow" -A 2
echo -e "\nChecking version requirements..."
fd "requirements" --exec cat {} \;
fd "setup.py" --exec cat {} \;
Length of output: 555
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
environment.yml (2)
14-14: Add newline at end of fileThe file is missing a newline character at the end.
- scipy +🧰 Tools
🪛 yamllint (1.35.1)
[error] 14-14: no new line character at the end of file
(new-line-at-end-of-file)
10-10: Consider adding version constraint for pyopenmsHaving no version constraint for pyopenms could lead to compatibility issues in the future. Consider specifying a minimum version to ensure stability.
- - pyopenms + - pyopenms>=2.4.0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
environment.yml(1 hunks)pyproject.toml(2 hunks)recipe/meta.yaml(2 hunks)requirements.txt(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- pyproject.toml
- requirements.txt
- recipe/meta.yaml
🧰 Additional context used
🪛 yamllint (1.35.1)
environment.yml
[error] 14-14: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (1)
environment.yml (1)
10-14: Verify the impact of dependency changes
Several dependencies have been removed (ms2rescore, deepLC, psm-utils, protobuf, pygam) and version constraints modified. Please ensure these changes don't break existing functionality.
✅ Verification successful
Let me generate additional verification scripts to check for any indirect usage or references to these packages in configuration or documentation files.
Based on the results from the verification scripts, I can now provide a final response:
Dependencies have been intentionally removed
The removed dependencies (ms2rescore, deepLC, psm-utils, protobuf, pygam) were primarily used in the rescoring functionality which has been removed from the codebase. This is evidenced by the deletion of the entire rescoring module (quantmsutils/rescoring/) and the removal of the ms2rescore command from quantmsutils/quantmsutilsc.py. The pyarrow version upgrade and scipy reintroduction are related to the remaining mzml statistics functionality.
quantmsutils/rescoring/ms2rescore.pyand its directory have been deletedquantmsutils/quantmsutilsc.pyno longer imports or registers the ms2rescore command- The remaining code changes focus on pyarrow Table usage in mzml statistics
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for usage of removed packages in the codebase
echo "Checking for usage of removed packages..."
for pkg in "ms2rescore" "deepLC" "psm-utils" "protobuf" "pygam"; do
echo "Searching for $pkg usage..."
rg -l "import.*$pkg|from.*$pkg" || true
done
# Check if pyarrow 18.0.0 is compatible with other dependencies
echo "Checking pyarrow compatibility..."
curl -s https://pypi.org/pypi/pyarrow/18.0.0/json | jq '.info.requires_dist'
Length of output: 1336
Script:
#!/bin/bash
# Check for any references to removed packages in configuration files and documentation
echo "Checking for package references in configs and docs..."
rg -i "ms2rescore|deeplc|psm.?utils|protobuf|pygam" --type yaml --type md --type txt
# Check for any potential function calls or class usage that might be from these packages
echo "Checking for potential usage patterns..."
rg -i "MS2Rescorer|DeepLC|PSMUtils|Proto|GAM" --type python
# Check if there are any recent changes that removed these dependencies
echo "Checking recent changes in Python files..."
git diff HEAD~5 HEAD -- "*.py"
Length of output: 21387
🧰 Tools
🪛 yamllint (1.35.1)
[error] 14-14: no new line character at the end of file
(new-line-at-end-of-file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
.github/workflows/python-app.yml (1)
42-42: Add newline at end of fileAdd a newline character at the end of the file to comply with POSIX standards.
poetry run pytest +🧰 Tools
🪛 yamllint (1.35.1)
[error] 42-42: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
.github/workflows/python-app.yml(1 hunks)
🧰 Additional context used
🪛 yamllint (1.35.1)
.github/workflows/python-app.yml
[error] 42-42: no new line character at the end of file
(new-line-at-end-of-file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
environment.yml (1)
13-13: Add newline at end of fileThe file is missing a newline character at the end.
Apply this diff:
- scipy +🧰 Tools
🪛 yamllint (1.35.1)
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
environment.yml(1 hunks)pyproject.toml(2 hunks)recipe/meta.yaml(2 hunks)requirements.txt(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- requirements.txt
- pyproject.toml
- recipe/meta.yaml
🧰 Additional context used
🪛 yamllint (1.35.1)
environment.yml
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (1)
environment.yml (1)
12-12: Confirm the need for pyarrow >= 18.0.0
The PR title suggests forcing pyarrow 18.0.0, but there's no context provided for this change. Could you please explain:
- What specific features/fixes in pyarrow 18.0.0 are required?
- Have you tested compatibility with your existing pyarrow usage?
Let's check pyarrow usage:
✅ Verification successful
Pyarrow version requirement is justified for list type support in schema
The code uses pyarrow for parquet file handling with specific features that justify the version requirement:
- Complex schema definitions with nested types (list of float64)
- Batch writing with compression
- Table creation from pylists with schema validation
- Parquet streaming writer functionality
The id_schema specifically uses pa.list_(pa.float64()) for the mz and intensity fields, which requires robust list type support in pyarrow. Version 18.0.0 introduced improvements in handling such complex types and schema validation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for pyarrow imports and usage
rg -l 'import.*pyarrow' || rg -l 'from.*pyarrow'
# Look for specific pyarrow features that might require 18.0.0
ast-grep --pattern 'import pyarrow as $_'
Length of output: 235
Script:
#!/bin/bash
# Let's examine the actual usage of pyarrow in the file to understand the features being used
cat quantmsutils/mzml/mzml_statistics.py
Length of output: 13839
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
pyproject.toml(2 hunks)recipe/meta.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
🔇 Additional comments (3)
recipe/meta.yaml (3)
16-16: LGTM!
The noarch: python specification is correct for a pure Python package.
20-20: Justify the stricter Python version constraint
The Python version requirement has been significantly tightened from >=3.8,<3.11 to >=3.9,<3.10. This might unnecessarily restrict users and could impact compatibility with other packages in the ecosystem.
Also applies to: 24-24
✅ Verification successful
Python version constraint is justified by project configuration
The stricter Python version requirement (>=3.9,<3.10) is consistent with the project's configuration in pyproject.toml. The project is explicitly targeting Python 3.9 as shown by the black configuration target-version = ["py39"] and the poetry dependency specification. While the code search didn't reveal heavy usage of Python 3.9-specific features, this appears to be an intentional project-wide decision.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if there are any Python 3.9-specific features being used
echo "Checking for Python 3.9 specific features..."
rg -l "annotations|TypeAlias|list\[|dict\[|tuple\[|union\[" || true
Length of output: 194
Script:
#!/bin/bash
# Check the actual usage of Python 3.9 features in the identified file
echo "Checking Python 3.9 features usage in diann2mztab.py..."
rg -A 5 "annotations|TypeAlias|list\[|dict\[|tuple\[|union\[" quantmsutils/diann/diann2mztab.py
# Check for any setup.py or pyproject.toml for dependency information
echo -e "\nChecking for Python package configuration files..."
fd -e py -e toml "setup|pyproject" --type f --exec cat {}
Length of output: 2403
27-30:
Review dependency removals and version constraints
Several concerns about the dependency changes:
- The removal of core dependencies (numpy, ms2rescore, deepLC, psm-utils, pygam, protobuf) needs justification. These seem like important dependencies for a bioinformatics package.
- The explicit version constraint on pyarrow (>=18.0.0) should be explained.
- While using conda-forge channel explicitly is good, ensure it's consistent with the project's channel priority requirements.
✅ Verification successful
Based on the analysis of the codebase and dependencies, I can now provide the final response:
Dependencies are correctly aligned with actual usage
The removed dependencies (numpy, ms2rescore, deepLC, psm-utils, pygam, protobuf) are not critical issues because:
- Numpy is indirectly included through scipy and pandas
- The codebase shows no direct usage of ms2rescore, deepLC, psm-utils, pygam, or protobuf
- The core functionality relies on pyopenms, pandas, pyarrow, and scipy which are all retained
The pyarrow version constraint (>=18.0.0) is justified by the direct usage in mzml_statistics.py for parquet file handling, where newer features may be required.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any remaining usage of removed packages
echo "Checking for potential broken imports after dependency removals..."
for pkg in "numpy" "ms2rescore" "deepLC" "psm_utils" "pygam" "protobuf"; do
echo "Checking $pkg usage in Python files:"
rg -p "import.*$pkg|from.*$pkg" '*.py' || true
done
# Check if pyarrow 18.0.0 is specifically needed
echo "Checking pyarrow version-specific usage..."
rg -p "pyarrow" '*.py' || true
Length of output: 1888
Script:
#!/bin/bash
# Let's first check if we're in the right directory and find Python files
echo "Listing Python files in the repository:"
fd -e py
# Check the recipe directory structure
echo -e "\nListing files in recipe directory:"
ls -la recipe/
# Check meta.yaml history to understand dependency changes
echo -e "\nChecking meta.yaml changes:"
git diff HEAD^1 recipe/meta.yaml || true
# Look for any build or test requirements that might explain dependencies
echo -e "\nChecking full meta.yaml content:"
cat recipe/meta.yaml
Length of output: 2692
Script:
#!/bin/bash
# Let's check the actual imports in the Python files to verify dependency usage
echo "Checking imports in Python files:"
for file in $(fd -e py); do
echo -e "\nFile: $file"
rg -p "^(import|from) \w+" "$file" || true
done
# Check pyproject.toml for dependencies
echo -e "\nChecking pyproject.toml dependencies:"
cat pyproject.toml || true
# Check for any numpy, protobuf related functionality
echo -e "\nChecking for numpy/array operations:"
rg -p "np\.|numpy|array|ndarray" || true
echo -e "\nChecking for protobuf related code:"
rg -p "proto|message|descriptor" || true
Length of output: 74437
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
environment.yml (1)
Line range hint
13-14: Add newline at end of fileYAML files should end with a newline character.
Add a newline at the end of the file.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
environment.yml(1 hunks)pyproject.toml(2 hunks)recipe/meta.yaml(1 hunks)requirements.txt(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pyproject.toml
- recipe/meta.yaml
🧰 Additional context used
🪛 yamllint (1.35.1)
environment.yml
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (4)
requirements.txt (3)
Line range hint 6-7: Dependencies removal needs attention
The verification revealed that scipy is still being used in quantmsutils/features/snr.py for entropy calculation.
#!/bin/bash
# Verify scipy usage
rg -l 'import.*scipy' || rg -l 'from.*scipy'
rg 'scipy\.' -A 33-4:
Add back version constraint for pyopenms
The version constraint for pyopenms should be maintained as this package provides critical functionality for mass spectrometry data processing.
Apply this change:
-pyopenms
+pyopenms>=2.4.05-5:
Add version constraint for pyarrow
The PR title indicates forcing pyarrow to version 18.0.0, but the constraint is missing in requirements.txt.
Apply this change:
-pyarrow
+pyarrow==18.0.0Likely invalid or redundant comment.
environment.yml (1)
10-10:
Add back version constraint for pyopenms
The version constraint for pyopenms should be maintained for consistent mass spectrometry data processing across environments.
Apply this change:
- - pyopenms
+ - pyopenms>=2.4.0There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
environment.yml (2)
12-12: Consider aligning pyarrow version with requirements.txtThe version constraint
>=16.1.0has been added for pyarrow. However, for consistency with other configuration files, consider using a fixed version.- - pyarrow>=16.1.0 + - pyarrow==18.0.0
13-13: Add newline at end of fileAdd a newline character at the end of the file to follow YAML formatting conventions.
- scipy +🧰 Tools
🪛 yamllint (1.35.1)
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (2)
tests/test_data/BSA1_F1_ms_info.parquetis excluded by!**/*.parquettests/test_data/BSA1_F1_ms_info.tsvis excluded by!**/*.tsv
📒 Files selected for processing (7)
.github/workflows/conda-build.yml(1 hunks)environment.yml(1 hunks)pyproject.toml(2 hunks)quantmsutils/mzml/mzml_statistics.py(1 hunks)recipe/meta.yaml(1 hunks)requirements.txt(1 hunks)tests/test_commands.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- .github/workflows/conda-build.yml
- tests/test_commands.py
- quantmsutils/mzml/mzml_statistics.py
- requirements.txt
- pyproject.toml
- recipe/meta.yaml
🧰 Additional context used
🪛 yamllint (1.35.1)
environment.yml
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (2)
environment.yml (2)
13-13: 🛠️ Refactor suggestion
Add version constraint for scipy
Since scipy is being reintroduced to provide essential numerical computing functionality (previously from numpy), it's recommended to add a version constraint to ensure compatibility.
🧰 Tools
🪛 yamllint (1.35.1)
[error] 13-13: no new line character at the end of file
(new-line-at-end-of-file)
10-10: Verify pyopenms version compatibility
The version constraint for pyopenms has been updated to >=3.2.0. Since this package provides critical functionality for mass spectrometry data processing, let's verify the compatibility.
✅ Verification successful
Let me analyze the pyopenms usage to verify version compatibility.
The first script will help identify all pyopenms imports, while the second one checks for specific class usage. This will help determine if any version-specific APIs are being used.
Let me check the actual usage of these classes to ensure version compatibility.
This will help us understand if any version-specific features or APIs are being used that might require version 3.2.0 or higher.
pyopenms version constraint is appropriate
The codebase uses standard pyopenms APIs that have been stable across versions:
- Core MS data handling:
MzMLFile,MSExperiment,SpectrumLookup - Sequence manipulation:
AASequence,FASTAFile - Modification handling:
ModificationsDB - File I/O:
IdXMLFile
All observed usage patterns are basic functionality that has been present in pyopenms for many versions. The version constraint >=3.2.0 is reasonable and doesn't pose compatibility risks.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any version-specific API usage that might be affected
rg -A 2 'from pyopenms import|import pyopenms'
# Check git history for version-related changes
git log -p -- '**/requirements.txt' '**/setup.py' '**/pyproject.toml' | grep -A 5 -B 5 'pyopenms'
Length of output: 930
Script:
#!/bin/bash
# Check specific API usage that might be version sensitive
ast-grep --pattern 'from pyopenms import $_'
ast-grep --pattern 'import pyopenms as $_'
# Look for specific pyopenms classes/methods usage
rg -l "MzMLFile|MSExperiment|SpectrumLookup|AASequence|FASTAFile|ModificationsDB|PROTON_MASS_U" quantmsutils/
Length of output: 820
Script:
#!/bin/bash
# Check how these classes are used in the codebase
rg -A 5 "MzMLFile|MSExperiment|SpectrumLookup|AASequence|FASTAFile|ModificationsDB" quantmsutils/
# Check if there are any version-specific method calls or attributes
rg "oms\." quantmsutils/
Length of output: 5744
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
tests/test_commands.py (2)
1-4: Remove unused importThe
pyarrow.parquetimport is not directly used in the code. Whileparquetfiles are being read, it's done through pandas'read_parquetfunction which internally handles the parquet operations.import pandas as pd from click.testing import CliRunner import pyarrow as pa -import pyarrow.parquet as pq from quantmsutils.quantmsutilsc import cli🧰 Tools
🪛 Ruff (0.8.0)
4-4:
pyarrow.parquetimported but unusedRemove unused import:
pyarrow.parquet(F401)
187-192: Add error handling for file operationsThe parquet file reading operations should include error handling to provide meaningful error messages if the files are missing or corrupted.
- table1 = pd.read_parquet("BSA1_F1_ms_info.parquet", schema=schema) - table2 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet", schema=schema) - table2 = table2.set_index('SpectrumID') - table1 = table1.set_index('SpectrumID') - - assert table1.compare(table2).empty + try: + table1 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet", schema=schema) + table2 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet", schema=schema) + table2 = table2.set_index('SpectrumID') + table1 = table1.set_index('SpectrumID') + + assert table1.compare(table2).empty + except FileNotFoundError as e: + pytest.fail(f"Failed to read parquet file: {str(e)}") + except Exception as e: + pytest.fail(f"Error comparing parquet data: {str(e)}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (3)
tests/test_data/BSA1_F1_ms_info.parquetis excluded by!**/*.parquettests/test_data/BSA1_F1_spectrum_df.csvis excluded by!**/*.csvtests/test_data/BSA1_F1_spectrum_df.parquetis excluded by!**/*.parquet
📒 Files selected for processing (2)
quantmsutils/mzml/mzml_statistics.py(4 hunks)tests/test_commands.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- quantmsutils/mzml/mzml_statistics.py
🧰 Additional context used
🪛 Ruff (0.8.0)
tests/test_commands.py
4-4: pyarrow.parquet imported but unused
Remove unused import: pyarrow.parquet
(F401)
🔇 Additional comments (1)
tests/test_commands.py (1)
Line range hint 1-193: Verify complete removal of ms2rescore references
The AI summary indicates that ms2rescore-related tests were removed. Let's verify there are no remaining references to ensure clean removal.
✅ Verification successful
Let me check if these references are only in test data files or if they are part of actual functionality.
Let me check for any ms2rescore-related functionality in the main code:
No active ms2rescore functionality found, only test data references remain
The verification shows that ms2rescore references only exist in test data file names under tests/test_data/ directory. No references were found in the main codebase, imports, function definitions, or CLI commands. These test data files are likely used as fixtures for other tests and don't indicate any remaining functional dependencies on ms2rescore.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any remaining ms2rescore references in the codebase
rg -i "ms2rescore" --type py
Length of output: 709
Script:
#!/bin/bash
# Check if these files exist in the test data directory
fd "ms2rescore" tests/test_data/
# Check for any ms2rescore references in function names or imports
ast-grep --pattern 'import $_ from "ms2rescore"'
ast-grep --pattern 'import "ms2rescore"'
ast-grep --pattern 'def $_ms2rescore$_($$$)'
Length of output: 265
Script:
#!/bin/bash
# Check for any ms2rescore references in the main code (excluding tests)
rg -i "ms2rescore" --type py --glob '!tests/**'
# Check for any configuration or CLI commands related to ms2rescore
ast-grep --pattern 'def $_(self, $$$):
$$$
"ms2rescore"
$$$'
# Look for any command definitions with ms2rescore
rg -i "ms2rescore" -A 5 -B 5 "def.*command"
Length of output: 213
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
tests/test_commands.py (1)
198-198: Strengthen the assertionUsing only
shapecomparison might miss data differences. Consider usingcompare()orequals()for a more thorough comparison.- assert id_table.shape == id_table2.shape + assert id_table.compare(id_table2).emptyquantmsutils/mzml/mzml_statistics.py (1)
169-170: Address batch processing TODOsThe TODO comments indicate a need for batch processing of spectrum data. This could impact memory usage with large datasets.
Would you like me to help implement a batched processing solution using PyArrow's RecordBatchBuilder for handling lists?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
quantmsutils/mzml/mzml_statistics.py(11 hunks)quantmsutils/utils/constants.py(1 hunks)tests/test_commands.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- quantmsutils/utils/constants.py
🔇 Additional comments (5)
tests/test_commands.py (2)
186-187: Use consistent path handling
The parquet file path "BSA1_F1_ms_info.parquet" is using a relative path without the tests/test_data/ prefix, unlike the second file. This could make the test fragile depending on where it's executed from.
- table1 = pd.read_parquet("BSA1_F1_ms_info.parquet", schema=schema)
+ table1 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet", schema=schema)193-194: Use consistent path handling
Similar to the previous issue, the parquet file path "BSA1_F1_spectrum_df.parquet" should use the tests/test_data/ prefix.
- id_table = pd.read_parquet("BSA1_F1_spectrum_df.parquet")
+ id_table = pd.read_parquet("tests/test_data/BSA1_F1_spectrum_df.parquet")quantmsutils/mzml/mzml_statistics.py (3)
209-226: LGTM! Well-structured schema definitions
Good improvements:
- Using constants for field names improves maintainability
- Appropriate type changes from float to int32 for discrete values
- Proper handling of nullable fields
298-306: LGTM! Secure and consistent implementation
Good practices:
- Safe SQL query construction
- Consistent schema with mzML implementation
- Proper use of constants
Line range hint 331-365: LGTM! Robust path handling implementation
The path resolution logic is well-implemented with:
- Proper case-insensitive extension handling
- Clear error messages
- Safe path resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
quantmsutils/utils/constants.py (1)
7-17: Consider adding detailed documentation for each constant.While the link to quantms.io is helpful, inline documentation for each constant would improve code readability and maintainability.
-CHARGE = "charge" # charge of the precursor ion +# The charge state of the precursor ion in MS2 spectra +CHARGE = "charge" -SCAN = "scan" # Definition in quantms.io https://github.com/bigbio/quantms.io/blob/main/docs/README.adoc#scan +# Unique identifier for each spectrum (see: https://github.com/bigbio/quantms.io/blob/main/docs/README.adoc#scan) +SCAN = "scan" -MS_LEVEL = "ms_level" +# Mass spectrometry level (1 for MS1, 2 for MS2) +MS_LEVEL = "ms_level" -NUM_PEAKS = "num_peaks" # number of peaks in the spectrum (MS1 or MS2) +# Number of peaks in the spectrum (MS1 or MS2) +NUM_PEAKS = "num_peaks" -BASE_PEAK_INTENSITY = "base_peak_intensity" +# Intensity of the highest peak in the spectrum +BASE_PEAK_INTENSITY = "base_peak_intensity"quantmsutils/mzml/mzml_statistics.py (2)
169-170: Address TODO comments regarding batch processing.The current implementation loads all PSM parts into memory before writing to parquet. This could be problematic for large datasets.
I can help implement a batch processing solution using PyArrow's streaming capabilities. Would you like me to propose a solution or create a GitHub issue to track this improvement?
78-81: Consider using a dictionary for better readability.The current list-based approach makes it harder to understand what each element represents.
- [str(scan_id), - int(ms_level), - mz_array.tolist(), - intensity_array.tolist(),] + { + SCAN: str(scan_id), + MS_LEVEL: int(ms_level), + MZ_ARRAY: mz_array.tolist(), + INTENSITY_ARRAY: intensity_array.tolist(), + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (2)
tests/test_data/BSA1_F1_ms_info.parquetis excluded by!**/*.parquettests/test_data/BSA1_F1_spectrum_df.parquetis excluded by!**/*.parquet
📒 Files selected for processing (3)
quantmsutils/mzml/mzml_statistics.py(11 hunks)quantmsutils/utils/constants.py(1 hunks)tests/test_commands.py(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
tests/test_commands.py
3-3: pyarrow imported but unused
Remove unused import: pyarrow
(F401)
🔇 Additional comments (2)
tests/test_commands.py (1)
173-174: Use consistent path handling.
The parquet file path "BSA1_F1_ms_info.parquet" is using a relative path without the tests/test_data/ prefix, unlike the second file.
- table1 = pd.read_parquet("BSA1_F1_ms_info.parquet")
+ table1 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet")quantmsutils/mzml/mzml_statistics.py (1)
209-217: LGTM! Schema improvements enhance type safety.
The schema now uses more appropriate data types:
- Integer types for
MS_LEVEL,CHARGE, andNUM_PEAKS - Consistent use of constants from the new constants module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
quantmsutils/diann/diann2mztab.py (3)
Line range hint
4-4: TODO comment needs to be addressed.The TODO comment indicates missing tests. Unit tests are crucial for ensuring the reliability of the data conversion process.
Would you like me to help create unit tests for this functionality or create a GitHub issue to track this task?
1101-1109: Consider adding error handling for parquet file operations.The parquet file reading operation could benefit from explicit error handling to gracefully handle corrupted files or missing columns.
- target = pd.read_parquet(file) + try: + target = pd.read_parquet(file) + if not all(col in target.columns for col in [MS_LEVEL, RETENTION_TIME, SCAN, EXPERIMENTAL_MASS_TO_CHARGE]): + raise ValueError(f"Missing required columns in {file}") + except Exception as e: + logger.error(f"Error reading parquet file {file}: {str(e)}") + raise
1104-1108: Consider optimizing the data filtering and sorting operations.The current implementation performs filtering and sorting operations separately. These could be combined for better performance.
- target = target[target[MS_LEVEL] == 2] - target.reset_index(inplace=True, drop=True) - target["DIANN-intraID"] = target.index - group.sort_values(by=RETENTION_TIME, inplace=True) - target = target[[RETENTION_TIME, SCAN, "DIANN-intraID", EXPERIMENTAL_MASS_TO_CHARGE]] + target = (target[target[MS_LEVEL] == 2] + .reset_index(drop=True) + .assign(**{"DIANN-intraID": lambda x: x.index}) + [[RETENTION_TIME, SCAN, "DIANN-intraID", EXPERIMENTAL_MASS_TO_CHARGE]]) + group = group.sort_values(by=RETENTION_TIME)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/diann/diann2mztab.py(2 hunks)
🔇 Additional comments (2)
quantmsutils/diann/diann2mztab.py (2)
22-22: LGTM! Good use of constants.
The import of constants from quantmsutils.utils.constants improves code maintainability by centralizing string literals.
1102-1108: LGTM! Good refactoring to use constants.
The changes improve code maintainability by:
- Using
MS_LEVELconstant for filtering MS2 spectra - Using
RETENTION_TIME,SCAN, andEXPERIMENTAL_MASS_TO_CHARGEconstants for column selection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
tests/test_commands.py (1)
114-131: Add docstrings to test functionsThe new test functions lack docstrings. Please add documentation describing:
- Purpose of the test
- Expected behavior
- Any important test data requirements
Example for
test_batch_convert_parquet:def test_batch_convert_parquet(): """ Test batch conversion of mzML files to parquet format. Verifies that the mzmlstats command successfully processes multiple files and generates the expected parquet output for each input file. """Also applies to: 187-203
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (4)
tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj1_ms_info.parquetis excluded by!**/*.parquettests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj2_ms_info.parquetis excluded by!**/*.parquettests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj1_ms_info.parquetis excluded by!**/*.parquettests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj2_ms_info.parquetis excluded by!**/*.parquet
📒 Files selected for processing (1)
tests/test_commands.py(3 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
tests/test_commands.py
3-3: pyarrow imported but unused
Remove unused import: pyarrow
(F401)
🔇 Additional comments (1)
tests/test_commands.py (1)
3-3: Remove unused import
The pyarrow import is not used in this file. While parquet operations are performed, pandas handles the pyarrow dependency internally.
-import pyarrow as pa🧰 Tools
🪛 Ruff (0.8.0)
3-3: pyarrow imported but unused
Remove unused import: pyarrow
(F401)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
.gitignore (1)
165-168: Consider using a wildcard pattern for test data filesInstead of listing individual files, consider using a pattern to make the
.gitignoremore maintainable:-/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj1.mzML -/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj2.mzML -/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj1.mzML -/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj2.mzML +/tests/test_data/diann2mztab/RD139_Narrow_UPS1_*_*fmol_inj*.mzMLquantmsutils/diann/diann2mztab.py (1)
1102-1109: LGTM! Consider extracting the column list for better readabilityThe use of constants improves maintainability. Consider extracting the column list to a variable for better readability:
-target = target[[RETENTION_TIME, SCAN, "DIANN-intraID", EXPERIMENTAL_MASS_TO_CHARGE]] +columns_to_keep = [RETENTION_TIME, SCAN, "DIANN-intraID", EXPERIMENTAL_MASS_TO_CHARGE] +target = target[columns_to_keep]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
.gitignore(1 hunks)quantmsutils/diann/diann2mztab.py(2 hunks)tests/test_commands.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_commands.py
🔇 Additional comments (1)
quantmsutils/diann/diann2mztab.py (1)
22-23: LGTM! Good use of centralized constants
The addition of constants from quantmsutils.utils.constants improves code maintainability and reduces magic strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (7)
quantmsutils/diann/dianncfg.py (1)
Line range hint
41-120: Consider reducing code duplication in modification handling.The code for handling fixed and variable modifications is nearly identical, differing only in the target list (
fix_ptmvsvar_ptm). Consider extracting the common logic into a helper function to improve maintainability and reduce duplication.Here's a suggested refactor:
def convert_mod(unimod_database, fix_mod: str, var_mod: str) -> Tuple[List, List]: pattern = re.compile(r"\((.*?)\)") var_ptm = [] fix_ptm = [] + def process_modification(mod: str, unimod_database) -> Tuple[str, str, bool]: + tag = 0 + diann_mod = None + for modification in unimod_database.modifications: + if modification.get_name() == mod.split(" ")[0]: + diann_mod = modification.get_name() + "," + str(modification._delta_mono_mass) + tag = 1 + break + if tag == 0: + print( + "Warning: Currently only supported unimod modifications for DIA pipeline. Skipped: " + + mod + ) + return None, None, False + + site = re.findall(pattern, " ".join(mod.split(" ")[1:]))[0] + if site == "Protein N-term": + site = "*n" + elif site == "N-term": + site = "n" + + is_label = any(label in diann_mod for label in ["TMT", "Label", "iTRAQ", "mTRAQ"]) + return diann_mod, site, is_label if fix_mod: for mod in fix_mod.split(","): + diann_mod, site, is_label = process_modification(mod, unimod_database) + if diann_mod: + if is_label: + fix_ptm.append(f"{diann_mod},{site},label") + else: + fix_ptm.append(f"{diann_mod},{site}") if var_mod: for mod in var_mod.split(","): + diann_mod, site, is_label = process_modification(mod, unimod_database) + if diann_mod: + if is_label: + var_ptm.append(f"{diann_mod},{site},label") + else: + var_ptm.append(f"{diann_mod},{site}") return fix_ptm, var_ptmquantmsutils/mzml/mzml_statistics.py (2)
234-252: Consider adding schema documentationWhile the schema definition is clear and well-structured, it would be beneficial to add documentation about:
- Valid value ranges for fields (e.g., valid MS levels)
- Business rules for nullable fields
- Any field dependencies or constraints
Line range hint
1-391: Consider adding tests for new functionalityWhile the code changes are well-implemented, it would be beneficial to add tests covering:
- Edge cases for integer type conversions
- Batch writing with various batch sizes
- Error scenarios for file processing
- ID-only output generation
Would you like me to help generate test cases for these scenarios?
quantmsutils/diann/diann2mztab.py (4)
36-36: Consider using a more specific log level.The current setup uses DEBUG level by default, which might be too verbose for production. Consider using INFO as the default level and allowing it to be configured through environment variables.
-logging.basicConfig(format="%(asctime)s [%(funcName)s] - %(message)s", level=logging.DEBUG) +logging.basicConfig( + format="%(asctime)s [%(funcName)s] - %(message)s", + level=os.getenv('LOG_LEVEL', 'INFO') +)
478-483: Performance optimization opportunity in mass calculation.The current implementation creates a dictionary for all unique sequences and then maps it. While this is already an optimization, we could further improve it by using numpy's vectorized operations.
-uniq_masses = { - k: compute_mass_modified_peptide(k) for k in report["Modified.Sequence"].unique() -} -mass_vector = report["Modified.Sequence"].map(uniq_masses) -report["Calculate.Precursor.Mz"] = ( - mass_vector + (PROTON_MASS_U * report["Precursor.Charge"]) -) / report["Precursor.Charge"] +# Convert to numpy arrays for vectorized operations +unique_sequences = report["Modified.Sequence"].unique() +unique_masses = np.array([compute_mass_modified_peptide(seq) for seq in unique_sequences]) +mass_map = dict(zip(unique_sequences, unique_masses)) +mass_vector = np.array([mass_map[seq] for seq in report["Modified.Sequence"]]) +charges = report["Precursor.Charge"].to_numpy() +report["Calculate.Precursor.Mz"] = (mass_vector + (PROTON_MASS_U * charges)) / charges
Line range hint
1064-1136: Improve error handling in scan matching logic.The current implementation has complex error handling for scan mismatches. Consider extracting this logic into a separate function for better maintainability.
+def handle_scan_mismatch(scan_matched): + """Handle cases where RT matched and DIA-NN Scan matched are inconsistent.""" + if not scan_matched["unassigned_matched"].any(): + return scan_matched.drop( + ["scan_RT", "scan_opt_global_spectrum_reference", + "scan_exp_mass_to_charge", "unassigned_matched"], + axis=1 + ) + + v_str = scan_matched[scan_matched["unassigned_matched"] == 1]["MS2.Scan"].tolist() + logger.info( + f"RT matched and DIA-NN Scan matched are inconsistent in mzML. " + f"Keep Scan matched: {v_str}" + ) + return scan_matched.drop( + ["RT", "opt_global_spectrum_reference", "DIANN-intraID", + "exp_mass_to_charge", "unassigned_matched"], + axis=1 + ).rename( + columns={ + "scan_RT": "RT", + "scan_opt_global_spectrum_reference": "opt_global_spectrum_reference", + "scan_exp_mass_to_charge": "exp_mass_to_charge", + } + ) # Replace the original code with: scan_matched["unassigned_matched"] = scan_matched.apply( lambda row: 1 if row["MS2.Scan"] != row["DIANN-intraID"] else 0, axis=1 ) scan_matched = handle_scan_mismatch(scan_matched)
1211-1216: Consider caching AASequence conversions.The repeated conversion of peptide sequences using
AASequence.fromString().toString()could be optimized by caching results for frequently occurring sequences.+@functools.lru_cache(maxsize=1024) +def convert_aa_sequence(sequence): + """Convert peptide sequence using AASequence with caching.""" + return AASequence.fromString(sequence).toString() -out_mztab_psh.loc[:, "opt_global_cv_MS:1000889_peptidoform_sequence"] = out_mztab_psh.apply( - lambda x: AASequence.fromString( - x["opt_global_cv_MS:1000889_peptidoform_sequence"] - ).toString(), - axis=1, - result_type="expand", -) +out_mztab_psh.loc[:, "opt_global_cv_MS:1000889_peptidoform_sequence"] = out_mztab_psh[ + "opt_global_cv_MS:1000889_peptidoform_sequence" +].apply(convert_aa_sequence)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
tests/test_data/BSA1_F1_spectrum_df.parquetis excluded by!**/*.parquet
📒 Files selected for processing (8)
quantmsutils/diann/diann2mztab.py(35 hunks)quantmsutils/diann/dianncfg.py(2 hunks)quantmsutils/features/snr.py(2 hunks)quantmsutils/mzml/mzml_statistics.py(11 hunks)quantmsutils/psm/psm_conversion.py(3 hunks)quantmsutils/sdrf/check_samplesheet.py(2 hunks)quantmsutils/sdrf/extract_sample.py(1 hunks)quantmsutils/utils/constants.py(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- quantmsutils/features/snr.py
- quantmsutils/sdrf/extract_sample.py
- quantmsutils/psm/psm_conversion.py
🚧 Files skipped from review as they are similar to previous changes (1)
- quantmsutils/utils/constants.py
🔇 Additional comments (9)
quantmsutils/diann/dianncfg.py (1)
58-58: LGTM! String formatting simplified.
The string concatenation has been simplified while maintaining the same functionality.
Also applies to: 94-94
quantmsutils/mzml/mzml_statistics.py (5)
13-27: Great improvement using constants instead of string literals!
The introduction of constants from utils.constants enhances code maintainability and reduces the risk of typos. This change makes the code more robust and easier to maintain.
149-179: Excellent batch writing optimization
The lazy initialization of writers and direct streaming of data using PyArrow's batch writing capabilities will improve memory efficiency for large datasets.
Line range hint 357-371: Clean implementation of ID output handling
The addition of ID-specific output path and its integration with the existing file processing logic is well implemented. The code maintains good error handling and supports both mzML and Bruker formats consistently.
99-111: Good improvement on data type precision
The change to use appropriate integer types for MS_LEVEL, CHARGE, and NUM_PEAKS is more precise and memory efficient. However, ensure that any downstream processing handles these integer types correctly.
✅ Verification successful
Integer type changes are safely handled throughout the codebase
The verification shows that the integer types for MS_LEVEL, CHARGE, and NUM_PEAKS are consistently handled across the codebase:
- All schema definitions in
mzml_statistics.pyusepa.int32()for these fields - The fields are properly used in downstream processing (e.g., in
diann2mztab.py) - The constants are well-documented in
constants.pywith clear purposes
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for code that processes these fields
rg "MS_LEVEL|CHARGE|NUM_PEAKS" --type python -C 2
Length of output: 82
Script:
#!/bin/bash
# Search for code that processes these fields, without file type restriction
rg "MS_LEVEL|CHARGE|NUM_PEAKS" -C 2
Length of output: 7634
90-95: Verify the impact of dictionary structure change in psm_parts
The change from list to dictionary structure in psm_parts is good for clarity, but ensure all consumers of this data are updated to handle the new structure.
✅ Verification successful
Dictionary structure change in psm_parts is safe to proceed
The verification shows that psm_parts is only used internally within the mzml_statistics.py file for batch processing. The dictionary keys (SCAN, MS_LEVEL, MZ_ARRAY, INTENSITY_ARRAY) are used consistently across the codebase, primarily through imports from constants.py. The change from list to dictionary structure is contained within this file and doesn't affect external consumers, as the data is properly serialized to parquet format before being used by other components.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any code that processes psm_parts data
rg -l "psm_parts" --type python
Length of output: 62
Script:
#!/bin/bash
# Search for any code that processes psm_parts data, without file type filter
rg -l "psm_parts"
# Also search for specific dictionary keys to find potential consumers
rg -l "SCAN|MS_LEVEL|MZ_ARRAY|INTENSITY_ARRAY"
Length of output: 297
Script:
#!/bin/bash
# Check how these dictionary keys are used in the related files
rg "SCAN|MS_LEVEL|MZ_ARRAY|INTENSITY_ARRAY" -A 2 -B 2 quantmsutils/features/snr.py quantmsutils/diann/diann2mztab.py
# Check the context of psm_parts usage
rg "psm_parts" -A 2 -B 2 quantmsutils/mzml/mzml_statistics.py
Length of output: 2457
quantmsutils/sdrf/check_samplesheet.py (2)
125-127: LGTM! Error message formatting improved.
The simplified error message is more concise while maintaining clarity.
140-140: Verify the impact of skipping SDRF validation.
While the implementation is correct, please ensure that skipping SDRF validation doesn't impact downstream processes that might depend on validated SDRF data.
Consider:
- Adding a warning message when this flag is used in production environments
- Documenting the risks of skipping validation in the README
- Adding logging to track when validation is skipped
Let me help verify the usage:
✅ Verification successful
The skip validation flag is safely implemented with appropriate safeguards
Based on the code analysis, the --skip_sdrf_validation flag is properly implemented with clear error handling and exit paths:
-
When validation is skipped:
- A clear message "No SDRF validation was performed" is printed
- Program exits cleanly with code 0
- No downstream processes continue after skipping validation
-
When validation runs:
- Multiple validation steps are performed (DEFAULT_TEMPLATE, MASS_SPECTROMETRY, factor values)
- Errors are collected and properly handled
- Program exits with appropriate error code if validation fails
The implementation follows the pattern of other validation flags in the codebase and includes proper error reporting. No downstream dependencies rely on validation results since the program exits immediately in both success and failure cases.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for dependencies on SDRF validation results
# Search for code that might depend on validated SDRF data
rg -l "SdrfDataFrame" | xargs rg "validate"
# Check for any potential validation checks in test files
fd -e py -x rg "check_sdrf" {} \;
Length of output: 536
Script:
#!/bin/bash
# Let's get more context about the SDRF validation flow and its usage
# Find the validation methods and their implementation
ast-grep --pattern 'def validate($$$) { $$$ }'
# Check where validation results are used
rg -A 3 "errors = .*validate\("
# Look for any error handling or dependencies on validation
rg "if.*errors" -A 2
Length of output: 640
Script:
#!/bin/bash
# Let's examine the check_sdrf function implementation and error handling
# Get the full context of check_sdrf function
ast-grep --pattern 'def check_sdrf($$$) {
$$$
}'
# Look for error handling and return values
rg -B 2 -A 4 "return errors|sys.exit"
# Check for any logging or warning messages
rg "warning|logger|print.*error" -A 2
Length of output: 17879
quantmsutils/diann/diann2mztab.py (1)
22-27: LGTM! Good use of constants.
The addition of constants from quantmsutils.utils.constants improves code maintainability by centralizing string literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
tests/test_commands.py (1)
115-132: Remove or implement the commented-out test function.Commented-out code should not be committed to the repository. If this test is valuable for batch processing functionality, it should be implemented. Otherwise, it should be removed.
quantmsutils/mzml/mzml_statistics.py (3)
4-4: Remove unused import.The
Listimport is not used in this file. The return type annotation usesSet.-from typing import Optional, List, Set +from typing import Optional, Set🧰 Tools
🪛 Ruff (0.8.0)
4-4:
typing.Listimported but unusedRemove unused import:
typing.List(F401)
149-164: Optimize batch writing performance.The current implementation creates a new RecordBatch for each batch. Consider collecting multiple batches and writing them together to reduce I/O operations.
if self.batch_data: # Initialize writers lazily if not already created if self.parquet_writer is None: self.parquet_writer = pq.ParquetWriter( where=self.output_path, schema=self.parquet_schema, compression="gzip" ) - # Create a Table directly from the current batch - batch = pa.RecordBatch.from_pylist(self.batch_data, schema=self.parquet_schema) - # Write the batch directly - self.parquet_writer.write_batch(batch) + # Create batches of 5 for more efficient writing + batches = [] + for i in range(0, len(self.batch_data), 5): + batch = pa.RecordBatch.from_pylist( + self.batch_data[i:i+5], + schema=self.parquet_schema + ) + batches.append(batch) + + # Write multiple batches at once + self.parquet_writer.write_batch(pa.concat_batches(batches)) # Clear the batch data self.batch_data = []
235-243: Consider memory optimization for numeric types.Some numeric fields could use smaller data types to reduce memory usage:
MS_LEVELcould useint8as it's typically 1 or 2CHARGEcould useint8as it rarely exceeds small valuesNUM_PEAKScould useint16if the maximum number of peaks is known to be under 32,767- pa.field(MS_LEVEL, pa.int32(), nullable=True), - pa.field(CHARGE, pa.int32(), nullable=True), - pa.field(NUM_PEAKS, pa.int32(), nullable=True), + pa.field(MS_LEVEL, pa.int8(), nullable=True), + pa.field(CHARGE, pa.int8(), nullable=True), + pa.field(NUM_PEAKS, pa.int16(), nullable=True),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
quantmsutils/diann/diann2mztab.py(35 hunks)quantmsutils/mzml/mzml_statistics.py(13 hunks)tests/test_commands.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- quantmsutils/diann/diann2mztab.py
🧰 Additional context used
🪛 Ruff (0.8.0)
quantmsutils/mzml/mzml_statistics.py
4-4: typing.List imported but unused
Remove unused import: typing.List
(F401)
🔇 Additional comments (2)
tests/test_commands.py (1)
191-192: Use consistent path handling.
The parquet file path "BSA1_F1_ms_info.parquet" is using a relative path without the tests/test_data/ prefix, unlike the second file. This could make the test fragile depending on where it's executed from.
quantmsutils/mzml/mzml_statistics.py (1)
201-204: LGTM! Well-implemented helper functions.
The helper functions have good error handling and type safety:
column_existsefficiently uses a Set for column lookup_resolve_ms_pathhas robust file resolution logic with clear error messages
Also applies to: 376-393
This is a major PR including:
Big refactoring on mzml_statistics:
pa.RecordBatch.from_pylistMzMLFile().transform(file_name.encode(), consumer)We hope with this PR to improve the memory and speed of mzml_statistics. We have to modify multiple components that use the
mzml_statisticsoutputs. I think would be good to create an issue on these components:Any other idea @jpfeuffer @daichengxin.
Summary by CodeRabbit
Summary by CodeRabbit
ms2rescorefunctionality from the CLI and documentation, streamlining user experience.pyarrowdependency across multiple configuration files to ensure compatibility.