major refactoring about ms2rescore and pyarrow #39

ypriverol · 2024-12-01T10:31:16Z

This is a major PR including:

remove ms2rescore because is a heavy tool that contains a lot of packages that are quite old, these packages are always in conflict with new packages in conda (pyarrow, and pandas) which forces us to use low-quality old versions.

Big refactoring on mzml_statistics:

change the column names to names that are in inline to quantms.io
change the writing from loading the DataFrame completely to writing batches with pa.RecordBatch.from_pylist
Remove the load of mzML in memory in favour of MzMLFile().transform(file_name.encode(), consumer)

We hope with this PR to improve the memory and speed of mzml_statistics. We have to modify multiple components that use the mzml_statistics outputs. I think would be good to create an issue on these components:

pmultiqc.
quantms

Any other idea @jpfeuffer @daichengxin.

Summary by CodeRabbit

New Features
- Enhanced documentation with new badges and improved script organization in the README.
Version Updates
- Incremented module version from 0.0.16 to 0.0.17 in various files.
Bug Fixes
- Removed the ms2rescore functionality from the CLI and documentation, streamlining user experience.
- Adjusted dependency specifications to improve installation flexibility and compatibility.
- Updated the version requirement for the pyarrow dependency across multiple configuration files to ensure compatibility.
- Removed several unnecessary dependencies to simplify the environment setup.

coderabbitai · 2024-12-01T10:31:23Z

Warning

Rate limit exceeded

@ypriverol has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 15 minutes and 34 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 8c4c822 and b5773c7.

Walkthrough

This pull request introduces changes across several configuration files to update dependency specifications. The version of pyopenms has been updated to >=3.2.0, and the version number of the quantms-utils package has been incremented from 0.0.16 to 0.0.17. Several dependencies, including numpy, ms2rescore, deepLC, psm-utils, and protobuf, have been removed, while scipy has been reintroduced without a specific version constraint. The README.md has also been updated to reflect these changes.

Changes

File	Change Summary
environment.yml	Updated `pyopenms` to `>=3.2.0`, removed `numpy`, `ms2rescore`, `deepLC`, `psm-utils`, and `protobuf`, and reintroduced `scipy`.
pyproject.toml	Incremented version from `0.0.16` to `0.0.17`, updated `python` to `*`, changed `pyopenms` to `>=3.2.0`, and updated `scipy` to unspecified.
quantmsutils/init.py	Updated `__version__` from `"0.0.16"` to `"0.0.17"`.
recipe/meta.yaml	Incremented version from `0.0.16` to `0.0.17`, updated `python` to `python`, changed `pyopenms` to `>=3.2.0`, and removed `numpy`, `ms2rescore`, `deepLC`, `psm-utils`, and `protobuf`.
requirements.txt	Updated `pyopenms` to `>=3.2.0`, removed `ms2rescore`, `deepLC`, `psm-utils`, `numpy`, and `pygam`, and updated `pyarrow` to `>=16.1.0`.
quantmsutils/rescoring/ms2rescore.py	Removed file that contained rescoring functionality for Peptide Spectrum Matches (PSMs).
quantmsutils/quantmsutilsc.py	Removed `ms2rescore` command from CLI setup.
quantmsutils/mzml/mzml_statistics.py	Updated `_write_batch` method to use `pa.Table.from_pylist` instead of `pa.RecordBatch.from_pylist`.
tests/test_commands.py	Removed test functions related to the `ms2rescore` command.
.github/workflows/python-app.yml	Removed steps for downloading test data and testing with `ms2rescore`.
.github/workflows/conda-build.yml	Updated Python version from `3.10` to `3.12`.
quantmsutils/utils/constants.py	Introduced a new file defining constants for mass spectrometry data.

Possibly related PRs

ms2rescore fix version for DeepLC #34: The changes in this PR involve the removal of the ms2rescore dependency, which is also removed in the main PR, indicating a direct relationship in dependency management.
forcing more versions for compatibility. #35: This PR updates the pyopenms version constraint to >=2.9.1, which is relevant as the main PR also updates pyopenms to >=3.2.0, showing a focus on version compatibility.
forcing openms version higher than 2.4.0 #38: This PR enforces a version requirement for pyopenms to be higher than 3.0.0, which aligns with the main PR's update of pyopenms to >=3.2.0, indicating a shared objective of maintaining compatible versions.

Suggested reviewers

jpfeuffer
daichengxin

Poem

🐇 In the garden of code, we hop with delight,
With pyopenms now free, our future is bright.
From 0.0.16 to 17, we cheer,
Dependencies aligned, let’s give a great cheer!
With each little change, our project will bloom,
A rabbit’s sweet joy, in the coding room! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

environment.yml (1)
13-13: Add newline at end of file

Add a newline character at the end of the file to comply with YAML formatting standards.
  - pyarrow>=18.0.0
+ 
🧰 Tools

🪛 yamllint (1.35.1)

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4e4b8bb and 7183557.

📒 Files selected for processing (8)

README.md (0 hunks)
environment.yml (1 hunks)
pyproject.toml (2 hunks)
quantmsutils/mzml/mzml_statistics.py (2 hunks)
quantmsutils/quantmsutilsc.py (0 hunks)
quantmsutils/rescoring/ms2rescore.py (0 hunks)
recipe/meta.yaml (2 hunks)
requirements.txt (1 hunks)

💤 Files with no reviewable changes (3)

quantmsutils/quantmsutilsc.py
README.md
quantmsutils/rescoring/ms2rescore.py

🚧 Files skipped from review as they are similar to previous changes (2)

pyproject.toml
recipe/meta.yaml

🧰 Additional context used

🪛 yamllint (1.35.1)

environment.yml

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (3)

requirements.txt (1)

6-6: LGTM: pyarrow version update

Setting pyarrow to >=18.0.0 is appropriate and aligns with the PR objective.

environment.yml (1)

10-13: LGTM: Dependency updates are consistent

The changes align with requirements.txt, maintaining consistency across dependency specifications.

🧰 Tools

🪛 yamllint (1.35.1)

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

quantmsutils/mzml/mzml_statistics.py (1)

142-143: LGTM: Improved data handling with pyarrow Table

The switch from RecordBatch to Table is a good improvement as:

Tables are more flexible and can handle multiple record batches
This change aligns well with the pyarrow >=18.0.0 requirement

Also applies to: 161-163

✅ Verification successful

Let me verify the pyarrow Table API usage further by checking the imports and version requirements.

LGTM: Verified pyarrow Table API compatibility

The switch to pa.Table.from_pylist is correctly implemented and compatible with the codebase:

The required pyarrow version (>=18.0.0) supports this API
All instances of Table usage are consistent across the file
The imports are properly defined with the standard pa alias

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify pyarrow Table API compatibility
echo "Checking pyarrow Table API usage..."
ast-grep --pattern 'pa.Table.from_pylist($_,$_)' || echo "No Table.from_pylist usage found"

Length of output: 514

Script:

#!/bin/bash
# Check pyarrow imports and version requirements
echo "Checking pyarrow imports..."
rg "^import pyarrow|^from pyarrow" -A 2

echo -e "\nChecking version requirements..."
fd "requirements" --exec cat {} \; 
fd "setup.py" --exec cat {} \;

Length of output: 555

requirements.txt

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

environment.yml (2)
14-14: Add newline at end of file

The file is missing a newline character at the end.
  - scipy
+
🧰 Tools

🪛 yamllint (1.35.1)

[error] 14-14: no new line character at the end of file

(new-line-at-end-of-file)

10-10: Consider adding version constraint for pyopenms

Having no version constraint for pyopenms could lead to compatibility issues in the future. Consider specifying a minimum version to ensure stability.
-  - pyopenms
+  - pyopenms>=2.4.0

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7183557 and 08da89b.

📒 Files selected for processing (4)

environment.yml (1 hunks)
pyproject.toml (2 hunks)
recipe/meta.yaml (2 hunks)
requirements.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

pyproject.toml
requirements.txt
recipe/meta.yaml

🧰 Additional context used

🪛 yamllint (1.35.1)

environment.yml

[error] 14-14: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (1)

environment.yml (1)

10-14: Verify the impact of dependency changes

Several dependencies have been removed (ms2rescore, deepLC, psm-utils, protobuf, pygam) and version constraints modified. Please ensure these changes don't break existing functionality.

✅ Verification successful

Let me generate additional verification scripts to check for any indirect usage or references to these packages in configuration or documentation files.

Based on the results from the verification scripts, I can now provide a final response:

Dependencies have been intentionally removed

The removed dependencies (ms2rescore, deepLC, psm-utils, protobuf, pygam) were primarily used in the rescoring functionality which has been removed from the codebase. This is evidenced by the deletion of the entire rescoring module (quantmsutils/rescoring/) and the removal of the ms2rescore command from quantmsutils/quantmsutilsc.py. The pyarrow version upgrade and scipy reintroduction are related to the remaining mzml statistics functionality.

quantmsutils/rescoring/ms2rescore.py and its directory have been deleted
quantmsutils/quantmsutilsc.py no longer imports or registers the ms2rescore command
The remaining code changes focus on pyarrow Table usage in mzml statistics

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for usage of removed packages in the codebase

echo "Checking for usage of removed packages..."
for pkg in "ms2rescore" "deepLC" "psm-utils" "protobuf" "pygam"; do
    echo "Searching for $pkg usage..."
    rg -l "import.*$pkg|from.*$pkg" || true
done

# Check if pyarrow 18.0.0 is compatible with other dependencies
echo "Checking pyarrow compatibility..."
curl -s https://pypi.org/pypi/pyarrow/18.0.0/json | jq '.info.requires_dist'

Length of output: 1336

Script:

#!/bin/bash
# Check for any references to removed packages in configuration files and documentation
echo "Checking for package references in configs and docs..."
rg -i "ms2rescore|deeplc|psm.?utils|protobuf|pygam" --type yaml --type md --type txt

# Check for any potential function calls or class usage that might be from these packages
echo "Checking for potential usage patterns..."
rg -i "MS2Rescorer|DeepLC|PSMUtils|Proto|GAM" --type python

# Check if there are any recent changes that removed these dependencies
echo "Checking recent changes in Python files..."
git diff HEAD~5 HEAD -- "*.py"

Length of output: 21387

🧰 Tools

🪛 yamllint (1.35.1)

[error] 14-14: no new line character at the end of file

(new-line-at-end-of-file)

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

.github/workflows/python-app.yml (1)
42-42: Add newline at end of file

Add a newline character at the end of the file to comply with POSIX standards.
        poetry run pytest
+
🧰 Tools

🪛 yamllint (1.35.1)

[error] 42-42: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 80d8184 and 582d648.

📒 Files selected for processing (1)

.github/workflows/python-app.yml (1 hunks)

🧰 Additional context used

🪛 yamllint (1.35.1)

.github/workflows/python-app.yml

[error] 42-42: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/python-app.yml

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

environment.yml (1)
13-13: Add newline at end of file

The file is missing a newline character at the end.

Apply this diff:
   - scipy
+
🧰 Tools

🪛 yamllint (1.35.1)

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7a7c8c3 and 2b11f90.

📒 Files selected for processing (4)

environment.yml (1 hunks)
pyproject.toml (2 hunks)
recipe/meta.yaml (2 hunks)
requirements.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

requirements.txt
pyproject.toml
recipe/meta.yaml

🧰 Additional context used

🪛 yamllint (1.35.1)

environment.yml

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (1)

environment.yml (1)

12-12: Confirm the need for pyarrow >= 18.0.0

The PR title suggests forcing pyarrow 18.0.0, but there's no context provided for this change. Could you please explain:

What specific features/fixes in pyarrow 18.0.0 are required?
Have you tested compatibility with your existing pyarrow usage?

Let's check pyarrow usage:

✅ Verification successful

Pyarrow version requirement is justified for list type support in schema

The code uses pyarrow for parquet file handling with specific features that justify the version requirement:

Complex schema definitions with nested types (list of float64)
Batch writing with compression
Table creation from pylists with schema validation
Parquet streaming writer functionality

The id_schema specifically uses pa.list_(pa.float64()) for the mz and intensity fields, which requires robust list type support in pyarrow. Version 18.0.0 introduced improvements in handling such complex types and schema validation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for pyarrow imports and usage
rg -l 'import.*pyarrow' || rg -l 'from.*pyarrow'

# Look for specific pyarrow features that might require 18.0.0
ast-grep --pattern 'import pyarrow as $_'

Length of output: 235

Script:

#!/bin/bash
# Let's examine the actual usage of pyarrow in the file to understand the features being used
cat quantmsutils/mzml/mzml_statistics.py

Length of output: 13839

environment.yml

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 502761e and 2590939.

📒 Files selected for processing (2)

pyproject.toml (2 hunks)
recipe/meta.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pyproject.toml

🔇 Additional comments (3)

recipe/meta.yaml (3)

16-16: LGTM!

The noarch: python specification is correct for a pure Python package.

20-20: Justify the stricter Python version constraint

The Python version requirement has been significantly tightened from >=3.8,<3.11 to >=3.9,<3.10. This might unnecessarily restrict users and could impact compatibility with other packages in the ecosystem.

Also applies to: 24-24

✅ Verification successful

Python version constraint is justified by project configuration

The stricter Python version requirement (>=3.9,<3.10) is consistent with the project's configuration in pyproject.toml. The project is explicitly targeting Python 3.9 as shown by the black configuration target-version = ["py39"] and the poetry dependency specification. While the code search didn't reveal heavy usage of Python 3.9-specific features, this appears to be an intentional project-wide decision.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if there are any Python 3.9-specific features being used
echo "Checking for Python 3.9 specific features..."
rg -l "annotations|TypeAlias|list\[|dict\[|tuple\[|union\[" || true

Length of output: 194

Script:

#!/bin/bash
# Check the actual usage of Python 3.9 features in the identified file
echo "Checking Python 3.9 features usage in diann2mztab.py..."
rg -A 5 "annotations|TypeAlias|list\[|dict\[|tuple\[|union\[" quantmsutils/diann/diann2mztab.py

# Check for any setup.py or pyproject.toml for dependency information
echo -e "\nChecking for Python package configuration files..."
fd -e py -e toml "setup|pyproject" --type f --exec cat {}

Length of output: 2403

27-30: ⚠️ Potential issue

Review dependency removals and version constraints

Several concerns about the dependency changes:

The removal of core dependencies (numpy, ms2rescore, deepLC, psm-utils, pygam, protobuf) needs justification. These seem like important dependencies for a bioinformatics package.
The explicit version constraint on pyarrow (>=18.0.0) should be explained.
While using conda-forge channel explicitly is good, ensure it's consistent with the project's channel priority requirements.

✅ Verification successful

Based on the analysis of the codebase and dependencies, I can now provide the final response:

Dependencies are correctly aligned with actual usage

The removed dependencies (numpy, ms2rescore, deepLC, psm-utils, pygam, protobuf) are not critical issues because:

Numpy is indirectly included through scipy and pandas
The codebase shows no direct usage of ms2rescore, deepLC, psm-utils, pygam, or protobuf
The core functionality relies on pyopenms, pandas, pyarrow, and scipy which are all retained

The pyarrow version constraint (>=18.0.0) is justified by the direct usage in mzml_statistics.py for parquet file handling, where newer features may be required.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any remaining usage of removed packages
echo "Checking for potential broken imports after dependency removals..."
for pkg in "numpy" "ms2rescore" "deepLC" "psm_utils" "pygam" "protobuf"; do
  echo "Checking $pkg usage in Python files:"
  rg -p "import.*$pkg|from.*$pkg" '*.py' || true
done

# Check if pyarrow 18.0.0 is specifically needed
echo "Checking pyarrow version-specific usage..."
rg -p "pyarrow" '*.py' || true

Length of output: 1888

Script:

#!/bin/bash
# Let's first check if we're in the right directory and find Python files
echo "Listing Python files in the repository:"
fd -e py

# Check the recipe directory structure
echo -e "\nListing files in recipe directory:"
ls -la recipe/

# Check meta.yaml history to understand dependency changes
echo -e "\nChecking meta.yaml changes:"
git diff HEAD^1 recipe/meta.yaml || true

# Look for any build or test requirements that might explain dependencies
echo -e "\nChecking full meta.yaml content:"
cat recipe/meta.yaml

Length of output: 2692

Script:

#!/bin/bash
# Let's check the actual imports in the Python files to verify dependency usage
echo "Checking imports in Python files:"
for file in $(fd -e py); do
  echo -e "\nFile: $file"
  rg -p "^(import|from) \w+" "$file" || true
done

# Check pyproject.toml for dependencies
echo -e "\nChecking pyproject.toml dependencies:"
cat pyproject.toml || true

# Check for any numpy, protobuf related functionality
echo -e "\nChecking for numpy/array operations:"
rg -p "np\.|numpy|array|ndarray" || true

echo -e "\nChecking for protobuf related code:"
rg -p "proto|message|descriptor" || true

Length of output: 74437

recipe/meta.yaml

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

environment.yml (1)

Line range hint 13-14: Add newline at end of file

YAML files should end with a newline character.

Add a newline at the end of the file.

🧰 Tools

🪛 yamllint (1.35.1)

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2d6e634 and a8ac9f4.

📒 Files selected for processing (4)

environment.yml (1 hunks)
pyproject.toml (2 hunks)
recipe/meta.yaml (1 hunks)
requirements.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

pyproject.toml
recipe/meta.yaml

🧰 Additional context used

🪛 yamllint (1.35.1)

environment.yml

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (4)

requirements.txt (3)

Line range hint 6-7: Dependencies removal needs attention

The verification revealed that scipy is still being used in quantmsutils/features/snr.py for entropy calculation.

#!/bin/bash
# Verify scipy usage
rg -l 'import.*scipy' || rg -l 'from.*scipy'
rg 'scipy\.' -A 3

3-4: ⚠️ Potential issue

Add back version constraint for pyopenms

The version constraint for pyopenms should be maintained as this package provides critical functionality for mass spectrometry data processing.

Apply this change:

-pyopenms
+pyopenms>=2.4.0

5-5: ⚠️ Potential issue

Add version constraint for pyarrow

The PR title indicates forcing pyarrow to version 18.0.0, but the constraint is missing in requirements.txt.

Apply this change:

-pyarrow
+pyarrow==18.0.0

Likely invalid or redundant comment.

environment.yml (1)

10-10: ⚠️ Potential issue

Add back version constraint for pyopenms

The version constraint for pyopenms should be maintained for consistent mass spectrometry data processing across environments.

Apply this change:

-  - pyopenms
+  - pyopenms>=2.4.0

environment.yml

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

environment.yml (2)
12-12: Consider aligning pyarrow version with requirements.txt

The version constraint >=16.1.0 has been added for pyarrow. However, for consistency with other configuration files, consider using a fixed version.
-  - pyarrow>=16.1.0
+  - pyarrow==18.0.0
13-13: Add newline at end of file

Add a newline character at the end of the file to follow YAML formatting conventions.
   - scipy
+
🧰 Tools

🪛 yamllint (1.35.1)

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a8ac9f4 and 1b4188d.

⛔ Files ignored due to path filters (2)

tests/test_data/BSA1_F1_ms_info.parquet is excluded by !**/*.parquet
tests/test_data/BSA1_F1_ms_info.tsv is excluded by !**/*.tsv

📒 Files selected for processing (7)

.github/workflows/conda-build.yml (1 hunks)
environment.yml (1 hunks)
pyproject.toml (2 hunks)
quantmsutils/mzml/mzml_statistics.py (1 hunks)
recipe/meta.yaml (1 hunks)
requirements.txt (1 hunks)
tests/test_commands.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (6)

.github/workflows/conda-build.yml
tests/test_commands.py
quantmsutils/mzml/mzml_statistics.py
requirements.txt
pyproject.toml
recipe/meta.yaml

🧰 Additional context used

🪛 yamllint (1.35.1)

environment.yml

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (2)

environment.yml (2)

13-13: 🛠️ Refactor suggestion

Add version constraint for scipy

Since scipy is being reintroduced to provide essential numerical computing functionality (previously from numpy), it's recommended to add a version constraint to ensure compatibility.

🧰 Tools

🪛 yamllint (1.35.1)

[error] 13-13: no new line character at the end of file

(new-line-at-end-of-file)

10-10: Verify pyopenms version compatibility

The version constraint for pyopenms has been updated to >=3.2.0. Since this package provides critical functionality for mass spectrometry data processing, let's verify the compatibility.

✅ Verification successful

Let me analyze the pyopenms usage to verify version compatibility.

The first script will help identify all pyopenms imports, while the second one checks for specific class usage. This will help determine if any version-specific APIs are being used.

Let me check the actual usage of these classes to ensure version compatibility.

This will help us understand if any version-specific features or APIs are being used that might require version 3.2.0 or higher.

pyopenms version constraint is appropriate

The codebase uses standard pyopenms APIs that have been stable across versions:

Core MS data handling: MzMLFile, MSExperiment, SpectrumLookup
Sequence manipulation: AASequence, FASTAFile
Modification handling: ModificationsDB
File I/O: IdXMLFile

All observed usage patterns are basic functionality that has been present in pyopenms for many versions. The version constraint >=3.2.0 is reasonable and doesn't pose compatibility risks.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any version-specific API usage that might be affected
rg -A 2 'from pyopenms import|import pyopenms' 

# Check git history for version-related changes
git log -p -- '**/requirements.txt' '**/setup.py' '**/pyproject.toml' | grep -A 5 -B 5 'pyopenms'

Length of output: 930

Script:

#!/bin/bash
# Check specific API usage that might be version sensitive
ast-grep --pattern 'from pyopenms import $_'
ast-grep --pattern 'import pyopenms as $_'

# Look for specific pyopenms classes/methods usage
rg -l "MzMLFile|MSExperiment|SpectrumLookup|AASequence|FASTAFile|ModificationsDB|PROTON_MASS_U" quantmsutils/

Length of output: 820

Script:

#!/bin/bash
# Check how these classes are used in the codebase
rg -A 5 "MzMLFile|MSExperiment|SpectrumLookup|AASequence|FASTAFile|ModificationsDB" quantmsutils/

# Check if there are any version-specific method calls or attributes
rg "oms\." quantmsutils/

Length of output: 5744

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

tests/test_commands.py (2)
1-4: Remove unused import

The pyarrow.parquet import is not directly used in the code. While parquet files are being read, it's done through pandas' read_parquet function which internally handles the parquet operations.
import pandas as pd
from click.testing import CliRunner
import pyarrow as pa
-import pyarrow.parquet as pq
from quantmsutils.quantmsutilsc import cli
🧰 Tools

🪛 Ruff (0.8.0)

4-4: pyarrow.parquet imported but unused

Remove unused import: pyarrow.parquet

(F401)

187-192: Add error handling for file operations

The parquet file reading operations should include error handling to provide meaningful error messages if the files are missing or corrupted.
-    table1 = pd.read_parquet("BSA1_F1_ms_info.parquet", schema=schema)
-    table2 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet", schema=schema)
-    table2 = table2.set_index('SpectrumID')
-    table1 = table1.set_index('SpectrumID')
-
-    assert table1.compare(table2).empty
+    try:
+        table1 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet", schema=schema)
+        table2 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet", schema=schema)
+        table2 = table2.set_index('SpectrumID')
+        table1 = table1.set_index('SpectrumID')
+
+        assert table1.compare(table2).empty
+    except FileNotFoundError as e:
+        pytest.fail(f"Failed to read parquet file: {str(e)}")
+    except Exception as e:
+        pytest.fail(f"Error comparing parquet data: {str(e)}")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1b4188d and 1c2d051.

⛔ Files ignored due to path filters (3)

tests/test_data/BSA1_F1_ms_info.parquet is excluded by !**/*.parquet
tests/test_data/BSA1_F1_spectrum_df.csv is excluded by !**/*.csv
tests/test_data/BSA1_F1_spectrum_df.parquet is excluded by !**/*.parquet

📒 Files selected for processing (2)

quantmsutils/mzml/mzml_statistics.py (4 hunks)
tests/test_commands.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

quantmsutils/mzml/mzml_statistics.py

🧰 Additional context used

🪛 Ruff (0.8.0)

tests/test_commands.py

4-4: pyarrow.parquet imported but unused

Remove unused import: pyarrow.parquet

(F401)

🔇 Additional comments (1)

tests/test_commands.py (1)

Line range hint 1-193: Verify complete removal of ms2rescore references

The AI summary indicates that ms2rescore-related tests were removed. Let's verify there are no remaining references to ensure clean removal.

✅ Verification successful

Let me check if these references are only in test data files or if they are part of actual functionality.

Let me check for any ms2rescore-related functionality in the main code:

No active ms2rescore functionality found, only test data references remain

The verification shows that ms2rescore references only exist in test data file names under tests/test_data/ directory. No references were found in the main codebase, imports, function definitions, or CLI commands. These test data files are likely used as fixtures for other tests and don't indicate any remaining functional dependencies on ms2rescore.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining ms2rescore references in the codebase
rg -i "ms2rescore" --type py

Length of output: 709

Script:

#!/bin/bash
# Check if these files exist in the test data directory
fd "ms2rescore" tests/test_data/

# Check for any ms2rescore references in function names or imports
ast-grep --pattern 'import $_ from "ms2rescore"'
ast-grep --pattern 'import "ms2rescore"'
ast-grep --pattern 'def $_ms2rescore$_($$$)'

Length of output: 265

Script:

#!/bin/bash
# Check for any ms2rescore references in the main code (excluding tests)
rg -i "ms2rescore" --type py --glob '!tests/**'

# Check for any configuration or CLI commands related to ms2rescore
ast-grep --pattern 'def $_(self, $$$):
  $$$
  "ms2rescore"
  $$$'

# Look for any command definitions with ms2rescore
rg -i "ms2rescore" -A 5 -B 5 "def.*command"

Length of output: 213

tests/test_commands.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

tests/test_commands.py (1)
198-198: Strengthen the assertion

Using only shape comparison might miss data differences. Consider using compare() or equals() for a more thorough comparison.
-    assert id_table.shape == id_table2.shape
+    assert id_table.compare(id_table2).empty
quantmsutils/mzml/mzml_statistics.py (1)

169-170: Address batch processing TODOs

The TODO comments indicate a need for batch processing of spectrum data. This could impact memory usage with large datasets.

Would you like me to help implement a batched processing solution using PyArrow's RecordBatchBuilder for handling lists?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1c2d051 and af9b247.

📒 Files selected for processing (3)

quantmsutils/mzml/mzml_statistics.py (11 hunks)
quantmsutils/utils/constants.py (1 hunks)
tests/test_commands.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

quantmsutils/utils/constants.py

🔇 Additional comments (5)

tests/test_commands.py (2)

186-187: Use consistent path handling

The parquet file path "BSA1_F1_ms_info.parquet" is using a relative path without the tests/test_data/ prefix, unlike the second file. This could make the test fragile depending on where it's executed from.

-    table1 = pd.read_parquet("BSA1_F1_ms_info.parquet", schema=schema)
+    table1 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet", schema=schema)

193-194: Use consistent path handling

Similar to the previous issue, the parquet file path "BSA1_F1_spectrum_df.parquet" should use the tests/test_data/ prefix.

-    id_table = pd.read_parquet("BSA1_F1_spectrum_df.parquet")
+    id_table = pd.read_parquet("tests/test_data/BSA1_F1_spectrum_df.parquet")

quantmsutils/mzml/mzml_statistics.py (3)

209-226: LGTM! Well-structured schema definitions

Good improvements:

Using constants for field names improves maintainability
Appropriate type changes from float to int32 for discrete values
Proper handling of nullable fields

298-306: LGTM! Secure and consistent implementation

Good practices:

Safe SQL query construction
Consistent schema with mzML implementation
Proper use of constants

Line range hint 331-365: LGTM! Robust path handling implementation

The path resolution logic is well-implemented with:

Proper case-insensitive extension handling
Clear error messages
Safe path resolution

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)

quantmsutils/utils/constants.py (1)
7-17: Consider adding detailed documentation for each constant.

While the link to quantms.io is helpful, inline documentation for each constant would improve code readability and maintainability.
-CHARGE = "charge" # charge of the precursor ion
+# The charge state of the precursor ion in MS2 spectra
+CHARGE = "charge"

-SCAN = "scan" # Definition in quantms.io https://github.com/bigbio/quantms.io/blob/main/docs/README.adoc#scan
+# Unique identifier for each spectrum (see: https://github.com/bigbio/quantms.io/blob/main/docs/README.adoc#scan)
+SCAN = "scan"

-MS_LEVEL = "ms_level"
+# Mass spectrometry level (1 for MS1, 2 for MS2)
+MS_LEVEL = "ms_level"

-NUM_PEAKS = "num_peaks" # number of peaks in the spectrum (MS1 or MS2)
+# Number of peaks in the spectrum (MS1 or MS2)
+NUM_PEAKS = "num_peaks"

-BASE_PEAK_INTENSITY = "base_peak_intensity"
+# Intensity of the highest peak in the spectrum
+BASE_PEAK_INTENSITY = "base_peak_intensity"
quantmsutils/mzml/mzml_statistics.py (2)
169-170: Address TODO comments regarding batch processing.

The current implementation loads all PSM parts into memory before writing to parquet. This could be problematic for large datasets.

I can help implement a batch processing solution using PyArrow's streaming capabilities. Would you like me to propose a solution or create a GitHub issue to track this improvement?

78-81: Consider using a dictionary for better readability.

The current list-based approach makes it harder to understand what each element represents.
-                    [str(scan_id),
-                            int(ms_level),
-                            mz_array.tolist(),
-                            intensity_array.tolist(),]
+                    {
+                        SCAN: str(scan_id),
+                        MS_LEVEL: int(ms_level),
+                        MZ_ARRAY: mz_array.tolist(),
+                        INTENSITY_ARRAY: intensity_array.tolist(),
+                    }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between af9b247 and 5b9ef64.

⛔ Files ignored due to path filters (2)

tests/test_data/BSA1_F1_ms_info.parquet is excluded by !**/*.parquet
tests/test_data/BSA1_F1_spectrum_df.parquet is excluded by !**/*.parquet

📒 Files selected for processing (3)

quantmsutils/mzml/mzml_statistics.py (11 hunks)
quantmsutils/utils/constants.py (1 hunks)
tests/test_commands.py (2 hunks)

🧰 Additional context used

🪛 Ruff (0.8.0)

tests/test_commands.py

3-3: pyarrow imported but unused

Remove unused import: pyarrow

(F401)

🔇 Additional comments (2)

tests/test_commands.py (1)

173-174: Use consistent path handling.

The parquet file path "BSA1_F1_ms_info.parquet" is using a relative path without the tests/test_data/ prefix, unlike the second file.

-    table1 = pd.read_parquet("BSA1_F1_ms_info.parquet")
+    table1 = pd.read_parquet("tests/test_data/BSA1_F1_ms_info.parquet")

quantmsutils/mzml/mzml_statistics.py (1)

209-217: LGTM! Schema improvements enhance type safety.

The schema now uses more appropriate data types:

Integer types for MS_LEVEL, CHARGE, and NUM_PEAKS
Consistent use of constants from the new constants module

tests/test_commands.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

quantmsutils/diann/diann2mztab.py (3)
Line range hint 4-4: TODO comment needs to be addressed.

The TODO comment indicates missing tests. Unit tests are crucial for ensuring the reliability of the data conversion process.

Would you like me to help create unit tests for this functionality or create a GitHub issue to track this task?

1101-1109: Consider adding error handling for parquet file operations.

The parquet file reading operation could benefit from explicit error handling to gracefully handle corrupted files or missing columns.
-        target = pd.read_parquet(file)
+        try:
+            target = pd.read_parquet(file)
+            if not all(col in target.columns for col in [MS_LEVEL, RETENTION_TIME, SCAN, EXPERIMENTAL_MASS_TO_CHARGE]):
+                raise ValueError(f"Missing required columns in {file}")
+        except Exception as e:
+            logger.error(f"Error reading parquet file {file}: {str(e)}")
+            raise
1104-1108: Consider optimizing the data filtering and sorting operations.

The current implementation performs filtering and sorting operations separately. These could be combined for better performance.
-        target = target[target[MS_LEVEL] == 2]
-        target.reset_index(inplace=True, drop=True)
-        target["DIANN-intraID"] = target.index
-        group.sort_values(by=RETENTION_TIME, inplace=True)
-        target = target[[RETENTION_TIME, SCAN, "DIANN-intraID", EXPERIMENTAL_MASS_TO_CHARGE]]
+        target = (target[target[MS_LEVEL] == 2]
+                 .reset_index(drop=True)
+                 .assign(**{"DIANN-intraID": lambda x: x.index})
+                 [[RETENTION_TIME, SCAN, "DIANN-intraID", EXPERIMENTAL_MASS_TO_CHARGE]])
+        group = group.sort_values(by=RETENTION_TIME)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5b9ef64 and b200964.

📒 Files selected for processing (1)

quantmsutils/diann/diann2mztab.py (2 hunks)

🔇 Additional comments (2)

quantmsutils/diann/diann2mztab.py (2)

22-22: LGTM! Good use of constants.

The import of constants from quantmsutils.utils.constants improves code maintainability by centralizing string literals.

1102-1108: LGTM! Good refactoring to use constants.

The changes improve code maintainability by:

Using MS_LEVEL constant for filtering MS2 spectra
Using RETENTION_TIME, SCAN, and EXPERIMENTAL_MASS_TO_CHARGE constants for column selection

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

tests/test_commands.py (1)
114-131: Add docstrings to test functions

The new test functions lack docstrings. Please add documentation describing:

Purpose of the test

Expected behavior

Any important test data requirements

Example for test_batch_convert_parquet:
def test_batch_convert_parquet():
    """
    Test batch conversion of mzML files to parquet format.
    
    Verifies that the mzmlstats command successfully processes multiple files
    and generates the expected parquet output for each input file.
    """
Also applies to: 187-203

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b200964 and 422d9da.

⛔ Files ignored due to path filters (4)

tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj1_ms_info.parquet is excluded by !**/*.parquet
tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj2_ms_info.parquet is excluded by !**/*.parquet
tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj1_ms_info.parquet is excluded by !**/*.parquet
tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj2_ms_info.parquet is excluded by !**/*.parquet

📒 Files selected for processing (1)

tests/test_commands.py (3 hunks)

🧰 Additional context used

🪛 Ruff (0.8.0)

tests/test_commands.py

3-3: pyarrow imported but unused

Remove unused import: pyarrow

(F401)

🔇 Additional comments (1)

tests/test_commands.py (1)

3-3: Remove unused import

The pyarrow import is not used in this file. While parquet operations are performed, pandas handles the pyarrow dependency internally.

-import pyarrow as pa

🧰 Tools

🪛 Ruff (0.8.0)

3-3: pyarrow imported but unused

Remove unused import: pyarrow

(F401)

tests/test_commands.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

.gitignore (1)
165-168: Consider using a wildcard pattern for test data files

Instead of listing individual files, consider using a pattern to make the .gitignore more maintainable:
-/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj1.mzML
-/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj2.mzML
-/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj1.mzML
-/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj2.mzML
+/tests/test_data/diann2mztab/RD139_Narrow_UPS1_*_*fmol_inj*.mzML
quantmsutils/diann/diann2mztab.py (1)
1102-1109: LGTM! Consider extracting the column list for better readability

The use of constants improves maintainability. Consider extracting the column list to a variable for better readability:
-target = target[[RETENTION_TIME, SCAN, "DIANN-intraID", EXPERIMENTAL_MASS_TO_CHARGE]]
+columns_to_keep = [RETENTION_TIME, SCAN, "DIANN-intraID", EXPERIMENTAL_MASS_TO_CHARGE]
+target = target[columns_to_keep]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 422d9da and 75a98e9.

📒 Files selected for processing (3)

.gitignore (1 hunks)
quantmsutils/diann/diann2mztab.py (2 hunks)
tests/test_commands.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_commands.py

🔇 Additional comments (1)

quantmsutils/diann/diann2mztab.py (1)

22-23: LGTM! Good use of centralized constants

The addition of constants from quantmsutils.utils.constants improves code maintainability and reduces magic strings.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (7)

quantmsutils/diann/dianncfg.py (1)

Line range hint 41-120: Consider reducing code duplication in modification handling.

The code for handling fixed and variable modifications is nearly identical, differing only in the target list (fix_ptm vs var_ptm). Consider extracting the common logic into a helper function to improve maintainability and reduce duplication.

Here's a suggested refactor:

def convert_mod(unimod_database, fix_mod: str, var_mod: str) -> Tuple[List, List]:
    pattern = re.compile(r"\((.*?)\)")
    var_ptm = []
    fix_ptm = []

+   def process_modification(mod: str, unimod_database) -> Tuple[str, str, bool]:
+       tag = 0
+       diann_mod = None
+       for modification in unimod_database.modifications:
+           if modification.get_name() == mod.split(" ")[0]:
+               diann_mod = modification.get_name() + "," + str(modification._delta_mono_mass)
+               tag = 1
+               break
+       if tag == 0:
+           print(
+               "Warning: Currently only supported unimod modifications for DIA pipeline. Skipped: "
+               + mod
+           )
+           return None, None, False
+       
+       site = re.findall(pattern, " ".join(mod.split(" ")[1:]))[0]
+       if site == "Protein N-term":
+           site = "*n"
+       elif site == "N-term":
+           site = "n"
+       
+       is_label = any(label in diann_mod for label in ["TMT", "Label", "iTRAQ", "mTRAQ"])
+       return diann_mod, site, is_label

    if fix_mod:
        for mod in fix_mod.split(","):
+           diann_mod, site, is_label = process_modification(mod, unimod_database)
+           if diann_mod:
+               if is_label:
+                   fix_ptm.append(f"{diann_mod},{site},label")
+               else:
+                   fix_ptm.append(f"{diann_mod},{site}")

    if var_mod:
        for mod in var_mod.split(","):
+           diann_mod, site, is_label = process_modification(mod, unimod_database)
+           if diann_mod:
+               if is_label:
+                   var_ptm.append(f"{diann_mod},{site},label")
+               else:
+                   var_ptm.append(f"{diann_mod},{site}")

    return fix_ptm, var_ptm

quantmsutils/mzml/mzml_statistics.py (2)

234-252: Consider adding schema documentation

While the schema definition is clear and well-structured, it would be beneficial to add documentation about:

Valid value ranges for fields (e.g., valid MS levels)

Business rules for nullable fields

Any field dependencies or constraints

Line range hint 1-391: Consider adding tests for new functionality

While the code changes are well-implemented, it would be beneficial to add tests covering:

Edge cases for integer type conversions

Batch writing with various batch sizes

Error scenarios for file processing

ID-only output generation

Would you like me to help generate test cases for these scenarios?

quantmsutils/diann/diann2mztab.py (4)

36-36: Consider using a more specific log level.

The current setup uses DEBUG level by default, which might be too verbose for production. Consider using INFO as the default level and allowing it to be configured through environment variables.

-logging.basicConfig(format="%(asctime)s [%(funcName)s] - %(message)s", level=logging.DEBUG)
+logging.basicConfig(
+    format="%(asctime)s [%(funcName)s] - %(message)s",
+    level=os.getenv('LOG_LEVEL', 'INFO')
+)

478-483: Performance optimization opportunity in mass calculation.

The current implementation creates a dictionary for all unique sequences and then maps it. While this is already an optimization, we could further improve it by using numpy's vectorized operations.

-uniq_masses = {
-    k: compute_mass_modified_peptide(k) for k in report["Modified.Sequence"].unique()
-}
-mass_vector = report["Modified.Sequence"].map(uniq_masses)
-report["Calculate.Precursor.Mz"] = (
-    mass_vector + (PROTON_MASS_U * report["Precursor.Charge"])
-) / report["Precursor.Charge"]
+# Convert to numpy arrays for vectorized operations
+unique_sequences = report["Modified.Sequence"].unique()
+unique_masses = np.array([compute_mass_modified_peptide(seq) for seq in unique_sequences])
+mass_map = dict(zip(unique_sequences, unique_masses))
+mass_vector = np.array([mass_map[seq] for seq in report["Modified.Sequence"]])
+charges = report["Precursor.Charge"].to_numpy()
+report["Calculate.Precursor.Mz"] = (mass_vector + (PROTON_MASS_U * charges)) / charges

Line range hint 1064-1136: Improve error handling in scan matching logic.

The current implementation has complex error handling for scan mismatches. Consider extracting this logic into a separate function for better maintainability.

+def handle_scan_mismatch(scan_matched):
+    """Handle cases where RT matched and DIA-NN Scan matched are inconsistent."""
+    if not scan_matched["unassigned_matched"].any():
+        return scan_matched.drop(
+            ["scan_RT", "scan_opt_global_spectrum_reference", 
+             "scan_exp_mass_to_charge", "unassigned_matched"],
+            axis=1
+        )
+    
+    v_str = scan_matched[scan_matched["unassigned_matched"] == 1]["MS2.Scan"].tolist()
+    logger.info(
+        f"RT matched and DIA-NN Scan matched are inconsistent in mzML. "
+        f"Keep Scan matched: {v_str}"
+    )
+    return scan_matched.drop(
+        ["RT", "opt_global_spectrum_reference", "DIANN-intraID",
+         "exp_mass_to_charge", "unassigned_matched"],
+        axis=1
+    ).rename(
+        columns={
+            "scan_RT": "RT",
+            "scan_opt_global_spectrum_reference": "opt_global_spectrum_reference",
+            "scan_exp_mass_to_charge": "exp_mass_to_charge",
+        }
+    )

# Replace the original code with:
scan_matched["unassigned_matched"] = scan_matched.apply(
    lambda row: 1 if row["MS2.Scan"] != row["DIANN-intraID"] else 0, 
    axis=1
)
scan_matched = handle_scan_mismatch(scan_matched)

1211-1216: Consider caching AASequence conversions.

The repeated conversion of peptide sequences using AASequence.fromString().toString() could be optimized by caching results for frequently occurring sequences.

+@functools.lru_cache(maxsize=1024)
+def convert_aa_sequence(sequence):
+    """Convert peptide sequence using AASequence with caching."""
+    return AASequence.fromString(sequence).toString()

-out_mztab_psh.loc[:, "opt_global_cv_MS:1000889_peptidoform_sequence"] = out_mztab_psh.apply(
-    lambda x: AASequence.fromString(
-        x["opt_global_cv_MS:1000889_peptidoform_sequence"]
-    ).toString(),
-    axis=1,
-    result_type="expand",
-)
+out_mztab_psh.loc[:, "opt_global_cv_MS:1000889_peptidoform_sequence"] = out_mztab_psh[
+    "opt_global_cv_MS:1000889_peptidoform_sequence"
+].apply(convert_aa_sequence)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 75a98e9 and 6b9a2b1.

⛔ Files ignored due to path filters (1)

tests/test_data/BSA1_F1_spectrum_df.parquet is excluded by !**/*.parquet

📒 Files selected for processing (8)

quantmsutils/diann/diann2mztab.py (35 hunks)
quantmsutils/diann/dianncfg.py (2 hunks)
quantmsutils/features/snr.py (2 hunks)
quantmsutils/mzml/mzml_statistics.py (11 hunks)
quantmsutils/psm/psm_conversion.py (3 hunks)
quantmsutils/sdrf/check_samplesheet.py (2 hunks)
quantmsutils/sdrf/extract_sample.py (1 hunks)
quantmsutils/utils/constants.py (1 hunks)

✅ Files skipped from review due to trivial changes (3)

quantmsutils/features/snr.py
quantmsutils/sdrf/extract_sample.py
quantmsutils/psm/psm_conversion.py

🚧 Files skipped from review as they are similar to previous changes (1)

quantmsutils/utils/constants.py

🔇 Additional comments (9)

quantmsutils/diann/dianncfg.py (1)

58-58: LGTM! String formatting simplified.

The string concatenation has been simplified while maintaining the same functionality.

Also applies to: 94-94

quantmsutils/mzml/mzml_statistics.py (5)

13-27: Great improvement using constants instead of string literals!

The introduction of constants from utils.constants enhances code maintainability and reduces the risk of typos. This change makes the code more robust and easier to maintain.

149-179: Excellent batch writing optimization

The lazy initialization of writers and direct streaming of data using PyArrow's batch writing capabilities will improve memory efficiency for large datasets.

Line range hint 357-371: Clean implementation of ID output handling

The addition of ID-specific output path and its integration with the existing file processing logic is well implemented. The code maintains good error handling and supports both mzML and Bruker formats consistently.

99-111: Good improvement on data type precision

The change to use appropriate integer types for MS_LEVEL, CHARGE, and NUM_PEAKS is more precise and memory efficient. However, ensure that any downstream processing handles these integer types correctly.

✅ Verification successful

Integer type changes are safely handled throughout the codebase

The verification shows that the integer types for MS_LEVEL, CHARGE, and NUM_PEAKS are consistently handled across the codebase:

All schema definitions in mzml_statistics.py use pa.int32() for these fields
The fields are properly used in downstream processing (e.g., in diann2mztab.py)
The constants are well-documented in constants.py with clear purposes

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for code that processes these fields
rg "MS_LEVEL|CHARGE|NUM_PEAKS" --type python -C 2

Length of output: 82

Script:

#!/bin/bash
# Search for code that processes these fields, without file type restriction
rg "MS_LEVEL|CHARGE|NUM_PEAKS" -C 2

Length of output: 7634

90-95: Verify the impact of dictionary structure change in psm_parts

The change from list to dictionary structure in psm_parts is good for clarity, but ensure all consumers of this data are updated to handle the new structure.

✅ Verification successful

Dictionary structure change in psm_parts is safe to proceed

The verification shows that psm_parts is only used internally within the mzml_statistics.py file for batch processing. The dictionary keys (SCAN, MS_LEVEL, MZ_ARRAY, INTENSITY_ARRAY) are used consistently across the codebase, primarily through imports from constants.py. The change from list to dictionary structure is contained within this file and doesn't affect external consumers, as the data is properly serialized to parquet format before being used by other components.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any code that processes psm_parts data
rg -l "psm_parts" --type python

Length of output: 62

Script:

#!/bin/bash
# Search for any code that processes psm_parts data, without file type filter
rg -l "psm_parts"

# Also search for specific dictionary keys to find potential consumers
rg -l "SCAN|MS_LEVEL|MZ_ARRAY|INTENSITY_ARRAY"

Length of output: 297

Script:

#!/bin/bash
# Check how these dictionary keys are used in the related files
rg "SCAN|MS_LEVEL|MZ_ARRAY|INTENSITY_ARRAY" -A 2 -B 2 quantmsutils/features/snr.py quantmsutils/diann/diann2mztab.py

# Check the context of psm_parts usage
rg "psm_parts" -A 2 -B 2 quantmsutils/mzml/mzml_statistics.py

Length of output: 2457

quantmsutils/sdrf/check_samplesheet.py (2)

125-127: LGTM! Error message formatting improved.

The simplified error message is more concise while maintaining clarity.

140-140: Verify the impact of skipping SDRF validation.

While the implementation is correct, please ensure that skipping SDRF validation doesn't impact downstream processes that might depend on validated SDRF data.

Consider:

Adding a warning message when this flag is used in production environments
Documenting the risks of skipping validation in the README
Adding logging to track when validation is skipped

Let me help verify the usage:

✅ Verification successful

The skip validation flag is safely implemented with appropriate safeguards

Based on the code analysis, the --skip_sdrf_validation flag is properly implemented with clear error handling and exit paths:

When validation is skipped:
- A clear message "No SDRF validation was performed" is printed
- Program exits cleanly with code 0
- No downstream processes continue after skipping validation
When validation runs:
- Multiple validation steps are performed (DEFAULT_TEMPLATE, MASS_SPECTROMETRY, factor values)
- Errors are collected and properly handled
- Program exits with appropriate error code if validation fails

The implementation follows the pattern of other validation flags in the codebase and includes proper error reporting. No downstream dependencies rely on validation results since the program exits immediately in both success and failure cases.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for dependencies on SDRF validation results

# Search for code that might depend on validated SDRF data
rg -l "SdrfDataFrame" | xargs rg "validate"

# Check for any potential validation checks in test files
fd -e py -x rg "check_sdrf" {} \;

Length of output: 536

Script:

#!/bin/bash
# Let's get more context about the SDRF validation flow and its usage

# Find the validation methods and their implementation
ast-grep --pattern 'def validate($$$) { $$$ }'

# Check where validation results are used
rg -A 3 "errors = .*validate\(" 

# Look for any error handling or dependencies on validation
rg "if.*errors" -A 2

Length of output: 640

Script:

#!/bin/bash
# Let's examine the check_sdrf function implementation and error handling

# Get the full context of check_sdrf function
ast-grep --pattern 'def check_sdrf($$$) {
  $$$
}'

# Look for error handling and return values
rg -B 2 -A 4 "return errors|sys.exit"

# Check for any logging or warning messages
rg "warning|logger|print.*error" -A 2

Length of output: 17879

quantmsutils/diann/diann2mztab.py (1)

22-27: LGTM! Good use of constants.

The addition of constants from quantmsutils.utils.constants improves code maintainability by centralizing string literals.

quantmsutils/diann/diann2mztab.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)

tests/test_commands.py (1)

115-132: Remove or implement the commented-out test function.

Commented-out code should not be committed to the repository. If this test is valuable for batch processing functionality, it should be implemented. Otherwise, it should be removed.
quantmsutils/mzml/mzml_statistics.py (3)
4-4: Remove unused import.

The List import is not used in this file. The return type annotation uses Set.
-from typing import Optional, List, Set
+from typing import Optional, Set
🧰 Tools

🪛 Ruff (0.8.0)

4-4: typing.List imported but unused

Remove unused import: typing.List

(F401)

149-164: Optimize batch writing performance.

The current implementation creates a new RecordBatch for each batch. Consider collecting multiple batches and writing them together to reduce I/O operations.
 if self.batch_data:
     # Initialize writers lazily if not already created
     if self.parquet_writer is None:
         self.parquet_writer = pq.ParquetWriter(
             where=self.output_path, schema=self.parquet_schema, compression="gzip"
         )

-    # Create a Table directly from the current batch
-    batch = pa.RecordBatch.from_pylist(self.batch_data, schema=self.parquet_schema)
-    # Write the batch directly
-    self.parquet_writer.write_batch(batch)
+    # Create batches of 5 for more efficient writing
+    batches = []
+    for i in range(0, len(self.batch_data), 5):
+        batch = pa.RecordBatch.from_pylist(
+            self.batch_data[i:i+5],
+            schema=self.parquet_schema
+        )
+        batches.append(batch)
+    
+    # Write multiple batches at once
+    self.parquet_writer.write_batch(pa.concat_batches(batches))

     # Clear the batch data
     self.batch_data = []
235-243: Consider memory optimization for numeric types.

Some numeric fields could use smaller data types to reduce memory usage:

MS_LEVEL could use int8 as it's typically 1 or 2

CHARGE could use int8 as it rarely exceeds small values

NUM_PEAKS could use int16 if the maximum number of peaks is known to be under 32,767
-    pa.field(MS_LEVEL, pa.int32(), nullable=True),
-    pa.field(CHARGE, pa.int32(), nullable=True),
-    pa.field(NUM_PEAKS, pa.int32(), nullable=True),
+    pa.field(MS_LEVEL, pa.int8(), nullable=True),
+    pa.field(CHARGE, pa.int8(), nullable=True),
+    pa.field(NUM_PEAKS, pa.int16(), nullable=True),

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9f54092 and 8c4c822.

📒 Files selected for processing (3)

quantmsutils/diann/diann2mztab.py (35 hunks)
quantmsutils/mzml/mzml_statistics.py (13 hunks)
tests/test_commands.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

quantmsutils/diann/diann2mztab.py

🧰 Additional context used

🪛 Ruff (0.8.0)

quantmsutils/mzml/mzml_statistics.py

4-4: typing.List imported but unused

Remove unused import: typing.List

(F401)

🔇 Additional comments (2)

tests/test_commands.py (1)

191-192: Use consistent path handling.

The parquet file path "BSA1_F1_ms_info.parquet" is using a relative path without the tests/test_data/ prefix, unlike the second file. This could make the test fragile depending on where it's executed from.

quantmsutils/mzml/mzml_statistics.py (1)

201-204: LGTM! Well-implemented helper functions.

The helper functions have good error handling and type safety:

column_exists efficiently uses a Set for column lookup
_resolve_ms_path has robust file resolution logic with clear error messages

Also applies to: 376-393

pyarrow force 18.0.0

606aaec

ypriverol added 3 commits December 1, 2024 10:37

testing

4e4b8bb

remove ms2rescore.py

740b4ab

remove ms2rescore.py

7183557

coderabbitai bot reviewed Dec 1, 2024

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

remove ms2rescore.py

08da89b

coderabbitai bot reviewed Dec 1, 2024

View reviewed changes

ypriverol added 2 commits December 1, 2024 22:48

remove ms2rescore.py

80d8184

remove ms2rescore.py

582d648

coderabbitai bot reviewed Dec 1, 2024

View reviewed changes

.github/workflows/python-app.yml Show resolved Hide resolved

ypriverol added 2 commits December 1, 2024 23:09

remove ms2rescore.py

7a7c8c3

remove ms2rescore.py

2b11f90

coderabbitai bot reviewed Dec 1, 2024

View reviewed changes

environment.yml Outdated Show resolved Hide resolved

environment.yml Outdated Show resolved Hide resolved

ypriverol added 3 commits December 2, 2024 07:34

remove ms2rescore.py

6d7da40

remove ms2rescore.py

502761e

remove ms2rescore.py

2590939

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

recipe/meta.yaml Show resolved Hide resolved

ypriverol added 3 commits December 2, 2024 09:40

remove ms2rescore.py

6d82424

remove ms2rescore.py

2d6e634

remove ms2rescore.py

a8ac9f4

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

environment.yml Outdated Show resolved Hide resolved

latest versions for all packages

d18b101

ypriverol changed the title ~~pyarrow force 18.0.0~~ major refactoring about ms2rescore and pyarrow Dec 2, 2024

ypriverol requested a review from daichengxin December 2, 2024 12:06

ypriverol added 2 commits December 2, 2024 12:13

latest versions for all packages

0ae008e

latest versions for all packages

1b4188d

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

major update on tests

1c2d051

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

tests/test_commands.py Outdated Show resolved Hide resolved

ypriverol added 2 commits December 2, 2024 16:31

constants.py created.

af9b247

constants.py created.

d8bc225

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

ypriverol added 2 commits December 2, 2024 16:40

constants.py created.

03ac3af

constants.py created.

5b9ef64

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

tests/test_commands.py Outdated Show resolved Hide resolved

constants.py created.

b200964

coderabbitai bot reviewed Dec 2, 2024

View reviewed changes

constants.py created.

422d9da

coderabbitai bot reviewed Dec 3, 2024

View reviewed changes

tests/test_commands.py Outdated Show resolved Hide resolved

tests/test_commands.py Show resolved Hide resolved

ypriverol added 3 commits December 3, 2024 07:20

constants.py created.

fe54ec5

constants.py created.

1cf33b1

constants.py created.

75a98e9

coderabbitai bot reviewed Dec 3, 2024

View reviewed changes

major changes in batch processing.

6b9a2b1

coderabbitai bot reviewed Dec 3, 2024

View reviewed changes

quantmsutils/diann/diann2mztab.py Show resolved Hide resolved

fix bug.

9f54092

ypriverol requested a review from jpfeuffer December 3, 2024 10:19

ypriverol added 2 commits December 3, 2024 10:20

major changes in batch processing and remove ms2rescore.py

f180cca

Merge remote-tracking branch 'origin/dev' into dev

2980bb6

daichengxin approved these changes Dec 3, 2024

View reviewed changes

minor cleaning tool.

8c4c822

coderabbitai bot reviewed Dec 3, 2024

View reviewed changes

minor cleaning tool.

b5773c7

ypriverol linked an issue Dec 3, 2024 that may be closed by this pull request

spliting quantms-utils from ms2rescore. #40

Closed

ypriverol merged commit 4f2e41c into main Dec 3, 2024
7 checks passed

major refactoring about ms2rescore and pyarrow #39

major refactoring about ms2rescore and pyarrow #39

Uh oh!

Conversation

ypriverol commented Dec 1, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ypriverol commented Dec 1, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 1, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)