Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix ruff lint errors (unused imports, import ordering) - Apply ruff formatting across all source and test files - Fix mypy type errors in validators, stats, and pride modules - Add types-requests to dev dependencies Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests the full read -> validate -> stats pipeline for all four supported formats (Olink CSV, Parquet, SomaScan ADAT, SomaScan CSV). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review Summary by QodoInitial release of pyprideap library for PRIDE Affinity Proteomics data
WalkthroughsDescription• Initial library release with complete project scaffolding • Multi-format readers for Olink (CSV, Parquet, XLSX) and SomaScan (ADAT, CSV) • Comprehensive validation framework with platform-specific validators • Statistics computation and PRIDE API client integration • Full test suite with CI/CD pipeline and documentation Diagramflowchart LR
A["Input Files<br/>CSV/Parquet/XLSX/ADAT"] -->|Auto-detect| B["Registry<br/>Format Detection"]
B -->|Route| C["Platform Readers<br/>Olink/SomaScan"]
C -->|Parse| D["AffinityDataset<br/>Core Data Model"]
D -->|Validate| E["Validators<br/>Schema/QC/Range"]
D -->|Analyze| F["Statistics<br/>Detection Rate/QC"]
D -->|Fetch| G["PRIDE Client<br/>Project Metadata"]
E -->|Results| H["ValidationResult<br/>Error/Warning/Info"]
F -->|Output| I["DatasetStats<br/>Summary"]
File Changes1. pyproject.toml
|
Code Review by Qodo
1.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new pyprideap package surface: core datamodels, many format readers and registry, LOD analysis, filtering, stats, validators, QC compute/render/report, a PRIDE API client, top-level exports, CI config, tests, and test fixtures. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(200,200,255,0.5)
participant User
participant Registry as readers.registry.read()
participant Detect as readers.registry.detect_format()
participant Reader as Reader (CSV/Parquet/ADAT/XLSX)
end
rect rgba(200,255,200,0.5)
participant Dataset as AffinityDataset
participant Validator as validators.validate()
participant LOD as lod.compute_lod_stats()
participant Stats as stats.compute_stats()
end
User->>Registry: read(path)
Registry->>Detect: detect_format(path)
Detect-->>Registry: format_type
Registry->>Reader: call reader_func(path)
Reader->>Reader: parse file, assemble samples/features/expression
Reader-->>Dataset: return AffinityDataset
Registry-->>User: dataset
User->>Validator: validate(dataset)
Validator->>Validator: run platform checks
Validator-->>User: list[ValidationResult]
User->>LOD: compute_lod_stats(dataset)
LOD->>LOD: derive LODs & per-sample rates
LOD-->>User: LodStats
User->>Stats: compute_stats(dataset)
Stats->>Stats: aggregate metrics
Stats-->>User: DatasetStats
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (12)
pyproject.toml (1)
5-18: Consider adding project metadata for PyPI publication.The core configuration looks solid. For eventual PyPI publication, consider adding author information, classifiers, and project URLs:
authors = [ {name = "Your Name", email = "email@example.com"} ] classifiers = [ "Development Status :: 3 - Alpha", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", "Programming Language :: Python :: 3.12", ] [project.urls] Homepage = "https://github.com/PRIDE-Archive/pyprideap"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 5 - 18, Add PyPI metadata under the existing [project] table: populate authors (e.g., authors = [{name = "Your Name", email = "you@example.com"}]), add a classifiers array (include Development Status and supported Python versions like "Programming Language :: Python :: 3.10"/3.11/3.12), and add a [project.urls] section with useful links (Homepage, Source, Documentation) so the package metadata for pyprideap is complete for PyPI publication; update the version/description if needed but keep the existing dependencies intact.src/pyprideap/pride.py (1)
11-15: Consider adding context manager support for proper session cleanup.The
requests.Sessionis created but never explicitly closed, which could lead to resource leaks in long-running applications.💡 Optional: Add context manager support
class PrideClient: def __init__(self, base_url: str = _BASE_URL, timeout: int = _DEFAULT_TIMEOUT): self.base_url = base_url self.timeout = timeout self._session = requests.Session() + + def close(self) -> None: + self._session.close() + + def __enter__(self) -> "PrideClient": + return self + + def __exit__(self, *args: object) -> None: + self.close()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/pride.py` around lines 11 - 15, The PrideClient currently creates a requests.Session in __init__ and never closes it; add explicit session cleanup by implementing context manager support (define __enter__ to return self and __exit__ to call self.close()) plus a close() method that calls self._session.close(); ensure existing callers can also call close() directly and consider adding a __del__ that calls close() as a fallback. Reference the PrideClient class, its __init__, the self._session attribute, and add methods __enter__, __exit__, and close to manage lifecycle..github/workflows/ci.yml (2)
10-19: Consider adding pip caching to speed up CI.Adding pip caching can significantly reduce CI time:
💡 Optional: Add caching to workflow
- uses: actions/setup-python@v5 with: python-version: "3.12" + cache: 'pip'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 10 - 19, Add pip caching to the "lint" job by caching pip's wheel/cache directory so subsequent runs skip re-downloading ruff; modify the steps around uses: actions/setup-python@v5 and the run: pip install ruff command to include an actions/cache step (e.g., cache key based on runner, python-version and a static ruff-version or requirements hash) that caches ~/.cache/pip (or the platform-appropriate pip cache path) before installing, and restore that cache so the existing run: pip install ruff becomes fast and CI time is reduced.
23-32: Consider addingfail-fast: falseto ensure all Python versions complete their tests.This allows each version to run independently, so you can see test results across all Python versions even if one fails:
Example change
strategy: + fail-fast: false matrix: python-version: ["3.10", "3.11", "3.12", "3.13"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 23 - 32, Update the GitHub Actions job strategy to disable fast-fail so all matrix entries run: under the existing `strategy` -> `matrix` configuration referencing `python-version`, add `fail-fast: false` (i.e., ensure `strategy` contains `matrix:` and `fail-fast: false`) so tests for "python-version" 3.10–3.13 all complete even if one fails.src/pyprideap/validators/olink_target.py (1)
7-20: Consider simplifying rule prefix handling.The
_rule_prefixclass attribute is only used withinvalidate(). Since the replacement logic is tightly coupled to this single method, it could be simplified.However, the current approach works correctly and allows for potential subclass customization. The implementation properly rewrites all rule strings from the parent validator.
♻️ Optional: Simplify if subclass customization is not needed
class OlinkTargetValidator(OlinkExploreValidator): - _rule_prefix = "olink_target" - def validate(self, dataset: AffinityDataset) -> list[ValidationResult]: results = super().validate(dataset) return [ ValidationResult( level=r.level, - rule=r.rule.replace("olink.", f"{self._rule_prefix}.", 1), + rule=r.rule.replace("olink.", "olink_target.", 1), message=r.message, details=r.details, ) for r in results ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/validators/olink_target.py` around lines 7 - 20, The class-level attribute _rule_prefix is only used inside OlinkTargetValidator.validate; simplify by removing the _rule_prefix attribute and use a local variable or inline string in validate instead: update OlinkTargetValidator to delete the _rule_prefix declaration and change the replace call in validate to r.rule.replace("olink.", "olink_target.", 1) (or assign prefix = "olink_target" at the top of validate and use f"{prefix}.") while keeping the existing mapping logic that constructs ValidationResult from super().validate(dataset).tests/test_validators.py (1)
147-176: Consider adding a test to verify rule prefix rewriting.The tests validate that OlinkTargetValidator inherits validation behavior from OlinkExploreValidator, but they don't explicitly verify that rule names are rewritten from
olink.toolink_target.prefix.💚 Optional: Add test for rule prefix
def test_rule_prefix_rewriting(self): ds = _make_olink_target_dataset( features=pd.DataFrame({"OlinkID": ["OID1"]}), ) results = OlinkTargetValidator().validate(ds) errors = [r for r in results if r.level == Level.ERROR] # Verify rules use olink_target prefix, not olink assert all(r.rule.startswith("olink_target.") for r in errors) assert not any(r.rule.startswith("olink.") for r in errors)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_validators.py` around lines 147 - 176, Add a unit test that ensures OlinkTargetValidator rewrites inherited rule names from the olink. prefix to olink_target.; create a new test method (e.g., test_rule_prefix_rewriting) inside TestOlinkTargetValidator that builds a minimal dataset via _make_olink_target_dataset, runs OlinkTargetValidator().validate(ds), filters error-level results using Level, and asserts that every error.rule starts with "olink_target." and that none start with "olink."; this verifies the prefix rewriting behavior for inherited rules.tests/test_registry.py (1)
35-52: Consider adding a test for reading SomaScan CSV files.
TestDetectFormatincludestest_somascan_csv_detected, butTestAutoReadlacks a correspondingtest_read_somascan_csv. For consistency and completeness, consider adding this test case to verify the auto-read functionality for SomaScan CSV format.💡 Suggested test addition
def test_read_xlsx(self, olink_xlsx_path): ds = read(olink_xlsx_path) assert isinstance(ds, AffinityDataset) assert ds.platform == Platform.OLINK_EXPLORE + + def test_read_somascan_csv(self, somascan_csv_path): + ds = read(somascan_csv_path) + assert isinstance(ds, AffinityDataset) + assert ds.platform == Platform.SOMASCAN🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_registry.py` around lines 35 - 52, Add a new test method test_read_somascan_csv to the TestAutoRead class that calls read(somascan_csv_path), asserts the returned object is the expected dataset type (e.g., AffinityDataset if applicable) and that ds.platform == Platform.SOMASCAN; reference the existing read() call pattern and the somascan_csv_path fixture used in TestDetectFormat to mirror other tests (e.g., test_read_somascan_adat) for consistency.tests/test_pride.py (1)
36-43: Consider using a more specific exception type in the mock.The test mocks
raise_for_status.side_effectwith a genericException("404 Not Found"). Therequestslibrary raisesrequests.exceptions.HTTPErrorfor failed status codes. Using the correct exception type would make the test more realistic.💡 Suggested improvement
+from requests.exceptions import HTTPError + def test_get_project_not_found(self): mock_response = MagicMock() mock_response.status_code = 404 - mock_response.raise_for_status.side_effect = Exception("404 Not Found") + mock_response.raise_for_status.side_effect = HTTPError("404 Not Found") with patch("pyprideap.pride.requests.Session.get", return_value=mock_response): client = PrideClient() - with pytest.raises(Exception, match="404"): + with pytest.raises(HTTPError, match="404"): client.get_project("PAD999999")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_pride.py` around lines 36 - 43, The test test_get_project_not_found should mock the same exception type Requests would raise: replace the generic Exception thrown from mock_response.raise_for_status.side_effect with requests.exceptions.HTTPError so the behavior matches real requests behavior; update the test to import requests (or reference requests.exceptions.HTTPError) and set mock_response.raise_for_status.side_effect = requests.exceptions.HTTPError("404 Not Found"), keeping the patch on pyprideap.pride.requests.Session.get and the existing pytest.raises assertion targeting the HTTPError message.src/pyprideap/readers/olink_csv.py (2)
18-32: Add validation for required columns before processing.The code assumes columns
SampleID,OlinkID, andNPXexist without validation. If any are missing,pivot_tableordrop_duplicateswill raise cryptic errors. Adding upfront validation would provide clearer error messages.💡 Suggested validation
df = pd.read_csv(path) + required_cols = {"SampleID", "OlinkID", "NPX"} + missing = required_cols - set(df.columns) + if missing: + raise ValueError(f"Missing required columns: {missing}") + sample_cols = [c for c in df.columns if c in _SAMPLE_COLS] samples = df[sample_cols].drop_duplicates(subset=["SampleID"]).reset_index(drop=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/readers/olink_csv.py` around lines 18 - 32, After reading the CSV into df, validate that the required columns "SampleID", "OlinkID", and "NPX" are present before any downstream operations (before computing sample_cols/feature_cols or calling df.pivot_table). If any are missing, raise a clear ValueError listing the missing columns so callers get an actionable message; place this check immediately after df = pd.read_csv(path) and before usage of sample_cols, feature_cols, samples, features, or expression.
26-31: Consider logging or warning whenaggfunc="first"silently resolves duplicates.Using
aggfunc="first"inpivot_tablesilently picks the first value when there are duplicate(SampleID, OlinkID)combinations. This could mask data quality issues. Consider either validating uniqueness beforehand or logging when duplicates are encountered.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/readers/olink_csv.py` around lines 26 - 31, The pivot uses df.pivot_table(..., aggfunc="first") which silently drops duplicate (SampleID, OlinkID) rows; before calling df.pivot_table (the expression variable), detect duplicates with df.duplicated(subset=["SampleID","OlinkID"], keep=False), and if any are found emit a warning/log (include count and a small sample of offending rows) or optionally raise; then either deduplicate explicitly (e.g., groupby + aggregate) or proceed to pivot—update the code around df.pivot_table to perform this check and logging so duplicates are visible instead of silently ignored.tests/test_integration.py (1)
6-12: Consider adding assertions on validation results content.The test checks that
resultsis a list but doesn't verify its contents. Integration tests would be more valuable if they asserted on expected validation outcomes (e.g., no errors for valid test data, or specific expected warnings).💡 Example enhancement
results = pyprideap.validate(dataset) stats = pyprideap.compute_stats(dataset) assert stats.n_samples > 0 assert stats.n_features > 0 assert isinstance(results, list) + # Verify no critical validation errors for valid test data + from pyprideap.core import Level + critical_errors = [r for r in results if r.level == Level.ERROR] + assert len(critical_errors) == 0, f"Unexpected validation errors: {critical_errors}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_integration.py` around lines 6 - 12, The test currently only asserts that results is a list; enhance it to verify expected validation outcomes by checking that results contains no errors for this valid fixture: after calling pyprideap.validate(dataset) assert either that results is empty or more robustly assert that no entry in results has severity 'error' (e.g., assert not any(r.get('severity') == 'error' for r in results)). Also consider adding an assertion on results structure (e.g., each item has keys like 'message' or 'severity') to ensure the validation output format from pyprideap.validate remains stable.src/pyprideap/readers/registry.py (1)
37-44: Potential fall-through issue: unrecognized CSV files silently fail.If a
.csvfile doesn't match SomaScan or Olink marker columns, the function falls through to the.xlsxcheck and eventually raisesValueError. However, this path is unclear from the code structure. Consider adding an explicitelseor comment to clarify this intentional fall-through behavior.💡 Suggested clarification
if suffix == ".csv": df_head = pd.read_csv(path, nrows=1) cols = set(df_head.columns) has_seqid_cols = any(c.startswith("SeqId.") for c in cols) if has_seqid_cols or _SOMASCAN_MARKER_COLS.intersection(cols): return "somascan_csv" if _OLINK_MARKER_COLS.intersection(cols): return "olink_csv" + # Fall through to raise ValueError if no markers matched🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/readers/registry.py` around lines 37 - 44, The CSV branch currently inspects df_head and checks _SOMASCAN_MARKER_COLS and _OLINK_MARKER_COLS but then silently falls through if no match; make this explicit by either adding an else to return a sentinel (e.g., None) or, to preserve current behavior, add a clear comment stating the fall-through is intentional so later .xlsx handling/ValueError will apply; update the block around suffix, pd.read_csv, df_head, has_seqid_cols, _SOMASCAN_MARKER_COLS and _OLINK_MARKER_COLS to include that explicit else or comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/pyprideap/__init__.py`:
- Around line 3-11: Wrap the importlib.metadata.version call in a try/except to
avoid raising PackageNotFoundError when the package metadata is missing: import
PackageNotFoundError from importlib.metadata and replace the direct assignment
to __version__ with a guarded assignment that sets __version__ =
version("pyprideap") inside the try block and sets a safe fallback (e.g.,
"0.0.0" or None) in the except PackageNotFoundError block; update the top-level
symbols (__version__, version, PackageNotFoundError) so importing pyprideap from
a source tree doesn't fail.
In `@src/pyprideap/pride.py`:
- Around line 37-40: The loop that sets urls[name] uses loc["value"] without
checking the key, which can raise KeyError; update the block that iterates over
locations (the for loc in locations loop) to safely access the value by either
checking "value" in loc or using loc.get("value") and only assign urls[name]
when a non-None value is present (and keep the existing break behavior after
assignment).
In `@src/pyprideap/readers/olink_parquet.py`:
- Around line 23-32: The expression matrix created by pivot_table is left with
columns in pivot order, causing mismatch with features; after building features
= df[feature_cols].drop_duplicates(subset=["OlinkID"]).reset_index(drop=True),
reindex the expression's columns to match features["OlinkID"] order (e.g. use
expression.reindex(columns=features["OlinkID"].values]) to align columns with
the features DataFrame) and then reset_index(drop=True) as done for the sample
axis so that expression columns and features["OlinkID"] have the same order and
indexing.
In `@src/pyprideap/readers/olink_xlsx.py`:
- Around line 23-27: The pivot_table call sorts OlinkID columns and can misalign
with the first-seen feature order in feature_cols/features; to fix, after
building expression (pivot_table and reindexing rows) explicitly reindex its
columns to match the feature order from features["OlinkID"] (or the
feature_cols-derived OlinkID order) so that expression.columns align with
features rows (use expression.reindex(columns=features["OlinkID"].values] or
equivalent) before returning/resetting index.
In `@src/pyprideap/readers/somascan_csv.py`:
- Around line 17-24: The features DataFrame only contains "SeqId" which breaks
SomaScanValidator expecting {"SeqId","UniProt","Target","Dilution"} and the
expression columns keep the "SeqId." prefix while seq_ids are stripped; update
the logic around seq_cols/seq_ids/features to (1) populate features with the
required columns by extracting UniProt/Target/Dilution from the original df if
present or else adding these columns with NaN/None defaults, and (2) make
expression column names and features.SeqId consistent by either stripping the
"SeqId." prefix from expression column names or keeping the prefix in
features.SeqId so that seq_cols, seq_ids, expression, and features all refer to
the same identifiers (see symbols: seq_cols, seq_ids, expression, features,
SomaScanValidator).
In `@tests/create_fixtures.py`:
- Around line 30-46: The test fixture function create_olink_xlsx is missing the
PlateQC column required by the olink xlsx reader's _SAMPLE_COLS; update the
DataFrame constructed in create_olink_xlsx to include a "PlateQC" column (e.g.,
values like "PASS" or appropriate per-row QC statuses) so the produced
tests/data/olink_sample.xlsx contains PlateQC for each row and matches the
reader expectation in src/pyprideap/readers/olink_xlsx.py.
---
Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 10-19: Add pip caching to the "lint" job by caching pip's
wheel/cache directory so subsequent runs skip re-downloading ruff; modify the
steps around uses: actions/setup-python@v5 and the run: pip install ruff command
to include an actions/cache step (e.g., cache key based on runner,
python-version and a static ruff-version or requirements hash) that caches
~/.cache/pip (or the platform-appropriate pip cache path) before installing, and
restore that cache so the existing run: pip install ruff becomes fast and CI
time is reduced.
- Around line 23-32: Update the GitHub Actions job strategy to disable fast-fail
so all matrix entries run: under the existing `strategy` -> `matrix`
configuration referencing `python-version`, add `fail-fast: false` (i.e., ensure
`strategy` contains `matrix:` and `fail-fast: false`) so tests for
"python-version" 3.10–3.13 all complete even if one fails.
In `@pyproject.toml`:
- Around line 5-18: Add PyPI metadata under the existing [project] table:
populate authors (e.g., authors = [{name = "Your Name", email =
"you@example.com"}]), add a classifiers array (include Development Status and
supported Python versions like "Programming Language :: Python ::
3.10"/3.11/3.12), and add a [project.urls] section with useful links (Homepage,
Source, Documentation) so the package metadata for pyprideap is complete for
PyPI publication; update the version/description if needed but keep the existing
dependencies intact.
In `@src/pyprideap/pride.py`:
- Around line 11-15: The PrideClient currently creates a requests.Session in
__init__ and never closes it; add explicit session cleanup by implementing
context manager support (define __enter__ to return self and __exit__ to call
self.close()) plus a close() method that calls self._session.close(); ensure
existing callers can also call close() directly and consider adding a __del__
that calls close() as a fallback. Reference the PrideClient class, its __init__,
the self._session attribute, and add methods __enter__, __exit__, and close to
manage lifecycle.
In `@src/pyprideap/readers/olink_csv.py`:
- Around line 18-32: After reading the CSV into df, validate that the required
columns "SampleID", "OlinkID", and "NPX" are present before any downstream
operations (before computing sample_cols/feature_cols or calling
df.pivot_table). If any are missing, raise a clear ValueError listing the
missing columns so callers get an actionable message; place this check
immediately after df = pd.read_csv(path) and before usage of sample_cols,
feature_cols, samples, features, or expression.
- Around line 26-31: The pivot uses df.pivot_table(..., aggfunc="first") which
silently drops duplicate (SampleID, OlinkID) rows; before calling df.pivot_table
(the expression variable), detect duplicates with
df.duplicated(subset=["SampleID","OlinkID"], keep=False), and if any are found
emit a warning/log (include count and a small sample of offending rows) or
optionally raise; then either deduplicate explicitly (e.g., groupby + aggregate)
or proceed to pivot—update the code around df.pivot_table to perform this check
and logging so duplicates are visible instead of silently ignored.
In `@src/pyprideap/readers/registry.py`:
- Around line 37-44: The CSV branch currently inspects df_head and checks
_SOMASCAN_MARKER_COLS and _OLINK_MARKER_COLS but then silently falls through if
no match; make this explicit by either adding an else to return a sentinel
(e.g., None) or, to preserve current behavior, add a clear comment stating the
fall-through is intentional so later .xlsx handling/ValueError will apply;
update the block around suffix, pd.read_csv, df_head, has_seqid_cols,
_SOMASCAN_MARKER_COLS and _OLINK_MARKER_COLS to include that explicit else or
comment.
In `@src/pyprideap/validators/olink_target.py`:
- Around line 7-20: The class-level attribute _rule_prefix is only used inside
OlinkTargetValidator.validate; simplify by removing the _rule_prefix attribute
and use a local variable or inline string in validate instead: update
OlinkTargetValidator to delete the _rule_prefix declaration and change the
replace call in validate to r.rule.replace("olink.", "olink_target.", 1) (or
assign prefix = "olink_target" at the top of validate and use f"{prefix}.")
while keeping the existing mapping logic that constructs ValidationResult from
super().validate(dataset).
In `@tests/test_integration.py`:
- Around line 6-12: The test currently only asserts that results is a list;
enhance it to verify expected validation outcomes by checking that results
contains no errors for this valid fixture: after calling
pyprideap.validate(dataset) assert either that results is empty or more robustly
assert that no entry in results has severity 'error' (e.g., assert not
any(r.get('severity') == 'error' for r in results)). Also consider adding an
assertion on results structure (e.g., each item has keys like 'message' or
'severity') to ensure the validation output format from pyprideap.validate
remains stable.
In `@tests/test_pride.py`:
- Around line 36-43: The test test_get_project_not_found should mock the same
exception type Requests would raise: replace the generic Exception thrown from
mock_response.raise_for_status.side_effect with requests.exceptions.HTTPError so
the behavior matches real requests behavior; update the test to import requests
(or reference requests.exceptions.HTTPError) and set
mock_response.raise_for_status.side_effect = requests.exceptions.HTTPError("404
Not Found"), keeping the patch on pyprideap.pride.requests.Session.get and the
existing pytest.raises assertion targeting the HTTPError message.
In `@tests/test_registry.py`:
- Around line 35-52: Add a new test method test_read_somascan_csv to the
TestAutoRead class that calls read(somascan_csv_path), asserts the returned
object is the expected dataset type (e.g., AffinityDataset if applicable) and
that ds.platform == Platform.SOMASCAN; reference the existing read() call
pattern and the somascan_csv_path fixture used in TestDetectFormat to mirror
other tests (e.g., test_read_somascan_adat) for consistency.
In `@tests/test_validators.py`:
- Around line 147-176: Add a unit test that ensures OlinkTargetValidator
rewrites inherited rule names from the olink. prefix to olink_target.; create a
new test method (e.g., test_rule_prefix_rewriting) inside
TestOlinkTargetValidator that builds a minimal dataset via
_make_olink_target_dataset, runs OlinkTargetValidator().validate(ds), filters
error-level results using Level, and asserts that every error.rule starts with
"olink_target." and that none start with "olink."; this verifies the prefix
rewriting behavior for inherited rules.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 348a20d9-ebc1-4e1a-a5c1-1d90108116cd
⛔ Files ignored due to path filters (4)
tests/data/olink_sample.npx.csvis excluded by!**/*.csvtests/data/olink_sample.parquetis excluded by!**/*.parquettests/data/olink_sample.xlsxis excluded by!**/*.xlsxtests/data/somascan_sample.csvis excluded by!**/*.csv
📒 Files selected for processing (31)
.github/workflows/ci.yml.gitignoreREADME.mdpyproject.tomlsrc/pyprideap/__init__.pysrc/pyprideap/core.pysrc/pyprideap/pride.pysrc/pyprideap/readers/__init__.pysrc/pyprideap/readers/olink_csv.pysrc/pyprideap/readers/olink_parquet.pysrc/pyprideap/readers/olink_xlsx.pysrc/pyprideap/readers/registry.pysrc/pyprideap/readers/somascan_adat.pysrc/pyprideap/readers/somascan_csv.pysrc/pyprideap/stats.pysrc/pyprideap/validators/__init__.pysrc/pyprideap/validators/base.pysrc/pyprideap/validators/olink_explore.pysrc/pyprideap/validators/olink_target.pysrc/pyprideap/validators/somascan.pytests/__init__.pytests/conftest.pytests/create_fixtures.pytests/data/somascan_sample.adattests/test_core.pytests/test_integration.pytests/test_pride.pytests/test_readers.pytests/test_registry.pytests/test_stats.pytests/test_validators.py
| for loc in locations: | ||
| if loc.get("name") == "FTP Protocol": | ||
| urls[name] = loc["value"] | ||
| break |
There was a problem hiding this comment.
Guard against missing value key to prevent KeyError.
Line 39 accesses loc["value"] without verifying the key exists. If the API response lacks a value field for an FTP location, this will raise a KeyError.
🛡️ Proposed defensive fix
for loc in locations:
if loc.get("name") == "FTP Protocol":
- urls[name] = loc["value"]
+ if "value" in loc:
+ urls[name] = loc["value"]
break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/pride.py` around lines 37 - 40, The loop that sets urls[name]
uses loc["value"] without checking the key, which can raise KeyError; update the
block that iterates over locations (the for loc in locations loop) to safely
access the value by either checking "value" in loc or using loc.get("value") and
only assign urls[name] when a non-None value is present (and keep the existing
break behavior after assignment).
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/pyprideap/__init__.py (1)
3-13:⚠️ Potential issue | 🟠 MajorGuard
__version__for source-tree imports.
version("pyprideap")raisesPackageNotFoundErrorwhen the package is imported without installed metadata, soimport pyprideapcan fail in local-dev and test setups.Suggested fix
-from importlib.metadata import version +from importlib.metadata import PackageNotFoundError, version @@ -__version__ = version("pyprideap") +try: + __version__ = version("pyprideap") +except PackageNotFoundError: + __version__ = "0+unknown"#!/bin/bash python - <<'PY' from pathlib import Path text = Path("src/pyprideap/__init__.py").read_text() print("has PackageNotFoundError guard:", "PackageNotFoundError" in text) print('has direct version() assignment:', '__version__ = version("pyprideap")' in text) PY python - <<'PY' from importlib.metadata import PackageNotFoundError, version try: version("definitely-not-installed-package-123456") except PackageNotFoundError as exc: print(type(exc).__name__) PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/__init__.py` around lines 3 - 13, The module-level assignment __version__ = version("pyprideap") can raise PackageNotFoundError when running from the source tree; wrap the call to importlib.metadata.version in a try/except that catches importlib.metadata.PackageNotFoundError (or the imported PackageNotFoundError symbol) and set a safe fallback (e.g. "__version__" = "0.0.0" or None) in the except branch so importing pyprideap in local-dev/tests won't fail; update the top-level import/assignment involving version and __version__ accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/pyprideap/filtering.py`:
- Around line 43-45: The code applies a boolean Series (keep_mask) from
dataset.samples directly to dataset.expression which risks label-aligned
selection if indexes diverge; change those selections to positional indexing
using .iloc with the boolean array (e.g., replace dataset.expression[keep_mask]
with dataset.expression.iloc[keep_mask.to_numpy()] and do the same for any other
occurrence such as the selection at lines referencing keep_mask/samples) and
then call .reset_index(drop=True) as before.
In `@src/pyprideap/lod.py`:
- Around line 56-69: The code currently falls back to any control types when no
negative controls exist; change this to require negative controls only by
setting control_mask = is_negative (remove the fallback to is_control) and then
raise a ValueError if not is_negative.any() or if control_mask.sum() <
_MIN_CONTROLS_FOR_LOD; update the check around
compute_lod_stats()/get_valid_proteins to rely solely on control_mask (negative
samples) and keep the existing error message using _MIN_CONTROLS_FOR_LOD and
control_mask.sum() to report insufficient negatives.
In `@src/pyprideap/validators/olink_explore.py`:
- Around line 65-74: The _check_qc_consistency method currently skips sample
rows when there is no corresponding expression row, allowing misaligned datasets
to pass; update _check_qc_consistency (in class handling AffinityDataset) to
detect a mismatch between ds.samples and ds.expression lengths up front and fail
fast by returning a ValidationResult (or raising a ValidationError) indicating a
fatal row-count mismatch instead of silently continuing; specifically, add a
check comparing len(ds.samples) and len(ds.expression) at the start of
_check_qc_consistency and emit/return a clear error ValidationResult referencing
the dataset and the mismatch so downstream code stops processing misaligned
tables.
---
Duplicate comments:
In `@src/pyprideap/__init__.py`:
- Around line 3-13: The module-level assignment __version__ =
version("pyprideap") can raise PackageNotFoundError when running from the source
tree; wrap the call to importlib.metadata.version in a try/except that catches
importlib.metadata.PackageNotFoundError (or the imported PackageNotFoundError
symbol) and set a safe fallback (e.g. "__version__" = "0.0.0" or None) in the
except branch so importing pyprideap in local-dev/tests won't fail; update the
top-level import/assignment involving version and __version__ accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8f31f39c-4945-440a-94b8-52ffaadf5309
📒 Files selected for processing (6)
src/pyprideap/__init__.pysrc/pyprideap/filtering.pysrc/pyprideap/lod.pysrc/pyprideap/validators/olink_explore.pytests/test_filtering.pytests/test_lod.py
| keep_mask = ~is_control | ||
| samples = dataset.samples[keep_mask].reset_index(drop=True) | ||
| expression = dataset.expression[keep_mask].reset_index(drop=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import pandas as pd
samples = pd.DataFrame({"SampleType": ["Sample", "Control"]}, index=[100, 101])
expression = pd.DataFrame({"OID0": [1.0, 2.0]}, index=[0, 1])
mask = samples["SampleType"].eq("Sample")
print("samples.index =", list(samples.index))
print("expression.index=", list(expression.index))
try:
expression[mask]
except Exception as exc:
print(type(exc).__name__, str(exc))
print("positional fix:")
print(expression.loc[mask.to_numpy()])
PYRepository: PRIDE-Archive/pyprideap
Length of output: 185
🏁 Script executed:
cat -n src/pyprideap/filtering.py | head -80Repository: PRIDE-Archive/pyprideap
Length of output: 2884
🏁 Script executed:
cat -n src/pyprideap/core.py | grep -A 20 "class AffinityDataset"Repository: PRIDE-Archive/pyprideap
Length of output: 298
🏁 Script executed:
rg "def samples|def expression" src/pyprideap/core.py -A 5Repository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
rg "AffinityDataset(" src/pyprideap/ -B 2 -A 5 | head -100Repository: PRIDE-Archive/pyprideap
Length of output: 149
🏁 Script executed:
find src/pyprideap -name "*.py" -exec grep -l "samples.*expression" {} \;Repository: PRIDE-Archive/pyprideap
Length of output: 93
🏁 Script executed:
grep -n "\.index\|reset_index\|set_index" src/pyprideap/filtering.pyRepository: PRIDE-Archive/pyprideap
Length of output: 346
🏁 Script executed:
rg "read_csv|read_parquet|pd\.DataFrame" src/pyprideap/ -B 1 -A 1 | head -80Repository: PRIDE-Archive/pyprideap
Length of output: 1926
🏁 Script executed:
find src/pyprideap -name "*.py" -exec grep -l "def __init__\|class.*:" {} \; | xargs ls -laRepository: PRIDE-Archive/pyprideap
Length of output: 700
🏁 Script executed:
cat -n src/pyprideap/core.py | head -100Repository: PRIDE-Archive/pyprideap
Length of output: 1016
🏁 Script executed:
cat -n src/pyprideap/readers/somascan_csv.pyRepository: PRIDE-Archive/pyprideap
Length of output: 1175
🏁 Script executed:
cat -n src/pyprideap/readers/somascan_adat.pyRepository: PRIDE-Archive/pyprideap
Length of output: 3326
🏁 Script executed:
cat -n src/pyprideap/readers/olink_csv.pyRepository: PRIDE-Archive/pyprideap
Length of output: 1608
🏁 Script executed:
cat -n src/pyprideap/readers/olink_parquet.pyRepository: PRIDE-Archive/pyprideap
Length of output: 1609
Use positional masking when subsetting expression.
Lines 45 and 74 apply a boolean Series from dataset.samples directly to dataset.expression. While this currently works because readers ensure matching RangeIndex(0, n) for both DataFrames, the code relies on an undocumented invariant. If the indexes ever diverge, pandas' label-aligned indexing will select wrong rows or raise errors.
Suggested fix
- expression = dataset.expression[keep_mask].reset_index(drop=True)
+ expression = dataset.expression.iloc[keep_mask.to_numpy()].reset_index(drop=True)Also applies to: 73-74
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/filtering.py` around lines 43 - 45, The code applies a boolean
Series (keep_mask) from dataset.samples directly to dataset.expression which
risks label-aligned selection if indexes diverge; change those selections to
positional indexing using .iloc with the boolean array (e.g., replace
dataset.expression[keep_mask] with dataset.expression.iloc[keep_mask.to_numpy()]
and do the same for any other occurrence such as the selection at lines
referencing keep_mask/samples) and then call .reset_index(drop=True) as before.
src/pyprideap/lod.py
Outdated
| # Further narrow to negative controls only | ||
| is_negative = ( | ||
| dataset.samples["SampleType"] | ||
| .astype(str) | ||
| .str.lower() | ||
| .str.strip() | ||
| .isin({"negative", "negative control", "negative_control", "neg"}) | ||
| ) | ||
|
|
||
| control_mask = is_negative if is_negative.any() else is_control | ||
| if control_mask.sum() < _MIN_CONTROLS_FOR_LOD: | ||
| raise ValueError( | ||
| f"Need at least {_MIN_CONTROLS_FOR_LOD} control samples for LOD calculation, found {control_mask.sum()}" | ||
| ) |
There was a problem hiding this comment.
Require negative controls instead of falling back to any control.
The current fallback computes LOD from positive/QC/reference controls when no negative controls exist. That contradicts the docstring and can materially distort thresholds returned by compute_lod_stats() and get_valid_proteins().
Suggested fix
- control_mask = is_negative if is_negative.any() else is_control
- if control_mask.sum() < _MIN_CONTROLS_FOR_LOD:
+ if not is_negative.any():
+ raise ValueError("Negative control samples required for LOD calculation")
+
+ control_mask = is_negative
+ if control_mask.sum() < _MIN_CONTROLS_FOR_LOD:
raise ValueError(
f"Need at least {_MIN_CONTROLS_FOR_LOD} control samples for LOD calculation, found {control_mask.sum()}"
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/lod.py` around lines 56 - 69, The code currently falls back to
any control types when no negative controls exist; change this to require
negative controls only by setting control_mask = is_negative (remove the
fallback to is_control) and then raise a ValueError if not is_negative.any() or
if control_mask.sum() < _MIN_CONTROLS_FOR_LOD; update the check around
compute_lod_stats()/get_valid_proteins to rely solely on control_mask (negative
samples) and keep the existing error message using _MIN_CONTROLS_FOR_LOD and
control_mask.sum() to report insufficient negatives.
| def _check_qc_consistency(self, ds: AffinityDataset) -> list[ValidationResult]: | ||
| if "SampleQC" not in ds.samples.columns: | ||
| return [] | ||
| results = [] | ||
| for idx in range(len(ds.samples)): | ||
| row = ds.samples.iloc[idx] | ||
| qc = row.get("SampleQC") | ||
| if idx >= len(ds.expression): | ||
| continue | ||
| expr_row = ds.expression.iloc[idx] |
There was a problem hiding this comment.
Fail fast on sample/expression row-count mismatches.
Line 72 silently skips samples that have no corresponding expression row. That turns a corrupted dataset into a partially validated one and leaves downstream filtering/stats code operating on misaligned tables.
Suggested fix
def _check_qc_consistency(self, ds: AffinityDataset) -> list[ValidationResult]:
if "SampleQC" not in ds.samples.columns:
return []
+ if len(ds.samples) != len(ds.expression):
+ return [
+ ValidationResult(
+ level=Level.ERROR,
+ rule="olink.expression.sample_count_mismatch",
+ message="Sample table and expression matrix must have the same number of rows",
+ details={
+ "n_samples": len(ds.samples),
+ "n_expression_rows": len(ds.expression),
+ },
+ )
+ ]
results = []
for idx in range(len(ds.samples)):
row = ds.samples.iloc[idx]
qc = row.get("SampleQC")
- if idx >= len(ds.expression):
- continue
expr_row = ds.expression.iloc[idx]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _check_qc_consistency(self, ds: AffinityDataset) -> list[ValidationResult]: | |
| if "SampleQC" not in ds.samples.columns: | |
| return [] | |
| results = [] | |
| for idx in range(len(ds.samples)): | |
| row = ds.samples.iloc[idx] | |
| qc = row.get("SampleQC") | |
| if idx >= len(ds.expression): | |
| continue | |
| expr_row = ds.expression.iloc[idx] | |
| def _check_qc_consistency(self, ds: AffinityDataset) -> list[ValidationResult]: | |
| if "SampleQC" not in ds.samples.columns: | |
| return [] | |
| if len(ds.samples) != len(ds.expression): | |
| return [ | |
| ValidationResult( | |
| level=Level.ERROR, | |
| rule="olink.expression.sample_count_mismatch", | |
| message="Sample table and expression matrix must have the same number of rows", | |
| details={ | |
| "n_samples": len(ds.samples), | |
| "n_expression_rows": len(ds.expression), | |
| }, | |
| ) | |
| ] | |
| results = [] | |
| for idx in range(len(ds.samples)): | |
| row = ds.samples.iloc[idx] | |
| qc = row.get("SampleQC") | |
| expr_row = ds.expression.iloc[idx] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/validators/olink_explore.py` around lines 65 - 74, The
_check_qc_consistency method currently skips sample rows when there is no
corresponding expression row, allowing misaligned datasets to pass; update
_check_qc_consistency (in class handling AffinityDataset) to detect a mismatch
between ds.samples and ds.expression lengths up front and fail fast by returning
a ValidationResult (or raising a ValidationError) indicating a fatal row-count
mismatch instead of silently continuing; specifically, add a check comparing
len(ds.samples) and len(ds.expression) at the start of _check_qc_consistency and
emit/return a clear error ValidationResult referencing the dataset and the
mismatch so downstream code stops processing misaligned tables.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nalysis Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s, compute_cv_distribution, compute_detection_rate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
src/pyprideap/readers/somascan_csv.py (1)
24-30:⚠️ Potential issue | 🟠 MajorStill unresolved: use the same
SeqIdvalues infeaturesandexpression.columns.
features["SeqId"]strips theSeqId.prefix, but the expression matrix keeps the prefixed names. Any downstream lookup keyed bySeqIdwill miss even though both structures describe the same assays.Suggested fix
- expression = df[seq_cols].astype(float).reset_index(drop=True) - seq_ids = [c.replace("SeqId.", "") for c in seq_cols] + expression = df[seq_cols].astype(float).reset_index(drop=True) + expression.columns = seq_ids features = pd.DataFrame( { "SeqId": seq_ids,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/readers/somascan_csv.py` around lines 24 - 30, The features DataFrame is created with SeqId values that have the "SeqId." prefix stripped (seq_ids) while the expression DataFrame columns remain as the original prefixed names (expression.columns from seq_cols), causing mismatches; fix by making the two representations identical — either keep the prefix when building features (set features["SeqId"] to seq_cols or to the same strings used in expression.columns) or normalize expression.columns to seq_ids so that expression.columns and features["SeqId"] match (update where seq_ids, seq_cols, features, and expression are handled).src/pyprideap/readers/olink_parquet.py (1)
27-36:⚠️ Potential issue | 🟠 MajorStill unresolved: reindex the feature axis before returning the dataset.
This has the same silent misalignment risk as the XLSX reader:
featureskeeps first-seenOlinkIDorder, while the pivoted matrix can use a different column order. BecauseAffinityDatasetdoes not validate alignment, callers can get the wrong metadata for each expression column.Suggested fix
feature_cols = [c for c in df.columns if c in _FEATURE_COLS] features = df[feature_cols].drop_duplicates(subset=["OlinkID"]).reset_index(drop=True) expression = df.pivot_table( index="SampleID", columns="OlinkID", values="NPX", aggfunc="first", ) - expression = expression.reindex(samples["SampleID"].values).reset_index(drop=True) + expression = expression.reindex( + index=samples["SampleID"].values, + columns=features["OlinkID"].values, + ).reset_index(drop=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/readers/olink_parquet.py` around lines 27 - 36, features and the pivoted expression can be misaligned because features is built from first-seen OlinkID order while expression.columns may be different; after creating expression (the pivot_table) reorder/reindex the features DataFrame to match expression.columns (use expression.columns / OlinkID values) so that the features rows align with expression columns before returning or passing into AffinityDataset (e.g., set features = features.set_index("OlinkID").reindex(expression.columns).reset_index()).src/pyprideap/readers/olink_xlsx.py (1)
27-31:⚠️ Potential issue | 🟠 MajorStill unresolved: keep the feature axis aligned with
features.
featurespreserves first-seenOlinkIDorder, butpivot_table()can return columns in a different order. On unsorted XLSX input,features.iloc[i]no longer describesexpression.iloc[:, i], andAffinityDatasetwill return that mismatch unchecked.Suggested fix
feature_cols = [c for c in df.columns if c in _FEATURE_COLS] features = df[feature_cols].drop_duplicates(subset=["OlinkID"]).reset_index(drop=True) expression = df.pivot_table(index="SampleID", columns="OlinkID", values="NPX", aggfunc="first") - expression = expression.reindex(samples["SampleID"].values).reset_index(drop=True) + expression = expression.reindex( + index=samples["SampleID"].values, + columns=features["OlinkID"].values, + ).reset_index(drop=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/readers/olink_xlsx.py` around lines 27 - 31, The feature axis of expression can be out of order versus features because pivot_table may sort columns differently; after creating features (feature_cols and features = df[feature_cols]...), reindex the columns of expression to the OlinkID order in features (use features["OlinkID"] as the column index ordering) so that expression.iloc[:, i] corresponds to features.iloc[i]; perform this reindexing on expression.columns (before any reset_index that drops column alignment) and handle missing columns by allowing NaNs or filling as appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Around line 55-56: Add a pytest pythonpath entry under the existing
[tool.pytest.ini_options] so pytest can locate the package in a src/ layout;
specifically, update the [tool.pytest.ini_options] block (the same section that
contains testpaths) to include a pythonpath setting that points to "src" so
tests won't raise ModuleNotFoundError when running pytest directly.
In `@src/pyprideap/readers/olink_csv.py`:
- Around line 27-36: The expression DataFrame columns can be alphabetically
sorted by pivot_table(), causing misalignment with features (built via
feature_cols and features = df[feature_cols].drop_duplicates(subset=["OlinkID"])
), so after creating expression (the pivot_table and reindex by SampleID)
reindex or reorder expression.columns to match the features["OlinkID"] order;
update the olink_csv.py (and analogously olink_xlsx.py and olink_parquet.py) to
set expression = expression.reindex(columns=features["OlinkID"].values) or
otherwise reorder expression.columns to features' OlinkID sequence so positional
pairing is preserved.
In `@src/pyprideap/readers/registry.py`:
- Around line 40-42: detect_format currently classifies a file as "somascan_csv"
when _SOMASCAN_MARKER_COLS is a subset of cols, but read_somascan_csv expects
wide-format files with SeqId.* columns and will fail for long-format inputs;
change the detection logic in detect_format() to only return "somascan_csv" when
has_seqid_cols (i.e., any column startswith "SeqId.") is true, removing the
_SOMASCAN_MARKER_COLS.issubset(cols) branch so only wide-format files are routed
to read_somascan_csv(); if long-format SomaScan CSV support is needed later, add
a separate reader instead of reusing read_somascan_csv().
---
Duplicate comments:
In `@src/pyprideap/readers/olink_parquet.py`:
- Around line 27-36: features and the pivoted expression can be misaligned
because features is built from first-seen OlinkID order while expression.columns
may be different; after creating expression (the pivot_table) reorder/reindex
the features DataFrame to match expression.columns (use expression.columns /
OlinkID values) so that the features rows align with expression columns before
returning or passing into AffinityDataset (e.g., set features =
features.set_index("OlinkID").reindex(expression.columns).reset_index()).
In `@src/pyprideap/readers/olink_xlsx.py`:
- Around line 27-31: The feature axis of expression can be out of order versus
features because pivot_table may sort columns differently; after creating
features (feature_cols and features = df[feature_cols]...), reindex the columns
of expression to the OlinkID order in features (use features["OlinkID"] as the
column index ordering) so that expression.iloc[:, i] corresponds to
features.iloc[i]; perform this reindexing on expression.columns (before any
reset_index that drops column alignment) and handle missing columns by allowing
NaNs or filling as appropriate.
In `@src/pyprideap/readers/somascan_csv.py`:
- Around line 24-30: The features DataFrame is created with SeqId values that
have the "SeqId." prefix stripped (seq_ids) while the expression DataFrame
columns remain as the original prefixed names (expression.columns from
seq_cols), causing mismatches; fix by making the two representations identical —
either keep the prefix when building features (set features["SeqId"] to seq_cols
or to the same strings used in expression.columns) or normalize
expression.columns to seq_ids so that expression.columns and features["SeqId"]
match (update where seq_ids, seq_cols, features, and expression are handled).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9dc01a85-4667-4b9b-a88d-7ae79f74a129
📒 Files selected for processing (12)
.github/workflows/ci.ymlpyproject.tomlsrc/pyprideap/__init__.pysrc/pyprideap/pride.pysrc/pyprideap/readers/olink_csv.pysrc/pyprideap/readers/olink_parquet.pysrc/pyprideap/readers/olink_xlsx.pysrc/pyprideap/readers/registry.pysrc/pyprideap/readers/somascan_csv.pytests/test_integration.pytests/test_registry.pytests/test_validators.py
🚧 Files skipped from review as they are similar to previous changes (5)
- tests/test_validators.py
- .github/workflows/ci.yml
- tests/test_integration.py
- tests/test_registry.py
- src/pyprideap/pride.py
| [tool.pytest.ini_options] | ||
| testpaths = ["tests"] |
There was a problem hiding this comment.
Add pythonpath for src-layout compatibility.
With a src/ layout, pytest won't find the pyprideap package without either an editable install or explicit path configuration. This causes ModuleNotFoundError when running pytest directly.
🔧 Proposed fix
[tool.pytest.ini_options]
testpaths = ["tests"]
+pythonpath = ["src"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| [tool.pytest.ini_options] | |
| testpaths = ["tests"] | |
| [tool.pytest.ini_options] | |
| testpaths = ["tests"] | |
| pythonpath = ["src"] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 55 - 56, Add a pytest pythonpath entry under the
existing [tool.pytest.ini_options] so pytest can locate the package in a src/
layout; specifically, update the [tool.pytest.ini_options] block (the same
section that contains testpaths) to include a pythonpath setting that points to
"src" so tests won't raise ModuleNotFoundError when running pytest directly.
| has_seqid_cols = any(c.startswith("SeqId.") for c in cols) | ||
| if has_seqid_cols or _SOMASCAN_MARKER_COLS.issubset(cols): | ||
| return "somascan_csv" |
There was a problem hiding this comment.
Make SomaScan CSV detection match the reader’s actual input contract.
detect_format() accepts any CSV with SeqId and SomaId, but read_somascan_csv() only supports wide files with SeqId.* expression columns. That means read() can classify a file as somascan_csv and then fail immediately with No SeqId.* columns found.
Suggested fix
- if has_seqid_cols or _SOMASCAN_MARKER_COLS.issubset(cols):
+ if has_seqid_cols:
return "somascan_csv"If long-format SomaScan CSV support is intended later, it should get its own reader instead of being routed to the current wide-format implementation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| has_seqid_cols = any(c.startswith("SeqId.") for c in cols) | |
| if has_seqid_cols or _SOMASCAN_MARKER_COLS.issubset(cols): | |
| return "somascan_csv" | |
| has_seqid_cols = any(c.startswith("SeqId.") for c in cols) | |
| if has_seqid_cols: | |
| return "somascan_csv" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/readers/registry.py` around lines 40 - 42, detect_format
currently classifies a file as "somascan_csv" when _SOMASCAN_MARKER_COLS is a
subset of cols, but read_somascan_csv expects wide-format files with SeqId.*
columns and will fail for long-format inputs; change the detection logic in
detect_format() to only return "somascan_csv" when has_seqid_cols (i.e., any
column startswith "SeqId.") is true, removing the
_SOMASCAN_MARKER_COLS.issubset(cols) branch so only wide-format files are routed
to read_somascan_csv(); if long-format SomaScan CSV support is needed later, add
a separate reader instead of reusing read_somascan_csv().
…gator Merged PCA, correlation, missing values, CV distribution, and detection rate into compute.py alongside distribution, QC summary, and LOD analysis. Added compute_all() that returns all applicable plot data for a dataset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… public API - qc_report() assembles all computed plot data into a standalone HTML file with Plotly CDN, table of contents, and platform summary header - compute_qc() (alias for compute_all) and qc_report() added to __all__ - Supports both Olink and SomaScan platform reports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion, missing frequency histogram, QC×LOD stacked bar, PCA with sample labels - Distribution: overlaid per-sample histograms instead of single combined - Missing Frequency: new plot — per-assay missing freq with 30% threshold - QC Summary: stacked bar crossing QC status with above/below LOD - PCA: use SampleQC for color when all same SampleType, show labels on points - Add MissingFrequencyData and QcLodSummaryData dataclasses Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- PCA: set explicit height=500px on all figures to prevent div collapse when using to_html(include_plotlyjs=False) - Missing Frequency: use MissingFreq column from Olink features metadata (per-assay metric from Olink) instead of computing NaN rate from pivoted expression matrix which is always 0% - olink_csv reader: add MissingFreq to extracted feature columns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI Feedback 🧐(Feedback updated until commit 5cb8d56)A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
There was a problem hiding this comment.
Actionable comments posted: 10
♻️ Duplicate comments (1)
src/pyprideap/filtering.py (1)
43-45:⚠️ Potential issue | 🟠 MajorUse positional row filtering for
expressionandlod_matrix.Lines 45, 53, 82, and 90 apply a boolean
Seriesbuilt fromdataset.samplesdirectly to other DataFrames. Pandas aligns that mask by index label, so any future index drift between these frames will mis-filter rows or raise. Convert the mask once withkeep_mask.to_numpy()and use.iloc[...]for the row-aligned matrices.🐛 Suggested fix
keep_mask = ~is_control - samples = dataset.samples[keep_mask].reset_index(drop=True) - expression = dataset.expression[keep_mask].reset_index(drop=True) + row_mask = keep_mask.to_numpy() + samples = dataset.samples.loc[keep_mask].reset_index(drop=True) + expression = dataset.expression.iloc[row_mask].reset_index(drop=True) @@ lod_df = metadata["lod_matrix"] if isinstance(lod_df, pd.DataFrame): - metadata["lod_matrix"] = lod_df[keep_mask].reset_index(drop=True) + metadata["lod_matrix"] = lod_df.iloc[row_mask].reset_index(drop=True) @@ - samples = dataset.samples[keep_mask].reset_index(drop=True) - expression = dataset.expression[keep_mask].reset_index(drop=True) + row_mask = keep_mask.to_numpy() + samples = dataset.samples.loc[keep_mask].reset_index(drop=True) + expression = dataset.expression.iloc[row_mask].reset_index(drop=True) @@ lod_df = metadata["lod_matrix"] if isinstance(lod_df, pd.DataFrame): - metadata["lod_matrix"] = lod_df[keep_mask].reset_index(drop=True) + metadata["lod_matrix"] = lod_df.iloc[row_mask].reset_index(drop=True)Also applies to: 51-53, 81-82, 88-90
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/filtering.py` around lines 43 - 45, The boolean mask built from dataset.samples (keep_mask = ~is_control) is being applied directly to other DataFrames which causes label-based alignment bugs; convert the mask to a positional boolean array (e.g., keep_pos = keep_mask.to_numpy()) and use .iloc[...] to select rows for dataset.samples, dataset.expression and dataset.lod_matrix (and any other uses) — replace expressions like samples = dataset.samples[keep_mask].reset_index(...) and expression = dataset.expression[keep_mask].reset_index(...) with samples = dataset.samples.iloc[keep_pos].reset_index(drop=True), expression = dataset.expression.iloc[keep_pos].reset_index(drop=True), and similarly use .iloc[keep_pos] when slicing lod_matrix.
🧹 Nitpick comments (1)
src/pyprideap/qc/render.py (1)
45-60: Normalize the histogram traces if this chart is supposed to compare distributions.The docstring says density/KDE-like output, but Lines 54-60 use
histnorm="", which plots raw counts. Samples with more non-missing measurements will therefore look "larger" even when the underlying shape is identical. If the intent is distribution comparison, switch tohistnorm="probability density"/"probability"or update the copy to match the current behavior.💡 Suggested fix
- histnorm="", + histnorm="probability density",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/qc/render.py` around lines 45 - 60, The histogram traces in render_distribution currently use histnorm="" which shows raw counts and biases samples with more measurements; update the go.Histogram call in render_distribution to use an appropriate normalization such as histnorm="probability density" (or "probability") so the per-sample overlaid curves compare distributions rather than counts, and ensure the docstring still states density/KDE-like output (or adjust the docstring if you intentionally keep raw counts).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/pyprideap/__init__.py`:
- Around line 3-13: The import block is too long and triggers Ruff I001/E501;
reflow the pyprideap.lod import into a parenthesized, multi-line import and let
Ruff/isort reorder groups so line lengths and ordering meet style rules—e.g.
replace the single long line that imports LodMethod, LodStats,
compute_lod_from_controls, compute_nclod, compute_lod_stats,
get_bundled_fixed_lod_path, get_reported_lod, get_valid_proteins, load_fixed_lod
with a parenthesized import list (one name per line) and then run ruff/isort to
fix ordering and spacing.
In `@src/pyprideap/lod.py`:
- Around line 486-488: The comprehension creating panel_cols is too long and
exceeds the 120-char limit; split it across multiple lines to satisfy the linter
by breaking after the opening bracket and aligning the conditional onto its own
line. Update the expression that builds panel_cols (the list comprehension
referencing dataset.expression.columns, panel_map, panel_name, and
has_lod_per_col) so each logical part is on a separate line (e.g., start the
list on one line, put the for clause on the next line, and the if condition on
another) to keep line length <=120.
- Line 247: The hardcoded _CONFIGS_DIR path will break after pip installation
and causes get_bundled_fixed_lod_path() to return None; replace filesystem-based
resolution with importlib.resources (or importlib.resources.files/open_text) to
load the bundled CSVs from the package resources and update pyproject.toml to
include those CSVs as package data (or configure hatchling's include section) so
they are distributed; modify get_bundled_fixed_lod_path() (and any callers) to
use importlib.resources APIs to locate/read the CSVs rather than
Path(__file__).resolve().parent... and remove reliance on _CONFIGS_DIR.
In `@src/pyprideap/qc/compute.py`:
- Around line 98-99: The helper _sample_id_col currently returns "SampleId" even
when neither "SampleID" nor "SampleId" exists; change _sample_id_col to return
None (or raise a ValueError with a clear message) when no valid column is found,
and update callers such as compute_correlation to check the return value before
indexing dataset.samples (raise a clear error or handle the missing ID column
case) so a KeyError is avoided and errors are informative.
- Around line 148-149: The error arises because dataset.metadata is typed
dict[str, object] so dataset.metadata["MissingFreq"] is object and list(...)
fails; fix by casting that value to the expected iterable type before
converting/assigning to freq. In compute.py, import typing.cast and change the
assignment in the branch with "MissingFreq" to use cast(List[str] or
Iterable[str], dataset.metadata["MissingFreq"]) (or cast to the concrete type
you expect) so freq has the proper type; reference the variables/members
dataset.metadata, "MissingFreq", and freq when making this change.
- Around line 218-231: The loop over numeric.columns uses the index i to grab
assay IDs and panels from dataset.features which can misalign; instead, map by
column name (col) to the feature row: look up dataset.features rows where
dataset.features[id_col] == col and use that row's id and panel values (fall
back to col or empty string if no match). Update the assignments that populate
assay_ids and panels (references: numeric.columns loop, dataset.features,
id_col, panel_col, assay_ids, panels) to perform a name-based lookup (or an
indexed mapping built once outside the loop) rather than using i-based
positional indexing.
In `@src/pyprideap/qc/report.py`:
- Around line 26-30: The help text for the "missing_frequency" entry contains
escaped percent signs ("30%%") that render as "30%%" in the HTML; update the
string literal under the "missing_frequency" key in src/pyprideap/qc/report.py
to use single percent signs ("30%") instead of "30%%" so the HTML shows "30%".
- Line 335: The HTML file is written with output.write_text(html) using
platform-default encoding while the content declares UTF-8 and contains
non-ASCII characters; change the write to explicitly use UTF-8 by calling
output.write_text(html, encoding="utf-8") and update any tests that read this
file with Path.read_text() to call read_text(encoding="utf-8") so bytes and
declared charset always match (look for the output.write_text(html) call in
report generation and corresponding read_text() uses in tests).
- Around line 220-230: Narrow the type of plot_data entries before calling the
renderer by checking the runtime type: after retrieving data =
plot_data.get(key) and ensuring data is not None, use isinstance(data, _dtype)
(where _dtype and renderer are the two elements from the _RENDERERS tuple you
unpack) or cast to the dataclass so that renderer(data) and access like
data.title are type-safe; also fix the help string literals by replacing the
double-percent occurrences ("30%% threshold", "more than 30%% of") with single
percent signs ("30% threshold", "more than 30% of"); finally, write the HTML
with explicit UTF-8 encoding by changing output.write_text(html) to
output.write_text(html, encoding="utf-8").
In `@src/pyprideap/readers/olink_xlsx.py`:
- Line 10: The XLSX reader currently drops the MissingFreq column because
_FEATURE_COLS in src/pyprideap/readers/olink_xlsx.py lacks "MissingFreq"; update
the _FEATURE_COLS set to include "MissingFreq" (matching the entry in
src/pyprideap/readers/olink_csv.py) so feature metadata extraction in the Olink
XLSX reader (referencing _FEATURE_COLS) preserves that field and yields
consistent metadata across CSV and XLSX loaders.
---
Duplicate comments:
In `@src/pyprideap/filtering.py`:
- Around line 43-45: The boolean mask built from dataset.samples (keep_mask =
~is_control) is being applied directly to other DataFrames which causes
label-based alignment bugs; convert the mask to a positional boolean array
(e.g., keep_pos = keep_mask.to_numpy()) and use .iloc[...] to select rows for
dataset.samples, dataset.expression and dataset.lod_matrix (and any other uses)
— replace expressions like samples = dataset.samples[keep_mask].reset_index(...)
and expression = dataset.expression[keep_mask].reset_index(...) with samples =
dataset.samples.iloc[keep_pos].reset_index(drop=True), expression =
dataset.expression.iloc[keep_pos].reset_index(drop=True), and similarly use
.iloc[keep_pos] when slicing lod_matrix.
---
Nitpick comments:
In `@src/pyprideap/qc/render.py`:
- Around line 45-60: The histogram traces in render_distribution currently use
histnorm="" which shows raw counts and biases samples with more measurements;
update the go.Histogram call in render_distribution to use an appropriate
normalization such as histnorm="probability density" (or "probability") so the
per-sample overlaid curves compare distributions rather than counts, and ensure
the docstring still states density/KDE-like output (or adjust the docstring if
you intentionally keep raw counts).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 790cb379-de63-4432-b5f5-73c0b0ac749e
⛔ Files ignored due to path filters (3)
configs/Explore-3072-Fixed-LOD-2024-12-19.csvis excluded by!**/*.csvconfigs/Explore-HT-Fixed-LOD.csvis excluded by!**/*.csvconfigs/Reveal-Fixed-LOD.csvis excluded by!**/*.csv
📒 Files selected for processing (15)
.gitignorepyproject.tomlsrc/pyprideap/__init__.pysrc/pyprideap/filtering.pysrc/pyprideap/lod.pysrc/pyprideap/qc/__init__.pysrc/pyprideap/qc/compute.pysrc/pyprideap/qc/render.pysrc/pyprideap/qc/report.pysrc/pyprideap/readers/olink_csv.pysrc/pyprideap/readers/olink_xlsx.pytests/test_lod.pytests/test_qc_compute.pytests/test_qc_render.pytests/test_qc_report.py
🚧 Files skipped from review as they are similar to previous changes (3)
- pyproject.toml
- tests/test_lod.py
- .gitignore
| from importlib.metadata import PackageNotFoundError, version | ||
|
|
||
| from pyprideap.core import AffinityDataset, Level, Platform, ValidationResult | ||
| from pyprideap.filtering import filter_controls, filter_qc | ||
| from pyprideap.lod import LodMethod, LodStats, compute_lod_from_controls, compute_nclod, compute_lod_stats, get_bundled_fixed_lod_path, get_reported_lod, get_valid_proteins, load_fixed_lod | ||
| from pyprideap.pride import PrideClient | ||
| from pyprideap.qc.compute import compute_all as compute_qc | ||
| from pyprideap.qc.report import qc_report | ||
| from pyprideap.readers.registry import read | ||
| from pyprideap.stats import DatasetStats, compute_stats | ||
| from pyprideap.validators import validate |
There was a problem hiding this comment.
Reflow the import block so Ruff passes.
CI is already failing on I001/E501 for this block. Breaking the long pyprideap.lod import into a parenthesized block and letting Ruff/isort reorder the group should clear it.
💡 Suggested fix
-from pyprideap.lod import LodMethod, LodStats, compute_lod_from_controls, compute_nclod, compute_lod_stats, get_bundled_fixed_lod_path, get_reported_lod, get_valid_proteins, load_fixed_lod
+from pyprideap.lod import (
+ LodMethod,
+ LodStats,
+ compute_lod_from_controls,
+ compute_lod_stats,
+ compute_nclod,
+ get_bundled_fixed_lod_path,
+ get_reported_lod,
+ get_valid_proteins,
+ load_fixed_lod,
+)🧰 Tools
🪛 GitHub Actions: CI
[error] 3-13: I001 Import block is un-sorted or un-formatted.
[error] 7-7: E501 Line too long (188 > 120)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/__init__.py` around lines 3 - 13, The import block is too long
and triggers Ruff I001/E501; reflow the pyprideap.lod import into a
parenthesized, multi-line import and let Ruff/isort reorder groups so line
lengths and ordering meet style rules—e.g. replace the single long line that
imports LodMethod, LodStats, compute_lod_from_controls, compute_nclod,
compute_lod_stats, get_bundled_fixed_lod_path, get_reported_lod,
get_valid_proteins, load_fixed_lod with a parenthesized import list (one name
per line) and then run ruff/isort to fix ordering and spacing.
| return pd.Series(lod_series.values, index=dataset.expression.columns[: len(lod_series)]) | ||
|
|
||
|
|
||
| _CONFIGS_DIR = Path(__file__).resolve().parent.parent.parent / "configs" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if configs directory is included in package data or exists at expected location
fd -t d "configs" --max-depth 4
# Check pyproject.toml for package data configuration
cat pyproject.toml | grep -A 10 "package-data\|include\|data-files" || echo "No package data config found"Repository: PRIDE-Archive/pyprideap
Length of output: 104
🏁 Script executed:
# Check for setup.py and MANIFEST.in
fd -t f "setup.py|setup.cfg|MANIFEST.in" --max-depth 2
# Get the full pyproject.toml to understand distribution setup
cat pyproject.tomlRepository: PRIDE-Archive/pyprideap
Length of output: 1669
🏁 Script executed:
# Find usage of _CONFIGS_DIR
rg "_CONFIGS_DIR" -n
# Check what files are in configs directory
fd -t f . configs/
# See how get_bundled_fixed_lod_path handles missing files
sed -n '250,270p' src/pyprideap/lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 1023
🏁 Script executed:
# Check for tests related to bundled LOD
rg "get_bundled_fixed_lod_path|_BUNDLED_FIXED_LOD" tests/ --type py
# Check README for any mention of configs or bundled data
grep -i "config\|bundle\|fixed.lod" README.mdRepository: PRIDE-Archive/pyprideap
Length of output: 49
Path resolution will fail after pip installation; bundled LOD data will not be accessible.
The _CONFIGS_DIR path uses Path(__file__).resolve().parent.parent.parent / "configs", which works in development but breaks in distributed installations. The configs/ directory is not included in the package distribution (pyproject.toml has no package-data configuration and hatchling only includes Python packages by default). After pip install, this path resolution will point to the wrong location, causing get_bundled_fixed_lod_path() to silently return None for all platforms.
Use importlib.resources to access the bundled CSV files instead, ensuring they are included in the package distribution via pyproject.toml configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/lod.py` at line 247, The hardcoded _CONFIGS_DIR path will break
after pip installation and causes get_bundled_fixed_lod_path() to return None;
replace filesystem-based resolution with importlib.resources (or
importlib.resources.files/open_text) to load the bundled CSVs from the package
resources and update pyproject.toml to include those CSVs as package data (or
configure hatchling's include section) so they are distributed; modify
get_bundled_fixed_lod_path() (and any callers) to use importlib.resources APIs
to locate/read the CSVs rather than Path(__file__).resolve().parent... and
remove reliance on _CONFIGS_DIR.
| panel_cols = [ | ||
| c for c in dataset.expression.columns if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False) | ||
| ] |
There was a problem hiding this comment.
Fix line length to resolve pipeline failure.
Line 487 exceeds 120 characters (121 > 120), causing the linter to fail.
🔧 Proposed fix
panel_cols = [
- c for c in dataset.expression.columns if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False)
+ c for c in dataset.expression.columns
+ if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False)
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| panel_cols = [ | |
| c for c in dataset.expression.columns if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False) | |
| ] | |
| panel_cols = [ | |
| c for c in dataset.expression.columns | |
| if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False) | |
| ] |
🧰 Tools
🪛 GitHub Actions: CI
[error] 487-487: E501 Line too long (121 > 120)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/lod.py` around lines 486 - 488, The comprehension creating
panel_cols is too long and exceeds the 120-char limit; split it across multiple
lines to satisfy the linter by breaking after the opening bracket and aligning
the conditional onto its own line. Update the expression that builds panel_cols
(the list comprehension referencing dataset.expression.columns, panel_map,
panel_name, and has_lod_per_col) so each logical part is on a separate line
(e.g., start the list on one line, put the for clause on the next line, and the
if condition on another) to keep line length <=120.
| def _sample_id_col(dataset: AffinityDataset) -> str: | ||
| return "SampleID" if "SampleID" in dataset.samples.columns else "SampleId" |
There was a problem hiding this comment.
_sample_id_col returns "SampleId" even when neither column exists.
When dataset.samples contains neither "SampleID" nor "SampleId" (common for SomaScan ADAT files per the reader implementation), this function returns "SampleId" unconditionally. Callers like compute_correlation (line 288) use this to index into dataset.samples, which would raise a KeyError.
Consider returning None or raising a clearer error when no valid column is found.
🛡️ Proposed fix
def _sample_id_col(dataset: AffinityDataset) -> str | None:
- return "SampleID" if "SampleID" in dataset.samples.columns else "SampleId"
+ if "SampleID" in dataset.samples.columns:
+ return "SampleID"
+ if "SampleId" in dataset.samples.columns:
+ return "SampleId"
+ return NoneThen update callers to check for None before indexing.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _sample_id_col(dataset: AffinityDataset) -> str: | |
| return "SampleID" if "SampleID" in dataset.samples.columns else "SampleId" | |
| def _sample_id_col(dataset: AffinityDataset) -> str | None: | |
| if "SampleID" in dataset.samples.columns: | |
| return "SampleID" | |
| if "SampleId" in dataset.samples.columns: | |
| return "SampleId" | |
| return None |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/qc/compute.py` around lines 98 - 99, The helper _sample_id_col
currently returns "SampleId" even when neither "SampleID" nor "SampleId" exists;
change _sample_id_col to return None (or raise a ValueError with a clear
message) when no valid column is found, and update callers such as
compute_correlation to check the return value before indexing dataset.samples
(raise a clear error or handle the missing ID column case) so a KeyError is
avoided and errors are informative.
| elif "MissingFreq" in dataset.metadata: | ||
| freq = list(dataset.metadata["MissingFreq"]) |
There was a problem hiding this comment.
Fix type error causing pipeline failure.
The pipeline fails with: No overload variant of "list" matches argument type "object". Since dataset.metadata is typed as dict[str, object], the value returned by dataset.metadata["MissingFreq"] has type object, which list() cannot accept without a cast.
🔧 Proposed fix
elif "MissingFreq" in dataset.metadata:
- freq = list(dataset.metadata["MissingFreq"])
+ mf = dataset.metadata["MissingFreq"]
+ freq = list(mf) if hasattr(mf, "__iter__") else [] # type: ignore[arg-type]Or use an explicit cast:
+from typing import cast, Iterable
...
elif "MissingFreq" in dataset.metadata:
- freq = list(dataset.metadata["MissingFreq"])
+ freq = list(cast(Iterable[float], dataset.metadata["MissingFreq"]))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif "MissingFreq" in dataset.metadata: | |
| freq = list(dataset.metadata["MissingFreq"]) | |
| elif "MissingFreq" in dataset.metadata: | |
| mf = dataset.metadata["MissingFreq"] | |
| freq = list(mf) if hasattr(mf, "__iter__") else [] # type: ignore[arg-type] |
🧰 Tools
🪛 GitHub Actions: CI
[error] 149-149: No overload variant of "list" matches argument type "object" [call-overload]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/qc/compute.py` around lines 148 - 149, The error arises because
dataset.metadata is typed dict[str, object] so dataset.metadata["MissingFreq"]
is object and list(...) fails; fix by casting that value to the expected
iterable type before converting/assigning to freq. In compute.py, import
typing.cast and change the assignment in the branch with "MissingFreq" to use
cast(List[str] or Iterable[str], dataset.metadata["MissingFreq"]) (or cast to
the concrete type you expect) so freq has the proper type; reference the
variables/members dataset.metadata, "MissingFreq", and freq when making this
change.
| "missing_frequency": ( | ||
| "Displays how many assays fall into each missing-frequency bin. The x-axis is the fraction of samples " | ||
| "where a given assay has no measured value; the y-axis (log scale) is the count of assays. The red " | ||
| "dashed line marks the 30%% threshold — assays above this line are missing in more than 30%% of " | ||
| "samples and may need to be excluded from downstream analysis." |
There was a problem hiding this comment.
Fix the doubled percent signs in the help copy.
These strings are emitted verbatim into the HTML, so the report currently shows 30%% instead of 30%.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/qc/report.py` around lines 26 - 30, The help text for the
"missing_frequency" entry contains escaped percent signs ("30%%") that render as
"30%%" in the HTML; update the string literal under the "missing_frequency" key
in src/pyprideap/qc/report.py to use single percent signs ("30%") instead of
"30%%" so the HTML shows "30%".
| _RENDERERS = { | ||
| "distribution": (DistributionData, R.render_distribution), | ||
| "missing_frequency": (MissingFrequencyData, R.render_missing_frequency), | ||
| "qc_summary": (QcLodSummaryData, R.render_qc_summary), | ||
| "lod_analysis": (LodAnalysisData, R.render_lod_analysis), | ||
| "pca": (PcaData, R.render_pca), | ||
| "correlation": (CorrelationData, R.render_correlation), | ||
| "missing_values": (MissingValuesData, R.render_missing_values), | ||
| "cv_distribution": (CvDistributionData, R.render_cv_distribution), | ||
| "detection_rate": (DetectionRateData, R.render_detection_rate), | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/pyprideap/qc/report.py | sed -n '220,270p'Repository: PRIDE-Archive/pyprideap
Length of output: 2467
🏁 Script executed:
cat -n src/pyprideap/qc/report.py | sed -n '1,250p' | tail -n 50Repository: PRIDE-Archive/pyprideap
Length of output: 2300
🏁 Script executed:
cat -n src/pyprideap/qc/compute.py | sed -n '340,365p'Repository: PRIDE-Archive/pyprideap
Length of output: 1246
🏁 Script executed:
rg "30%%" src/pyprideap/qc/report.pyRepository: PRIDE-Archive/pyprideap
Length of output: 172
🏁 Script executed:
rg "write_text" src/pyprideap/qc/report.pyRepository: PRIDE-Archive/pyprideap
Length of output: 94
🏁 Script executed:
cat -n src/pyprideap/qc/report.py | grep -A 5 -B 5 "write_text"Repository: PRIDE-Archive/pyprideap
Length of output: 315
🏁 Script executed:
cat -n src/pyprideap/qc/report.py | grep -B 10 "30%%" | head -n 20Repository: PRIDE-Archive/pyprideap
Length of output: 931
🏁 Script executed:
cat -n src/pyprideap/qc/report.py | sed -n '19,35p'Repository: PRIDE-Archive/pyprideap
Length of output: 1394
🏁 Script executed:
cat -n src/pyprideap/qc/report.py | sed -n '300,336p'Repository: PRIDE-Archive/pyprideap
Length of output: 1902
🏁 Script executed:
python3 << 'EOF'
# Test what %% in a regular string literal produces
test_str = "This is 30%% of samples"
print("String value:", repr(test_str))
print("Rendered:", test_str)
# Compare with what should happen in format context
format_str = "This is 30%% of {}"
print("Format string rendered:", format_str.format("samples"))
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 188
🏁 Script executed:
rg "encoding" src/pyprideap/qc/report.pyRepository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
python3 << 'EOF'
# Simulate what mypy sees
from typing import Dict, Any
# This is what compute_all returns
plot_data: Dict[str, object] = {"key": "value"}
# At line 249
data = plot_data.get("key") # type: object | None
# After the None check at line 250-251
if data is not None:
# data is still typed as object, not narrowed to the specific dataclass type
# because dict.get() with dict[str, object] returns object | None
# Accessing data.title would fail mypy type check
print(type(data)) # Would need isinstance check to narrow
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 80
Narrow plot_data before invoking the renderer registry, fix the %% escaping in help text, and specify UTF-8 encoding for HTML output.
compute_all() returns dict[str, object] (compute.py:347), so at line 249 data = plot_data.get(key) is typed as object | None. The if data is None: continue check at lines 250–251 narrows only to object, not to the specific dataclass type. Line 252 calls renderer(data) and line 256 accesses data.title—both fail mypy type checking because object has no title attribute and renderer expects a concrete dataclass type. Add an isinstance(data, _dtype) guard or cast after the None check to properly narrow the type.
The help text at lines 29–30 contains literal "30%% threshold" and "more than 30%% of". In a regular string literal (not f-string or format context), %% is not escaped and will appear as 30%% in the rendered HTML instead of 30%.
Line 335 calls output.write_text(html) without specifying an encoding. The HTML contains non-ASCII characters (em dash \u2014 at line 319), and the default platform encoding may vary. Use output.write_text(html, encoding="utf-8") for consistency and portability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/qc/report.py` around lines 220 - 230, Narrow the type of
plot_data entries before calling the renderer by checking the runtime type:
after retrieving data = plot_data.get(key) and ensuring data is not None, use
isinstance(data, _dtype) (where _dtype and renderer are the two elements from
the _RENDERERS tuple you unpack) or cast to the dataclass so that renderer(data)
and access like data.title are type-safe; also fix the help string literals by
replacing the double-percent occurrences ("30%% threshold", "more than 30%% of")
with single percent signs ("30% threshold", "more than 30% of"); finally, write
the HTML with explicit UTF-8 encoding by changing output.write_text(html) to
output.write_text(html, encoding="utf-8").
| "</body>\n</html>" | ||
| ) | ||
|
|
||
| output.write_text(html) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l src/pyprideap/qc/report.pyRepository: PRIDE-Archive/pyprideap
Length of output: 97
🏁 Script executed:
cat -n src/pyprideap/qc/report.py | sed -n '330,340p'Repository: PRIDE-Archive/pyprideap
Length of output: 315
🏁 Script executed:
# Search for where the HTML is generated and what special characters it contains
rg "—|≤|−" src/pyprideap/qc/report.py -B 3 -A 3Repository: PRIDE-Archive/pyprideap
Length of output: 1438
🏁 Script executed:
# Check for test files related to this module
find . -name "*test*report*" -o -name "*report*test*" | head -20Repository: PRIDE-Archive/pyprideap
Length of output: 92
🏁 Script executed:
# Check the test file to see how it handles encoding
cat -n tests/test_qc_report.pyRepository: PRIDE-Archive/pyprideap
Length of output: 3279
🏁 Script executed:
# Check Python's write_text() behavior - look at what encoding is used by default
python3 <<'EOF'
from pathlib import Path
import inspect
# Check write_text signature
sig = inspect.signature(Path.write_text)
print(f"write_text signature: {sig}")
# Check default encoding
print(f"\nDefault encoding for write_text: {inspect.signature(Path.write_text).parameters}")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 374
🏁 Script executed:
# Search for any other places where output might be read back
rg "read_text|open.*encoding" src/pyprideap/qc/report.pyRepository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Check the context of where output is defined - where the Path is created
cat -n src/pyprideap/qc/report.py | sed -n '300,340p'Repository: PRIDE-Archive/pyprideap
Length of output: 1902
🏁 Script executed:
# Check the qc_report function signature to understand output parameter type
cat -n src/pyprideap/qc/report.py | sed -n '260,280p'Repository: PRIDE-Archive/pyprideap
Length of output: 1052
🏁 Script executed:
# Check if there are any other write_text calls in the file that might already use encoding
rg "write_text|write_bytes" src/pyprideap/qc/report.pyRepository: PRIDE-Archive/pyprideap
Length of output: 94
🏁 Script executed:
# Get the full function signature
cat -n src/pyprideap/qc/report.py | sed -n '200,220p'Repository: PRIDE-Archive/pyprideap
Length of output: 829
🏁 Script executed:
# Check if the output is a Path object
rg "def qc_report" src/pyprideap/qc/report.py -A 5Repository: PRIDE-Archive/pyprideap
Length of output: 385
Specify UTF-8 encoding when writing HTML.
The generated HTML declares <meta charset="utf-8"> but contains non-ASCII characters (—, ≤, −) and is written with write_text() using the platform-default encoding. On non-UTF-8 locales, this causes a mismatch: the bytes won't be UTF-8 even though the charset is declared as UTF-8. Use output.write_text(html, encoding="utf-8") to ensure the file is always written in UTF-8. Additionally, the tests that read this file back should also specify encoding="utf-8" in their read_text() calls to maintain consistency.
💡 Suggested fix
- output.write_text(html)
+ output.write_text(html, encoding="utf-8")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| output.write_text(html) | |
| output.write_text(html, encoding="utf-8") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/qc/report.py` at line 335, The HTML file is written with
output.write_text(html) using platform-default encoding while the content
declares UTF-8 and contains non-ASCII characters; change the write to explicitly
use UTF-8 by calling output.write_text(html, encoding="utf-8") and update any
tests that read this file with Path.read_text() to call
read_text(encoding="utf-8") so bytes and declared charset always match (look for
the output.write_text(html) call in report generation and corresponding
read_text() uses in tests).
| from pyprideap.core import AffinityDataset, Platform | ||
|
|
||
| _SAMPLE_COLS = {"SampleID", "PlateID", "WellID", "SampleType", "SampleQC", "PlateQC"} | ||
| _FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD"} |
There was a problem hiding this comment.
Preserve MissingFreq in XLSX feature metadata.
src/pyprideap/readers/olink_csv.py:9-11 includes MissingFreq in _FEATURE_COLS, but this reader drops it entirely. The same Olink dataset will therefore expose different feature metadata depending on whether it is loaded from CSV or XLSX, which makes downstream QC/statistics format-dependent.
Proposed fix
-_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD"}
+_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD", "MissingFreq"}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD"} | |
| _FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD", "MissingFreq"} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/readers/olink_xlsx.py` at line 10, The XLSX reader currently
drops the MissingFreq column because _FEATURE_COLS in
src/pyprideap/readers/olink_xlsx.py lacks "MissingFreq"; update the
_FEATURE_COLS set to include "MissingFreq" (matching the entry in
src/pyprideap/readers/olink_csv.py) so feature metadata extraction in the Olink
XLSX reader (referencing _FEATURE_COLS) preserves that field and yields
consistent metadata across CSV and XLSX loaders.
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (4)
src/pyprideap/readers/olink_xlsx.py (1)
10-10:⚠️ Potential issue | 🟠 MajorAdd
MissingFreqto the XLSX feature schema.
src/pyprideap/readers/olink_csv.pypreservesMissingFreq, but this reader drops it entirely. The same Olink export will therefore expose differentfeaturesmetadata depending on whether it is loaded from CSV or XLSX.Suggested fix
-_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD"} +_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD", "MissingFreq"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/readers/olink_xlsx.py` at line 10, The XLSX reader's feature schema currently omits the MissingFreq column causing inconsistent features metadata vs the CSV reader; update the _FEATURE_COLS set in src/pyprideap/readers/olink_xlsx.py to include "MissingFreq" (same symbol _FEATURE_COLS) so the XLSX parser preserves that column, and ensure any downstream usage in the reader logic that iterates over _FEATURE_COLS (e.g., feature extraction or validation code in the olink_xlsx reader) continues to handle MissingFreq the same way as the CSV reader.src/pyprideap/lod.py (3)
96-103:⚠️ Potential issue | 🟠 MajorRequire negative controls for NCLOD.
This still falls back to positive/QC/reference controls when no negatives exist, which changes the NCLOD threshold instead of failing fast.
Suggested fix
is_negative = st.isin(_NEGATIVE_CONTROL_TYPES) if is_negative.any(): return is_negative - - # Fallback: use any control-like sample - is_control = st.isin(_CONTROL_SAMPLE_TYPES) - if is_control.any(): - return is_control raise ValueError("No negative control samples found in SampleType column")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/lod.py` around lines 96 - 103, The current logic uses _CONTROL_SAMPLE_TYPES fallback when no negatives are present which alters NCLOD; change the behavior so NCLOD requires negative controls only: in the function that computes is_negative (look for variables is_negative and _NEGATIVE_CONTROL_TYPES in src/pyprideap/lod.py) remove or disable the fallback branch that checks _CONTROL_SAMPLE_TYPES/is_control and instead raise a clear error (or return a failing status) when is_negative.any() is false so the caller fails fast if no negative controls are provided.
488-490:⚠️ Potential issue | 🟡 MinorWrap
panel_colsto unblock the linter.CI is still failing on Line 489 with E501.
Suggested fix
panel_cols = [ - c for c in dataset.expression.columns if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False) + c + for c in dataset.expression.columns + if panel_map.get(c) == panel_name + and has_lod_per_col.get(c, False) ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/lod.py` around lines 488 - 490, Line is too long (E501): break the comprehension for panel_cols into multiple lines to satisfy the linter. Edit the expression that builds panel_cols (the list comprehension using dataset.expression.columns, panel_map, panel_name, and has_lod_per_col) so that the opening bracket, the iterator expression, and the conditional are split across lines (or assign intermediate variables) to keep line length under limit and preserve the same logic.
249-269:⚠️ Potential issue | 🟠 MajorBundled FixedLOD resolution will break outside a source checkout.
Path(__file__).resolve().parent.parent.parent / "configs"depends on the repo layout. After installation,get_bundled_fixed_lod_path()can no longer reliably find the bundled CSVs.#!/bin/bash fd -t d '^configs$' printf '\n--- pyproject packaging hints ---\n' rg -n "package-data|include|artifacts|only-include" pyproject.toml printf '\n--- current bundled LOD path logic ---\n' sed -n '249,317p' src/pyprideap/lod.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/pyprideap/lod.py` around lines 249 - 269, The current _CONFIGS_DIR constant and get_bundled_fixed_lod_path() rely on a repo-relative path that breaks after installation; instead package the CSVs as package data and resolve them at runtime via package resources. Update packaging (pyproject.toml) to include the configs CSVs, remove/stop relying on _CONFIGS_DIR, and change get_bundled_fixed_lod_path(platform) to use importlib.resources (or importlib_resources for py<3.9) to locate the file inside the installed pyprideap package (e.g. importlib.resources.files("pyprideap").joinpath("configs", filename)) and return None if the resource is missing; keep the _BUNDLED_FIXED_LOD mapping and filename lookup logic the same, only replace the filesystem resolution with package-resource resolution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/pyprideap/lod.py`:
- Around line 125-127: In _nclod_base(), avoid using np.maximum on a pandas
Series (numeric_expr) because np.maximum returns Any; instead compute medians =
numeric_expr.median() and stds = numeric_expr.std() as before but replace the
np.maximum call with stds.multiply(3).clip(lower=_MIN_STD_FLOOR) (or equivalent
Series.clip usage) and return medians + that clipped Series so the function
retains a pd.Series return type and satisfies mypy; update references to
medians, stds, and _MIN_STD_FLOOR accordingly.
- Around line 198-204: The DataFrame construction and addition degrade type
precision because base.values and adj.values are typed as ndarray |
ExtensionArray; update the lod_matrix creation to use base.to_numpy() when
tiling and convert adj to a pandas.Series (or ensure adj is a Series) before
adding so mypy can infer a concrete DataFrame return type: use
np.tile(base.to_numpy(), (len(dataset.samples), 1)) for the data and ensure the
addition uses a Series (e.g., pd.Series(adj) or adj if already a Series) with
axis=0 when calling lod_matrix.add; keep the same column=index and
index=dataset.expression.index and return lod_matrix.
In `@src/pyprideap/qc/render.py`:
- Around line 110-123: Remove the dead variable cumulative in the loop inside
the render function (where cumulative = 0.0 is set and cumulative += pct is
applied); delete both the initialization and the increment since cumulative is
never read, leaving the loop that builds the go.Bar traces using pct,
data.categories, data.counts, and colors intact.
- Around line 31-42: The _QC_LOD_COLORS mapping assigns alarming red ("#e74c3c")
to "WARN & NPX ≤ LOD" while various "FAIL" keys use gray, which flips visual
severity; update the _QC_LOD_COLORS dict so FAIL-related keys (e.g., "FAIL",
"FAIL & NPX > LOD", "FAIL & NPX ≤ LOD") use a more alarming color (e.g., red)
and WARN-related keys (e.g., "WARN", "WARN & NPX > LOD", "WARN & NPX ≤ LOD") use
a less severe color (e.g., orange), or document the intended semantics if the
current mapping is deliberate; locate and change the color hex values in the
_QC_LOD_COLORS constant in render.py accordingly.
In `@src/pyprideap/readers/olink_csv.py`:
- Around line 47-56: The LOD pivot can drop assay columns with all-NA values
causing metadata["lod_matrix"] to not match expression.columns; after creating
lod_matrix with df.pivot_table (and before resetting index), reindex its columns
to the same assay order used for expression/features (the same reference used at
line where features are aligned) and fill missing columns with NaN so
lod_matrix.columns == expression.columns; apply the identical fix in
olink_xlsx.py where lod_matrix is built, ensuring you reference the same
sample_order and assay ordering variable (expression.columns or features) when
reindexing.
---
Duplicate comments:
In `@src/pyprideap/lod.py`:
- Around line 96-103: The current logic uses _CONTROL_SAMPLE_TYPES fallback when
no negatives are present which alters NCLOD; change the behavior so NCLOD
requires negative controls only: in the function that computes is_negative (look
for variables is_negative and _NEGATIVE_CONTROL_TYPES in src/pyprideap/lod.py)
remove or disable the fallback branch that checks
_CONTROL_SAMPLE_TYPES/is_control and instead raise a clear error (or return a
failing status) when is_negative.any() is false so the caller fails fast if no
negative controls are provided.
- Around line 488-490: Line is too long (E501): break the comprehension for
panel_cols into multiple lines to satisfy the linter. Edit the expression that
builds panel_cols (the list comprehension using dataset.expression.columns,
panel_map, panel_name, and has_lod_per_col) so that the opening bracket, the
iterator expression, and the conditional are split across lines (or assign
intermediate variables) to keep line length under limit and preserve the same
logic.
- Around line 249-269: The current _CONFIGS_DIR constant and
get_bundled_fixed_lod_path() rely on a repo-relative path that breaks after
installation; instead package the CSVs as package data and resolve them at
runtime via package resources. Update packaging (pyproject.toml) to include the
configs CSVs, remove/stop relying on _CONFIGS_DIR, and change
get_bundled_fixed_lod_path(platform) to use importlib.resources (or
importlib_resources for py<3.9) to locate the file inside the installed
pyprideap package (e.g.
importlib.resources.files("pyprideap").joinpath("configs", filename)) and return
None if the resource is missing; keep the _BUNDLED_FIXED_LOD mapping and
filename lookup logic the same, only replace the filesystem resolution with
package-resource resolution.
In `@src/pyprideap/readers/olink_xlsx.py`:
- Line 10: The XLSX reader's feature schema currently omits the MissingFreq
column causing inconsistent features metadata vs the CSV reader; update the
_FEATURE_COLS set in src/pyprideap/readers/olink_xlsx.py to include
"MissingFreq" (same symbol _FEATURE_COLS) so the XLSX parser preserves that
column, and ensure any downstream usage in the reader logic that iterates over
_FEATURE_COLS (e.g., feature extraction or validation code in the olink_xlsx
reader) continues to handle MissingFreq the same way as the CSV reader.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: eb03d159-5d1d-4635-bd95-7a0843b5735b
📒 Files selected for processing (7)
src/pyprideap/lod.pysrc/pyprideap/qc/compute.pysrc/pyprideap/qc/render.pysrc/pyprideap/qc/report.pysrc/pyprideap/readers/olink_csv.pysrc/pyprideap/readers/olink_parquet.pysrc/pyprideap/readers/olink_xlsx.py
🚧 Files skipped from review as they are similar to previous changes (3)
- src/pyprideap/readers/olink_parquet.py
- src/pyprideap/qc/report.py
- src/pyprideap/qc/compute.py
| medians = numeric_expr.median() | ||
| stds = numeric_expr.std() | ||
| return medians + np.maximum(_MIN_STD_FLOOR, 3 * stds) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/pyprideap/lod.py | head -150Repository: PRIDE-Archive/pyprideap
Length of output: 6256
🏁 Script executed:
# Check for mypy configuration and type checking setup
fd -t f "pyproject.toml|setup.cfg|mypy.ini|py.typed" .Repository: PRIDE-Archive/pyprideap
Length of output: 83
🏁 Script executed:
cat pyproject.tomlRepository: PRIDE-Archive/pyprideap
Length of output: 1669
🏁 Script executed:
# Check for GitHub Actions or CI configuration
fd -t f "\.github.*|\.travis|\.gitlab|circle|jenkins" .Repository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
ls -la .github/ 2>/dev/null || echo "No .github directory"Repository: PRIDE-Archive/pyprideap
Length of output: 233
🏁 Script executed:
ls -la .github/workflows/Repository: PRIDE-Archive/pyprideap
Length of output: 233
🏁 Script executed:
cat .github/workflows/ci.ymlRepository: PRIDE-Archive/pyprideap
Length of output: 1143
🏁 Script executed:
# Let's check if there are any type checking hints or test cases for this
rg "np.maximum|\.clip" src/Repository: PRIDE-Archive/pyprideap
Length of output: 145
🏁 Script executed:
# Let's create a simple test to understand type behavior
python3 << 'EOF'
import numpy as np
import pandas as pd
from typing import get_type_hints
# Simulate the scenario
numeric_expr = pd.DataFrame({
'col1': [1.0, 2.0, 3.0],
'col2': [4.0, 5.0, 6.0]
})
medians = numeric_expr.median()
stds = numeric_expr.std()
print("medians type:", type(medians))
print("stds type:", type(stds))
print("3 * stds type:", type(3 * stds))
# Test np.maximum
result1 = np.maximum(0.2, 3 * stds)
print("np.maximum result type:", type(result1))
print("np.maximum result:", result1)
# Test clip
result2 = (3 * stds).clip(lower=0.2)
print("clip result type:", type(result2))
print("clip result:", result2)
# Test final return
final1 = medians + np.maximum(0.2, 3 * stds)
print("medians + np.maximum type:", type(final1))
final2 = medians + (3 * stds).clip(lower=0.2)
print("medians + clip type:", type(final2))
# Check if they're equal
print("Are results equal?", np.allclose(final1, final2))
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 184
🏁 Script executed:
# Let me check the pandas type stubs for Series operations to confirm
# Look for any type annotations in the codebase
rg "@.*return|->.*Series" src/pyprideap/lod.py -A 2Repository: PRIDE-Archive/pyprideap
Length of output: 1022
🏁 Script executed:
# Let's check if there are any tests or type annotations related to this function
rg "_nclod_base|nclod" tests/ --type pyRepository: PRIDE-Archive/pyprideap
Length of output: 178
🏁 Script executed:
cat tests/test_lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 12453
Use Series.clip() to preserve type safety in _nclod_base()
The current code returns medians + np.maximum(_MIN_STD_FLOOR, 3 * stds), which widens the return type to Any because np.maximum() doesn't have proper type stubs for pandas Series. This triggers a mypy failure with warn_return_any = true. The Series.clip() method preserves the declared pd.Series return type and is mathematically equivalent:
Suggested fix
- return medians + np.maximum(_MIN_STD_FLOOR, 3 * stds)
+ return medians + (3 * stds).clip(lower=_MIN_STD_FLOOR)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| medians = numeric_expr.median() | |
| stds = numeric_expr.std() | |
| return medians + np.maximum(_MIN_STD_FLOOR, 3 * stds) | |
| medians = numeric_expr.median() | |
| stds = numeric_expr.std() | |
| return medians + (3 * stds).clip(lower=_MIN_STD_FLOOR) |
🧰 Tools
🪛 GitHub Actions: CI
[error] 127-127: mypy: Returning Any from function declared to return "Series[Any]". (no-any-return)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/lod.py` around lines 125 - 127, In _nclod_base(), avoid using
np.maximum on a pandas Series (numeric_expr) because np.maximum returns Any;
instead compute medians = numeric_expr.median() and stds = numeric_expr.std() as
before but replace the np.maximum call with
stds.multiply(3).clip(lower=_MIN_STD_FLOOR) (or equivalent Series.clip usage)
and return medians + that clipped Series so the function retains a pd.Series
return type and satisfies mypy; update references to medians, stds, and
_MIN_STD_FLOOR accordingly.
| lod_matrix = pd.DataFrame( | ||
| np.tile(base.values, (len(dataset.samples), 1)), | ||
| columns=base.index, | ||
| index=dataset.expression.index, | ||
| ) | ||
| lod_matrix = lod_matrix.add(adj.values, axis=0) | ||
| return lod_matrix |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's explore the repository structure
git ls-files | head -20Repository: PRIDE-Archive/pyprideap
Length of output: 581
🏁 Script executed:
# Find the file in question
fd "lod.py" --type fRepository: PRIDE-Archive/pyprideap
Length of output: 105
🏁 Script executed:
# Read the relevant section of the file
cat -n src/pyprideap/lod.py | sed -n '190,210p'Repository: PRIDE-Archive/pyprideap
Length of output: 856
🏁 Script executed:
# Get the full function context to understand the types involved
cat -n src/pyprideap/lod.py | sed -n '180,210p'Repository: PRIDE-Archive/pyprideap
Length of output: 1273
🏁 Script executed:
# Check the function signature to understand parameter types
cat -n src/pyprideap/lod.py | grep -B 20 "lod_matrix = pd.DataFrame" | head -25Repository: PRIDE-Archive/pyprideap
Length of output: 924
🏁 Script executed:
# Get the full function definition that contains line 198-204
cat -n src/pyprideap/lod.py | sed -n '170,180p'Repository: PRIDE-Archive/pyprideap
Length of output: 580
🏁 Script executed:
# Search for the function signature that contains this code
rg -B 30 "np.tile\(base.values" src/pyprideap/lod.py | head -40Repository: PRIDE-Archive/pyprideap
Length of output: 1092
🏁 Script executed:
# Find _nclod_base function
rg -A 10 "def _nclod_base" src/pyprideap/lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 497
🏁 Script executed:
# Find _intensity_adjustment function
rg -A 10 "def _intensity_adjustment" src/pyprideap/lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 554
🏁 Script executed:
# Check the imports and pandas type hints
rg "import pandas|from pandas" src/pyprideap/lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 86
🏁 Script executed:
# Check for mypy configuration
fd -e toml -e ini | xargs grep -l "mypy" 2>/dev/nullRepository: PRIDE-Archive/pyprideap
Length of output: 81
🏁 Script executed:
# Check pyproject.toml for type checking config
cat pyproject.toml | grep -A 10 mypyRepository: PRIDE-Archive/pyprideap
Length of output: 344
🏁 Script executed:
# Get the complete definition of _nclod_base to see the exact return type
cat -n src/pyprideap/lod.py | sed -n '110,150p'Repository: PRIDE-Archive/pyprideap
Length of output: 1825
🏁 Script executed:
# Check what numpy.tile expects for its first argument and what it returns
python3 << 'EOF'
import inspect
import numpy as np
# Check np.tile signature
sig = inspect.signature(np.tile)
print("np.tile signature:", sig)
# Check return annotation
print("\nnp.tile return annotation:", sig.return_annotation)
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 184
🏁 Script executed:
# Look for any mypy test or type checking in the CI/tests
cat .github/workflows/ci.ymlRepository: PRIDE-Archive/pyprideap
Length of output: 1143
🏁 Script executed:
# Check if there are any type stubs or annotations for Series.values
rg "Series.values" src/pyprideap/lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Let's look at the full context around the function to understand the types better
cat -n src/pyprideap/lod.py | sed -n '1,30p'Repository: PRIDE-Archive/pyprideap
Length of output: 1585
🏁 Script executed:
# Check the pandas imports and version specifications
rg "pandas" pyproject.tomlRepository: PRIDE-Archive/pyprideap
Length of output: 110
🏁 Script executed:
# Let's check if there are any tests for this function
cat -n tests/test_lod.py | head -80Repository: PRIDE-Archive/pyprideap
Length of output: 2956
🏁 Script executed:
# Look at the actual return type by checking what pd.Series.values returns according to pandas stubs
# and also check numpy.tile's behavior with Series.values
python3 << 'EOF'
# Since we can't import numpy/pandas directly in sandbox, let's check the types from pandas documentation
# pandas.Series.values returns ndarray | ExtensionArray in strict typing
# numpy.tile(arr, reps) expects array_like, which accepts both ndarray and ExtensionArray
# But the return type of np.tile is always ndarray
# Let's check if there's a type issue with pd.DataFrame.add when receiving values from another Series
print("Analysis of type chain:")
print("1. base: pd.Series (return from _nclod_base)")
print("2. base.values: ndarray | ExtensionArray (pandas type stubs definition)")
print("3. np.tile(base.values, ...): always returns ndarray")
print("4. pd.DataFrame(..., data=ndarray): creates DataFrame[Any] or DataFrame[Any] based on dtype")
print("5. lod_matrix.add(adj.values, axis=0): when adj.values is ndarray | ExtensionArray, result could be Any")
print("")
print("The issue is that when intermediate types are union types, mypy may lose track of the final type.")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 555
🏁 Script executed:
# Check for type annotations or comments in the file
cat -n src/pyprideap/lod.py | sed -n '1,50p'Repository: PRIDE-Archive/pyprideap
Length of output: 2150
🏁 Script executed:
# Look for the exact types that pandas-stubs uses for Series
rg "def values" src/pyprideap/ || echo "No explicit values definition in source"Repository: PRIDE-Archive/pyprideap
Length of output: 106
🏁 Script executed:
# Let's examine the entire _nclod_base function to verify its return type
cat -n src/pyprideap/lod.py | sed -n '108,128p'Repository: PRIDE-Archive/pyprideap
Length of output: 1013
🏁 Script executed:
# Let's verify if there are actual mypy errors by checking if mypy is run in CI
# and looking at any mypy output or error logs
grep -r "mypy" .github/workflows/Repository: PRIDE-Archive/pyprideap
Length of output: 149
🏁 Script executed:
# Let's also check the full _intensity_adjustment function to see its complete return type
cat -n src/pyprideap/lod.py | sed -n '130,170p'Repository: PRIDE-Archive/pyprideap
Length of output: 1946
🏁 Script executed:
# Check if there's a setup.py or any build logs that might show mypy errors
fd "setup.py|setup.cfg|*.log" --type fRepository: PRIDE-Archive/pyprideap
Length of output: 389
🏁 Script executed:
# Let's verify what the expression index length should be vs samples length
cat -n tests/test_lod.py | sed -n '70,150p'Repository: PRIDE-Archive/pyprideap
Length of output: 3466
🏁 Script executed:
# Check the AffinityDataset definition to understand expression vs samples
rg "class AffinityDataset" src/pyprideap/ -A 15Repository: PRIDE-Archive/pyprideap
Length of output: 388
🏁 Script executed:
# Let's check the actual comparison between samples and expression indices
cat -n tests/test_lod.py | sed -n '23,80p'Repository: PRIDE-Archive/pyprideap
Length of output: 2359
🏁 Script executed:
# Let's verify that dataset.samples and dataset.expression have the same length
python3 << 'EOF'
# From test setup, we can see:
# - samples DataFrame is created with n_samples rows
# - expression DataFrame is created with n_samples rows
# - They should have the same number of rows (samples)
print("Dataset structure verification:")
print("- samples: rows = sample count, columns = sample metadata")
print("- expression: rows = samples (same as samples), columns = assays")
print("- base (from _nclod_base): Series indexed by assay names, one value per assay")
print("- adj (from _intensity_adjustment): Series indexed by sample index, one value per sample")
print("")
print("Code at line 199:")
print(" np.tile(base.values, (len(dataset.samples), 1))")
print(" - Tiles base (assays) across all samples")
print(" - Result shape: (num_samples, num_assays)")
print("")
print("Code at line 200-201:")
print(" columns=base.index -> one column per assay")
print(" index=dataset.expression.index -> one row per sample")
print("")
print("The suggested fix changes to:")
print(" np.tile(base_values, (len(dataset.expression.index), 1))")
print(" - Same result since len(dataset.expression.index) == len(dataset.samples)")
print(" - But more semantically explicit")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 870
🏁 Script executed:
# Let's check pandas-stubs to understand the actual type issue better
# Search for any type: ignore comments that might exist
rg "type: ignore" src/pyprideap/lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Let's also look at what the actual code does with Series.add
cat -n src/pyprideap/lod.py | sed -n '195,210p'Repository: PRIDE-Archive/pyprideap
Length of output: 712
🏁 Script executed:
# Check if there are any existing type hints or casts in the file
rg "cast\|TYPE_CHECKING" src/pyprideap/lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Let's look at whether the return type truly degrades to Any by examining the full function signature
cat -n src/pyprideap/lod.py | sed -n '170,175p'Repository: PRIDE-Archive/pyprideap
Length of output: 295
🏁 Script executed:
# Check if there's a way to see actual mypy output or check for pandas-stubs behavior
# Let's verify the pandas-stubs typing for DataFrame.add with array-like arguments
python3 << 'EOF'
# According to pandas-stubs and pandas typing, when you call:
# DataFrame.add(other: ArrayLike, axis=0)
# where ArrayLike = ndarray | ExtensionArray | Index | Series
# The return type depends on the union narrowing
# The chain:
# 1. adj: pd.Series | None (checked against None, so pd.Series in this context)
# 2. adj.values: ndarray | ExtensionArray (per pandas-stubs)
# 3. lod_matrix.add(adj.values, axis=0): DataFrame
# When the argument is a union type (ndarray | ExtensionArray),
# mypy may not be able to narrow the return type and could conservatively
# assume Any or a less precise type.
# The suggested fix:
# - base_values: ndarray (explicit conversion)
# - adjustment: pd.Series (explicit wrapping)
# Both have clear, non-union types that mypy can track through the chain
print("Type narrowing analysis:")
print("Original code has union types (ndarray | ExtensionArray)")
print("at multiple points in the chain, causing potential type loss.")
print("")
print("Suggested fix converts these to concrete types:")
print("- to_numpy() -> ndarray")
print("- Explicit pd.Series wrapping -> Series")
print("")
print("This should preserve type information through the chain.")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 381
🏁 Script executed:
# Let's verify the behavior difference between the two approaches
# by checking if there are any tests that might fail with the suggested changes
grep -n "compute_nclod\|plate_adjusted" tests/test_lod.py | head -20Repository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Check for plate-adjusted LOD tests
rg -A 20 "plate_adjusted.*True" tests/test_lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Let's examine the entire flow more carefully and check if there are any issues with the suggested fix
cat -n src/pyprideap/lod.py | sed -n '107,130p'Repository: PRIDE-Archive/pyprideap
Length of output: 1063
🏁 Script executed:
# Check what happens when Series are passed to np.tile and DataFrame constructor
python3 << 'EOF'
# Trace through the suggested fix vs original:
print("=== ORIGINAL CODE ===")
print("base: pd.Series -> base.values: ndarray | ExtensionArray")
print("np.tile(base.values, ...) -> ndarray")
print("pd.DataFrame(ndarray, ...) -> DataFrame")
print("adj: pd.Series | None -> adj.values: ndarray | ExtensionArray")
print("DataFrame.add(ndarray | ExtensionArray, axis=0) -> potentially loses type info with union")
print("return lod_matrix -> return type: DataFrame[Any] or imprecise type")
print("")
print("=== SUGGESTED FIX ===")
print("base_values = base.to_numpy(dtype=float, na_value=np.nan) -> ndarray[float]")
print("np.tile(base_values, ...) -> ndarray[float]")
print("pd.DataFrame(ndarray[float], ...) -> DataFrame[float]")
print("adjustment = pd.Series(adj.to_numpy(dtype=float), index=...) -> Series[float]")
print("DataFrame.add(Series[float], axis=0) -> DataFrame[float]")
print("return lod_matrix -> return type: DataFrame[float]")
print("")
print("Type preservation: ✓ Fixes union type issue")
print("Semantic correctness: ✓ Converts to float, handles NA values")
print("Index alignment: ✓ Explicitly sets index to dataset.expression.index")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 999
🏁 Script executed:
# Let's verify there are no tests that would break with these changes
grep -B 5 -A 10 "plate_adjusted" tests/test_lod.py | grep -A 10 "def test"Repository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Check if there are tests specifically for plate-adjusted NCLOD
fd "test.*lod" --type f -x cat {} \; | grep -B 5 -A 10 "compute_nclod.*plate"Repository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Let's look for any tests that verify the return type or structure
rg "def test.*nclod" tests/test_lod.py -A 20 | head -80Repository: PRIDE-Archive/pyprideap
Length of output: 897
🏁 Script executed:
# Let's verify the exact behavior of to_numpy and ensure the fix is sound
python3 << 'EOF'
# Check pandas Series.to_numpy signature and behavior
print("Verification of suggested fix:")
print("")
print("1. base.to_numpy(dtype=float, na_value=np.nan)")
print(" - Returns ndarray with float dtype")
print(" - Handles NA values properly")
print(" - base is pd.Series from _nclod_base, numeric data")
print("")
print("2. pd.Series(adj.to_numpy(dtype=float), index=dataset.expression.index)")
print(" - Creates Series from ndarray with explicit index")
print(" - Ensures index alignment")
print(" - adj is pd.Series from _intensity_adjustment, indexed by samples")
print("")
print("3. np.tile(base_values, (len(dataset.expression.index), 1))")
print(" - base_values is ndarray[float] (no union type)")
print(" - Clear type throughout")
print("")
print("4. lod_matrix.add(adjustment, axis=0)")
print(" - Both operands have clear types")
print(" - DataFrame.add(Series, axis=0) preserves DataFrame type")
print("")
print("Result: Type chain is fully typed, no Any inference needed")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 827
🏁 Script executed:
# Let's look for any compute_nclod tests that test plate_adjusted behavior
cat -n tests/test_lod.py | grep -B 5 -A 15 "def test" | grep -B 5 -A 15 "compute_nclod\|compute_lod_from_controls\|plate"Repository: PRIDE-Archive/pyprideap
Length of output: 2700
🏁 Script executed:
# Look for integration tests that actually call compute_nclod with plate_adjusted
rg "compute_nclod\|compute_lod_from_controls" tests/test_lod.py -B 2 -A 5Repository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Check the complete test file to see if plate_adjusted scenarios are tested
wc -l tests/test_lod.py && cat -n tests/test_lod.py | tail -100Repository: PRIDE-Archive/pyprideap
Length of output: 4676
🏁 Script executed:
# Let's search for any tests that specifically test the plate_adjusted=True path
rg "compute_nclod" tests/test_lod.py -B 2 -A 8Repository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Check if there's a test that sets up PlateID to test the plate_adjusted path
rg "PlateID\|plate_adjusted" tests/test_lod.py -B 3 -A 3Repository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Let's verify if the suggested fix maintains the correct behavior by checking the indices
python3 << 'EOF'
# Verify the index alignment in the suggested fix
print("Index alignment verification:")
print("")
print("Original code:")
print(" index=dataset.expression.index")
print("")
print("Suggested fix:")
print(" adjustment = pd.Series(adj.to_numpy(dtype=float), index=dataset.expression.index)")
print("")
print("Key observation:")
print("- _intensity_adjustment returns Series indexed by dataset.samples.index")
print("- In the fix, adjustment is re-indexed to dataset.expression.index")
print("- This assumes dataset.samples.index == dataset.expression.index")
print("- From AffinityDataset definition, both have same row count")
print("")
print("Potential issue:")
print("- If indices don't match in order, re-indexing could cause NaN values")
print("- The original code uses adj.values which doesn't check index alignment")
print("- The suggested fix explicitly re-indexes, which is safer but might change behavior")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 771
🏁 Script executed:
# Let's check if dataset.samples.index and dataset.expression.index are always aligned
grep -B 5 -A 10 "class AffinityDataset" src/pyprideap/core.pyRepository: PRIDE-Archive/pyprideap
Length of output: 331
🏁 Script executed:
# Let's look at how the datasets are created in tests to see if indices are aligned
cat -n tests/test_lod.py | sed -n '39,80p'Repository: PRIDE-Archive/pyprideap
Length of output: 1789
🏁 Script executed:
# Let's verify one more important thing - whether the indices actually match or if there's a behavioral change risk
python3 << 'EOF'
# Check the implications of the behavioral change
print("Behavioral analysis of the suggested fix:")
print("")
print("Original code:")
print(" lod_matrix.add(adj.values, axis=0)")
print(" - adj.values is ndarray, treated as positional broadcast")
print(" - Assumes values are in the same order as lod_matrix rows")
print("")
print("Suggested fix:")
print(" adjustment = pd.Series(adj.to_numpy(dtype=float), index=dataset.expression.index)")
print(" lod_matrix.add(adjustment, axis=0)")
print(" - adjustment is Series with explicit index")
print(" - Pandas will align by index when adding Series")
print("")
print("Risk assessment:")
print("- If indices match by value and order: Behavior is identical")
print("- If indices don't match: Suggested fix could cause NaN due to index misalignment")
print("")
print("From code inspection:")
print("- Both samples and expression use default RangeIndex in tests")
print("- _intensity_adjustment returns Series(index=dataset.samples.index)")
print("- Default index is RangeIndex(0, n_samples)")
print("- Both should be identical in practice")
print("")
print("Conclusion: Suggested fix is safer (explicit index) and equivalent in typical cases")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 994
🏁 Script executed:
# Check if there are any mypy configuration details that would affect type narrowing
cat pyproject.toml | grep -A 5 "mypy"Repository: PRIDE-Archive/pyprideap
Length of output: 287
🏁 Script executed:
# Look at the complete function to understand all type implications
cat -n src/pyprideap/lod.py | sed -n '170,205p'Repository: PRIDE-Archive/pyprideap
Length of output: 1492
Improve type narrowing for DataFrame return type.
base.values and adj.values are typed as ndarray | ExtensionArray by pandas-stubs, which prevents mypy from tracking a precise return type through the DataFrame construction and addition operations. Converting to explicit concrete types (ndarray via to_numpy() and Series via explicit construction) ensures the return type remains DataFrame rather than degrading to Any under warn_return_any mode.
Suggested fix
+ base_values = base.to_numpy(dtype=float, na_value=np.nan)
+ adjustment = pd.Series(adj.to_numpy(dtype=float), index=dataset.expression.index)
lod_matrix = pd.DataFrame(
- np.tile(base.values, (len(dataset.samples), 1)),
+ np.tile(base_values, (len(dataset.expression.index), 1)),
columns=base.index,
index=dataset.expression.index,
)
- lod_matrix = lod_matrix.add(adj.values, axis=0)
- return lod_matrix
+ return lod_matrix.add(adjustment, axis=0)🧰 Tools
🪛 GitHub Actions: CI
[error] 199-199: mypy: Argument 1 to "tile" has incompatible type "ndarray[tuple[int], dtype[Any]] | ExtensionArray"; expected "_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]]" (arg-type)
[error] 204-204: mypy: Returning Any from function declared to return "DataFrame | Series[Any]". (no-any-return)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/lod.py` around lines 198 - 204, The DataFrame construction and
addition degrade type precision because base.values and adj.values are typed as
ndarray | ExtensionArray; update the lod_matrix creation to use base.to_numpy()
when tiling and convert adj to a pandas.Series (or ensure adj is a Series)
before adding so mypy can infer a concrete DataFrame return type: use
np.tile(base.to_numpy(), (len(dataset.samples), 1)) for the data and ensure the
addition uses a Series (e.g., pd.Series(adj) or adj if already a Series) with
axis=0 when calling lod_matrix.add; keep the same column=index and
index=dataset.expression.index and return lod_matrix.
| _QC_LOD_COLORS = { | ||
| "PASS & NPX > LOD": "#2ecc71", | ||
| "PASS & NPX ≤ LOD": "#3498db", | ||
| "WARN & NPX > LOD": "#f39c12", | ||
| "WARN & NPX ≤ LOD": "#e74c3c", | ||
| "FAIL & NPX > LOD": "#95a5a6", | ||
| "FAIL & NPX ≤ LOD": "#7f8c8d", | ||
| "PASS": "#2ecc71", | ||
| "WARN": "#f39c12", | ||
| "FAIL": "#e74c3c", | ||
| "NA": "#95a5a6", | ||
| } |
There was a problem hiding this comment.
Verify color semantics for QC categories.
The color assignment may cause confusion: "WARN & NPX ≤ LOD" uses red (#e74c3c), which is the same color as fallback "FAIL", while "FAIL & NPX > LOD" and "FAIL & NPX ≤ LOD" use gray tones. This makes WARN conditions appear more severe than FAIL conditions visually.
If intentional (e.g., "below LOD" is the primary concern), consider documenting the rationale. Otherwise, consider adjusting colors so FAIL-related categories use more alarming colors than WARN-related ones.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/qc/render.py` around lines 31 - 42, The _QC_LOD_COLORS mapping
assigns alarming red ("#e74c3c") to "WARN & NPX ≤ LOD" while various "FAIL" keys
use gray, which flips visual severity; update the _QC_LOD_COLORS dict so
FAIL-related keys (e.g., "FAIL", "FAIL & NPX > LOD", "FAIL & NPX ≤ LOD") use a
more alarming color (e.g., red) and WARN-related keys (e.g., "WARN", "WARN & NPX
> LOD", "WARN & NPX ≤ LOD") use a less severe color (e.g., orange), or document
the intended semantics if the current mapping is deliberate; locate and change
the color hex values in the _QC_LOD_COLORS constant in render.py accordingly.
| cumulative = 0.0 | ||
| for cat, cnt, color in zip(data.categories, data.counts, colors): | ||
| pct = cnt / total * 100 if total > 0 else 0 | ||
| fig.add_trace( | ||
| go.Bar( | ||
| x=["Samples"], | ||
| y=[pct], | ||
| name=f"{cat} {cnt} ({pct:.1f}%)", | ||
| marker_color=color, | ||
| text=f"{pct:.1f}%", | ||
| textposition="inside", | ||
| ) | ||
| ) | ||
| cumulative += pct |
There was a problem hiding this comment.
Remove unused cumulative variable.
The variable cumulative is incremented on line 123 but never read. This appears to be dead code, possibly a leftover from development.
🧹 Proposed fix
fig = go.Figure()
- cumulative = 0.0
for cat, cnt, color in zip(data.categories, data.counts, colors):
pct = cnt / total * 100 if total > 0 else 0
fig.add_trace(
go.Bar(
x=["Samples"],
y=[pct],
name=f"{cat} {cnt} ({pct:.1f}%)",
marker_color=color,
text=f"{pct:.1f}%",
textposition="inside",
)
)
- cumulative += pct📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cumulative = 0.0 | |
| for cat, cnt, color in zip(data.categories, data.counts, colors): | |
| pct = cnt / total * 100 if total > 0 else 0 | |
| fig.add_trace( | |
| go.Bar( | |
| x=["Samples"], | |
| y=[pct], | |
| name=f"{cat} {cnt} ({pct:.1f}%)", | |
| marker_color=color, | |
| text=f"{pct:.1f}%", | |
| textposition="inside", | |
| ) | |
| ) | |
| cumulative += pct | |
| for cat, cnt, color in zip(data.categories, data.counts, colors): | |
| pct = cnt / total * 100 if total > 0 else 0 | |
| fig.add_trace( | |
| go.Bar( | |
| x=["Samples"], | |
| y=[pct], | |
| name=f"{cat} {cnt} ({pct:.1f}%)", | |
| marker_color=color, | |
| text=f"{pct:.1f}%", | |
| textposition="inside", | |
| ) | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/qc/render.py` around lines 110 - 123, Remove the dead variable
cumulative in the loop inside the render function (where cumulative = 0.0 is set
and cumulative += pct is applied); delete both the initialization and the
increment since cumulative is never read, leaving the loop that builds the
go.Bar traces using pct, data.categories, data.counts, and colors intact.
| # Build per-sample × per-assay LOD matrix if LOD column exists | ||
| if "LOD" in df.columns: | ||
| lod_matrix = df.pivot_table( | ||
| index="SampleID", | ||
| columns="OlinkID", | ||
| values="LOD", | ||
| aggfunc="first", | ||
| ) | ||
| lod_matrix = lod_matrix.reindex(sample_order).reset_index(drop=True) | ||
| metadata["lod_matrix"] = lod_matrix |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the olink_csv.py file in full context around the flagged lines
cat -n src/pyprideap/readers/olink_csv.py | head -100Repository: PRIDE-Archive/pyprideap
Length of output: 2789
🏁 Script executed:
# Check the file size to understand its structure
wc -l src/pyprideap/readers/olink_csv.pyRepository: PRIDE-Archive/pyprideap
Length of output: 104
🏁 Script executed:
# Look at olink_xlsx.py to verify the same pattern exists
cat -n src/pyprideap/readers/olink_xlsx.py | head -100Repository: PRIDE-Archive/pyprideap
Length of output: 2484
🏁 Script executed:
# Check lod.py to understand how lod_matrix is consumed
cat -n src/pyprideap/lod.pyRepository: PRIDE-Archive/pyprideap
Length of output: 23379
🏁 Script executed:
python3 << 'EOF'
import pandas as pd
import numpy as np
# Simulate data: one assay (OlinkID_A) has NPX values but all-null LOD
df = pd.DataFrame({
'SampleID': ['S1', 'S1', 'S2', 'S2'],
'OlinkID': ['OlinkID_A', 'OlinkID_B', 'OlinkID_A', 'OlinkID_B'],
'NPX': [1.0, 2.0, 3.0, 4.0],
'LOD': [0.1, 0.2, 0.3, np.nan], # OlinkID_B has one null LOD
})
print("Original DataFrame:")
print(df)
print("\n" + "="*60)
# Build expression (NPX pivot)
expression = df.pivot_table(
index='SampleID',
columns='OlinkID',
values='NPX',
aggfunc='first',
)
print("\nExpression (NPX pivot):")
print(expression)
print(f"Columns: {list(expression.columns)}")
print("\n" + "="*60)
# Build lod_matrix (LOD pivot) with some null values
lod_matrix = df.pivot_table(
index='SampleID',
columns='OlinkID',
values='LOD',
aggfunc='first',
)
print("\nLOD Matrix (LOD pivot):")
print(lod_matrix)
print(f"Columns: {list(lod_matrix.columns)}")
print("\n" + "="*60)
print("\nColumn alignment issue:")
print(f"expression has {len(expression.columns)} columns: {list(expression.columns)}")
print(f"lod_matrix has {len(lod_matrix.columns)} columns: {list(lod_matrix.columns)}")
print(f"Alignment problem: {set(expression.columns) != set(lod_matrix.columns)}")
# Test with completely null LOD for one OlinkID
print("\n" + "="*60)
print("\nTest 2: OlinkID with ALL-NULL LOD")
df2 = pd.DataFrame({
'SampleID': ['S1', 'S1', 'S2', 'S2'],
'OlinkID': ['OlinkID_A', 'OlinkID_B', 'OlinkID_A', 'OlinkID_B'],
'NPX': [1.0, 2.0, 3.0, 4.0],
'LOD': [np.nan, np.nan, np.nan, np.nan], # ALL null LOD
})
expression2 = df2.pivot_table(
index='SampleID',
columns='OlinkID',
values='NPX',
aggfunc='first',
)
lod_matrix2 = df2.pivot_table(
index='SampleID',
columns='OlinkID',
values='LOD',
aggfunc='first',
)
print(f"expression2 columns: {list(expression2.columns)}")
print(f"lod_matrix2 columns: {list(lod_matrix2.columns)}")
print(f"Alignment problem: {set(expression2.columns) != set(lod_matrix2.columns)}")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 185
🏁 Script executed:
# Also verify the suggested fix would work
python3 << 'EOF'
import pandas as pd
import numpy as np
df = pd.DataFrame({
'SampleID': ['S1', 'S1', 'S2', 'S2'],
'OlinkID': ['OlinkID_A', 'OlinkID_B', 'OlinkID_A', 'OlinkID_B'],
'NPX': [1.0, 2.0, 3.0, 4.0],
'LOD': [np.nan, np.nan, np.nan, np.nan],
})
sample_order = df['SampleID'].unique()
expression = df.pivot_table(
index='SampleID', columns='OlinkID', values='NPX', aggfunc='first',
).reindex(sample_order).reset_index(drop=True)
lod_matrix = df.pivot_table(
index='SampleID', columns='OlinkID', values='LOD', aggfunc='first',
)
print("Before fix:")
print(f"expression columns: {list(expression.columns)}")
print(f"lod_matrix columns: {list(lod_matrix.columns)}")
# Apply suggested fix
lod_matrix_fixed = lod_matrix.reindex(
index=sample_order,
columns=expression.columns,
).reset_index(drop=True)
print("\nAfter fix:")
print(f"expression columns: {list(expression.columns)}")
print(f"lod_matrix_fixed columns: {list(lod_matrix_fixed.columns)}")
print(f"Aligned: {list(expression.columns) == list(lod_matrix_fixed.columns)}")
EOFRepository: PRIDE-Archive/pyprideap
Length of output: 185
🏁 Script executed:
# Check if there are any tests in the codebase that verify lod_matrix alignment
find . -type f -name "*.py" -path "*/test*" | head -20Repository: PRIDE-Archive/pyprideap
Length of output: 427
🏁 Script executed:
# Search for tests related to olink readers
find . -type f -name "*.py" | xargs grep -l "read_olink\|lod_matrix" | grep -i testRepository: PRIDE-Archive/pyprideap
Length of output: 135
🏁 Script executed:
# Examine test_readers.py to see what's tested
cat -n tests/test_readers.py | head -200Repository: PRIDE-Archive/pyprideap
Length of output: 6906
🏁 Script executed:
# Examine test_lod.py
cat -n tests/test_lod.py | head -150Repository: PRIDE-Archive/pyprideap
Length of output: 5992
🏁 Script executed:
# Check conftest.py for fixtures
cat -n tests/conftest.pyRepository: PRIDE-Archive/pyprideap
Length of output: 792
🏁 Script executed:
# Look for fixture data that might test lod_matrix
find tests -type f \( -name "*.csv" -o -name "*.xlsx" \) | head -20Repository: PRIDE-Archive/pyprideap
Length of output: 158
🏁 Script executed:
# Search specifically for any tests that verify lod_matrix has same columns as expression
rg "lod_matrix.*columns\|expression\.columns" tests/Repository: PRIDE-Archive/pyprideap
Length of output: 49
🏁 Script executed:
# Check the olink sample data to see if it has LOD column and structure
head -20 tests/data/olink_sample.npx.csvRepository: PRIDE-Archive/pyprideap
Length of output: 1012
🏁 Script executed:
# Examine the structure more carefully
head -10 tests/data/olink_sample.npx.csv | cat -ARepository: PRIDE-Archive/pyprideap
Length of output: 780
🏁 Script executed:
# Count rows and columns
wc -l tests/data/olink_sample.npx.csvRepository: PRIDE-Archive/pyprideap
Length of output: 101
🏁 Script executed:
# Get the CSV structure to understand the test data
cut -d',' -f1-10 tests/data/olink_sample.npx.csv | head -15Repository: PRIDE-Archive/pyprideap
Length of output: 875
🏁 Script executed:
# Check if there are any other usages of lod_matrix in the codebase
rg "lod_matrix" --type py -A 3 -B 3Repository: PRIDE-Archive/pyprideap
Length of output: 9550
Ensure lod_matrix columns match expression.columns at creation.
pivot_table() can drop assay columns where LOD values are entirely null, causing metadata["lod_matrix"] to be narrower than expression. While lod.py includes defensive alignment in _above_lod_matrix(), the mismatch should be prevented at creation time. This pattern is inconsistent with how features are explicitly aligned (line 43). The same issue exists in src/pyprideap/readers/olink_xlsx.py.
Suggested fix
if "LOD" in df.columns:
lod_matrix = df.pivot_table(
index="SampleID",
columns="OlinkID",
values="LOD",
aggfunc="first",
)
- lod_matrix = lod_matrix.reindex(sample_order).reset_index(drop=True)
+ lod_matrix = lod_matrix.reindex(
+ index=sample_order,
+ columns=expression.columns,
+ ).reset_index(drop=True)
metadata["lod_matrix"] = lod_matrix🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/pyprideap/readers/olink_csv.py` around lines 47 - 56, The LOD pivot can
drop assay columns with all-NA values causing metadata["lod_matrix"] to not
match expression.columns; after creating lod_matrix with df.pivot_table (and
before resetting index), reindex its columns to the same assay order used for
expression/features (the same reference used at line where features are aligned)
and fill missing columns with NaN so lod_matrix.columns == expression.columns;
apply the identical fix in olink_xlsx.py where lod_matrix is built, ensuring you
reference the same sample_order and assay ordering variable (expression.columns
or features) when reindexing.
This pull request introduces the initial implementation of the
pyprideapPython library for handling affinity proteomics datasets from the PRIDE Affinity Archive (PAD). It includes core data structures, file readers for multiple platforms and formats, validation logic, statistics computation, and CI setup. The changes establish the foundation for reading, validating, and analyzing Olink and SomaScan datasets, and provide development tooling for linting, testing, and type checking.Core library structure and data model:
AffinityDataset,Platform,Level,ValidationResult) insrc/pyprideap/core.pyto represent datasets and validation results.src/pyprideap/__init__.pyto expose top-level API including reading, validation, statistics, and PRIDE client functionality.File readers for supported platforms and formats:
AffinityDataset. [1] [2] [3] [4] [5]src/pyprideap/readers/registry.pyfor automatic format detection and unified dataset reading.Validation and statistics:
validatefunction to check datasets against PRIDE guidelines.src/pyprideap/stats.pywith summary metrics for samples, features, detection rate, QC, and value distribution.PRIDE Archive integration and development tooling:
PrideClientclass insrc/pyprideap/pride.pyfor fetching project metadata and files from the PRIDE Archive API..github/workflows/ci.yml) for linting, testing, and type checking, and defined project metadata and dev dependencies inpyproject.toml. [1] [2]Documentation:
README.mdto document installation, usage, supported formats, validation rules, data model, and development instructions.Summary by CodeRabbit
New Features
Documentation
Chores