feat: add bioRxiv submission package command (v1.19.0)#291
Conversation
New Features: - Add `rxiv biorxiv` command for complete submission package generation - Generate bioRxiv author template (TSV) with HTML entity encoding - Include manuscript PDF, source files, and ZIP archive - Support for custom directories and filenames (--biorxiv-dir, --zip-filename, --no-zip) - HTML entity encoding for special characters (António → António) - Automatic handling of multiple corresponding authors Code Improvements: - Centralize common submission logic in BaseCommand - Add shared helper methods: _clear_output_directory(), _ensure_pdf_built(), _set_submission_defaults() - Refactor ArxivCommand and BioRxivCommand to use shared methods - Eliminate ~64 lines of duplicated code Bug Fixes: - Fix special character handling for bioRxiv (use HTML entities instead of ASCII stripping) Tests: - Add 25 unit tests for bioRxiv functionality - Test HTML entity encoding, TSV generation, packaging - All tests passing (23 passed, 2 skipped) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Pull Request Review: bioRxiv Submission Package Command (v1.19.0)OverviewThis PR adds a new StrengthsCode Quality:
Testing:
Feature Implementation:
Issues and Recommendations1. Data Mutation - HIGH PRIORITY The code mutates the original config data which is a side effect that violates functional programming principles. 2. Inconsistent Error Handling - MEDIUM PRIORITY The code raises an error for multiple corresponding authors but then fixes it silently elsewhere. 3. Performance Issue - MEDIUM PRIORITY The dictionary is rebuilt on every function call which is expensive. 4. Missing ZIP Path Validation - MEDIUM PRIORITY Could fail silently or create files in unexpected locations. 5. Test Coverage Gaps
6. gitignore Changes - MINOR Security Considerations
Final VerdictRecommended Action: APPROVE WITH MINOR REVISIONS Priority fixes before merge:
Can be addressed in follow-up:
Excellent work on this feature! |
There was a problem hiding this comment.
Pull request overview
This PR adds a new rxiv biorxiv command that generates complete bioRxiv submission packages, including author templates with HTML entity encoding for international characters, manuscript PDFs, source files, and ZIP archives. The PR also refactors common submission logic into BaseCommand helper methods to reduce code duplication between the arxiv and biorxiv commands.
Changes:
- Added new
rxiv biorxivcommand with TSV author template generation using HTML entity encoding - Refactored submission command logic into shared BaseCommand helper methods (
_clear_output_directory,_ensure_pdf_built,_set_submission_defaults) - Added 25 unit tests for bioRxiv functionality covering HTML entity encoding, TSV generation, and validation
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/rxiv_maker/engines/operations/prepare_biorxiv.py | Core bioRxiv functionality: HTML entity encoding, TSV generation, package preparation, and ZIP creation |
| src/rxiv_maker/cli/commands/biorxiv.py | CLI command definition for bioRxiv submission package generation |
| src/rxiv_maker/cli/framework/base.py | Added shared helper methods for submission commands |
| src/rxiv_maker/cli/framework/workflow_commands.py | Refactored ArxivCommand to use shared helpers and added BioRxivCommand implementation |
| src/rxiv_maker/cli/main.py | Registered biorxiv command in CLI |
| src/rxiv_maker/cli/commands/init.py | Exported biorxiv command |
| tests/unit/test_prepare_biorxiv.py | Comprehensive unit tests for bioRxiv functionality |
| tests/unit/test_biorxiv_command.py | Basic CLI command test |
| src/rxiv_maker/version.py | Version bump to 1.19.0 |
| CHANGELOG.md | Documented new features and changes |
| .gitignore | Added development file patterns |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Handle multiple corresponding authors: keep only the last one | ||
| corresponding_indices = [i for i, author in enumerate(authors) if author.get("corresponding_author", False)] | ||
| if len(corresponding_indices) > 1: | ||
| # Unmark all but the last corresponding author | ||
| for idx in corresponding_indices[:-1]: | ||
| authors[idx]["corresponding_author"] = False | ||
| logger.warning( | ||
| f"Multiple corresponding authors found. Only keeping the last one: " | ||
| f"{authors[corresponding_indices[-1]].get('name', 'Unknown')}" | ||
| ) | ||
|
|
||
| # Validate author data | ||
| validate_author_data(authors) | ||
|
|
There was a problem hiding this comment.
The logic for handling multiple corresponding authors is confusing. Lines 187-196 automatically keep only the last corresponding author, but then lines 198-199 call validate_author_data() which checks for multiple corresponding authors and would raise an error. However, since the multiple authors have already been handled, the validation check for multiple corresponding authors becomes dead code.
Consider either: (1) moving the validation before the automatic handling, and only handle multiple authors if validation fails, or (2) removing the multiple-corresponding-author check from validate_author_data() since it's handled at the call site. The current approach makes the validation function's behavior inconsistent with its actual usage.
| # Handle multiple corresponding authors: keep only the last one | |
| corresponding_indices = [i for i, author in enumerate(authors) if author.get("corresponding_author", False)] | |
| if len(corresponding_indices) > 1: | |
| # Unmark all but the last corresponding author | |
| for idx in corresponding_indices[:-1]: | |
| authors[idx]["corresponding_author"] = False | |
| logger.warning( | |
| f"Multiple corresponding authors found. Only keeping the last one: " | |
| f"{authors[corresponding_indices[-1]].get('name', 'Unknown')}" | |
| ) | |
| # Validate author data | |
| validate_author_data(authors) | |
| # Validate author data; if validation fails due to multiple corresponding authors, | |
| # automatically keep only the last corresponding author and re-validate. | |
| try: | |
| validate_author_data(authors) | |
| except BioRxivAuthorError: | |
| corresponding_indices = [ | |
| i for i, author in enumerate(authors) if author.get("corresponding_author", False) | |
| ] | |
| if len(corresponding_indices) > 1: | |
| # Unmark all but the last corresponding author | |
| for idx in corresponding_indices[:-1]: | |
| authors[idx]["corresponding_author"] = False | |
| logger.warning( | |
| f"Multiple corresponding authors found. Only keeping the last one: " | |
| f"{authors[corresponding_indices[-1]].get('name', 'Unknown')}" | |
| ) | |
| # Re-validate after automatic correction | |
| validate_author_data(authors) | |
| else: | |
| # The validation error was not due to multiple corresponding authors | |
| raise |
| Path to the created ZIP file | ||
| """ | ||
| # Use manuscript-aware naming if manuscript path is provided | ||
| if manuscript_path and zip_filename == "biorxiv_submission.zip": |
There was a problem hiding this comment.
The condition on line 367 will never be true in practice. The default zip_filename is set by _set_submission_defaults() to include the full path (e.g., "/path/to/output/manuscript_biorxiv.zip"), not just "biorxiv_submission.zip". This makes lines 367-369 unreachable code.
The manuscript_path parameter is already used by the caller to set an appropriate default filename via _set_submission_defaults(), so these lines appear to be redundant. Consider removing them or adjusting the logic to check if the filename matches the expected pattern rather than an exact string match.
| if manuscript_path and zip_filename == "biorxiv_submission.zip": | |
| if manuscript_path and Path(zip_filename).name == "biorxiv_submission.zip": |
| def test_multiple_authors(self, tmp_path): | ||
| """Test TSV with multiple authors.""" | ||
| config_path = tmp_path / "00_CONFIG.yml" | ||
| config_content = """ | ||
| authors: | ||
| - name: John Smith | ||
| email: john@example.com | ||
| affiliations: [inst1] | ||
| corresponding_author: false | ||
| - name: Jane Doe | ||
| email: jane@example.com | ||
| affiliations: [inst2] | ||
| corresponding_author: true | ||
|
|
||
| affiliations: | ||
| - shortname: inst1 | ||
| full_name: University A | ||
| - shortname: inst2 | ||
| full_name: University B | ||
| """ | ||
| config_path.write_text(config_content) | ||
|
|
||
| output_path = tmp_path / "biorxiv_authors.tsv" | ||
| generate_biorxiv_author_tsv(config_path, output_path) | ||
|
|
||
| with open(output_path, newline="", encoding="utf-8") as f: | ||
| reader = csv.reader(f, delimiter="\t") | ||
| rows = list(reader) | ||
|
|
||
| assert len(rows) == 3 # Header + 2 authors | ||
| assert rows[1][6] == "" # First author not corresponding (empty string) | ||
| assert rows[2][6] == "Yes" # Second author is corresponding | ||
|
|
There was a problem hiding this comment.
Missing test coverage for the automatic handling of multiple corresponding authors. While the PR description mentions "Automatic handling of multiple corresponding authors (keeps last one)" and there's code to implement this (lines 187-196), there's no test that verifies this behavior works correctly.
Consider adding a test case where the config has multiple authors marked as corresponding authors to verify that: (1) the function succeeds without raising an error, (2) only the last author is marked as corresponding in the output TSV, and (3) a warning is logged.
Summary
This PR adds a new
rxiv biorxivcommand that generates a complete bioRxiv submission package, including author template (TSV), manuscript PDF, source files, and a ready-to-upload ZIP archive.New Features
rxiv biorxivgenerates complete submission package--biorxiv-dir,--zip-filename,--no-zipCode Improvements
_clear_output_directory(),_ensure_pdf_built(),_set_submission_defaults()helper methodsBug Fixes
Testing
Checklist
🤖 Generated with Claude Code