feat: Add page number tracking and fix PyMuPDF thread-safety issues#142
feat: Add page number tracking and fix PyMuPDF thread-safety issues#142KennethVrb wants to merge 5 commits intoKohulan:developmentfrom
Conversation
feat: added return_bboxes parameter to segment_chemical_structures_from_file
…handles per thread
There was a problem hiding this comment.
Pull request overview
This PR extends segment_chemical_structures_from_file() to optionally track 1-indexed source page numbers for each extracted segment (primarily for PDFs) while refactoring PDF rendering to avoid PyMuPDF cross-thread document sharing.
Changes:
- Added
return_page_numbersoption and updated return-shape handling insegment_chemical_structures_from_file(). - Refactored PDF loading into single-page vs multi-page paths; multi-page rendering uses separate PyMuPDF document handles per thread.
- Added/updated tests and fixtures for page-number alignment using generated PDFs and mocked detection.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
decimer_segmentation/decimer_segmentation.py |
Adds page-number return support and refactors PDF page rendering for thread safety. |
tests/test_segmentation.py |
Adds new integration tests covering return combinations and page-number correctness. |
tests/helpers.py |
Introduces a helper to generate temporary multi-page PDFs for tests. |
tests/conftest.py |
Adds a mocked detection fixture and updates model-weight path checks to use pystow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Patch the get_mrcnn_results function | ||
| import decimer_segmentation.decimer_segmentation as ds | ||
|
|
||
| monkeypatch.setattr(ds, "get_mrcnn_results", mock_get_mrcnn_results) |
There was a problem hiding this comment.
mock_model_detection imports decimer_segmentation.decimer_segmentation to patch get_mrcnn_results, but that module eagerly initializes segmentation_model = get_model() at import time (which can download/load weights). This undermines the goal of running these tests without model weights/network and can make the suite slow/flaky. Consider making model initialization fully lazy in the library (e.g., call get_model() inside get_mrcnn_results rather than at module import) or otherwise provide a test hook to bypass model loading before importing the module.
There was a problem hiding this comment.
@Kohulan i also noticed you are loading the model the moment decimer gets imported instead of lazy loading it.
Is there any reason for this?
There was a problem hiding this comment.
I would like to keep the model loaded when it is initialised since lazy loading takes time and this affects performance, keeping it in memory allows one to run the segmentation faster.
| def render_page(page_num: int) -> Tuple[int, np.ndarray]: | ||
| page = pdf_document[page_num] | ||
| matrix = pymupdf.Matrix(300 / 72, 300 / 72) | ||
| pix = page.get_pixmap(matrix=matrix, alpha=False) | ||
| img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape( | ||
| pix.h, pix.w, pix.n | ||
| ) | ||
| if pix.n == 3: | ||
| img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2BGR) | ||
| return page_num, img_array.copy() | ||
| # Each thread opens its own document handle for thread safety | ||
| with pymupdf.open(pdf_path) as doc: | ||
| page = doc[page_num] | ||
| matrix = pymupdf.Matrix(300 / 72, 300 / 72) | ||
| pix = page.get_pixmap(matrix=matrix, alpha=False) | ||
| img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape( | ||
| pix.h, pix.w, pix.n | ||
| ) | ||
| if pix.n == 3: | ||
| img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2BGR) | ||
| return page_num, img_array.copy() |
There was a problem hiding this comment.
In _load_pdf_multipage(), each page render opens the PDF anew (pymupdf.open(pdf_path)) which means page_count opens for a multi-page document. This adds significant overhead for larger PDFs. If thread-safety requires separate handles, consider reusing one handle per worker thread (e.g., via threading.local() cache or ThreadPoolExecutor(initializer=...)) so each worker opens the document once and renders multiple pages.
There was a problem hiding this comment.
@Kohulan i can address this issue but ill have to tackle it somewhere in the coming weeks.
There was a problem hiding this comment.
I need to properly check what you are exactly doing since I don't want to break anything with new implementations. Once I have a clear overview, I can merge this.
Meanwhile, if you could fix this issue, then I can check over that. Thank you.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@copilot open a new pull request to apply changes based on the comments in this thread |
Summary
This PR adds page number tracking for PDF segmentation and fixes critical thread-safety issues in multi-page PDF processing.
Changes
New Features:
return_page_numbersparameter to segment_chemical_structures_from_file() to return 1-indexed page numbers for each detected structureBug Fixes:
pystowlocation instead of hardcoded project path)print()with properlogger.debug()callCode Quality Improvements:
withstatements) for all PyMuPDF document handlesTesting: