Skip to content

feat: Add page number tracking and fix PyMuPDF thread-safety issues#142

Open
KennethVrb wants to merge 5 commits intoKohulan:developmentfrom
KennethVrb:master
Open

feat: Add page number tracking and fix PyMuPDF thread-safety issues#142
KennethVrb wants to merge 5 commits intoKohulan:developmentfrom
KennethVrb:master

Conversation

@KennethVrb
Copy link

Summary

This PR adds page number tracking for PDF segmentation and fixes critical thread-safety issues in multi-page PDF processing.

Changes

New Features:

  • Added return_page_numbers parameter to segment_chemical_structures_from_file() to return 1-indexed page numbers for each detected structure
  • Supports flexible return combinations: segments only, segments + bboxes, segments + page numbers, or all three

Bug Fixes:

  • Fixed PyMuPDF thread-safety issue by using separate document handles per thread instead of sharing a single document across threads
  • Fixed incorrect model path check in test configuration (now uses pystow location instead of hardcoded project path)
  • Replaced debug print() with proper logger.debug() call

Code Quality Improvements:

  • Refactored PDF loading into three focused functions (_load_pdf_pages, _load_pdf_single_page, _load_pdf_multipage) for better separation of concerns
  • Consistent use of context managers (with statements) for all PyMuPDF document handles

Testing:

  • Added comprehensive unit tests with mocked model detection for fast, reliable testing without requiring model weights
  • Tests validate proper alignment of segments, bboxes, and page numbers
  • Created tests/helpers.py with create_test_pdf() utility for generating test PDFs
  • Added mock_model_detection fixture in conftest.py

@Kohulan Kohulan changed the base branch from master to development March 4, 2026 08:23
@Kohulan Kohulan requested a review from Copilot March 4, 2026 08:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends segment_chemical_structures_from_file() to optionally track 1-indexed source page numbers for each extracted segment (primarily for PDFs) while refactoring PDF rendering to avoid PyMuPDF cross-thread document sharing.

Changes:

  • Added return_page_numbers option and updated return-shape handling in segment_chemical_structures_from_file().
  • Refactored PDF loading into single-page vs multi-page paths; multi-page rendering uses separate PyMuPDF document handles per thread.
  • Added/updated tests and fixtures for page-number alignment using generated PDFs and mocked detection.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
decimer_segmentation/decimer_segmentation.py Adds page-number return support and refactors PDF page rendering for thread safety.
tests/test_segmentation.py Adds new integration tests covering return combinations and page-number correctness.
tests/helpers.py Introduces a helper to generate temporary multi-page PDFs for tests.
tests/conftest.py Adds a mocked detection fixture and updates model-weight path checks to use pystow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +118 to +121
# Patch the get_mrcnn_results function
import decimer_segmentation.decimer_segmentation as ds

monkeypatch.setattr(ds, "get_mrcnn_results", mock_get_mrcnn_results)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mock_model_detection imports decimer_segmentation.decimer_segmentation to patch get_mrcnn_results, but that module eagerly initializes segmentation_model = get_model() at import time (which can download/load weights). This undermines the goal of running these tests without model weights/network and can make the suite slow/flaky. Consider making model initialization fully lazy in the library (e.g., call get_model() inside get_mrcnn_results rather than at module import) or otherwise provide a test hook to bypass model loading before importing the module.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kohulan i also noticed you are loading the model the moment decimer gets imported instead of lazy loading it.
Is there any reason for this?

Copy link
Owner

@Kohulan Kohulan Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep the model loaded when it is initialised since lazy loading takes time and this affects performance, keeping it in memory allows one to run the segmentation faster.

Comment on lines 299 to +310
def render_page(page_num: int) -> Tuple[int, np.ndarray]:
page = pdf_document[page_num]
matrix = pymupdf.Matrix(300 / 72, 300 / 72)
pix = page.get_pixmap(matrix=matrix, alpha=False)
img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
pix.h, pix.w, pix.n
)
if pix.n == 3:
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2BGR)
return page_num, img_array.copy()
# Each thread opens its own document handle for thread safety
with pymupdf.open(pdf_path) as doc:
page = doc[page_num]
matrix = pymupdf.Matrix(300 / 72, 300 / 72)
pix = page.get_pixmap(matrix=matrix, alpha=False)
img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
pix.h, pix.w, pix.n
)
if pix.n == 3:
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2BGR)
return page_num, img_array.copy()
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _load_pdf_multipage(), each page render opens the PDF anew (pymupdf.open(pdf_path)) which means page_count opens for a multi-page document. This adds significant overhead for larger PDFs. If thread-safety requires separate handles, consider reusing one handle per worker thread (e.g., via threading.local() cache or ThreadPoolExecutor(initializer=...)) so each worker opens the document once and renders multiple pages.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kohulan i can address this issue but ill have to tackle it somewhere in the coming weeks.

Copy link
Owner

@Kohulan Kohulan Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to properly check what you are exactly doing since I don't want to break anything with new implementations. Once I have a clear overview, I can merge this.

Meanwhile, if you could fix this issue, then I can check over that. Thank you.

Kohulan and others added 2 commits March 11, 2026 22:07
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Kohulan
Copy link
Owner

Kohulan commented Mar 11, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants