feat: Add page number tracking and fix PyMuPDF thread-safety issues by KennethVrb · Pull Request #142 · Kohulan/DECIMER-Image-Segmentation

KennethVrb · 2026-02-27T08:19:24Z

Summary

This PR adds page number tracking for PDF segmentation and fixes critical thread-safety issues in multi-page PDF processing.

Changes

New Features:

Added return_page_numbers parameter to segment_chemical_structures_from_file() to return 1-indexed page numbers for each detected structure
Supports flexible return combinations: segments only, segments + bboxes, segments + page numbers, or all three

Bug Fixes:

Fixed PyMuPDF thread-safety issue by using separate document handles per thread instead of sharing a single document across threads
Fixed incorrect model path check in test configuration (now uses pystow location instead of hardcoded project path)
Replaced debug print() with proper logger.debug() call

Code Quality Improvements:

Refactored PDF loading into three focused functions (_load_pdf_pages, _load_pdf_single_page, _load_pdf_multipage) for better separation of concerns
Consistent use of context managers (with statements) for all PyMuPDF document handles

Testing:

Added comprehensive unit tests with mocked model detection for fast, reliable testing without requiring model weights
Tests validate proper alignment of segments, bboxes, and page numbers
Created tests/helpers.py with create_test_pdf() utility for generating test PDFs
Added mock_model_detection fixture in conftest.py

feat: added return_bboxes parameter to segment_chemical_structures_from_file

…s_from_file

…handles per thread

Copilot

Pull request overview

This PR extends segment_chemical_structures_from_file() to optionally track 1-indexed source page numbers for each extracted segment (primarily for PDFs) while refactoring PDF rendering to avoid PyMuPDF cross-thread document sharing.

Changes:

Added return_page_numbers option and updated return-shape handling in segment_chemical_structures_from_file().
Refactored PDF loading into single-page vs multi-page paths; multi-page rendering uses separate PyMuPDF document handles per thread.
Added/updated tests and fixtures for page-number alignment using generated PDFs and mocked detection.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`decimer_segmentation/decimer_segmentation.py`	Adds page-number return support and refactors PDF page rendering for thread safety.
`tests/test_segmentation.py`	Adds new integration tests covering return combinations and page-number correctness.
`tests/helpers.py`	Introduces a helper to generate temporary multi-page PDFs for tests.
`tests/conftest.py`	Adds a mocked detection fixture and updates model-weight path checks to use `pystow`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T08:28:40Z

tests/conftest.py

+    # Patch the get_mrcnn_results function
+    import decimer_segmentation.decimer_segmentation as ds
+
+    monkeypatch.setattr(ds, "get_mrcnn_results", mock_get_mrcnn_results)


mock_model_detection imports decimer_segmentation.decimer_segmentation to patch get_mrcnn_results, but that module eagerly initializes segmentation_model = get_model() at import time (which can download/load weights). This undermines the goal of running these tests without model weights/network and can make the suite slow/flaky. Consider making model initialization fully lazy in the library (e.g., call get_model() inside get_mrcnn_results rather than at module import) or otherwise provide a test hook to bypass model loading before importing the module.

@Kohulan i also noticed you are loading the model the moment decimer gets imported instead of lazy loading it.
Is there any reason for this?

I would like to keep the model loaded when it is initialised since lazy loading takes time and this affects performance, keeping it in memory allows one to run the segmentation faster.

tests/helpers.py

decimer_segmentation/decimer_segmentation.py

Copilot · 2026-03-04T08:28:41Z

decimer_segmentation/decimer_segmentation.py

    def render_page(page_num: int) -> Tuple[int, np.ndarray]:
-        page = pdf_document[page_num]
-        matrix = pymupdf.Matrix(300 / 72, 300 / 72)
-        pix = page.get_pixmap(matrix=matrix, alpha=False)
-        img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
-            pix.h, pix.w, pix.n
-        )
-        if pix.n == 3:
-            img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2BGR)
-        return page_num, img_array.copy()
+        # Each thread opens its own document handle for thread safety
+        with pymupdf.open(pdf_path) as doc:
+            page = doc[page_num]
+            matrix = pymupdf.Matrix(300 / 72, 300 / 72)
+            pix = page.get_pixmap(matrix=matrix, alpha=False)
+            img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
+                pix.h, pix.w, pix.n
+            )
+            if pix.n == 3:
+                img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2BGR)
+            return page_num, img_array.copy()


In _load_pdf_multipage(), each page render opens the PDF anew (pymupdf.open(pdf_path)) which means page_count opens for a multi-page document. This adds significant overhead for larger PDFs. If thread-safety requires separate handles, consider reusing one handle per worker thread (e.g., via threading.local() cache or ThreadPoolExecutor(initializer=...)) so each worker opens the document once and renders multiple pages.

@Kohulan i can address this issue but ill have to tackle it somewhere in the coming weeks.

I need to properly check what you are exactly doing since I don't want to break anything with new implementations. Once I have a clear overview, I can merge this.

Meanwhile, if you could fix this issue, then I can check over that. Thank you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Kohulan · 2026-03-11T21:08:24Z

@copilot open a new pull request to apply changes based on the comments in this thread

Kohulan and others added 3 commits February 25, 2026 15:54

Merge pull request Kohulan#140 from Kohulan/development

6914cac

feat: added return_bboxes parameter to segment_chemical_structures_from_file

feat: add return_page_numbers parameter to segment_chemical_structure…

428cc2a

…s_from_file

fix: improve thread safety in PDF loading by using separate document …

44c7fe5

…handles per thread

Kohulan changed the base branch from master to development March 4, 2026 08:23

Kohulan requested a review from Copilot March 4, 2026 08:24

Copilot started reviewing on behalf of Kohulan March 4, 2026 08:24 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Kohulan and others added 2 commits March 11, 2026 22:07

Update tests/helpers.py

72737f4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update decimer_segmentation/decimer_segmentation.py

901117b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add page number tracking and fix PyMuPDF thread-safety issues#142

feat: Add page number tracking and fix PyMuPDF thread-safety issues#142
KennethVrb wants to merge 5 commits intoKohulan:developmentfrom
KennethVrb:master

KennethVrb commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

KennethVrb Mar 12, 2026

Uh oh!

Kohulan Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

KennethVrb Mar 12, 2026

Uh oh!

Kohulan Mar 12, 2026 •

edited

Loading

Uh oh!

Kohulan commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KennethVrb commented Feb 27, 2026

Summary

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

KennethVrb Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Kohulan Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

KennethVrb Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Kohulan Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kohulan commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kohulan Mar 12, 2026 •

edited

Loading

Kohulan Mar 12, 2026 •

edited

Loading