ENH: Add page-range support to extract-images, extract-annotated-pages, and extract-text by tibisabau · Pull Request #215 · py-pdf/pdfly

tibisabau · 2025-12-08T13:54:17Z

Summary

This enhancement introduces consistent page-range filtering across three CLI subcommands: extract-images, extract-annotated-pages, and extract-text. Each now supports two new optional arguments, --from and --end, enabling selective processing of only a portion of the PDF rather than the entire document.

The Code was generated by GitHub Copilot. (GPT5)

e.g. Closes #194

Checklist:

A unit test is covering the code added / modified by this PR
In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder
A mention of the change is present in CHANGELOG.md
This PR is ready to be merged

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD 3-Clause license.

…s, and extract-text This enhancement introduces consistent page-range filtering across three CLI subcommands: extract-images, extract-annotated-pages, and extract-text. Each now supports two new optional arguments, --from-page and --to-page, enabling selective processing of only a portion of the PDF rather than the entire document.

Copilot

Pull request overview

This PR introduces page-range filtering functionality to three CLI subcommands (extract-images, extract-annotated-pages, and extract-text) by adding optional --from and --end parameters. These parameters enable users to selectively process specific portions of a PDF document rather than the entire file, using 0-based inclusive indexing.

Key Changes:

Added start and end optional parameters to all three extraction commands
Implemented filtering logic in each command's main function
Added comprehensive test coverage for the new range functionality
Updated documentation with usage examples for the new parameters

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
pdfly/extract_images.py	Added range filtering for image extraction using global image index
pdfly/extract_annotated_pages.py	Added range filtering for page extraction based on page index
pdfly/cli.py	Added `--from` and `--end` options to extract-images, extract-annotated-pages, and extract-text commands
tests/test_extract_images.py	Added tests for single-image and multi-image range extraction
tests/test_extract_annotated_pages.py	Added tests for page range filtering with annotations
tests/test_cli.py	Added tests for extract-text command with range parameters
docs/user/subcommand-extract-text.md	Updated documentation with range parameter usage examples
docs/user/subcommand-extract-images.md	Updated documentation with range parameter usage examples
docs/user/subcommand-extract-annotated-pages.md	Updated documentation with range parameter usage examples
resources/file-with-invalid-offsets.pdf	File appears modified but change seems unrelated to this PR

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pdfly/extract_images.py

pdfly/extract_annotated_pages.py

tests/test_extract_annotated_pages.py

pdfly/cli.py

tests/test_cli.py

tests/test_extract_images.py

tests/test_cli.py

pdfly/extract_annotated_pages.py

docs/user/subcommand-extract-images.md

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Lucas-C · 2026-01-19T06:36:11Z

pdfly/extract_images.py

+            if start is not None and global_index < start:
+                in_range = False
+            if end is not None and global_index > end:


It seems to me that the --start & --end arguments should be checked against the page_index, not a global_index that seems to be an index of all the images in the document.
Given that some pages in a document can have zero or several images on it, comparing from & start to this global_index won't work as expected IMHO.

Also, the same remark applies regarding "exclusive end indices" and the use of >= instead of >.

Lucas-C · 2026-01-19T06:36:58Z

tests/test_cli.py

+    from pathlib import Path
+    from .conftest import RESOURCES_ROOT, chdir


Why do you perform those imports inside this test function?

As can be seen in the current unit test files, we usually perform imports outside functions, at the top of files.

Lucas-C · 2026-01-19T06:39:28Z

pdfly/cli.py

+        if start is not None and i < start:
+            continue
+        if end is not None and i > end:


Usually in Python, when defining a range, the start index is inclusive but the end index is exclusive.

I think we should stick to this, and be clear about this in the docs.

There that would mean using if end is not None and i >= end insead of i > end.

What do you think? 🙂

Lucas-C · 2026-01-19T06:40:31Z

pdfly/extract_annotated_pages.py

+        if start is not None and i < start:
+            continue
+        if end is not None and i > end:


The same remark applies regarding "exclusive end indices" and the use of >= instead of >.

Lucas-C · 2026-01-19T06:40:59Z

resources/file-with-invalid-offsets.pdf

Why have you modified this file?

Lucas-C · 2026-01-19T06:42:11Z

tests/test_cli.py

+    captured = capsys.readouterr()
+    assert not captured.err
+    # We expect some non-empty text output for that page
+    assert captured.out.strip() != ""


It would be interesting to be more precise there, on asserting what the expected output is, especially given the fact that you are testing a very specific case with from == end == 0 , which should lead to zero output if we implement an exclusive end index.

Lucas-C · 2026-01-19T06:45:41Z

Hi @tibisabau 👋 🙂

I gave a few feedbacks regarding the current code.

I think once you have addressed them we could probably merge this!

Thank you for your work on this 🙏

Copilot AI review requested due to automatic review settings December 8, 2025 13:54

Copilot started reviewing on behalf of tibisabau December 8, 2025 13:54 View session

Update Changelog

7d13024

tibisabau changed the title ~~Draft: ENH: Add page-range support to extract-images, extract-annotated-pages, and extract-text~~ ENH: Add page-range support to extract-images, extract-annotated-pages, and extract-text Dec 8, 2025

Copilot AI reviewed Dec 8, 2025

View reviewed changes

Lucas-C reviewed Jan 19, 2026

View reviewed changes

resources/file-with-invalid-offsets.pdf

Copy link

Member

Lucas-C Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have you modified this file?

Lucas-C reviewed Jan 19, 2026

View reviewed changes

		from pathlib import Path
		from .conftest import RESOURCES_ROOT, chdir

Conversation

tibisabau commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Lucas-C Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Lucas-C Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lucas-C Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Lucas-C Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lucas-C Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lucas-C Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Lucas-C Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Lucas-C commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tibisabau commented Dec 8, 2025 •

edited

Loading

Lucas-C Jan 19, 2026 •

edited

Loading

Lucas-C Jan 19, 2026 •

edited

Loading

Lucas-C Jan 19, 2026 •

edited

Loading