Skip to content

fix: preserve tail text in Selector.get_all_text#175

Closed
tsubasakong wants to merge 4 commits intoD4Vinci:mainfrom
tsubasakong:fix/167-get-all-text-tail-nodes
Closed

fix: preserve tail text in Selector.get_all_text#175
tsubasakong wants to merge 4 commits intoD4Vinci:mainfrom
tsubasakong:fix/167-get-all-text-tail-nodes

Conversation

@tsubasakong
Copy link

Summary

  • preserve tail text nodes when collecting recursive text content
  • keep text order stable across nested children
  • add regression coverage for interleaved child/tail text and ignored-tag tails

Fixes #167

Validation

  • source .venv/bin/activate && pytest tests/parser/test_parser_advanced.py -q
  • source .venv/bin/activate && pytest tests/parser/test_general.py -q -k all_text
  • git diff --check

@D4Vinci D4Vinci added the PR-against-rules This PR doesn't comply with one or more of the contribution rules. label Mar 8, 2026
@D4Vinci
Copy link
Owner

D4Vinci commented Mar 8, 2026

A duplicate of #168, and this PR is against the contribution rules.

Also, PR #168 is the best approach because:

  1. Correctness: The ancestor-walk approach for ignored tags is the only one that handles arbitrarily nested ignored elements correctly. This PR still recurses into ignored elements, potentially leaking their content.
  2. Performance: Using a pre-compiled XPath(".//text()") at the module level is consistent with the existing codebase patterns (_find_all_elements, _find_all_elements_with_spaces are already pre-compiled the same way).
  3. Simplification: It correctly removes the unnecessary _find_all_elements expansion from ignored_elements — since it walks ancestors, it only needs the tag elements themselves in the set, which is simpler and faster.

@D4Vinci D4Vinci closed this Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR-against-rules This PR doesn't comply with one or more of the contribution rules.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Selector.get_all_text() doesn't get all text

2 participants