Skip to content

Preserve HTML formatting in bookmarklet text selection capture#90

Open
Copilot wants to merge 9 commits intotrunkfrom
copilot/fix-formatting-issues
Open

Preserve HTML formatting in bookmarklet text selection capture#90
Copilot wants to merge 9 commits intotrunkfrom
copilot/fix-formatting-issues

Conversation

Copy link

Copilot AI commented Mar 6, 2026

  • Understand the issue: bookmarklet captures selected text as plain text, losing HTML formatting
  • Update assets/bookmarklet.js to capture selected HTML using getRangeAt() / cloneContents() / innerHTML and send as sel_html alongside plain-text s
  • Bookmarklet try-catch: Wrap DOM calls in try-catch; falls back to plain-text only on error
  • Fix entity encoding comparison: Remove selectionHtml !== selection guard; always send sel_html when HTML capture succeeded
  • Add sanitizeInlineContent() with strict allowlist (strong, em, b, i, u, s, a[href], code, mark, sub, sup, span, br); unwraps all other elements; strips all attributes except href on <a>
  • Block data: and vbscript: URI schemes alongside javascript: in href checks
  • Add listElementToBlock() with depth limit (cap 10) to prevent stack overflow
  • Add htmlToBlocks(html) with blockquote recursion depth limit (cap 5)
  • Fix tempEl.firstChild.innerHTMLtempEl.innerHTML (heading/paragraph/blockquote handlers broke with allowlist sanitizer)
  • Update buildSuggestedContent() to fall back to description quote block when selectionHtml produces empty blocks
  • Update src/App.js to always compute description from plain-text sources (let content builder decide priority)
  • Update JSDoc for buildSuggestedContent to document selectionHtml and fallback behavior
  • Export htmlToBlocks from src/utils/index.js
  • Update assets/bookmarklet.min.js
  • 38 tests pass (added: data:/vbscript: blocking, non-allowlist unwrapping, attribute stripping, depth-limit safety, fallback behavior)
  • Linting clean
  • CodeQL: no security alerts
Original prompt

This section details on the original issue you should resolve

<issue_title>It doesn't retain formatting</issue_title>
<issue_description>Hi

I like this thing a lot but most of the time it doesn't retain any formatting which makes it useless a bit.

For instance ask ChatGPT a question that generates a bulleted list, then use PressT to capture it. It will capture as a simple text which makes it super hard to read in the entry. This is also an issue with StackOverflow etc type of places where there might be some formatted entries.

Sample

In Unreal Engine: How to Choose

  • Unreal Engine allows developers to switch between deferred and forward rendering based on project needs:
    • Deferred Rendering: Default for most Unreal projects, optimized for high-quality scenes.
    • Forward Rendering: Enabled in the project settings, often for VR projects or specific cases where its advantages outweigh its limitations.

Each technique aligns with different performance and visual requirements, so understanding the project's priorities is key to making the right choice.
</issue_description>

Comments on the Issue (you are @copilot in this section)

@kraftbj Thank you for raising this issue.

The 2.0.0 modernization improves formatting handling in the editor itself—content pasted directly into the Gutenberg block editor preserves formatting as expected.

However, you're right that the bookmarklet's text selection capture still converts selections to plain text. This is a limitation of how the bookmarklet extracts selected text (window.getSelection().toString()).

We're leaving this open as a future enhancement to explore capturing formatted selections. In the meantime, for formatted content, pasting directly into the editor is the best approach.</comment_new>


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits March 6, 2026 23:12
Co-authored-by: kraftbj <88897+kraftbj@users.noreply.github.com>
- Bookmarklet now captures selected HTML using getRangeAt()/cloneContents()
  and sends it as `sel_html` alongside the plain-text `s` field
- Added htmlToBlocks() utility that converts HTML to Gutenberg block markup
  (handles headings h1-h6, paragraphs, ul/ol lists with nesting, blockquotes,
  and pre/code blocks; strips scripts/event handlers for safety)
- buildSuggestedContent() now renders HTML selections as formatted blocks
  instead of plain-text in a quote block
- App.js postMessage handler passes sel_html through to content builder
- Added 20 new unit tests for htmlToBlocks and updated buildSuggestedContent
- Added bookmarklet test for HTML selection capture
- Rebuilt minified bookmarklet

Co-authored-by: kraftbj <88897+kraftbj@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix issue with retaining formatting in captured text Preserve HTML formatting in bookmarklet text selection capture Mar 6, 2026
kraftbj
kraftbj previously requested changes Mar 6, 2026
Copy link
Collaborator

@kraftbj kraftbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

Critical Issues (3)

  1. Sanitizer uses denylist, not allowlist (html-parser.js:459-486) — The JSDoc promises an allowlist of safe inline elements, but the implementation only denylists specific dangerous patterns. Everything else passes through: <form>, <svg>, <input>, <button>, <textarea>, plus all attributes like style, class, formaction, etc.

  2. data: and vbscript: URI schemes not stripped (html-parser.js:476-484) — Only javascript: is blocked. data:text/html,... links execute scripts when clicked and survive sanitization into paragraph blocks.

  3. Silent content loss when selectionHtml produces empty blocks (html-parser.js:711-715, App.js:228-229) — When selectionHtml is truthy but htmlToBlocks() returns empty (whitespace-only HTML, only script tags, etc.), no content block is produced. The description fallback is also defeated because App.js blanks description when selectionHtml exists. The user's selection is silently discarded.

Important Issues (5)

  1. Bookmarklet HTML capture has no try-catch (bookmarklet.js:39-43) — If cloneContents() throws (cross-shadow-DOM ranges, browser extensions), the entire bookmarklet halts with no popup and no feedback.

  2. Async event listener swallows errors (App.js:189) — handleMessage is async but used as an event listener. Unhandled errors produce silent promise rejections. Consider wrapping the body in a top-level try-catch with console.error. (Pre-existing line, not in this diff, but this PR increases the risk by adding HTML parsing that can fail.)

  3. Unbounded recursion (html-parser.js:517, 643) — listElementToBlock and htmlToBlocks (blockquotes) recurse without depth limits. Crafted deeply nested HTML causes stack overflow.

  4. Entity encoding defeats plain-text optimization (bookmarklet.js:311-314) — innerHTML entity-encodes & etc. while toString() doesn't, so selectionHtml !== selection is true even for unformatted text like "Tom & Jerry".

  5. buildSuggestedContent JSDoc doesn't document selectionHtml — The new property changes function behavior significantly but isn't documented.

Strengths

  • Core design is sound — getRangeAt/cloneContents to Gutenberg blocks is the right approach.
  • DOMParser is the correct choice for untrusted HTML parsing.
  • Proper escapeHtml on text content within blocks.
  • Good test coverage for the happy path (all block types, nested lists, inline formatting, real-world example from the issue).
  • Clean backward compatibility — sel_html is additive, plain-text s still sent.

Critical fixes:
- Switch sanitizeInlineContent to allowlist (strong,em,b,i,u,s,a,code,
  mark,sub,sup,span,br); all other elements unwrapped, all attributes
  stripped except href on <a>
- Block data: and vbscript: URI schemes alongside javascript: in href
- Fix silent content loss: buildSuggestedContent falls back to
  description quote when selectionHtml produces empty blocks; App.js
  always computes description so the fallback is available

Important fixes:
- Wrap bookmarklet cloneContents() DOM calls in try-catch
- Add depth limits: listElementToBlock cap at 10, htmlToBlocks blockquote
  recursion cap at 5
- Remove unreliable selectionHtml !== selection comparison (innerHTML
  entity-encodes & to &amp;); always send sel_html when capture succeeded

Also fix tempEl.firstChild.innerHTML → tempEl.innerHTML (heading/paragraph/
blockquote handlers were broken by the allowlist unwrapping block containers)
Update JSDoc for buildSuggestedContent to document selectionHtml
Add tests: data:/vbscript: blocking, non-allowlist unwrapping, attribute
stripping, depth-limit safety, selectionHtml→empty→description fallback

Co-authored-by: kraftbj <88897+kraftbj@users.noreply.github.com>
Copy link
Author

Copilot AI commented Mar 6, 2026

PR Review

Critical Issues (3)...

All critical and important issues addressed in 8d3ae6c — summary of changes:

Critical (3/3 fixed):

  • Sanitizer switched to strict allowlist; all non-listed elements unwrapped, all attributes stripped except href on <a>
  • data: and vbscript: schemes now blocked alongside javascript:
  • Silent content loss: buildSuggestedContent falls back to description when htmlToBlocks returns empty; App.js always computes description

Important (4/4 fixed):

  • Bookmarklet DOM calls wrapped in try-catch with plain-text fallback
  • Depth limits added: listElementToBlock caps at 10, blockquote recursion in htmlToBlocks caps at 5
  • Entity encoding comparison removed; sel_html is always sent when capture succeeded

Item 5 (async event listener) is pre-existing and unrelated to these changes; left as-is.

@kraftbj kraftbj marked this pull request as ready for review March 16, 2026 22:06
@kraftbj
Copy link
Collaborator

kraftbj commented Mar 16, 2026

Manual Testing Steps

The bookmarklet now captures HTML formatting from the user's selection and converts it to Gutenberg blocks. Here's what to test.

Setup

  1. Build the plugin (npm run build)
  2. Activate on a test site
  3. Go to Tools → Press This and drag the bookmarklet to your bookmarks bar

Core Functionality: Formatted Selection

On a page with rich content (ChatGPT output, a StackOverflow answer, a Wikipedia article), select a chunk that includes varied formatting, then click the bookmarklet.

  • Bold/italic/inline formatting — Select text with <strong>, <em>, <code>, etc. These should appear as inline formatting inside paragraph blocks, not stripped to plain text.
  • Headings — Select content that includes <h2>/<h3> headings. They should become Heading blocks at the correct level.
  • Bulleted and numbered lists — Select a <ul> or <ol>. Should become a List block with correct items. Nested lists (sub-bullets) should be preserved up to reasonable depth.
  • Blockquotes — Select a <blockquote>. Should become a Quote block with its inner content as blocks.
  • Code blocks — Select a <pre><code> block. Should become a Code block.
  • Links — Select text containing <a href="..."> links. The link text and href should be preserved. Other attributes (class, style, target, etc.) should be stripped.
  • Mixed content — Select a section with headings, paragraphs, lists, and inline formatting together. Each should map to the appropriate block type.

Backward Compatibility

  • Plain text selection — Select plain text with no HTML formatting. Should still produce a Quote block with the text (same as before this PR).
  • No selection — Click the bookmarklet with nothing selected. Page metadata (title, description, embed URL) should still populate the editor as before.
  • Embed URLs — Use the bookmarklet on a YouTube video page. The embed block should still appear.

Security / Sanitization

  • Script injection — On a page you control, add <script>alert('xss')</script> inside a div. Select it and press the bookmarklet. The script should be stripped entirely (no alert, no visible text from the tag).
  • Dangerous attributes — Select content containing elements with onclick, onerror, style, or class attributes. All attributes should be stripped (except href on links).
  • Dangerous URI schemes — Select a link with href="javascript:alert(1)", href="data:text/html,...", or href="vbscript:...". The href should be removed; the link text may remain as plain text.
  • Non-allowlisted elements — Select content containing <form>, <input>, <svg>, <iframe>, or <button> elements. Dangerous ones (iframe, object, embed) should be fully removed. Others should be unwrapped (children preserved as text, element itself gone).

Edge Cases

  • Whitespace / empty selection — Select only whitespace or empty space. Should not produce a broken or empty block.
  • Script-only content — If somehow the selection is only <script> tags, htmlToBlocks returns empty and the editor should fall back to the plain-text description (or produce no content block—not crash).
  • Special characters — Select text with &, <, >, quotes, etc. These should be properly escaped in the resulting blocks.
  • Entity encoding — Select text like "Tom & Jerry" or "5 > 3". The HTML capture should still work (previous versions had a bug where entity encoding differences caused sel_html to not be sent).

Error Resilience

  • Bookmarklet on restricted pages — Try the bookmarklet on pages where DOM access might be limited (browser extension pages, PDFs). The bookmarklet should still open the popup—falling back to plain text or no selection rather than silently failing.

@kraftbj kraftbj modified the milestones: 2.1.0, 2.0.2 Mar 16, 2026
@kraftbj kraftbj self-requested a review March 16, 2026 22:18
@kraftbj kraftbj dismissed their stale review March 16, 2026 22:18

Changed addressed, needs manual testing

@kraftbj kraftbj mentioned this pull request Mar 16, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

It doesn't retain formatting

2 participants