Skip to content

Strip invisible Unicode from link hrefs for defense-in-depth#3299

Open
romanisa wants to merge 5 commits intomicrosoft:masterfrom
romanisa:romasha/strip-invisible-unicode
Open

Strip invisible Unicode from link hrefs for defense-in-depth#3299
romanisa wants to merge 5 commits intomicrosoft:masterfrom
romanisa:romasha/strip-invisible-unicode

Conversation

@romanisa
Copy link
Contributor

@romanisa romanisa commented Mar 5, 2026

Summary

Strips invisible Unicode characters (zero-width chars, bidirectional marks, Unicode Tags U+E0001-U+E007F, etc.) from link href attributes at multiple layers to prevent hidden content injection via mailto: links.

Bug: ADO #409639 - Rooster should strip or neutralize invisible Unicode when rendering drafts, especially in case of MailTo links

Problem

Bug explains the problem statement .. avoiding keeping it here for the sake of keeping internal details private.

Changes

Defense-in-depth - invisible Unicode stripped at 4 layers:

Layer File Coverage
Utility stripInvisibleUnicode.ts (new) Strips ~30 categories of invisible chars
HTML sanitization sanitizeElement.ts Paste, HTML-to-model conversion
XSS check checkXss.ts insertLink() API + prevents script: bypass
Format handler linkFormatHandler.ts DOM-to-model conversion

Characters stripped

  • Zero-width chars (U+200B-U+200F)
  • Bidirectional controls (U+202A-U+202E, U+2066-U+2069)
  • Unicode Tags (U+E0001-U+E007F) - specifically called out in the bug
  • Soft hyphen, BOM, word joiner, and other invisible formatting characters

Design decisions

  • Strips from href only (not all text) - avoids breaking zero-width joiners used in some languages
  • Strips rather than rejects - user-friendly for accidental BOM/soft-hyphens
  • Utility in roosterjs-content-model-dom to avoid circular dependencies

Testing

  • 28 new unit tests added across 4 test files
  • All 90 relevant tests pass (Chrome 145)

Strip invisible Unicode characters (zero-width chars, bidirectional marks,
Unicode Tags U+E0001-E007F, etc.) from link href attributes at multiple
layers to prevent hidden content injection via mailto: links.

Changes:
- Add stripInvisibleUnicode utility in roosterjs-content-model-dom
- Apply stripping in sanitizeElement.ts (HTML paste/sanitization path)
- Apply stripping in checkXss.ts (programmatic link insertion path)
- Apply stripping in linkFormatHandler.ts (DOM-to-model conversion path)
- Apply stripping in linkProcessor.ts (DOM-to-model conditional check)
- Add comprehensive unit tests for all changes

Bug: https://outlookweb.visualstudio.com/Outlook%20Web/_workitems/edit/409639

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanisa romanisa requested review from JiuqingSong and Copilot March 5, 2026 17:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Strips invisible Unicode characters from link href values across sanitization, model conversion, and XSS checking to prevent hidden-content injection (notably via mailto:).

Changes:

  • Introduces stripInvisibleUnicode() utility and exports it from roosterjs-content-model-dom.
  • Applies stripping during DOM→model conversion (linkProcessor, linkFormatHandler) and HTML sanitization (sanitizeElement).
  • Updates checkXss() to strip invisible Unicode before evaluating script: patterns and adds unit tests across layers.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/roosterjs-content-model-dom/lib/domUtils/stripInvisibleUnicode.ts Adds the core stripping utility via a consolidated regex.
packages/roosterjs-content-model-dom/lib/index.ts Exposes stripInvisibleUnicode from the package entrypoint.
packages/roosterjs-content-model-dom/lib/formatHandlers/segment/linkFormatHandler.ts Strips invisibles when reading link formats from DOM attributes.
packages/roosterjs-content-model-dom/lib/domToModel/processors/linkProcessor.ts Strips invisibles while processing <a href> into the content model.
packages/roosterjs-content-model-core/lib/command/createModelFromHtml/sanitizeElement.ts Strips invisibles when sanitizing href attributes.
packages/roosterjs-content-model-dom/test/domUtils/stripInvisibleUnicodeTest.ts Adds focused unit coverage for stripping behavior across many code points.
packages/roosterjs-content-model-dom/test/domToModel/processors/linkProcessorTest.ts Ensures DOM→model link processing strips invisible Unicode in href.
packages/roosterjs-content-model-core/test/command/createModelFromHtml/sanitizeElementTest.ts Ensures sanitizer strips invisibles from href but not unrelated attributes.
packages/roosterjs-content-model-api/lib/publicApi/utils/checkXss.ts Strips invisibles before XSS detection and returns sanitized links.
packages/roosterjs-content-model-api/test/publicApi/utils/checkXssTest.ts Adds tests for invisible Unicode stripping + script: obfuscation detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

romanisa and others added 2 commits March 5, 2026 13:07
…t/linkFormatHandler.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…, use strict equality

- Strip invisible Unicode from href BEFORE the script: regex check to
  prevent XSS bypass (e.g., s\u200Bcript: passing the check then being
  stripped to script:)
- Guard against empty href after stripping in linkFormatHandler (only
  set format.href when sanitizedHref is non-empty)
- Use strict equality (===) for attribute name checks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanisa romanisa self-assigned this Mar 5, 2026
romanisa and others added 2 commits March 5, 2026 14:36
… comments

Extend the stripped character set to include Mongolian free variation
selectors (U+180B-180D), interlinear annotation anchors (U+FFF9-FFFB),
and extended Unicode Tags (U+E0080-E00FF). Add defense-in-depth comments
at each call site, document ZWJ/emoji and URL-encoding limitations, and
add 4 new tests for the expanded character ranges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add decodeURIComponent before stripping so that URL-encoded invisible
characters (e.g. %E2%80%8B for U+200B) are also caught. Falls back
gracefully on malformed percent-encoding. Adds 5 new tests for
URL-encoded scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants