Strip invisible Unicode from link hrefs for defense-in-depth by romanisa · Pull Request #3299 · microsoft/roosterjs

romanisa · 2026-03-05T07:19:25Z

Summary

Strips invisible Unicode characters (zero-width chars, bidirectional marks, Unicode Tags U+E0001-U+E007F, etc.) from link href attributes at multiple layers to prevent hidden content injection via mailto: links.

Bug: ADO #409639 - Rooster should strip or neutralize invisible Unicode when rendering drafts, especially in case of MailTo links

Problem

Bug explains the problem statement .. avoiding keeping it here for the sake of keeping internal details private.

Changes

Defense-in-depth - invisible Unicode stripped at 4 layers:

Layer	File	Coverage
Utility	`stripInvisibleUnicode.ts` (new)	Strips ~30 categories of invisible chars
HTML sanitization	`sanitizeElement.ts`	Paste, HTML-to-model conversion
XSS check	`checkXss.ts`	`insertLink()` API + prevents `script:` bypass
Format handler	`linkFormatHandler.ts`	DOM-to-model conversion

Characters stripped

Zero-width chars (U+200B-U+200F)
Bidirectional controls (U+202A-U+202E, U+2066-U+2069)
Unicode Tags (U+E0001-U+E007F) - specifically called out in the bug
Soft hyphen, BOM, word joiner, and other invisible formatting characters

Design decisions

Strips from href only (not all text) - avoids breaking zero-width joiners used in some languages
Strips rather than rejects - user-friendly for accidental BOM/soft-hyphens
Utility in roosterjs-content-model-dom to avoid circular dependencies

Testing

28 new unit tests added across 4 test files
All 90 relevant tests pass (Chrome 145)

Strip invisible Unicode characters (zero-width chars, bidirectional marks, Unicode Tags U+E0001-E007F, etc.) from link href attributes at multiple layers to prevent hidden content injection via mailto: links. Changes: - Add stripInvisibleUnicode utility in roosterjs-content-model-dom - Apply stripping in sanitizeElement.ts (HTML paste/sanitization path) - Apply stripping in checkXss.ts (programmatic link insertion path) - Apply stripping in linkFormatHandler.ts (DOM-to-model conversion path) - Apply stripping in linkProcessor.ts (DOM-to-model conditional check) - Add comprehensive unit tests for all changes Bug: https://outlookweb.visualstudio.com/Outlook%20Web/_workitems/edit/409639 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Strips invisible Unicode characters from link href values across sanitization, model conversion, and XSS checking to prevent hidden-content injection (notably via mailto:).

Changes:

Introduces stripInvisibleUnicode() utility and exports it from roosterjs-content-model-dom.
Applies stripping during DOM→model conversion (linkProcessor, linkFormatHandler) and HTML sanitization (sanitizeElement).
Updates checkXss() to strip invisible Unicode before evaluating script: patterns and adds unit tests across layers.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
packages/roosterjs-content-model-dom/lib/domUtils/stripInvisibleUnicode.ts	Adds the core stripping utility via a consolidated regex.
packages/roosterjs-content-model-dom/lib/index.ts	Exposes `stripInvisibleUnicode` from the package entrypoint.
packages/roosterjs-content-model-dom/lib/formatHandlers/segment/linkFormatHandler.ts	Strips invisibles when reading link formats from DOM attributes.
packages/roosterjs-content-model-dom/lib/domToModel/processors/linkProcessor.ts	Strips invisibles while processing `<a href>` into the content model.
packages/roosterjs-content-model-core/lib/command/createModelFromHtml/sanitizeElement.ts	Strips invisibles when sanitizing `href` attributes.
packages/roosterjs-content-model-dom/test/domUtils/stripInvisibleUnicodeTest.ts	Adds focused unit coverage for stripping behavior across many code points.
packages/roosterjs-content-model-dom/test/domToModel/processors/linkProcessorTest.ts	Ensures DOM→model link processing strips invisible Unicode in `href`.
packages/roosterjs-content-model-core/test/command/createModelFromHtml/sanitizeElementTest.ts	Ensures sanitizer strips invisibles from `href` but not unrelated attributes.
packages/roosterjs-content-model-api/lib/publicApi/utils/checkXss.ts	Strips invisibles before XSS detection and returns sanitized links.
packages/roosterjs-content-model-api/test/publicApi/utils/checkXssTest.ts	Adds tests for invisible Unicode stripping + `script:` obfuscation detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/roosterjs-content-model-core/lib/command/createModelFromHtml/sanitizeElement.ts

packages/roosterjs-content-model-dom/lib/formatHandlers/segment/linkFormatHandler.ts

packages/roosterjs-content-model-core/lib/command/createModelFromHtml/sanitizeElement.ts

…t/linkFormatHandler.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…, use strict equality - Strip invisible Unicode from href BEFORE the script: regex check to prevent XSS bypass (e.g., s\u200Bcript: passing the check then being stripped to script:) - Guard against empty href after stripping in linkFormatHandler (only set format.href when sanitizedHref is non-empty) - Use strict equality (===) for attribute name checks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… comments Extend the stripped character set to include Mongolian free variation selectors (U+180B-180D), interlinear annotation anchors (U+FFF9-FFFB), and extended Unicode Tags (U+E0080-E00FF). Add defense-in-depth comments at each call site, document ZWJ/emoji and URL-encoding limitations, and add 4 new tests for the expanded character ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add decodeURIComponent before stripping so that URL-encoded invisible characters (e.g. %E2%80%8B for U+200B) are also caught. Falls back gracefully on malformed percent-encoding. Adds 5 new tests for URL-encoded scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

romanisa requested review from JiuqingSong and Copilot March 5, 2026 17:09

Copilot AI reviewed Mar 5, 2026

View reviewed changes

romanisa and others added 2 commits March 5, 2026 13:07

Update packages/roosterjs-content-model-dom/lib/formatHandlers/segmen…

f0da586

…t/linkFormatHandler.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

romanisa self-assigned this Mar 5, 2026

romanisa requested review from BryanValverdeU and juliaroldi March 5, 2026 21:39

romanisa and others added 2 commits March 5, 2026 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip invisible Unicode from link hrefs for defense-in-depth#3299

Strip invisible Unicode from link hrefs for defense-in-depth#3299
romanisa wants to merge 5 commits intomicrosoft:masterfrom
romanisa:romasha/strip-invisible-unicode

romanisa commented Mar 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

romanisa commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Characters stripped

Design decisions

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

romanisa commented Mar 5, 2026 •

edited

Loading