Skip to content

Punctuation characters aren't stripped from auto-generated heading ID attributes #22248

@matthewmcvickar

Description

@matthewmcvickar

Issue Summary

The Problem

Headings automatically include an id attribute with a lowercased and dashed 'slug'-ish version of the heading text. E.g., the second-level heading 'My Favorite Book' will be rendered as <h2 id="my-favorite-book">.

This transformation also strips out a number of non alphanumeric characters and encodes non-ASCII ones. E.g., this heading:

"What's my favorite book?" you ask? Why, 'Moby Dick' of course! (I smile/laugh.)

…is turned into this HTML:

<h2 id="whats-my-favorite-book-you-ask-why-moby-dick-of-course-i-smilelaugh">"What's my favorite book?" you ask? Why, 'Moby Dick' of course! (I smile/laugh.)</h2>

As you can see, the punctuation is stripped out: single and double quotation marks, question marks exclamation points, commas, parentheses, commas, periods.

But only some punctuation is stripped out. If I use curly/fancy/typographer's quotation marks or other punctuation or special characters, they are encoded instead of stripped. E.g., this heading:

“It’s me,” I said.

Is turned into this HTML:

<h2 id="%E2%80%9Cit%E2%80%99s-me%E2%80%9D-i-said">“It’s me,” I said.</h2>

Why It's a Problem

I see two problems here:

  1. The anchor URLs for linking to these headings are ugly and hard to read.
  2. The anchor URLs for these headings are not easy to guess, which means that editors who are trying to link to headings further down the page don't know what to put for URLs for internal links.

The Request

Would you consider stripping more characters from the heading id attribute?

In my testing, the following characters are removed from heading id attributes:
' " ; , . < > / \ ? ! [ ] ( ) { } @ # $ % ^ & * = _ + ~

But these characters are not removed:
‘ ’ “ ” ` ¡ ¿ - – — •

Related Tickets

This has been brought up before, in #13876 and #14179, but those tickets were closed because it is intentional that characters are encoded so that "when links or URLs are displayed by browsers they will appear as native characters."

I understand this goal, but I don't think punctuation should be preserved, and I think the characters listed above could safely be removed from these attribute values without causing problems or losing important information.

Steps to Reproduce

  1. In a post, make a new heading (e.g., a second-level heading).
  2. Use any of these special characters in a heading: ‘ ’ “ ” ` ¡ ¿ - – — • (e.g., “It’s me—” I said)
  3. Publish the post.
  4. In the published post, inspect the HTML for the heading.
  5. Note that the heading's id attribute is full of encoded punctuation characters.

Ghost Version

5.109.2

Node.js Version

18.20.5

How did you install Ghost?

macOS Sequoia 15.3.1, ghost-cli, ghost install local

Database type

MySQL 5.7

Browser & OS version

n/a

Relevant log / error output

n/a

Code of Conduct

  • I agree to be friendly and polite to people in this repository

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug[triage] something behaving unexpectedlycommunity[triage] Community features and bugsstale[triage] Issues that were closed to to lack of traction

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions