A powerful PHP library for converting HTML to semantic Markdown, preserving the structure and meaning of the original content.
This library is a PHP port of domscribe-python, which itself is based on dom-to-semantic-markdown.
- Semantic preservation: Maintains the semantic structure of HTML during conversion
- Complex structure handling: Handles nested lists, tables, and other complex HTML structures
- Highly customizable: Extensive options to tailor the conversion process
- Main content extraction: Automatically identifies and extracts the main content from web pages
- LLM-friendly output: Optimized for Language Model processing with special annotations
- Well-tested: Comprehensive test suite with PHPUnit
- Modern PHP: Uses PHP 8.0+ features with strict typing
Install via Composer:
composer require acseo/domscribe<?php
use Domscribe\Converter;
// Simple conversion
$html = "<h1>Hello, World!</h1><p>This is a <strong>test</strong>.</p>";
$markdown = Converter::htmlToMarkdown($html);
echo $markdown;
// Output:
// # Hello, World!
//
// This is a **test**.use Domscribe\Converter;
use Domscribe\ConversionOptions;
$html = '<html><body><main><h1>Main Content</h1><p>Some text</p></main></body></html>';
// Using an array
$options = [
'extract_main_content' => true,
'refify_urls' => true,
'keep_html' => ['div', 'span'],
'debug' => false,
];
$markdown = Converter::htmlToMarkdown($html, $options);
// Or using ConversionOptions object
$options = new ConversionOptions();
$options->extractMainContent = true;
$options->refifyUrls = true;
$options->keepHtml = ['div', 'span'];
$markdown = Converter::htmlToMarkdown($html, $options);| Option | Type | Default | Description |
|---|---|---|---|
websiteDomain |
?string |
null |
Website domain to strip from URLs |
extractMainContent |
bool |
false |
Automatically extract main content |
refifyUrls |
bool |
false |
Convert to reference-style links |
urlMap |
array |
[] |
Map of URLs to replace |
debug |
bool |
false |
Enable debug logging |
enableTableColumnTracking |
bool |
true |
Add colId comments to table cells |
keepHtml |
array |
[] |
HTML tags to preserve |
includeMetaData |
string|bool|null |
null |
Include metadata from HTML head |
overrideElementProcessing |
callable|null |
null |
Custom element processing callback |
processUnhandledElement |
callable|null |
null |
Custom unhandled element callback |
overrideNodeRenderer |
callable|null |
null |
Custom node renderer callback |
renderCustomNode |
callable|null |
null |
Custom node renderer callback |
use Domscribe\Converter;
$html = <<<HTML
<div>
<h1>My Blog Post</h1>
<p>Here's a paragraph with <strong>bold</strong> and <em>italic</em> text.</p>
<ul>
<li>Item 1</li>
<li>Item 2
<ol>
<li>Subitem 2.1</li>
<li>Subitem 2.2</li>
</ol>
</li>
<li>Item 3</li>
</ul>
<blockquote>
<p>This is a quote.</p>
</blockquote>
</div>
HTML;
$markdown = Converter::htmlToMarkdown($html);
echo $markdown;Output:
# My Blog Post
Here's a paragraph with **bold** and *italic* text.
- Item 1
- Item 2
1. Subitem 2.1
2. Subitem 2.2
- Item 3
> This is a quote.use Domscribe\Converter;
$html = <<<HTML
<html>
<body>
<header>Header content</header>
<nav>Navigation</nav>
<main>
<h1>Main Article</h1>
<p>This is the main content.</p>
</main>
<footer>Footer content</footer>
</body>
</html>
HTML;
$options = ['extract_main_content' => true];
$markdown = Converter::htmlToMarkdown($html, $options);
echo $markdown;Output:
# Main Article
This is the main content.use Domscribe\Converter;
$html = <<<HTML
<p>
Check out <a href="https://example.com">this site</a> and
<a href="https://example.org">another site</a>.
Here's <a href="https://example.com">the first site</a> again.
</p>
HTML;
$options = ['refify_urls' => true];
$markdown = Converter::htmlToMarkdown($html, $options);
echo $markdown;Output:
Check out [this site][1] and [another site][2].
Here's [the first site][1] again.
[1]: https://example.com
[2]: https://example.orguse Domscribe\Converter;
$html = '<p>This is <span class="highlight">highlighted</span> text.</p>';
$options = ['keep_html' => ['span']];
$markdown = Converter::htmlToMarkdown($html, $options);
echo $markdown;Output:
This is <span class="highlight">highlighted</span> text.use Domscribe\Converter;
$html = <<<HTML
<table>
<thead>
<tr>
<th>Name</th>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>30</td>
</tr>
<tr>
<td>Bob</td>
<td>25</td>
</tr>
</tbody>
</table>
HTML;
$markdown = Converter::htmlToMarkdown($html);
echo $markdown;Output:
| Name <!-- colId: 1 --> | Age <!-- colId: 2 --> |
| --- | --- |
| Alice <!-- colId: 1 --> | 30 <!-- colId: 2 --> |
| Bob <!-- colId: 1 --> | 25 <!-- colId: 2 --> |Domscribe provides access to the Abstract Syntax Tree (AST) for advanced use cases:
use Domscribe\Converter;
$html = '<h1>Title</h1><p>Text with <a href="https://example.com">link</a></p>';
// Convert HTML to AST
$ast = Converter::htmlToMarkdownAst($html);
// Find specific nodes in the AST
$link = Converter::findInMarkdownAst($ast, function ($node) {
return isset($node['type']) && $node['type'] === 'link';
});
// Find all nodes of a certain type
$allLinks = Converter::findAllInMarkdownAst($ast, function ($node) {
return isset($node['type']) && $node['type'] === 'link';
});
// Convert AST back to Markdown string
$markdown = Converter::markdownAstToString($ast);# Install dependencies
composer install
# Run tests
composer test
# Run with coverage
./vendor/bin/phpunit --coverage-html coverage
# Run static analysis
composer phpstan
# Check code style
composer cs-check
# Fix code style
composer cs-fixThe library is organized into several key components:
- Converter: Main entry point and orchestrator
- HtmlToMarkdownAst: Converts HTML DOM to Markdown AST
- MarkdownAstToString: Converts AST to Markdown string
- DomUtils: DOM manipulation and content extraction utilities
- UrlUtils: URL processing and reference-style conversion
- AstUtils: AST traversal and manipulation utilities
- ConversionOptions: Configuration object for customization
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Original TypeScript library: dom-to-semantic-markdown
- Python port: domscribe-python
- PHP port by ACSEO
- domscribe-python - Python version
- dom-to-semantic-markdown - Original TypeScript version
For issues, questions, or contributions, please use the GitHub issue tracker.