-
Notifications
You must be signed in to change notification settings - Fork 185
docs: extension for llm.txt and llm-full.txt outputs #1166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lbliii
wants to merge
34
commits into
NVIDIA-NeMo:main
Choose a base branch
from
lbliii:llane/llm-txt-extension
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,308
−0
Open
Changes from 31 commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
6b660ec
updates
lbliii d22db1f
Fixing versions1.json to support nested folders (#1157)
aschilling-nv 45eef7c
Resolve a few warnings in the docs build (#1120)
jrbourbeau 4ec94a6
Context manager support for ``RayClient`` (#1155)
jrbourbeau cc42725
added hydra params as defaults (#1159)
Ssofja aa9be41
Increase timeout for CPU tests (#1162)
sarahyurick db6a64a
Llane/docs readme updates (#1163)
lbliii 0658cab
add_codecov_badge (#1172)
pablo-garay 0d77c3b
Support more flags when building docs (#1175)
ayushdg d569ef8
Fix bug for running filters on empty batches (#1173)
sarahyurick 7c81ef8
docs: pyrpoject toml classifiers section (#1176)
lbliii 386fdc4
docs: migration guide and faq (#1167)
lbliii db0665a
citation file (#1165)
lbliii 3551968
add enhanced captioning to rn; bump versions on json files (#1154)
lbliii f5b7710
Add large file splitting script (#1161)
jrbourbeau 2705877
docs ruff
lbliii 7c9618c
fixes
lbliii a4d28ce
Update docs/_extensions/llm_txt_output/content_extractor.py
lbliii 9d2d1fc
Fix ``JsonlWriter`` / ``ParquetWriter`` output directory typo (#1180)
jrbourbeau 16cd357
new extension: rich metadata for SEO (#1182)
lbliii 10a579a
Llama Nemotron Data Curation tutorial (#1063)
sarahyurick 443e516
Fix flaky ``test_split_parquet_file_by_size`` failure (#1185)
jrbourbeau a9050b3
Switch from pynvml to nvidia-ml-py (#1186)
jrbourbeau 6610b67
ruff
lbliii 67dbf1f
improve ref target handling and ruff
lbliii 5ab7adb
Remove ftfy pin (#1189)
sarahyurick 0341fb0
Increase UV_HTTP_TIMEOUT to reduce sporadic test failures (#1196)
jrbourbeau 9e61d05
Link to documentation within text tutorials (#1190)
sarahyurick 3859d5f
docs: remove unused noqa directives for PLC0415 in llm_txt_output ext…
lbliii 05f93e5
Merge branch 'main' into llane/llm-txt-extension
lbliii cd8df09
Merge branch 'main' into llane/llm-txt-extension
lbliii fb22e91
docs(llm-txt): refactor content_extractor with helper functions and f…
lbliii 677fef6
docs(llm-txt): apply ruff linting and style fixes
lbliii fef3fd0
docs(llm-txt): fix regex error in sentence boundary detection
lbliii File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,290 @@ | ||
| # LLM.txt Output Extension | ||
|
|
||
| Sphinx extension to generate `llm.txt` files for every documentation page in a standardized format optimized for Large Language Model consumption. | ||
|
|
||
| ## What is llm.txt? | ||
|
|
||
| The `llm.txt` format is a proposed standard designed to help LLMs like ChatGPT and Claude better understand and index website content. Each file contains: | ||
|
|
||
| - Document title and summary | ||
| - Clean overview text | ||
| - Key sections with descriptions | ||
| - Related resources/links | ||
| - Metadata | ||
|
|
||
| This extension generates: | ||
|
|
||
| 1. **Individual `.llm.txt` files**: One file for each documentation page (e.g., `index.llm.txt`, `getting-started/quickstart.llm.txt`) | ||
| 2. **Aggregated `llm-full.txt`**: A single file containing all documentation with a table of contents for complete site indexing | ||
|
|
||
| ## Features | ||
|
|
||
| - **Simple text format**: Plain markdown files that LLMs can easily parse | ||
| - **Clean content extraction**: Removes MyST directive artifacts, toctrees, and navigation elements | ||
| - **Structured sections**: Organized with headings, summaries, and key sections | ||
| - **Related links**: Automatically extracts internal links for context | ||
| - **Metadata support**: Includes frontmatter metadata and document classification | ||
| - **Content gating integration**: Respects content gating rules from the content_gating extension | ||
| - **Configurable**: Control content length, sections included, and more | ||
|
|
||
| ## Installation | ||
|
|
||
| The extension is included in the `_extensions` directory. Enable it in your `conf.py`: | ||
|
|
||
| ```python | ||
| extensions = [ | ||
| # ... other extensions | ||
| '_extensions.llm_txt_output', | ||
| ] | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### Minimal Configuration (Recommended) | ||
|
|
||
| ```python | ||
| # conf.py | ||
| llm_txt_settings = { | ||
| 'enabled': True, # All other settings use defaults | ||
| } | ||
| ``` | ||
|
|
||
| ### Full Configuration Options | ||
|
|
||
| ```python | ||
| # conf.py | ||
| llm_txt_settings = { | ||
| 'enabled': True, # Enable/disable generation | ||
| 'exclude_patterns': [ # Patterns to exclude | ||
| '_build', | ||
| '_templates', | ||
| '_static', | ||
| 'apidocs' | ||
| ], | ||
| 'verbose': True, # Verbose logging | ||
| 'base_url': 'https://docs.example.com/latest', # Base URL for absolute links | ||
| 'max_content_length': 5000, # Max chars in overview (0 = no limit) | ||
| 'summary_sentences': 2, # Number of sentences for summary | ||
| 'include_metadata': True, # Include metadata section | ||
| 'include_headings': True, # Include key sections from headings | ||
| 'include_related_links': True, # Include internal links | ||
| 'max_related_links': 10, # Max related links to include | ||
| 'card_handling': 'simple', # 'simple' or 'smart' for grid cards | ||
| 'clean_myst_artifacts': True, # Remove MyST directive artifacts | ||
| 'generate_full_file': True, # Generate llm-full.txt with all docs | ||
| } | ||
| ``` | ||
|
|
||
| ### Important: Configure base_url for Proper Link Handling | ||
|
|
||
| To comply with the [llms.txt specification](https://llmstxt.org/), you **must** set the `base_url` option to your documentation's canonical URL. This ensures all internal links are converted to absolute URLs that LLMs can access: | ||
|
|
||
| ```python | ||
| llm_txt_settings = { | ||
| 'base_url': 'https://docs.nvidia.com/nemo/evaluator/latest', | ||
| } | ||
| ``` | ||
|
|
||
| Without `base_url`, links will remain relative (e.g., `../quickstart.html`) instead of absolute URLs (e.g., `https://docs.nvidia.com/nemo/evaluator/latest/quickstart.html`). | ||
|
|
||
| ## Output Files | ||
|
|
||
| The extension generates two types of output: | ||
|
|
||
| ### Individual Page Files | ||
|
|
||
| For a file `docs/get-started/install.md`, the extension generates `_build/html/get-started/install.llm.txt` | ||
|
|
||
| ### Aggregated Full Site File | ||
|
|
||
| The extension also generates `_build/html/llm-full.txt` containing all documentation in a single file with: | ||
|
|
||
| - **Header**: Project name and version | ||
| - **Table of Contents**: Complete list of all documentation pages with titles | ||
| - **Document Separators**: Clear HTML comment markers between documents (e.g., `<!-- Document 1/103: index -->`) | ||
| - **Sorted Order**: Documents alphabetically sorted by path for consistency | ||
|
|
||
| This aggregated file is ideal for: | ||
|
|
||
| - Training LLMs on your complete documentation | ||
| - Feeding entire documentation into context windows | ||
| - Creating embeddings of your documentation corpus | ||
| - Providing a single download for offline LLM consumption | ||
|
|
||
| To disable generation of the full file: | ||
|
|
||
| ```python | ||
| llm_txt_settings = { | ||
| 'generate_full_file': False, | ||
| } | ||
| ``` | ||
|
|
||
| ## Output Format | ||
|
|
||
| For a file `docs/get-started/install.md`, the extension generates `_build/html/get-started/install.llm.txt`: | ||
|
|
||
| ```markdown | ||
| # Installation Guide | ||
|
|
||
| > Learn how to install NeMo Evaluator on your system using pip, Docker, or from source. | ||
|
|
||
| ## Overview | ||
|
|
||
| NeMo Evaluator is a Python package for evaluating large language models. You can install it using pip for a quick setup, or use Docker for a containerized environment... | ||
|
|
||
| ## Key Sections | ||
|
|
||
| - **Prerequisites**: Required software and dependencies before installation | ||
| - **Installation Methods**: Different ways to install the package | ||
| - **Verify Installation**: Steps to confirm successful installation | ||
| - **Troubleshooting**: Common installation issues and solutions | ||
|
|
||
| ## Related Resources | ||
|
|
||
| - [Quickstart Guide](https://docs.nvidia.com/nemo/evaluator/latest/quickstart.html) | ||
| - [Configuration Guide](https://docs.nvidia.com/nemo/evaluator/latest/configuration.html) | ||
|
|
||
| ## Metadata | ||
|
|
||
| - Document Type: guide | ||
| - Categories: getting-started | ||
| - Last Updated: 2025-10-02 | ||
| ``` | ||
|
|
||
| **Note**: Links are converted to absolute URLs when `base_url` is configured, ensuring compatibility with the [llms.txt specification](https://llmstxt.org/). | ||
|
|
||
| ## MyST Markdown Handling | ||
|
|
||
| The extension intelligently handles MyST markdown directives: | ||
|
|
||
| ### What Gets Cleaned | ||
|
|
||
| - **Toctrees**: Hidden navigation removed | ||
| - **Directive markers**: `:::`, `:::{grid}`, etc. removed | ||
| - **Directive options**: `:hidden:`, `:caption:`, `:link:`, etc. removed | ||
| - **Icons**: `{octicon}` references removed | ||
| - **HTML tags**: `<br />`, `<div>`, `<hr>` removed | ||
| - **Escaped characters**: `\\` backslashes cleaned | ||
| - **Code fences**: Language indicators cleaned | ||
|
|
||
| ### What Gets Preserved and Enhanced | ||
|
|
||
| - **Grid Cards** (when `card_handling: 'smart'`): Converted to clean markdown lists with proper links | ||
| - **Badges**: Converted to parentheses (e.g., `{bdg-secondary}`cli`` → `(cli)`) | ||
| - **Headings**: All heading structure maintained | ||
| - **Links**: Internal and external links extracted and converted to absolute URLs | ||
|
|
||
| ### Smart Card Handling | ||
|
|
||
| When `card_handling` is set to `'smart'`, the extension parses MyST markdown source files directly to extract `{grid-item-card}` directives before Sphinx processes them. This approach is more reliable than trying to parse the complex doctree structure. | ||
|
|
||
| **Example MyST Input:** | ||
| ```markdown | ||
| :::{grid-item-card} {octicon}`rocket;1.5em` NeMo Evaluator Launcher | ||
| :link: nemo-evaluator-launcher/index | ||
| :link-type: doc | ||
|
|
||
| **Start here** - Unified CLI and Python API for running evaluations | ||
| +++ | ||
| {bdg-secondary}`CLI` | ||
| ::: | ||
| ``` | ||
|
|
||
| **Generated llm.txt Output:** | ||
| ```markdown | ||
| ## Available Options | ||
|
|
||
| - **[NeMo Evaluator Launcher](https://docs.nvidia.com/nemo/evaluator/latest/nemo-evaluator-launcher/index.html)** | ||
| Start here - Unified CLI and Python API for running evaluations | ||
| ``` | ||
|
|
||
| **What Gets Cleaned:** | ||
| - Octicon references: `{octicon}`icon`` removed from titles | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. syntax: Backtick formatting is broken. Should be: |
||
| - Badge syntax: `{bdg-secondary}`cli`` removed from descriptions | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. syntax: Backtick formatting is broken. Should be: |
||
| - Card footers: `+++` separator and everything after it | ||
| - Template variables: `{{ product_name_short }}` replaced with "NeMo Evaluator" | ||
| - Directive options: `:link:`, `:link-type:`, etc. removed | ||
|
|
||
| **How It Works:** | ||
| 1. Reads the raw `.md` source file | ||
| 2. Uses regex to find `:::{grid-item-card}...:::` blocks | ||
| 3. Extracts title (first line), link (from `:link:` option), and description (card body) | ||
| 4. Converts internal links to absolute URLs using `base_url` | ||
| 5. Keeps external links (GitHub, etc.) unchanged | ||
| 6. Formats as clean markdown list with proper links | ||
|
|
||
| ## Content Gating Integration | ||
|
|
||
| This extension automatically respects content gating rules: | ||
|
|
||
| - Documents excluded by the `content_gating` extension are not processed | ||
| - Respects Sphinx's `exclude_patterns` configuration | ||
| - Provides debug logging when content gating rules are applied | ||
|
|
||
| ## Example Usage | ||
|
|
||
| ### Build Documentation with llm.txt Files | ||
|
|
||
| ```bash | ||
| # Build HTML documentation (llm.txt files generated automatically) | ||
| make html | ||
|
|
||
| # Check generated files | ||
| ls _build/html/*.llm.txt | ||
| ls _build/html/**/*.llm.txt | ||
| ``` | ||
|
|
||
| ### Access llm.txt Files | ||
|
|
||
| After building: | ||
|
|
||
| - Root page: `_build/html/index.llm.txt` | ||
| - Regular pages: `_build/html/path/to/page.llm.txt` | ||
| - Directory indexes: `_build/html/directory/index.llm.txt` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### No llm.txt files generated | ||
|
|
||
| Check that the extension is enabled: | ||
|
|
||
| ```python | ||
| llm_txt_settings = {'enabled': True} | ||
| ``` | ||
|
|
||
| ### Files missing content | ||
|
|
||
| Increase content length limit: | ||
|
|
||
| ```python | ||
| llm_txt_settings = {'max_content_length': 10000} | ||
| ``` | ||
|
|
||
| ### Too many MyST artifacts | ||
|
|
||
| Enable artifact cleaning: | ||
|
|
||
| ```python | ||
| llm_txt_settings = {'clean_myst_artifacts': True} | ||
| ``` | ||
|
|
||
| ### Content gated documents included | ||
|
|
||
| The extension respects `exclude_patterns`. Add patterns to exclude: | ||
|
|
||
| ```python | ||
| llm_txt_settings = { | ||
| 'exclude_patterns': ['_build', '_templates', '_static', 'apidocs', 'internal/*'] | ||
| } | ||
| ``` | ||
|
|
||
| ## Dependencies | ||
|
|
||
| - Sphinx >= 4.0 | ||
| - docutils >= 0.16 | ||
| - PyYAML (optional, for frontmatter extraction) | ||
|
|
||
| ## License | ||
|
|
||
| Same as the parent project. | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| """ | ||
| Sphinx extension to generate llm.txt files for LLM consumption. | ||
|
|
||
| This extension creates parallel llm.txt files for each document in a standardized | ||
| markdown format that Large Language Models can easily parse and understand. | ||
|
|
||
| The llm.txt format includes: | ||
| - Document title and summary | ||
| - Clean overview text | ||
| - Key sections with descriptions | ||
| - Related resources/links | ||
| - Metadata | ||
|
|
||
| See README.md for detailed configuration options and usage examples. | ||
| """ | ||
|
|
||
| from typing import Any | ||
|
|
||
| from sphinx.application import Sphinx | ||
|
|
||
| from .config import get_default_settings, validate_config | ||
| from .processor import on_build_finished | ||
|
|
||
|
|
||
| def setup(app: Sphinx) -> dict[str, Any]: | ||
| """Setup function for Sphinx extension.""" | ||
| # Add configuration with default settings | ||
| default_settings = get_default_settings() | ||
| app.add_config_value("llm_txt_settings", default_settings, "html") | ||
|
|
||
| # Connect to build events | ||
| app.connect("config-inited", validate_config) | ||
| app.connect("build-finished", on_build_finished) | ||
|
|
||
| return { | ||
| "version": "1.0.0", | ||
| "parallel_read_safe": True, | ||
| "parallel_write_safe": True, | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Backtick formatting is broken in this example. Should be:
{bdg-secondary}\cli`→(cli)`