Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6b660ec
updates
lbliii Oct 3, 2025
d22db1f
Fixing versions1.json to support nested folders (#1157)
aschilling-nv Oct 1, 2025
45eef7c
Resolve a few warnings in the docs build (#1120)
jrbourbeau Oct 2, 2025
4ec94a6
Context manager support for ``RayClient`` (#1155)
jrbourbeau Oct 2, 2025
cc42725
added hydra params as defaults (#1159)
Ssofja Oct 2, 2025
aa9be41
Increase timeout for CPU tests (#1162)
sarahyurick Oct 3, 2025
db6a64a
Llane/docs readme updates (#1163)
lbliii Oct 3, 2025
0658cab
add_codecov_badge (#1172)
pablo-garay Oct 7, 2025
0d77c3b
Support more flags when building docs (#1175)
ayushdg Oct 8, 2025
d569ef8
Fix bug for running filters on empty batches (#1173)
sarahyurick Oct 8, 2025
7c81ef8
docs: pyrpoject toml classifiers section (#1176)
lbliii Oct 9, 2025
386fdc4
docs: migration guide and faq (#1167)
lbliii Oct 10, 2025
db0665a
citation file (#1165)
lbliii Oct 13, 2025
3551968
add enhanced captioning to rn; bump versions on json files (#1154)
lbliii Oct 13, 2025
f5b7710
Add large file splitting script (#1161)
jrbourbeau Oct 14, 2025
2705877
docs ruff
lbliii Oct 15, 2025
7c9618c
fixes
lbliii Oct 15, 2025
a4d28ce
Update docs/_extensions/llm_txt_output/content_extractor.py
lbliii Oct 15, 2025
9d2d1fc
Fix ``JsonlWriter`` / ``ParquetWriter`` output directory typo (#1180)
jrbourbeau Oct 16, 2025
16cd357
new extension: rich metadata for SEO (#1182)
lbliii Oct 17, 2025
10a579a
Llama Nemotron Data Curation tutorial (#1063)
sarahyurick Oct 17, 2025
443e516
Fix flaky ``test_split_parquet_file_by_size`` failure (#1185)
jrbourbeau Oct 20, 2025
a9050b3
Switch from pynvml to nvidia-ml-py (#1186)
jrbourbeau Oct 20, 2025
6610b67
ruff
lbliii Oct 21, 2025
67dbf1f
improve ref target handling and ruff
lbliii Oct 21, 2025
5ab7adb
Remove ftfy pin (#1189)
sarahyurick Oct 21, 2025
0341fb0
Increase UV_HTTP_TIMEOUT to reduce sporadic test failures (#1196)
jrbourbeau Oct 22, 2025
9e61d05
Link to documentation within text tutorials (#1190)
sarahyurick Oct 22, 2025
3859d5f
docs: remove unused noqa directives for PLC0415 in llm_txt_output ext…
lbliii Oct 22, 2025
05f93e5
Merge branch 'main' into llane/llm-txt-extension
lbliii Oct 22, 2025
cd8df09
Merge branch 'main' into llane/llm-txt-extension
lbliii Oct 27, 2025
fb22e91
docs(llm-txt): refactor content_extractor with helper functions and f…
lbliii Oct 28, 2025
677fef6
docs(llm-txt): apply ruff linting and style fixes
lbliii Oct 28, 2025
fef3fd0
docs(llm-txt): fix regex error in sentence boundary detection
lbliii Oct 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions docs/_extensions/llm_txt_output/.ruff.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Ruff configuration for llm_txt_output extension

[lint]
# Ignore complexity warnings for grid card extraction functions
# These functions handle complex nested structures from MyST markdown
# and are difficult to simplify further without losing functionality
ignore = [
"C901", # Complex function
"PLR0912", # Too many branches
"PLR0915", # Too many statements
"PLC0415", # Import not at top-level (conditional imports for optional dependencies)
]

[lint.per-file-ignores]
# Allow complex functions in content_extractor.py for parsing logic
"content_extractor.py" = ["C901", "PLR0912", "PLR0915", "PLC0415"]
Comment on lines +14 to +16
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: : Per-file ignores at line 16 duplicate the global ignores from lines 7-12 for content_extractor.py. Remove this redundant section since the global ignores already apply


290 changes: 290 additions & 0 deletions docs/_extensions/llm_txt_output/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
# LLM.txt Output Extension

Sphinx extension to generate `llm.txt` files for every documentation page in a standardized format optimized for Large Language Model consumption.

## What is llm.txt?

The `llm.txt` format is a proposed standard designed to help LLMs like ChatGPT and Claude better understand and index website content. Each file contains:

- Document title and summary
- Clean overview text
- Key sections with descriptions
- Related resources/links
- Metadata

This extension generates:

1. **Individual `.llm.txt` files**: One file for each documentation page (e.g., `index.llm.txt`, `getting-started/quickstart.llm.txt`)
2. **Aggregated `llm-full.txt`**: A single file containing all documentation with a table of contents for complete site indexing

## Features

- **Simple text format**: Plain markdown files that LLMs can easily parse
- **Clean content extraction**: Removes MyST directive artifacts, toctrees, and navigation elements
- **Structured sections**: Organized with headings, summaries, and key sections
- **Related links**: Automatically extracts internal links for context
- **Metadata support**: Includes frontmatter metadata and document classification
- **Content gating integration**: Respects content gating rules from the content_gating extension
- **Configurable**: Control content length, sections included, and more

## Installation

The extension is included in the `_extensions` directory. Enable it in your `conf.py`:

```python
extensions = [
# ... other extensions
'_extensions.llm_txt_output',
]
```

## Configuration

### Minimal Configuration (Recommended)

```python
# conf.py
llm_txt_settings = {
'enabled': True, # All other settings use defaults
}
```

### Full Configuration Options

```python
# conf.py
llm_txt_settings = {
'enabled': True, # Enable/disable generation
'exclude_patterns': [ # Patterns to exclude
'_build',
'_templates',
'_static',
'apidocs'
],
'verbose': True, # Verbose logging
'base_url': 'https://docs.example.com/latest', # Base URL for absolute links
'max_content_length': 5000, # Max chars in overview (0 = no limit)
'summary_sentences': 2, # Number of sentences for summary
'include_metadata': True, # Include metadata section
'include_headings': True, # Include key sections from headings
'include_related_links': True, # Include internal links
'max_related_links': 10, # Max related links to include
'card_handling': 'simple', # 'simple' or 'smart' for grid cards
'clean_myst_artifacts': True, # Remove MyST directive artifacts
'generate_full_file': True, # Generate llm-full.txt with all docs
}
```

### Important: Configure base_url for Proper Link Handling

To comply with the [llms.txt specification](https://llmstxt.org/), you **must** set the `base_url` option to your documentation's canonical URL. This ensures all internal links are converted to absolute URLs that LLMs can access:

```python
llm_txt_settings = {
'base_url': 'https://docs.nvidia.com/nemo/evaluator/latest',
}
```

Without `base_url`, links will remain relative (e.g., `../quickstart.html`) instead of absolute URLs (e.g., `https://docs.nvidia.com/nemo/evaluator/latest/quickstart.html`).

## Output Files

The extension generates two types of output:

### Individual Page Files

For a file `docs/get-started/install.md`, the extension generates `_build/html/get-started/install.llm.txt`

### Aggregated Full Site File

The extension also generates `_build/html/llm-full.txt` containing all documentation in a single file with:

- **Header**: Project name and version
- **Table of Contents**: Complete list of all documentation pages with titles
- **Document Separators**: Clear HTML comment markers between documents (e.g., `<!-- Document 1/103: index -->`)
- **Sorted Order**: Documents alphabetically sorted by path for consistency

This aggregated file is ideal for:

- Training LLMs on your complete documentation
- Feeding entire documentation into context windows
- Creating embeddings of your documentation corpus
- Providing a single download for offline LLM consumption

To disable generation of the full file:

```python
llm_txt_settings = {
'generate_full_file': False,
}
```

## Output Format

For a file `docs/get-started/install.md`, the extension generates `_build/html/get-started/install.llm.txt`:

```markdown
# Installation Guide

> Learn how to install NeMo Evaluator on your system using pip, Docker, or from source.

## Overview

NeMo Evaluator is a Python package for evaluating large language models. You can install it using pip for a quick setup, or use Docker for a containerized environment...

## Key Sections

- **Prerequisites**: Required software and dependencies before installation
- **Installation Methods**: Different ways to install the package
- **Verify Installation**: Steps to confirm successful installation
- **Troubleshooting**: Common installation issues and solutions

## Related Resources

- [Quickstart Guide](https://docs.nvidia.com/nemo/evaluator/latest/quickstart.html)
- [Configuration Guide](https://docs.nvidia.com/nemo/evaluator/latest/configuration.html)

## Metadata

- Document Type: guide
- Categories: getting-started
- Last Updated: 2025-10-02
```

**Note**: Links are converted to absolute URLs when `base_url` is configured, ensuring compatibility with the [llms.txt specification](https://llmstxt.org/).

## MyST Markdown Handling

The extension intelligently handles MyST markdown directives:

### What Gets Cleaned

- **Toctrees**: Hidden navigation removed
- **Directive markers**: `:::`, `:::{grid}`, etc. removed
- **Directive options**: `:hidden:`, `:caption:`, `:link:`, etc. removed
- **Icons**: `{octicon}` references removed
- **HTML tags**: `<br />`, `<div>`, `<hr>` removed
- **Escaped characters**: `\\` backslashes cleaned
- **Code fences**: Language indicators cleaned

### What Gets Preserved and Enhanced

- **Grid Cards** (when `card_handling: 'smart'`): Converted to clean markdown lists with proper links
- **Badges**: Converted to parentheses (e.g., `{bdg-secondary}`cli`` → `(cli)`)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Backtick formatting is broken in this example. Should be: {bdg-secondary}\cli`(cli)`

- **Headings**: All heading structure maintained
- **Links**: Internal and external links extracted and converted to absolute URLs

### Smart Card Handling

When `card_handling` is set to `'smart'`, the extension parses MyST markdown source files directly to extract `{grid-item-card}` directives before Sphinx processes them. This approach is more reliable than trying to parse the complex doctree structure.

**Example MyST Input:**
```markdown
:::{grid-item-card} {octicon}`rocket;1.5em` NeMo Evaluator Launcher
:link: nemo-evaluator-launcher/index
:link-type: doc

**Start here** - Unified CLI and Python API for running evaluations
+++
{bdg-secondary}`CLI`
:::
```

**Generated llm.txt Output:**
```markdown
## Available Options

- **[NeMo Evaluator Launcher](https://docs.nvidia.com/nemo/evaluator/latest/nemo-evaluator-launcher/index.html)**
Start here - Unified CLI and Python API for running evaluations
```

**What Gets Cleaned:**
- Octicon references: `{octicon}`icon`` removed from titles
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Backtick formatting is broken. Should be: {octicon}\icon``

- Badge syntax: `{bdg-secondary}`cli`` removed from descriptions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Backtick formatting is broken. Should be: {bdg-secondary}\cli``

- Card footers: `+++` separator and everything after it
- Template variables: `{{ product_name_short }}` replaced with "NeMo Evaluator"
- Directive options: `:link:`, `:link-type:`, etc. removed

**How It Works:**
1. Reads the raw `.md` source file
2. Uses regex to find `:::{grid-item-card}...:::` blocks
3. Extracts title (first line), link (from `:link:` option), and description (card body)
4. Converts internal links to absolute URLs using `base_url`
5. Keeps external links (GitHub, etc.) unchanged
6. Formats as clean markdown list with proper links

## Content Gating Integration

This extension automatically respects content gating rules:

- Documents excluded by the `content_gating` extension are not processed
- Respects Sphinx's `exclude_patterns` configuration
- Provides debug logging when content gating rules are applied

## Example Usage

### Build Documentation with llm.txt Files

```bash
# Build HTML documentation (llm.txt files generated automatically)
make html

# Check generated files
ls _build/html/*.llm.txt
ls _build/html/**/*.llm.txt
```

### Access llm.txt Files

After building:

- Root page: `_build/html/index.llm.txt`
- Regular pages: `_build/html/path/to/page.llm.txt`
- Directory indexes: `_build/html/directory/index.llm.txt`

## Troubleshooting

### No llm.txt files generated

Check that the extension is enabled:

```python
llm_txt_settings = {'enabled': True}
```

### Files missing content

Increase content length limit:

```python
llm_txt_settings = {'max_content_length': 10000}
```

### Too many MyST artifacts

Enable artifact cleaning:

```python
llm_txt_settings = {'clean_myst_artifacts': True}
```

### Content gated documents included

The extension respects `exclude_patterns`. Add patterns to exclude:

```python
llm_txt_settings = {
'exclude_patterns': ['_build', '_templates', '_static', 'apidocs', 'internal/*']
}
```

## Dependencies

- Sphinx >= 4.0
- docutils >= 0.16
- PyYAML (optional, for frontmatter extraction)

## License

Same as the parent project.

38 changes: 38 additions & 0 deletions docs/_extensions/llm_txt_output/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
"""Sphinx extension to generate llm.txt files for LLM consumption.

This extension creates parallel llm.txt files for each document in a standardized
markdown format that Large Language Models can easily parse and understand.

The llm.txt format includes:
- Document title and summary
- Clean overview text
- Key sections with descriptions
- Related resources/links
- Metadata

See README.md for detailed configuration options and usage examples.
"""

from typing import Any

from sphinx.application import Sphinx

from .config import get_default_settings, validate_config
from .processor import on_build_finished


def setup(app: Sphinx) -> dict[str, Any]:
"""Set up Sphinx extension for llm.txt generation."""
# Add configuration with default settings
default_settings = get_default_settings()
app.add_config_value("llm_txt_settings", default_settings, "html")

# Connect to build events
app.connect("config-inited", validate_config)
app.connect("build-finished", on_build_finished)

return {
"version": "1.0.0",
"parallel_read_safe": True,
"parallel_write_safe": True,
}
Loading