docs: extension for llm.txt and llm-full.txt outputs #1166

lbliii · 2025-10-03T20:49:34Z

New Sphinx extension to enable AI to better consume documentation by providing per-page and full site llm.txt files as part of the output deliverables.

Example Output

Location: /about/concepts/text/data-curation-pipeline.llm.txt

# Text Data Curation Pipeline

> This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.

## Overview

Text Data Curation Pipeline This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.

Architecture Overview The following diagram provides a high-level outline of NeMo Curator’s text curation architecture: Pipeline Stages NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training: 1.

Data Sources Multiple input sources provide the foundation for text curation: Cloud storage : (S3, GCS, Azure) Local workstation : files (JSONL, Parquet) 2.

Data Acquisition & Processing Raw data is downloaded, extracted, and converted into standardized formats: Download & Extraction : Retrieve and process remote data sources Cleaning & Pre-processing : Convert formats and normalize text DocumentBatch Creation : Standardize data into NeMo Curator’s core data structure 3.

Quality Assessment & Filtering Multiple filtering stages ensure data quality: Heuristic Quality Filtering : Rule-based filters for basic quality checks Model-based Quality Filtering : Classification models trained to identify high vs. low quality text 4.

Deduplication Remove duplicate and near-duplicate content: Exact Deduplication : Remove identical documents using MD5 hashing Fuzzy Deduplication : Remove near-duplicates using MinHash and LSH similarity Semantic Deduplication : Remove semantically similar content using embeddings 5.

Final Preparation Prepare the curated dataset for training: Format Standardization : Ensure consistent output format Infrastructure Foundation The entire pipeline runs on a robust, scalable infrastructure: Ray : Distributed computing framework for parallelization RAPIDS : GPU-accelerated data processing (cuDF, cuGraph, cuML) Flexible Deployment : CPU and GPU acceleration support Key Components The pipeline leverages several core component types: Data Loading Core concepts for loading and managing text datasets from local files about-concepts-text-data-loading Data Acquisition Components for downloading and extracting data from remote sources about-concepts-text-data-acquisition Data Processing Concepts for filtering, deduplication, and classification about-concepts-text-data-processing Processing Modes The pipeline supports different processing approaches: GPU Acceleration : Leverage NVIDIA GPUs for: High-throughput data processing ML model inference for classification Embedding generation for semantic operations CPU Processing : Scale across multiple CPU cores for: Text parsing and cleaning Rule-based filtering Large-scale data transformations Hybrid Workflows : Combine CPU and GPU processing for optimal performance based on the specific operation.

Scalability & Deployment The architecture scales from single machines to large clusters: Single Node : Process datasets on laptops or workstations Multi-Node : Distribute processing across cluster resources Cloud Native : Deploy on cloud platforms HPC Integration : Run on HPC supercomputing clusters For hands-on experience, see the Text Curation Getting Started Guide

## Available Options

- **[Data Loading](https://docs.nvidia.com/nemo/curator/latest/about-concepts-text-data-loading.html)**
  Core concepts for loading and managing text datasets from local files
- **[Data Acquisition](https://docs.nvidia.com/nemo/curator/latest/about-concepts-text-data-acquisition.html)**
  Components for downloading and extracting data from remote sources
- **[Data Processing](https://docs.nvidia.com/nemo/curator/latest/about-concepts-text-data-processing.html)**
  Concepts for filtering, deduplication, and classification

## Key Sections

- **Text Data Curation Pipeline**: This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.
- **Architecture Overview**: The following diagram provides a high-level outline of NeMo Curator’s text curation architecture:
- **Pipeline Stages**: NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:
- **1. Data Sources**: Multiple input sources provide the foundation for text curation:
- **2. Data Acquisition & Processing**: Raw data is downloaded, extracted, and converted into standardized formats:
- **3. Quality Assessment & Filtering**: Multiple filtering stages ensure data quality:
- **4. Deduplication**: Remove duplicate and near-duplicate content:
- **5. Final Preparation**: Prepare the curated dataset for training:
- **Infrastructure Foundation**: The entire pipeline runs on a robust, scalable infrastructure:
- **Key Components**: The pipeline leverages several core component types:
- **Processing Modes**: The pipeline supports different processing approaches:
- **Scalability & Deployment**: The architecture scales from single machines to large clusters:

## Metadata

- **Description**: Comprehensive overview of NeMo Curator's text curation pipeline architecture including data acquisition and processing
- **Content Type**: concept
- **Difficulty**: beginner
- **Tags**: pipeline, architecture, text-curation, distributed, gpu-accelerated, overview
- **Categories**: concepts-architecture
- **Target Audience**: data-scientist-focused, mle-focused
- **Last Updated**: 2025-10-15

copy-pr-bot · 2025-10-03T20:49:37Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copilot

Pull Request Overview

This PR adds a new Sphinx extension for generating LLM-friendly documentation output in the llm.txt format. The extension creates parallel .llm.txt files for each documentation page and an aggregated llm-full.txt file containing all documentation. The goal is to make documentation more accessible to Large Language Models by providing clean, structured text files optimized for AI consumption.

Key changes:

Complete implementation of the llm_txt_output Sphinx extension with modular architecture
Smart MyST directive parsing to extract grid cards and clean content artifacts
Integration with existing Sphinx configuration and content gating systems

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
docs/conf.py	Added extension registration and configuration settings for llm.txt output
docs/_extensions/llm_txt_output/init.py	Main extension entry point with Sphinx setup configuration
docs/_extensions/llm_txt_output/config.py	Configuration validation and default settings management
docs/_extensions/llm_txt_output/processor.py	Build orchestration and document processing pipeline
docs/_extensions/llm_txt_output/content_extractor.py	Core content extraction logic with MyST cleaning and grid card parsing
docs/_extensions/llm_txt_output/template_builder.py	LLM.txt format template generation and content structuring
docs/_extensions/llm_txt_output/writer.py	File writing operations for output generation
docs/_extensions/llm_txt_output/README.md	Comprehensive documentation for the extension

Comments suppressed due to low confidence (1)

docs/_extensions/llm_txt_output/processor.py:1

Bare exception handling with a broad catch is generally discouraged. Consider catching specific exceptions like OSError, IOError, or yaml.YAMLError for better error handling and debugging.

"""Document processing and build orchestration for llm.txt output."""

docs/_extensions/llm_txt_output/content_extractor.py

Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: James Bourbeau <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

* Context manager support for RayClient Signed-off-by: James Bourbeau <[email protected]> * Bump timeout Signed-off-by: James Bourbeau <[email protected]> --------- Signed-off-by: James Bourbeau <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: Ssofja <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

* Increase timeout for CPU tests Signed-off-by: Sarah Yurick <[email protected]> * split stages tests into separate workflows Signed-off-by: Sarah Yurick <[email protected]> * fix slash issue Signed-off-by: Sarah Yurick <[email protected]> * try quotes Signed-off-by: Sarah Yurick <[email protected]> * another attempt.. Signed-off-by: Sarah Yurick <[email protected]> * .. Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

* readme updates Signed-off-by: Lawrence Lane <[email protected]> * readme overhaul Signed-off-by: Lawrence Lane <[email protected]> * marketing callout Signed-off-by: Lawrence Lane <[email protected]> * added > Signed-off-by: Lawrence Lane <[email protected]> * dask > ray Signed-off-by: Lawrence Lane <[email protected]> * Update README.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * linkfixes; image classifier changed to filters; remove old pages Signed-off-by: Lawrence Lane <[email protected]> * link swap Signed-off-by: Lawrence Lane <[email protected]> * fixes Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: Pablo Garay <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

* Fix bug for running filters on empty batches Signed-off-by: Sarah Yurick <[email protected]> * is_actor_stage for models and tokenizers Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: Lawrence Lane <[email protected]>

* docs: initial loading of migration guide and faq Signed-off-by: Lawrence Lane <[email protected]> * minor changes Signed-off-by: Lawrence Lane <[email protected]> * crosslinks Signed-off-by: Lawrence Lane <[email protected]> * links to migration guides Signed-off-by: Lawrence Lane <[email protected]> * wording Signed-off-by: Lawrence Lane <[email protected]> * Update docs/about/release-notes/index.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * remove parenthetical Signed-off-by: Lawrence Lane <[email protected]> * gh link Signed-off-by: Lawrence Lane <[email protected]> * cleanup Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

* text curation updates Signed-off-by: Lawrence Lane <[email protected]> * concepts Signed-off-by: Lawrence Lane <[email protected]> * remove synthetic docs not for this release Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * text concepts and getting started changes Signed-off-by: Lawrence Lane <[email protected]> * links, concepts Signed-off-by: Lawrence Lane <[email protected]> * crosslinks Signed-off-by: Lawrence Lane <[email protected]> * quality assessment updates Signed-off-by: Lawrence Lane <[email protected]> * more cleanup Signed-off-by: Lawrence Lane <[email protected]> * semdedup Signed-off-by: Lawrence Lane <[email protected]> * example import cleanup Signed-off-by: Lawrence Lane <[email protected]> * concepts Signed-off-by: Lawrence Lane <[email protected]> * Update docs/about/concepts/text/data-acquisition-concepts.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * feedback batch 1 Signed-off-by: Lawrence Lane <[email protected]> * feedback batch 2 Signed-off-by: Lawrence Lane <[email protected]> * file_paths="/path/to/jsonl_directory", Signed-off-by: Lawrence Lane <[email protected]> * revert removal of xenna for common crawl executors Signed-off-by: Lawrence Lane <[email protected]> * quickstart installation steps Signed-off-by: Lawrence Lane <[email protected]> * Update docs/about/concepts/text/data-acquisition-concepts.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * data loading concepts updates / simplification Signed-off-by: Lawrence Lane <[email protected]> * data processing feedback Signed-off-by: Lawrence Lane <[email protected]> * read-existing pg updates Signed-off-by: Lawrence Lane <[email protected]> * add-id updates Signed-off-by: Lawrence Lane <[email protected]> * dedup updates Signed-off-by: Lawrence Lane <[email protected]> * feedback Signed-off-by: Lawrence Lane <[email protected]> * citation file Signed-off-by: Lawrence Lane <[email protected]> * fix Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

…Mo#1154) * add enhanced captioning to rn; bump versions on json files Signed-off-by: Lawrence Lane <[email protected]> * remove ds store Signed-off-by: Lawrence Lane <[email protected]> * feedback Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: Lawrence Lane <[email protected]>

Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

…-NeMo#1180) Signed-off-by: Lawrence Lane <[email protected]>

* docs(extensions): add rich metadata SEO extension with frontmatter integration Add new Sphinx extension that injects SEO-optimized metadata into HTML head from frontmatter: Core Features: - Extract frontmatter (description, tags, personas, difficulty, content_type, modality) - Generate standard meta tags (description, keywords, audience) - Generate Open Graph tags for social sharing (Facebook, LinkedIn) - Generate Twitter Card tags for enhanced previews - Generate JSON-LD structured data (schema.org) for search engines - Support product versioning via cascade.product fields Components: - rich_metadata/__init__.py: Main extension with config-inited and html-page-context hooks - rich_metadata/templates/layout.html: Template override for metadata injection - rich_metadata/README.md: Technical overview and features - rich_metadata/USAGE.md: Complete usage guide with examples - rich_metadata/SUMMARY.md: Quick reference guide - rich_metadata/IMPLEMENTATION.md: Architecture and implementation details - rich_metadata/verify_metadata.py: Automated verification script Template handling follows search_assets pattern - templates live within extension folder and are automatically added to Sphinx template search path via config-inited hook. Extension enabled in conf.py and ready for use with existing frontmatter. * docs(rich_metadata): add SEO metadata extension with organized output and enhanced titles - Extract frontmatter from markdown files and inject SEO metadata - Support standard meta tags (description, keywords) - Support Open Graph tags (og:description, og:title, og:type, og:url) - Support Twitter Card tags (twitter:description, twitter:title, twitter:card) - Support custom content metadata (audience, difficulty, modality, content_type) - Generate JSON-LD structured data (schema.org Article/TechArticle) - Organize metadata with HTML comments for readability - Enhanced page titles: 'Page: Section - Site | NVIDIA' - Template override for clean title rendering - Suppress warnings for generated pages (genindex, search, etc.) - Add frontmatter to homepage (index.md) - Fix invalid Jinja2 meta tag in search.html template * docs(rich_metadata): fix duplicates and use enhanced titles for social sharing - Remove duplicate metatags rendering (parent theme already renders it) - Use enhanced structured title format for og:title and twitter:title - All titles now consistent: 'Page: Section - Site | NVIDIA' - Improves social sharing context on Facebook, Twitter, LinkedIn * rich metadata Signed-off-by: Lawrence Lane <[email protected]> * docs: rich metadata for seo Signed-off-by: Lawrence Lane <[email protected]> * Update docs/_extensions/rich_metadata/__init__.py Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> * docs(extensions): refactor build_meta_tags to reduce complexity via helper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields * docs(rich_metadata): fix linter errors in extension code - Replace aliased errors with OSError in frontmatter extraction - Refactor _add_custom_fields into smaller functions to reduce complexity - Remove unnecessary variable assignment before return in build_meta_tags - Break down verify_html_file into helper functions (_display_meta_tags, _display_json_ld, _display_no_metadata_help) - Add return type annotation to main() function - Remove trailing whitespace from blank lines throughout verify_metadata.py * docs(rich_metadata): fix quote style to use double quotes per Ruff Q000 --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Copilot <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

* create llama nemotron tutorial script and readme Signed-off-by: Sarah Yurick <[email protected]> * enable all heuristic filters Signed-off-by: Sarah Yurick <[email protected]> * add tokenizer steps Signed-off-by: Sarah Yurick <[email protected]> * add noqa Signed-off-by: Sarah Yurick <[email protected]> * add full tutorial Signed-off-by: Sarah Yurick <[email protected]> * fix empty filter test Signed-off-by: Sarah Yurick <[email protected]> * remove use_fast=True Signed-off-by: Sarah Yurick <[email protected]> * Update tutorials/text/llama-nemotron-data-curation/main.py Signed-off-by: Sarah Yurick <[email protected]> * Update tutorials/text/llama-nemotron-data-curation/main.py Signed-off-by: Sarah Yurick <[email protected]> * add comment, refresh files Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * typos Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

* Log more info on test_split_parquet_file_by_size failure Signed-off-by: James Bourbeau <[email protected]> * Sort files Signed-off-by: James Bourbeau <[email protected]> --------- Signed-off-by: James Bourbeau <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: James Bourbeau <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: Lawrence Lane <[email protected]>

Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

…o#1196) Signed-off-by: James Bourbeau <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

* link to docs in classifier tutorials Signed-off-by: Sarah Yurick <[email protected]> * add links to semdedup tutorials Signed-off-by: Sarah Yurick <[email protected]> * download and extract links Signed-off-by: Sarah Yurick <[email protected]> * Apply suggestions from code review Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

…ension Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This PR introduces a new Sphinx extension (llm_txt_output) that generates AI-consumable documentation outputs in two formats: per-page .llm.txt files and an aggregated llm-full.txt file. The extension hooks into Sphinx's build lifecycle, extracting structured content (titles, summaries, headings, grid cards, links, metadata) from documentation sources after HTML generation completes. It integrates with NeMo Curator's existing MyST markdown processing and content gating infrastructure by respecting exclude_patterns from both Sphinx's global config and the extension's own settings. The implementation consists of five coordinated modules: configuration management (config.py), document tree content extraction (content_extractor.py), markdown template building (template_builder.py), file writing (writer.py), and orchestration (processor.py). The extension is enabled in docs/conf.py with a base URL for absolute link generation, content length limits, and smart grid card handling. This fits into the project's broader documentation strategy by providing machine-readable outputs alongside human-readable HTML, enabling LLM assistants to consume documentation more effectively.

Important Files Changed

Filename	Score	Overview
`docs/_extensions/llm_txt_output/__init__.py`	5/5	Registers new Sphinx extension with config validation and build-finished hooks
`docs/_extensions/llm_txt_output/config.py`	4/5	Defines configuration schema with validation; `generate_full_file` setting missing from boolean validation
`docs/conf.py`	5/5	Enables `llm_txt_output` extension with base URL, content limits, and smart card handling
`docs/_extensions/llm_txt_output/writer.py`	3/5	Writes llm.txt files to build directory; contains duplicate `get_output_path` helper function
`docs/_extensions/llm_txt_output/processor.py`	3/5	Orchestrates document processing; uses bare `except Exception` clauses and inconsistent pattern matching
`docs/_extensions/llm_txt_output/template_builder.py`	3/5	Builds markdown-formatted output; uses build-time timestamp instead of document modification date
`docs/_extensions/llm_txt_output/README.md`	4/5	Documents extension usage and configuration; contains markdown syntax errors in code examples at lines 173, 202-203
`docs/_extensions/llm_txt_output/content_extractor.py`	3/5	Extracts content from doctree; complex dual-strategy card extraction with simplified `../` URL handling

Confidence score: 3/5

This PR adds valuable functionality but has several implementation concerns that could cause issues in production environments.
Score reflects: (1) bare exception handlers that may hide critical errors during builds, (2) duplicate code (get_output_path in writer.py), (3) timestamp metadata that uses build-time rather than source modification time, (4) oversimplified URL path traversal logic that strips all ../ references without tracking depth, (5) markdown syntax errors in documentation, and (6) missing validation for the generate_full_file configuration setting.
Pay close attention to content_extractor.py (lines 786-789 for URL handling, lines 830-848 for exception handling), template_builder.py (line 152 for timestamp logic), writer.py (duplicate function), and config.py (lines 94-109 for missing boolean validation).

Sequence Diagram

sequenceDiagram
    participant Sphinx
    participant Extension as llm_txt_output Extension
    participant Config as config.py
    participant Processor as processor.py
    participant Extractor as content_extractor.py
    participant Template as template_builder.py
    participant Writer as writer.py
    participant FileSystem

    Sphinx->>Extension: setup(app)
    Extension->>Sphinx: add_config_value("llm_txt_settings", defaults)
    Extension->>Sphinx: connect("config-inited", validate_config)
    Extension->>Sphinx: connect("build-finished", on_build_finished)
    
    Sphinx->>Config: validate_config(app, config)
    Config->>Config: _ensure_settings_dict()
    Config->>Config: apply_config_defaults()
    Config->>Config: _validate_core_settings()
    Config->>Config: _validate_content_limits()
    Config->>Config: _validate_boolean_settings()
    
    Note over Sphinx: Sphinx builds HTML documentation
    
    Sphinx->>Processor: on_build_finished(app, exception)
    
    loop For each document in app.env.all_docs
        Processor->>Processor: should_generate_llm_txt(config, docname)
        Processor->>Processor: is_content_gated(config, docname)
        
        alt Document should be processed
            Processor->>Extractor: extract_document_content(env, docname, settings)
            Extractor->>Extractor: _extract_title(env, docname)
            Extractor->>Extractor: _extract_grid_cards_from_markdown(env, docname, base_url)
            Extractor->>FileSystem: Read raw markdown source
            FileSystem-->>Extractor: Markdown content
            Extractor->>Extractor: _extract_text_content(doctree)
            Extractor->>Extractor: clean_text_for_llm_txt(raw_text)
            Extractor->>Extractor: _extract_summary(doctree, num_sentences)
            Extractor->>Extractor: _extract_overview(raw_text, max_length)
            Extractor->>Extractor: _extract_headings(doctree)
            Extractor->>Extractor: _extract_internal_links(doctree, max_links, base_url)
            Extractor->>Extractor: _extract_metadata(env, docname)
            Extractor-->>Processor: content_data dict
            
            Processor->>Template: build_llm_txt_content(content_data)
            Template->>Template: _build_metadata_section(metadata)
            Template->>Template: _normalize_spacing(content)
            Template-->>Processor: llm_txt_content string
            
            Processor->>Writer: write_llm_txt_file(app, docname, llm_txt_content)
            Writer->>Writer: Determine output path
            Writer->>FileSystem: Create directory (if needed)
            Writer->>FileSystem: Write .llm.txt file
            FileSystem-->>Writer: Success
            
            Processor->>Processor: Collect content for full file
        end
    end
    
    alt generate_full_file enabled
        Processor->>Processor: _write_full_llm_txt(app, all_content, settings)
        Processor->>FileSystem: Write llm-full.txt
        FileSystem-->>Processor: Success
    end
    
    Processor-->>Sphinx: Generation complete

_{8 files reviewed, 15 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-27T21:04:40Z

docs/_extensions/llm_txt_output/config.py

+    bool_settings = [
+        "enabled",
+        "verbose",
+        "include_metadata",
+        "include_headings",
+        "include_related_links",
+        "clean_myst_artifacts",
+    ]


logic: missing generate_full_file from bool_settings validation list, though it's defined as a boolean in defaults (line 27)

greptile-apps · 2025-10-27T21:04:41Z

docs/_extensions/llm_txt_output/writer.py

+    if docname == "index":
+        return outdir / "index.llm.txt"
+    elif docname.endswith("/index"):
+        return outdir / docname[:-6] / "index.llm.txt"
+    else:
+        return outdir / f"{docname}.llm.txt"


style: get_output_path duplicates the exact logic from write_llm_txt_file. Extract this to reduce code duplication:

Suggested change

if docname == "index":

return outdir / "index.llm.txt"

elif docname.endswith("/index"):

return outdir / docname[:-6] / "index.llm.txt"

else:

return outdir / f"{docname}.llm.txt"

def write_llm_txt_file(app: Sphinx, docname: str, content: str) -> None:

"""Write llm.txt file to output directory."""

try:

txt_path = get_output_path(app, docname)

greptile-apps · 2025-10-27T21:04:41Z

docs/_extensions/llm_txt_output/writer.py

+    except Exception:
+        logger.exception(f"Failed to write llm.txt for {docname}")


style: Bare except Exception swallows all errors including KeyboardInterrupt or SystemExit derived exceptions. Catch specific exceptions (e.g., IOError, OSError) or at minimum re-raise after logging

greptile-apps · 2025-10-27T21:04:42Z

docs/_extensions/llm_txt_output/template_builder.py

+            has_content = True
+
+    # Add last updated timestamp
+    timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d")


logic: timestamp generated at build time will be identical across all pages, not reflecting actual document updates. Should this use a per-document last-modified date from the source file's mtime or Sphinx's environment instead?

greptile-apps · 2025-10-27T21:04:43Z

docs/_extensions/llm_txt_output/processor.py

+                generated_count += 1
+                logger.debug(f"Generated llm.txt for {docname}")
+
+            except Exception:


style: Bare except Exception silently swallows all errors. Consider re-raising or logging at error level to avoid losing critical failures during doc generation.

docs/_extensions/llm_txt_output/content_extractor.py

greptile-apps · 2025-10-27T21:04:50Z

docs/_extensions/llm_txt_output/content_extractor.py

+        if content.startswith("---"):
+            end_marker = content.find("\n---\n", 3)
+            if end_marker != -1:
+                frontmatter_text = content[3:end_marker]
+                return yaml.safe_load(frontmatter_text)


logic: frontmatter extraction searches for '\n---\n' starting at position 3, but if document has Windows line endings '\r\n---\r\n', this will fail - normalize line endings first

…ix URL resolution - Add helper functions: _clean_card_title(), _truncate_description(), _resolve_url() - Fix URL path traversal using urllib.parse.urljoin for proper relative path resolution - Improve sentence boundary detection to avoid splitting on abbreviations (Dr., U.S., etc.) - Normalize line endings in frontmatter extraction to handle Windows CRLF - Refactor _extract_grid_cards_from_markdown to use shared helpers (reduce complexity) - Refactor _extract_grid_cards to use shared helpers (reduce complexity) - Remove duplicate title cleaning and description truncation logic throughout - Remove deprecated _make_absolute_url function - Remove noqa complexity suppressions after refactoring

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review: the addition of docs/get-started/quickstart-tutorial.md, a comprehensive new tutorial document. The file introduces a unified getting-started guide for NeMo Curator that bridges foundational concepts with hands-on examples. It follows a pedagogical structure: defining core abstractions (Pipeline, Stage, Task, Executor), walking through installation, and presenting a complete three-stage sentiment analysis pipeline demonstrating resource allocation, batch processing, and distributed execution. The tutorial integrates with the repository's llm.txt documentation extension by including rich frontmatter metadata (tags, personas, difficulty, content type) and MyST markdown features (grid cards, admonitions) that structure the content for both human readers and AI consumption. The file acts as an entry point linking to four modality-specific quickstart guides (text, image, video, audio) and demonstrates the canonical Task→Stage→Pipeline→Executor pattern that underlies all workflows in the codebase.

Important Files Changed

Filename	Score	Overview
docs/get-started/quickstart-tutorial.md	5/5	New comprehensive tutorial document covering installation, core concepts, and a working sentiment analysis pipeline example

Confidence score: 5/5

This PR is safe to merge with minimal risk.
Score reflects purely additive documentation content with no code changes, no breaking changes, and no security or logic issues detected. The tutorial follows established patterns from the repository's documentation structure and MyST markdown conventions.
No files require special attention; this is a single new documentation file with no dependencies on existing code.

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

- Add .ruff.toml to suppress complexity warnings for parsing functions - Replace open() with Path.read_text()/write_text() for better pathlib integration - Fix docstring formatting (D212, D413, D205, D401) - Convert one complex logging f-string to lazy % formatting - Fix unnecessary elif after return (RET505) - Auto-fix trailing commas and blank lines Remaining G004 warnings are in config validation code and can be addressed separately. Complexity warnings (C901, PLR0912, PLR0915) are inherent to MyST markdown parsing logic.

The negative lookbehind pattern with alternation of variable-width strings caused 're.error: look-behind requires fixed-width pattern' during build. Changed approach to: 1. Split with simple regex first 2. Post-process to merge sentences split after abbreviations This avoids variable-width lookbehind limitations while still handling common abbreviations (Dr., U.S., etc.) correctly.

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The recent commits focus exclusively on code style improvements across the llm_txt_output Sphinx extension with no functional changes. Key refactorings include: (1) standardizing docstrings to single-line opening format throughout __init__.py, template_builder.py, and content_extractor.py; (2) modernizing file I/O by replacing open().read() with Path.read_text() in writer.py; (3) extracting a _clean_card_title helper to eliminate repeated regex patterns in content_extractor.py; (4) simplifying if-elif chains into early returns in writer.py; (5) adding a local .ruff.toml configuration to suppress complexity warnings for the extension's parsing logic; (6) minor formatting fixes (trailing commas, logging syntax). The extension itself generates AI-consumable .llm.txt files from Sphinx documentation by hooking into the build-finished event, extracting content from doctrees and markdown sources, and producing both per-page and aggregated outputs with metadata and table of contents.

Important Files Changed

Filename	Score	Overview
docs/_extensions/llm_txt_output/init.py	5/5	Reformatted module and setup function docstrings to single-line style; no logic changes
docs/_extensions/llm_txt_output/template_builder.py	5/5	Updated `_normalize_spacing` docstring format for consistency; all build logic unchanged
docs/_extensions/llm_txt_output/.ruff.toml	4/5	Added local linting config to suppress complexity/import warnings; contains redundant per-file ignores
docs/_extensions/llm_txt_output/writer.py	4/5	Extracted `get_output_path` helper but left duplicate path resolution in `write_llm_txt_file`; simplified if-elif chains and file I/O
docs/_extensions/llm_txt_output/processor.py	4/5	Minor docstring and formatting updates; core orchestration logic unchanged; bare `except Exception` blocks remain from previous review
docs/_extensions/llm_txt_output/content_extractor.py	4/5	Consolidated title cleaning into `_clean_card_title` helper; refactored docstrings and file I/O; improved sentence split regex; trailing comma and code duplication issues remain

Confidence score: 4/5

This PR is safe to merge with minor technical debt that can be addressed in follow-up work
Score reflects solid code quality improvements (docstring consistency, modern idioms, reduced duplication) but incomplete refactoring in writer.py (extracted helper not used), redundant linting config in .ruff.toml, and carryover issues from previous reviews (bare exception handlers, duplicate path logic)
Pay close attention to docs/_extensions/llm_txt_output/writer.py and docs/_extensions/llm_txt_output/.ruff.toml for the incomplete refactoring and redundant configuration respectively

_{6 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-28T22:30:19Z

docs/_extensions/llm_txt_output/.ruff.toml

+[lint.per-file-ignores]
+# Allow complex functions in content_extractor.py for parsing logic
+"content_extractor.py" = ["C901", "PLR0912", "PLR0915", "PLC0415"]


style: : Per-file ignores at line 16 duplicate the global ignores from lines 7-12 for content_extractor.py. Remove this redundant section since the global ignores already apply

greptile-apps · 2025-10-28T22:30:19Z

docs/_extensions/llm_txt_output/processor.py

+    return False
+
+
+def _write_full_llm_txt(app: Sphinx, all_content: list[dict[str, str]], _settings: dict) -> None:


style: : _settings parameter is unused - remove it or prefix with underscore if intentionally ignored for future extensibility

greptile-apps · 2025-10-28T22:30:20Z

docs/_extensions/llm_txt_output/content_extractor.py

+                description = re.sub(
+                    r"\{\{[^}]+\}\}", "NeMo Evaluator", description,
+                )  # Replace {{ product_name_short }}


style: trailing comma after string argument on line 494 violates PEP 8 for single-element function calls - remove the comma:

Suggested change

description = re.sub(

r"\{\{[^}]+\}\}", "NeMo Evaluator", description,

) # Replace {{ product_name_short }}

description = re.sub(

r"\{\{[^}]+\}\}", "NeMo Evaluator", description

) # Replace {{ product_name_short }}

greptile-apps · 2025-10-28T22:30:21Z

docs/_extensions/llm_txt_output/content_extractor.py

+        import yaml
+        from pathlib import Path
+
+        content = Path(file_path).read_text(encoding="utf-8")


style: import yaml and Path at module level instead of inside _extract_frontmatter - same pattern needed in line 814 where str(source_path) suggests Path handling inconsistency

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The recent updates address previous feedback on content_extractor.py by refactoring exception handling, removing redundant configuration, improving code organization, and fixing backtick formatting issues in directive examples. Specific improvements include moving imports to module-level, consolidating duplicate logic into helper functions, removing unused parameters, and tightening exception handling to catch specific error types rather than broad Exception classes. The changes align with Python best practices (PEP 8, Ruff linting standards) and make the code more maintainable for future contributors working with Sphinx documentation extensions.

Important Files Changed

Filename	Score	Overview
docs/_extensions/llm_txt_output/content_extractor.py	4/5	Refactored exception handling to catch specific exceptions, moved imports to module level, removed unused parameters, consolidated duplicate title/description logic, and fixed directive formatting examples
docs/_extensions/llm_txt_output/.ruff.toml	5/5	Removed redundant per-file ignore configuration that duplicated global ignores for content_extractor.py

Confidence score: 4/5

This PR is safe to merge with minimal risk; the changes are primarily code quality improvements that address previous review feedback
Score reflects solid refactoring that improves maintainability without introducing new functionality or breaking changes; one point deducted because exception handling could still be more granular (e.g., distinguishing between YAML parsing errors and file I/O errors) and the sentence splitting regex limitation noted in previous reviews remains unaddressed
Pay close attention to docs/_extensions/llm_txt_output/content_extractor.py to verify that specific exception types now caught are comprehensive enough for all failure modes in production documentation builds

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-28T22:31:42Z

docs/_extensions/llm_txt_output/content_extractor.py

+        sentences = re.split(r"(?<=[.!?])\s+(?=[A-Z0-9])", para)
+
+        # Common abbreviations that shouldn't trigger sentence splits
+        abbrevs = {
+            "Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
+            "vs.", "etc.", "i.e.", "e.g.", "Fig.", "Vol.", "No.",
+            "Ph.D.", "U.S.", "U.K.", "Inc.", "Ltd.", "Corp.",
+        }
+
+        # Merge sentences that were split after abbreviations
+        merged_sentences = []
+        i = 0
+        while i < len(sentences):
+            current = sentences[i]
+            # Check if current ends with an abbreviation
+            if i + 1 < len(sentences):
+                ends_with_abbrev = any(current.rstrip().endswith(abbr) for abbr in abbrevs)
+                if ends_with_abbrev:
+                    # Merge with next sentence
+                    current = current + " " + sentences[i + 1]
+                    i += 1
+            merged_sentences.append(current)
+            i += 1
+
+        sentences = merged_sentences


logic: sentences on line 302 could be empty list if para has no sentence boundaries matching the pattern - this would cause merged_sentences loop to append empty strings. Add guard: if not sentences: result.append(para); continue

greptile-apps · 2025-10-28T22:31:43Z

docs/_extensions/llm_txt_output/content_extractor.py

+        abbrevs = {
+            "Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
+            "vs.", "etc.", "i.e.", "e.g.", "Fig.", "Vol.", "No.",
+            "Ph.D.", "U.S.", "U.K.", "Inc.", "Ltd.", "Corp.",
+        }


style: abbreviation set is missing common tech/doc abbreviations that will appear in documentation: 'API', 'URL', 'GPU', 'CPU', 'RAM', 'KB', 'MB', 'GB', 'TB', 'MHz', 'GHz', 'NVIDIA', 'AWS', etc.

greptile-apps · 2025-10-28T22:31:44Z

docs/_extensions/llm_txt_output/content_extractor.py

+            current = sentences[i]
+            # Check if current ends with an abbreviation
+            if i + 1 < len(sentences):
+                ends_with_abbrev = any(current.rstrip().endswith(abbr) for abbr in abbrevs)


style: ends_with_abbrev check uses exact match on rstrip() but sentences may have multiple trailing spaces or punctuation - use current.rstrip().rstrip('.,!?').endswith(abbr) for robustness

lbliii self-assigned this Oct 3, 2025

lbliii requested a review from Copilot October 15, 2025 18:46

Copilot AI reviewed Oct 15, 2025

View reviewed changes

docs/_extensions/llm_txt_output/content_extractor.py Outdated Show resolved Hide resolved

lbliii marked this pull request as ready for review October 15, 2025 19:29

lbliii and others added 25 commits October 22, 2025 13:17

updates

6b660ec

Signed-off-by: Lawrence Lane <[email protected]>

Fixing versions1.json to support nested folders (NVIDIA-NeMo#1157)

d22db1f

Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Resolve a few warnings in the docs build (NVIDIA-NeMo#1120)

45eef7c

Signed-off-by: James Bourbeau <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

added hydra params as defaults (NVIDIA-NeMo#1159)

cc42725

Signed-off-by: Ssofja <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

add_codecov_badge (NVIDIA-NeMo#1172)

0658cab

Signed-off-by: Pablo Garay <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Support more flags when building docs (NVIDIA-NeMo#1175)

0d77c3b

Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

docs: pyrpoject toml classifiers section (NVIDIA-NeMo#1176)

7c81ef8

Signed-off-by: Lawrence Lane <[email protected]>

Add large file splitting script (NVIDIA-NeMo#1161)

f5b7710

Signed-off-by: Lawrence Lane <[email protected]>

docs ruff

2705877

Signed-off-by: Lawrence Lane <[email protected]>

fixes

7c9618c

Signed-off-by: Lawrence Lane <[email protected]>

Update docs/_extensions/llm_txt_output/content_extractor.py

a4d28ce

Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Fix JsonlWriter / ParquetWriter output directory typo (NVIDIA…

9d2d1fc

…-NeMo#1180) Signed-off-by: Lawrence Lane <[email protected]>

Switch from pynvml to nvidia-ml-py (NVIDIA-NeMo#1186)

a9050b3

Signed-off-by: James Bourbeau <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

ruff

6610b67

Signed-off-by: Lawrence Lane <[email protected]>

improve ref target handling and ruff

67dbf1f

Signed-off-by: Lawrence Lane <[email protected]>

sarahyurick and others added 4 commits October 22, 2025 13:17

Remove ftfy pin (NVIDIA-NeMo#1189)

5ab7adb

Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

Increase UV_HTTP_TIMEOUT to reduce sporadic test failures (NVIDIA-NeM…

0341fb0

…o#1196) Signed-off-by: James Bourbeau <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>

docs: remove unused noqa directives for PLC0415 in llm_txt_output ext…

3859d5f

…ension Signed-off-by: Lawrence Lane <[email protected]>

lbliii force-pushed the llane/llm-txt-extension branch from d7156d4 to 3859d5f Compare October 22, 2025 17:18

lbliii added 2 commits October 22, 2025 13:18

Merge branch 'main' into llane/llm-txt-extension

05f93e5

Merge branch 'main' into llane/llm-txt-extension

cd8df09

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

greptile-apps bot reviewed Oct 28, 2025

View reviewed changes

lbliii added 2 commits October 28, 2025 18:28

greptile-apps bot reviewed Oct 28, 2025

View reviewed changes

		except Exception:
		logger.exception(f"Failed to write llm.txt for {docname}")

		return False


		def _write_full_llm_txt(app: Sphinx, all_content: list[dict[str, str]], _settings: dict) -> None:

Uh oh!

docs: extension for llm.txt and llm-full.txt outputs #1166

Are you sure you want to change the base?

docs: extension for llm.txt and llm-full.txt outputs #1166

Conversation

lbliii commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example Output

Uh oh!

copy-pr-bot bot commented Oct 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 3/5

Sequence Diagram

Uh oh!

greptile-apps bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Uh oh!

greptile-apps bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Uh oh!

lbliii commented Oct 3, 2025 •

edited

Loading