Skip to content

Conversation

@lbliii
Copy link
Contributor

@lbliii lbliii commented Oct 3, 2025

New Sphinx extension to enable AI to better consume documentation by providing per-page and full site llm.txt files as part of the output deliverables.

Example Output

Location: /about/concepts/text/data-curation-pipeline.llm.txt

# Text Data Curation Pipeline

> This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.

## Overview

Text Data Curation Pipeline This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.

Architecture Overview The following diagram provides a high-level outline of NeMo Curator’s text curation architecture: Pipeline Stages NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training: 1.

Data Sources Multiple input sources provide the foundation for text curation: Cloud storage : (S3, GCS, Azure) Local workstation : files (JSONL, Parquet) 2.

Data Acquisition & Processing Raw data is downloaded, extracted, and converted into standardized formats: Download & Extraction : Retrieve and process remote data sources Cleaning & Pre-processing : Convert formats and normalize text DocumentBatch Creation : Standardize data into NeMo Curator’s core data structure 3.

Quality Assessment & Filtering Multiple filtering stages ensure data quality: Heuristic Quality Filtering : Rule-based filters for basic quality checks Model-based Quality Filtering : Classification models trained to identify high vs. low quality text 4.

Deduplication Remove duplicate and near-duplicate content: Exact Deduplication : Remove identical documents using MD5 hashing Fuzzy Deduplication : Remove near-duplicates using MinHash and LSH similarity Semantic Deduplication : Remove semantically similar content using embeddings 5.

Final Preparation Prepare the curated dataset for training: Format Standardization : Ensure consistent output format Infrastructure Foundation The entire pipeline runs on a robust, scalable infrastructure: Ray : Distributed computing framework for parallelization RAPIDS : GPU-accelerated data processing (cuDF, cuGraph, cuML) Flexible Deployment : CPU and GPU acceleration support Key Components The pipeline leverages several core component types: Data Loading Core concepts for loading and managing text datasets from local files about-concepts-text-data-loading Data Acquisition Components for downloading and extracting data from remote sources about-concepts-text-data-acquisition Data Processing Concepts for filtering, deduplication, and classification about-concepts-text-data-processing Processing Modes The pipeline supports different processing approaches: GPU Acceleration : Leverage NVIDIA GPUs for: High-throughput data processing ML model inference for classification Embedding generation for semantic operations CPU Processing : Scale across multiple CPU cores for: Text parsing and cleaning Rule-based filtering Large-scale data transformations Hybrid Workflows : Combine CPU and GPU processing for optimal performance based on the specific operation.

Scalability & Deployment The architecture scales from single machines to large clusters: Single Node : Process datasets on laptops or workstations Multi-Node : Distribute processing across cluster resources Cloud Native : Deploy on cloud platforms HPC Integration : Run on HPC supercomputing clusters For hands-on experience, see the Text Curation Getting Started Guide

## Available Options

- **[Data Loading](https://docs.nvidia.com/nemo/curator/latest/about-concepts-text-data-loading.html)**
  Core concepts for loading and managing text datasets from local files
- **[Data Acquisition](https://docs.nvidia.com/nemo/curator/latest/about-concepts-text-data-acquisition.html)**
  Components for downloading and extracting data from remote sources
- **[Data Processing](https://docs.nvidia.com/nemo/curator/latest/about-concepts-text-data-processing.html)**
  Concepts for filtering, deduplication, and classification

## Key Sections

- **Text Data Curation Pipeline**: This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.
- **Architecture Overview**: The following diagram provides a high-level outline of NeMo Curator’s text curation architecture:
- **Pipeline Stages**: NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:
- **1. Data Sources**: Multiple input sources provide the foundation for text curation:
- **2. Data Acquisition & Processing**: Raw data is downloaded, extracted, and converted into standardized formats:
- **3. Quality Assessment & Filtering**: Multiple filtering stages ensure data quality:
- **4. Deduplication**: Remove duplicate and near-duplicate content:
- **5. Final Preparation**: Prepare the curated dataset for training:
- **Infrastructure Foundation**: The entire pipeline runs on a robust, scalable infrastructure:
- **Key Components**: The pipeline leverages several core component types:
- **Processing Modes**: The pipeline supports different processing approaches:
- **Scalability & Deployment**: The architecture scales from single machines to large clusters:

## Metadata

- **Description**: Comprehensive overview of NeMo Curator's text curation pipeline architecture including data acquisition and processing
- **Content Type**: concept
- **Difficulty**: beginner
- **Tags**: pipeline, architecture, text-curation, distributed, gpu-accelerated, overview
- **Categories**: concepts-architecture
- **Target Audience**: data-scientist-focused, mle-focused
- **Last Updated**: 2025-10-15

@lbliii lbliii self-assigned this Oct 3, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 3, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@lbliii lbliii requested a review from Copilot October 15, 2025 18:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new Sphinx extension for generating LLM-friendly documentation output in the llm.txt format. The extension creates parallel .llm.txt files for each documentation page and an aggregated llm-full.txt file containing all documentation. The goal is to make documentation more accessible to Large Language Models by providing clean, structured text files optimized for AI consumption.

Key changes:

  • Complete implementation of the llm_txt_output Sphinx extension with modular architecture
  • Smart MyST directive parsing to extract grid cards and clean content artifacts
  • Integration with existing Sphinx configuration and content gating systems

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
docs/conf.py Added extension registration and configuration settings for llm.txt output
docs/_extensions/llm_txt_output/init.py Main extension entry point with Sphinx setup configuration
docs/_extensions/llm_txt_output/config.py Configuration validation and default settings management
docs/_extensions/llm_txt_output/processor.py Build orchestration and document processing pipeline
docs/_extensions/llm_txt_output/content_extractor.py Core content extraction logic with MyST cleaning and grid card parsing
docs/_extensions/llm_txt_output/template_builder.py LLM.txt format template generation and content structuring
docs/_extensions/llm_txt_output/writer.py File writing operations for output generation
docs/_extensions/llm_txt_output/README.md Comprehensive documentation for the extension
Comments suppressed due to low confidence (1)

docs/_extensions/llm_txt_output/processor.py:1

  • Bare exception handling with a broad catch is generally discouraged. Consider catching specific exceptions like OSError, IOError, or yaml.YAMLError for better error handling and debugging.
"""Document processing and build orchestration for llm.txt output."""

@lbliii lbliii marked this pull request as ready for review October 15, 2025 19:29
lbliii and others added 25 commits October 22, 2025 13:17
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: James Bourbeau <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* Context manager support for RayClient

Signed-off-by: James Bourbeau <[email protected]>

* Bump timeout

Signed-off-by: James Bourbeau <[email protected]>

---------

Signed-off-by: James Bourbeau <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Ssofja <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* Increase timeout for CPU tests

Signed-off-by: Sarah Yurick <[email protected]>

* split stages tests into separate workflows

Signed-off-by: Sarah Yurick <[email protected]>

* fix slash issue

Signed-off-by: Sarah Yurick <[email protected]>

* try quotes

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt..

Signed-off-by: Sarah Yurick <[email protected]>

* ..

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* readme updates

Signed-off-by: Lawrence Lane <[email protected]>

* readme overhaul

Signed-off-by: Lawrence Lane <[email protected]>

* marketing callout

Signed-off-by: Lawrence Lane <[email protected]>

* added >

Signed-off-by: Lawrence Lane <[email protected]>

* dask > ray

Signed-off-by: Lawrence Lane <[email protected]>

* Update README.md

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: L.B. <[email protected]>

* linkfixes; image classifier changed to filters; remove old pages

Signed-off-by: Lawrence Lane <[email protected]>

* link swap

Signed-off-by: Lawrence Lane <[email protected]>

* fixes

Signed-off-by: Lawrence Lane <[email protected]>

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Pablo Garay <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* Fix bug for running filters on empty batches

Signed-off-by: Sarah Yurick <[email protected]>

* is_actor_stage for models and tokenizers

Signed-off-by: Sarah Yurick <[email protected]>

* add praateek's suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* docs: initial loading of migration guide and faq

Signed-off-by: Lawrence Lane <[email protected]>

* minor changes

Signed-off-by: Lawrence Lane <[email protected]>

* crosslinks

Signed-off-by: Lawrence Lane <[email protected]>

* links to migration guides

Signed-off-by: Lawrence Lane <[email protected]>

* wording

Signed-off-by: Lawrence Lane <[email protected]>

* Update docs/about/release-notes/index.md

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: L.B. <[email protected]>

* remove parenthetical

Signed-off-by: Lawrence Lane <[email protected]>

* gh link

Signed-off-by: Lawrence Lane <[email protected]>

* cleanup

Signed-off-by: Lawrence Lane <[email protected]>

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* text curation updates

Signed-off-by: Lawrence Lane <[email protected]>

* concepts

Signed-off-by: Lawrence Lane <[email protected]>

* remove synthetic docs not for this release

Signed-off-by: Lawrence Lane <[email protected]>

* updates

Signed-off-by: Lawrence Lane <[email protected]>

* text concepts and getting started changes

Signed-off-by: Lawrence Lane <[email protected]>

* links, concepts

Signed-off-by: Lawrence Lane <[email protected]>

* crosslinks

Signed-off-by: Lawrence Lane <[email protected]>

* quality assessment updates

Signed-off-by: Lawrence Lane <[email protected]>

* more cleanup

Signed-off-by: Lawrence Lane <[email protected]>

* semdedup

Signed-off-by: Lawrence Lane <[email protected]>

* example import cleanup

Signed-off-by: Lawrence Lane <[email protected]>

* concepts

Signed-off-by: Lawrence Lane <[email protected]>

* Update docs/about/concepts/text/data-acquisition-concepts.md

Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: L.B. <[email protected]>

* feedback batch 1

Signed-off-by: Lawrence Lane <[email protected]>

* feedback batch 2

Signed-off-by: Lawrence Lane <[email protected]>

* file_paths="/path/to/jsonl_directory",

Signed-off-by: Lawrence Lane <[email protected]>

* revert removal of xenna for common crawl executors

Signed-off-by: Lawrence Lane <[email protected]>

* quickstart installation steps

Signed-off-by: Lawrence Lane <[email protected]>

* Update docs/about/concepts/text/data-acquisition-concepts.md

Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: L.B. <[email protected]>

* data loading concepts updates / simplification

Signed-off-by: Lawrence Lane <[email protected]>

* data processing feedback

Signed-off-by: Lawrence Lane <[email protected]>

* read-existing pg updates

Signed-off-by: Lawrence Lane <[email protected]>

* add-id updates

Signed-off-by: Lawrence Lane <[email protected]>

* dedup updates

Signed-off-by: Lawrence Lane <[email protected]>

* feedback

Signed-off-by: Lawrence Lane <[email protected]>

* citation file

Signed-off-by: Lawrence Lane <[email protected]>

* fix

Signed-off-by: Lawrence Lane <[email protected]>

* updates

Signed-off-by: Lawrence Lane <[email protected]>

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
…Mo#1154)

* add enhanced captioning to rn; bump versions on json files

Signed-off-by: Lawrence Lane <[email protected]>

* remove ds store

Signed-off-by: Lawrence Lane <[email protected]>

* feedback

Signed-off-by: Lawrence Lane <[email protected]>

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: L.B. <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* docs(extensions): add rich metadata SEO extension with frontmatter integration

Add new Sphinx extension that injects SEO-optimized metadata into HTML head from frontmatter:

Core Features:
- Extract frontmatter (description, tags, personas, difficulty, content_type, modality)
- Generate standard meta tags (description, keywords, audience)
- Generate Open Graph tags for social sharing (Facebook, LinkedIn)
- Generate Twitter Card tags for enhanced previews
- Generate JSON-LD structured data (schema.org) for search engines
- Support product versioning via cascade.product fields

Components:
- rich_metadata/__init__.py: Main extension with config-inited and html-page-context hooks
- rich_metadata/templates/layout.html: Template override for metadata injection
- rich_metadata/README.md: Technical overview and features
- rich_metadata/USAGE.md: Complete usage guide with examples
- rich_metadata/SUMMARY.md: Quick reference guide
- rich_metadata/IMPLEMENTATION.md: Architecture and implementation details
- rich_metadata/verify_metadata.py: Automated verification script

Template handling follows search_assets pattern - templates live within extension folder
and are automatically added to Sphinx template search path via config-inited hook.

Extension enabled in conf.py and ready for use with existing frontmatter.

* docs(rich_metadata): add SEO metadata extension with organized output and enhanced titles

- Extract frontmatter from markdown files and inject SEO metadata
- Support standard meta tags (description, keywords)
- Support Open Graph tags (og:description, og:title, og:type, og:url)
- Support Twitter Card tags (twitter:description, twitter:title, twitter:card)
- Support custom content metadata (audience, difficulty, modality, content_type)
- Generate JSON-LD structured data (schema.org Article/TechArticle)
- Organize metadata with HTML comments for readability
- Enhanced page titles: 'Page: Section - Site | NVIDIA'
- Template override for clean title rendering
- Suppress warnings for generated pages (genindex, search, etc.)
- Add frontmatter to homepage (index.md)
- Fix invalid Jinja2 meta tag in search.html template

* docs(rich_metadata): fix duplicates and use enhanced titles for social sharing

- Remove duplicate metatags rendering (parent theme already renders it)
- Use enhanced structured title format for og:title and twitter:title
- All titles now consistent: 'Page: Section - Site | NVIDIA'
- Improves social sharing context on Facebook, Twitter, LinkedIn

* rich metadata

Signed-off-by: Lawrence Lane <[email protected]>

* docs: rich metadata for seo

Signed-off-by: Lawrence Lane <[email protected]>

* Update docs/_extensions/rich_metadata/__init__.py

Co-authored-by: Copilot <[email protected]>
Signed-off-by: L.B. <[email protected]>

* docs(extensions): refactor build_meta_tags to reduce complexity via helper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields

* docs(rich_metadata): fix linter errors in extension code

- Replace aliased errors with OSError in frontmatter extraction
- Refactor _add_custom_fields into smaller functions to reduce complexity
- Remove unnecessary variable assignment before return in build_meta_tags
- Break down verify_html_file into helper functions (_display_meta_tags, _display_json_ld, _display_no_metadata_help)
- Add return type annotation to main() function
- Remove trailing whitespace from blank lines throughout verify_metadata.py

* docs(rich_metadata): fix quote style to use double quotes per Ruff Q000

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* create llama nemotron tutorial script and readme

Signed-off-by: Sarah Yurick <[email protected]>

* enable all heuristic filters

Signed-off-by: Sarah Yurick <[email protected]>

* add tokenizer steps

Signed-off-by: Sarah Yurick <[email protected]>

* add noqa

Signed-off-by: Sarah Yurick <[email protected]>

* add full tutorial

Signed-off-by: Sarah Yurick <[email protected]>

* fix empty filter test

Signed-off-by: Sarah Yurick <[email protected]>

* remove use_fast=True

Signed-off-by: Sarah Yurick <[email protected]>

* Update tutorials/text/llama-nemotron-data-curation/main.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update tutorials/text/llama-nemotron-data-curation/main.py

Signed-off-by: Sarah Yurick <[email protected]>

* add comment, refresh files

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

* typos

Signed-off-by: Sarah Yurick <[email protected]>

* add praateek's suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* ruff

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* Log more info on test_split_parquet_file_by_size failure

Signed-off-by: James Bourbeau <[email protected]>

* Sort files

Signed-off-by: James Bourbeau <[email protected]>

---------

Signed-off-by: James Bourbeau <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: James Bourbeau <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
sarahyurick and others added 4 commits October 22, 2025 13:17
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* link to docs in classifier tutorials

Signed-off-by: Sarah Yurick <[email protected]>

* add links to semdedup tutorials

Signed-off-by: Sarah Yurick <[email protected]>

* download and extract links

Signed-off-by: Sarah Yurick <[email protected]>

* Apply suggestions from code review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
@lbliii lbliii force-pushed the llane/llm-txt-extension branch from d7156d4 to 3859d5f Compare October 22, 2025 17:18
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR introduces a new Sphinx extension (llm_txt_output) that generates AI-consumable documentation outputs in two formats: per-page .llm.txt files and an aggregated llm-full.txt file. The extension hooks into Sphinx's build lifecycle, extracting structured content (titles, summaries, headings, grid cards, links, metadata) from documentation sources after HTML generation completes. It integrates with NeMo Curator's existing MyST markdown processing and content gating infrastructure by respecting exclude_patterns from both Sphinx's global config and the extension's own settings. The implementation consists of five coordinated modules: configuration management (config.py), document tree content extraction (content_extractor.py), markdown template building (template_builder.py), file writing (writer.py), and orchestration (processor.py). The extension is enabled in docs/conf.py with a base URL for absolute link generation, content length limits, and smart grid card handling. This fits into the project's broader documentation strategy by providing machine-readable outputs alongside human-readable HTML, enabling LLM assistants to consume documentation more effectively.


Important Files Changed

Filename Score Overview
docs/_extensions/llm_txt_output/__init__.py 5/5 Registers new Sphinx extension with config validation and build-finished hooks
docs/_extensions/llm_txt_output/config.py 4/5 Defines configuration schema with validation; generate_full_file setting missing from boolean validation
docs/conf.py 5/5 Enables llm_txt_output extension with base URL, content limits, and smart card handling
docs/_extensions/llm_txt_output/writer.py 3/5 Writes llm.txt files to build directory; contains duplicate get_output_path helper function
docs/_extensions/llm_txt_output/processor.py 3/5 Orchestrates document processing; uses bare except Exception clauses and inconsistent pattern matching
docs/_extensions/llm_txt_output/template_builder.py 3/5 Builds markdown-formatted output; uses build-time timestamp instead of document modification date
docs/_extensions/llm_txt_output/README.md 4/5 Documents extension usage and configuration; contains markdown syntax errors in code examples at lines 173, 202-203
docs/_extensions/llm_txt_output/content_extractor.py 3/5 Extracts content from doctree; complex dual-strategy card extraction with simplified ../ URL handling

Confidence score: 3/5

  • This PR adds valuable functionality but has several implementation concerns that could cause issues in production environments.
  • Score reflects: (1) bare exception handlers that may hide critical errors during builds, (2) duplicate code (get_output_path in writer.py), (3) timestamp metadata that uses build-time rather than source modification time, (4) oversimplified URL path traversal logic that strips all ../ references without tracking depth, (5) markdown syntax errors in documentation, and (6) missing validation for the generate_full_file configuration setting.
  • Pay close attention to content_extractor.py (lines 786-789 for URL handling, lines 830-848 for exception handling), template_builder.py (line 152 for timestamp logic), writer.py (duplicate function), and config.py (lines 94-109 for missing boolean validation).

Sequence Diagram

sequenceDiagram
    participant Sphinx
    participant Extension as llm_txt_output Extension
    participant Config as config.py
    participant Processor as processor.py
    participant Extractor as content_extractor.py
    participant Template as template_builder.py
    participant Writer as writer.py
    participant FileSystem

    Sphinx->>Extension: setup(app)
    Extension->>Sphinx: add_config_value("llm_txt_settings", defaults)
    Extension->>Sphinx: connect("config-inited", validate_config)
    Extension->>Sphinx: connect("build-finished", on_build_finished)
    
    Sphinx->>Config: validate_config(app, config)
    Config->>Config: _ensure_settings_dict()
    Config->>Config: apply_config_defaults()
    Config->>Config: _validate_core_settings()
    Config->>Config: _validate_content_limits()
    Config->>Config: _validate_boolean_settings()
    
    Note over Sphinx: Sphinx builds HTML documentation
    
    Sphinx->>Processor: on_build_finished(app, exception)
    
    loop For each document in app.env.all_docs
        Processor->>Processor: should_generate_llm_txt(config, docname)
        Processor->>Processor: is_content_gated(config, docname)
        
        alt Document should be processed
            Processor->>Extractor: extract_document_content(env, docname, settings)
            Extractor->>Extractor: _extract_title(env, docname)
            Extractor->>Extractor: _extract_grid_cards_from_markdown(env, docname, base_url)
            Extractor->>FileSystem: Read raw markdown source
            FileSystem-->>Extractor: Markdown content
            Extractor->>Extractor: _extract_text_content(doctree)
            Extractor->>Extractor: clean_text_for_llm_txt(raw_text)
            Extractor->>Extractor: _extract_summary(doctree, num_sentences)
            Extractor->>Extractor: _extract_overview(raw_text, max_length)
            Extractor->>Extractor: _extract_headings(doctree)
            Extractor->>Extractor: _extract_internal_links(doctree, max_links, base_url)
            Extractor->>Extractor: _extract_metadata(env, docname)
            Extractor-->>Processor: content_data dict
            
            Processor->>Template: build_llm_txt_content(content_data)
            Template->>Template: _build_metadata_section(metadata)
            Template->>Template: _normalize_spacing(content)
            Template-->>Processor: llm_txt_content string
            
            Processor->>Writer: write_llm_txt_file(app, docname, llm_txt_content)
            Writer->>Writer: Determine output path
            Writer->>FileSystem: Create directory (if needed)
            Writer->>FileSystem: Write .llm.txt file
            FileSystem-->>Writer: Success
            
            Processor->>Processor: Collect content for full file
        end
    end
    
    alt generate_full_file enabled
        Processor->>Processor: _write_full_llm_txt(app, all_content, settings)
        Processor->>FileSystem: Write llm-full.txt
        FileSystem-->>Processor: Success
    end
    
    Processor-->>Sphinx: Generation complete
Loading

8 files reviewed, 15 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +96 to +103
bool_settings = [
"enabled",
"verbose",
"include_metadata",
"include_headings",
"include_related_links",
"clean_myst_artifacts",
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: missing generate_full_file from bool_settings validation list, though it's defined as a boolean in defaults (line 27)

Comment on lines 43 to 48
if docname == "index":
return outdir / "index.llm.txt"
elif docname.endswith("/index"):
return outdir / docname[:-6] / "index.llm.txt"
else:
return outdir / f"{docname}.llm.txt"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: get_output_path duplicates the exact logic from write_llm_txt_file. Extract this to reduce code duplication:

Suggested change
if docname == "index":
return outdir / "index.llm.txt"
elif docname.endswith("/index"):
return outdir / docname[:-6] / "index.llm.txt"
else:
return outdir / f"{docname}.llm.txt"
def write_llm_txt_file(app: Sphinx, docname: str, content: str) -> None:
"""Write llm.txt file to output directory."""
try:
txt_path = get_output_path(app, docname)

Comment on lines +35 to +36
except Exception:
logger.exception(f"Failed to write llm.txt for {docname}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Bare except Exception swallows all errors including KeyboardInterrupt or SystemExit derived exceptions. Catch specific exceptions (e.g., IOError, OSError) or at minimum re-raise after logging

has_content = True

# Add last updated timestamp
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: timestamp generated at build time will be identical across all pages, not reflecting actual document updates. Should this use a per-document last-modified date from the source file's mtime or Sphinx's environment instead?

generated_count += 1
logger.debug(f"Generated llm.txt for {docname}")

except Exception:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Bare except Exception silently swallows all errors. Consider re-raising or logging at error level to avoid losing critical failures during doc generation.

Comment on lines +830 to +834
if content.startswith("---"):
end_marker = content.find("\n---\n", 3)
if end_marker != -1:
frontmatter_text = content[3:end_marker]
return yaml.safe_load(frontmatter_text)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: frontmatter extraction searches for '\n---\n' starting at position 3, but if document has Windows line endings '\r\n---\r\n', this will fail - normalize line endings first

…ix URL resolution

- Add helper functions: _clean_card_title(), _truncate_description(), _resolve_url()
- Fix URL path traversal using urllib.parse.urljoin for proper relative path resolution
- Improve sentence boundary detection to avoid splitting on abbreviations (Dr., U.S., etc.)
- Normalize line endings in frontmatter extraction to handle Windows CRLF
- Refactor _extract_grid_cards_from_markdown to use shared helpers (reduce complexity)
- Refactor _extract_grid_cards to use shared helpers (reduce complexity)
- Remove duplicate title cleaning and description truncation logic throughout
- Remove deprecated _make_absolute_url function
- Remove noqa complexity suppressions after refactoring
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review: the addition of docs/get-started/quickstart-tutorial.md, a comprehensive new tutorial document. The file introduces a unified getting-started guide for NeMo Curator that bridges foundational concepts with hands-on examples. It follows a pedagogical structure: defining core abstractions (Pipeline, Stage, Task, Executor), walking through installation, and presenting a complete three-stage sentiment analysis pipeline demonstrating resource allocation, batch processing, and distributed execution. The tutorial integrates with the repository's llm.txt documentation extension by including rich frontmatter metadata (tags, personas, difficulty, content type) and MyST markdown features (grid cards, admonitions) that structure the content for both human readers and AI consumption. The file acts as an entry point linking to four modality-specific quickstart guides (text, image, video, audio) and demonstrates the canonical Task→Stage→Pipeline→Executor pattern that underlies all workflows in the codebase.

Important Files Changed

Filename Score Overview
docs/get-started/quickstart-tutorial.md 5/5 New comprehensive tutorial document covering installation, core concepts, and a working sentiment analysis pipeline example

Confidence score: 5/5

  • This PR is safe to merge with minimal risk.
  • Score reflects purely additive documentation content with no code changes, no breaking changes, and no security or logic issues detected. The tutorial follows established patterns from the repository's documentation structure and MyST markdown conventions.
  • No files require special attention; this is a single new documentation file with no dependencies on existing code.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

- Add .ruff.toml to suppress complexity warnings for parsing functions
- Replace open() with Path.read_text()/write_text() for better pathlib integration
- Fix docstring formatting (D212, D413, D205, D401)
- Convert one complex logging f-string to lazy % formatting
- Fix unnecessary elif after return (RET505)
- Auto-fix trailing commas and blank lines

Remaining G004 warnings are in config validation code and can be addressed separately.
Complexity warnings (C901, PLR0912, PLR0915) are inherent to MyST markdown parsing logic.
The negative lookbehind pattern with alternation of variable-width strings
caused 're.error: look-behind requires fixed-width pattern' during build.

Changed approach to:
1. Split with simple regex first
2. Post-process to merge sentences split after abbreviations

This avoids variable-width lookbehind limitations while still handling
common abbreviations (Dr., U.S., etc.) correctly.
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The recent commits focus exclusively on code style improvements across the llm_txt_output Sphinx extension with no functional changes. Key refactorings include: (1) standardizing docstrings to single-line opening format throughout __init__.py, template_builder.py, and content_extractor.py; (2) modernizing file I/O by replacing open().read() with Path.read_text() in writer.py; (3) extracting a _clean_card_title helper to eliminate repeated regex patterns in content_extractor.py; (4) simplifying if-elif chains into early returns in writer.py; (5) adding a local .ruff.toml configuration to suppress complexity warnings for the extension's parsing logic; (6) minor formatting fixes (trailing commas, logging syntax). The extension itself generates AI-consumable .llm.txt files from Sphinx documentation by hooking into the build-finished event, extracting content from doctrees and markdown sources, and producing both per-page and aggregated outputs with metadata and table of contents.

Important Files Changed

Filename Score Overview
docs/_extensions/llm_txt_output/init.py 5/5 Reformatted module and setup function docstrings to single-line style; no logic changes
docs/_extensions/llm_txt_output/template_builder.py 5/5 Updated _normalize_spacing docstring format for consistency; all build logic unchanged
docs/_extensions/llm_txt_output/.ruff.toml 4/5 Added local linting config to suppress complexity/import warnings; contains redundant per-file ignores
docs/_extensions/llm_txt_output/writer.py 4/5 Extracted get_output_path helper but left duplicate path resolution in write_llm_txt_file; simplified if-elif chains and file I/O
docs/_extensions/llm_txt_output/processor.py 4/5 Minor docstring and formatting updates; core orchestration logic unchanged; bare except Exception blocks remain from previous review
docs/_extensions/llm_txt_output/content_extractor.py 4/5 Consolidated title cleaning into _clean_card_title helper; refactored docstrings and file I/O; improved sentence split regex; trailing comma and code duplication issues remain

Confidence score: 4/5

  • This PR is safe to merge with minor technical debt that can be addressed in follow-up work
  • Score reflects solid code quality improvements (docstring consistency, modern idioms, reduced duplication) but incomplete refactoring in writer.py (extracted helper not used), redundant linting config in .ruff.toml, and carryover issues from previous reviews (bare exception handlers, duplicate path logic)
  • Pay close attention to docs/_extensions/llm_txt_output/writer.py and docs/_extensions/llm_txt_output/.ruff.toml for the incomplete refactoring and redundant configuration respectively

6 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +14 to +16
[lint.per-file-ignores]
# Allow complex functions in content_extractor.py for parsing logic
"content_extractor.py" = ["C901", "PLR0912", "PLR0915", "PLC0415"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: : Per-file ignores at line 16 duplicate the global ignores from lines 7-12 for content_extractor.py. Remove this redundant section since the global ignores already apply

return False


def _write_full_llm_txt(app: Sphinx, all_content: list[dict[str, str]], _settings: dict) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: : _settings parameter is unused - remove it or prefix with underscore if intentionally ignored for future extensibility

Comment on lines +493 to +495
description = re.sub(
r"\{\{[^}]+\}\}", "NeMo Evaluator", description,
) # Replace {{ product_name_short }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: trailing comma after string argument on line 494 violates PEP 8 for single-element function calls - remove the comma:

Suggested change
description = re.sub(
r"\{\{[^}]+\}\}", "NeMo Evaluator", description,
) # Replace {{ product_name_short }}
description = re.sub(
r"\{\{[^}]+\}\}", "NeMo Evaluator", description
) # Replace {{ product_name_short }}

Comment on lines +829 to +832
import yaml
from pathlib import Path

content = Path(file_path).read_text(encoding="utf-8")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: import yaml and Path at module level instead of inside _extract_frontmatter - same pattern needed in line 814 where str(source_path) suggests Path handling inconsistency

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The recent updates address previous feedback on content_extractor.py by refactoring exception handling, removing redundant configuration, improving code organization, and fixing backtick formatting issues in directive examples. Specific improvements include moving imports to module-level, consolidating duplicate logic into helper functions, removing unused parameters, and tightening exception handling to catch specific error types rather than broad Exception classes. The changes align with Python best practices (PEP 8, Ruff linting standards) and make the code more maintainable for future contributors working with Sphinx documentation extensions.

Important Files Changed

Filename Score Overview
docs/_extensions/llm_txt_output/content_extractor.py 4/5 Refactored exception handling to catch specific exceptions, moved imports to module level, removed unused parameters, consolidated duplicate title/description logic, and fixed directive formatting examples
docs/_extensions/llm_txt_output/.ruff.toml 5/5 Removed redundant per-file ignore configuration that duplicated global ignores for content_extractor.py

Confidence score: 4/5

  • This PR is safe to merge with minimal risk; the changes are primarily code quality improvements that address previous review feedback
  • Score reflects solid refactoring that improves maintainability without introducing new functionality or breaking changes; one point deducted because exception handling could still be more granular (e.g., distinguishing between YAML parsing errors and file I/O errors) and the sentence splitting regex limitation noted in previous reviews remains unaddressed
  • Pay close attention to docs/_extensions/llm_txt_output/content_extractor.py to verify that specific exception types now caught are comprehensive enough for all failure modes in production documentation builds

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +302 to +326
sentences = re.split(r"(?<=[.!?])\s+(?=[A-Z0-9])", para)

# Common abbreviations that shouldn't trigger sentence splits
abbrevs = {
"Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
"vs.", "etc.", "i.e.", "e.g.", "Fig.", "Vol.", "No.",
"Ph.D.", "U.S.", "U.K.", "Inc.", "Ltd.", "Corp.",
}

# Merge sentences that were split after abbreviations
merged_sentences = []
i = 0
while i < len(sentences):
current = sentences[i]
# Check if current ends with an abbreviation
if i + 1 < len(sentences):
ends_with_abbrev = any(current.rstrip().endswith(abbr) for abbr in abbrevs)
if ends_with_abbrev:
# Merge with next sentence
current = current + " " + sentences[i + 1]
i += 1
merged_sentences.append(current)
i += 1

sentences = merged_sentences
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: sentences on line 302 could be empty list if para has no sentence boundaries matching the pattern - this would cause merged_sentences loop to append empty strings. Add guard: if not sentences: result.append(para); continue

Comment on lines +305 to +309
abbrevs = {
"Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
"vs.", "etc.", "i.e.", "e.g.", "Fig.", "Vol.", "No.",
"Ph.D.", "U.S.", "U.K.", "Inc.", "Ltd.", "Corp.",
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: abbreviation set is missing common tech/doc abbreviations that will appear in documentation: 'API', 'URL', 'GPU', 'CPU', 'RAM', 'KB', 'MB', 'GB', 'TB', 'MHz', 'GHz', 'NVIDIA', 'AWS', etc.

current = sentences[i]
# Check if current ends with an abbreviation
if i + 1 < len(sentences):
ends_with_abbrev = any(current.rstrip().endswith(abbr) for abbr in abbrevs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: ends_with_abbrev check uses exact match on rstrip() but sentences may have multiple trailing spaces or punctuation - use current.rstrip().rstrip('.,!?').endswith(abbr) for robustness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants