-
Couldn't load subscription status.
- Fork 185
docs: extension for llm.txt and llm-full.txt outputs #1166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new Sphinx extension for generating LLM-friendly documentation output in the llm.txt format. The extension creates parallel .llm.txt files for each documentation page and an aggregated llm-full.txt file containing all documentation. The goal is to make documentation more accessible to Large Language Models by providing clean, structured text files optimized for AI consumption.
Key changes:
- Complete implementation of the
llm_txt_outputSphinx extension with modular architecture - Smart MyST directive parsing to extract grid cards and clean content artifacts
- Integration with existing Sphinx configuration and content gating systems
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| docs/conf.py | Added extension registration and configuration settings for llm.txt output |
| docs/_extensions/llm_txt_output/init.py | Main extension entry point with Sphinx setup configuration |
| docs/_extensions/llm_txt_output/config.py | Configuration validation and default settings management |
| docs/_extensions/llm_txt_output/processor.py | Build orchestration and document processing pipeline |
| docs/_extensions/llm_txt_output/content_extractor.py | Core content extraction logic with MyST cleaning and grid card parsing |
| docs/_extensions/llm_txt_output/template_builder.py | LLM.txt format template generation and content structuring |
| docs/_extensions/llm_txt_output/writer.py | File writing operations for output generation |
| docs/_extensions/llm_txt_output/README.md | Comprehensive documentation for the extension |
Comments suppressed due to low confidence (1)
docs/_extensions/llm_txt_output/processor.py:1
- Bare exception handling with a broad catch is generally discouraged. Consider catching specific exceptions like
OSError,IOError, oryaml.YAMLErrorfor better error handling and debugging.
"""Document processing and build orchestration for llm.txt output."""
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: James Bourbeau <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* Context manager support for RayClient Signed-off-by: James Bourbeau <[email protected]> * Bump timeout Signed-off-by: James Bourbeau <[email protected]> --------- Signed-off-by: James Bourbeau <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Ssofja <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* Increase timeout for CPU tests Signed-off-by: Sarah Yurick <[email protected]> * split stages tests into separate workflows Signed-off-by: Sarah Yurick <[email protected]> * fix slash issue Signed-off-by: Sarah Yurick <[email protected]> * try quotes Signed-off-by: Sarah Yurick <[email protected]> * another attempt.. Signed-off-by: Sarah Yurick <[email protected]> * .. Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* readme updates Signed-off-by: Lawrence Lane <[email protected]> * readme overhaul Signed-off-by: Lawrence Lane <[email protected]> * marketing callout Signed-off-by: Lawrence Lane <[email protected]> * added > Signed-off-by: Lawrence Lane <[email protected]> * dask > ray Signed-off-by: Lawrence Lane <[email protected]> * Update README.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * linkfixes; image classifier changed to filters; remove old pages Signed-off-by: Lawrence Lane <[email protected]> * link swap Signed-off-by: Lawrence Lane <[email protected]> * fixes Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Pablo Garay <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* Fix bug for running filters on empty batches Signed-off-by: Sarah Yurick <[email protected]> * is_actor_stage for models and tokenizers Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
* docs: initial loading of migration guide and faq Signed-off-by: Lawrence Lane <[email protected]> * minor changes Signed-off-by: Lawrence Lane <[email protected]> * crosslinks Signed-off-by: Lawrence Lane <[email protected]> * links to migration guides Signed-off-by: Lawrence Lane <[email protected]> * wording Signed-off-by: Lawrence Lane <[email protected]> * Update docs/about/release-notes/index.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * remove parenthetical Signed-off-by: Lawrence Lane <[email protected]> * gh link Signed-off-by: Lawrence Lane <[email protected]> * cleanup Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* text curation updates Signed-off-by: Lawrence Lane <[email protected]> * concepts Signed-off-by: Lawrence Lane <[email protected]> * remove synthetic docs not for this release Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * text concepts and getting started changes Signed-off-by: Lawrence Lane <[email protected]> * links, concepts Signed-off-by: Lawrence Lane <[email protected]> * crosslinks Signed-off-by: Lawrence Lane <[email protected]> * quality assessment updates Signed-off-by: Lawrence Lane <[email protected]> * more cleanup Signed-off-by: Lawrence Lane <[email protected]> * semdedup Signed-off-by: Lawrence Lane <[email protected]> * example import cleanup Signed-off-by: Lawrence Lane <[email protected]> * concepts Signed-off-by: Lawrence Lane <[email protected]> * Update docs/about/concepts/text/data-acquisition-concepts.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * feedback batch 1 Signed-off-by: Lawrence Lane <[email protected]> * feedback batch 2 Signed-off-by: Lawrence Lane <[email protected]> * file_paths="/path/to/jsonl_directory", Signed-off-by: Lawrence Lane <[email protected]> * revert removal of xenna for common crawl executors Signed-off-by: Lawrence Lane <[email protected]> * quickstart installation steps Signed-off-by: Lawrence Lane <[email protected]> * Update docs/about/concepts/text/data-acquisition-concepts.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * data loading concepts updates / simplification Signed-off-by: Lawrence Lane <[email protected]> * data processing feedback Signed-off-by: Lawrence Lane <[email protected]> * read-existing pg updates Signed-off-by: Lawrence Lane <[email protected]> * add-id updates Signed-off-by: Lawrence Lane <[email protected]> * dedup updates Signed-off-by: Lawrence Lane <[email protected]> * feedback Signed-off-by: Lawrence Lane <[email protected]> * citation file Signed-off-by: Lawrence Lane <[email protected]> * fix Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
…Mo#1154) * add enhanced captioning to rn; bump versions on json files Signed-off-by: Lawrence Lane <[email protected]> * remove ds store Signed-off-by: Lawrence Lane <[email protected]> * feedback Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
…-NeMo#1180) Signed-off-by: Lawrence Lane <[email protected]>
* docs(extensions): add rich metadata SEO extension with frontmatter integration Add new Sphinx extension that injects SEO-optimized metadata into HTML head from frontmatter: Core Features: - Extract frontmatter (description, tags, personas, difficulty, content_type, modality) - Generate standard meta tags (description, keywords, audience) - Generate Open Graph tags for social sharing (Facebook, LinkedIn) - Generate Twitter Card tags for enhanced previews - Generate JSON-LD structured data (schema.org) for search engines - Support product versioning via cascade.product fields Components: - rich_metadata/__init__.py: Main extension with config-inited and html-page-context hooks - rich_metadata/templates/layout.html: Template override for metadata injection - rich_metadata/README.md: Technical overview and features - rich_metadata/USAGE.md: Complete usage guide with examples - rich_metadata/SUMMARY.md: Quick reference guide - rich_metadata/IMPLEMENTATION.md: Architecture and implementation details - rich_metadata/verify_metadata.py: Automated verification script Template handling follows search_assets pattern - templates live within extension folder and are automatically added to Sphinx template search path via config-inited hook. Extension enabled in conf.py and ready for use with existing frontmatter. * docs(rich_metadata): add SEO metadata extension with organized output and enhanced titles - Extract frontmatter from markdown files and inject SEO metadata - Support standard meta tags (description, keywords) - Support Open Graph tags (og:description, og:title, og:type, og:url) - Support Twitter Card tags (twitter:description, twitter:title, twitter:card) - Support custom content metadata (audience, difficulty, modality, content_type) - Generate JSON-LD structured data (schema.org Article/TechArticle) - Organize metadata with HTML comments for readability - Enhanced page titles: 'Page: Section - Site | NVIDIA' - Template override for clean title rendering - Suppress warnings for generated pages (genindex, search, etc.) - Add frontmatter to homepage (index.md) - Fix invalid Jinja2 meta tag in search.html template * docs(rich_metadata): fix duplicates and use enhanced titles for social sharing - Remove duplicate metatags rendering (parent theme already renders it) - Use enhanced structured title format for og:title and twitter:title - All titles now consistent: 'Page: Section - Site | NVIDIA' - Improves social sharing context on Facebook, Twitter, LinkedIn * rich metadata Signed-off-by: Lawrence Lane <[email protected]> * docs: rich metadata for seo Signed-off-by: Lawrence Lane <[email protected]> * Update docs/_extensions/rich_metadata/__init__.py Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> * docs(extensions): refactor build_meta_tags to reduce complexity via helper functions _add_basic_fields, _add_opengraph_fields, _add_twitter_fields, _add_custom_fields * docs(rich_metadata): fix linter errors in extension code - Replace aliased errors with OSError in frontmatter extraction - Refactor _add_custom_fields into smaller functions to reduce complexity - Remove unnecessary variable assignment before return in build_meta_tags - Break down verify_html_file into helper functions (_display_meta_tags, _display_json_ld, _display_no_metadata_help) - Add return type annotation to main() function - Remove trailing whitespace from blank lines throughout verify_metadata.py * docs(rich_metadata): fix quote style to use double quotes per Ruff Q000 --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Copilot <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* create llama nemotron tutorial script and readme Signed-off-by: Sarah Yurick <[email protected]> * enable all heuristic filters Signed-off-by: Sarah Yurick <[email protected]> * add tokenizer steps Signed-off-by: Sarah Yurick <[email protected]> * add noqa Signed-off-by: Sarah Yurick <[email protected]> * add full tutorial Signed-off-by: Sarah Yurick <[email protected]> * fix empty filter test Signed-off-by: Sarah Yurick <[email protected]> * remove use_fast=True Signed-off-by: Sarah Yurick <[email protected]> * Update tutorials/text/llama-nemotron-data-curation/main.py Signed-off-by: Sarah Yurick <[email protected]> * Update tutorials/text/llama-nemotron-data-curation/main.py Signed-off-by: Sarah Yurick <[email protected]> * add comment, refresh files Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> * typos Signed-off-by: Sarah Yurick <[email protected]> * add praateek's suggestions Signed-off-by: Sarah Yurick <[email protected]> * ruff Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* Log more info on test_split_parquet_file_by_size failure Signed-off-by: James Bourbeau <[email protected]> * Sort files Signed-off-by: James Bourbeau <[email protected]> --------- Signed-off-by: James Bourbeau <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: James Bourbeau <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
…o#1196) Signed-off-by: James Bourbeau <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
* link to docs in classifier tutorials Signed-off-by: Sarah Yurick <[email protected]> * add links to semdedup tutorials Signed-off-by: Sarah Yurick <[email protected]> * download and extract links Signed-off-by: Sarah Yurick <[email protected]> * Apply suggestions from code review Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]>
…ension Signed-off-by: Lawrence Lane <[email protected]>
d7156d4 to
3859d5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR introduces a new Sphinx extension (llm_txt_output) that generates AI-consumable documentation outputs in two formats: per-page .llm.txt files and an aggregated llm-full.txt file. The extension hooks into Sphinx's build lifecycle, extracting structured content (titles, summaries, headings, grid cards, links, metadata) from documentation sources after HTML generation completes. It integrates with NeMo Curator's existing MyST markdown processing and content gating infrastructure by respecting exclude_patterns from both Sphinx's global config and the extension's own settings. The implementation consists of five coordinated modules: configuration management (config.py), document tree content extraction (content_extractor.py), markdown template building (template_builder.py), file writing (writer.py), and orchestration (processor.py). The extension is enabled in docs/conf.py with a base URL for absolute link generation, content length limits, and smart grid card handling. This fits into the project's broader documentation strategy by providing machine-readable outputs alongside human-readable HTML, enabling LLM assistants to consume documentation more effectively.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
docs/_extensions/llm_txt_output/__init__.py |
5/5 | Registers new Sphinx extension with config validation and build-finished hooks |
docs/_extensions/llm_txt_output/config.py |
4/5 | Defines configuration schema with validation; generate_full_file setting missing from boolean validation |
docs/conf.py |
5/5 | Enables llm_txt_output extension with base URL, content limits, and smart card handling |
docs/_extensions/llm_txt_output/writer.py |
3/5 | Writes llm.txt files to build directory; contains duplicate get_output_path helper function |
docs/_extensions/llm_txt_output/processor.py |
3/5 | Orchestrates document processing; uses bare except Exception clauses and inconsistent pattern matching |
docs/_extensions/llm_txt_output/template_builder.py |
3/5 | Builds markdown-formatted output; uses build-time timestamp instead of document modification date |
docs/_extensions/llm_txt_output/README.md |
4/5 | Documents extension usage and configuration; contains markdown syntax errors in code examples at lines 173, 202-203 |
docs/_extensions/llm_txt_output/content_extractor.py |
3/5 | Extracts content from doctree; complex dual-strategy card extraction with simplified ../ URL handling |
Confidence score: 3/5
- This PR adds valuable functionality but has several implementation concerns that could cause issues in production environments.
- Score reflects: (1) bare exception handlers that may hide critical errors during builds, (2) duplicate code (
get_output_pathin writer.py), (3) timestamp metadata that uses build-time rather than source modification time, (4) oversimplified URL path traversal logic that strips all../references without tracking depth, (5) markdown syntax errors in documentation, and (6) missing validation for thegenerate_full_fileconfiguration setting. - Pay close attention to
content_extractor.py(lines 786-789 for URL handling, lines 830-848 for exception handling),template_builder.py(line 152 for timestamp logic),writer.py(duplicate function), andconfig.py(lines 94-109 for missing boolean validation).
Sequence Diagram
sequenceDiagram
participant Sphinx
participant Extension as llm_txt_output Extension
participant Config as config.py
participant Processor as processor.py
participant Extractor as content_extractor.py
participant Template as template_builder.py
participant Writer as writer.py
participant FileSystem
Sphinx->>Extension: setup(app)
Extension->>Sphinx: add_config_value("llm_txt_settings", defaults)
Extension->>Sphinx: connect("config-inited", validate_config)
Extension->>Sphinx: connect("build-finished", on_build_finished)
Sphinx->>Config: validate_config(app, config)
Config->>Config: _ensure_settings_dict()
Config->>Config: apply_config_defaults()
Config->>Config: _validate_core_settings()
Config->>Config: _validate_content_limits()
Config->>Config: _validate_boolean_settings()
Note over Sphinx: Sphinx builds HTML documentation
Sphinx->>Processor: on_build_finished(app, exception)
loop For each document in app.env.all_docs
Processor->>Processor: should_generate_llm_txt(config, docname)
Processor->>Processor: is_content_gated(config, docname)
alt Document should be processed
Processor->>Extractor: extract_document_content(env, docname, settings)
Extractor->>Extractor: _extract_title(env, docname)
Extractor->>Extractor: _extract_grid_cards_from_markdown(env, docname, base_url)
Extractor->>FileSystem: Read raw markdown source
FileSystem-->>Extractor: Markdown content
Extractor->>Extractor: _extract_text_content(doctree)
Extractor->>Extractor: clean_text_for_llm_txt(raw_text)
Extractor->>Extractor: _extract_summary(doctree, num_sentences)
Extractor->>Extractor: _extract_overview(raw_text, max_length)
Extractor->>Extractor: _extract_headings(doctree)
Extractor->>Extractor: _extract_internal_links(doctree, max_links, base_url)
Extractor->>Extractor: _extract_metadata(env, docname)
Extractor-->>Processor: content_data dict
Processor->>Template: build_llm_txt_content(content_data)
Template->>Template: _build_metadata_section(metadata)
Template->>Template: _normalize_spacing(content)
Template-->>Processor: llm_txt_content string
Processor->>Writer: write_llm_txt_file(app, docname, llm_txt_content)
Writer->>Writer: Determine output path
Writer->>FileSystem: Create directory (if needed)
Writer->>FileSystem: Write .llm.txt file
FileSystem-->>Writer: Success
Processor->>Processor: Collect content for full file
end
end
alt generate_full_file enabled
Processor->>Processor: _write_full_llm_txt(app, all_content, settings)
Processor->>FileSystem: Write llm-full.txt
FileSystem-->>Processor: Success
end
Processor-->>Sphinx: Generation complete
8 files reviewed, 15 comments
| bool_settings = [ | ||
| "enabled", | ||
| "verbose", | ||
| "include_metadata", | ||
| "include_headings", | ||
| "include_related_links", | ||
| "clean_myst_artifacts", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: missing generate_full_file from bool_settings validation list, though it's defined as a boolean in defaults (line 27)
| if docname == "index": | ||
| return outdir / "index.llm.txt" | ||
| elif docname.endswith("/index"): | ||
| return outdir / docname[:-6] / "index.llm.txt" | ||
| else: | ||
| return outdir / f"{docname}.llm.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: get_output_path duplicates the exact logic from write_llm_txt_file. Extract this to reduce code duplication:
| if docname == "index": | |
| return outdir / "index.llm.txt" | |
| elif docname.endswith("/index"): | |
| return outdir / docname[:-6] / "index.llm.txt" | |
| else: | |
| return outdir / f"{docname}.llm.txt" | |
| def write_llm_txt_file(app: Sphinx, docname: str, content: str) -> None: | |
| """Write llm.txt file to output directory.""" | |
| try: | |
| txt_path = get_output_path(app, docname) |
| except Exception: | ||
| logger.exception(f"Failed to write llm.txt for {docname}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Bare except Exception swallows all errors including KeyboardInterrupt or SystemExit derived exceptions. Catch specific exceptions (e.g., IOError, OSError) or at minimum re-raise after logging
| has_content = True | ||
|
|
||
| # Add last updated timestamp | ||
| timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: timestamp generated at build time will be identical across all pages, not reflecting actual document updates. Should this use a per-document last-modified date from the source file's mtime or Sphinx's environment instead?
| generated_count += 1 | ||
| logger.debug(f"Generated llm.txt for {docname}") | ||
|
|
||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Bare except Exception silently swallows all errors. Consider re-raising or logging at error level to avoid losing critical failures during doc generation.
| if content.startswith("---"): | ||
| end_marker = content.find("\n---\n", 3) | ||
| if end_marker != -1: | ||
| frontmatter_text = content[3:end_marker] | ||
| return yaml.safe_load(frontmatter_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: frontmatter extraction searches for '\n---\n' starting at position 3, but if document has Windows line endings '\r\n---\r\n', this will fail - normalize line endings first
…ix URL resolution - Add helper functions: _clean_card_title(), _truncate_description(), _resolve_url() - Fix URL path traversal using urllib.parse.urljoin for proper relative path resolution - Improve sentence boundary detection to avoid splitting on abbreviations (Dr., U.S., etc.) - Normalize line endings in frontmatter extraction to handle Windows CRLF - Refactor _extract_grid_cards_from_markdown to use shared helpers (reduce complexity) - Refactor _extract_grid_cards to use shared helpers (reduce complexity) - Remove duplicate title cleaning and description truncation logic throughout - Remove deprecated _make_absolute_url function - Remove noqa complexity suppressions after refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review: the addition of docs/get-started/quickstart-tutorial.md, a comprehensive new tutorial document. The file introduces a unified getting-started guide for NeMo Curator that bridges foundational concepts with hands-on examples. It follows a pedagogical structure: defining core abstractions (Pipeline, Stage, Task, Executor), walking through installation, and presenting a complete three-stage sentiment analysis pipeline demonstrating resource allocation, batch processing, and distributed execution. The tutorial integrates with the repository's llm.txt documentation extension by including rich frontmatter metadata (tags, personas, difficulty, content type) and MyST markdown features (grid cards, admonitions) that structure the content for both human readers and AI consumption. The file acts as an entry point linking to four modality-specific quickstart guides (text, image, video, audio) and demonstrates the canonical Task→Stage→Pipeline→Executor pattern that underlies all workflows in the codebase.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| docs/get-started/quickstart-tutorial.md | 5/5 | New comprehensive tutorial document covering installation, core concepts, and a working sentiment analysis pipeline example |
Confidence score: 5/5
- This PR is safe to merge with minimal risk.
- Score reflects purely additive documentation content with no code changes, no breaking changes, and no security or logic issues detected. The tutorial follows established patterns from the repository's documentation structure and MyST markdown conventions.
- No files require special attention; this is a single new documentation file with no dependencies on existing code.
2 files reviewed, no comments
- Add .ruff.toml to suppress complexity warnings for parsing functions - Replace open() with Path.read_text()/write_text() for better pathlib integration - Fix docstring formatting (D212, D413, D205, D401) - Convert one complex logging f-string to lazy % formatting - Fix unnecessary elif after return (RET505) - Auto-fix trailing commas and blank lines Remaining G004 warnings are in config validation code and can be addressed separately. Complexity warnings (C901, PLR0912, PLR0915) are inherent to MyST markdown parsing logic.
The negative lookbehind pattern with alternation of variable-width strings caused 're.error: look-behind requires fixed-width pattern' during build. Changed approach to: 1. Split with simple regex first 2. Post-process to merge sentences split after abbreviations This avoids variable-width lookbehind limitations while still handling common abbreviations (Dr., U.S., etc.) correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The recent commits focus exclusively on code style improvements across the llm_txt_output Sphinx extension with no functional changes. Key refactorings include: (1) standardizing docstrings to single-line opening format throughout __init__.py, template_builder.py, and content_extractor.py; (2) modernizing file I/O by replacing open().read() with Path.read_text() in writer.py; (3) extracting a _clean_card_title helper to eliminate repeated regex patterns in content_extractor.py; (4) simplifying if-elif chains into early returns in writer.py; (5) adding a local .ruff.toml configuration to suppress complexity warnings for the extension's parsing logic; (6) minor formatting fixes (trailing commas, logging syntax). The extension itself generates AI-consumable .llm.txt files from Sphinx documentation by hooking into the build-finished event, extracting content from doctrees and markdown sources, and producing both per-page and aggregated outputs with metadata and table of contents.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| docs/_extensions/llm_txt_output/init.py | 5/5 | Reformatted module and setup function docstrings to single-line style; no logic changes |
| docs/_extensions/llm_txt_output/template_builder.py | 5/5 | Updated _normalize_spacing docstring format for consistency; all build logic unchanged |
| docs/_extensions/llm_txt_output/.ruff.toml | 4/5 | Added local linting config to suppress complexity/import warnings; contains redundant per-file ignores |
| docs/_extensions/llm_txt_output/writer.py | 4/5 | Extracted get_output_path helper but left duplicate path resolution in write_llm_txt_file; simplified if-elif chains and file I/O |
| docs/_extensions/llm_txt_output/processor.py | 4/5 | Minor docstring and formatting updates; core orchestration logic unchanged; bare except Exception blocks remain from previous review |
| docs/_extensions/llm_txt_output/content_extractor.py | 4/5 | Consolidated title cleaning into _clean_card_title helper; refactored docstrings and file I/O; improved sentence split regex; trailing comma and code duplication issues remain |
Confidence score: 4/5
- This PR is safe to merge with minor technical debt that can be addressed in follow-up work
- Score reflects solid code quality improvements (docstring consistency, modern idioms, reduced duplication) but incomplete refactoring in writer.py (extracted helper not used), redundant linting config in .ruff.toml, and carryover issues from previous reviews (bare exception handlers, duplicate path logic)
- Pay close attention to docs/_extensions/llm_txt_output/writer.py and docs/_extensions/llm_txt_output/.ruff.toml for the incomplete refactoring and redundant configuration respectively
6 files reviewed, 4 comments
| [lint.per-file-ignores] | ||
| # Allow complex functions in content_extractor.py for parsing logic | ||
| "content_extractor.py" = ["C901", "PLR0912", "PLR0915", "PLC0415"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: : Per-file ignores at line 16 duplicate the global ignores from lines 7-12 for content_extractor.py. Remove this redundant section since the global ignores already apply
| return False | ||
|
|
||
|
|
||
| def _write_full_llm_txt(app: Sphinx, all_content: list[dict[str, str]], _settings: dict) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: : _settings parameter is unused - remove it or prefix with underscore if intentionally ignored for future extensibility
| description = re.sub( | ||
| r"\{\{[^}]+\}\}", "NeMo Evaluator", description, | ||
| ) # Replace {{ product_name_short }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: trailing comma after string argument on line 494 violates PEP 8 for single-element function calls - remove the comma:
| description = re.sub( | |
| r"\{\{[^}]+\}\}", "NeMo Evaluator", description, | |
| ) # Replace {{ product_name_short }} | |
| description = re.sub( | |
| r"\{\{[^}]+\}\}", "NeMo Evaluator", description | |
| ) # Replace {{ product_name_short }} |
| import yaml | ||
| from pathlib import Path | ||
|
|
||
| content = Path(file_path).read_text(encoding="utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: import yaml and Path at module level instead of inside _extract_frontmatter - same pattern needed in line 814 where str(source_path) suggests Path handling inconsistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The recent updates address previous feedback on content_extractor.py by refactoring exception handling, removing redundant configuration, improving code organization, and fixing backtick formatting issues in directive examples. Specific improvements include moving imports to module-level, consolidating duplicate logic into helper functions, removing unused parameters, and tightening exception handling to catch specific error types rather than broad Exception classes. The changes align with Python best practices (PEP 8, Ruff linting standards) and make the code more maintainable for future contributors working with Sphinx documentation extensions.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| docs/_extensions/llm_txt_output/content_extractor.py | 4/5 | Refactored exception handling to catch specific exceptions, moved imports to module level, removed unused parameters, consolidated duplicate title/description logic, and fixed directive formatting examples |
| docs/_extensions/llm_txt_output/.ruff.toml | 5/5 | Removed redundant per-file ignore configuration that duplicated global ignores for content_extractor.py |
Confidence score: 4/5
- This PR is safe to merge with minimal risk; the changes are primarily code quality improvements that address previous review feedback
- Score reflects solid refactoring that improves maintainability without introducing new functionality or breaking changes; one point deducted because exception handling could still be more granular (e.g., distinguishing between YAML parsing errors and file I/O errors) and the sentence splitting regex limitation noted in previous reviews remains unaddressed
- Pay close attention to
docs/_extensions/llm_txt_output/content_extractor.pyto verify that specific exception types now caught are comprehensive enough for all failure modes in production documentation builds
1 file reviewed, 3 comments
| sentences = re.split(r"(?<=[.!?])\s+(?=[A-Z0-9])", para) | ||
|
|
||
| # Common abbreviations that shouldn't trigger sentence splits | ||
| abbrevs = { | ||
| "Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.", | ||
| "vs.", "etc.", "i.e.", "e.g.", "Fig.", "Vol.", "No.", | ||
| "Ph.D.", "U.S.", "U.K.", "Inc.", "Ltd.", "Corp.", | ||
| } | ||
|
|
||
| # Merge sentences that were split after abbreviations | ||
| merged_sentences = [] | ||
| i = 0 | ||
| while i < len(sentences): | ||
| current = sentences[i] | ||
| # Check if current ends with an abbreviation | ||
| if i + 1 < len(sentences): | ||
| ends_with_abbrev = any(current.rstrip().endswith(abbr) for abbr in abbrevs) | ||
| if ends_with_abbrev: | ||
| # Merge with next sentence | ||
| current = current + " " + sentences[i + 1] | ||
| i += 1 | ||
| merged_sentences.append(current) | ||
| i += 1 | ||
|
|
||
| sentences = merged_sentences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: sentences on line 302 could be empty list if para has no sentence boundaries matching the pattern - this would cause merged_sentences loop to append empty strings. Add guard: if not sentences: result.append(para); continue
| abbrevs = { | ||
| "Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.", | ||
| "vs.", "etc.", "i.e.", "e.g.", "Fig.", "Vol.", "No.", | ||
| "Ph.D.", "U.S.", "U.K.", "Inc.", "Ltd.", "Corp.", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: abbreviation set is missing common tech/doc abbreviations that will appear in documentation: 'API', 'URL', 'GPU', 'CPU', 'RAM', 'KB', 'MB', 'GB', 'TB', 'MHz', 'GHz', 'NVIDIA', 'AWS', etc.
| current = sentences[i] | ||
| # Check if current ends with an abbreviation | ||
| if i + 1 < len(sentences): | ||
| ends_with_abbrev = any(current.rstrip().endswith(abbr) for abbr in abbrevs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: ends_with_abbrev check uses exact match on rstrip() but sentences may have multiple trailing spaces or punctuation - use current.rstrip().rstrip('.,!?').endswith(abbr) for robustness
New Sphinx extension to enable AI to better consume documentation by providing per-page and full site llm.txt files as part of the output deliverables.
Example Output
Location:
/about/concepts/text/data-curation-pipeline.llm.txt