feat: dynamic error extraction #4196

nv-nmailhot · 2025-11-07T23:42:56Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added automated error log extraction and analysis tool with intelligent fallback detection for improved log inspection.
Improvements
- Enhanced CI/CD workflow error reporting with GitHub annotations.
- Improved diagnostic capture (pod status, events) on deployment failures.
- Better structured error messaging and propagation across deployment and test processes.

coderabbitai · 2025-11-07T23:51:44Z

Walkthrough

A new Python script for extracting error messages from logs is introduced with dual extraction strategies (LogAI-based and regex fallback), and a GitHub Actions workflow is enhanced to integrate this error extraction, log collection, and GitHub annotation creation for deployment and test failures.

Changes

Cohort / File(s)	Summary
New Error Extraction Script `\.github/scripts/extract_log_errors\.py`	Introduces `LogErrorExtractor` class with LogAI-based and regex fallback extraction strategies. Defines `extract_with_logai()` and `extract_with_fallback()` methods, orchestrated by `extract_errors()`. Provides `get_summary()` and `get_primary_error()` formatters. Includes `ERROR_PATTERNS` and `CONTEXT_PATTERNS` class attributes. Implements CLI with `main()` entry point supporting `--json` output flag.
Enhanced Workflow Error Handling `\.github/workflows/container-validation-backends\.yml`	Adds per-step log file redirection (deploy-operator.log, test-output.log). Introduces Helm availability checks with version logging and structured error handling. Converts deployment steps to conditional gates with diagnostics capture on failure. Adds Python setup, LogAI installation, and error extraction steps triggered on deployment/test failures. Implements GitHub annotation creation workflow with check-run integration and extracted error logs. Extends failure handling with `ERROR_MESSAGE` environment variable propagation across both operator deployment and test paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

LogAI availability and fallback orchestration in extract_log_errors.py: Dual extraction strategies require verification of graceful degradation and error handling patterns.
Regex pattern definitions and deduplication logic: Error pattern matching, context extraction, and result limiting require careful validation.
Workflow failure path orchestration: Multiple conditional branches for deployment and test failures with error propagation need coordination verification.
GitHub annotation and check-run integration: Error message formatting, annotation JSON structure, and integration with extracted logs require attention to correctness.

Poem

🐰 A script hops forth with errors to find,
With LogAI's wisdom and patterns refined,
When pipelines tumble, annotations appear,
Guiding us gently through logs dark and dear! ✨📋

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is a blank template with no actual content provided. All sections (Overview, Details, Where should the reviewer start?) lack substantive information, and the Related Issues section contains only a placeholder.	Complete the description by filling in the Overview section explaining the purpose, the Details section describing specific changes, pointing reviewers to the key files (extract_log_errors.py and container-validation-backends.yml), and replacing the GitHub issue placeholder with an actual issue number.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: dynamic error extraction' accurately captures the main feature addition: introducing new error extraction logic via the LogErrorExtractor class and integration with the workflow.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93ada89 and 11e20fd.

📒 Files selected for processing (2)

.github/scripts/extract_log_errors.py (1 hunks)
.github/workflows/container-validation-backends.yml (5 hunks)

🧰 Additional context used

🪛 actionlint (1.7.8)

.github/workflows/container-validation-backends.yml

488-488: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

741-741: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.

.github/workflows/container-validation-backends.yml

[error] 1-1: Trailing whitespace detected and fixed by pre-commit. Please re-run to ensure CI passes.

[error] 1-1: Check-shebang-scripts-are-executable: .github/scripts/extract_log_errors.py has a shebang but is not executable.

.github/scripts/extract_log_errors.py

[error] 14-15: Ruff: F401 import 'NNAnomalyDetector' and 'AnomalyDetectionWorkflow' imported but unused in extract_log_errors.py.

[error] 14-14: Ruff: F401 import 'NNAnomalyDetector' imported but unused.

[error] 15-15: Ruff: F401 import 'AnomalyDetectionWorkflow' imported but unused.

[error] 1-1: isort: Files were modified by this hook and need reformatting.

[error] 1-1: Black: 1 file reformatted; run pre-commit again to apply changes.

🪛 Ruff (0.14.3)

.github/scripts/extract_log_errors.py

1-1: Shebang is present but file is not executable

(EXE001)

29-42: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

45-50: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

67-67: Probable insecure usage of temporary file or directory: "/tmp/analysis.log"

(S108)

100-100: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: operator (arm64)
GitHub Check: operator (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: vllm (arm64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo

coderabbitai · 2025-11-07T23:51:47Z

.github/scripts/extract_log_errors.py

@@ -0,0 +1,225 @@
+#!/usr/bin/env python3


⚠️ Potential issue | 🟠 Major

Set the executable bit for this script

Because the file starts with a shebang, the hooks expect it to be executable. Right now it isn’t, so check-shebang-scripts-are-executable fails. Please mark it executable in git (e.g., git update-index --chmod=+x .github/scripts/extract_log_errors.py) so the lint gate passes.(docs.astral.sh)

🧰 Tools

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.

[error] 1-1: isort: Files were modified by this hook and need reformatting.

[error] 1-1: Black: 1 file reformatted; run pre-commit again to apply changes.

🪛 Ruff (0.14.3)

1-1: Shebang is present but file is not executable

(EXE001)

🤖 Prompt for AI Agents

In .github/scripts/extract_log_errors.py at line 1: the script has a shebang but is not executable, causing the check-shebang-scripts-are-executable hook to fail; mark the file executable in git (for example run git update-index --chmod=+x .github/scripts/extract_log_errors.py) and commit the change so the executable bit is preserved in the repository.

coderabbitai · 2025-11-07T23:51:47Z

.github/scripts/extract_log_errors.py

+    from logai.applications.openset.anomaly_detection import AnomalyDetectionWorkflow
+    from logai.dataloader.data_loader import FileDataLoader
+    from logai.analysis.nn_anomaly_detector import NNAnomalyDetector
+    from logai.preprocess.preprocessor import Preprocessor
+    from logai.information_extraction.log_parser import LogParser


⚠️ Potential issue | 🟠 Major

Remove unused LogAI imports

AnomalyDetectionWorkflow and NNAnomalyDetector are never referenced, which is why Ruff reports F401 and the lint gate fails. Drop the unused symbols (or use them) so the script passes lint. Example fix:

- from logai.applications.openset.anomaly_detection import AnomalyDetectionWorkflow - from logai.analysis.nn_anomaly_detector import NNAnomalyDetector from logai.preprocess.preprocessor import Preprocessor

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.

[error] 14-15: Ruff: F401 import 'NNAnomalyDetector' and 'AnomalyDetectionWorkflow' imported but unused in extract_log_errors.py.

[error] 14-14: Ruff: F401 import 'NNAnomalyDetector' imported but unused.

[error] 15-15: Ruff: F401 import 'AnomalyDetectionWorkflow' imported but unused.

🤖 Prompt for AI Agents

.github/scripts/extract_log_errors.py around lines 14 to 18: the imports AnomalyDetectionWorkflow and NNAnomalyDetector are unused and cause Ruff F401 lint failures; either remove those two symbols from the import line or reference them where intended (e.g., instantiate/use them) so they are no longer unused — update the import statement to only import FileDataLoader, Preprocessor, and LogParser (or implement the intended usage of the removed symbols) and run the linter to confirm the F401 error is resolved.

coderabbitai · 2025-11-07T23:51:47Z

.github/scripts/extract_log_errors.py

+
+    # Common error patterns for fallback
+    ERROR_PATTERNS = [
+        r"Error:?\s+(.+?)(?:\n|$)",
+        r"ERROR[:\s]+(.+?)(?:\n|$)",
+        r"Failed\s+(.+?)(?:\n|$)",
+        r"failed\s+(.+?)(?:\n|$)",
+        r"FAILED[:\s]+(.+?)(?:\n|$)",
+        r"Exception[:\s]+(.+?)(?:\n|$)",
+        r"Traceback \(most recent call last\):(.+?)(?:\n\n|$)",
+        r"fatal[:\s]+(.+?)(?:\n|$)",
+        r"FATAL[:\s]+(.+?)(?:\n|$)",
+        r"panic[:\s]+(.+?)(?:\n|$)",
+        r"timed out\s+(.+?)(?:\n|$)",
+        r"timeout\s+(.+?)(?:\n|$)",
+    ]
+
+    # Patterns to identify context around errors
+    CONTEXT_PATTERNS = [
+        r"(exit code \d+)",
+        r"(status code \d+)",
+        r"(HTTP \d{3})",
+        r"(line \d+)",
+    ]


⚠️ Potential issue | 🟠 Major

Annotate class-level patterns as ClassVar

Ruff flags ERROR_PATTERNS and CONTEXT_PATTERNS because they’re mutable class attributes without a ClassVar annotation. Add the import and annotate them so it’s explicit that they are shared across instances:

-from typing import List, Dict, Any +from typing import List, Dict, Any, ClassVar @@ - ERROR_PATTERNS = [ + ERROR_PATTERNS: ClassVar[List[str]] = [ @@ - CONTEXT_PATTERNS = [ + CONTEXT_PATTERNS: ClassVar[List[str]] = [

This silences RUF012 and documents the intended shared state.(docs.astral.sh)

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.3)

29-42: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

45-50: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents

In .github/scripts/extract_log_errors.py around lines 27 to 50, the module-level/pseudo-class constants ERROR_PATTERNS and CONTEXT_PATTERNS are flagged by Ruff as mutable class attributes without a ClassVar annotation; import typing.ClassVar and annotate both variables as ClassVar[list[str]] (or ClassVar[List[str]] if using from typing import List) so they are explicitly declared as shared/immutable class-level patterns and Ruff RUF012 is silenced.

coderabbitai · 2025-11-07T23:51:48Z

.github/workflows/container-validation-backends.yml

+        echo "Waiting for deployments to be ready..."
+        if ! timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch 2>&1; then
+          ERROR_MSG="Deployment rollout timed out after 300 seconds"
+          echo "$ERROR_MSG"
+          
+          # Capture diagnostics
+          echo "=== Pod Status ==="
+          kubectl get pods -n ${NAMESPACE} -o wide || true
+          echo "=== Deployment Status ==="
+          kubectl get deployments -n ${NAMESPACE} || true
+          
+          echo "ERROR_MESSAGE=$ERROR_MSG" >> $GITHUB_ENV
+          exit 1
+        fi


⚠️ Potential issue | 🔴 Critical

Fix rollout status command to target real deployments

kubectl rollout status deployment -n "$NAMESPACE" exits with “exactly one NAME is required” because the resource name is missing. That means the step will always fail, even when the rollout succeeds. Please enumerate the deployments (or otherwise name the specific ones) and call kubectl rollout status on each so the rollout wait actually works. For example:

- if ! timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch 2>&1; then + if ! kubectl get deployments -n "$NAMESPACE" -o name | while read -r deployment; do + timeout 300s kubectl rollout status "$deployment" -n "$NAMESPACE" + done; then

Any solution that supplies valid resource names is fine; the key is to stop invoking kubectl rollout status without a target.(kubernetes.io)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

.github/workflows/container-validation-backends.yml lines 470-483: the kubectl rollout status call is missing a resource name and therefore always fails; replace the single call with logic that lists deployments in $NAMESPACE and iterates over each deployment name, calling kubectl rollout status deployment/<name> -n "$NAMESPACE" (with the existing timeout) for each; handle the case of zero deployments (skip or treat as success), and preserve the existing diagnostics and exit-on-failure behavior if any rollout status call times out.

coderabbitai · 2025-11-07T23:51:48Z

.github/workflows/container-validation-backends.yml

+      uses: actions/setup-python@v4
+      with:
+        python-version: '3.10'
+
+    - name: Install LogAI
+      if: always() && steps.deploy-operator-step.outcome == 'failure'
+      run: |
+        pip install --upgrade pip
+        pip install logai || echo "LogAI installation failed, will use fallback"
+        chmod +x .github/scripts/extract_log_errors.py
+    
+    - name: Extract Errors from Logs using LogAI
+      if: always() && steps.deploy-operator-step.outcome == 'failure'
+      id: extract-errors


⚠️ Potential issue | 🔴 Critical

Upgrade setup-python to a Node20-compatible release

actions/setup-python@v4 still runs on the retired Node16 runtime, which is no longer provided on hosted runners. To avoid unexpected job failures, bump both occurrences to a Node20-based release (v5 or newer). For example:

- uses: actions/setup-python@v4 + uses: actions/setup-python@v5

Do the same in the later step so both paths stay supported.(github.blog)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

uses: actions/setup-python@v4

with:

python-version: '3.10'

- name: Install LogAI

if: always() && steps.deploy-operator-step.outcome == 'failure'

run: |

pip install --upgrade pip

pip install logai || echo "LogAI installation failed, will use fallback"

chmod +x .github/scripts/extract_log_errors.py

- name: Extract Errors from Logs using LogAI

if: always() && steps.deploy-operator-step.outcome == 'failure'

id: extract-errors

uses: actions/setup-python@v5

with:

python-version: '3.10'

- name: Install LogAI

if: always() && steps.deploy-operator-step.outcome == 'failure'

run: |

pip install --upgrade pip

pip install logai || echo "LogAI installation failed, will use fallback"

chmod +x .github/scripts/extract_log_errors.py

- name: Extract Errors from Logs using LogAI

if: always() && steps.deploy-operator-step.outcome == 'failure'

id: extract-errors

🧰 Tools

🪛 actionlint (1.7.8)

488-488: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents

.github/workflows/container-validation-backends.yml lines 488-501: the workflow uses actions/setup-python@v4 which runs on the retired Node16 runtime; update this action to a Node20-compatible release (e.g., actions/setup-python@v5 or newer) in this step and also update the later identical occurrence in the file so both paths use the Node20-based release to prevent runner failures.

nv-nmailhot added 2 commits November 7, 2025 15:09

add error message propogation

959ec2b

extract error with ai

11e20fd

nv-nmailhot requested a review from a team as a code owner November 7, 2025 23:42

pull-request-size bot added the size/XL label Nov 7, 2025

github-actions bot added the feat label Nov 7, 2025

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

always run deploy test to test temporarily

ada3992

copy-pr-bot bot temporarily deployed to GITLAB November 11, 2025 18:27 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 11, 2025 18:28 Inactive

change rules to test

38cc68d

copy-pr-bot bot temporarily deployed to GITLAB November 11, 2025 19:47 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 11, 2025 19:53 Inactive

force trigger always

4e01be8

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 18:06 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 18:08 Inactive

test log ai

50926d8

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 19:25 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 19:34 Inactive

logai fixes

556308d

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 21:28 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 21:29 Inactive

Merge branch 'main' into nmailhot/ai-logs

a591a54

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 21:39 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 21:40 Inactive

add more error identifiers

8704fef

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 21:46 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 21:50 Inactive

move annotation creation to within step

c723c8a

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 23:36 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 23:37 Inactive

add changes for deploy operator job

38aae36

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 23:40 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 23:42 Inactive

nv-nmailhot closed this Nov 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: dynamic error extraction #4196

feat: dynamic error extraction #4196

Uh oh!

nv-nmailhot commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 7, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: dynamic error extraction #4196

feat: dynamic error extraction #4196

Uh oh!

Conversation

nv-nmailhot commented Nov 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 7, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-nmailhot commented Nov 7, 2025 •

edited by coderabbitai bot

Loading