Skip to content

Conversation

@nv-nmailhot
Copy link
Contributor

@nv-nmailhot nv-nmailhot commented Nov 7, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added automated error log extraction and analysis tool with intelligent fallback detection for improved log inspection.
  • Improvements

    • Enhanced CI/CD workflow error reporting with GitHub annotations.
    • Improved diagnostic capture (pod status, events) on deployment failures.
    • Better structured error messaging and propagation across deployment and test processes.

@nv-nmailhot nv-nmailhot requested a review from a team as a code owner November 7, 2025 23:42
@github-actions github-actions bot added the feat label Nov 7, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

A new Python script for extracting error messages from logs is introduced with dual extraction strategies (LogAI-based and regex fallback), and a GitHub Actions workflow is enhanced to integrate this error extraction, log collection, and GitHub annotation creation for deployment and test failures.

Changes

Cohort / File(s) Summary
New Error Extraction Script
\.github/scripts/extract_log_errors\.py
Introduces LogErrorExtractor class with LogAI-based and regex fallback extraction strategies. Defines extract_with_logai() and extract_with_fallback() methods, orchestrated by extract_errors(). Provides get_summary() and get_primary_error() formatters. Includes ERROR_PATTERNS and CONTEXT_PATTERNS class attributes. Implements CLI with main() entry point supporting --json output flag.
Enhanced Workflow Error Handling
\.github/workflows/container-validation-backends\.yml
Adds per-step log file redirection (deploy-operator.log, test-output.log). Introduces Helm availability checks with version logging and structured error handling. Converts deployment steps to conditional gates with diagnostics capture on failure. Adds Python setup, LogAI installation, and error extraction steps triggered on deployment/test failures. Implements GitHub annotation creation workflow with check-run integration and extracted error logs. Extends failure handling with ERROR_MESSAGE environment variable propagation across both operator deployment and test paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • LogAI availability and fallback orchestration in extract_log_errors.py: Dual extraction strategies require verification of graceful degradation and error handling patterns.
  • Regex pattern definitions and deduplication logic: Error pattern matching, context extraction, and result limiting require careful validation.
  • Workflow failure path orchestration: Multiple conditional branches for deployment and test failures with error propagation need coordination verification.
  • GitHub annotation and check-run integration: Error message formatting, annotation JSON structure, and integration with extracted logs require attention to correctness.

Poem

🐰 A script hops forth with errors to find,
With LogAI's wisdom and patterns refined,
When pipelines tumble, annotations appear,
Guiding us gently through logs dark and dear! ✨📋

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is a blank template with no actual content provided. All sections (Overview, Details, Where should the reviewer start?) lack substantive information, and the Related Issues section contains only a placeholder. Complete the description by filling in the Overview section explaining the purpose, the Details section describing specific changes, pointing reviewers to the key files (extract_log_errors.py and container-validation-backends.yml), and replacing the GitHub issue placeholder with an actual issue number.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: dynamic error extraction' accurately captures the main feature addition: introducing new error extraction logic via the LogErrorExtractor class and integration with the workflow.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93ada89 and 11e20fd.

📒 Files selected for processing (2)
  • .github/scripts/extract_log_errors.py (1 hunks)
  • .github/workflows/container-validation-backends.yml (5 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/container-validation-backends.yml

488-488: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)


741-741: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.
.github/workflows/container-validation-backends.yml

[error] 1-1: Trailing whitespace detected and fixed by pre-commit. Please re-run to ensure CI passes.


[error] 1-1: Check-shebang-scripts-are-executable: .github/scripts/extract_log_errors.py has a shebang but is not executable.

.github/scripts/extract_log_errors.py

[error] 14-15: Ruff: F401 import 'NNAnomalyDetector' and 'AnomalyDetectionWorkflow' imported but unused in extract_log_errors.py.


[error] 14-14: Ruff: F401 import 'NNAnomalyDetector' imported but unused.


[error] 15-15: Ruff: F401 import 'AnomalyDetectionWorkflow' imported but unused.


[error] 1-1: isort: Files were modified by this hook and need reformatting.


[error] 1-1: Black: 1 file reformatted; run pre-commit again to apply changes.

🪛 Ruff (0.14.3)
.github/scripts/extract_log_errors.py

1-1: Shebang is present but file is not executable

(EXE001)


29-42: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


45-50: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


67-67: Probable insecure usage of temporary file or directory: "/tmp/analysis.log"

(S108)


100-100: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: operator (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo

@@ -0,0 +1,225 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Set the executable bit for this script

Because the file starts with a shebang, the hooks expect it to be executable. Right now it isn’t, so check-shebang-scripts-are-executable fails. Please mark it executable in git (e.g., git update-index --chmod=+x .github/scripts/extract_log_errors.py) so the lint gate passes.(docs.astral.sh)

🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.

[error] 1-1: isort: Files were modified by this hook and need reformatting.


[error] 1-1: Black: 1 file reformatted; run pre-commit again to apply changes.

🪛 Ruff (0.14.3)

1-1: Shebang is present but file is not executable

(EXE001)

🤖 Prompt for AI Agents
In .github/scripts/extract_log_errors.py at line 1: the script has a shebang but
is not executable, causing the check-shebang-scripts-are-executable hook to
fail; mark the file executable in git (for example run git update-index
--chmod=+x .github/scripts/extract_log_errors.py) and commit the change so the
executable bit is preserved in the repository.

Comment on lines 14 to 18
from logai.applications.openset.anomaly_detection import AnomalyDetectionWorkflow
from logai.dataloader.data_loader import FileDataLoader
from logai.analysis.nn_anomaly_detector import NNAnomalyDetector
from logai.preprocess.preprocessor import Preprocessor
from logai.information_extraction.log_parser import LogParser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove unused LogAI imports

AnomalyDetectionWorkflow and NNAnomalyDetector are never referenced, which is why Ruff reports F401 and the lint gate fails. Drop the unused symbols (or use them) so the script passes lint. Example fix:

-    from logai.applications.openset.anomaly_detection import AnomalyDetectionWorkflow
-    from logai.analysis.nn_anomaly_detector import NNAnomalyDetector
     from logai.preprocess.preprocessor import Preprocessor

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.

[error] 14-15: Ruff: F401 import 'NNAnomalyDetector' and 'AnomalyDetectionWorkflow' imported but unused in extract_log_errors.py.


[error] 14-14: Ruff: F401 import 'NNAnomalyDetector' imported but unused.


[error] 15-15: Ruff: F401 import 'AnomalyDetectionWorkflow' imported but unused.

🤖 Prompt for AI Agents
.github/scripts/extract_log_errors.py around lines 14 to 18: the imports
AnomalyDetectionWorkflow and NNAnomalyDetector are unused and cause Ruff F401
lint failures; either remove those two symbols from the import line or reference
them where intended (e.g., instantiate/use them) so they are no longer unused —
update the import statement to only import FileDataLoader, Preprocessor, and
LogParser (or implement the intended usage of the removed symbols) and run the
linter to confirm the F401 error is resolved.

Comment on lines 27 to 50

# Common error patterns for fallback
ERROR_PATTERNS = [
r"Error:?\s+(.+?)(?:\n|$)",
r"ERROR[:\s]+(.+?)(?:\n|$)",
r"Failed\s+(.+?)(?:\n|$)",
r"failed\s+(.+?)(?:\n|$)",
r"FAILED[:\s]+(.+?)(?:\n|$)",
r"Exception[:\s]+(.+?)(?:\n|$)",
r"Traceback \(most recent call last\):(.+?)(?:\n\n|$)",
r"fatal[:\s]+(.+?)(?:\n|$)",
r"FATAL[:\s]+(.+?)(?:\n|$)",
r"panic[:\s]+(.+?)(?:\n|$)",
r"timed out\s+(.+?)(?:\n|$)",
r"timeout\s+(.+?)(?:\n|$)",
]

# Patterns to identify context around errors
CONTEXT_PATTERNS = [
r"(exit code \d+)",
r"(status code \d+)",
r"(HTTP \d{3})",
r"(line \d+)",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Annotate class-level patterns as ClassVar

Ruff flags ERROR_PATTERNS and CONTEXT_PATTERNS because they’re mutable class attributes without a ClassVar annotation. Add the import and annotate them so it’s explicit that they are shared across instances:

-from typing import List, Dict, Any
+from typing import List, Dict, Any, ClassVar
@@
-    ERROR_PATTERNS = [
+    ERROR_PATTERNS: ClassVar[List[str]] = [
@@
-    CONTEXT_PATTERNS = [
+    CONTEXT_PATTERNS: ClassVar[List[str]] = [

This silences RUF012 and documents the intended shared state.(docs.astral.sh)

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.3)

29-42: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


45-50: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents
In .github/scripts/extract_log_errors.py around lines 27 to 50, the
module-level/pseudo-class constants ERROR_PATTERNS and CONTEXT_PATTERNS are
flagged by Ruff as mutable class attributes without a ClassVar annotation;
import typing.ClassVar and annotate both variables as ClassVar[list[str]] (or
ClassVar[List[str]] if using from typing import List) so they are explicitly
declared as shared/immutable class-level patterns and Ruff RUF012 is silenced.

Comment on lines 470 to 483
echo "Waiting for deployments to be ready..."
if ! timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch 2>&1; then
ERROR_MSG="Deployment rollout timed out after 300 seconds"
echo "$ERROR_MSG"
# Capture diagnostics
echo "=== Pod Status ==="
kubectl get pods -n ${NAMESPACE} -o wide || true
echo "=== Deployment Status ==="
kubectl get deployments -n ${NAMESPACE} || true
echo "ERROR_MESSAGE=$ERROR_MSG" >> $GITHUB_ENV
exit 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix rollout status command to target real deployments

kubectl rollout status deployment -n "$NAMESPACE" exits with “exactly one NAME is required” because the resource name is missing. That means the step will always fail, even when the rollout succeeds. Please enumerate the deployments (or otherwise name the specific ones) and call kubectl rollout status on each so the rollout wait actually works. For example:

-        if ! timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch 2>&1; then
+        if ! kubectl get deployments -n "$NAMESPACE" -o name | while read -r deployment; do
+          timeout 300s kubectl rollout status "$deployment" -n "$NAMESPACE"
+        done; then

Any solution that supplies valid resource names is fine; the key is to stop invoking kubectl rollout status without a target.(kubernetes.io)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
.github/workflows/container-validation-backends.yml lines 470-483: the kubectl
rollout status call is missing a resource name and therefore always fails;
replace the single call with logic that lists deployments in $NAMESPACE and
iterates over each deployment name, calling kubectl rollout status
deployment/<name> -n "$NAMESPACE" (with the existing timeout) for each; handle
the case of zero deployments (skip or treat as success), and preserve the
existing diagnostics and exit-on-failure behavior if any rollout status call
times out.

Comment on lines 488 to 501
uses: actions/setup-python@v4
with:
python-version: '3.10'

- name: Install LogAI
if: always() && steps.deploy-operator-step.outcome == 'failure'
run: |
pip install --upgrade pip
pip install logai || echo "LogAI installation failed, will use fallback"
chmod +x .github/scripts/extract_log_errors.py
- name: Extract Errors from Logs using LogAI
if: always() && steps.deploy-operator-step.outcome == 'failure'
id: extract-errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Upgrade setup-python to a Node20-compatible release

actions/setup-python@v4 still runs on the retired Node16 runtime, which is no longer provided on hosted runners. To avoid unexpected job failures, bump both occurrences to a Node20-based release (v5 or newer). For example:

-      uses: actions/setup-python@v4
+      uses: actions/setup-python@v5

Do the same in the later step so both paths stay supported.(github.blog)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install LogAI
if: always() && steps.deploy-operator-step.outcome == 'failure'
run: |
pip install --upgrade pip
pip install logai || echo "LogAI installation failed, will use fallback"
chmod +x .github/scripts/extract_log_errors.py
- name: Extract Errors from Logs using LogAI
if: always() && steps.deploy-operator-step.outcome == 'failure'
id: extract-errors
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install LogAI
if: always() && steps.deploy-operator-step.outcome == 'failure'
run: |
pip install --upgrade pip
pip install logai || echo "LogAI installation failed, will use fallback"
chmod +x .github/scripts/extract_log_errors.py
- name: Extract Errors from Logs using LogAI
if: always() && steps.deploy-operator-step.outcome == 'failure'
id: extract-errors
🧰 Tools
🪛 actionlint (1.7.8)

488-488: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents
.github/workflows/container-validation-backends.yml lines 488-501: the workflow
uses actions/setup-python@v4 which runs on the retired Node16 runtime; update
this action to a Node20-compatible release (e.g., actions/setup-python@v5 or
newer) in this step and also update the later identical occurrence in the file
so both paths use the Node20-based release to prevent runner failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants