-
Notifications
You must be signed in to change notification settings - Fork 693
feat: dynamic error extraction #4196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughA new Python script for extracting error messages from logs is introduced with dual extraction strategies (LogAI-based and regex fallback), and a GitHub Actions workflow is enhanced to integrate this error extraction, log collection, and GitHub annotation creation for deployment and test failures. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/scripts/extract_log_errors.py(1 hunks).github/workflows/container-validation-backends.yml(5 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/container-validation-backends.yml
488-488: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
741-741: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.
.github/workflows/container-validation-backends.yml
[error] 1-1: Trailing whitespace detected and fixed by pre-commit. Please re-run to ensure CI passes.
[error] 1-1: Check-shebang-scripts-are-executable: .github/scripts/extract_log_errors.py has a shebang but is not executable.
.github/scripts/extract_log_errors.py
[error] 14-15: Ruff: F401 import 'NNAnomalyDetector' and 'AnomalyDetectionWorkflow' imported but unused in extract_log_errors.py.
[error] 14-14: Ruff: F401 import 'NNAnomalyDetector' imported but unused.
[error] 15-15: Ruff: F401 import 'AnomalyDetectionWorkflow' imported but unused.
[error] 1-1: isort: Files were modified by this hook and need reformatting.
[error] 1-1: Black: 1 file reformatted; run pre-commit again to apply changes.
🪛 Ruff (0.14.3)
.github/scripts/extract_log_errors.py
1-1: Shebang is present but file is not executable
(EXE001)
29-42: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
45-50: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
67-67: Probable insecure usage of temporary file or directory: "/tmp/analysis.log"
(S108)
100-100: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: operator (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: Build and Test - dynamo
| @@ -0,0 +1,225 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set the executable bit for this script
Because the file starts with a shebang, the hooks expect it to be executable. Right now it isn’t, so check-shebang-scripts-are-executable fails. Please mark it executable in git (e.g., git update-index --chmod=+x .github/scripts/extract_log_errors.py) so the lint gate passes.(docs.astral.sh)
🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.
[error] 1-1: isort: Files were modified by this hook and need reformatting.
[error] 1-1: Black: 1 file reformatted; run pre-commit again to apply changes.
🪛 Ruff (0.14.3)
1-1: Shebang is present but file is not executable
(EXE001)
🤖 Prompt for AI Agents
In .github/scripts/extract_log_errors.py at line 1: the script has a shebang but
is not executable, causing the check-shebang-scripts-are-executable hook to
fail; mark the file executable in git (for example run git update-index
--chmod=+x .github/scripts/extract_log_errors.py) and commit the change so the
executable bit is preserved in the repository.
| from logai.applications.openset.anomaly_detection import AnomalyDetectionWorkflow | ||
| from logai.dataloader.data_loader import FileDataLoader | ||
| from logai.analysis.nn_anomaly_detector import NNAnomalyDetector | ||
| from logai.preprocess.preprocessor import Preprocessor | ||
| from logai.information_extraction.log_parser import LogParser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused LogAI imports
AnomalyDetectionWorkflow and NNAnomalyDetector are never referenced, which is why Ruff reports F401 and the lint gate fails. Drop the unused symbols (or use them) so the script passes lint. Example fix:
- from logai.applications.openset.anomaly_detection import AnomalyDetectionWorkflow
- from logai.analysis.nn_anomaly_detector import NNAnomalyDetector
from logai.preprocess.preprocessor import PreprocessorCommittable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4196/merge) by nv-nmailhot.
[error] 14-15: Ruff: F401 import 'NNAnomalyDetector' and 'AnomalyDetectionWorkflow' imported but unused in extract_log_errors.py.
[error] 14-14: Ruff: F401 import 'NNAnomalyDetector' imported but unused.
[error] 15-15: Ruff: F401 import 'AnomalyDetectionWorkflow' imported but unused.
🤖 Prompt for AI Agents
.github/scripts/extract_log_errors.py around lines 14 to 18: the imports
AnomalyDetectionWorkflow and NNAnomalyDetector are unused and cause Ruff F401
lint failures; either remove those two symbols from the import line or reference
them where intended (e.g., instantiate/use them) so they are no longer unused —
update the import statement to only import FileDataLoader, Preprocessor, and
LogParser (or implement the intended usage of the removed symbols) and run the
linter to confirm the F401 error is resolved.
|
|
||
| # Common error patterns for fallback | ||
| ERROR_PATTERNS = [ | ||
| r"Error:?\s+(.+?)(?:\n|$)", | ||
| r"ERROR[:\s]+(.+?)(?:\n|$)", | ||
| r"Failed\s+(.+?)(?:\n|$)", | ||
| r"failed\s+(.+?)(?:\n|$)", | ||
| r"FAILED[:\s]+(.+?)(?:\n|$)", | ||
| r"Exception[:\s]+(.+?)(?:\n|$)", | ||
| r"Traceback \(most recent call last\):(.+?)(?:\n\n|$)", | ||
| r"fatal[:\s]+(.+?)(?:\n|$)", | ||
| r"FATAL[:\s]+(.+?)(?:\n|$)", | ||
| r"panic[:\s]+(.+?)(?:\n|$)", | ||
| r"timed out\s+(.+?)(?:\n|$)", | ||
| r"timeout\s+(.+?)(?:\n|$)", | ||
| ] | ||
|
|
||
| # Patterns to identify context around errors | ||
| CONTEXT_PATTERNS = [ | ||
| r"(exit code \d+)", | ||
| r"(status code \d+)", | ||
| r"(HTTP \d{3})", | ||
| r"(line \d+)", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotate class-level patterns as ClassVar
Ruff flags ERROR_PATTERNS and CONTEXT_PATTERNS because they’re mutable class attributes without a ClassVar annotation. Add the import and annotate them so it’s explicit that they are shared across instances:
-from typing import List, Dict, Any
+from typing import List, Dict, Any, ClassVar
@@
- ERROR_PATTERNS = [
+ ERROR_PATTERNS: ClassVar[List[str]] = [
@@
- CONTEXT_PATTERNS = [
+ CONTEXT_PATTERNS: ClassVar[List[str]] = [This silences RUF012 and documents the intended shared state.(docs.astral.sh)
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.3)
29-42: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
45-50: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
🤖 Prompt for AI Agents
In .github/scripts/extract_log_errors.py around lines 27 to 50, the
module-level/pseudo-class constants ERROR_PATTERNS and CONTEXT_PATTERNS are
flagged by Ruff as mutable class attributes without a ClassVar annotation;
import typing.ClassVar and annotate both variables as ClassVar[list[str]] (or
ClassVar[List[str]] if using from typing import List) so they are explicitly
declared as shared/immutable class-level patterns and Ruff RUF012 is silenced.
| echo "Waiting for deployments to be ready..." | ||
| if ! timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch 2>&1; then | ||
| ERROR_MSG="Deployment rollout timed out after 300 seconds" | ||
| echo "$ERROR_MSG" | ||
| # Capture diagnostics | ||
| echo "=== Pod Status ===" | ||
| kubectl get pods -n ${NAMESPACE} -o wide || true | ||
| echo "=== Deployment Status ===" | ||
| kubectl get deployments -n ${NAMESPACE} || true | ||
| echo "ERROR_MESSAGE=$ERROR_MSG" >> $GITHUB_ENV | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix rollout status command to target real deployments
kubectl rollout status deployment -n "$NAMESPACE" exits with “exactly one NAME is required” because the resource name is missing. That means the step will always fail, even when the rollout succeeds. Please enumerate the deployments (or otherwise name the specific ones) and call kubectl rollout status on each so the rollout wait actually works. For example:
- if ! timeout 300s kubectl rollout status deployment -n $NAMESPACE --watch 2>&1; then
+ if ! kubectl get deployments -n "$NAMESPACE" -o name | while read -r deployment; do
+ timeout 300s kubectl rollout status "$deployment" -n "$NAMESPACE"
+ done; thenAny solution that supplies valid resource names is fine; the key is to stop invoking kubectl rollout status without a target.(kubernetes.io)
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
.github/workflows/container-validation-backends.yml lines 470-483: the kubectl
rollout status call is missing a resource name and therefore always fails;
replace the single call with logic that lists deployments in $NAMESPACE and
iterates over each deployment name, calling kubectl rollout status
deployment/<name> -n "$NAMESPACE" (with the existing timeout) for each; handle
the case of zero deployments (skip or treat as success), and preserve the
existing diagnostics and exit-on-failure behavior if any rollout status call
times out.
| uses: actions/setup-python@v4 | ||
| with: | ||
| python-version: '3.10' | ||
|
|
||
| - name: Install LogAI | ||
| if: always() && steps.deploy-operator-step.outcome == 'failure' | ||
| run: | | ||
| pip install --upgrade pip | ||
| pip install logai || echo "LogAI installation failed, will use fallback" | ||
| chmod +x .github/scripts/extract_log_errors.py | ||
| - name: Extract Errors from Logs using LogAI | ||
| if: always() && steps.deploy-operator-step.outcome == 'failure' | ||
| id: extract-errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrade setup-python to a Node20-compatible release
actions/setup-python@v4 still runs on the retired Node16 runtime, which is no longer provided on hosted runners. To avoid unexpected job failures, bump both occurrences to a Node20-based release (v5 or newer). For example:
- uses: actions/setup-python@v4
+ uses: actions/setup-python@v5Do the same in the later step so both paths stay supported.(github.blog)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| uses: actions/setup-python@v4 | |
| with: | |
| python-version: '3.10' | |
| - name: Install LogAI | |
| if: always() && steps.deploy-operator-step.outcome == 'failure' | |
| run: | | |
| pip install --upgrade pip | |
| pip install logai || echo "LogAI installation failed, will use fallback" | |
| chmod +x .github/scripts/extract_log_errors.py | |
| - name: Extract Errors from Logs using LogAI | |
| if: always() && steps.deploy-operator-step.outcome == 'failure' | |
| id: extract-errors | |
| uses: actions/setup-python@v5 | |
| with: | |
| python-version: '3.10' | |
| - name: Install LogAI | |
| if: always() && steps.deploy-operator-step.outcome == 'failure' | |
| run: | | |
| pip install --upgrade pip | |
| pip install logai || echo "LogAI installation failed, will use fallback" | |
| chmod +x .github/scripts/extract_log_errors.py | |
| - name: Extract Errors from Logs using LogAI | |
| if: always() && steps.deploy-operator-step.outcome == 'failure' | |
| id: extract-errors |
🧰 Tools
🪛 actionlint (1.7.8)
488-488: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🤖 Prompt for AI Agents
.github/workflows/container-validation-backends.yml lines 488-501: the workflow
uses actions/setup-python@v4 which runs on the retired Node16 runtime; update
this action to a Node20-compatible release (e.g., actions/setup-python@v5 or
newer) in this step and also update the later identical occurrence in the file
so both paths use the Node20-based release to prevent runner failures.
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Improvements