Skip to content

Conversation

@aarshi0301
Copy link

@aarshi0301 aarshi0301 commented Oct 21, 2025

Change description

Description here

Add Observability Metrics for Bulk Operations

Adds comprehensive observability metrics for bulk create/update operations to help identify performance bottlenecks.

Features Added

  • Timing metrics for diff calculation, lineage calculation, validation, ingestion, notification, audit logging
  • Payload metrics (entity count and bytes)
  • Relationship and array attribute tracking
  • Operation counters and gauges
  • Error tracking with error type classification

Type of change

  • Bug fix (fixes an issue)
  • New feature (adds functionality)

Related issues

Fix #1

Helm Config Changes for Running Tests (Staging PR)

Does this PR require Helm config changes for testing?

  • Tests are NOT required for this commit. (You can proceed with the PR.) ✅
  • No, Helm config changes are not needed. (You can proceed with the PR.) ✅
  • Yes, I have already updated the config-values on enpla9up36. (You can proceed with the PR.) ✅
  • Yes, but I have NOT updated the config-values. (Please update them before proceeding; or, tests will run with default values.)⚠️

Checklists

Development

  • Lint rules pass locally
  • Application changes have been tested thoroughly
  • Automated tests covering modified code pass

Security

  • Security impact of change has been considered
  • Code follows company security practices and guidelines

Code review

  • Pull request has a descriptive title and context useful to a reviewer. Screenshots or screencasts are attached as necessary
  • "Ready for review" label attached and reviewers assigned
  • Changes have been reviewed by at least one other contributor
  • Pull request linked to task tracker where applicable

Note

Adds Micrometer-based observability for bulk create/update (payload, timing, relationships, gauges, errors) and lineage timing, with docs and minimal API hooks.

  • Observability (Micrometer):
    • Introduces AtlasObservabilityService and AtlasObservabilityData to emit timers/counters/distributions and gauges (operations_in_progress, total_operations).
    • Records duration, payload size/bytes, relationship/attribute counts, operation status/errors; structured error logs with MDC.
  • Payload analysis:
    • Adds PayloadAnalyzer to compute array_relationships, per-relationship counts, and array attribute counts from AtlasEntitiesWithExtInfo.
  • Create/Update instrumentation:
    • Instruments AtlasEntityStoreV2#createOrUpdate: operation start/end/failure, timings (validation, diff, ingestion), payload metrics, RequestContext counters; integrates PayloadAnalyzer and records Prometheus-safe tags.
    • Exposes AtlasEntityStream#getEntitiesWithExtInfo for analysis.
  • Lineage timing:
    • Captures and accumulates lineage calculation time in DeleteHandlerV1 and EntityGraphMapper; adds lineageCalcTime fields/methods to RequestContext.
  • Diff timing:
    • Wraps AtlasEntityComparator#getDiffResult with perf metric.
  • Docs:
    • Adds observability.mdc with full metric catalog, Grafana queries, dashboards, and alerting examples.

Written by Cursor Bugbot for commit b151f12. This will update automatically on new commits. Configure here.

- Add AtlasObservabilityData class with all required metrics fields
- Add AtlasObservabilityService using Micrometer for metrics recording
- Add PayloadAnalyzer for array relationship and attribute analysis
- Instrument AtlasEntityStoreV2.createOrUpdate() with timing and payload metrics
- Add observability implementation plan documentation
- Capture trace_id, client_origin, timing metrics, and array counts
- Support both relationship arrays and regular attribute arrays
- Remove traceId, agentId, vertexIds, assetGuids from Prometheus metrics to prevent cardinality explosion
- Keep high-cardinality fields only for error logging via logErrorDetails()
- Update observability implementation plan with cardinality management guidelines
- Add error handling around observability metrics recording
- Add array attributes analysis similar to relationship attributes
- Set MDC filter key 'atlas-observability' for error logging
- Use OBSERVABILITY logger for proper log routing
- Add proper MDC cleanup in finally block
- Follows same pattern as other Atlas loggers (audit, metrics, tasks)
- Replace manual MDC.put/remove with MDCScope.of()
- Follows Atlas best practices for MDC management
- Automatically restores previous MDC state on close
- Cleaner and more robust than try/finally approach
- Record individual relationship types (process, inputs, outputs, etc.) with counts
- Record individual attribute types with counts
- Use Counter metrics with relationship_name/attribute_name tags
- Enables detailed analysis: process:1, inputs:3, outputs:2, etc.
- Maintains existing total count metrics for aggregation
cursor[bot]

This comment was marked as outdated.

@aarshi0301 aarshi0301 changed the title Bulkobservability Observability | Bulk Analyzer Oct 21, 2025
@aarshi0301 aarshi0301 marked this pull request as draft October 21, 2025 05:08
@aarshi0301 aarshi0301 marked this pull request as ready for review October 31, 2025 05:07
Copilot finished reviewing on behalf of aarshi0301 November 11, 2025 08:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive Micrometer-based observability instrumentation for Atlas bulk create/update operations, introducing metrics tracking for payload analysis, timing breakdowns, and operational status monitoring.

Key Changes:

  • Introduced AtlasObservabilityService with Micrometer metrics (gauges, timers, counters, distribution summaries) for tracking operation performance, payload characteristics, and errors
  • Added timing instrumentation across entity pipeline stages (validation, diff calculation, ingestion, lineage calculation)
  • Created PayloadAnalyzer to compute payload metrics including sizes, relationships, and attributes

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
repository/src/main/java/org/apache/atlas/observability/AtlasObservabilityService.java New service providing Micrometer-based metrics recording with SLOs, percentiles, and low-cardinality tags for Prometheus compatibility
repository/src/main/java/org/apache/atlas/observability/PayloadAnalyzer.java New analyzer for extracting payload metrics including entity counts, relationships, and attributes
common/src/main/java/org/apache/atlas/observability/AtlasObservabilityData.java New data model holding observability metrics including timing, payload, and request metadata
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityStoreV2.java Instrumented createOrUpdate method with timing capture, payload analysis, and metrics recording with error handling
server-api/src/main/java/org/apache/atlas/RequestContext.java Added lineage calculation timing accumulation fields and methods
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/EntityGraphMapper.java Added lineage calculation timing in addHasLineage method
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/DeleteHandlerV1.java Added lineage calculation timing in deletion handlers
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityComparator.java Wrapped diff calculation with performance metric recording
repository/src/main/java/org/apache/atlas/repository/store/graph/v2/AtlasEntityStream.java Added accessor method for entitiesWithExtInfo to enable payload analysis
observability.mdc Comprehensive documentation of all metrics with PromQL queries, dashboard recommendations, and alerting rules
.github/workflows/maven.yml Whitespace-only formatting changes to CI workflow

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aarshi0301 aarshi0301 requested a review from Copilot November 11, 2025 09:02
Copilot finished reviewing on behalf of aarshi0301 November 11, 2025 09:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

observabilityService.recordArrayRelationships(observabilityData);
observabilityService.recordArrayAttributes(observabilityData);
observabilityService.recordTimingMetrics(observabilityData);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing Performance Metrics for Key Operations

The observability metrics for notificationTime and auditLogTime are never captured in the createOrUpdate method. The entityChangeNotifier.onEntitiesMutated() and entityChangeNotifier.notifyDifferentialEntityChanges() calls at lines 1786-1788 represent notification processing, but their duration is not measured and set via observabilityData.setNotificationTime(). Similarly, audit logging time is never tracked. This means these metrics will always report 0, making them useless for identifying performance bottlenecks in notification and audit logging.

Fix in Cursor Fix in Web

Copilot finished reviewing on behalf of aarshi0301 November 11, 2025 09:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1649 to 1686
try {
// Timing: Lineage calculation
long lineageCalcStart = System.currentTimeMillis();

if (RequestContext.get().skipHasLineageCalculation()) {
return;
}
if (RequestContext.get().skipHasLineageCalculation()) {
return;
}

for (AtlasVertex vertexToBeDeleted : vertices) {
if (ACTIVE.equals(getStatus(vertexToBeDeleted))) {
AtlasEntityType entityType = typeRegistry.getEntityTypeByName(getTypeName(vertexToBeDeleted));
boolean isProcess = entityType.getTypeAndAllSuperTypes().contains(PROCESS_SUPER_TYPE);
boolean isCatalog = entityType.getTypeAndAllSuperTypes().contains(DATA_SET_SUPER_TYPE);
for (AtlasVertex vertexToBeDeleted : vertices) {
if (ACTIVE.equals(getStatus(vertexToBeDeleted))) {
AtlasEntityType entityType = typeRegistry.getEntityTypeByName(getTypeName(vertexToBeDeleted));
boolean isProcess = entityType.getTypeAndAllSuperTypes().contains(PROCESS_SUPER_TYPE);
boolean isCatalog = entityType.getTypeAndAllSuperTypes().contains(DATA_SET_SUPER_TYPE);

if (isCatalog || isProcess) {
if (isCatalog || isProcess) {

Iterator<AtlasEdge> edgeIterator = vertexToBeDeleted.getEdges(AtlasEdgeDirection.BOTH, PROCESS_EDGE_LABELS).iterator();
Iterator<AtlasEdge> edgeIterator = vertexToBeDeleted.getEdges(AtlasEdgeDirection.BOTH, PROCESS_EDGE_LABELS).iterator();

Set<AtlasEdge> edgesToBeDeleted = new HashSet<>();
Set<AtlasEdge> edgesToBeDeleted = new HashSet<>();

while (edgeIterator.hasNext()) {
AtlasEdge edge = edgeIterator.next();
if (ACTIVE.equals(getStatus(edge))) {
edgesToBeDeleted.add(edge);
while (edgeIterator.hasNext()) {
AtlasEdge edge = edgeIterator.next();
if (ACTIVE.equals(getStatus(edge))) {
edgesToBeDeleted.add(edge);
}
}
}

resetHasLineageOnInputOutputDelete(edgesToBeDeleted, vertexToBeDeleted);
resetHasLineageOnInputOutputDelete(edgesToBeDeleted, vertexToBeDeleted);
}
}
}

// Record lineage calculation time
long lineageCalcTime = System.currentTimeMillis() - lineageCalcStart;
RequestContext.get().addLineageCalcTime(lineageCalcTime);
} finally {
RequestContext.get().endMetricRecord(metricRecorder);
}
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lineage timing measurement wraps the entire method body in a try-finally block. However, if skipHasLineageCalculation() returns true (line 1653), the method returns early, and the timing recorded (difference of ~0ms) will be misleading since no actual lineage calculation occurred. This could skew the observability metrics.

Consider moving the timing start after the early return check, or add a flag to indicate whether lineage calculation was actually performed.

Copilot uses AI. Check for mistakes.
Comment on lines +1842 to +1856
} catch (AtlasBaseException e) {
// Record operation failure
if (!operationRecorded) {
String errorCode = e.getAtlasErrorCode() != null ? e.getAtlasErrorCode().getErrorCode() : "UNKNOWN_ERROR";
observabilityService.recordOperationFailure("createOrUpdate", errorCode);
operationRecorded = true;
}
throw e;
} catch (Exception e) {
// Record operation failure
if (!operationRecorded) {
observabilityService.recordOperationFailure("createOrUpdate", e.getClass().getSimpleName());
operationRecorded = true;
}
throw new AtlasBaseException(e);
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling logic has redundant catch blocks. When an AtlasBaseException is caught at line 1842, it's re-thrown after recording the failure. Then a generic Exception catch block at line 1850 wraps it in a new AtlasBaseException. This means an AtlasBaseException will be caught, re-thrown, caught again by the second catch block, and wrapped in another AtlasBaseException.

Consider removing the AtlasBaseException catch block and only keeping the generic Exception catch block, or ensure the second catch block doesn't catch AtlasBaseException by checking the exception type.

Copilot uses AI. Check for mistakes.
Comment on lines +1858 to +1866
// Ensure operationsInProgress is decremented even if recordOperationEnd/Failure wasn't called
if (!operationRecorded) {
try {
observabilityService.recordOperationFailure("createOrUpdate", "unexpected_error");
} catch (Exception e) {
// Log but don't throw - we're in finally block
LOG.warn("Failed to record operation failure in finally block", e);
}
}
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The finally block attempts to record an operation failure if operationRecorded is false, but this could fail and throw an exception. In a finally block, throwing an exception will suppress any exception that was being thrown from the try block, potentially masking the original error.

The code logs the exception but catching and logging in a finally block is risky. Consider ensuring that no exceptions escape from the finally block by wrapping all operations in a try-catch that swallows all exceptions (which is already done), but verify this pattern doesn't mask critical failures.

Copilot uses AI. Check for mistakes.
histogram_quantile(0.95, rate(atlas_observability_bulk_duration_seconds_bucket[5m])) < 5
```

#### Payload Size SLO Compliance (95% under 1MB)
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the PromQL query comment. The value 1048576 is 1 MiB (mebibyte, 1024^2 bytes), not 1MB (megabyte, 1000^2 bytes). The comment should say "1MiB" for accuracy.

Suggested change
#### Payload Size SLO Compliance (95% under 1MB)
#### Payload Size SLO Compliance (95% under 1MiB)

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.


// Analyze array relationships
analyzeArrayRelationships(entitiesWithExtInfo, data);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Broken Payload Size Observability

The analyzePayload() method never sets payloadRequestBytes on the AtlasObservabilityData object. The field remains at its default value of 0, causing the payload_bytes metric to always report zero instead of the actual payload size in bytes. This breaks the payload size monitoring feature described in the observability documentation.

Fix in Cursor Fix in Web

@aarshi0301 aarshi0301 closed this Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants