Skip to content

Fix #729 and #731: Telemetry lifecycle management#734

Open
msrathore-db wants to merge 2 commits intomainfrom
fix/telemetry-lifecycle-issues-729-731
Open

Fix #729 and #731: Telemetry lifecycle management#734
msrathore-db wants to merge 2 commits intomainfrom
fix/telemetry-lifecycle-issues-729-731

Conversation

@msrathore-db
Copy link
Contributor

What type of PR is this?

  • Refactor
  • Feature
  • Bug Fix
  • Other

Description

Issue #729: Connection failures hang CICD jobs for 15+ minutes, cannot be cancelled

  • Telemetry forced on even with enable_telemetry=False
  • 900-second timeout causes long hangs
  • Blocking shutdown prevents process exit

Issue #731: Race condition causes AttributeError: 'NoneType' object has no attribute 'request'

  • Async telemetry executes after _http_client is garbage collected
  • No null check before calling pool_manager.request()

Changes

Issue #729 (3 fixes)

  1. Respect enable_telemetry parameter (telemetry_client.py:722-729, client.py:321)
    - Added parameter to connection_failure_log() with early return if disabled
    - Connection passes user's preference through
  2. Reduce timeout to 30s (telemetry_client.py:299)
    - Changed from 900s to 30s for faster failure
  3. Non-blocking shutdown (telemetry_client.py:707)
    - Changed executor.shutdown(wait=True) to wait=False

Issue #731 (3 fixes)

  1. Null-safety check (unified_http_client.py:296-298)
    - Check if pool_manager is None before calling .request()
    - Raise RequestError with clear message
  2. Add del cleanup (telemetry_client.py:439-451)
    - Close _http_client during garbage collection
    - Ensures eventual resource cleanup
  3. Document lifecycle (telemetry_client.py:430-435)
    - Added docstring explaining why _http_client isn't closed in close()
    - Import RequestError from databricks.sql.exc

Impact

  • Connection failures exit in <1s instead of 15+ minutes
  • CICD jobs no longer hang
  • No more race condition AttributeError
  • enable_telemetry=False now respected in all scenarios

Backward Compatibility

Fully backward compatible. No API changes, only behavior improvements.

Closes #729
Closes #731

How is this tested?

  • Unit tests
  • E2E Tests
  • Manually
  • N/A

Related Tickets & Documents

@github-actions
Copy link

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Copy link
Contributor

@samikshya-db samikshya-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
Per reviewer feedback on PR #734:

1. Revert timeout from 30s back to 900s (line 299)
   - Reviewer noted that with wait=False, timeout is not critical
   - The async nature and wait=False handle the exit speed

2. Revert telemetry_enabled parameter back to True (line 734)
   - Reviewer noted this is redundant given the early return
   - If enable_telemetry=False, we return early (line 729)
   - Line 734 only executes when enable_telemetry=True
   - Therefore using the parameter here is unnecessary

These changes address the reviewer's valid technical concerns while
keeping the core fixes intact:
- wait=False for non-blocking shutdown (critical for Issue #729)
- Early return when enable_telemetry=False (critical for Issue #729)
- All Issue #731 fixes (null-safety, __del__, documentation)

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
@msrathore-db msrathore-db force-pushed the fix/telemetry-lifecycle-issues-729-731 branch from d4f9054 to 471a551 Compare February 5, 2026 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sporadic 'NoneType' object has no attribute 'request' error from telemetry — usage question ERROR: Rapid fire connection error without exit

2 participants