Skip to content

Add pyhive connection timeout#7300

Open
ryanbordo wants to merge 5 commits intoapache:masterfrom
ryanbordo:add-pyhive-connection-timeout
Open

Add pyhive connection timeout#7300
ryanbordo wants to merge 5 commits intoapache:masterfrom
ryanbordo:add-pyhive-connection-timeout

Conversation

@ryanbordo
Copy link

@ryanbordo ryanbordo commented Jan 11, 2026

Why are the changes needed?

Adds an optional parameter to pyhive hive connection class init to allow setting the socket connection timeout. By default, thrift socket connections have no timeout which causes connections to hang indefinitely when something goes wrong. This helps us in dbt by allowing for multi-threaded management of connections while continuing to rely on pyhive thrift setup.

Reference thrift lines:
https://github.com/apache/thrift/blob/0d18fb2e97a00fc56997fa059d6819e54cdff64b/lib/py/src/transport/THttpClient.py#L130
https://github.com/apache/thrift/blob/0d18fb2e97a00fc56997fa059d6819e54cdff64b/lib/py/src/transport/TSocket.py#L111

How was this patch tested?

Manually on TSocket path and added unit test

Was this patch authored or co-authored using generative AI tooling?

No

@ryanbordo ryanbordo marked this pull request as ready for review January 11, 2026 11:12
@cxzl25 cxzl25 requested a review from Copilot January 11, 2026 14:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional connection_timeout parameter to PyHive's Hive connection class to prevent connections from hanging indefinitely when issues occur. The timeout is specified in milliseconds and applies to both HTTP and TSocket transport types.

Changes:

  • Added connection_timeout parameter to the Connection.__init__ method signature and documentation
  • Implemented timeout setting for both HTTP (THttpClient) and TSocket transport paths
  • Added integration test to verify connections work with the timeout parameter

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
python/pyhive/hive.py Added connection_timeout parameter to Connection class, implemented setTimeout calls for both HTTP and socket transports, and added parameter documentation
python/pyhive/tests/test_hive.py Added integration test to verify connection works with timeout parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (526a8aa) to head (cb26cb8).

Additional details and impacted files
@@          Coverage Diff           @@
##           master   #7300   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         698     698           
  Lines       43646   43646           
  Branches     5894    5894           
======================================
  Misses      43646   43646           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

auth = 'NONE'
socket = thrift.transport.TSocket.TSocket(host, port)
if connection_timeout:
socket.setTimeout(connection_timeout)
Copy link

@fbertsch fbertsch Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this sets both connection and socket timeout. There's a nice description of the difference here.

Suggested change
socket.setTimeout(connection_timeout)
socket.setConnectTimeout(connection_timeout)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good mention. My goal with this was originally to manage the socket timeout. However, I think I will leave this way to not get too granular.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java, a socket can have different connectTimeout, readTimeout, and writeTimeout values. Does Python have an equivalent? Generally, we want a small connectTimeout, and long readTimeout and writeTimeout, so it can fail fast on connecting dead server and failover to health one, if it has multiple server instances

Copy link
Author

@ryanbordo ryanbordo Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my short research, it looks like the short answer is no, but possible with an explicit create_connection At least, python thrift is only using supporting one timeout and not doing this.
https://github.com/apache/thrift/blob/c99d09a231648d72e05a89d80281b38c9d0d1b9a/lib/py/src/transport/TSocket.py#L145-L147

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can call socket.setTimeout after the socket is connected, right? if so, it's easy to implement connectTimeout, socketTimeout

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be possible. Are you suggesting to add in here or in the Thrift library? I also want to call out that a user could choose to do this already by passing a thrift transport object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user could choose to do this already by passing a thrift transport object.

the user passed thrift_transport should already be established? if so, the user takes responsibility to set connectTimeout.

I think what we should do here is

  1. add two args connect_timeout and socket_timeout
  2. for thrift_transport, suppose it is already established, just set socket_timeout
  3. for host/port case, use connect_timeout for the connecting phase, and change it to socket_timeout after the connection is established

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection from the thrift_transport actually doesn't need to be established. And in our dbt code, we have a path that relies on this connection being opened in hive.connect function. Unfortunately, the case for passing a transport object is ambiguous since thrift python only has one setTimeout for both. It might make sense to skip both timeouts when a transport is passed? Regardless, I'll get started on the proposal you sent but let me know if you have any additional thoughts.

Copy link
Member

@pan3793 pan3793 Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I read the code, self._transport.open() happens in Connection __init__, so we can call setTimeout(connect_time) before it, and call setTimeout(socket_time), same for both host/port and thrift_transport cases

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants