Skip to content

Conversation

@NamanMahor
Copy link
Contributor

@NamanMahor NamanMahor commented Oct 15, 2025

I have done few fixes for PLAT-236: Test clickhouse connector for wrong user inputs

  1. Added a TCP dial check to validate host and port reachability before connecting, providing a clear error message when either is incorrect.
  2. handling "operation timed out"
  3. EOF and handsake [21] issue when ssl is false. giving proper message.
  4. Prevented redundant retries by skipping the second connection attempt when the protocol is already HTTP

Checklist:

  • Covered by tests
  • Ran it and it works as intended
  • Reviewed the diff before requesting a review
  • Checked for unhandled edge cases
  • Linked the issues it closes
  • Checked if the docs need to be updated. If so, create a separate Linear DOCS issue
  • Intend to cherry-pick into the release branch
  • I'm proud of this work!

@NamanMahor NamanMahor changed the title Fix for wrong host and port issue Handler failure cases for clickhouse connector Oct 15, 2025
@NamanMahor NamanMahor marked this pull request as ready for review October 15, 2025 19:03
@NamanMahor NamanMahor self-assigned this Oct 16, 2025
Comment on lines 753 to 760
if conf.Host != "" && conf.Port != 0 {
target := net.JoinHostPort(conf.Host, fmt.Sprintf("%d", conf.Port))
conn, err := net.DialTimeout("tcp", target, 5*time.Second)
if err != nil {
return nil, fmt.Errorf("Error: %w - please check that the host and port are correct %s", err, target)
}
conn.Close()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will add a redundant round-trip every time we open a Clickhouse handle. Can you look into ways where it only does extra checks/validation in failure scenarios (i.e. the non-happy path where the connection does not open successfully)? Ideally if the connector is configured correctly, we shouldn't be doing additional requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it’s redundant but quite fast (a few ms). If we lower the dial timeout to <30s, we can move this check to the error path so it only runs on failures/ or may not need it. We can also make dial timeout < 30 with check host has suffix clickHouse.cloud , but i think in-house clusters can scale to zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can reduce the timeout. But I don't understand why ping uses the full dial timeout when the host is unreachable? It feels like there's a difference between a) the dial taking a long time, b) the dial failing fast and getting retried. Is it because the Clickhouse driver treats those as the same? If that's the case, should we consider patching the Clickhouse driver to have two different settings for this?

return nil, err
// Detect SSL/TLS mismatch (common causes: "read: EOF" or TLS Alert [21])
if strings.Contains(err.Error(), "EOF") || strings.Contains(err.Error(), "[handshake] unexpected packet [21]") {
return nil, fmt.Errorf("Error: %w — this usually happens due to SSL/TLS mismatch", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In error wrapping in general, the %w should come at the end of the error message. It also shouldn't add an explicit Error: prefix in the error message – that kind of formatting should be done by the UI/CLI when printing the error (internally, we don't know if the error may get wrapped some more times by other callers, which would make the injected Error: look like broken formatting). E.g.:

			return nil, fmt.Errorf("handshake failed (this usually happens due to SSL/TLS mismatch): %w", err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

conn, err := net.DialTimeout("tcp", target, 5*time.Second)
if err != nil {
return nil, fmt.Errorf("Error: %w - please check that the host and port are correct %s", err, target)
return nil, fmt.Errorf("please check that the host and port are correct %s: %w - ", target, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit redundant - at the end of the error message

Comment on lines 753 to 760
if conf.Host != "" && conf.Port != 0 {
target := net.JoinHostPort(conf.Host, fmt.Sprintf("%d", conf.Port))
conn, err := net.DialTimeout("tcp", target, 5*time.Second)
if err != nil {
return nil, fmt.Errorf("Error: %w - please check that the host and port are correct %s", err, target)
}
conn.Close()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can reduce the timeout. But I don't understand why ping uses the full dial timeout when the host is unreachable? It feels like there's a difference between a) the dial taking a long time, b) the dial failing fast and getting retried. Is it because the Clickhouse driver treats those as the same? If that's the case, should we consider patching the Clickhouse driver to have two different settings for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants