-
Notifications
You must be signed in to change notification settings - Fork 151
Handler failure cases for clickhouse connector #8133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Handler failure cases for clickhouse connector #8133
Conversation
| if conf.Host != "" && conf.Port != 0 { | ||
| target := net.JoinHostPort(conf.Host, fmt.Sprintf("%d", conf.Port)) | ||
| conn, err := net.DialTimeout("tcp", target, 5*time.Second) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("Error: %w - please check that the host and port are correct %s", err, target) | ||
| } | ||
| conn.Close() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will add a redundant round-trip every time we open a Clickhouse handle. Can you look into ways where it only does extra checks/validation in failure scenarios (i.e. the non-happy path where the connection does not open successfully)? Ideally if the connector is configured correctly, we shouldn't be doing additional requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it’s redundant but quite fast (a few ms). If we lower the dial timeout to <30s, we can move this check to the error path so it only runs on failures/ or may not need it. We can also make dial timeout < 30 with check host has suffix clickHouse.cloud , but i think in-house clusters can scale to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can reduce the timeout. But I don't understand why ping uses the full dial timeout when the host is unreachable? It feels like there's a difference between a) the dial taking a long time, b) the dial failing fast and getting retried. Is it because the Clickhouse driver treats those as the same? If that's the case, should we consider patching the Clickhouse driver to have two different settings for this?
| return nil, err | ||
| // Detect SSL/TLS mismatch (common causes: "read: EOF" or TLS Alert [21]) | ||
| if strings.Contains(err.Error(), "EOF") || strings.Contains(err.Error(), "[handshake] unexpected packet [21]") { | ||
| return nil, fmt.Errorf("Error: %w — this usually happens due to SSL/TLS mismatch", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In error wrapping in general, the %w should come at the end of the error message. It also shouldn't add an explicit Error: prefix in the error message – that kind of formatting should be done by the UI/CLI when printing the error (internally, we don't know if the error may get wrapped some more times by other callers, which would make the injected Error: look like broken formatting). E.g.:
return nil, fmt.Errorf("handshake failed (this usually happens due to SSL/TLS mismatch): %w", err)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| conn, err := net.DialTimeout("tcp", target, 5*time.Second) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("Error: %w - please check that the host and port are correct %s", err, target) | ||
| return nil, fmt.Errorf("please check that the host and port are correct %s: %w - ", target, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit redundant - at the end of the error message
| if conf.Host != "" && conf.Port != 0 { | ||
| target := net.JoinHostPort(conf.Host, fmt.Sprintf("%d", conf.Port)) | ||
| conn, err := net.DialTimeout("tcp", target, 5*time.Second) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("Error: %w - please check that the host and port are correct %s", err, target) | ||
| } | ||
| conn.Close() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can reduce the timeout. But I don't understand why ping uses the full dial timeout when the host is unreachable? It feels like there's a difference between a) the dial taking a long time, b) the dial failing fast and getting retried. Is it because the Clickhouse driver treats those as the same? If that's the case, should we consider patching the Clickhouse driver to have two different settings for this?
I have done few fixes for PLAT-236: Test clickhouse connector for wrong user inputs
Checklist: