Skip to content

Conversation

@nilekhc
Copy link

@nilekhc nilekhc commented Oct 30, 2025

This PR extends the work of CRL support by consuming the ca-crl config map in zTunnel and validates the request accordingly.

Fixes #1678

@istio-testing
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Oct 30, 2025
@istio-policy-bot
Copy link

😊 Welcome @nilekhc! This is either your first contribution to the Istio ztunnel repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@istio-testing istio-testing added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 30, 2025
@nilekhc nilekhc changed the title feat: implements ca-crl in zTunnel feat: implements crl support in zTunnel Oct 30, 2025
@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/test all

@howardjohn
Copy link
Member

I'm a bit confused. It seems there was a design doc that was approved for this (not sure by who, as I was out of office and there are no meeting notes or listed approvers) but the design is completely different than this

@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from 6cfcad3 to c522f5f Compare October 30, 2025 15:33
@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/test all

@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

I'm a bit confused. It seems there was a design doc that was approved for this (not sure by who, as I was out of office and there are no meeting notes or listed approvers) but the design is completely different than this

Hi @howardjohn, Thanks for the comment. The original design I presented in one of the community meetings received some valuable feedback. Based on that, I pivoted the design to propagate CRL information through a ConfigMap, similar to how the ca-root-cert is propogated. I’ve captured these updates in the same design document, under the Implementation section.

This PR simply uses that CRL ConfigMap and validates the request accordingly.

cc @costinm @therealmitchconnors @keithmattix

@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from c522f5f to ce5886a Compare October 30, 2025 16:20
@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/test all

@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from ce5886a to 2402c4d Compare October 30, 2025 16:34
@istio-testing istio-testing added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 30, 2025
@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/test all

@nilekhc nilekhc marked this pull request as ready for review October 30, 2025 16:43
@nilekhc nilekhc requested a review from a team as a code owner October 30, 2025 16:43
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Oct 30, 2025
socket2::SockRef::from(&s).set_tcp_keepalive(&ka)
);
}
#[cfg(target_os = "linux")]
Copy link
Author

@nilekhc nilekhc Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting an error on macos due to cross compilation. I have to add this to some places to fix it. Please let me know if there is a better way to handle this.

@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/retest


/// Close all inbound and outbound connections
/// This is called when certificate revocations are detected
pub fn close_all_connections(&self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really dangerous and bad behavior. How strongly do you feel about keeping this?

Does any other application do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I don't think we should be closing every single connection just because of one revocation. What does Envoy do?

Copy link
Author

@nilekhc nilekhc Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I addd this since http connections are long lived and does not check updated crl. Is there a better way to handle this? I know envoy has some timeout we can configure. Not comepletly sure how zTunnel handles it. Happy to tweak this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is to do nothing. But at the very least, only close ones that are revoked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing nothing feels wrong IMO; the purpose of a CRL is to be able to break a particular trust chain, and you don't want existing connections to continue. Only closing the ones that are revoked is the right solution to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is restart your clients if you want the CRL to take effect quickly the UX we want then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, zTunnel should terminate that connection selectively and avoid restarting the workload. Since the user provides/updates the CRL file, they’re already aware of potentially revoked certificates, so why ask them to take action twice?
(1) Create or update the CRL, and
(2) Restart the workload pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing advocate aside, I suspect that most folks would accept a limitation where existing connections are not re-evaluated on a CRL refresh. This seems to be a fairly common behavior. I think I would lean towards breaking support for re-evaluating existing connections into a separate PR (assuming we do want to explore that functionality).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate PR seems reasonable to me so we can evaluate the complexity on its own

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted the outbound related changes and now it only selectively terminates inbound connections. PTAL. Thanks

/cc @howardjohn @keithmattix @ilrudie

@therealmitchconnors
Copy link
Contributor

I'm a bit confused. It seems there was a design doc that was approved for this (not sure by who, as I was out of office and there are no meeting notes or listed approvers) but the design is completely different than this

If you'd like to see the conversation where this was reviewed, it's in the 4/14 wg meeting at https://www.youtube.com/watch?v=welIiuqUvn0&t=118s

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Nov 18, 2025
@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from 5509cf6 to 37f0ca0 Compare November 19, 2025 01:11
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Nov 19, 2025
@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch 2 times, most recently from 11a8b0b to 0f855bf Compare November 19, 2025 19:15
@nilekhc
Copy link
Author

nilekhc commented Nov 20, 2025

@howardjohn @ilrudie @keithmattix Could you PTAL when you get a chance? Thanks.

serial.len(),
serial
.iter()
.map(|b| format!("{:02x}", b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we converting to string here and on L372? it will be quite expensive, we can just compare the bytes directly?

Copy link
Contributor

@ilrudie ilrudie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me this bundles functionality that belongs in something like PolicyWatcher into the ConnectionManager. Please take a look at that, specifically how run works and is invoked and works for an example of how similar functionality was implemented.

Comment on lines +198 to +208
let rbac_ctx = ProxyRbacContext {
conn: conn.clone(),
dest_workload: destination_workload,
};
let mut tls_guard = pi
.connection_manager
.register_tls_connection(&rbac_ctx, client_cert_serials.clone());
debug!("TLS connection registered for revocation tracking");

// Get watcher BEFORE serving to listen for revocation-based termination
let tls_drain_watch = tls_guard.watcher();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serv_connect will already create a connection guard which will end the connection when asked. I don't think we need a second connection guard

Comment on lines +239 to +255
// This allows CRL revocation to terminate the connection immediately
tokio::select! {
result = serve => {
tls_guard.release();
result
}
_ = tls_drain_watch.wait_for_drain() => {
debug!("TLS connection terminated due to certificate revocation");
tls_guard.release();
// Return an error to signal abnormal termination
// This helps with proper cleanup and error reporting
Err(std::io::Error::new(
std::io::ErrorKind::ConnectionAborted,
"connection terminated due to certificate revocation"
).into())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above, we already take a guard that closes the connection, I don't think we need

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add peer CA CRL support in zTunnel

7 participants