Skip to content

Conversation

@jonstacks
Copy link
Collaborator

Relates to:

What

Background

The ngrok-operator watches Service[type=Loadbalancer], Ingress, Gateway API resources, and ngrok CRD types. When it creates resources in the ngrok API for these, it adds the ngrok finalizer to these resources to make sure that those ngrok resources get cleaned up on deletion of the resources.
These controllers are responsible for removing the finalizers from the resources they manage after cleaning up the ngrok resources.

When the manager pods start terminating, their context is cancelled and they stop reconciling resources. Once these manager pods are terminated they can no longer cleanup resources or remove the finalizers.

Problem

When uninstalling the helm chart, there is a race condition where the manager pods are deleted before they have time to clean up the resources they owned. This leaves resources in the cluster with the ngrok finalizer present, but no way remove the finalizer. This makes the uninstall fail and wedges these resources in a bad state.

What is more, is that if you you run helm uninstall and some of the resources are cleaned up, but others like the KubernetesOperator don't get processed in time, when you try to re-install, the new manager pods will clean up the finalizer from the existing resources and then the CRD will be deleted after install, leaving you in a bad state. This is what may be happening happening in #597.

How

Helm Hook

This PR adds a helm pre-delete hook, initially disabled by default in the chart while we continue to test it. The pre-delete hook installs a cluster role to give the helm delete hook permissions to find resources with the ngrok finalizer and then annotation those resources with a k8s.ngrok.com/cleanup: "true" annotation.

k8s.ngrok.com/cleanup: "true" annotation support for controllers

This PR adds support for handling a k8s.ngrok.com/cleanup: "true" annotation to our controllers. When objects are annotated with this annotation:

  • The ngrok resources are deleted
  • Any internal annotations are cleaned up
  • The finalizer is removed
  • The status conditions are appropriately reset

Alternatives considered

I was contemplating just triggering deletes for all of the resources that have our finalizer and letting our controllers handle the cleanup, removing the annotation, and then letting them be garbage collected. The reason I didn't opt for this approach is that we don't create Ingress, Service, and Gateway API resources. They are created by users. It feels unexpected for us to delete them since we didn't create them.

Additional Concerns

This is currently only implemented as a helm pre-delete hook. This does not cover the case where the manager deployments or ngrok-operator namespace is deleted directly. That case is difficult to handle. When the namespace id deleted, it starts deleting all namespaced resources including the service account which grants the managers permissions to remove the finalizers. It also terminates the manager pods which prevent finalizers on resources in different namespaces from being cleaned up.

I have some other thoughts on how we can handle some of these cases that will make use of the annotation work, but need to do more testing. This also only works when using helm since this hook isn't supported by other tools like ArgoCD, for example. I think the annotation work can be kept, but we might need to supplement the pre-delete hook with other methods to make uninstall cleanup easier.

Breaking Changes

There should be no breaking changes. This feature is disabled by default currently in the helm chart and has to be enabled with cleanupHook.enabled=true (We do enable it for our devenv to test it). Nothing in the codebase annotates these currently, so those paths should be unaffected.

@jonstacks jonstacks self-assigned this Nov 7, 2025
@github-actions github-actions bot added size/XXL Denotes a PR that changes 1000+ lines documentation Improvements or additions to documentation area/controller Issues dealing with the controller area/helm-chart Issues dealing with the helm chart and removed size/XXL Denotes a PR that changes 1000+ lines labels Nov 7, 2025
Signed-off-by: Jonathan Stacks <[email protected]>
@jonstacks jonstacks force-pushed the stacks/helm-delete-hook branch from 195cd46 to 836f3d4 Compare November 7, 2025 20:42
@github-actions github-actions bot added the size/XXL Denotes a PR that changes 1000+ lines label Nov 7, 2025
@codecov
Copy link

codecov bot commented Nov 7, 2025

Codecov Report

❌ Patch coverage is 42.06897% with 84 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.98%. Comparing base (a13039f) to head (836f3d4).

Files with missing lines Patch % Lines
internal/controller/base_controller.go 0.00% 23 Missing ⚠️
internal/controller/ingress/ingress_controller.go 0.00% 14 Missing ⚠️
pkg/managerdriver/controller-handler.go 0.00% 10 Missing ⚠️
...nternal/controller/gateway/httproute_controller.go 25.00% 6 Missing and 3 partials ⚠️
internal/controller/gateway/tlsroute_controller.go 61.90% 4 Missing and 4 partials ⚠️
internal/controller/gateway/tcproute_controller.go 50.00% 4 Missing and 3 partials ⚠️
internal/controller/gateway/gateway_controller.go 50.00% 4 Missing and 2 partials ⚠️
.../controller/ngrok/kubernetesoperator_controller.go 0.00% 4 Missing ⚠️
internal/controller/service/controller.go 75.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #709      +/-   ##
==========================================
+ Coverage   48.53%   48.98%   +0.45%     
==========================================
  Files          95       95              
  Lines       10500    10575      +75     
==========================================
+ Hits         5096     5180      +84     
+ Misses       5049     5024      -25     
- Partials      355      371      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/controller Issues dealing with the controller area/helm-chart Issues dealing with the helm chart documentation Improvements or additions to documentation size/XXL Denotes a PR that changes 1000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants