-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Summary
The OCI Secrets Store CSI Driver Provider exhibits a memory leak caused by HTTP connections not being properly closed or reused when authenticating via workload identity. This routinely causes several OOMKilled containers in high-throughput environments where secrets are mounted frequently.
Environment
- Provider version: v0.4.2
- Kubernetes: OKE (Oracle Kubernetes Engine)
- Authentication method: Workload Identity (
x509FederationClientForOkeWorkloadIdentity) - Secret rotation enabled: Yes (2m interval)
Symptoms
- Steady, linear memory growth over time (~5-6 Mi/hour under moderate load)
- Pods approaching 200Mi memory limit within ~24-36 hours
- Memory usage correlates directly with the volume of secret mount requests on the node
Root Cause Analysis
Using Go's pprof tooling, we identified the following:
Heap Profile Findings
Top memory consumers point to HTTP connection handling:
| Function | Memory | Percentage |
|---|---|---|
bytes.growSlice |
3084 kB | 14.82% |
bufio.NewWriterSize |
3084 kB | 14.82% |
bufio.NewReaderSize |
3084 kB | 14.82% |
compress/flate.NewWriter |
1805 kB | 8.68% |
crypto/internal/fips140/aes/gcm.NewGCMForTLS12 |
1537 kB | 7.39% |
Critically, this function appears in the profile:
512.20kB 2.46% 85.23% 3596.15kB 17.28% github.com/oracle/oci-go-sdk/v65/common/auth.(*x509FederationClientForOkeWorkloadIdentity).getSecurityToken
Goroutine Profile Findings
Leaking pod (node with high secret mount volume):
goroutine profile: total 968
478 @ ... net/http.(*persistConn).readLoop
478 @ ... net/http.(*persistConn).writeLoop
Healthy pod (node with zero secret mounts):
goroutine profile: total 9
The leaking pod has 478 persistent HTTP connections that are never closed or reused.
Reproduction
- Deploy the OCI provider with workload identity authentication
- Schedule workloads that mount secrets from OCI Vault on a node
- Over time, observe memory growth in the provider pod on that node
- Collect goroutine profile:
curl http://localhost:6060/debug/pprof/goroutine?debug=1 - Compare goroutine count between high-traffic and idle nodes
Expected Behavior
HTTP connections to OCI endpoints should be:
- Reused via connection pooling, OR
- Properly closed after use
Suggested Fix
The issue likely resides in how x509FederationClientForOkeWorkloadIdentity.getSecurityToken creates HTTP clients. Potential fixes:
- Reuse a single
http.Clientinstance instead of creating new ones per request - Ensure
resp.Body.Close()is called after reading responses - Configure connection pool limits via
http.Transport.MaxIdleConnsandMaxIdleConnsPerHost
Impact
In environments with frequent secret mounts (e.g., CI/CD pipelines running short-lived jobs), the provider becomes unusable as pods OOM within hours to days depending on workload volume.
Workarounds
Currently, no effective workarounds exist other than:
- Increasing memory limits (delays but doesn't prevent the issue)
- Use alternative
Attachments
Happy to provide full heap and goroutine profile dumps if helpful.