Skip to content

Memory Leak using Workload Identity Authentication #50

@bassg0navy

Description

@bassg0navy

Summary

The OCI Secrets Store CSI Driver Provider exhibits a memory leak caused by HTTP connections not being properly closed or reused when authenticating via workload identity. This routinely causes several OOMKilled containers in high-throughput environments where secrets are mounted frequently.

Environment

  • Provider version: v0.4.2
  • Kubernetes: OKE (Oracle Kubernetes Engine)
  • Authentication method: Workload Identity (x509FederationClientForOkeWorkloadIdentity)
  • Secret rotation enabled: Yes (2m interval)

Symptoms

  • Steady, linear memory growth over time (~5-6 Mi/hour under moderate load)
  • Pods approaching 200Mi memory limit within ~24-36 hours
  • Memory usage correlates directly with the volume of secret mount requests on the node

Root Cause Analysis

Using Go's pprof tooling, we identified the following:

Heap Profile Findings

Top memory consumers point to HTTP connection handling:

Function Memory Percentage
bytes.growSlice 3084 kB 14.82%
bufio.NewWriterSize 3084 kB 14.82%
bufio.NewReaderSize 3084 kB 14.82%
compress/flate.NewWriter 1805 kB 8.68%
crypto/internal/fips140/aes/gcm.NewGCMForTLS12 1537 kB 7.39%

Critically, this function appears in the profile:

512.20kB  2.46%  85.23%  3596.15kB  17.28%  github.com/oracle/oci-go-sdk/v65/common/auth.(*x509FederationClientForOkeWorkloadIdentity).getSecurityToken

Goroutine Profile Findings

Leaking pod (node with high secret mount volume):

goroutine profile: total 968

478 @ ... net/http.(*persistConn).readLoop
478 @ ... net/http.(*persistConn).writeLoop

Healthy pod (node with zero secret mounts):

goroutine profile: total 9

The leaking pod has 478 persistent HTTP connections that are never closed or reused.

Reproduction

  1. Deploy the OCI provider with workload identity authentication
  2. Schedule workloads that mount secrets from OCI Vault on a node
  3. Over time, observe memory growth in the provider pod on that node
  4. Collect goroutine profile: curl http://localhost:6060/debug/pprof/goroutine?debug=1
  5. Compare goroutine count between high-traffic and idle nodes

Expected Behavior

HTTP connections to OCI endpoints should be:

  • Reused via connection pooling, OR
  • Properly closed after use

Suggested Fix

The issue likely resides in how x509FederationClientForOkeWorkloadIdentity.getSecurityToken creates HTTP clients. Potential fixes:

  1. Reuse a single http.Client instance instead of creating new ones per request
  2. Ensure resp.Body.Close() is called after reading responses
  3. Configure connection pool limits via http.Transport.MaxIdleConns and MaxIdleConnsPerHost

Impact

In environments with frequent secret mounts (e.g., CI/CD pipelines running short-lived jobs), the provider becomes unusable as pods OOM within hours to days depending on workload volume.

Workarounds

Currently, no effective workarounds exist other than:

  • Increasing memory limits (delays but doesn't prevent the issue)
  • Use alternative

Attachments

Happy to provide full heap and goroutine profile dumps if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions