Skip to content

Incident: Uncontrolled Read Throughput and High IOPS on Node under Pressure #490

@Kouzi99

Description

@Kouzi99

Hello,

We experienced an incident with the certificate exporter deployed as a DaemonSet on one node. After some prerequisite issue on the node (details unclear), the exporter began exhibiting extremely high disk activity: throughput reached 1.75 GB/s and IOPS hit 4.5k io/s. For now, we cannot replicate this issue, but I'm sharing a relevant piece of configuration and asking for advice on mitigation.

Relevant configuration:

  cache:
    # -- Enable caching of Kubernetes objects to prevent scraping timeouts
    enabled: true
    # -- Maximum time an object can stay in cache unrefreshed (seconds) - it will be at least half of that
    maxDuration: 300

  kubeApiRateLimits:
    # -- Should requests to the Kubernetes API server be rate-limited
    enabled: false
    # -- Maximum rate of queries sent to the API server (per second)
    queriesPerSecond: 5
    # -- Burst bucket size for queries sent to the API server
    burstQueries: 10

Question:
Would enabling kubeApiRateLimits help prevent such behavior if the node is under pressure? Or is the uncontrolled read operation unrelated to API rate limiting?

Additionally, I suspect that, under node pressure, the following part of the code may trigger unthrottled read operations:

// internal/certificate.go
func readFile(file string) ([]byte, error) {
	contents, err := os.ReadFile(file)
	if err == nil || !os.IsNotExist(err) {
		return contents, err
	}

	fsys := os.DirFS(".")
	if filepath.IsAbs(file) {
		fsys = os.DirFS("/")
	}

	realPath, err := resolveSymlink(fsys, file)
	if err != nil {
		return nil, err
	}

	return os.ReadFile(realPath)
}

Any insight or recommendations to prevent this kind of incident in the future are appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions