Skip to content

Bug: turning on "collect application metrics" can cause stability issues #6714

@yuvii

Description

@yuvii

📜 Description

We recently transitioned some services over to kubernetes. These are customer facing services, so we wanted to turn on application metrics, which attached a small envoy container to our main one. However, we soon had issues where the envoy container got OOM killed, which caused an outage as the entire pod was down, and our service became unavailable to customers.

👟 Reproduction steps

I'm not entirely sure how to reproduce this. I'm guessing it's a question of traffic. For what it's worth, adding the envoy sidecar did not provide any metric information, our dashboards remained empty even as traffic came in, so maybe there was some kind of bug there that caused this.

👍 Expected behavior

The envoy container should not affect the main container this way. Any failures in metric collections should not impact resiliency

👎 Actual Behavior

described above.

☸ Kubernetes version

1.33

Cloud provider

AWS

🌍 Browser

Chrome

🧱 Your Environment

We're using a single ELB that handles TLS termination, and then passes the requests off to Traefik which works as a reverse proxy and can auto-discover services on its own.

✅ Proposed Solution

No response

👀 Have you spent some time to check if this issue has been raised before?

  • I checked and didn't find any similar issue

🏢 Have you read the Code of Conduct?

Metadata

Metadata

Labels

bugSomething isn't workingneeds-triageIssue is not approved or ready-to-work on

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions