-
Notifications
You must be signed in to change notification settings - Fork 392
Description
Hi,
We have an issue upgrading from v2.9.5 to v3.x using AWS IAM to access AWS MSK.
We use RedPanda Console against AWS MSK cluster (Kafka v3.6.0
). We have configured Console to use AWS IAM to connect to the MSK cluster. This works with v2.9.5
of RP Console perfectly. Upon upgrading the RP Console to v3.x
we experience errors and slow performance of RP Console.
We have updated the configuration of RP Console with the new settings (e.g. Schema Registry section moved to its own top level). We actually running RP Console in AWS ECS (containers) and use environment variables for configuration. So we have, for now, kept the old and new configuration names.
Our IAM role is a little restrictive. We want RP Console to be read only in a lot of cases so thats its main use is really just debugging. We also have some sensitive topics (prefixed with sensitive
) in our cluster that we do not want users to be able to read by default.
The role is (with some redaction):
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"kafka-cluster:DescribeCluster*",
"kafka-cluster:Connect"
],
"Effect": "Allow",
"Resource": "arn:aws:kafka:ap-southeast-2:********:cluster/msk-cluster-name/AABBCCDD-EEFF-1122-3344-556677889900-4"
},
{
"Action": "kafka-cluster:DescribeGroup",
"Effect": "Allow",
"Resource": "arn:aws:kafka:ap-southeast-2:********:group/msk-cluster-name/*"
},
{
"Action": [
"kafka-cluster:ReadData",
"kafka-cluster:DescribeTransactionalId",
"kafka-cluster:DescribeTopic*"
],
"Effect": "Allow",
"Resource": "arn:aws:kafka:ap-southeast-2:********:topic/msk-cluster-name/*"
},
{
"Action": "kafka-cluster:ReadData",
"Effect": "Deny",
"Resource": "arn:aws:kafka:ap-southeast-2:********:topic/msk-cluster-name/*/sensitive*"
}
]
}
When we upgrade, some parts of RP Console seem to work, many are very slow and several just fail entirely. For example; we can see a topic list, however, cannot enumerate the messages in the topics at all.
We see a lot of errors from RP Console such as (in no particular order):
{"level":"warn","ts":"2025-04-29T22:41:28.878Z","msg":"","timestamp":"2025-04-29T22:41:23Z","procedure":"/redpanda.api.console.v1alpha1.ClusterStatusService/GetKafkaInfo","request_duration":"5.002129822s","status_code":"internal","request_size_bytes":0,"peer_address":"2406:da1c:****:****:****:****:****:4639","error":"internal: context deadline exceeded"}
{"level":"warn","ts":"2025-04-29T22:41:58.079Z","logger":"redpanda_cluster_status_service","msg":"failed to request kafka version","broker_id":1,"error":"the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists"}
{"level":"warn","ts":"2025-04-29T22:42:17.274Z","msg":"failed to retrieve log dirs by topic","error":"failed to retrieve metadata: context deadline exceeded"}
I am aware the context deadline message is time out and/or connectivity issues. We separately verified that access is fine (we'd likely be seeing it under 2.9.5
too). Ports are open (happens to be 9098 for AWS IAM in MSK).
We have tiered topics - but being AWS MSK they go off to hidden S3 buckets (even we cant access them).Some of the request durations are long too. E.g. this one was > 11 seconds:
{"level":"warn","ts":"2025-04-29T22:47:13.616Z","msg":"","timestamp":"2025-04-29T22:47:02Z","procedure":"/redpanda.api.console.v1alpha1.ClusterStatusService/GetKafkaInfo","request_duration":"11.351032601s","status_code":"internal","request_size_bytes":0,"peer_address":"2406:da1c:****:****:****:****:****:4639","error":"internal: context deadline exceeded"}
On a bit of a whim we switched back to SASL/SCRAM for authentication and RP Console worked fine. The user has pretty liberal permissions (though is still restricted on the sensitive topics). This seems to confirm that it is a bit of a problem with the AWS IAM integration in someway - and perhaps some introduced regression.