add initial ambient mc perf doc #16846

therealmitchconnors · 2025-09-03T22:54:32Z

Description

Document the performance characteristics of Ambient Multi Cluster in preparation for Beta.

Reviewers

Mitch

keithmattix

Needs one sentence of explanation, but otherwise LGTM.

keithmattix · 2025-09-29T17:54:55Z

content/en/docs/ops/deployment/ambient-mc-perf/index.md

+
+## Control plane performance
+
+As documented [here](/docs/ops/deployment/performance-and-scalability), the Istio control plane generally scales as the product of deployment changes, configuration changes, and the number of connected proxies. Ambient Multi Cluster adds two new dimensions to the Control Plane scalability story: number of remote clusters, and number of remote services. This means that adding 10 remote services to the mesh has substantially lower impact on the control plane performance than adding 10 local services.


Suggested change

As documented [here](/docs/ops/deployment/performance-and-scalability), the Istio control plane generally scales as the product of deployment changes, configuration changes, and the number of connected proxies. Ambient Multi Cluster adds two new dimensions to the Control Plane scalability story: number of remote clusters, and number of remote services. This means that adding 10 remote services to the mesh has substantially lower impact on the control plane performance than adding 10 local services.

As documented [here](/docs/ops/deployment/performance-and-scalability), the Istio control plane generally scales as the product of deployment changes, configuration changes, and the number of connected proxies. Ambient multicluster adds two new dimensions to the control plane scalability story: number of remote clusters, and number of remote services. Because the control plane is not programming proxies for remote clusters (assuming a multi-primary deployment topology), adding 10 remote services to the mesh has substantially lower impact on the control plane performance than adding 10 local services.

howardjohn · 2025-09-29T18:04:47Z

content/en/docs/ops/deployment/ambient-mc-perf/index.md

+
+When traffic is routed to a remote cluster, the originating data plane establishes an encrypted tunnel to the destination cluster's east/west gateway. It then establishes a secondary encrypted tunnel inside the first, which is terminated at the destination data plane. This use of inner and outer tunnels allows the data plane to securely communicate with the remote cluster without knowing the details of which pod IPs represent which services.
+
+This double-encryption does carry some overhead, however. The Data Plane Load test measures the response latency of traffic between pods in the same cluster, versus those in two different clusters, to understand the impact of double encryption on latency. Additionally, double encryption requires double handshakes, which disproportionately affects the latency of new connections to the remote cluster. As you can see below, our initial connections observed an average of 2.2 milliseconds(346%) additional latency, while requests using existing connections observed an increase of 0.13 milliseconds(72%). While these numbers appear significant, it is expected that most multicluster traffic will cross availability zones or regions, and the observed increase in overhead latency will be minimal compared to the overall transit latency between data centers.


Its not the double encryption causing 346% increase though, right? At minimum its +1 TCP proxy hop, which is (likely) more expensive than the double TLS.

Is this also tested on a real cross-zone/cross-VPC/etc cloud? Or is the network path ~zero cost in the test? If not, that would be a major factor here as well

I believe the tests have no network cost since they're run with kind /cc @Stevenjin8

the +346% is due to having to reestablish an inner hbone for every connection. These tests were all run in kind locally.

so yeah, its not due to double encryption alone

Also, double encryption does not require double handshakes, we/I were just a bit lazy in our implementation. But we can also make the point that we can speed this up in the future to be roughly on par with request/response numbers

Ah right, forgot the 346 is for CRR not RR. RR is the more important number anyways

Stevenjin8 · 2025-09-30T13:11:18Z

content/en/docs/ops/deployment/ambient-mc-perf/index.md

+
+As documented [here](/docs/ops/deployment/performance-and-scalability), the Istio control plane generally scales as the product of deployment changes, configuration changes, and the number of connected proxies. Ambient Multi Cluster adds two new dimensions to the Control Plane scalability story: number of remote clusters, and number of remote services. This means that adding 10 remote services to the mesh has substantially lower impact on the control plane performance than adding 10 local services.
+
+Our Multicluster Control Plane Load test created 300 services with 4000 endpoints in each of 10 clusters, and added these clusters to the mesh one at a time. The approximate control plane impact of adding a remote cluster at this scale was **1% of a CPU core, and 180 MB of memory**. At this scale, it should be safe to scale well beyond 10 clusters in a mesh with a properly scaled control plane. One item to note is that for Multicluster scalability, horizontally scaling the control plane will not help, as each control plane instance maintains a complete cache of remote services. Instead, we recommend modifying the resource requests and limits of the control plane to scale vertically to meet the needs of your multicluster mesh.


we should standardize on "multi cluster" vs "multicluster"

Multicluster as a single word is the standardized term

dhawton

Some nits for consistency and conforming to the general casing used across the docs.

dhawton · 2025-09-30T16:45:16Z

content/en/docs/ops/deployment/ambient-mc-perf/index.md

+
+When traffic is routed to a remote cluster, the originating data plane establishes an encrypted tunnel to the destination cluster's east/west gateway. It then establishes a secondary encrypted tunnel inside the first, which is terminated at the destination data plane. This use of inner and outer tunnels allows the data plane to securely communicate with the remote cluster without knowing the details of which pod IPs represent which services.
+
+This double-encryption does carry some overhead, however. The Data Plane Load test measures the response latency of traffic between pods in the same cluster, versus those in two different clusters, to understand the impact of double encryption on latency. Additionally, double encryption requires double handshakes, which disproportionately affects the latency of new connections to the remote cluster. As you can see below, our initial connections observed an average of 2.2 milliseconds(346%) additional latency, while requests using existing connections observed an increase of 0.13 milliseconds(72%). While these numbers appear significant, it is expected that most multicluster traffic will cross availability zones or regions, and the observed increase in overhead latency will be minimal compared to the overall transit latency between data centers.


Suggested change

This double-encryption does carry some overhead, however. The Data Plane Load test measures the response latency of traffic between pods in the same cluster, versus those in two different clusters, to understand the impact of double encryption on latency. Additionally, double encryption requires double handshakes, which disproportionately affects the latency of new connections to the remote cluster. As you can see below, our initial connections observed an average of 2.2 milliseconds(346%) additional latency, while requests using existing connections observed an increase of 0.13 milliseconds(72%). While these numbers appear significant, it is expected that most multicluster traffic will cross availability zones or regions, and the observed increase in overhead latency will be minimal compared to the overall transit latency between data centers.

This double encryption does carry some overhead, however. The data plane load test measures the response latency of traffic between pods in the same cluster, versus those in two different clusters, to understand the impact of double encryption on latency. Additionally, double encryption requires double handshakes, which disproportionately affects the latency of new connections to the remote cluster. As you can see below, our initial connections observed an average of 2.2 milliseconds (346%) additional latency, while requests using existing connections observed an increase of 0.13 milliseconds (72%). While these numbers appear significant, it is expected that most multicluster traffic will cross availability zones or regions, and the observed increase in overhead latency will be minimal compared to the overall transit latency between data centers.

dhawton · 2025-09-30T16:46:18Z

content/en/docs/ops/deployment/ambient-mc-perf/index.md

+
+As documented [here](/docs/ops/deployment/performance-and-scalability), the Istio control plane generally scales as the product of deployment changes, configuration changes, and the number of connected proxies. Ambient Multi Cluster adds two new dimensions to the Control Plane scalability story: number of remote clusters, and number of remote services. This means that adding 10 remote services to the mesh has substantially lower impact on the control plane performance than adding 10 local services.
+
+Our Multicluster Control Plane Load test created 300 services with 4000 endpoints in each of 10 clusters, and added these clusters to the mesh one at a time. The approximate control plane impact of adding a remote cluster at this scale was **1% of a CPU core, and 180 MB of memory**. At this scale, it should be safe to scale well beyond 10 clusters in a mesh with a properly scaled control plane. One item to note is that for Multicluster scalability, horizontally scaling the control plane will not help, as each control plane instance maintains a complete cache of remote services. Instead, we recommend modifying the resource requests and limits of the control plane to scale vertically to meet the needs of your multicluster mesh.


Suggested change

Our Multicluster Control Plane Load test created 300 services with 4000 endpoints in each of 10 clusters, and added these clusters to the mesh one at a time. The approximate control plane impact of adding a remote cluster at this scale was **1% of a CPU core, and 180 MB of memory**. At this scale, it should be safe to scale well beyond 10 clusters in a mesh with a properly scaled control plane. One item to note is that for Multicluster scalability, horizontally scaling the control plane will not help, as each control plane instance maintains a complete cache of remote services. Instead, we recommend modifying the resource requests and limits of the control plane to scale vertically to meet the needs of your multicluster mesh.

Our multicluster control plane load test created 300 services with 4000 endpoints in each of 10 clusters, and added these clusters to the mesh one at a time. The approximate control plane impact of adding a remote cluster at this scale was **1% of a CPU core, and 180 MB of memory**. At this scale, it should be safe to scale well beyond 10 clusters in a mesh with a properly scaled control plane. One item to note is that for multicluster scalability, horizontally scaling the control plane will not help, as each control plane instance maintains a complete cache of remote services. Instead, we recommend modifying the resource requests and limits of the control plane to scale vertically to meet the needs of your multicluster mesh.

dhawton · 2025-09-30T16:46:57Z

content/en/docs/ops/deployment/ambient-mc-perf/index.md

+test: n/a
+---
+
+Multicluster deployments with Ambient mode enable you to offer truly globally resilient applications at scale with minimal overhead. In addition to its normal functions, the Istio control plane creates watches on all remote clusters to keep an up-to-date listing of what global services each cluster offers. The Istio dataplane can route traffic to these remote global services, either as a part of normal traffic distribution, or specifically when the local service is unavailable.


Suggested change

Multicluster deployments with Ambient mode enable you to offer truly globally resilient applications at scale with minimal overhead. In addition to its normal functions, the Istio control plane creates watches on all remote clusters to keep an up-to-date listing of what global services each cluster offers. The Istio dataplane can route traffic to these remote global services, either as a part of normal traffic distribution, or specifically when the local service is unavailable.

Multicluster deployments with ambient mode enable you to offer truly globally resilient applications at scale with minimal overhead. In addition to its normal functions, the Istio control plane creates watches on all remote clusters to keep an up-to-date listing of what global services each cluster offers. The Istio data plane can route traffic to these remote global services, either as a part of normal traffic distribution, or specifically when the local service is unavailable.

add initial ambient mc perf doc

6ad5fc5

therealmitchconnors requested a review from a team as a code owner September 3, 2025 22:54

istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Sep 3, 2025

istio-policy-bot added area/ambient area/perf and scalability kind/docs labels Sep 3, 2025

istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 3, 2025

Stevenjin8 and others added 3 commits September 15, 2025 13:00

Add figures

2c8b6fa

Merge pull request #1 from Stevenjin8/mitch

8c4f80c

Mitch

lint

a867cd2

therealmitchconnors changed the title ~~WIP: add initial ambient mc perf doc~~ add initial ambient mc perf doc Sep 29, 2025

istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Sep 29, 2025

keithmattix reviewed Sep 29, 2025

View reviewed changes

howardjohn reviewed Sep 29, 2025

View reviewed changes

Stevenjin8 reviewed Sep 30, 2025

View reviewed changes

dhawton reviewed Sep 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add initial ambient mc perf doc #16846

add initial ambient mc perf doc #16846

Uh oh!

therealmitchconnors commented Sep 3, 2025

Uh oh!

keithmattix left a comment

Uh oh!

keithmattix Sep 29, 2025 •

edited by dhawton

Loading

Uh oh!

howardjohn Sep 29, 2025

Uh oh!

keithmattix Sep 29, 2025

Uh oh!

Stevenjin8 Sep 30, 2025

Uh oh!

Stevenjin8 Sep 30, 2025

Uh oh!

Stevenjin8 Sep 30, 2025 •

edited

Loading

Uh oh!

howardjohn Sep 30, 2025

Uh oh!

Stevenjin8 Sep 30, 2025 •

edited

Loading

Uh oh!

dhawton Sep 30, 2025

Uh oh!

dhawton left a comment

Uh oh!

dhawton Sep 30, 2025

Uh oh!

dhawton Sep 30, 2025

Uh oh!

dhawton Sep 30, 2025

Uh oh!

Uh oh!


		## Control plane performance

		As documented [here](/docs/ops/deployment/performance-and-scalability), the Istio control plane generally scales as the product of deployment changes, configuration changes, and the number of connected proxies. Ambient Multi Cluster adds two new dimensions to the Control Plane scalability story: number of remote clusters, and number of remote services. This means that adding 10 remote services to the mesh has substantially lower impact on the control plane performance than adding 10 local services.


		When traffic is routed to a remote cluster, the originating data plane establishes an encrypted tunnel to the destination cluster's east/west gateway. It then establishes a secondary encrypted tunnel inside the first, which is terminated at the destination data plane. This use of inner and outer tunnels allows the data plane to securely communicate with the remote cluster without knowing the details of which pod IPs represent which services.

		This double-encryption does carry some overhead, however. The Data Plane Load test measures the response latency of traffic between pods in the same cluster, versus those in two different clusters, to understand the impact of double encryption on latency. Additionally, double encryption requires double handshakes, which disproportionately affects the latency of new connections to the remote cluster. As you can see below, our initial connections observed an average of 2.2 milliseconds(346%) additional latency, while requests using existing connections observed an increase of 0.13 milliseconds(72%). While these numbers appear significant, it is expected that most multicluster traffic will cross availability zones or regions, and the observed increase in overhead latency will be minimal compared to the overall transit latency between data centers.


		As documented [here](/docs/ops/deployment/performance-and-scalability), the Istio control plane generally scales as the product of deployment changes, configuration changes, and the number of connected proxies. Ambient Multi Cluster adds two new dimensions to the Control Plane scalability story: number of remote clusters, and number of remote services. This means that adding 10 remote services to the mesh has substantially lower impact on the control plane performance than adding 10 local services.

		Our Multicluster Control Plane Load test created 300 services with 4000 endpoints in each of 10 clusters, and added these clusters to the mesh one at a time. The approximate control plane impact of adding a remote cluster at this scale was 1% of a CPU core, and 180 MB of memory. At this scale, it should be safe to scale well beyond 10 clusters in a mesh with a properly scaled control plane. One item to note is that for Multicluster scalability, horizontally scaling the control plane will not help, as each control plane instance maintains a complete cache of remote services. Instead, we recommend modifying the resource requests and limits of the control plane to scale vertically to meet the needs of your multicluster mesh.

add initial ambient mc perf doc #16846

Are you sure you want to change the base?

add initial ambient mc perf doc #16846

Uh oh!

Conversation

therealmitchconnors commented Sep 3, 2025

Description

Reviewers

Uh oh!

keithmattix left a comment

Choose a reason for hiding this comment

Uh oh!

keithmattix Sep 29, 2025 • edited by dhawton Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Stevenjin8 Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Stevenjin8 Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhawton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

keithmattix Sep 29, 2025 •

edited by dhawton

Loading

Stevenjin8 Sep 30, 2025 •

edited

Loading

Stevenjin8 Sep 30, 2025 •

edited

Loading