e2e: PP: cover ExecCPUAffinity support in tests#1432
e2e: PP: cover ExecCPUAffinity support in tests#1432shajmakh wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shajmakh The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
depend on #1426 |
6ef4f1a to
beeea3d
Compare
565820a to
0af067c
Compare
|
regarding /test e2e-gcp-pao-workloadhints |
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). In this commit we start updating only the affected job on which the test would run, later we will need to add this setting to all other jobs that consume ipi-gcp cluster configuration. Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). In this commit we start updating only the affected job on which the test would run, later we will need to add this setting to all other jobs that consume ipi-gcp cluster configuration. Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
|
/retest |
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
|
when temporarly removed the failing test due to misaligning node topology with PP cpu section, |
|
/hold |
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
0af067c to
41afeca
Compare
|
/test e2e-aws-ovn |
acdb51a to
a8b7158
Compare
|
/retest |
|
/unhold |
SargunNarula
left a comment
There was a problem hiding this comment.
Thanks for the tests, IMO some tests are redundant which can be removed.
| } | ||
|
|
||
| var err error | ||
| testPod := pods.MakePodWithResources(ctx, workerRTNode, qos, containersResources) |
There was a problem hiding this comment.
That is an option, indeed, but since I want to use the same function in different suites (not only on mixed cpus), I added one to provide this functionality with QoS and multi-containers for the pod.
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
Outdated
Show resolved
Hide resolved
| sharedCpusResource: resource.MustParse("1"), | ||
| }, | ||
| }), | ||
| Entry("best-effort pod with shared CPU request", |
There was a problem hiding this comment.
I think it would be better to keep the best-effort scenario under cpu_management only, since it does not depend on shared CPUs.
There was a problem hiding this comment.
I think we need to keep isolation and make sure that the test pass when mixed cpus is enabled too
| //cnt1 resources | ||
| {}, | ||
| }), | ||
| Entry("burstable pod with shared CPU request", |
There was a problem hiding this comment.
Same case with burstable
There was a problem hiding this comment.
I see your point, but when shared cpus is enabled the flow becomes different when execCPUAffinity is enabled as well. so I believe we should ensure that the same set of tests (BE and BU) pass also on a cluster with that config. wdyt?
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/1_performance/cpu_management.go
Outdated
Show resolved
Hide resolved
…#73835) GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
571923e to
b8bd51e
Compare
|
@SargunNarula Thanks for your valuable review. I've addressed your comments. Let me know if the new version addresses your concerns. Thanks! |
…#73835) GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
…#73835) GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed. In general this is the good practice to include all node's cpus in the PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)). Note: this is subject to change should the CPU specifications on GCP get modified. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
|
@shajmakh I think all the concerns are covered now and the tests look good to me. /lgtm |
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
Outdated
Show resolved
Hide resolved
| updatedIsolated = *mustParse(string(*profile.Spec.CPU.Isolated)) | ||
| currentShared := mustParse(string(*profile.Spec.CPU.Shared)) | ||
| if len(currentShared.List()) < 2 { | ||
| testlog.Info("shared cpuset has less than 2 cpus; this test requires at least 2 shared cpus; update the profile") |
There was a problem hiding this comment.
shouldn't we abort/skip here?
There was a problem hiding this comment.
the intention here was to allow this test to run even if it requires PP update, as much as the cluster topology allows. I do not assume that the PP config when the test run would have the least required shared CPUs (2), in case it doesn't the test will reconfigure PP to allow this test to run. there is no point of running it when the shared cpus are 1 which is how the suite setu sets this setting in the profile:
https://github.com/shajmakh/cluster-node-tuning-operator/blob/b8bd51eab141d263aa3d95aa37b36f1e6cfece5c/test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go#L770
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
Outdated
Show resolved
Hide resolved
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
Outdated
Show resolved
Hide resolved
| By("Run exec command on the pod and verify the process is pinned not only to the first exclusive CPU") | ||
|
|
||
| for i := 0; i < retries; i++ { | ||
| cmd := []string{"/bin/bash", "-c", "sleep 10 & SLPID=$!; ps -o psr -p $SLPID;"} |
There was a problem hiding this comment.
we really need a more robust approach here
| val, ok := profile.Annotations[performancev2.PerformanceProfileExecCPUAffinityAnnotation] | ||
| if ok && val == performancev2.PerformanceProfileExecCPUAffinityDisable { | ||
| // fail loudly because the default should be enabled | ||
| Fail("exec-cpu-affinity is disabled in the profile") |
There was a problem hiding this comment.
Why not using Expect as we always do?
| cpusIncludingShared, err := cpuset.Parse(cpusetCfg.Cpus) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to parse cpuset config for test pod cpus=%q", cpusetCfg.Cpus) | ||
| testlog.Infof("all CPUs dedicated for the container (including shared if requested): %s", cpusIncludingShared.String()) | ||
| firstCPU := cpusIncludingShared.List()[0] |
There was a problem hiding this comment.
We should check if not empty first. the cpuset.Parse(cpusetCfg.Cpus) can succeed (no error) and return an empty CPUset
| isSharedCPUsRequested := container.Resources.Limits.Name(sharedCpusResource, resource.DecimalSI).Value() > 0 | ||
| if isSharedCPUsRequested { | ||
| cntShared := cpusIncludingShared.Difference(updatedIsolated) | ||
| firstCPU = cntShared.List()[0] |
|
|
||
| AfterEach(func() { | ||
| deleteTestPod(context.TODO(), testpod) | ||
| Expect(pods.Delete(context.TODO(), testpod)).To(BeTrue(), "Failed to delete pod") |
There was a problem hiding this comment.
if you're already adding an assertion I would add the pod name as well.
| val, ok := profile.Annotations[performancev2.PerformanceProfileExecCPUAffinityAnnotation] | ||
| if ok && val == performancev2.PerformanceProfileExecCPUAffinityDisable { | ||
| // fail loudly because the default should be enabled | ||
| Fail("exec-cpu-affinity is disabled in the profile") |
There was a problem hiding this comment.
you can use Expect here as we always do
| Expect(getter).ToNot(BeNil()) | ||
|
|
||
| By("Checking if exec-cpu-affinity is disabled, if not disable it") | ||
| initialProfile, _ = profiles.GetByNodeLabels(testutils.NodeSelectorLabels) |
| retries := 20 | ||
| By("Run exec command on the pod and verify the process is pinned not only to the first exclusive CPU") | ||
|
|
||
| for i := 0; i < retries; i++ { |
There was a problem hiding this comment.
why not using Eventually? you can have better control on the timing and you'll be using a built in method
| } | ||
|
|
||
| func Delete(ctx context.Context, pod *corev1.Pod) bool { | ||
| err := testclient.DataPlaneClient.Get(ctx, client.ObjectKeyFromObject(pod), pod) |
There was a problem hiding this comment.
I would pass the client as an argument just for the case that we might need to use a different one in the future
Add main e2e tests that checks the behavior of performance-profile with `ExecCPUAffinity: first` and without it (legacy). Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Add unit tests for functions in resources helper package for tests. Assisted-by: Cursor v1.2.2 AI-Attribution: AIA Entirely AI, Human-initiated, Reviewed, Cursor v1.2.2 v1.0 Signed-off-by: Shereen Haj <shajmakh@redhat.com>
b8bd51e to
305d0c1
Compare
|
@shajmakh: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add basic e2e tests that checks the default behavior of performance-profile with default enabled
ExecCPUAffinity: first.