Skip to content

Conversation

@arthurlapertosa
Copy link
Contributor

No description provided.

@arthurlapertosa arthurlapertosa marked this pull request as ready for review August 28, 2025 03:59
@arthurlapertosa arthurlapertosa requested review from a team, apeabody and ericyz as code owners August 28, 2025 03:59
@arthurlapertosa
Copy link
Contributor Author

arthurlapertosa commented Aug 28, 2025

@apeabody could you please run the build for this PR?

@apeabody
Copy link
Collaborator

/gcbrun

@arthurlapertosa
Copy link
Contributor Author

@apeabody I don't have access to the GCP cloud build project. Could you please send me the error?

@apeabody
Copy link
Collaborator

@apeabody I don't have access to the GCP cloud build project. Could you please send me the error?

Error: Reference to undeclared input variable

  on ../../modules/beta-autopilot-private-cluster/cluster.tf line 72, in resource "google_container_cluster" "primary":
  72:       confidential_instance_type = lookup(var.node_pools[0], "confidential_instance_type", null)

An input variable with the name "node_pools" has not been declared. This
variable can be declared with a variable "node_pools" {} block.}

@arthurlapertosa
Copy link
Contributor Author

@apeabody could you please re-run the build?

@apeabody
Copy link
Collaborator

/gcbrun

@arthurlapertosa
Copy link
Contributor Author

@apeabody I think the build wasn't properly triggered, could you please take a look?

@apeabody
Copy link
Collaborator

apeabody commented Sep 2, 2025

/gcbrun

@apeabody
Copy link
Collaborator

apeabody commented Sep 2, 2025

@apeabody I think the build wasn't properly triggered, could you please take a look?

Might have been too quick after the merge, it's running now.

Copy link
Collaborator

@apeabody apeabody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apeabody
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example for creating a GKE cluster with confidential nodes and GPUs. This is a valuable addition. The changes include modifications to several Terraform modules to support confidential_instance_type and guest_accelerator configurations, along with the new example files and corresponding integration tests. The implementation is mostly correct, but I've found a few issues related to version constraints, external dependencies, and a bug in the for_each logic that need to be addressed.

@arthurlapertosa
Copy link
Contributor Author

@apeabody I don't have access to the build error, could you please send it to me?

@arthurlapertosa
Copy link
Contributor Author

@apeabody could you please rerun the build and the checks?

@apeabody
Copy link
Collaborator

apeabody commented Nov 4, 2025

/gcbrun

@apeabody
Copy link
Collaborator

apeabody commented Nov 4, 2025

@apeabody I don't have access to the build error, could you please send it to me?

It's expired, triggered a fresh test

@apeabody
Copy link
Collaborator

apeabody commented Nov 4, 2025

@apeabody I don't have access to the build error, could you please send it to me?

It's expired, triggered a fresh test

Thanks - This change looks pretty good, but we still need to enable the new test. Can you please add to the cloudbuild similar to https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/2458/files#diff-35f3e17c28a8a8de710f4be35fb5448d0e33c2d9b89fb2be7499b9830f890d12

@arthurlapertosa
Copy link
Contributor Author

@apeabody sorry for the delay. Can you run the build, please?

@apeabody
Copy link
Collaborator

/gcbrun

@apeabody
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example for creating a GKE cluster with confidential nodes and GPUs. The implementation is solid, including updates to various modules to support confidential_instance_type and GPU configurations, and the addition of a comprehensive integration test. My review focuses on the new example files, where I've identified opportunities to simplify the Terraform code, remove unused resources, and improve documentation and test clarity. Overall, these are great additions that enhance the module's capabilities.

cluster_type = "confidential-gpu"
network_name = "confidential-gpu-network-${random_string.suffix.result}"
subnet_name = "confidential-gpu-subnet"
master_auth_subnetwork = "confidential-gpu-master-subnet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The local variable master_auth_subnetwork is defined but never used in this public cluster example. This should be removed to avoid confusion. The corresponding subnet resource in network.tf should also be removed.

region = var.region
zones = var.zones
network = module.gcp-network.network_name
subnetwork = local.subnet_names[index(module.gcp-network.subnets_names, local.subnet_name)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This expression to look up the subnetwork name is unnecessarily complex. It can be simplified by using the local.subnet_name variable directly.

  subnetwork                        = local.subnet_name

Comment on lines +32 to +37
{
subnet_name = local.master_auth_subnetwork
subnet_ip = "10.60.0.0/17"
subnet_region = var.region
subnet_private_access = true
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This subnet for master_auth_subnetwork is created but not used in this public cluster example. It should be removed to avoid provisioning unnecessary resources.

Comment on lines +28 to +30
output "location" {
value = module.gke.location
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The location output is missing a description. Please add one for clarity and consistency with other outputs.

output "location" {
  description = "The location (region or zone) in which the cluster resides."
  value = module.gke.location
}

"nodeConfig.diskType",
"nodeConfig.enableConfidentialStorage",
"nodeConfig.machineType",
"nodeConfig.diskType",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The JSON path nodeConfig.diskType is duplicated in the validateJSONPaths slice (it also appears on line 72). Please remove this duplicate entry to improve code clarity.

@apeabody
Copy link
Collaborator

Step #99 - "apply test-confidential-gpu-public":         	Error:      	Received unexpected error:
Step #99 - "apply test-confidential-gpu-public":         	            	FatalError{Underlying: error while running command: exit status 1; 
Step #99 - "apply test-confidential-gpu-public":         	            	Error: Error waiting for creating GKE cluster: Insufficient quota to satisfy the request: Not all instances running in IGM after 29.827554958s. Expected 1, running 0, transitioning 1. Current errors: [GCE_QUOTA_EXCEEDED]: Instance 'gke-confidential-gpu-clu-default-pool-ad60a6ba-xnqt' creation failed: Quota 'GPUS_PER_GPU_FAMILY' exceeded.  Limit: 0.0 in region us-central1.
Step #99 - "apply test-confidential-gpu-public":         	            	
Step #99 - "apply test-confidential-gpu-public":         	            	  with module.gke.google_container_cluster.primary,
Step #99 - "apply test-confidential-gpu-public":         	            	  on ../../modules/beta-public-cluster/cluster.tf line 22, in resource "google_container_cluster" "primary":
Step #99 - "apply test-confidential-gpu-public":         	            	  22: resource "google_container_cluster" "primary" {
Step #99 - "apply test-confidential-gpu-public":         	            	}
Step #99 - "apply test-confidential-gpu-public":         	Test:       	TestConfidentialGpuPublic

@arthurlapertosa
Copy link
Contributor Author

@apeabody
Looks like the project used for testing has 0 GPUS_PER_GPU_FAMILY quota. This quota is necessary for this example to work, once it uses a container with a GPU.
Do you think we could get/request this quota?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants