-
Notifications
You must be signed in to change notification settings - Fork 122
[CI] Update documents around the GCP runners #367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| # LLVM Premerge infra - GCP runners | ||
|
|
||
| This document describes how the GCP based presubmit infra is working, and | ||
| explains common maintenance actions. | ||
|
|
||
| --- | ||
| NOTE: As of today, only Googlers can administrate the cluster. | ||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| Presubmit tests are using GitHub workflows. Executing GitHub workflows can be | ||
| done in two ways: | ||
| - using GitHub provided runners. | ||
| - using self-hosted runners. | ||
|
|
||
| GitHub provided runners are not very powerful, and have limitations, but they | ||
| are **FREE**. | ||
| Self hosted runners are self-hosted, meaning they can be large virtual | ||
| machines running on GCP, very powerful, but **expensive**. | ||
|
|
||
| To balance cost/performance, we keep both types. | ||
| - simple jobs like `clang-format` shall run on GitHub runners. | ||
| - building & testing LLVM shall be done on self-hosted runners. | ||
|
|
||
| LLVM has several flavor of self-hosted runners: | ||
| - libcxx runners. | ||
| - MacOS runners managed by Microsoft. | ||
| - GCP windows/linux runners managed by Google. | ||
|
|
||
| This document only focuses on Google's GCP hosted runners. | ||
|
|
||
| Choosing on which runner a workflow runs is done in the workflow definition: | ||
|
|
||
| ``` | ||
| jobs: | ||
| my_job_name: | ||
| # Runs on expensive GCP VMs. | ||
| runs-on: llvm-premerge-linux-runners | ||
| ``` | ||
|
|
||
| Our self hosted runners come in two flavors: | ||
| - Linux | ||
| - Windows | ||
|
|
||
| ## GCP runners - Architecture overview | ||
|
|
||
| Our runners are hosted on a GCP Kubernetes cluster, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller). | ||
| The cluster has 3 nodes: | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - llvm-premerge-linux | ||
| - llvm-premerge-linux-service | ||
| - llvm-premerge-windows | ||
|
|
||
| **llvm-premerge-linux-service** is a fixed node, only used to host the | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| services required to manage the premerge infra (controller, listeners, | ||
| monitoring). Today, this node has only one `e2-small` machine. | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| **llvm-premerge-linux** is a auto-scaling node with large `c2d-highcpu-56` | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| VMs. This node runs the Linux workflows. | ||
|
|
||
| **llvm-premerge-windows** is a auto-scaling node with large `c2d-highcpu-56` | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| VMs. Similar to the Linux node, but this time it runs Windows workflows. | ||
|
|
||
| ### Service node: llvm-premerge-linux-service | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| This node runs all the services managing the presubmit infra. | ||
| - Action Runner Controller | ||
| - 1 listener for the Linux runners. | ||
| - 1 listener for the windows runners. | ||
| - Grafana Alloy to gather metrics. | ||
Keenuts marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The Action Runner Controller listens on the LLVM repository job queue. | ||
| Individual jobs are then handled by the listeners. | ||
|
|
||
| How a job is run: | ||
| - The controller informs GitHub the self-hosted runner is live. | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - A PR is uploaded on GitHub | ||
| - The listener finds a Linux job to run. | ||
| - The listener creates a new runner pod to be scheduled by Kubernetes. | ||
| - Kubernetes adds one instance to the Linux node to schedule new pod. | ||
| - The runner starts executing on the new node. | ||
| - Once finished, the runner dies, meaning the pod dies. | ||
| - If the instance is not reused in the next 10 minutes, Kubernetes will scale | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| down the instance. | ||
|
|
||
| ### Worker nodes : llvm-premerge-linux, llvm-premerge-windows | ||
|
|
||
| To make sure each runner pod is scheduled on the correct node (linux or | ||
| windows, avoiding the service node), we use labels & taints. | ||
| Those taints are configured in the | ||
| [ARC runner templates](linux_runners_values.yaml). | ||
|
|
||
| The other constraints we define are the resource requirements. Without | ||
| information, Kubernetes is allowed to schedule multiple pods on the instance. | ||
| This becomes very important with the container/runner tandem: | ||
| - the container HAS to run on the same instance as the runner. | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - the runner itself doesn't request many resources. | ||
| So if we do not enforce limits, the controller could schedule 2 runners on | ||
| the same instance, forcing containers to share resources. | ||
| Resource limits are defined in 2 locations: | ||
| - [runner configuration](linux_runners_values.yaml) | ||
| - [container template](linux_container_pod_template.yaml) | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| # Cluster configuration | ||
|
|
||
| The cluster is managed using Terraform. The main configuration is | ||
| [main.tf](main.tf). | ||
|
|
||
| --- | ||
| NOTE: As of today, only Googlers can administrate the cluster. | ||
Keenuts marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --- | ||
|
|
||
| Terraform is a tool to automate infrastructure deployment. Basic usage is to | ||
| change this configuration and to call `terraform apply` make the required | ||
| changes. | ||
| Terraform won't recreate the whole cluster from scratch every time, instead | ||
| it tries to only apply the new changes. To do so, **Terraform needs a state**. | ||
|
|
||
| **If you apply changes without this state, you might break the cluster.** | ||
|
|
||
| The current configuration stores its state into a GCP bucket. | ||
|
|
||
|
|
||
| ## Accessing Google Cloud Console | ||
|
|
||
| This web interface is the easiest way to get a quick look at the infra. | ||
|
|
||
| --- | ||
| IMPORTANT: cluster state is managed with terraform. Please DO NOT change | ||
| shapes/scaling, and other settings using the cloud console. Any change not | ||
| done through terraform will be at best overridden by terraform, and in the | ||
| worst case cause an inconsistent state. | ||
| --- | ||
|
|
||
| The main part you want too look into is `Menu > Kubernetes Engine > Clusters`. | ||
|
|
||
| Currently, we have 3 clusters: | ||
| - `llvm-premerge-checks`: the cluster hosting BuildKite Linux runners. | ||
| - `windows-cluster`: the cluster hosting BuildKite Windows runners. | ||
| - `llvm-premerge-prototype`: the cluster for those GCP hoster runners. | ||
|
|
||
| Yes, it's called `prototype`, but that's the production cluster. | ||
boomanaiden154 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| To add a VM to the cluster, the VM has to come from a `pool`. A `pool` is | ||
| a group of nodes withing a cluster that all have the same configuration. | ||
|
|
||
| For example: | ||
| A pool can say it contains at most 10 nodes, each using the `c2d-highcpu-32` | ||
| configuration (32 cores, 64GB ram). | ||
| In addition, a pool can `autoscale` [docs](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler). | ||
|
|
||
| If you click on `llvm-premerge-prototype`, and go to the `Nodes` tab, you | ||
| will see 3 node pools: | ||
| - llvm-premerge-linux | ||
| - llvm-premerge-linux-service | ||
| - llvm-premerge-windows | ||
|
|
||
| Definition for each pool is in [Architecture overview](architecture.md). | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| If you click on a pool, example `llvm-premerge-linux`, you will see one | ||
| instance group, and maybe several nodes. | ||
|
|
||
| Each created node must be attached to an instance group, which is used to | ||
| manage a group of instances. Because we use automated autoscale, and we have | ||
| a basic cluster, we have a single instance group per pool. | ||
|
|
||
| Then, we have the nodes. If you are looking at the panel during off hours, | ||
| you might see no nodes at all: when no presubmit is running, no VM is on. | ||
| If you are looking at the panel at peak time, you should see 4 instances. | ||
| (Today, autoscale is capped at 4 instances). | ||
|
|
||
| If you click on a node, you'll see the CPU usage, memory usage, and can access | ||
| the logs for each instance. | ||
|
|
||
| As long as you don't click on actions like `Cordon`, `Edit`, `Delete`, etc, | ||
| navigating the GCP panel should not cause any harm. So feel free to look | ||
| around to familiarize yourself with the interface. | ||
|
|
||
| ## Setup | ||
|
|
||
| - install terraform (https://developer.hashicorp.com/terraform/install?product_intent=terraform) | ||
| - get the GCP tokens: `gcloud auth application-default login` | ||
| - initialize terraform: `terraform init` | ||
|
|
||
| To apply any changes to the cluster: | ||
| - setup the cluster: `terraform apply` | ||
| - terraform will list the list of proposed changes. | ||
| - enter 'yes' when prompted. | ||
|
|
||
| ## Setting the cluster up for the first time | ||
|
|
||
| ``` | ||
| terraform apply -target google_container_node_pool.llvm_premerge_linux_service | ||
| terraform apply -target google_container_node_pool.llvm_premerge_linux | ||
| terraform apply -target google_container_node_pool.llvm_premerge_windows | ||
| terraform apply | ||
| ``` | ||
|
|
||
| Setting the cluster up for the first time is more involved as there are certain | ||
| resources where terraform is unable to handle explicit dependencies. This means | ||
| that we have to set up the GKE cluster before we setup any of the Kubernetes | ||
| resources as otherwise the Terraform Kubernetes provider will error out. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # LLVM Premerge infra - GCP runners | ||
Keenuts marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| This document describes how the GCP based presubmit infra is working, and | ||
| explains common maintenance actions. | ||
|
|
||
| --- | ||
| NOTE: As of today, only Googlers can administrate the cluster. | ||
| --- | ||
|
|
||
| ## Overview | ||
|
|
||
| Presubmit tests are using GitHub workflows. Executing GitHub workflows can be | ||
| done in two ways: | ||
| - using GitHub provided runners. | ||
| - using self-hosted runners on GCP. | ||
|
|
||
| GitHub provided runners are not very powerful, and have limitations, but they | ||
| are **FREE**. | ||
| Self hosted runners are large virtual machines, very powerful, but they are | ||
| **expensive**. | ||
|
|
||
| To balance cost/performance, we keep both runners. | ||
| - simple jobs like `clang-format` shall run on GitHub runners. | ||
| - building & testing LLVM shall be done on self-hosted runners. | ||
|
|
||
| The choice between self-hosted & GitHub runners is done in the workflow | ||
| definition: | ||
|
|
||
| ``` | ||
| jobs: | ||
| my_job_name: | ||
| # Runs on expensive GCP VMs. | ||
| runs-on: llvm-premerge-linux-runners | ||
| ``` | ||
|
|
||
| Our self hosted runners come in two flavors: | ||
| - linux | ||
| - windows | ||
|
|
||
| ## GCP runners - Architecture overview | ||
|
|
||
| Our runners are hosted on a GCP Kubernetes cluster, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller). | ||
| The cluster has 3 nodes: | ||
| - llvm-premerge-linux | ||
| - llvm-premerge-linux-service | ||
| - llvm-premerge-windows | ||
|
|
||
| **llvm-premerge-linux-service** is a fixed node, only used to host the | ||
| services required to manage the premerge infra (controller, listeners, | ||
| monitoring). Today, this node has only one e2-small machine. | ||
|
|
||
| **llvm-premerge-linux** is a auto-scaling node with large c2d-highcpu-56 VMs. | ||
| This node runs the linux workflows. | ||
|
|
||
| **llvm-premerge-windows** is a auto-scaling node with large c2d-highcpu-56 VMs. | ||
| Similar to the linux node, but this time it runs Windows workflows. | ||
|
|
||
| ### Service node: llvm-premerge-linux-service | ||
|
|
||
| This node runs all the services managing the presubmit infra. | ||
| - Action Runner Controller | ||
| - 1 listener for the linux runners. | ||
| - 1 listener for the windows runners. | ||
| - Grafana Alloy to gather metrics. | ||
|
|
||
|
|
||
| The Action Runner Controller listens on the LLVM repository job queue. | ||
| Individual jobs are then handled by the listeners. | ||
|
|
||
| How a job is run: | ||
| - The controller informs GitHub the self-hosted runner is live. | ||
| - A PR is uploaded on GitHub | ||
| - The listener finds a linux job to run. | ||
| - The listener creates a new runner pod to be scheduled by Kubernetes. | ||
| - Kubernetes adds one instance to the linux node to schedule new pod. | ||
| - The runner starts executing on the new node. | ||
| - Once finished, the runner dies, meaning the pod dies. | ||
| - If the instance is not reused in the next 10 minutes, Kubernetes will scale | ||
| down the instance. | ||
|
|
||
| To make sure each pod is scheduled on the correct node (linux or windows, | ||
| avoiding the service node), we use labels & tains. | ||
| Those tains are configured in the [ARC runner templates](premerge/linux_runners_values.yaml). | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.