|
| 1 | +# E2E Test Failure Investigation Guide |
| 2 | + |
| 3 | +This guide provides a structured approach to investigating end-to-end (e2e) test failures in the cluster-api-addon-provider-fleet project. |
| 4 | + |
| 5 | +## Understanding E2E Tests |
| 6 | + |
| 7 | +Our CI pipeline runs several e2e tests to validate functionality across different Kubernetes versions: |
| 8 | + |
| 9 | +- **Cluster Class Import Tests**: Validate the cluster class import functionality |
| 10 | +- **Import Tests**: Validate the general import functionality |
| 11 | +- **Import RKE2 Tests**: Validate import functionality specific to RKE2 clusters |
| 12 | + |
| 13 | +Each test runs on multiple Kubernetes versions (stable and latest) to ensure compatibility. |
| 14 | + |
| 15 | +## Accessing Test Artifacts |
| 16 | + |
| 17 | +When e2e tests fail, the CI pipeline automatically collects and uploads artifacts containing valuable debugging information. These artifacts are created using [crust-gather](https://github.com/crust-gather/crust-gather), a tool that captures the state of Kubernetes clusters. |
| 18 | + |
| 19 | +### Finding the Artifact URL |
| 20 | + |
| 21 | +1. Navigate to the failed GitHub Actions workflow run |
| 22 | +2. Scroll down to the "Artifacts" section |
| 23 | +3. Find the artifact corresponding to the failed test (e.g., `artifacts-cluster-class-import-stable`) |
| 24 | +4. Copy the artifact URL (right-click on the artifact link and copy the URL) |
| 25 | + |
| 26 | +## Using the serve-artifact.sh Script |
| 27 | + |
| 28 | +The `serve-artifact.sh` script allows you to download and serve the test artifacts locally, providing access to the Kubernetes contexts from the test environment. |
| 29 | + |
| 30 | +### Prerequisites |
| 31 | + |
| 32 | +- A GitHub token with `repo` read permissions (set as `GITHUB_TOKEN` environment variable) |
| 33 | +- `kubectl` installed, `krew` installed. |
| 34 | +- [crust-gather](https://github.com/crust-gather/crust-gather) installed. Can be replicated with nix, if available. |
| 35 | + |
| 36 | +### Serving Artifacts |
| 37 | + |
| 38 | +Fetch the `serve-artifact.sh` script from the [crust-gather GitHub repository](https://github.com/crust-gather/crust-gather): |
| 39 | + |
| 40 | +```bash |
| 41 | +curl -L https://raw.githubusercontent.com/crust-gather/crust-gather/refs/heads/main/serve-artifact.sh -o serve-artifact.sh && chmod +x serve-artifact.sh |
| 42 | +``` |
| 43 | + |
| 44 | +```bash |
| 45 | +# Using the full artifact URL |
| 46 | +./serve-artifact.sh -u https://github.com/rancher/cluster-api-addon-provider-fleet/actions/runs/15737662078/artifacts/3356068059 -s 0.0.0.0:9095 |
| 47 | + |
| 48 | +# OR using individual components |
| 49 | +./serve-artifact.sh -o rancher -r cluster-api-addon-provider-fleet -a 3356068059 -s 0.0.0.0:9095 |
| 50 | +``` |
| 51 | + |
| 52 | +This will: |
| 53 | +1. Download the artifact from GitHub |
| 54 | +2. Extract its contents |
| 55 | +3. Start a local server that provides access to the Kubernetes contexts from the test environment |
| 56 | + |
| 57 | +## Investigating Failures |
| 58 | + |
| 59 | +Once the artifact server is running, you can use various tools to investigate the failure: |
| 60 | + |
| 61 | +### Using k9s |
| 62 | + |
| 63 | +[k9s](https://k9scli.io/) provides a terminal UI to interact with Kubernetes clusters: |
| 64 | + |
| 65 | +1. Open a new terminal |
| 66 | +2. Run `k9s` |
| 67 | +3. Press `:` to open the command prompt |
| 68 | +4. Type `ctx` and press Enter |
| 69 | +5. Select the context from the test environment (there may be multiple contexts). `dev` for the e2e tests. |
| 70 | +6. Navigate through resources to identify issues: |
| 71 | + - Check pods for crash loops or errors |
| 72 | + - Examine events for warnings or errors |
| 73 | + - Review logs from relevant components |
| 74 | + |
| 75 | +### Common Investigation Paths |
| 76 | + |
| 77 | +1. **Check Fleet Resources**: |
| 78 | + - `FleetAddonConfig` resources |
| 79 | + - Fleet `Cluster` resource |
| 80 | + - CAPI `ClusterGroup` resources |
| 81 | + - Ensure all relevant labels are present on above. |
| 82 | + - Check for created `Fleet` namespace `cluster-<ns>-<cluster name>-<random-prefix>` that it is consitent with the NS in the Cluster `.status.namespace`. |
| 83 | + - Check for `ClusterRegistrationToken` in the cluster namespace. |
| 84 | + - Check for `BundleNamespaceMapping` in the `ClusterClass` namespace if a cluster references a `ClusterClass` in a different namespace |
| 85 | + |
| 86 | +2. **Check CAPI Resources**: |
| 87 | + - Cluster resource |
| 88 | + - Check for `ControlPlaneInitialized` condition to be `true` |
| 89 | + - ClusterClass resources, these are present and have `status.observedGeneration` consistent with the `metadata.generation` |
| 90 | + - Continue on a per-cluster basis |
| 91 | + |
| 92 | +3. **Check Controller Logs**: |
| 93 | + - Look for error messages or warnings in the controller logs in the `caapf-system` namespace. |
| 94 | + - Check for reconciliation failures in `manager` container. In case of upstream installation, check for `helm-manager` container logs. |
| 95 | + |
| 96 | +4. **Check Kubernetes Events**: |
| 97 | + - Events often contain information about failures, otherwise `CAAPF` publishes events for each resource apply from CAPI `Cluster`, including Fleet `Cluster` in the cluster namespace, `ClusterGroup` and `BundleNamespaceMapping` in the `ClusterClass` namespace. These events are created by `caapf-controller` component. |
| 98 | + |
| 99 | +## Common Failure Patterns |
| 100 | + |
| 101 | +### Import Failures |
| 102 | + |
| 103 | +- **Symptom**: Fleet `Cluster` not created or in error state |
| 104 | +- **Investigation**: Check the controller logs in the `cattle-fleet-system` namespace for errors during import processing. Check for errors in the `CAAPF` logs for missing cluster definition. |
| 105 | +- **Common causes**: |
| 106 | + - Fleet cluster import process is serial, and hot loop in other cluster import blocks further cluster imports. Fleet issue. |
| 107 | + - CAPI `Cluster` is not ready and does not have `ControlPlaneInitialized` condition. Issue with CAPI or requires more time to be ready. |
| 108 | + - Otherwise `CAAPF` issue. |
| 109 | + |
| 110 | +### Cluster Class Failures |
| 111 | + |
| 112 | +- **Symptom**: ClusterClass not properly imported or is not evaluated as a target. |
| 113 | +- **Investigation**: Check for the `BundleNamespaceMapping` in the `ClusterClass` namespace named after the `Cluster` resource. Check the controller logs in the `caapf-system` namespace for errors during ClusterClass processing. Check `ClusterGroup` resource in the `Cluster` namespace. |
| 114 | +- **Common causes**: |
| 115 | + - Check for `Cluster` referencing `ClusterClass` in a different namespace. |
| 116 | + - In the event of missing resources, `CAAPF` related error. |
| 117 | + |
| 118 | +## Reference |
| 119 | + |
| 120 | +- [crust-gather GitHub repository](https://github.com/crust-gather/crust-gather) |
| 121 | +- [k9s documentation](https://k9scli.io/topics/commands/) |
0 commit comments