Skip to content

Commit d00f312

Browse files
Add e2e and failure investigation docs (#325)
Signed-off-by: Danil-Grigorev <[email protected]>
1 parent c04ce1d commit d00f312

File tree

2 files changed

+122
-1
lines changed

2 files changed

+122
-1
lines changed

docs/src/05_developers/03_e2e.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# E2E Test Failure Investigation Guide
2+
3+
This guide provides a structured approach to investigating end-to-end (e2e) test failures in the cluster-api-addon-provider-fleet project.
4+
5+
## Understanding E2E Tests
6+
7+
Our CI pipeline runs several e2e tests to validate functionality across different Kubernetes versions:
8+
9+
- **Cluster Class Import Tests**: Validate the cluster class import functionality
10+
- **Import Tests**: Validate the general import functionality
11+
- **Import RKE2 Tests**: Validate import functionality specific to RKE2 clusters
12+
13+
Each test runs on multiple Kubernetes versions (stable and latest) to ensure compatibility.
14+
15+
## Accessing Test Artifacts
16+
17+
When e2e tests fail, the CI pipeline automatically collects and uploads artifacts containing valuable debugging information. These artifacts are created using [crust-gather](https://github.com/crust-gather/crust-gather), a tool that captures the state of Kubernetes clusters.
18+
19+
### Finding the Artifact URL
20+
21+
1. Navigate to the failed GitHub Actions workflow run
22+
2. Scroll down to the "Artifacts" section
23+
3. Find the artifact corresponding to the failed test (e.g., `artifacts-cluster-class-import-stable`)
24+
4. Copy the artifact URL (right-click on the artifact link and copy the URL)
25+
26+
## Using the serve-artifact.sh Script
27+
28+
The `serve-artifact.sh` script allows you to download and serve the test artifacts locally, providing access to the Kubernetes contexts from the test environment.
29+
30+
### Prerequisites
31+
32+
- A GitHub token with `repo` read permissions (set as `GITHUB_TOKEN` environment variable)
33+
- `kubectl` installed, `krew` installed.
34+
- [crust-gather](https://github.com/crust-gather/crust-gather) installed. Can be replicated with nix, if available.
35+
36+
### Serving Artifacts
37+
38+
Fetch the `serve-artifact.sh` script from the [crust-gather GitHub repository](https://github.com/crust-gather/crust-gather):
39+
40+
```bash
41+
curl -L https://raw.githubusercontent.com/crust-gather/crust-gather/refs/heads/main/serve-artifact.sh -o serve-artifact.sh && chmod +x serve-artifact.sh
42+
```
43+
44+
```bash
45+
# Using the full artifact URL
46+
./serve-artifact.sh -u https://github.com/rancher/cluster-api-addon-provider-fleet/actions/runs/15737662078/artifacts/3356068059 -s 0.0.0.0:9095
47+
48+
# OR using individual components
49+
./serve-artifact.sh -o rancher -r cluster-api-addon-provider-fleet -a 3356068059 -s 0.0.0.0:9095
50+
```
51+
52+
This will:
53+
1. Download the artifact from GitHub
54+
2. Extract its contents
55+
3. Start a local server that provides access to the Kubernetes contexts from the test environment
56+
57+
## Investigating Failures
58+
59+
Once the artifact server is running, you can use various tools to investigate the failure:
60+
61+
### Using k9s
62+
63+
[k9s](https://k9scli.io/) provides a terminal UI to interact with Kubernetes clusters:
64+
65+
1. Open a new terminal
66+
2. Run `k9s`
67+
3. Press `:` to open the command prompt
68+
4. Type `ctx` and press Enter
69+
5. Select the context from the test environment (there may be multiple contexts). `dev` for the e2e tests.
70+
6. Navigate through resources to identify issues:
71+
- Check pods for crash loops or errors
72+
- Examine events for warnings or errors
73+
- Review logs from relevant components
74+
75+
### Common Investigation Paths
76+
77+
1. **Check Fleet Resources**:
78+
- `FleetAddonConfig` resources
79+
- Fleet `Cluster` resource
80+
- CAPI `ClusterGroup` resources
81+
- Ensure all relevant labels are present on above.
82+
- Check for created `Fleet` namespace `cluster-<ns>-<cluster name>-<random-prefix>` that it is consitent with the NS in the Cluster `.status.namespace`.
83+
- Check for `ClusterRegistrationToken` in the cluster namespace.
84+
- Check for `BundleNamespaceMapping` in the `ClusterClass` namespace if a cluster references a `ClusterClass` in a different namespace
85+
86+
2. **Check CAPI Resources**:
87+
- Cluster resource
88+
- Check for `ControlPlaneInitialized` condition to be `true`
89+
- ClusterClass resources, these are present and have `status.observedGeneration` consistent with the `metadata.generation`
90+
- Continue on a per-cluster basis
91+
92+
3. **Check Controller Logs**:
93+
- Look for error messages or warnings in the controller logs in the `caapf-system` namespace.
94+
- Check for reconciliation failures in `manager` container. In case of upstream installation, check for `helm-manager` container logs.
95+
96+
4. **Check Kubernetes Events**:
97+
- Events often contain information about failures, otherwise `CAAPF` publishes events for each resource apply from CAPI `Cluster`, including Fleet `Cluster` in the cluster namespace, `ClusterGroup` and `BundleNamespaceMapping` in the `ClusterClass` namespace. These events are created by `caapf-controller` component.
98+
99+
## Common Failure Patterns
100+
101+
### Import Failures
102+
103+
- **Symptom**: Fleet `Cluster` not created or in error state
104+
- **Investigation**: Check the controller logs in the `cattle-fleet-system` namespace for errors during import processing. Check for errors in the `CAAPF` logs for missing cluster definition.
105+
- **Common causes**:
106+
- Fleet cluster import process is serial, and hot loop in other cluster import blocks further cluster imports. Fleet issue.
107+
- CAPI `Cluster` is not ready and does not have `ControlPlaneInitialized` condition. Issue with CAPI or requires more time to be ready.
108+
- Otherwise `CAAPF` issue.
109+
110+
### Cluster Class Failures
111+
112+
- **Symptom**: ClusterClass not properly imported or is not evaluated as a target.
113+
- **Investigation**: Check for the `BundleNamespaceMapping` in the `ClusterClass` namespace named after the `Cluster` resource. Check the controller logs in the `caapf-system` namespace for errors during ClusterClass processing. Check `ClusterGroup` resource in the `Cluster` namespace.
114+
- **Common causes**:
115+
- Check for `Cluster` referencing `ClusterClass` in a different namespace.
116+
- In the event of missing resources, `CAAPF` related error.
117+
118+
## Reference
119+
120+
- [crust-gather GitHub repository](https://github.com/crust-gather/crust-gather)
121+
- [k9s documentation](https://k9scli.io/topics/commands/)

src/metrics.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ impl Default for Diagnostics {
9393
fn default() -> Self {
9494
Self {
9595
last_event: Utc::now(),
96-
reporter: "doc-controller".into(),
96+
reporter: "caapf-controller".into(),
9797
}
9898
}
9999
}

0 commit comments

Comments
 (0)