Add e2e and failure investigation docs (#325)

Danil-Grigorev · web-flow · commit d00f31209345 · 2025-06-20T13:12:48.000Z
Signed-off-by: Danil-Grigorev &lt;danil.grigorev@suse.com&gt;
diff --git a/docs/src/05_developers/03_e2e.md b/docs/src/05_developers/03_e2e.md
@@ -0,0 +1,121 @@
+# E2E Test Failure Investigation Guide
+
+This guide provides a structured approach to investigating end-to-end (e2e) test failures in the cluster-api-addon-provider-fleet project.
+
+## Understanding E2E Tests
+
+Our CI pipeline runs several e2e tests to validate functionality across different Kubernetes versions:
+
+- **Cluster Class Import Tests**: Validate the cluster class import functionality
+- **Import Tests**: Validate the general import functionality
+- **Import RKE2 Tests**: Validate import functionality specific to RKE2 clusters
+
+Each test runs on multiple Kubernetes versions (stable and latest) to ensure compatibility.
+
+## Accessing Test Artifacts
+
+When e2e tests fail, the CI pipeline automatically collects and uploads artifacts containing valuable debugging information. These artifacts are created using [crust-gather](https://github.com/crust-gather/crust-gather), a tool that captures the state of Kubernetes clusters.
+
+### Finding the Artifact URL
+
+1. Navigate to the failed GitHub Actions workflow run
+2. Scroll down to the "Artifacts" section
+3. Find the artifact corresponding to the failed test (e.g., `artifacts-cluster-class-import-stable`)
+4. Copy the artifact URL (right-click on the artifact link and copy the URL)
+
+## Using the serve-artifact.sh Script
+
+The `serve-artifact.sh` script allows you to download and serve the test artifacts locally, providing access to the Kubernetes contexts from the test environment.
+
+### Prerequisites
+
+- A GitHub token with `repo` read permissions (set as `GITHUB_TOKEN` environment variable)
+- `kubectl` installed, `krew` installed.
+- [crust-gather](https://github.com/crust-gather/crust-gather) installed. Can be replicated with nix, if available.
+
+### Serving Artifacts
+
+Fetch the `serve-artifact.sh` script from the [crust-gather GitHub repository](https://github.com/crust-gather/crust-gather):
+
+```bash
+curl -L https://raw.githubusercontent.com/crust-gather/crust-gather/refs/heads/main/serve-artifact.sh -o serve-artifact.sh && chmod +x serve-artifact.sh
+```
+
+```bash
+# Using the full artifact URL
+./serve-artifact.sh -u https://github.com/rancher/cluster-api-addon-provider-fleet/actions/runs/15737662078/artifacts/3356068059 -s 0.0.0.0:9095
+
+# OR using individual components
+./serve-artifact.sh -o rancher -r cluster-api-addon-provider-fleet -a 3356068059 -s 0.0.0.0:9095
+```
+
+This will:
+1. Download the artifact from GitHub
+2. Extract its contents
+3. Start a local server that provides access to the Kubernetes contexts from the test environment
+
+## Investigating Failures
+
+Once the artifact server is running, you can use various tools to investigate the failure:
+
+### Using k9s
+
+[k9s](https://k9scli.io/) provides a terminal UI to interact with Kubernetes clusters:
+
+1. Open a new terminal
+2. Run `k9s`
+3. Press `:` to open the command prompt
+4. Type `ctx` and press Enter
+5. Select the context from the test environment (there may be multiple contexts). `dev` for the e2e tests.
+6. Navigate through resources to identify issues:
+   - Check pods for crash loops or errors
+   - Examine events for warnings or errors
+   - Review logs from relevant components
+
+### Common Investigation Paths
+
+1. **Check Fleet Resources**:
+   - `FleetAddonConfig` resources
+   - Fleet `Cluster` resource
+   - CAPI `ClusterGroup` resources
+   - Ensure all relevant labels are present on above.
+   - Check for created `Fleet` namespace `cluster-<ns>-<cluster name>-<random-prefix>` that it is consitent with the NS in the Cluster `.status.namespace`.
+   - Check for `ClusterRegistrationToken` in the cluster namespace.
+   - Check for `BundleNamespaceMapping` in the `ClusterClass` namespace if a cluster references a `ClusterClass` in a different namespace
+
+2. **Check CAPI Resources**:
+   - Cluster resource
+   - Check for `ControlPlaneInitialized` condition to be `true`
+   - ClusterClass resources, these are present and have `status.observedGeneration` consistent with the `metadata.generation`
+   - Continue on a per-cluster basis
+
+3. **Check Controller Logs**:
+   - Look for error messages or warnings in the controller logs in the `caapf-system` namespace.
+   - Check for reconciliation failures in `manager` container. In case of upstream installation, check for `helm-manager` container logs.
+
+4. **Check Kubernetes Events**:
+   - Events often contain information about failures, otherwise `CAAPF` publishes events for each resource apply from CAPI `Cluster`, including Fleet `Cluster` in the cluster namespace, `ClusterGroup` and `BundleNamespaceMapping` in the `ClusterClass` namespace. These events are created by `caapf-controller` component.
+
+## Common Failure Patterns
+
+### Import Failures
+
+- **Symptom**: Fleet `Cluster` not created or in error state
+- **Investigation**: Check the controller logs in the `cattle-fleet-system` namespace for errors during import processing. Check for errors in the `CAAPF` logs for missing cluster definition.
+- **Common causes**:
+  - Fleet cluster import process is serial, and hot loop in other cluster import blocks further cluster imports. Fleet issue.
+  - CAPI `Cluster` is not ready and does not have `ControlPlaneInitialized` condition. Issue with CAPI or requires more time to be ready.
+  - Otherwise `CAAPF` issue.
+
+### Cluster Class Failures
+
+- **Symptom**: ClusterClass not properly imported or is not evaluated as a target.
+- **Investigation**: Check for the `BundleNamespaceMapping` in the `ClusterClass` namespace named after the `Cluster` resource. Check the controller logs in the `caapf-system` namespace for errors during ClusterClass processing. Check `ClusterGroup` resource in the `Cluster` namespace.
+- **Common causes**:
+  - Check for `Cluster` referencing `ClusterClass` in a different namespace.
+  - In the event of missing resources, `CAAPF` related error.
+
+## Reference
+
+- [crust-gather GitHub repository](https://github.com/crust-gather/crust-gather)
+- [k9s documentation](https://k9scli.io/topics/commands/)
diff --git a/src/metrics.rs b/src/metrics.rs
@@ -93,7 +93,7 @@ impl Default for Diagnostics {
     fn default() -> Self {
         Self {
             last_event: Utc::now(),
-            reporter: "doc-controller".into(),
+            reporter: "caapf-controller".into(),
         }
     }
 }

Original file line number	Diff line number	Diff line change
`@@ -93,7 +93,7 @@ impl Default for Diagnostics {`
`93`	`93`	`fn default() -> Self {`
`94`	`94`	`Self {`
`95`	`95`	`last_event: Utc::now(),`
`96`		`- reporter: "doc-controller".into(),`
	`96`	`+ reporter: "caapf-controller".into(),`
`97`	`97`	`}`
`98`	`98`	`}`
`99`	`99`	`}`