Skip to content

Add KKP Backup documentation#2106

Open
csengerszabo wants to merge 21 commits intomainfrom
kkp-backup
Open

Add KKP Backup documentation#2106
csengerszabo wants to merge 21 commits intomainfrom
kkp-backup

Conversation

@csengerszabo
Copy link
Contributor

@csengerszabo csengerszabo commented Mar 9, 2026

This document outlines the backup and restore procedures for the Kubermatic Kubernetes Platform, emphasizing the importance of a comprehensive backup strategy, recovery objectives, and a multi-layered backup approach.

Fixes kubermatic/product-strategy#22

This document outlines the backup and restore procedures for the Kubermatic Kubernetes Platform, emphasizing the importance of a comprehensive backup strategy, recovery objectives, and a multi-layered backup approach.

Signed-off-by: csengerszabo <csenger@kubermatic.com>
@kubermatic-bot kubermatic-bot added dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 9, 2026
Signed-off-by: csengerszabo <csenger@kubermatic.com>
Updated the link format for Integrated User Cluster Backup documentation and added references for KubeOne cluster backup and restore strategies.

Signed-off-by: csengerszabo <csenger@kubermatic.com>
Signed-off-by: csengerszabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Updated language to indicate that the cronjob and tools can be used for backups, rather than stating they must be used.

Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>

#### MinIO
* MinIO serves as a cluster-internal, central datastore for all Kubernetes system-related backups.
* All data stored within MinIO should be synchronized every 30 minutes to an external object storage solution (e.g., Azure Blob Storage, AWS S3) via a Kubernetes cronjob. This process utilizes the `rclone` command-line tool, which enables delta synchronization to S3-compatible datastores.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the link.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the link should be at the rclone part not on the top

* An etcd "ring" can tolerate the loss of up to (N-1)/2 nodes and remain healthy. However, if more nodes are lost, the database must be restored from a backup. A snapshot from a single member of the etcd ring is sufficient to restore the entire cluster.
* The Public Key Infrastructure (PKI) encompasses the Certificate Authority (CA), certificates, and keys required for Kubernetes authentication. Backing up the PKI is equally critical for a swift recovery.
* We recommend backing up etcd snapshots and the PKI every 30 minutes and storing these backups outside the cluster.
* A Kubernetes cronjob should handle this process: it runs every 30 minutes, collects the PKI data, captures an etcd snapshot, and can use the `restic` command-line tool to upload the data to the cluster-internal MinIO storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already added this link down on the page with other references.

#### Kubernetes Objects
* While etcd and PKI backups are sufficient for restoring a broken cluster within the same environment, it is often necessary to restore a previous state within an otherwise functional cluster, or to migrate a previous state to an entirely new cluster.
* This is where Velero excels. Velero captures a snapshot of all objects within the cluster, enabling targeted state restoration (similar to executing `kubectl get <crd> <crd-name> -o yaml > my-object.yaml`).
* Velero is recommended to run, for example, every 6 hours.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the link there.

* Therefore, only the Prometheus database requires backing up. To balance performance and usability, we recommend backing up the database every 6 hours.
* Velero, in conjunction with its `restic` integration, can be utilized for this task.
* Velero extracts a dump of the Prometheus database and securely syncs it to the cluster-internal MinIO datastore.
* This process can be seamlessly integrated into the standard Velero backup cycle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also link to https://github.com/kubermatic/kubermatic/tree/main/charts/backup/velero and how to enable it at MLA / some values needs to get set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toschneck can you elaborate, what exactly should be added here?


### User Clusters

#### etcd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the link into the first line.

* The Kubermatic Kubernetes Platform (KKP) provides a fully automated and integrated mechanism with Velero on user clusters to manage these backups, storing them on dedicated cloud storage.
* You can learn more about our Integrated User Cluster Backup feature here: [documentation of Integrated User Cluster Backup in KKP](cluster-backup/)

#### Data Replication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link again to this example https://github.com/kubermatic/community-components/tree/master/components/rclone-s3-syncer you can ref this as example implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I linked this once on the top of this section, no need to link it multiple times.


| Backup Job | Schedule | TTL |
| :--- | :--- | :--- |
| KKP master control plane VM backups | Once daily | 3 days |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you havet etcd restic snapshot on kubeone this is not required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment.

| :--- | :--- | :--- |
| KKP master control plane VM backups | Once daily | 3 days |
| MLA data (KKP master cluster objects + Prometheus data) | Every 6 hours | 168 hours (7 days) |
| KKP master etcd and PKI | Every 30 minutes | 24 hours |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • per seeed this must be done as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new line for that.

## Process

* To accelerate the recovery process and minimize human error, a comprehensive disaster recovery runbook must be documented.
* This runbook should provide explicit, step-by-step instructions detailing the appropriate recovery strategies for various failure scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a step-by-step guide, it's more an overview

* These tests must be conducted at least annually and should be executed by various team members.
* This practice ensures that the documentation remains current and prevents knowledge silos within the team.

## References for KubeOne cluster backup and restore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add the KKP backup links + https://github.com/kubermatic/community-components/tree/master/components/rclone-s3-syncer you can ref this as example implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KKP Backup is already linked. Plus I linked this once on the top of this section, no need to link it multiple times.

@toschneck
Copy link
Member

Review

Image is not correct
image

csengerszabo and others added 7 commits March 17, 2026 10:13
Added example implementation link for backup strategy and clarified backup job descriptions.

Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <szabo.csenger@gmail.com>
Updated image source for kkp_backup in backup tutorial.

Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <szabo.csenger@gmail.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
@csengerszabo
Copy link
Contributor Author

/retest

csengerszabo and others added 4 commits March 18, 2026 11:44
…_edited.png

Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
…_tuned_matched copy2.png

Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <szabo.csenger@gmail.com>
@mfahlandt
Copy link
Member

/lgtm
/approve

@kubermatic-bot kubermatic-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2026
@kubermatic-bot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 7471e541d66565ce850c3b655a71d683d4eb7e96

@kubermatic-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mfahlandt
Once this PR has been reviewed and has the lgtm label, please assign dakraus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Csenger Szabo <szabo.csenger@gmail.com>
@kubermatic-bot kubermatic-bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 23, 2026
@kubermatic-bot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants