Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
22ca1de
Add KKP Backup documentation
csengerszabo Mar 9, 2026
d1175d1
Clarify documentation for Integrated User Cluster Backup
csengerszabo Mar 10, 2026
3f9a3c9
Fix links and add KubeOne backup references
csengerszabo Mar 10, 2026
aae194d
Fix formatting of KubeOne backup references
csengerszabo Mar 10, 2026
e7125dc
Replace backup image with a new tuned version
csengerszabo Mar 11, 2026
a7957bf
Clarify backup process for etcd and PKI
csengerszabo Mar 11, 2026
56af4fd
Update backup image in KKP backup tutorial
csengerszabo Mar 12, 2026
a73c557
Enhance backup and disaster recovery documentation
csengerszabo Mar 17, 2026
aec036b
move rclone link to its place
csengerszabo Mar 18, 2026
c18c337
Update backup image in KKP backup tutorial
csengerszabo Mar 18, 2026
03da007
Add files via upload
csengerszabo Mar 18, 2026
642c0b3
Change image source in kkp-backup tutorial
csengerszabo Mar 18, 2026
0f05613
Fix image filename
csengerszabo Mar 18, 2026
992eaf7
Update backup image in KKP backup tutorial
csengerszabo Mar 18, 2026
1152e07
Delete content/kubermatic/main/tutorials-howtos/kkp-backup/kkp_backup…
csengerszabo Mar 18, 2026
542632d
Delete content/kubermatic/main/tutorials-howtos/kkp-backup/kkp_backup…
csengerszabo Mar 18, 2026
fbc0ec7
Update backup image in KKP backup tutorial
csengerszabo Mar 18, 2026
33450ca
new image file with appropriate naming
csengerszabo Mar 18, 2026
e3bcbbd
Image change VM Backup (optional)
csengerszabo Mar 23, 2026
35f096b
Update backup strategy image in documentation
csengerszabo Mar 23, 2026
67f63d7
Delete content/kubermatic/main/tutorials-howtos/kkp-backup/kkpbackup.png
csengerszabo Mar 23, 2026
669811d
Fix height for backup strategy documentation
csengerszabo Mar 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions content/kubermatic/main/tutorials-howtos/kkp-backup/_index.en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
+++
title = "KKP Backup"
date = 2026-03-09T19:45:00+02:00
weight = 151
+++

## Introduction

The following is a recommended procedure for backup and restore Kubermatic Kubernetes Platform. A comprehensive backup strategy is essential to guarantee recovery from any potential disaster scenario. Consequently, all system components must be backed up regularly to separate storage mediums, ensuring rapid restoration when necessary.

The primary objective during a disaster is to restore the system to the most recent state possible (RPO) as quickly as possible (RTO).

* **Recovery Point Objective (RPO):** Defines the maximum acceptable amount of data loss a customer can tolerate during a disaster.
* **Recovery Time Objective (RTO):** Defines the maximum acceptable amount of time it takes to recover a system or application after a disaster.

## Overview


We propose a multi-layered approach to backups. Backing up the underlying virtual machine (VM) infrastructure and its storage enables the rapid restoration of core system components, thereby reducing the overall RTO.

Furthermore, at the application layer, Kubernetes tools — most notably Velero — are utilized to maintain low RPO and RTO for application state and functionality.

<img width="2240" src="kkpbackupstrategy.png" style="height: auto;" alt="kkp_backup" />


## Recommended backup and disaster recovery strategy

### Infrastructure
* Control plane VMs must be backed up regularly (ideally daily) at the virtual machine level.
* Backups must be exact clones of the VMs, including copies of all attached volumes.
* These backups must reside on a separate datastore.
* In the event of a node failure, the underlying VM must be fully restorable from its backup.
* Tools such as Veem, or native cloud provider VM backup solutions, can be utilized to automate these regular backups.

### Kubermatic Master / Seed Level

#### MinIO
* MinIO serves as a cluster-internal, central datastore for all Kubernetes system-related backups.
* All data stored within MinIO should be synchronized every 30 minutes to an external object storage solution (e.g., Azure Blob Storage, AWS S3) via a Kubernetes cronjob. This process utilizes the `rclone` command-line tool, which enables delta synchronization to S3-compatible datastores. Here is an [example implementation](https://github.com/kubermatic/community-components/tree/master/components/rclone-s3-syncer).
* Aggregating all backup data in a cluster-internal datastore before external synchronization provides significant benefits regarding redundancy, performance, and ease of use.

#### etcd / PKI
* Check the documentation of KKP's [etcd backup feature.](https://docs.kubermatic.com/kubermatic/main/tutorials-howtos/etcd-backups/)
* The etcd database contains the complete state of a Kubernetes cluster.
* An etcd "ring" can tolerate the loss of up to (N-1)/2 nodes and remain healthy. However, if more nodes are lost, the database must be restored from a backup. A snapshot from a single member of the etcd ring is sufficient to restore the entire cluster.
* The Public Key Infrastructure (PKI) encompasses the Certificate Authority (CA), certificates, and keys required for Kubernetes authentication. Backing up the PKI is equally critical for a swift recovery.
* We recommend backing up etcd snapshots and the PKI every 30 minutes and storing these backups outside the cluster.
* A Kubernetes cronjob should handle this process: it runs every 30 minutes, collects the PKI data, captures an etcd snapshot, and can use the `restic` command-line tool to upload the data to the cluster-internal MinIO storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already added this link down on the page with other references.


#### Kubernetes Objects
* While etcd and PKI backups are sufficient for restoring a broken cluster within the same environment, it is often necessary to restore a previous state within an otherwise functional cluster, or to migrate a previous state to an entirely new cluster.
* This is where Velero excels. Velero captures a snapshot of all objects within the cluster, enabling targeted state restoration (similar to executing `kubectl get <crd> <crd-name> -o yaml > my-object.yaml`).
* Velero is recommended to run, for example, every 6 hours. Defaults are already set, you can see our [Velero chart](https://github.com/kubermatic/kubermatic/tree/main/charts/backup/velero).

#### MLA Data
* All Monitoring, Logging, and Alerting (MLA) configurations, including alerts and dashboards, are stored as infrastructure-as-code within the platform owner's codebase.
* Therefore, only the Prometheus database requires backing up. To balance performance and usability, we recommend backing up the database every 6 hours.
* Velero, in conjunction with its `restic` integration, can be utilized for this task.
* Velero extracts a dump of the Prometheus database and securely syncs it to the cluster-internal MinIO datastore.
* This process can be seamlessly integrated into the standard Velero backup cycle. Defaults are already set, you can see our [Velero chart](https://github.com/kubermatic/kubermatic/tree/main/charts/backup/velero).

### User Clusters

#### etcd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the link into the first line.

* To restore the state of user clusters during disaster recovery, etcd database snapshots must be available.
* The Kubermatic Kubernetes Platform (KKP) provides a fully automated mechanism for the regular backup and restoration of user clusters via snapshots.
* This configuration must be applied on a per-cluster basis.
* We strongly recommend enabling this feature and scheduling it to run every 20 minutes. These snapshots are subsequently stored on the cluster-internal MinIO.

#### Workload
* Stateless user cluster workloads can be easily restored from etcd backups.
* However, once a user cluster runs stateful applications, additional measures are required.
* The Kubermatic Kubernetes Platform (KKP) provides a fully automated and integrated mechanism with Velero on user clusters to manage these backups, storing them on dedicated cloud storage.
* You can learn more about our Integrated User Cluster Backup feature here: [documentation of Integrated User Cluster Backup in KKP](cluster-backup/)

#### Data Replication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link again to this example https://github.com/kubermatic/community-components/tree/master/components/rclone-s3-syncer you can ref this as example implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I linked this once on the top of this section, no need to link it multiple times.

* Furthermore, we recommend designing any stateful application with built-in replication mechanisms. Storage should be replicated across multiple nodes, and ideally, across different datacenters or availability zones.


## Backup Schedule

Below is the suggested schedule configuration, categorized by backup job, frequency, and retention period (Time to Live - TTL).

| Backup Job | Schedule | TTL |
| :--- | :--- | :--- |
| KKP master control plane VM backups * | Once daily | 3 days |
| MLA data (KKP master cluster objects + Prometheus data) | Every 6 hours | 168 hours (7 days) |
| KKP master etcd and PKI | Every 30 minutes | 24 hours |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • per seeed this must be done as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new line for that.

| KKP seed etcd and PKI | Every 30 minutes | 24 hours |
| User-cluster control-planes | Every 20 minutes | Last 20 (~6.5 hours) |
| Cluster-internal backups | Every 30 minutes | Configured via External Storage Provider |

`*` If you have etcd restic snapshot on KubeOne, this is not required.


## Process

* To accelerate the recovery process and minimize human error, a comprehensive disaster recovery runbook must be documented.
* This runbook should provide explicit, step-by-step instructions detailing the appropriate recovery strategies for various failure scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a step-by-step guide, it's more an overview

* To validate both the recovery procedures and the runbook itself, regular disaster recovery drills are essential.
* The staging environment serves as an ideal platform for simulating these diverse disaster scenarios.
* These tests must be conducted at least annually and should be executed by various team members.
* This practice ensures that the documentation remains current and prevents knowledge silos within the team.

## References for KubeOne cluster backup and restore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add the KKP backup links + https://github.com/kubermatic/community-components/tree/master/components/rclone-s3-syncer you can ref this as example implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KKP Backup is already linked. Plus I linked this once on the top of this section, no need to link it multiple times.


If KKP master and seed cluster lifecycle management is being done by KubeOne, you may find further information regarding backup strategies for KubeOne on the following links:

* [Backups Addon in KubeOne](../kubeone/main/examples/addons-backup/)
* [KubeOne Manual Cluster Recovery](../kubeone/main/guides/manual-cluster-recovery/)
* [KubeOne Manual Cluster Repair](../kubeone/main/guides/manual-cluster-repair/)


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.