Conversation
This document outlines the backup and restore procedures for the Kubermatic Kubernetes Platform, emphasizing the importance of a comprehensive backup strategy, recovery objectives, and a multi-layered backup approach. Signed-off-by: csengerszabo <csenger@kubermatic.com>
Signed-off-by: csengerszabo <csenger@kubermatic.com>
Updated the link format for Integrated User Cluster Backup documentation and added references for KubeOne cluster backup and restore strategies. Signed-off-by: csengerszabo <csenger@kubermatic.com>
Signed-off-by: csengerszabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Updated language to indicate that the cronjob and tools can be used for backups, rather than stating they must be used. Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
|
|
||
| #### MinIO | ||
| * MinIO serves as a cluster-internal, central datastore for all Kubernetes system-related backups. | ||
| * All data stored within MinIO should be synchronized every 30 minutes to an external object storage solution (e.g., Azure Blob Storage, AWS S3) via a Kubernetes cronjob. This process utilizes the `rclone` command-line tool, which enables delta synchronization to S3-compatible datastores. |
There was a problem hiding this comment.
https://github.com/kubermatic/community-components/tree/master/components/rclone-s3-syncer you can ref this as example implementation
There was a problem hiding this comment.
Added the link.
There was a problem hiding this comment.
the link should be at the rclone part not on the top
| * An etcd "ring" can tolerate the loss of up to (N-1)/2 nodes and remain healthy. However, if more nodes are lost, the database must be restored from a backup. A snapshot from a single member of the etcd ring is sufficient to restore the entire cluster. | ||
| * The Public Key Infrastructure (PKI) encompasses the Certificate Authority (CA), certificates, and keys required for Kubernetes authentication. Backing up the PKI is equally critical for a swift recovery. | ||
| * We recommend backing up etcd snapshots and the PKI every 30 minutes and storing these backups outside the cluster. | ||
| * A Kubernetes cronjob should handle this process: it runs every 30 minutes, collects the PKI data, captures an etcd snapshot, and can use the `restic` command-line tool to upload the data to the cluster-internal MinIO storage. |
There was a problem hiding this comment.
https://docs.kubermatic.com/kubeone/main/examples/addons-backup/
Link to build-in backup solution
There was a problem hiding this comment.
I already added this link down on the page with other references.
| #### Kubernetes Objects | ||
| * While etcd and PKI backups are sufficient for restoring a broken cluster within the same environment, it is often necessary to restore a previous state within an otherwise functional cluster, or to migrate a previous state to an entirely new cluster. | ||
| * This is where Velero excels. Velero captures a snapshot of all objects within the cluster, enabling targeted state restoration (similar to executing `kubectl get <crd> <crd-name> -o yaml > my-object.yaml`). | ||
| * Velero is recommended to run, for example, every 6 hours. |
There was a problem hiding this comment.
we deliver some default link https://github.com/kubermatic/kubermatic/tree/main/charts/backup/velero
- pot. docu link
There was a problem hiding this comment.
Added the link there.
| * Therefore, only the Prometheus database requires backing up. To balance performance and usability, we recommend backing up the database every 6 hours. | ||
| * Velero, in conjunction with its `restic` integration, can be utilized for this task. | ||
| * Velero extracts a dump of the Prometheus database and securely syncs it to the cluster-internal MinIO datastore. | ||
| * This process can be seamlessly integrated into the standard Velero backup cycle. |
There was a problem hiding this comment.
also link to https://github.com/kubermatic/kubermatic/tree/main/charts/backup/velero and how to enable it at MLA / some values needs to get set
There was a problem hiding this comment.
@toschneck can you elaborate, what exactly should be added here?
|
|
||
| ### User Clusters | ||
|
|
||
| #### etcd |
There was a problem hiding this comment.
There was a problem hiding this comment.
I added the link into the first line.
| * The Kubermatic Kubernetes Platform (KKP) provides a fully automated and integrated mechanism with Velero on user clusters to manage these backups, storing them on dedicated cloud storage. | ||
| * You can learn more about our Integrated User Cluster Backup feature here: [documentation of Integrated User Cluster Backup in KKP](cluster-backup/) | ||
|
|
||
| #### Data Replication |
There was a problem hiding this comment.
link again to this example https://github.com/kubermatic/community-components/tree/master/components/rclone-s3-syncer you can ref this as example implementation
There was a problem hiding this comment.
I linked this once on the top of this section, no need to link it multiple times.
|
|
||
| | Backup Job | Schedule | TTL | | ||
| | :--- | :--- | :--- | | ||
| | KKP master control plane VM backups | Once daily | 3 days | |
There was a problem hiding this comment.
if you havet etcd restic snapshot on kubeone this is not required
There was a problem hiding this comment.
Added the comment.
| | :--- | :--- | :--- | | ||
| | KKP master control plane VM backups | Once daily | 3 days | | ||
| | MLA data (KKP master cluster objects + Prometheus data) | Every 6 hours | 168 hours (7 days) | | ||
| | KKP master etcd and PKI | Every 30 minutes | 24 hours | |
There was a problem hiding this comment.
- per seeed this must be done as well
There was a problem hiding this comment.
Added a new line for that.
| ## Process | ||
|
|
||
| * To accelerate the recovery process and minimize human error, a comprehensive disaster recovery runbook must be documented. | ||
| * This runbook should provide explicit, step-by-step instructions detailing the appropriate recovery strategies for various failure scenarios. |
There was a problem hiding this comment.
this is not a step-by-step guide, it's more an overview
| * These tests must be conducted at least annually and should be executed by various team members. | ||
| * This practice ensures that the documentation remains current and prevents knowledge silos within the team. | ||
|
|
||
| ## References for KubeOne cluster backup and restore |
There was a problem hiding this comment.
Also add the KKP backup links + https://github.com/kubermatic/community-components/tree/master/components/rclone-s3-syncer you can ref this as example implementation
There was a problem hiding this comment.
KKP Backup is already linked. Plus I linked this once on the top of this section, no need to link it multiple times.
Added example implementation link for backup strategy and clarified backup job descriptions. Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <szabo.csenger@gmail.com>
Updated image source for kkp_backup in backup tutorial. Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <szabo.csenger@gmail.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
|
/retest |
…_edited.png Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
…_tuned_matched copy2.png Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <szabo.csenger@gmail.com>
|
/lgtm |
|
LGTM label has been added. DetailsGit tree hash: 7471e541d66565ce850c3b655a71d683d4eb7e96 |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mfahlandt The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: Csenger Szabo <szabo.csenger@gmail.com>
|
New changes are detected. LGTM label has been removed. |
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>
Signed-off-by: Csenger Szabo <csenger@kubermatic.com>

This document outlines the backup and restore procedures for the Kubermatic Kubernetes Platform, emphasizing the importance of a comprehensive backup strategy, recovery objectives, and a multi-layered backup approach.
Fixes kubermatic/product-strategy#22