-
Notifications
You must be signed in to change notification settings - Fork 4.5k
docs: CE-924 Update Disaster recovery overview #22756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great!
Besides the suggestions, please check my comment asking you to confirm that my edits are factually accurate.
## Workflow | ||
|
||
Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The tutorial [Backup Consul Data and State](/consul/tutorials/production-deploy/backup-and-restore) covers this in further detail. | ||
Disaster recovery for Consul should typically involve creating the necessary backups, having a clear restoration process, and testing this process regularly to ensure that it works as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disaster recovery for Consul should typically involve creating the necessary backups, having a clear restoration process, and testing this process regularly to ensure that it works as expected. | |
The workflow to prepare a Consul datacenter for disaster recovery scenario requires the following: | |
1. Create the necessary backups | |
1. Have a clear restoration process | |
1. Test your restoration process regularly to ensure that it works as expected. |
Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The tutorial [Backup Consul Data and State](/consul/tutorials/production-deploy/backup-and-restore) covers this in further detail. | ||
Disaster recovery for Consul should typically involve creating the necessary backups, having a clear restoration process, and testing this process regularly to ensure that it works as expected. | ||
|
||
When you prepare your disaster recovery strategy, consider the following recommendations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you prepare your disaster recovery strategy, consider the following recommendations. | |
The exact steps in your restoration process can vary according to your unique networking needs. To prepare a clear restoration process, consider the following recommendations. |
When you prepare your disaster recovery strategy, consider the following recommendations. | ||
|
||
1. **Define your RPO and RTO**: Understand the maximum acceptable downtime and data loss for your applications. | ||
1. **Implement regular backups**: Use the Consul snapshot feature to create regular backups of your cluster state, or utilise the Consul Enterprise Snapshot agent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. **Implement regular backups**: Use the Consul snapshot feature to create regular backups of your cluster state, or utilise the Consul Enterprise Snapshot agent. | |
1. **Create backups on a regular schedule**: Use the Consul snapshot feature to create regular backups of your cluster state, or utilise the Consul Enterprise Snapshot agent. |
|
||
## Where to start | ||
|
||
Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The [Backup and restore a Consul datacenter page](/consul/docs/manage/disaster-recovery/backup-restore) covers this in further detail. [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation) is similar, but requires additional considerations for data replication and consistency across regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The [Backup and restore a Consul datacenter page](/consul/docs/manage/disaster-recovery/backup-restore) covers this in further detail. [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation) is similar, but requires additional considerations for data replication and consistency across regions. | |
We recommend using [Consul's built-in snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. Refer to [Backup and restore a Consul datacenter](/consul/docs/manage/disaster-recovery/backup-restore) to learn more about taking snapshots and using them to recover a Consul datacenter. | |
If your Consul datacenter is part of a WAN-federated deployment, refer to [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation). WAN-federated datacenters require additional consideration to ensure data replication and consistency across regions. |
</Warning> | ||
|
||
You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage. Avoid local or ephemeral storage. | ||
You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend to avoid local or ephemeral storage, and we suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend to avoid local or ephemeral storage, and we suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage. | |
You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend that you avoid local or ephemeral storage as well as block or file based storage, and instead use object storage such as Azure blobs, Google Cloud storage, or AWS S3 storage. |
Did I get the facts right? Trying to make it clear what to use/what not to use.
- If client agents were not configured in a way that persists access to a token, then client agents will not resume function after the restore because they not have permissions to register with the new Consul cluster. This situation applies when the ACL token was set using the API or CLI, or if the ACL token was set in an environment variable. Use the [`consul acl set-agent-token` command](/consul/commands/acl/set-agent-token#agent) or the `CONSUL_HTTP_TOKEN` variable to update the token on client agents before you restore a cluster with a snapshot. | ||
|
||
## Service failure recommendations | ||
### Service failure recommendations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Service failure recommendations | |
## Service failure recommendations |
To architect against outages caused by disasters that impact services registered with Consul, use [cluster peering failover with sameness groups](/consul/docs/multi-tenant/sameness-group/vm). With this setup, Consul can transparently failover requests to an unhealthy service to the same service in a different region and datacenter. | ||
|
||
## Region failure recommendations | ||
### Region failure recommendations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Region failure recommendations | |
## Region failure recommendations |
In the event of a total region failure, Consul and your services are likely down. To architect against this situation, [deploy Consul and your services in multiple regions with a global failover policy](/consul/tutorials/operate-consul/redundancy-zones) so that Consul reroutes network traffic to the alternate region during a disaster. Deploying identical Consul servers and services across multiple cloud regions satisfies datacenter latency requirements and limits the blast radius during large-scale disasters. | ||
|
||
## Multi-cluster disaster recovery considerations | ||
### Multi-cluster disaster recovery considerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Multi-cluster disaster recovery considerations | |
## Multi-cluster disaster recovery considerations |
It is important to consider both placement of the primary Consul datacenter as well as the steps required to recover from a disaster. The recommended approach is reviewed in detail below. | ||
|
||
### Clientless primary Consul datacenter | ||
#### Clientless primary Consul datacenter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Clientless primary Consul datacenter | |
### Clientless primary Consul datacenter |
- In the event of a disaster, the additional Consul datacenters can still continue to function independently of the primary Consul datacenter although functionality will be reduced until the primary Consul datacenter is brought back online. See the table below for more details. | ||
|
||
### Primary Consul datacenter outage behaviors | ||
#### Primary Consul datacenter outage behaviors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Primary Consul datacenter outage behaviors | |
### Primary Consul datacenter outage behaviors |
Description
The “Consul multi-cluster disaster recovery considerations” tutorial becomes the Disaster recovery overview page
Testing & Reproduction steps
None
Links
https://hashicorp.atlassian.net/browse/CE-924
PR Checklist
PCI review checklist
I have documented a clear reason for, and description of, the change I am making.
If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.
If applicable, I've documented the impact of any changes to security controls.