Skip to content

Conversation

krastin
Copy link
Contributor

@krastin krastin commented Sep 12, 2025

Description

The “Consul multi-cluster disaster recovery considerations” tutorial becomes the Disaster recovery overview page

Testing & Reproduction steps

None

Links

https://hashicorp.atlassian.net/browse/CE-924

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

PCI review checklist

  • I have documented a clear reason for, and description of, the change I am making.

  • If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.

  • If applicable, I've documented the impact of any changes to security controls.

@krastin krastin requested a review from boruszak September 12, 2025 19:52
@krastin krastin self-assigned this Sep 12, 2025
@krastin krastin requested review from a team as code owners September 12, 2025 19:52
@krastin krastin added type/docs Documentation needs to be created/updated/clarified pr/no-changelog PR does not need a corresponding .changelog entry pr/no-metrics-test backport/1.21 Changes are backported to 1.21 labels Sep 12, 2025
Copy link
Contributor

@boruszak boruszak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!

Besides the suggestions, please check my comment asking you to confirm that my edits are factually accurate.

## Workflow

Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The tutorial [Backup Consul Data and State](/consul/tutorials/production-deploy/backup-and-restore) covers this in further detail.
Disaster recovery for Consul should typically involve creating the necessary backups, having a clear restoration process, and testing this process regularly to ensure that it works as expected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Disaster recovery for Consul should typically involve creating the necessary backups, having a clear restoration process, and testing this process regularly to ensure that it works as expected.
The workflow to prepare a Consul datacenter for disaster recovery scenario requires the following:
1. Create the necessary backups
1. Have a clear restoration process
1. Test your restoration process regularly to ensure that it works as expected.

Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The tutorial [Backup Consul Data and State](/consul/tutorials/production-deploy/backup-and-restore) covers this in further detail.
Disaster recovery for Consul should typically involve creating the necessary backups, having a clear restoration process, and testing this process regularly to ensure that it works as expected.

When you prepare your disaster recovery strategy, consider the following recommendations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When you prepare your disaster recovery strategy, consider the following recommendations.
The exact steps in your restoration process can vary according to your unique networking needs. To prepare a clear restoration process, consider the following recommendations.

When you prepare your disaster recovery strategy, consider the following recommendations.

1. **Define your RPO and RTO**: Understand the maximum acceptable downtime and data loss for your applications.
1. **Implement regular backups**: Use the Consul snapshot feature to create regular backups of your cluster state, or utilise the Consul Enterprise Snapshot agent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Implement regular backups**: Use the Consul snapshot feature to create regular backups of your cluster state, or utilise the Consul Enterprise Snapshot agent.
1. **Create backups on a regular schedule**: Use the Consul snapshot feature to create regular backups of your cluster state, or utilise the Consul Enterprise Snapshot agent.


## Where to start

Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The [Backup and restore a Consul datacenter page](/consul/docs/manage/disaster-recovery/backup-restore) covers this in further detail. [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation) is similar, but requires additional considerations for data replication and consistency across regions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The [Backup and restore a Consul datacenter page](/consul/docs/manage/disaster-recovery/backup-restore) covers this in further detail. [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation) is similar, but requires additional considerations for data replication and consistency across regions.
We recommend using [Consul's built-in snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. Refer to [Backup and restore a Consul datacenter](/consul/docs/manage/disaster-recovery/backup-restore) to learn more about taking snapshots and using them to recover a Consul datacenter.
If your Consul datacenter is part of a WAN-federated deployment, refer to [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation). WAN-federated datacenters require additional consideration to ensure data replication and consistency across regions.

</Warning>

You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage. Avoid local or ephemeral storage.
You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend to avoid local or ephemeral storage, and we suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend to avoid local or ephemeral storage, and we suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage.
You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend that you avoid local or ephemeral storage as well as block or file based storage, and instead use object storage such as Azure blobs, Google Cloud storage, or AWS S3 storage.

Did I get the facts right? Trying to make it clear what to use/what not to use.

- If client agents were not configured in a way that persists access to a token, then client agents will not resume function after the restore because they not have permissions to register with the new Consul cluster. This situation applies when the ACL token was set using the API or CLI, or if the ACL token was set in an environment variable. Use the [`consul acl set-agent-token` command](/consul/commands/acl/set-agent-token#agent) or the `CONSUL_HTTP_TOKEN` variable to update the token on client agents before you restore a cluster with a snapshot.

## Service failure recommendations
### Service failure recommendations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Service failure recommendations
## Service failure recommendations

To architect against outages caused by disasters that impact services registered with Consul, use [cluster peering failover with sameness groups](/consul/docs/multi-tenant/sameness-group/vm). With this setup, Consul can transparently failover requests to an unhealthy service to the same service in a different region and datacenter.

## Region failure recommendations
### Region failure recommendations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Region failure recommendations
## Region failure recommendations

In the event of a total region failure, Consul and your services are likely down. To architect against this situation, [deploy Consul and your services in multiple regions with a global failover policy](/consul/tutorials/operate-consul/redundancy-zones) so that Consul reroutes network traffic to the alternate region during a disaster. Deploying identical Consul servers and services across multiple cloud regions satisfies datacenter latency requirements and limits the blast radius during large-scale disasters.

## Multi-cluster disaster recovery considerations
### Multi-cluster disaster recovery considerations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Multi-cluster disaster recovery considerations
## Multi-cluster disaster recovery considerations

It is important to consider both placement of the primary Consul datacenter as well as the steps required to recover from a disaster. The recommended approach is reviewed in detail below.

### Clientless primary Consul datacenter
#### Clientless primary Consul datacenter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Clientless primary Consul datacenter
### Clientless primary Consul datacenter

- In the event of a disaster, the additional Consul datacenters can still continue to function independently of the primary Consul datacenter although functionality will be reduced until the primary Consul datacenter is brought back online. See the table below for more details.

### Primary Consul datacenter outage behaviors
#### Primary Consul datacenter outage behaviors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Primary Consul datacenter outage behaviors
### Primary Consul datacenter outage behaviors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.21 Changes are backported to 1.21 pr/no-changelog PR does not need a corresponding .changelog entry pr/no-metrics-test type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants