docs: CE-924 Update Disaster recovery overview #22756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

krastin wants to merge 1 commit into main from krastin/ce-924

Contributor

krastin commented Sep 12, 2025

Description

The “Consul multi-cluster disaster recovery considerations” tutorial becomes the Disaster recovery overview page

Testing & Reproduction steps

None

Links

https://hashicorp.atlassian.net/browse/CE-924

PR Checklist

updated test coverage
external facing docs updated
appropriate backport labels added
not a security concern

PCI review checklist

I have documented a clear reason for, and description of, the change I am making.
If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.
If applicable, I've documented the impact of any changes to security controls.


          initial edits

67ad058

krastin requested a review from boruszak

September 12, 2025 19:52

krastin self-assigned this

krastin requested review from a team as code owners

September 12, 2025 19:52

krastin added type/docs pr/no-changelog pr/no-metrics-test backport/1.21 labels

boruszak approved these changes

View reviewed changes

Contributor

boruszak left a comment

Looking great!

Besides the suggestions, please check my comment asking you to confirm that my edits are factually accurate.

website/content/docs/manage/disaster-recovery/index.mdx

    
              ## Workflow

              Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The tutorial [Backup Consul Data and State](/consul/tutorials/production-deploy/backup-and-restore) covers this in further detail.

              Disaster recovery for Consul should typically involve creating the necessary backups, having a clear restoration process, and testing this process regularly to ensure that it works as expected.

Contributor

boruszak Sep 15, 2025

Suggested change

      
            Disaster recovery for Consul should typically involve creating the necessary backups, having a clear restoration process, and testing this process regularly to ensure that it works as expected.
          
            The workflow to prepare a Consul datacenter for disaster recovery scenario requires the following:
          
            1. Create the necessary backups 
          
            1. Have a clear restoration process
          
            1. Test your restoration process regularly to ensure that it works as expected.

website/content/docs/manage/disaster-recovery/index.mdx

    
              Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The tutorial [Backup Consul Data and State](/consul/tutorials/production-deploy/backup-and-restore) covers this in further detail.

              Disaster recovery for Consul should typically involve creating the necessary backups, having a clear restoration process, and testing this process regularly to ensure that it works as expected.

              When you prepare your disaster recovery strategy, consider the following recommendations.

Contributor

boruszak Sep 15, 2025

Suggested change

      
            When you prepare your disaster recovery strategy, consider the following recommendations.
          
            The exact steps in your restoration process can vary according to your unique networking needs. To prepare a clear restoration process, consider the following recommendations.

website/content/docs/manage/disaster-recovery/index.mdx

    
              When you prepare your disaster recovery strategy, consider the following recommendations.

              1. **Define your RPO and RTO**: Understand the maximum acceptable downtime and data loss for your applications.

              1. **Implement regular backups**: Use the Consul snapshot feature to create regular backups of your cluster state, or utilise the Consul Enterprise Snapshot agent.

Contributor

boruszak Sep 15, 2025

Suggested change

      
            1. **Implement regular backups**: Use the Consul snapshot feature to create regular backups of your cluster state, or utilise the Consul Enterprise Snapshot agent.
          
            1. **Create backups on a regular schedule**: Use the Consul snapshot feature to create regular backups of your cluster state, or utilise the Consul Enterprise Snapshot agent.

website/content/docs/manage/disaster-recovery/index.mdx

    
              ## Where to start

              Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The [Backup and restore a Consul datacenter page](/consul/docs/manage/disaster-recovery/backup-restore) covers this in further detail. [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation) is similar, but requires additional considerations for data replication and consistency across regions.

Contributor

boruszak Sep 15, 2025

Suggested change

      
            Our recommended method for backing up Consul state uses the [built-in Consul snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. The [Backup and restore a Consul datacenter page](/consul/docs/manage/disaster-recovery/backup-restore) covers this in further detail. [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation) is similar, but requires additional considerations for data replication and consistency across regions.
          
            We recommend using [Consul's built-in snapshot feature](/consul/commands/snapshot), which is available through the HTTP API or CLI. Refer to [Backup and restore a Consul datacenter](/consul/docs/manage/disaster-recovery/backup-restore) to learn more about taking snapshots and using them to recover a Consul datacenter.
          
            If your Consul datacenter is part of a WAN-federated deployment, refer to [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation). WAN-federated datacenters require additional consideration to ensure data replication and consistency across regions.

website/content/docs/manage/disaster-recovery/index.mdx

    
              </Warning>

              You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage.  We suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage. Avoid local or ephemeral storage.

              You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend to avoid local or ephemeral storage, and we suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage.

Contributor

boruszak Sep 15, 2025

Suggested change

      
            You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend to avoid local or ephemeral storage, and we suggest the use of object storage versus block or file based storage, such as Azure blobs, Google Cloud storage, or AWS S3 storage. 
          
            You should take snapshots of Consul clusters on a regular basis and store them on mounted or external storage. We recommend that you avoid local or ephemeral storage as well as block or file based storage, and instead use object storage such as Azure blobs, Google Cloud storage, or AWS S3 storage.

Did I get the facts right? Trying to make it clear what to use/what not to use.

website/content/docs/manage/disaster-recovery/index.mdx

    
              - If client agents were not configured in a way that persists access to a token, then client agents will not resume function after the restore because they not have permissions to register with the new Consul cluster. This situation applies when the ACL token was set using the API or CLI, or if the ACL token was set in an environment variable. Use the [`consul acl set-agent-token` command](/consul/commands/acl/set-agent-token#agent) or the `CONSUL_HTTP_TOKEN` variable to update the token on client agents before you restore a cluster with a snapshot.

              ## Service failure recommendations

              ### Service failure recommendations

Contributor

boruszak Sep 15, 2025

Suggested change

      
            ### Service failure recommendations
          
            ## Service failure recommendations

website/content/docs/manage/disaster-recovery/index.mdx

    
              To architect against outages caused by disasters that impact services registered with Consul, use [cluster peering failover with sameness groups](/consul/docs/multi-tenant/sameness-group/vm). With this setup, Consul can transparently failover requests to an unhealthy service to the same service in a different region and datacenter.

              ## Region failure recommendations

              ### Region failure recommendations

Contributor

boruszak Sep 15, 2025

Suggested change

      
            ### Region failure recommendations
          
            ## Region failure recommendations

website/content/docs/manage/disaster-recovery/index.mdx

    
              In the event of a total region failure, Consul and your services are likely down. To architect against this situation, [deploy Consul and your services in multiple regions with a global failover policy](/consul/tutorials/operate-consul/redundancy-zones) so that Consul reroutes network traffic to the alternate region during a disaster. Deploying identical Consul servers and services across multiple cloud regions satisfies datacenter latency requirements and limits the blast radius during large-scale disasters.

              ## Multi-cluster disaster recovery considerations

              ### Multi-cluster disaster recovery considerations

Contributor

boruszak Sep 15, 2025

Suggested change

      
            ### Multi-cluster disaster recovery considerations
          
            ## Multi-cluster disaster recovery considerations

website/content/docs/manage/disaster-recovery/index.mdx

    
              It is important to consider both placement of the primary Consul datacenter as well as the steps required to recover from a disaster. The recommended approach is reviewed in detail below.

              ### Clientless primary Consul datacenter

              #### Clientless primary Consul datacenter

Contributor

boruszak Sep 15, 2025

Suggested change

      
            #### Clientless primary Consul datacenter
          
            ### Clientless primary Consul datacenter

website/content/docs/manage/disaster-recovery/index.mdx

    
              - In the event of a disaster, the additional Consul datacenters can still continue to function independently of the primary Consul datacenter although functionality will be reduced until the primary Consul datacenter is brought back online. See the table below for more details.

              ### Primary Consul datacenter outage behaviors

              #### Primary Consul datacenter outage behaviors

Contributor

boruszak Sep 15, 2025

Suggested change

      
            #### Primary Consul datacenter outage behaviors
          
            ### Primary Consul datacenter outage behaviors

danielehc mentioned this pull request

Update "Disaster recovery for WAN-federated datacenters" #22834

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.21 pr/no-changelog pr/no-metrics-test type/docs