-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Update "Disaster recovery for WAN-federated datacenters" #22834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
ce5a015
4d3f638
ecbadde
03384f8
081031a
06b831d
f1a73bf
ddad2c2
e9a3f09
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,166 @@ | ||||||||||||||
--- | ||||||||||||||
layout: docs | ||||||||||||||
page_title: Disaster preparation strategy | ||||||||||||||
description: >- | ||||||||||||||
Prepare for Consul disaster recovery using best practice recommendations. Implement a backup plan and a disaster recovery plan (DRP) to minimize downtime in case a disaster event happens in your deployment. | ||||||||||||||
--- | ||||||||||||||
|
||||||||||||||
# Disaster preparation strategy | ||||||||||||||
|
||||||||||||||
This topic provides an overview of the best practices for preparing a disaster recovery strategy for your Consul cluster. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
## Introduction | ||||||||||||||
|
||||||||||||||
Disaster recovery is an important part of business continuity planning. | ||||||||||||||
|
||||||||||||||
When defining a disaster preparation strategy, you should take into account the following two parameters: | ||||||||||||||
|
||||||||||||||
- **Recovery point objective (RPO)** - The maximum amount of data loss that can be incurred from a disaster, failure, or comparable event. RPO is measured as a unit of time and there is usually a 1-to-1 correlation between RPO and backup frequency. | ||||||||||||||
- **Recovery time objective (RTO)** - The amount of time that passes between application failure and full availability restoration. RTO could be kept relatively short by having another datacenter location available for disaster recovery purposes with replication of services and data occurs on a regular basis. | ||||||||||||||
|
||||||||||||||
Restoring a Consul cluster from a disastrous event, such as the complete loss of one or more datacenters or region, typically includes the full redeploy of a new Consul datacenter to replace the lost one. Using best practices for deploy and automation greatly reduces the amount of time that it will take to perform these steps. | ||||||||||||||
|
||||||||||||||
- [Use a recommended architecture](#use-a-recommended-architecture) for your datacenter. | ||||||||||||||
- [Automate your deployment](#automate-your-deployment) process to reduce deploy times and human errors. | ||||||||||||||
- [Implement a backup strategy](#implement-a-backup-strategy) to reduce RPO. | ||||||||||||||
- Have a [TLS certificate distribution process](#tls-certificate-distribution-process) in place. | ||||||||||||||
- Adopt an adequate [ACL down policy](#acl-down-policy). | ||||||||||||||
- Use [federation strategies to mitigate outages](#federation-strategies-to-mitigate-outages). | ||||||||||||||
|
||||||||||||||
|
||||||||||||||
## Use a recommended architecture | ||||||||||||||
|
||||||||||||||
Not every outage has the same level of impact. A lot of the resiliency of your Consul datacenter will rely on proper configuration and the adoption of a recommended architecture. Following a standard architecture makes the deploy process consistent across your organization, also helping with the automation of the deploy process. | ||||||||||||||
|
||||||||||||||
Refer to [Consul Reference Architecture](/consul/tutorials/production-deploy/reference-architecture) to learn about the recommended configurations for your Consul datacenter. | ||||||||||||||
|
||||||||||||||
If you are using Kubernetes you can refer to [Consul on Kubernetes reference architecture](/consul/tutorials/production-kubernetes/kubernetes-reference-architecture). | ||||||||||||||
|
||||||||||||||
Enterprise users can use [redundancy zones](/consul/tutorials/operate-consul/redundancy-zones) to provide fault tolerance even in case of a total region failure. | ||||||||||||||
|
||||||||||||||
## Automate your deployment | ||||||||||||||
|
||||||||||||||
The amount of downtime you experience from the loss of your Consul datacenter is directly proportional to the amount of time it takes you to deploy a new datacenter. | ||||||||||||||
|
||||||||||||||
Re-deploying an entire datacenter after an outage is a non-trivial operation that might require a considerable amount of time. You can reduce the time to recover, _RTO_, by following Infrastructure as Code (IaC) principles and using tools such as [Terraform](/terraform/intro) and [Vault](/vault/docs/about-vault/what-is-vault) to help you in the deployment and recovery process. | ||||||||||||||
|
||||||||||||||
Refer to the follow documentation to set up your datacenters: | ||||||||||||||
|
||||||||||||||
- [Deployment Guide](/consul/tutorials/production-deploy/deployment-guide) | ||||||||||||||
- [Securing Consul with ACLs](/consul/docs/secure/acl) | ||||||||||||||
|
||||||||||||||
If you are using Kubernetes refer to the following documentation: | ||||||||||||||
|
||||||||||||||
- [Consul and Kubernetes deployment guide](/consul/tutorials/production-kubernetes/kubernetes-deployment-guide). | ||||||||||||||
|
||||||||||||||
To learn more about best practices for your deployments you can also refer to HashiCorp's [Well-Architected Framework](/well-architected-framework/what-is) documentation for a list of best practices that can help you define and automate your processes, optimize your resources and costs, design reliable systems, and secure your infrastructure and services. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
|
||||||||||||||
## Implement a backup strategy | ||||||||||||||
|
||||||||||||||
A Consul datacenter's state is more than just the initial configuration, it includes data that is generated during normal operations such as KV entries, ACL tokens, and intentions. When your datacenter fails, this information is lost and cannot be manually recreated without a backup. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
Restoring from a snapshot ensures all the intentions, KV entries and ACL tokens are reintroduced. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
You can follow [Backup Consul Data and State](/consul/tutorials/production-deploy/backup-and-restore) to learn how to perform a snapshot of your Consul datacenter to use in case of disaster. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
|
||||||||||||||
## TLS certificate distribution process | ||||||||||||||
|
||||||||||||||
Certificates are stored on the agent disk and are not saved in a snapshot. This means you will have to re-generate them in case you lose access to the agent's data. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
Consul comes equipped with a command, [`consul tls cert create`](/consul/commands/tls/cert), that permits you to generate TLS certificates for the agents. This simplifies the automation of deployment by giving you the ability to generate a CA and TLS certificates as part of the process. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
As an alternative, we suggest you use Vault as a CA and TLS certificate generator to help you automate the process. Refer to [Generate mTLS Certificates for Consul with Vault](/consul/docs/automate/consul-template/vault/mtls) to learn how to automate certificate generation and distribution for your Consul server agents. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
|
||||||||||||||
## ACL down policy | ||||||||||||||
|
||||||||||||||
When your primary datacenter is down you lose your ability to validate ACL policies. To mitigate this, Consul has a configuration parameter, [`acl.down_policy`](/consul/docs/reference/agent/configuration-file/acl#acl_down_policy), that tells Consul which strategy to follow if ACLs cannot be validated against the primary datacenter. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
By default, Consul adopts the `extend-cache` approach, meaning that in case of an outage Consul will allow cached ACL objects to be used, ignoring their TTL values. If a non-cached ACL is used, `extend-cache` acts like `deny`. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
If you changed the `down_policy` to the more restrictive value of `deny`, you will be impacted more severely from the outage, since all ACL protected operations in the secondary datacenter will be denied until the primary datacenter is restored. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
|
||||||||||||||
### Client ACL tokens reconfiguration | ||||||||||||||
|
||||||||||||||
When you restore a snapshot to a new Consul cluster, depending on the initial configuration you might need to reconfigure the ACL tokens for the client agents. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
- If token persistence was enabled before the snapshot was captured, using the [`enable_token_persistence`](/consul/docs/reference/agent/configuration-file/acl#acl_enable_token_persistence) configuration flag, then the client agents will resume function after the snapshot restore in the cluster's server agents, without the need for reconfiguration. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
- If the ACL tokens for the agents were specified directly in the client agent configuration before the snapshot was captured, using the [`acl.tokens.agent`](/consul/docs/reference/agent/configuration-file/acl#acl_tokens_agent) parameter, then the client agents will resume function after the snapshot restore in the cluster's server agents, without the need for reconfiguration. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
- If none of the previous options were enabled, then ACL tokens will not be persisted after a restore. This means that Consul clients will not be able to re-join the datacenter because they do not have the required permissions and they require an extra configuration to be restored. Use the [`consul acl set-agent-token` command](/consul/commands/acl/set-agent-token#agent), the [`acl.tokens.agent`](/consul/docs/reference/agent/configuration-file/acl#acl_tokens_agent) configuration parameter, or the `CONSUL_HTTP_TOKEN` variable to update the token on client agents. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
The table below offers an overview of the different configuration possibilities and an indication over the need for Consul clients' reconfiguration. | ||||||||||||||
|
||||||||||||||
| Token persistence enabled | ACL token provided in Consul client config | Consul client requires a re-configuration | | ||||||||||||||
| --- | --- | --- | | ||||||||||||||
| | | | | ||||||||||||||
| Yes | No | No | | ||||||||||||||
| No | Yes | No | | ||||||||||||||
| Yes | Yes | No | | ||||||||||||||
| No | No | Yes | | ||||||||||||||
|
||||||||||||||
|
||||||||||||||
## Federation strategies to mitigate outages | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
Introducing Consul federation in your environment, by having multiple Consul datacenters federated using WAN or cluster peering, can increase your resilience to disruptive events by replicating services across multiple datacenters, regions, and cloud providers. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
Implementing federation in your environment you can leverage Consul functionalities that increase resilience towards service failure: | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
- Within a single datacenter, Consul provides automatic failover for services by omitting failed service instances from DNS lookups. | ||||||||||||||
- WAN federated clusters can use [prepared queries](/consul/docs/manage-traffic/failover/prepared-query) to let users define failover policies in a centralized way. | ||||||||||||||
- Cluster-peered federated datacenters can use [sameness groups](/consul/docs/manage-traffic/failover/sameness-group) to automatically redirect service traffic to healthy instances in failover scenarios. | ||||||||||||||
|
||||||||||||||
To deploy a multi-datacenter federated Consul cluster you can refer to the following documentation: | ||||||||||||||
|
||||||||||||||
- [Basic Federation with WAN Gossip](/consul/docs/east-west/wan-federation/vms) | ||||||||||||||
- [ACL Replication for Multiple Datacenters](/consul/docs/secure/acl/token/federation) | ||||||||||||||
|
||||||||||||||
If you are using Consul's service mesh in your WAN-Federated environment, you should also set [`enable_central_service_config = true`](/consul/docs/reference/agent/configuration-file/general#enable_central_service_config) on your Consul clients, which allows you to centrally configure the sidecar and mesh gateway proxies. | ||||||||||||||
|
||||||||||||||
To make use of the mesh gateway functionality, refer to the [Mesh gateways ](/consul/docs/east-west/mesh-gateway) documentation. | ||||||||||||||
danielehc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
|
||||||||||||||
## Primary Consul datacenter outage impact | ||||||||||||||
|
||||||||||||||
When you design and architect your WAN-federated Consul environment, it is important to consider the critical role of the primary datacenter in the multi-cluster deployment. The primary Consul datacenter serves as the source of truth for the following data. | ||||||||||||||
|
When you design and architect your WAN-federated Consul environment, it is important to consider the critical role of the primary datacenter in the multi-cluster deployment. The primary Consul datacenter serves as the source of truth for the following data. | |
When you design and architect your WAN-federated Consul environment, it is important to consider the critical role of the primary datacenter in the multi-cluster deployment. The primary Consul datacenter serves as the source of truth for the following data: |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Certificate Authority management, if you use the built-in Consul CA. The root CA resides in the primary Consul datacenter and must sign the certificates for the additional Consul datacenters. | |
1. ACLs | |
1. Intentions | |
- ACL operations, including tokens and policies. | |
- Service intentions for secure service-to-service communication. | |
- Certificate Authority management, if you use the built-in Consul CA. The root CA resides in the primary Consul datacenter and must sign the certificates for the additional Consul datacenters. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you establish and federate a primary Consul datacenter, you cannot migrate, change, or move it. An effective pattern for large Consul multi-cluster deployments is to have a dedicated primary Consul datacenter with the sole purpose of serving as a primary. You would only include Consul servers in this primary datacenter and not connect any client nodes or services. This primary Consul datacenter can then be federated normally with other Consul datacenters, which will each contain both servers and clients. | |
Once you establish a primary Consul datacenter for your federated deployment, you cannot migrate, change, or move it. | |
One effective pattern for large Consul multi-cluster deployments is to have a dedicated primary Consul datacenter with the sole purpose of serving as the primary datacenter. You would only include Consul servers in this primary datacenter and not connect any client nodes or services. This primary Consul datacenter can then be federated normally with other Consul datacenters, which contain both servers and clients. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It becomes easier to move the primary Consul datacenter. For example, you may want to migrate it from an on premises datacenter to a cloud environment. Typically, this would entail performing a backup and restore of the primary Consul datacenter to the alternate location. Review the [Disaster Recovery for the Primary Datacenter](/consul/tutorials/datacenter-operations/recovery-outage-primary) tutorial for guidance on restoring a Consul cluster. | |
- It becomes easier to move the primary Consul datacenter. For example, you may want to migrate it from an on premises datacenter to a cloud environment. Typically, this process entails performing a backup and restore of the primary Consul datacenter to the alternate location. For more information, refer to [Backup and restore a Consul datacenter](/consul/docs/manage/disaster-recovery/backup-restore). |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In the event of a disaster, the additional Consul datacenters can still continue to function independently of the primary Consul datacenter although functionality will be reduced until the primary Consul datacenter is brought back online. | |
- If your primary datacenter experiences a disaster, the other Consul datacenters can still continue to function independently. They will operate with reduced functionality until the primary Consul datacenter is brought back online. |
Uh oh!
There was an error while loading. Please reload this page.