Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion website/content/docs/error-messages/consul.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -165,12 +165,14 @@ When restoring a Consul datacenter with a snapshot on new infrastructure, Consul

This error means that in the new datacenter there is at least one node with the same `node_name` as a node in the snapshot's datacenter, but with a different `node_id`. This represents a consistency issue.

There are two possible workarounds:
There are three possible workarounds:

1. Save the UUID from the previous node’s data directory. Then re-use that same UUID when you first start the agent on the new node. You can configure node IDs for your Consul agent nodes with the [`node_id` configuration parameter](/consul/docs/reference/agent/configuration-file/node#_node_id).

1. Always use unique node names for your Consul datacenters so that there is no risk of conflicts. You can configure node names for your Consul agent nodes using the [`node_name`](/consul/docs/reference/agent/configuration-file/node#_node) configuration parameter.

1. Perform a [`consul leave`](/consul/commands/leave) on each server and then start the server again. Do this one server at a time. Once servers are restarted, the node ids will be set to the expected value and this will resolve the errors in the logs.

## ACL not found

If Consul returns the following error, this indicates that you have ACL enabled in your cluster but you aren't passing a valid token.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,17 +35,16 @@ To reduce the burden on the leader, it is possible to [run the snapshot command

However, we still recommend you take `consistent` snapshots for write-heavy production use cases, or when you want to snapshot a cluster state immediately after a specific change.


## Workflow

1. **Backup the Consul datacenter**: Use the `consul snapshot save` command to create a backup of the Consul datacenter.
1. **Verify the backup**: Inspect the backup file to ensure it was created successfully.
1. **Restore from snapshot**: Use the `consul snapshot restore` command to restore the Consul datacenter from the backup.



## Backup a Consul datacenter


Run the basic snapshot command on one of the servers. Because it uses the default settings, this request runs in `consistent` mode.

```shell-session
Expand All @@ -68,6 +67,11 @@ Version 1

For more information about the `snapshot inspect` sub-command and its output, refer to the [`consul snapshot inspect` CLI documentation](/consul/commands/snapshot/inspect).

<Warning heading="Security warning">

Consul snapshots contain extremely sensitive data, such as credentials in recoverable form. Store snapshots on an encrypted medium with sufficiently strict access controls in place.

</Warning>


## Restore a Consul datacenter
Expand All @@ -83,10 +87,9 @@ $ consul snapshot restore backup.snap
Restored snapshot
```


## Additional guidance

For more information on disaster recovery, including detailed instructions on how to backup and restore Consul datacenters, refer to the following resources:

- [Consul Disaster Recovery](/consul/docs/manage/disaster-recovery)
- [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/federation)
- [Disaster recovery for WAN-federated datacenters](/consul/docs/manage/disaster-recovery/restore-federated)
166 changes: 166 additions & 0 deletions website/content/docs/manage/disaster-recovery/disaster-preparation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
layout: docs
page_title: Disaster preparation strategy
description: >-
Prepare for Consul disaster recovery using best practice recommendations. Implement a backup plan and a disaster recovery plan (DRP) to minimize downtime in case a disaster event happens in your deployment.
---

# Disaster preparation strategy

This topic provides an overview of the best practices for preparing a disaster recovery strategy for your Consul cluster.

## Introduction

Disaster recovery is an important part of business continuity planning.

When defining a disaster preparation strategy, you should take into account the following two parameters:

- **Recovery point objective (RPO)** - The maximum amount of data loss that can be incurred from a disaster, failure, or comparable event. RPO is measured as a unit of time and there is usually a 1-to-1 correlation between RPO and backup frequency.
- **Recovery time objective (RTO)** - The amount of time that passes between application failure and full availability restoration. RTO could be kept relatively short by having another datacenter location available for disaster recovery purposes with replication of services and data occurs on a regular basis.

Restoring a Consul cluster from a disastrous event, such as the complete loss of one or more datacenters or region, typically includes the full redeploy of a new Consul datacenter to replace the lost one. Using best practices for deploy and automation greatly reduces the amount of time that it will take to perform these steps.

- [Use a recommended architecture](#use-a-recommended-architecture) for your datacenter.
- [Automate your deployment](#automate-your-deployment) process to reduce deploy times and human errors.
- [Implement a backup strategy](#implement-a-backup-strategy) to reduce RPO.
- Have a [TLS certificate distribution process](#tls-certificate-distribution-process) in place.
- Adopt an adequate [ACL down policy](#acl-down-policy).
- Use [federation strategies to mitigate outages](#federation-strategies-to-mitigate-outages).


## Use a recommended architecture

Not every outage has the same level of impact. A lot of the resiliency of your Consul datacenter will rely on proper configuration and the adoption of a recommended architecture. Following a standard architecture makes the deploy process consistent across your organization, also helping with the automation of the deploy process.

Refer to [Consul Reference Architecture](/consul/tutorials/production-deploy/reference-architecture) to learn about the recommended configurations for your Consul datacenter.

If you are using Kubernetes you can refer to [Consul on Kubernetes reference architecture](/consul/tutorials/production-kubernetes/kubernetes-reference-architecture).

Enterprise users can use [redundancy zones](/consul/tutorials/operate-consul/redundancy-zones) to provide fault tolerance even in case of a total region failure.

## Automate your deployment

The amount of downtime you experience from the loss of your Consul datacenter is directly proportional to the amount of time it takes you to deploy a new datacenter.

Re-deploying an entire datacenter after an outage is a non-trivial operation that might require a considerable amount of time. You can reduce the time to recover, _RTO_, by following Infrastructure as Code (IaC) principles and using tools such as [Terraform](/terraform/intro) and [Vault](/vault/docs/about-vault/what-is-vault) to help you in the deployment and recovery process.

Refer to the follow documentation to set up your datacenters:

- [Deployment Guide](/consul/tutorials/production-deploy/deployment-guide)
- [Securing Consul with ACLs](/consul/docs/secure/acl)

If you are using Kubernetes refer to the following documentation:

- [Consul and Kubernetes deployment guide](/consul/tutorials/production-kubernetes/kubernetes-deployment-guide).

To learn more about best practices for your deployments you can also refer to HashiCorp's [Well-Architected Framework](/well-architected-framework/what-is) documentation for a list of best practices that can help you define and automate your processes, optimize your resources and costs, design reliable systems, and secure your infrastructure and services.


## Implement a backup strategy

A Consul datacenter's state is more than just the initial configuration, it includes data that is generated during normal operations such as KV entries, ACL tokens, and intentions. When your datacenter fails, this information is lost and cannot be manually recreated without a backup.

Restoring from a snapshot ensures all the intentions, KV entries and ACL tokens are reintroduced.

You can follow [Backup Consul Data and State](/consul/tutorials/production-deploy/backup-and-restore) to learn how to perform a snapshot of your Consul datacenter to use in case of disaster.


## TLS certificate distribution process

Certificates are stored on the agent disk and are not saved in a snapshot. This means you will have to re-generate them in case you lose access to the agent's data.

Consul comes equipped with a command, [`consul tls cert create`](/consul/commands/tls/cert), that permits you to generate TLS certificates for the agents. This simplifies the automation of deployment by giving you the ability to generate a CA and TLS certificates as part of the process.

As an alternative, we suggest you use Vault as a CA and TLS certificate generator to help you automate the process. Refer to [Generate mTLS Certificates for Consul with Vault](/consul/docs/automate/consul-template/vault/mtls) to learn how to automate certificate generation and distribution for your Consul server agents.


## ACL down policy

When your primary datacenter is down you lose your ability to validate ACL policies. To mitigate this, Consul has a configuration parameter, [`acl.down_policy`](/consul/docs/reference/agent/configuration-file/acl#acl_down_policy), that tells Consul which strategy to follow if ACLs cannot be validated against the primary datacenter.

By default, Consul adopts the `extend-cache` approach, meaning that in case of an outage Consul will allow cached ACL objects to be used, ignoring their TTL values. If a non-cached ACL is used, `extend-cache` acts like `deny`.

If you changed the `down_policy` to the more restrictive value of `deny`, you will be impacted more severely from the outage, since all ACL protected operations in the secondary datacenter will be denied until the primary datacenter is restored.


### Client ACL tokens reconfiguration

When you restore a snapshot to a new Consul cluster, depending on the initial configuration you might need to reconfigure the ACL tokens for the client agents.

- If token persistence was enabled before the snapshot was captured, using the [`enable_token_persistence`](/consul/docs/reference/agent/configuration-file/acl#acl_enable_token_persistence) configuration flag, then the client agents will resume function after the snapshot restore in the cluster's server agents, without the need for reconfiguration.
- If the ACL tokens for the agents were specified directly in the client agent configuration before the snapshot was captured, using the [`acl.tokens.agent`](/consul/docs/reference/agent/configuration-file/acl#acl_tokens_agent) parameter, then the client agents will resume function after the snapshot restore in the cluster's server agents, without the need for reconfiguration.
- If none of the previous options were enabled, then ACL tokens will not be persisted after a restore. This means that Consul clients will not be able to re-join the datacenter because they do not have the required permissions and they require an extra configuration to be restored. Use the [`consul acl set-agent-token` command](/consul/commands/acl/set-agent-token#agent), the [`acl.tokens.agent`](/consul/docs/reference/agent/configuration-file/acl#acl_tokens_agent) configuration parameter, or the `CONSUL_HTTP_TOKEN` variable to update the token on client agents.

The table below offers an overview of the different configuration possibilities and an indication over the need for Consul clients' reconfiguration.

| Token persistence enabled | ACL token provided in Consul client config | Consul client requires a re-configuration |
| --- | --- | --- |
| | | |
| Yes | No | No |
| No | Yes | No |
| Yes | Yes | No |
| No | No | Yes |


## Federation strategies to mitigate outages

Introducing Consul federation in your environment, by having multiple Consul datacenters federated using WAN or cluster peering, can increase your resilience to disruptive events by replicating services across multiple datacenters, regions, and cloud providers.

Implementing federation in your environment you can leverage Consul functionalities that increase resilience towards service failure:

- Within a single datacenter, Consul provides automatic failover for services by omitting failed service instances from DNS lookups.
- WAN federated clusters can use [prepared queries](/consul/docs/manage-traffic/failover/prepared-query) to let users define failover policies in a centralized way.
- Cluster-peered federated datacenters can use [sameness groups](/consul/docs/manage-traffic/failover/sameness-group) to automatically redirect service traffic to healthy instances in failover scenarios.

To deploy a multi-datacenter federated Consul cluster you can refer to the following documentation:

- [Basic Federation with WAN Gossip](/consul/docs/east-west/wan-federation/vms)
- [ACL Replication for Multiple Datacenters](/consul/docs/secure/acl/token/federation)

If you are using Consul's service mesh in your WAN-Federated environment, you should also set [`enable_central_service_config = true`](/consul/docs/reference/agent/configuration-file/general#enable_central_service_config) on your Consul clients, which allows you to centrally configure the sidecar and mesh gateway proxies.

To make use of the mesh gateway functionality, refer to the [Mesh gateways ](/consul/docs/east-west/mesh-gateway) documentation.


## Primary Consul datacenter outage impact

When you design and architect your WAN-federated Consul environment, it is important to consider the critical role of the primary datacenter in the multi-cluster deployment. The primary Consul datacenter serves as the source of truth for the following data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When you design and architect your WAN-federated Consul environment, it is important to consider the critical role of the primary datacenter in the multi-cluster deployment. The primary Consul datacenter serves as the source of truth for the following data.
When you design and architect your WAN-federated Consul environment, it is important to consider the critical role of the primary datacenter in the multi-cluster deployment. The primary Consul datacenter serves as the source of truth for the following data:


1. Certificate Authority management, if you use the built-in Consul CA. The root CA resides in the primary Consul datacenter and must sign the certificates for the additional Consul datacenters.
1. ACLs
1. Intentions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Certificate Authority management, if you use the built-in Consul CA. The root CA resides in the primary Consul datacenter and must sign the certificates for the additional Consul datacenters.
1. ACLs
1. Intentions
- ACL operations, including tokens and policies.
- Service intentions for secure service-to-service communication.
- Certificate Authority management, if you use the built-in Consul CA. The root CA resides in the primary Consul datacenter and must sign the certificates for the additional Consul datacenters.


The table below shows the impact on Consul operations of a full outage of the primary Consul datacenter.

| Consul feature | Create | Read | Update | Delete |
| -------------- | -------- | ---------------- | -------- | -------- |
| ACLs | &#10060; | &#9989; &#x00B9; | &#10060; | &#10060; |
| Intentions | &#10060; | &#9989; &#x00B2; | &#10060; | &#10060; |
| KV Store | &#9989; | &#9989; | &#9989; | &#9989; |
| Services | &#9989; | &#9989; | &#9989; | &#9989; |


1. The ability to read and validate ACLs assumes that the default setting of `extend_cache` is used for the ACL down policy and that the ACL token was cached in the local datacenter before the primary datacenter outage.
2. The ability to read and validate intentions assumes that Intentions were created when primary datacenter was online.

For the TLS certificate management you can greatly reduce the impact of a primary datacenter outage by using Vault both to [generate mTLS Certificates for Consul agents](/consul/docs/automate/consul-template/vault/mtls) and as a [Consul service mesh certification authority](/consul/tutorials/operate-consul/vault-pki-consul-connect-ca).

### Clientless primary Consul datacenter

Once you establish and federate a primary Consul datacenter, you cannot migrate, change, or move it. An effective pattern for large Consul multi-cluster deployments is to have a dedicated primary Consul datacenter with the sole purpose of serving as a primary. You would only include Consul servers in this primary datacenter and not connect any client nodes or services. This primary Consul datacenter can then be federated normally with other Consul datacenters, which will each contain both servers and clients.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once you establish and federate a primary Consul datacenter, you cannot migrate, change, or move it. An effective pattern for large Consul multi-cluster deployments is to have a dedicated primary Consul datacenter with the sole purpose of serving as a primary. You would only include Consul servers in this primary datacenter and not connect any client nodes or services. This primary Consul datacenter can then be federated normally with other Consul datacenters, which will each contain both servers and clients.
Once you establish a primary Consul datacenter for your federated deployment, you cannot migrate, change, or move it.
One effective pattern for large Consul multi-cluster deployments is to have a dedicated primary Consul datacenter with the sole purpose of serving as the primary datacenter. You would only include Consul servers in this primary datacenter and not connect any client nodes or services. This primary Consul datacenter can then be federated normally with other Consul datacenters, which contain both servers and clients.


This approach provides two distinct advantages.

- It becomes easier to move the primary Consul datacenter. For example, you may want to migrate it from an on premises datacenter to a cloud environment. Typically, this would entail performing a backup and restore of the primary Consul datacenter to the alternate location. Review the [Disaster Recovery for the Primary Datacenter](/consul/tutorials/datacenter-operations/recovery-outage-primary) tutorial for guidance on restoring a Consul cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- It becomes easier to move the primary Consul datacenter. For example, you may want to migrate it from an on premises datacenter to a cloud environment. Typically, this would entail performing a backup and restore of the primary Consul datacenter to the alternate location. Review the [Disaster Recovery for the Primary Datacenter](/consul/tutorials/datacenter-operations/recovery-outage-primary) tutorial for guidance on restoring a Consul cluster.
- It becomes easier to move the primary Consul datacenter. For example, you may want to migrate it from an on premises datacenter to a cloud environment. Typically, this process entails performing a backup and restore of the primary Consul datacenter to the alternate location. For more information, refer to [Backup and restore a Consul datacenter](/consul/docs/manage/disaster-recovery/backup-restore).

- In the event of a disaster, the additional Consul datacenters can still continue to function independently of the primary Consul datacenter although functionality will be reduced until the primary Consul datacenter is brought back online.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- In the event of a disaster, the additional Consul datacenters can still continue to function independently of the primary Consul datacenter although functionality will be reduced until the primary Consul datacenter is brought back online.
- If your primary datacenter experiences a disaster, the other Consul datacenters can still continue to function independently. They will operate with reduced functionality until the primary Consul datacenter is brought back online.


## Additional guidance

This page helps you build your internal operations manual for outages and to create a disaster recovery strategy.

You should make sure to test the manual multiple times before experiencing an outage, to make sure the steps are correct and to measure the time needed for a recovery against your desired _RTO_.

Use our tutorials on disaster recovery to test the commands on a test environment:

- [Disaster recovery for Consul clusters](/consul/tutorials/operate-consul/recovery-outage).
- [Disaster Recovery for Consul on Kubernetes](/consul/tutorials/production-kubernetes/kubernetes-disaster-recovery).
Loading
Loading