Pages

Tuesday, June 13, 2023

Designing Azure Storage Account Regional Failover with Private Endpoints

I’ve had the opportunity to work on several projects over the past year to design disaster recovery to recover from one Azure region to another. One of the most common topics that comes up is how to handle storage accounts that are accessed through private endpoints and have public endpoints disabled:

image

image

The purpose of this blog post is to provide a walkthrough of possible methods to design regional failover with private endpoints.

Sample Environment

Take the following topology as an example:

image

In this topology, we have a storage account in the East US region that is configured with Read-access geo-redundant storage (RA-GRS) so all data written to it will automatically get written to the paired region in West US:

image

Since Read Access is configured, a secondary endpoint is available for read access on the replicated copy in the secondary region:

image

A private endpoint is provisioned in the East US region so the vm-east-us-prod virtual machine can access the storage account privately from its subnet 10.1.0.4 to the private endpoint at 10.1.2.4 within the vnet-east-us VNet:

image

Although a secondary endpoint is available, this should not be mistaken for an endpoint that can be used for DR purposes because it allows for read access via the public endpoint to the replicated copy in West US during normal operation.

Notice that there is a pre-deployed virtual machine in the West US region that serve to provide continue operation of access to the storage account in the event where the East US region is unavailable. This type of very common for most environments as a DR failover region is typically pre-staged with networks that serve to host resources to continue operations in the event where the primary region is down.

Scenario #1 – Shared Private DNS Zone for Primary and Secondary Regions

One common design that can be used between two regions is where the Private DNS Zone is shared between the VNets in the two regions. This configuration allows for both VNets to use the same DNS zone for name resolution and therefore will resolve the same private IP address configured for the private endpoint in the primary region providing access to the storage account:

image

In the diagram above, the secondary region’s virtual machine is placed in a VNet that linked to the same private DNS zone:

image

It is important to note that the reason why we are able to link the two VNets to the same Private DNS Zone is because these are Global resources even though it is placed in a regional resource group:

image

This type of configuration means that attempting to resolve easteusblobprod.privatelink.blob.core.windows.net in both regions will direct the traffic to the private endpoint deployed in East US and since the two regions have Global VNet Peering configured, the West US traffic will traverse through that connection to the East US region.

image

In the event where the storage account is unavailable in East US or it has been manually failed over to the West US region, traffic will continue to be directed to the private endpoint in East US, then sent over a private link to the failed over storage account in the West US region, which now has become LRS (Locally-redundant storage):

image

image

Such a design unfortunately would not provide the required access in the event of an East US regional failure because the primary private endpoint will no longer be available if East US becomes unavailable:

image

A common design is to have a DR runbook that performs the following in the event of a regional failure:

  1. Provision a new private endpoint in the West US region
  2. Update the Private DNS Zone’s record to direct traffic to the new private endpoint
image

This type of design requires manual steps to be executed but saves cost in the disaster recovery region because while private endpoints costs $0.014 (CAD) per hour, which equates to around $10.22/month, larger environments can have many private endpoints and the charges for resources that are not actively used isn’t well received by organizations. Environments leveraging automation using Infrastructure as Code are great candidates for this type of design as the resources and changes can be executed with little manual labour. Furthermore, disaster recovery solutions are not always automatically invoked so having to provision private endpoints in the event of a catatrophic event is not uncommon. An example of this could be leveraging Azure Site Recovery to recover VMs with its recovery plan capability to execute Azure Automation runbooks.

Scenario #2 – Separate Private DNS Zone for Primary and Secondary Regions

If there is a desire to pre-provision all resources to either fully automate or reduce the amount of manual labour involved in the event of a DR, it is possible to provision a private link in the disaster recovery West US region that is linked to the storage account. The important design change here is that a second private DNS zone is created for the DR region and linked to the VNet as shown in the diagram below:

image

Notice that the pre-provisioned private link will now allow the virtual machine in the West US DR region to access the storage account through a private link rather than the global VNet peering. I won’t go into the details but I have had cross-region active/active deployments configured with such a design.

Here is how the configuration would look like in the Azure portal:

image

image

image

image

With the above design, a regional loss will require no manual configuration to access the storage account failed over to West US:

image

In summary, this design removes the requirement for provisioning a private endpoint and updating DNS in the event of a disaster recovery. However, this does incur additional cost as well as maintaining multiple private DNS zones that are associated to the different VNets in each region. There will also be additional considerations required when there is an on-premise hybrid cloud connectivity to the Azure regions and traffic originating outside of Azure needs to reach the private endpoint.

Hope this gives the reader a good idea about the designs available for providing private endpoint connectivity in the event of a disaster recovery.

No comments: