Pages

Friday, July 30, 2021

Automating Azure Site Recovery Recovery Plan Test Failover with PowerShell Script (on-premise VMs to Azure)

I’ve recently been asked by a colleague whether I had any PowerShell scripts that would automate the test failover and cleanup of Azure Site Recovery replicated VMs and my original thought was that there must be plenty of scripts available on the internet but quickly found that results from Google were either the official Microsoft documentation that goes through configuring ASR, replicate, and only failover over one VM (https://docs.microsoft.com/en-us/azure/site-recovery/azure-to-azure-powershell) or blog posts that provided bits and pieces of information and not a complete script.

Having been involved in Azure Site Recovery design, implementation and testing, I have created a PowerShell script to initiate the failover of a recovery plan and then perform the cleanup when the DR environment has been tested. This post serves to share the script that I use and I would encourage anyone who decides to use it to improve and customize the script as needed.

Environment

The environment this script will be used for will have the source as an on-premise and target in Azure’s East US region. The source environment are virtual machines hosted on VMware vSphere.

Requirements

  1. Account with appropriate permissions that will be used to connect to the tenant with the Connect-AzAccount PowerShell cmdlet
  2. Recovery Plan already configured (we’ll be initiating the Test failover on the Recovery Plan and not individual VMs).
  3. The Subscription ID containing the servers being repliated
  4. The name of the Recovery Services Vault containing the replicated VMs
  5. The Recovery Plan name that will be failed over
  6. The VNet name that will be used for the failover VMs

Script Process

  1. Connect to Azure with Connect-AzConnect
  2. Set the context to the subscription ID
  3. Initiates the Test Failover task for the recovery plan
  4. Wait until the Test Failover has completed
  5. Notify user that the Test Failover has completed
  6. Pause and prompt the user to cleanup the failover test VMs
  7. Proceed to clean up Test Failover
  8. End script

I have plans in the future to add additional improvements such as accepting a subscription ID upon execution, providing recovery plan selection for failover testing, or listing failed over VM details (I can’t seem to find a cmdlet that displays the list of VMs and its status in a specified Recovery Group).

Script Variables

$RSVaultName = <name of Recovery Services Group> - e.g. "rsv-us-eus-contoso-asr"

$ASRRecoveryPlanName = <name of Recovery Plan> - e.g. "Recover-Domain-Controllers"

$TestFailoverVNetName = <Name of VNet name in the failover site the VM is to be connected to> - e.g. "vnet-us-eus-dr"

The Script

The following is the script:

Connect-AzAccount

Set-AzContext -SubscriptionId "adae0952-xxxx-xxxx-xxxx-2b8ef42c9bbb"

$RSVaultName = "rsv-us-eus-contoso-asr"

$ASRRecoveryPlanName = "Recover-Domain-Controllers"

$TestFailoverVNetName = "vnet-us-eus-dr"

$vault = Get-AzRecoveryServicesVault -Name $RSVaultName

Set-AzRecoveryServicesAsrVaultContext -Vault $vault

$RecoveryPlan = Get-AzRecoveryServicesAsrRecoveryPlan -FriendlyName $ASRRecoveryPlanName

$TFOVnet = Get-AzVirtualNetwork -Name $TestFailoverVNetName

$TFONetwork= $TFOVnet.Id

#Start test failover of recovery plan

$Job_TFO = Start-AzRecoveryServicesAsrTestFailoverJob -RecoveryPlan $RecoveryPlan -Direction PrimaryToRecovery -AzureVMNetworkId $TFONetwork

do {

$Job_TFOState = Get-AzRecoveryServicesAsrJob -Job $Job_TFO | Select-Object State

Clear-Host

Write-Host "======== Monitoring Failover ========"

Write-Host "Status will refresh every 5 seconds."

try {

    }

catch {

Write-Host -ForegroundColor Red "ERROR - Unable to get status of Failover job"

Write-Host -ForegroundColor Red "ERROR - " + $_

        log "ERROR" "Unable to get status of Failover job"

        log "ERROR" $_

exit

    }

Write-Host "Failover status for $($Job_TFO.TargetObjectName) is $($Job_TFOState.state)"

Start-Sleep 5;

} while (($Job_TFOState.state -eq "InProgress") -or ($Job_TFOState.state -eq "NotStarted"))

if($Job_TFOState.state -eq "Failed"){

Write-host("The test failover job failed. Script terminating.")

Exit

}else {

Read-Host -Prompt "Test failover has completed. Please check ASR Portal, test VMs and press enter to perform cleanup..."

#Start test failover cleanup of recovery plan

$Job_TFOCleanup = Start-AzRecoveryServicesAsrTestFailoverCleanupJob -RecoveryPlan $RecoveryPlan -Comment "Testing Completed"

do {

$Job_TFOCleanupState = Get-AzRecoveryServicesAsrJob -Job $Job_TFOCleanup | Select-Object State

Clear-Host

Write-Host "======== Monitoring Cleanup ========"

Write-Host "Status will refresh every 5 seconds."

try {

    }

catch {

Write-Host -ForegroundColor Red "ERROR - Unable to get status of cleanup job"

Write-Host -ForegroundColor Red "ERROR - " + $_

        log "ERROR" "Unable to get status of cleanup job"

        log "ERROR" $_

exit

    }

Write-Host "Cleanup status for $($Job_TFO.TargetObjectName) is $($Job_TFOCleanupState.state)"

Start-Sleep 5;

} while (($Job_TFOCleanupState.state -eq "InProgress") -or ($Job_TFOCleanupState.state -eq "NotStarted"))

Write-Host "Test failover cleanup completed."

}

image

The following are screenshots of the PowerShell script output:

image

I hope this will help anyone out there who may be looking for a PowerShell script to automate ASR failover process.

One of the additions I wanted to add to this script was to list the Status VMs in the recovery group after the test failover has completed but I could not find a way to list the VMs that only belong to the recovery group. The cmdlets below lists all of the VMs that are protected but combing through the properties does not appear to contain any reference to what recovery plans they belong to. Please feel free to comment if you happen to know the solution.

$PrimaryFabric = Get-AzRecoveryServicesAsrFabric -FriendlyName svr-asr-01

#svr-asr-01 represents Configuration Servers

$PrimaryProtContainer = Get-AzRecoveryServicesAsrProtectionContainer -Fabric $PrimaryFabric

$ReplicationProtectedItem = Get-AzRecoveryServicesAsrReplicationProtectedItem -ProtectionContainer $PrimaryProtContainer

----------Update July 31, 2021---------

After reviewing some of my old notes, I managed to find another version of the PowerShell script that performed test failover for two plans and included steps to shutdown a VM, remove VNet peering between production and DR regions before the test failover, then recreate them afterwards. The following is a copy of the script:

Connect-AzAccount

Set-AzContext -SubscriptionId "53ea69af-xxx-xxxx-a020-xxxxea02f8b"

#Shutdown DC2

Write-Host "Shutting down DC2 VM in DR"

$DRDCName = "DC2"

$DRDCRG = "Canada-East-Prod"

Stop-AzVM -ResourceGroupName $DRDCRG -Name $DRDCName -force

#Declare variables for DR production VNet

$DRVNetName = "vnet-prod-canadaeast"

$DRVnetRG = "Canada-East-Prod"

$DRVNetPeerName = "DR-to-Prod"

$DRVNetObj = Get-AzVirtualNetwork -Name $DRVNetName

$DRVNetID = $DRVNetObj.ID

#Declare variables for Production VNet

$ProdVNetName = "Contoso-Prod-vnet"

$ProdVnetRG = "Contoso-Prod"

$ProdVNetPeerName = "Prod-to-DR"

$ProdVNetObj = Get-AzVirtualNetwork -Name $ProdVNetName

$ProdVNetID = $ProdVNetObj.ID

# Remove the DR VNet's peering to production

Write-Host "Removing VNet peering between Production and DR environment"

Remove-AzVirtualNetworkPeering -Name $DRVNetPeerName -VirtualNetworkName $DRVNetName -ResourceGroupName $DRVnetRG -force

Remove-AzVirtualNetworkPeering -Name $ProdVNetPeerName -VirtualNetworkName $ProdVNetName -ResourceGroupName $ProdVnetRG -force

#Failover Test for Domain Controller BREAZDC2

$RSVaultName = "rsv-asr-canada-east"

$ASRRecoveryPlanName = "Domain-Controller"

$TestFailoverVNetName = "vnet-prod-canadaeast"

$vault = Get-AzRecoveryServicesVault -Name $RSVaultName

Set-AzRecoveryServicesAsrVaultContext -Vault $vault

$RecoveryPlan = Get-AzRecoveryServicesAsrRecoveryPlan -FriendlyName $ASRRecoveryPlanName

$TFOVnet = Get-AzVirtualNetwork -Name $TestFailoverVNetName

$TFONetwork= $TFOVnet.Id

$Job_TFO = Start-AzRecoveryServicesAsrTestFailoverJob -RecoveryPlan $RecoveryPlan -Direction PrimaryToRecovery -AzureVMNetworkId $TFONetwork

do {

$Job_TFOState = Get-AzRecoveryServicesAsrJob -Job $Job_TFO | Select-Object State

Clear-Host

Write-Host "======== Monitoring Failover ========"

Write-Host "Status will refresh every 5 seconds."

try {

    }

catch {

Write-Host -ForegroundColor Red "ERROR - Unable to get status of Failover job"

Write-Host -ForegroundColor Red "ERROR - " + $_

        log "ERROR" "Unable to get status of Failover job"

        log "ERROR" $_

exit

    }

Write-Host "Failover status for $($Job_TFO.TargetObjectName) is $($Job_TFOState.state)"

Start-Sleep 5;

} while (($Job_TFOState.state -eq "InProgress") -or ($Job_TFOState.state -eq "NotStarted"))

if($Job_TFOState.state -eq "Failed"){

Write-host("The test failover job failed. Script terminating.")

Exit

}else {

#Failover Test for Remaining Servers

$ASRRecoveryPlanName = "DR-Servers"

$RecoveryPlan = Get-AzRecoveryServicesAsrRecoveryPlan -FriendlyName $ASRRecoveryPlanName

$Job_TFO = Start-AzRecoveryServicesAsrTestFailoverJob -RecoveryPlan $RecoveryPlan -Direction PrimaryToRecovery -AzureVMNetworkId $TFONetwork

do {

$Job_TFOState = Get-AzRecoveryServicesAsrJob -Job $Job_TFO | Select-Object State

Clear-Host

Write-Host "======== Monitoring Failover ========"

Write-Host "Status will refresh every 5 seconds."

try {

        }

catch {

Write-Host -ForegroundColor Red "ERROR - Unable to get status of Failover job"

Write-Host -ForegroundColor Red "ERROR - " + $_

            log "ERROR" "Unable to get status of Failover job"

            log "ERROR" $_

exit

        }

Write-Host "Failover status for $($Job_TFO.TargetObjectName) is $($Job_TFOState.state)"

Start-Sleep 5;

    } while (($Job_TFOState.state -eq "InProgress") -or ($Job_TFOState.state -eq "NotStarted"))

if($Job_TFOState.state -eq "Failed"){

Write-host("The test failover job failed. Script terminating.")

Exit

    }else {

Read-Host -Prompt "Test failover has completed. Please check ASR Portal, test VMs and press enter to perform cleanup..."

$Job_TFOCleanup = Start-AzRecoveryServicesAsrTestFailoverCleanupJob -RecoveryPlan $RecoveryPlan -Comment "Testing Completed"

do {

$Job_TFOCleanupState = Get-AzRecoveryServicesAsrJob -Job $Job_TFOCleanup | Select-Object State

Clear-Host

Write-Host "======== Monitoring Cleanup ========"

Write-Host "Status will refresh every 5 seconds."

try {

    }

catch {

Write-Host -ForegroundColor Red "ERROR - Unable to get status of cleanup job"

Write-Host -ForegroundColor Red "ERROR - " + $_

        log "ERROR" "Unable to get status of cleanup job"

        log "ERROR" $_

exit

    }

Write-Host "Cleanup status for $($Job_TFO.TargetObjectName) is $($Job_TFOCleanupState.state)"

Start-Sleep 5;

} while (($Job_TFOCleanupState.state -eq "InProgress") -or ($Job_TFOCleanupState.state -eq "NotStarted"))

$ASRRecoveryPlanName = "Domain-Controller"

$RecoveryPlan = Get-AzRecoveryServicesAsrRecoveryPlan -FriendlyName $ASRRecoveryPlanName

$Job_TFOCleanup = Start-AzRecoveryServicesAsrTestFailoverCleanupJob -RecoveryPlan $RecoveryPlan -Comment "Testing Completed"

do {

$Job_TFOCleanupState = Get-AzRecoveryServicesAsrJob -Job $Job_TFOCleanup | Select-Object State

Clear-Host

Write-Host "======== Monitoring Cleanup ========"

Write-Host "Status will refresh every 5 seconds."

try {

    }

catch {

Write-Host -ForegroundColor Red "ERROR - Unable to get status of cleanup job"

Write-Host -ForegroundColor Red "ERROR - " + $_

        log "ERROR" "Unable to get status of cleanup job"

        log "ERROR" $_

exit

    }

Write-Host "Cleanup status for $($ASRRecoveryPlanName) is $($Job_TFOCleanupState.state)"

Start-Sleep 5;

} while (($Job_TFOCleanupState.state -eq "InProgress") -or ($Job_TFOCleanupState.state -eq "NotStarted"))

Write-Host "Test failover cleanup completed."

}

}

#Create the DR VNet's peering to production

Write-Host "Recreating VNet peering between Production and DR environment after failover testing"

Add-AzVirtualNetworkPeering -Name $DRVNetPeerName -VirtualNetwork $DRVNetObj -RemoteVirtualNetworkId $ProdVNetID -AllowForwardedTraffic

Add-AzVirtualNetworkPeering -Name $ProdVNetPeerName -VirtualNetwork $ProdVNetObj -RemoteVirtualNetworkId $DRVNetID -AllowForwardedTraffic

#Power On DC2

Write-Host "Powering on DC2 VM in DR after testing"

Start-AzVM -ResourceGroupName $DRDCRG -Name $DRDCName

Monday, July 26, 2021

Configuring Microsoft Azure AD Single Sign-On (SSO) for Citrix ShareFile

I recently had an ex-colleague reach out to me about configuring the integration between Citrix ShareFile and Azure Active Directory (Azure AD) as he was required to configure SAML authentication for a Citrix ShareFile portal so that it would use Azure AD as an IDP. The official documentation can be found here:

How to Configure Single Sign-On (SSO) for ShareFile
https://support.citrix.com/article/CTX208557

Tutorial: Azure Active Directory integration with Citrix ShareFile
https://docs.microsoft.com/en-us/azure/active-directory/saas-apps/sharefile-tutorial

However, the documentation wasn’t extremely clear on some of the steps and other blog posts available references the older Azure portal so I thought writing this post may help anyone who may be looking for updated information.

Step #1 – Adding Citrix ShareFile as an Enterprise Application

Begin by logging into portal.azure.com for the tenant that will be providing the Azure AD as the iDP, navigate to Azure Active Directory > Enterprise Applications:

image

Click on New application:

image

Search for Citrix ShareFile and then click on the tile:

image

A window will slide out from the right to display the application, proceed to click on the Create button:

image

image

The creation of the application will take a few minutes and eventually finish:

image

Step #2 – Configure Azure ShareFile Enterprise Application

Proceed to navigate into the Single sign-on configuration in the ShareFile Enterprise Application:

image

Click on the SAML tile:

image

The SAML configuration will be displayed:

image

Click on the Edit button for the Basic SAML Configuration:

image

Remove the default Identifier (Entity ID) configuration:

image

Enter the following for the configuration and then save it:

Identity (Entity ID):

https://<customDomain>.sharefile.com/saml/info < set this as default

https://<customDomain>.sharefile.com

Reply URL (Assertion Consumer Service URL):

https://<customDomain>.sharefile.com/saml/acs

Sign on URL:

https://<customDomain>.sharefile.com/saml/login

Relay State:

Leave blank.

Logout Url:

Leave blank.

image

image

Saving the settings will now display the new configuration:

image

You will be prompted to test the single sign-on settings upon successfully configuring the SAML configuration but given that we have not configured ShareFile yet, select No, I’ll test later:

image

Scroll down and locate the certificate download link labeled as:

Certificate (Base64) Download

Download the certificate and then proceed to expand the Configuration URLs and copy the value for the following to somewhere like NotePad:

  • Login URL
  • Azure AD Identifier
  • Logout URL
image

*Note that the Login URL and Logout URL values are the same and the following is a sample:

https://login.microsoftonline.com/97f1d4b7-d6e7-4ebb-842d-cce6024b0bb3/saml2
https://sts.windows.net/87f1d4b7-d6e7-4ebb-842d-cce6024b0bb2/
https://login.microsoftonline.com/97f1d4b7-d6e7-4ebb-842d-cce6024b0bb3/saml2

Step #3 – Grant Azure AD user access to ShareFile

The next step is to configure grant permissions to users and groups who will be logging into ShareFile with their Azure AD credentials. Failure to do so will throw an error indicating the user logging on is not assigned to a role for the application.

From within the Citrix ShareFile Enterprise Application, navigate to Users and groups then click on the Add user/group button:

image

Use the User and groups link to select either a test user or a group that will log into ShareFile with their Azure AD credentials (I will use my user account for this example) and then use the Select a role link to configure a role. The Microsoft documentation indicates we can have none selected as Default Access will automatically be configured but I’ve found that the assign button does not become active until a role is selected. Other documentation I was able to find indicates the Employee role should be configured so proceed with using Employee as the role:

image

Proceed by clicking the Assign button:

image

Notice that my account is now assigned:

image

Step #4 – Configure ShareFile to for Single sign-on / SAML 2.0 Configuration

With Azure AD configured, proceed to log into the ShareFile portal as an administrator, then navigate to Settings > Admin Settings > Security > Login & Security Policy:

image

Scroll down to the Single sign-on / SAML 2.0 Configuration section and select Yes for Enable SAML:

image

Proceed by opening the Notepad with the Configuration URLs that were copy from Azure:

  • Login URL
  • Azure AD Identifier
  • Logout URL

As well as opening the downloaded Citrix ShareFile.cer Certificate (Base64):

image

Fill in the following fields:

Field: ShareFile Issuer / Entity ID
Value: https://<customDomain>.sharefile.com/saml/info

Field: Your IDP Issuer / Entity ID
Value: Azure AD Identifier (example: https://sts.windows.net/87f1d4b7-d6e7-4ebb-942d-cce6024b0bb2/)

Field: X.509 Certificate
Value: Paste the certificate content from the downloaded Citrix ShareFile.cer Certificate into the configuration.

Field: Login URL
Value: Login URL from Azure (example: https://login.microsoftonline.com/87f1d4b5-d6e7-4ebb-842d-cce6024b0bb2/saml2)

Field: Logout URL:
Value: Logout URL from Azure (example: https://login.microsoftonline.com/87f1d4b5-d6e7-4ebb-842d-cce6024b0bb2/saml2)

image

Scroll down to the Optional Settings section:

image

Locate the SP-Initiated Auth Context configuration:

image

Change the configuration to User Name and Password, Exact for the field to the right, and save the settings:

image

Step #5 – Set up user as Employee in ShareFile

The next step is to set up the corresponding test user or ShareFile users in ShareFile. This environment uses on-premise Active Directory accounts, which are synced into Azure AD and the method I used to configure the accounts in ShareFile is the ShareFile User Management Tool (https://support.citrix.com/article/CTX214038). I will not be demonstrating the process in this post.

Step #6 – Test SSO with SAML

The final step is to test SSO to ensure that the configuration is correct. We can begin by using the Test this application button in the Citrix ShareFile Enterprise Application in Azure portal:

image

image

A successful test will display the following:

image

Next, navigate to the Sharefile login portal and you will notice the additional Company Employee Sign In option for logging in:

image

Proceed to login and confirm that the process is successful.

Azure site Recovery replication for Windows 2008 R2 server fails with: "Installation of mobility agent has failed as SHA-2 code signing is not supported on the current Microsoft Windows Server 2008 R2 Standard OS version"

As much as Windows Server 2008 R2 has come to end of support, I still periodically come across them when working with clients and one of the common scenarios I’ve had to deal with is attempting to replicate them from an on-premise network to Microsoft Azure with Azure Site Recovery. Below is an issue that I’ve seen quite a few times so I’d like to write this quick blog post to describe the problem and the steps to remediate.

Problem

You’re trying to replicate an on-premise Windows 2008 R2 server that has Service Pack 1 installed to Azure with Azure Site Recovery:

image

However, the installation of the mobility service fails:

image

The specific Error Details for the server are as follow:

----------------------------------------------------------------------------------------------------------------------------

Error Details

Installing Mobility Service and preparing target

·

· Error ID

78007

· Error Message

The requested operation did not complete.

· Provider error

Provider error code: 95560 Provider error message: Installation of mobility agent has failed as SHA-2 code signing is not supported on the current Microsoft Windows Server 2008 R2 Standard OS version. Provider error possible causes: For successful installation, mobility service requires SHA-2 support as SHA-1 is deprecated from September 2019. Provider error recommended action: Update your Microsoft Windows Server 2008 R2 Standard operating system with the following KB articles and then retry the operation. Servicing stack update (SSU) https://support.microsoft.com/en-us/help/4490628 SHA-2 update https://support.microsoft.com/en-us/help/4474419/sha-2-code-signing-support-update Learn more (https://aka.ms/asr-os-support)

· Possible causes

Check the provider error for more details.

· Recommendation

Resolve the issue as recommended in the provider error details.

· Related links

o https://support.microsoft.com/en-us/help/4490628

o https://support.microsoft.com/en-us/help/4474419/sha-2-code-signing-support-update

o https://aka.ms/asr-os-support

· First Seen At

7/22/2021, 9:28:00 PM

----------------------------------------------------------------------------------------------------------------------------

image

The Error Details provides the suggestion to download and install KB4490628 but when you attempt to do so, the installation wizard indicates the update is already installed on the server:

https://support.microsoft.com/en-us/help/4490628

AMD64-all-windows6.1-kb4490628-x64_d3de52d6987f7c8bdc2c015dca69eac96047c76e.msu

image

Solution

I’ve come across the following 2 scenarios for this:

  1. The update KB4490628 indicated above has been installed
  2. The update KB4490628 indicated above has not been installed

Regardless of which of the above scenario applies to the problematic server, the first step is to download the following KB4474419 update and install it:

2019-09 Security Update for Windows Server 2008 R2 for x64-based Systems (KB4474419)

AMD64-all-windows6.1-kb4474419-v3-x64_b5614c6cea5cb4e198717789633dca16308ef79c.msu

image

image

Once the update has been installed and the server has been restarted, proceed to try installing the suggested KB. If it had already been installed then it will not continue but if it hasn’t, it will proceed, complete and not require a restart.

With the above completed, the Microsoft Azure Site Recovery Mobility Service/Master Target Server should now install successfully and the Enable replication job should complete successfully:

image

With the required updates installed, the deployment of the Mobility Service agent should succeed and the replication job should complete:

image

Hope this helps anyone who may be encountering this issue.

Wednesday, July 7, 2021

What are Proximity Placement Groups?

Proximity Placement Groups was welcomed by many organizations when Microsoft announced the preview (https://azure.microsoft.com/en-us/blog/introducing-proximity-placement-groups/) in July 2019 and finally GA (https://azure.microsoft.com/en-ca/blog/announcing-the-general-availability-of-proximity-placement-groups/) in December 2019. The concept isn’t in any way complex but I wanted to write this post to demonstrate its use case for an OpenText Archive Center solution hosted on Azure project I was recently involved in. Before I begin, the following is the official documentation provided by Microsoft:

Proximity placement groups
https://docs.microsoft.com/en-us/azure/virtual-machines/co-location

The Scenario

One of the decisions we had to make at the beginning was how to deliver HA across Availability Zones in a region but OpenText was not clear as to whether they supported clustering the Archive Center across Availability Zones due to potential latency concerns. I do not believe that Microsoft publishes specific latency metrics for each region’s zones but the general guideline I use is that latency across zones can be 2ms or less as per the marketing material here:

https://azure.microsoft.com/en-ca/global-infrastructure/availability-zones/#faq

What is the latency perimeter for an Availability Zone?

We ensure that customer impact is minimal to none with a latency perimeter of less than two milliseconds between Availability Zones.

image

To make a long story short, what OpenText eventually provided as a requirement was that we would need to have a cluster of 4-nodes, where 2 servers need to be in one zone and another 2 can be in another. The servers that are located in the same zone must have the lowest latency possible, preferably be hosted in the same datacenter. The following is a diagram depicting the requirement:

image

Limitations of Availability Zones

With the above requirements in mind, simply deploying the 2 nodes with the availability zone set to 1 and another 2 nodes with the availability zone set to 2 or 3 would not suffice because of the following facts about Azure regions, zones and datacenters as its footprint grows:

  1. Availability Zones can span multiple datacenters because each zone can contain more than one datacenter
  2. Scale sets can span multiple datacenters
  3. Availability Sets in the future can span multiple datacenters

Microsoft understands that organizations would need a way to guarantee the lowest latency between VMs and therefore provides the concept of Proximity Placement Groups, which is a logical grouping used to guarantee Azure compute resources are physically located close to each other. Proximity Placement Groups can also be useful for low latency between stand-alone VMs, availability sets, virtual machine scale sets, and multiple application tier virtual machines.

Availability Zones with Proximity Groups

Leveraging Proximity Placement Groups (PPG) will guarantee that the 2 OpenText Archive Center cluster nodes in the same Availability Set will also be placed together to provide the lowest latency. The following is a diagram that depicts this.

image

Note that the above diagram also includes two additional Document Pipeline servers that will also be grouped with each of the OpenText Archive Center servers.

Limitations of Proximity Placement Group

Proximity Placement Groups does have limitations and that is when there are VM SKUs that are considered to be exotic where they may not be offered at every datacenter. Example of these VMs can be N series with NVDIA cards or large sized VMs for SAP. When mixing exotic VM SKUs, it is best to power on the most exotic VM first so the more popular VMs will likely be available in the same datacenter. In the event where Azure is unable to power on a VM in the same datacenter as the previously powered on VMs, it will fail with the error message:

Oversubscribed allocation request
Stop allocate and try in reverse order

Another method of ensuring all VMs can be powered on is to use an ARM template to place all the VMs in the Proximity Placement Group together to power on as Azure will locate a datacenter that has all of the available VM SKUs.

Measuring Latency Across Availability Zones

When we think about testing latency between servers, the quickest method is to use PING and while we’ll get a millisecond latency metric in the results, it actually isn’t an accurate way to measure latency. The Microsoft recommended tool to test and measure latency are the following because they measure TCP and UDP delivery time unlike PING, which uses ICMP:

As described in the following article:

Test VM network latency
https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-test-latency

The following is a demonstration of using latte.exe to obtain statistics for two Azure VMs hosted in different Azure Availability Zones in the Canada Central region.

image

I will begin by using two low end Standard B1s (1 vcpus, 1 GiB memory) virtual machines.

Set Up Receiver VM

Begin by logging onto the receiving VM and open firewall for the latte.exe tool:

netsh advfirewall firewall add rule program=c:\temp\latte.exe name="Latte" protocol=any dir=in action=allow enable=yes profile=ANY

image

The Latte application should be shown in the Allowed apps and features list upon successfully executing the netsh command:

image

On the receiver, start latte.exe (run it from the CMD window, not from PowerShell):

latte -a <Receiver IP address>:<port> -i <iterations>
latte -a 10.0.0.5:5005 -i 65100
Protocol TCP
SendMethod Blocking
ReceiveMethod Blocking
SO_SNDBUF Default
SO_RCVBUF Default
MsgSize(byte) 4
Iterations 65100

The parameters are as follows:

  • Use -a to specify the IP address and port
  • Use the IP address of the receiving VM
  • Any available port number is can be used (this example uses 5005)
  • Use -i to specify the iterations
  • Microsoft documentation indicates that around 65,000 iterations is long enough to return representative results
image

Set Up Sender VM

On the sender, start latte.exe (run it from the CMD window, not from PowerShell):

latte -c -a <Receiver IP address>:<port> -i <iterations>

The resulting command is the same as on the receiver, except with the addition of -c to indicate that this is the client, or sender:

latte -c -a 10.0.0.5:5005 -i 65100
Protocol TCP
SendMethod Blocking
ReceiveMethod Blocking
SO_SNDBUF Default
SO_RCVBUF Default
MsgSize(byte) 4
Iterations 65100

image

Results

Wait for a minute or so for the results to be displayed:

Sender:

C:\Temp>latte -c -a 10.0.0.5:5005 -i 65100
Protocol TCP
SendMethod Blocking
ReceiveMethod Blocking
SO_SNDBUF Default
SO_RCVBUF Default
MsgSize(byte) 4
Iterations 65100
Latency(usec) 2026.40
CPU(%) 5.2
CtxSwitch/sec 1382 (2.80/iteration)
SysCall/sec 3678 (7.45/iteration)
Interrupt/sec 1100 (2.23/iteration)
C:\Temp>

image

Receiver:

C:\Temp>latte -a 10.0.0.5:5005 -i 65100
Protocol TCP
SendMethod Blocking
ReceiveMethod Blocking
SO_SNDBUF Default
SO_RCVBUF Default
MsgSize(byte) 4
Iterations 65100
Latency(usec) 2026.51
CPU(%) 2.2
CtxSwitch/sec 1140 (2.31/iteration)
SysCall/sec 1524 (3.09/iteration)
Interrupt/sec 1068 (2.17/iteration)
C:\Temp>

image

Sender and receiver metrics side by side:

Note that the latency is labeled as Latency(usec), which is in microseconds and the results are 2130.93, which is about 2ms.

image

Next, I will change the VM size from the low end Standard B1s (1 vcpus, 1 GiB memory) to D series Standard D2s v3 (2 vcpus, 8 GiB memory).

Notice the latency, which is 1546 usec, is better with the D series:

image

However, changing the VM size to a higher Standard D4s v3 (4 vcpus, 16 GiB memory) actually yields slow results at 1840.29 usec. This is likely the fluctuation of the connectivity speed between the datacenter.

image

Accelerated Networking

Accelerated Networking is one of the recommendations provided in the Proximity Placement Group documentation. Not all VM sizes are capable of accelerated networking but the Standard D4s v3 (4 vcpus, 16 GiB memory) supports it so the following is a test with it enabled.

image

I have validated that my operating system is part of the supported operating systems. If connectivity to your VM is disrupted due to incompatible OS, please disable accelerated networking here and connection will resume.

image

image

Note that the latency has decreased to 1364.43 usec after enabling accelerated networking:

image

Latency within the same Availability Zone

The following are tests with two Standard D4s v3 (4 vcpus, 16 GiB memory) without accelerated networking VMs in the same availability zone.

Latency is 308.55 usecs.

image

The following are tests with two Standard D4s v3 (4 vcpus, 16 GiB memory) with accelerated networking.

The latency significantly improves to 55 to 61.09 usecs.

image

Same Availability Zone with Proximity Placement Group

The following are tests with two Standard D4s v3 (4 vcpus, 16 GiB memory) with accelerated networking VMs in the same availability zone with Proximity Placement Group configured.

Create the Proximity Placement Group:

image

image

image

Add the VMs to the Proximity Placement Group:

image

image

The latency results are more or less the same even though the 3 tests are a bit lower at 54 to 57 usec:

image

Summary

The last two tests where the results for two VMs in the same availability zone without PPG and the results for the two VMs in the same availability zone with PPG are more or less the same should not discourage you from using PPG because the VMs were likely powered on in the same datacenter. Using Proximity Placement Group will guarantee that this is the case every time it is powered off and back on.

The sample size of the tests I performed wouldn’t be able to claim that the results are conclusive but I hope it will give a general idea of the latency improvements with Accelerated Networking and Proximity Placement Groups.

If you would like to learn more about real world applications and Proximity Placement Groups, the following SAP on Azure article is a good read:

Azure proximity placement groups for optimal network latency with SAP applications

https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/virtual-machines/workloads/sap/sap-proximity-placement-scenarios.md