Sunday, September 18, 2011

Running a VMware Site Recovery Manager recovery plan errors out with the message: “Error: Operation timed out.”

Problem

You notice that executing one of your VMware Site Recovery Manager recovery plan errors out at the Change Network Settings step with the message:

Error: Operation timed out.

image

The Change Network Settings step is where SRM attempts to edit the virtual machine’s network interface cards’ settings. These settings include the following:

  • IP address
  • DNS servers
  • DNS suffix
  • WINS servers

The whole process is managed by the VMware tools that are installed on the guest virtual machine where the following process takes place:

  1. Boot the virtual machine
  2. Apply network changes to virtual machine
  3. Wait for VMware Tools to report back to SRM that the changes have been made

There are times where the virtual machine may have the following issues:

  1. Take a very long time to boot up
  2. The customizations takes a long time to successfully apply
  3. Simply never reports back

If 1 or 2 are the reasons why the recovery plan fails, it is possible to increase the timeout value of the operation to allow enough time for the guest operating system to boot and report that it has successfully changed its network settings. The default timeout value is 300 seconds so try incrementing that value to 600, 1200, 2000 (2000 is the max) to see if the operation completes successfully. The follow screenshots demonstrates the process:

1. Log onto the recovery site’s vCenter, open the Site Recovery plug-in, expand the Recovery Plans node, right click on the recovery plan that’s failing and select Edit Recovery Plan:

image

2. Click on the Next button till you get to the Response Times section and increase the following values Change Network Settings.

image

Once the changes have been applied, try running the test again.

The same applies for failures that point to Wait for OS HeartBeat timeouts so if the test appears to fail during that step, try increasing the timeout to a higher value to give the virtual machine more time to report back to SRM that it is in good health.

**Note: I’ve always been careful to let the client know that increase the timeout value doesn’t necessarily mean the plan will run much longer because the other virtual machines that were operating correctly will take the same amount of time and SRM will only wait a bit longer for the problematic one or ones.  With that being said, this is where a decision needs to be made about whether it makes more sense to manually change the IP address in a DR situation rather than waiting upwards to an extra 30 minutes or so for SRM to change it for you.

If the steps above does not correct the issue, as per VMware’s documentation, the next few items to check will be the following:

  1. Ensure that VMware Tools is installed on the virtual machine and is up-to-date.
  2. If the virtual machine uses a DHCP server to obtain a dynamic or reserved IP address, ensure that the DHCP server is up and operational before the virtual machine is booted.
  3. If the virtual machine is configured with a static IP address in the customizations, ensure that the static IP is available and that there are no conflicts.

If none of the steps above corrects the error, try removing the customization and run the test again. If the test succeeds without the customization, the error is most likely caused at the virtual machine operating system level. Errors caused by operating systems are difficult to troubleshoot as it can either be a problem with the OS or a third party application. At this point, I would recommend to remove the customization and make a note in the recovery process documentation to manually change the IP address for the problematic virtual machine.

1 comment:

Ken said...

Hi Terence

I am encountering an error she i use a .bat file with a nets command to change ip and static dns servers on the recovered vm. I have tried a script path to the vm and the vsphere server but both error out. The nets command works but I just can not get it to run from the recovery plan. Any ideas or alternate ways to change the ip?