vSphere 5.5/6 DNS failover mechanism

Sometimes you think you entirely know how things work, and then you learn the hard way that you were wrong.

So what happened?

Many hosts appeared unreachable in vCenter at first. Over time, more and more hosts seemed to disappear. Login to the Web Client was also very slow.
After searching a bit we saw that the DNS server configured as primary DNS server on ESXi 5.5/6 hosts and vCenter Servers was down.

There is a second one configured, which should be used as failover, no? Well, it’s complicated…
It appears that the secondary DNS server is only contacted when the primary one is unreachable – per request. This means that there is no generic complete failover to the secondary DNS server once the primary one is unreachable, but rather a failover for the each lookup request. Due to the standard timeout and retry values, all these requests pile up, resulting in the whole infrastructure being unmanageable, depending on its size.

This is – according to VMware support – by design. But it can be adapted with a workaround to avoid issues like the ones we experienced.

What can be done?

VMware KB 2145255  comes up with a workaround for the vCenter Server Appliance 5.5.x and 6.0.x.
They suggest editing the /etc/resolv.conf file to define timeout and retry values, and adding a rotation mechanism, by simply appending this line to the existing file:

options timeout:1 attempts:1 rotate

The KB unfortunately only mentions vCenter 5/6, not the vSphere hosts, nor other appliances.
After performing a few tests on some hosts, I found out that the workaround worked very well on vSphere 5.5/6 hosts as well. Later VMware support validated the workaround for vSphere hosts and other appliances based on Linux as well.

Note that this behaviour is only in vSphere up to 6. As vSphere 6.5 is based on Photon OS, it is not impacted and does not need any workaround.

Appliances

To modify appliances, simply follow the steps in the KB:

To reduce the timeout value and allow the appliance to fail over to the next available DNS server, modify the /etc/resolv.conf file.

Note: If you are using a dispersed vSphere topology with one or more external PSCs, vCenter Management nodes, and Single Single On node, you must perform these steps on all appliances. If you modify the DNS settings through the VAMI or the virtual machine console, these values are lost and need to be reapplied.

Note: Take a snapshot of vCenter Appliance before proceeding.

  1. Take an SSH session to vCenter Server Appliance.
  2. Take a backup of the /etc/resolv.conf file.
  3. Open the /etc/resolv.conf file using a suitable text editor.
  4. Add these values to the end of the file:options timeout:1 attempts:1 rotate
  5. Notes:
    • The timeout value controls the time in seconds before moving on to the next DNS server.
    • The attempts value controls the number of retries before moving to the next DNS server.
    • The rotate value adds a round robin behavior.
  6. Save and close the file.
  7. Reboot the appliance.

vSphere hosts:

If you have many hosts, it might take some time to connect to each one of them, so it might be handy to script this.
All you need is plink, which can be retrieved here.

I wrote following little script to modify the /etc/resolv.conf file on each host registered to a vCenter:

Connect-VIServer <VCENTER>
Foreach ($hostname in Get-VMHost) {
#Start SSH service if disabled
Start-VMHostService -HostService ($hostname  | Get-VMHostService | Where {$_.Key -eq “TSM-SSH”}) -confirm:$false

#launch the command to append the options to the /etc/resolv.conf file
echo y | .\plink.exe -v $hostname -l <USERNAME> -pw <PASSWORD>  echo “options timeout:1 attempts:1 rotate >> /etc/resolv.conf”

#Stop SSH service
Stop-VMHostService -HostService ($hostname | Get-VMHostService | Where {$_.Key -eq “TSM-SSH”}) -confirm:$false
}
Disconnect-VIServer<VCENTER> -confirm:$false

That’s it!

Leave a Comment