For those of you using NetApp, or IBM N series, NFS storage with VMware vSphere 5 or looking to upgrade to vSphere 5 with NetApp NFS storage beware of a NetApp bug causing the ESXi hosts to lose connection to the NFS datastores.
There is a bug in the NetApp Data ONTAP code causing NFS clients, including VMware vSphere 5, to loose connection to the NFS exports under heavy load. It appears to affect the lower end NetApp FAS systems and vSphere 5. I am not sure if the problem occurs with vSphere 4 or not, it may only be vSphere 5 due to a larger number of asynchronous requests being sent with vSphere 5.
I have witnessed this happen a number of times now. The first thing you notice is that the NFS datastores on the NetApp controller under heavy load are grayed out in the vSphere Client and all of the virtual machines on these datastores stop responding. After a few minutes the datastores come back and hopefully your virtual machines continue as if nothing has happened, although I have had issues with Microsoft SQL services stopping and IBM Domino servers hanging as well as other applications crashing. One other side effect I have seen with this is that any virtual machine that was powered off and stored on one of the affected datastores is shown in the vCenter inventory grayed out and (inaccessible) once the datastores come back on line. This symptom is described in Matt Vogt blog http://blog.mattvogt.net/2013/02/08/vms-grayed-out-after-nfs-datastore-restored/, see his second screen shot for an example.
There are a couple of workarounds for this issue as detailed in the VMware KB article http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2016122 and the following NetApp Bug notices: –
NetApp Director Vaughn Stewart bogged about this problem earlier in the year, see http://virtualstorageguy.com/2013/02/08/heads-up-avoiding-vmware-vsphere-esxi-5-nfs-disconnect-issues/
The bug has been fixed in the following versions of Data ONTAP
- 7.3 stream – 7.3.7P
- 8.0 stream – 8.0.5
- 8.1 stream – 8.1.2P4
- 8.2 stream – 8.2RC1
Vaughn’s blog above suggested that the bug would be fixed in the 8.1.3 release which at the time of publication of this blog had not been released, however the bug notice for 321428 states that it was fixed in 8.1.2P4.
Therefore if you are already running vSphere 5 with NetApp NFS storage I would suggest that you upgrade to one of the version of Data ONTAP with this bug fixed. If you are planning to upgrade to vSphere 5 and you use NFS storage then I suggest you ensure you are running one of the versions of Data ONTAP listed above or later before upgrading your ESXi servers.
You can tell if you have experienced this issue by looking for messages similar to the following in /var/log/vmkwarning.log
Lost connection to the server <NFS-Server-IP-Address> mount point /vol/<NFS-Volume>, mounted as <datastore-id> (“<datastore-name>”)
Lost connection to the server 10.10.0.50 mount point /vol/VMware-Datastore01, mounted as bf7ce3db-42c081a2-0000-000000000000 (“Datastore01”)
From the ESXi command line you can use grep to find these messages, e.g.
grep “Lost connection to the server” /var/log/vmkwarning.log
Although you may have lost connection for another reason.