This article applies to using ESXi 3.5 with HP BL685 blade servers.
Not too long ago, our ESXi servers started dying randomly, it started with the ESXi servers becoming disconnected from Virtual Center. When sshing into the system, there were a bunch of unkillable processes, some of these are listed below:
The /opt directly also mysteriously went missing and there were timeouts from vmhba2 which was the scsi controller. Thankfully the VM’s running on the system still keep running. Running /sbin/services.sh stop does not kill the processes and starting new ones on top of them does not enable VC connectivity either.
I have been in touch with VMware support over a number of months for each incident and they have not been able to offer any advice other than reboot the server, which IMHO is a very bad thing to do when you’re dealing with production systems.
After about the fifth incident, support finally tell me that they’ve received 40+ tickets against the incident and that they are working on a fix. They think it’s an interoperability with the cache on the SCSI controller and ESXi. In the meantime we just need to keep rebooting our servers when that happens. Bad form VMware! I thought that this was supposed to be an Enterprise product. This CAN NOT happen with enterprise level software!
The VMware internal engineering # for this issue is 420010.
Update: Apparently this bug has fixed in a firmware update for the storage controller. The problem usually occurs at around the 60 day mark so i’m waiting to see if it’s fixed yet. More to come.
Update 2: 70 days in and the bug hasn’t resurfaced. Looks like the problem is definitely fixed in the updated storage controller firmware.