A new customer comes in and we deploy a new VMware vSphere private cloud platform for them (first using this type of hardware). Nothing special or too fancy, but fist ones 10G production networking.
After a few weeks, integration team complains that a random VM stopped being able to communicate with another VM, but only one other specific VM. Moving the "broken" VM to a different ESXi fixed things, so we suspected a bad cable/connection/port/switch. Various tests turned up nothing, so we just waited for something to happen again.
A few days later, same thing. Some more debugging, packet capture, nothing. Rebooting the ESXi fixed the issue, so it was not the cables/switch, probably. Support ticket was opened at VMware for them to throw all sorts of useless "advice" (update drivers, firwmare, OS, etc etc).
This kept happening more and more, at some point there were multiple daily occurrences of this - again, just specific VMs to other specific VMs, but could always SSH, and communicate with other things, for which we had to reboot the hypervisor to fix it. VMware are completely and utterly useless, even with all the logs, timelines, etc.
A few weeks in, customer is getting pissed. We say that we've tried all sorts of debugging of everything (packet capture on the ESX, switch stuff, in the guest OSes, etc etc), and there's no rhyme nor reason - all sorts of VMs, of different virtual hardware versions, on different guest OSes, different virtual NIC types, different ESXes, and we're trying stuff with the vendor, it probably being a software bug.
One morning I decided to just go and read all of the logs on one of the ESX, trying to see if I can spot something weird (early on we tried greping for errors, warns yielded just VMware vomit and nothing of use). There's too much of them, and I don't see anything. In desperation, I Googled various combinations of "vmware" "nic type" "network issues", and boom, I stumble upon Intel forums with months of people complaining that the Intel X710 NIC's drivers are broken, throw a "Malicious Driver Detected" message (not error) in the logs, and just shut down traffic on that specific port. And what do you know, that's the NICs we're using, and we have those messages. The piece of shit of a driver had been known to not work for months (there was either that, or it crashing the whole machine), but was proudly sitting on VMware's compatibility list. When I told VMware's support about it, they said they were aware internally, but refused to remove it from the compatibility list. But if we upgraded to the beta release of the next major vSphere, there's a newer driver that supposedly fixes everything. We did that and everything was then finally fixed, but there were machines with similar issues where the driver wasn't updated for years after that.
This is the event that taught me that enterprise vendors don't know that much even about their own software, VMware's support is useless, hardware compatibility lists are also useless. So you actually need to know what you're doing and can't rely on support saving you.