Tuesday, Feb 11, 2020
Broadcomm NIC + Windows 2012 R2 + VMQ enabled = horror instead of performance benefit
Do you have Hyper-V host running on Windows 2012R2 with Broadcomm (NetXtreme 1-gigabit network) NICs, and are experiencing network latency issues? Read on!
Ok, I mentioned it in the title, but what the hell is VMQ?
To get started, I’ll briefly explain what is VMQ (Virtual Machine Queue)
Per Wikipedia Article, The Virtual Machine Queue (VMQ) is a hardware virtualization technology for the efficient transfer of network traffic (such as TCP/IP, iSCSI or FCoE) to a virtualized host OS. A VMQ capable NIC can use DMA to transfer all incoming frames that should be routed to a receive queue to the receive buffers that are allocated for that queue.
To put it in a bit simpler terms, the idea of VMQ is to offload VM packet processing from HyperV host (CPU) to NIC. In general, it is a great idea, but (obviously) it has some flaws. Even some companies such as Palo Alto Networks suggest disabling it in the first place.
How did I notice the latency?
In our case, HyperV host was running in an external network to which I was connected via VPN, so neither I or my colleagues have noticed the latency in the first place. However, when we hosted the first few services on VM’s and started using them, we noticed that sometimes the connection to those VM’s is kind of slow. Ping from our PCs to those VM’s wasn’t of much use due to VPN latency (although the numbers were quite good), so we blamed it to VPN connection, throttling, many users were connected, etc. How naive of us.
One day I was working with an app that is running some queries that are somewhat more intensive and noticed unusual slowness. Same queries were running much faster (almost 10 times faster) on my machine (with my local database).
I connected to the Database VM and saw that it is running the same query through PGadmin without slowness. Now that was a bit fishy.
After some time of debugging around PostgreSQL instances and blaming different configuration, I ran several Ping tests and I got very erratic behavior.
- ping from separate physical PC (in the same network) to HV host = less than 1ms - good
C:\Users\Administrator> ping 192.168.30.46
Pinging 192.168.30.46 with 32 bytes of data:
Reply from 192.168.30.46: bytes=32 time<1ms TTL=128
Reply from 192.168.30.46: bytes=32 time<1ms TTL=128
Reply from 192.168.30.46: bytes=32 time=1ms TTL=128
Reply from 192.168.30.46: bytes=32 time<1ms TTL=128
- ping from same physical PC as above to VM1 = 100-150ms - not good
C:\Users\Administrator> ping 192.168.30.129
Pinging 192.168.30.129 with 32 bytes of data:
Reply from 192.168.30.129: bytes=32 time=148ms TTL=128
Reply from 192.168.30.129: bytes=32 time=88ms TTL=128
Reply from 192.168.30.129: bytes=32 time=102ms TTL=128
Reply from 192.168.30.129: bytes=32 time=121ms TTL=128
- ping from VM1 to VM2 = less than 1ms - good
C:\Users\Administrator> ping 192.168.30.130
Pinging 192.168.30.130 with 32 bytes of data:
Reply from 192.168.30.130: bytes=32 time<1ms TTL=128
Reply from 192.168.30.130: bytes=32 time=1ms TTL=128
Reply from 192.168.30.130: bytes=32 time<1ms TTL=128
Reply from 192.168.30.130: bytes=32 time<1ms TTL=128
- ping from VM1 to physical PC = 15-30ms - not quite good as they are in same network
C:\Users\Administrator> ping 192.168.30.121
Pinging 192.168.30.121 with 32 bytes of data:
Reply from 192.168.30.121: bytes=32 time=18ms TTL=128
Reply from 192.168.30.121: bytes=32 time=15ms TTL=128
Reply from 192.168.30.121: bytes=32 time=31ms TTL=128
Reply from 192.168.30.121: bytes=32 time=27ms TTL=128
- ping from VM2 to VM1 - 50-350ms - definitely not good -
C:\Users\Administrator> ping 192.168.30.129
Pinging 192.168.30.129 with 32 bytes of data:
Reply from 192.168.30.129: bytes=32 time=168ms TTL=128
Reply from 192.168.30.129: bytes=32 time=49ms TTL=128
Reply from 192.168.30.129: bytes=32 time=328ms TTL=128
Reply from 192.168.30.129: bytes=32 time=340ms TTL=128
Reply from 192.168.30.129: bytes=32 time=270ms TTL=128
Reply from 192.168.30.129: bytes=32 time=110ms TTL=128
Reply from 192.168.30.129: bytes=32 time=720ms TTL=128
Now, these last results were raising a red alert. If the VMs are running on the same physical device, why the hell would ping be so slow between them?
As you can see, the behavior was very erratic.
Ok, how to fix it? I want speed
After a bit of Googling, I ran into this article
To be honest, I read the article up to this line Broadcom is aware of this issue and will release a driver update to resolve the issue.
, after which I started looking for a new driver and closed the tab.
Unfortunately and sadly, driver update didn’t help, and as I didn’t want to lose more time to find another driver that could probably work better, I searched for alternative solutions and/or workarounds.
If I was smart enough, I would re-read the article till the end, shortly after the first failure and check other suggestions, but my urge to Google deeper was stronger.
Somewhere on Reddit (thank you, unknown commenter) I have read that easiest way to resolve this problem is to disable VMQ in HyperV utility per each VM.
That helped a lot, and the performance was around 30-40% better, but the ping behavior between VMs was still behaving erratically.
Eventually, I ran into a suggestion to disable VMQ per network interface (the very same suggestion given by MS in their article). First, I ran Get-netAdapterVMQ
in PS on Hyper-V host and got the following
PS C:\Users\Administrator> Get-netAdapterVMQ
Name InterfaceDescription Enabled
---- -------------------- -------
Embedded LOM 1 Port 4 Broadcom NetXtreme Gigabit E...#2 True
Embedded LOM 1 Port 3 Broadcom NetXtreme Gigabit E...#3 True
Embedded LOM 1 Port 2 Broadcom NetXtreme Gigabit E...#4 True
Embedded LOM 1 Port 1 Broadcom NetXtreme Gigabit Eth... True
Secondly, I ran Disable-NetAdapterVmq -Name '<Adapter Name>'
command for each of network adapters.
To verify, I ran the Get-netAdapterVMQ
again and got the following
PS C:\Users\Administrator> Get-netAdapterVMQ
Name InterfaceDescription Enabled
---- -------------------- -------
Embedded LOM 1 Port 4 Broadcom NetXtreme Gigabit E...#2 False
Embedded LOM 1 Port 3 Broadcom NetXtreme Gigabit E...#3 False
Embedded LOM 1 Port 2 Broadcom NetXtreme Gigabit E...#4 False
Embedded LOM 1 Port 1 Broadcom NetXtreme Gigabit Eth... False
Once I did this, I ran the ping tests (all scenarios as above) and all of them were happily running under 1ms.
What if I want VMQ?
Although I left the VMQ disabled, I guess that someone would want to make use of it. For that, I suggest reading this Reddit post.
I also recommend reading the same post for a bit of laugh and how the OP has improved their Production environment performance by almost 10x by disabling the VMQ!
Make sure to post your comments if you decide to enable VMQ back and try the steps from the aforementioned article.
Conclusion
I spoke about this problem with my dear colleague (sysadmin), who usually takes care of preparing Hyper-V hosts, and he laughed a bit as he has the habit to disable the VMQ by default.
This particular Hyper-V host was prepared by the other colleague, who wasn’t familiar with the problem (neither was I).
Overall, I lost a few hours tracing the latency issues before I got on the right track with Network issues.
In the end, this was a quite good lesson as I didn’t encounter problems with VMQ problems before, but will think twice about this if I ever ran into the same Hardware + OS + latency issues combination.
Have you encountered the same problem? Do you have a better/easier solution? I am looking forward to your comments.