Confusing Mellanox Windows PerfMon Counters

Introduction

So you start out doing SMB Direct. Maybe you’re doing RoCE, if so there’s a good chance you’ll be using the excellent Mellanox cards. You studied hard, read a lot and put some real effort into setting it up. The SMB Direct / DCB configuration is how you think it should be and things are working as expected.

Curious as you are you want to find out if you can see Priority Flow Control work. Well, the easiest way to do so is by using the Windows Performance Monitor counters that Mellanox provides.

Confusing Mellanox Windows PerfMon Counters

So you take your first look at the Mellanox Adaptor QoS Perfmon counters for ConnectX series for SMB Direct (RDMA) traffic. When you want to see what’s happening in regards to pause frames that have been sent and received and what pause duration was requested from the receiving hop (or received from the sending hop) you can get confused. The naming is a bit counter intuitive.

clip_image002

The Rcv Pause duration is not the duration requested by the pause frames the host received, but by the pause frames that host sent. Likewise, the Sent Pause duration is not the duration requested by the pause frames the host send, but by the pause frames that host received.

clip_image004

So you might end up wondering why your host sends pause frames but to only see the Rcv Pause duration go up. Now you know why Smile.

Now there were plans to fix this in WinOF 4.95. The original release note made mention of this and this made me quite happy as most people are confused enough when it comes to RDMA/RoCE/DCB configurations as it is.

A screenshot of the change in the original Mellanox WinOF VPI Release Notes revision 4.95

clip_image005

Unfortunately, this did not happen. It was removed in a newer version of these release notes. My guess is it could have been a breaking chance of some sort if a lot of tooling or automation is expecting these counter names.

I still remember how puzzled I looked at the counters which to me didn’t make sense and the tedious labor of empirical testing to figure out that the wording was a bit “less than optimal”.

But look, once you know this you just need to keep it in mind. For now, we’ll have to live with some confusing Mellanox Windows PerfMon counter names. At least I hope I have saved you the confusion and time I went through when first starting with these Mellanox counters. Other than that I can only say that you should not be discouraged as they have been and are a great tool in checking RoCE DCB/PFC configs.

6 thoughts on “Confusing Mellanox Windows PerfMon Counters

  1. Did/does the counters for pause frames and duration on non utilized priorities increment for no obvious reason? I’ve got a Mellanox ConnectX-3 Pro w/ the latest drivers. I’m using priority 0 for default traffic and priority 3 for SMB Direct traffic. When I watch the Rcv Pause Frames, Rcv Duration Frames, Sent Pause Frames and Sent Duration Frames of the non-utilized priorities they will increase. What’s weird is that for example, priority 1 will only have it’s rcv pause frame counter increment while it’s rcv pause duration counter will remain at zero. Similarly priority 2 will have it’s rcv pasue duration counter increment while it’s rcv pause frames counter remains at zero. I can see KB and packets received and sent over on priority 0 and 3 like I should. While the other priorities don’t show packet or KB count increasing. Another interesting thing is that if I stop traffic over the network so that counters don’t change and then send 1 ping from a remote computer I will see the packets received count increment by 1 on priority zero like I should but also the rcv pause frame count on priority one will increment by 1. It’s almost as if the counters aren’t tracking the data properly? The switch doesn’t show any pause frames during this ping test. Have you experienced this at all when testing? Can this be safely ignored since the priorities are disabled? If it can be ignored, it’s misleading none the less…

    • That doesn’t sound right. Maybe you have a DCBx configuration on the switches doing other configurations than you intend?

      • I tried plugging two of the servers directly together and I was still getting the same thing. Willing was set to false on both servers. So I wiped everything and started over fresh. I got PFC working after I reinstalled the OSes and set things back up. I have SMB at priority 3 and all other traffic with priority 0. PFC is enabled only on priority 3. If I saturate the pipes with SMB traffic I can see the pause frame count go up on the switches and on priority 3 in the Mellanox Adapter QoS Counters for the HBAs on Windows. Everything is working great, the frame count numbers line up. That is until I install the Hyper-V role. After I install the Hyper-V role and restart, the Rcv Pause Duration, Rcv Pause Frames, Sent Pause Duration and Sent Pause Frames counters under “Mellanox Adapter QoS Counters” appear to mess up like they were before. No Hyper-V switches have been installed and none of the networking has changed. The only difference is the hyper-v role was installed. Something is obviously not right. How can the duration increment w/o the a pause frame count incrementing? I saturated the links to generate pause frames. The counters for priority 3 never incremented. The switch did some RX XOFF frames and Rx Total Frames on the port though. Any idea what I should do or if there is a way to confirm if it is still working? I’m using Dell branded Connectx-3 cards and s4048t-on switches. I’m at a loss… Here is a link to what I’m seeing on the counters: http://www.kreel.com/Mellanox_QoS_Counters.png

        • I may have figure it out. It may be related to SR-IOV settings. The global setting was disabled in the bios but SR-IOV was enabled on the cards. I disabled that and I think it’s working now w/ hyper-v installed. I have to run some more tests and verify but I believe this was the problem.

          • Interesting side effect – You’d assume that setting would be ignored in that case. Share the results when you can.

          • In the BIOS under Device Settings -> ConnectX-3 Pro -> Device Level Config – Virtualization Mode: Change from SR-IOV to None. Everything works on all the servers I tested when SR-IOV is disabled on the HCAs like this. Obviously disabling SR-IOV on the HCA s means SR-IOV won’t work… I tried a bunch of things to get SR-IOV to work but I seemed to have failed. Note, changing the global SR-IOV setting in the BIOS didn’t appear to make a difference one way or the other. Just the setting on the cards.

Leave a Reply