Force Mellanox ConnectX-4 Lx 25Gbps to 10Gbps speed

Introduction

As you might remember I wrote a blog post about SFP+ and SFP28 compatibility. In this i discuss future proofing your network investments and not having to upgrade everything all at once. One example is that buying 25Gbps NICs when your main network infrastructure is still on 10Gbps is not an issue. 25Gbps normally handles 10Gbps well so you don’t have do replace all parts in the fabric at the same time but you can start with either the network fabric or the server NICs. It’s a way of future proofing the investments you make today.

When installing Mellanox ConnectX-4 Lx 25Gbps NICs in a bunch of servers we hit an issue when connected them to the DELLEMC N4000 10Gbps switches. The intent is to replace these with 25/50/100Gbps in the future.

The links did not come up.

The links did not come up. The switch ports are normally forced 10 Gbps in our setups so we check that. The speed was indeed set fix to 10Gbps. When changing that to auto-negotiate the link would come up at 1Gbps.

Naturally you check everything from cabling to used transceivers (BCM84754 on the switches) but that all checked out. We also check the firmware on the switches to determine if they were up to date and perhaps a new version fixed a known issue related to this. But no hardware wise everything was up to date on the switches and on the NICs.

Note that these links worked fine when used with 10 Gbps cards like the ConnectX-3 Pro. The DELL branded transceivers on the switches were BCM84754 (Broadcom)

The fix: Force Mellanox ConnectX-4 Lx 25Gbps to 10Gbps speed

I do not need to tell you that when you want 10Gbps getting 1Gbps doesn’t fly well. The fix was easy enough. We put the switch ports back to 10Gbps fixed speed. Auto-negotiate doesn’t deliver. No worries we fix the ports anyway. We then used mlc5cmd.exe Mellanox tool to change the NIC ports from auto-negotiate to fixed.

On hosts with Mellanox Connect-X4 NICs you open an elevated command prompt.

Navigate to C:\Program Files\Mellanox\MLNX_WinOF2\Management Tools. Run the below command to check the current link speed.

mlx5cmd.exe -LinkSpeed -Name “MyNicName ” -Query

Note 10 and 25 Gbps are supported, so it’s autonegotiate.

We force the link speed to 10Gbps:

mlx5cmd.exe -LinkSpeed -Name “MyNicName ” -Set 10

Link speed is forced to 10Gbps

The link comes up at 10Gbps

Likewise you can force the link to 25Gbps. If you want to change it back to the default you can force the link speed to auto-negotiate again.

mlx5cmd.exe -LinkSpeed -Name “MyNicName” -Set 0

See https://community.mellanox.com/s/article/mlx5cmd-exe for more information on this tool.

Do note that the switch port also needs to be set to 10Gbps fixed. As you can see below the command will notify you when those are still on auto.

The change was done but still no uplink when the switch port isn’t fixed to 10Gbps.

Conclusion

So my statement hold true the path to 25/50/100Gbps is one you can do step by step with future proofing. You might run into some issues but these are normally fixable. I have shared with you how to fix failing or wrong speed negotiations on 25 Gbps RDMA NICs (Mellanox ConnectX-4 Lx) when connecting to 10Gbps ports. I’m pretty sure the same holds true for other models. I have also had cards where things work out of the box but don’t give up when you hit an issue. I hope this helps some of you out there.

Invest in attending VeeamON 2019

Protecting your data is paramount

It is often stated about storage engineers that they have only one rule and that is to never lose data. This is true and it pretty much determines all of their actions. But in reality, this is a top priority for all of us. It has always been like this, but in a full stack world, it is truer than ever. The prime directive of anyone working in ICT, in any role, from helpdesk to CIO, from IT Pro to developer, is to keep services up and running. You have to do this in order to be able to deliver value to the customers. That is how you make a living.

And guess what? In order for services to be up and running, you also need to have any data required protected, available and usable. It’s no good to have developer corrupt the data logically or physically, have a CISO force encryption but lose the keys to the kingdom, or have ransomware render your backups useless. The list of bad things that can and do happen is long and sobering.

It’s no good to have data available you cannot use or read. This is true for all data no matter its nature or where it resides. While funny to a point to the people who are not involved or affected, stating that technically the data is not lost as you know where it is but you just can never access it again is no good at all and far from a joke.

The reality is that bad things do happen no matter how well you try to prevent it. So, you should never bank on solitary solutions. Always give yourself multiple options. From the good old 3-2-1 rule and to having air-gap studies; to not banking on one type of storage, technology, location, process or person, it is all part of adequate to excellent protection. If the business truly and honestly states its data is the gold that makes them thrive, they must act to protect it accordingly.

Are you going then?

No, unfortunately, I’m am otherwise engaged on the days of the conference. Didier, how can you say that to invest in attending VeeamON 2019 when you are not even going yourself? 1st of all, I’d go if I did not have prior engagements – I have done so in the past. I learned a lot and I’m still in touch with customers and Veeam employees I met there. Secondly, an investment does not always involve my physical person being on the scene. It is about getting the right knowledge and insights to the right people in the right places. That means sending colleagues, employees, customers to conferences and training and not always yourself. You develop the skills, knowledge, and capabilities of your organization. Data availability is a primary concern and a goal that I can only achieve when it does not depend solely on me. That too is investing. Helping others learn, grow and succeed help everyone out. Needless to say, I wish I was able to go.

Attend VeeamON 2019

Given the important task of keeping data available and the responsibility to protect and having the ability to recover it when things go wrong, you have a business case to attend VeeamON 2019. In today’s diverse, complex, and fragmented IT ecosystem that data is spread across more places, in different layers and in more elaborate solutions than ever before. This makes data protection and availability a challenge.

When business happens at the speed of light you also have to make data available at that pace when required in different locations for various use cases. Data mobility is a reality and this also has to happen in a secure and workable manner.

We can use all the help we can get to make this happen. The good news is you do not have to go it alone. Attend VeeamON 2019 to give yourself a head start. Invest in your business and your staff. Help them succeed. Register!

You can tap into the collective brain of all the attendees. Talk to knowledgeable and experienced practitioners. Learn from your peers and Veeam experts. Gain insights from others on how to approach the challenges you face. Have that discussion to gain a better understanding of what you can do and how. Get your compass aligned so you can confirm you are right on track to deliver what is asked for you. After all, it is not a small or benign job, to have to ensure the safety of your organizations 21st-century gold, its data. Discover what Veeam is doing in R&D to help you meet future requirements where ever your date lives.

Don’t forget that others can learn from you as well. Consider responding to the request for proposals. Share what you know and have learned. Contribute and learn whilst teaching. Invest in attending VeeamON 2019.

Windows Server 2019 SMB Direct Best Practices

Windows Server 2019 SMB Direct Best Practices

The Hyper-V amigos, @Hypervserver and working@workinghardinithardinit recently did a webinar together about Windows Server 2019 SMB Direct Best Practices. We also discuss some trends and put some things into perspective. Rachfahl IT Solutions does more of these cool webinars for you to check out (see the info at the end of the video). You can watch the webinar below on Vimeo. It’s quite an honor to be invited to talk on this subject as Carsten is one the worlds most experienced S2D practitioners.

https://vimeo.com/319818020
Windows Server 2019 SMB Direct Best Pratices Webinar

Need to know more?

I hope this get’s you started and updated. Need to know more? Want more details, advice and a deeper and more elaborate discussion. I will be talking about this on various occasions this year. One of them is the Cloud & Datacenter Conference Germany 2019 in Hanau. Register to secure your spot. It is a great conference with a lot of hands on, real life knowledge being shard. I will be around for the Hyper-V pre day and during the entire conference. This means you can find me to talk shop. Be warned, I can go one about the subject or a while

Replay Manager Configure Server There was an error loading the configuration information.

Replay Manager Configure Server There was an error loading the configuration information

When Replacing a bunch of servers with new DELL R740s (Hyper-V clusters, File clusters, backup targets etc.) I ran into an issue with the DELL Replay Manager software. The servers leverage multiple DELL EMC Storage Center SANs. The have multiple ones for Scale-Out, Redundancy, Failover, Mutliple Datacenters, …

With some of the servers I noticed that the loading of the information was slow, while most others were just fine. But with 4 out of all servers the connection never actually happens. The connectivity was just fine, and test connectivity confirmed this. As this had zero impact on the actual replays that were scheduled this went unnoticed. But when you are adding and removing servers you might need to dive into Server Configuration and that were after a minute we got the below error thrown

Configure Server
There was an error loading the configuration information.
Error Message:
The request channel timed out while waiting for a reply after 00:01:00. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeout
.

Notice that the GUI says connecting to our demo server82… but unless you need info from the server you might still see the info it get’s from the Storage Center SAN itself.

This is quite annoying as we need to be in there. So how to fix this. I have some ideas as I know this error from .NET WCF but in this case I was looking for an easier way out especially when I don’t have all the information about this 3rd party application. The good news is that it is easily fixed.

Fixing this

Replay manager stores the replays and metadata info about those replays it creates on the SAN itself. That’s why you can still see those even when you actually ca’t connect to the server. The config of servers you add and use in Replay Manager is stored locally where the client lived. This files is portable, just copy it form your profile and had it to a colleague. No big deal.

Now the server configuration you do from the Replay Manager GUI tool itself is stored on each and any server where you have the Replay Manager service installed. You will find that file, ReplayManager.config.xml, under C:\ProgramData\Compellent\ReplayManager.

Make a copy to be sure and edit the original using a text editor that has elevated permissions so you can save your changes. In the example file of one server below note that server82 (green) has 2 old Compellent SC entries (yellow) that are no longer in service. One SAN it cannot find won’t exceed the time-out windows, but it does slow the GUI down significantly. 2 or more phantom old SAN slow things down looking for them and you get the time out error.

The fix is easy, cut the key values out of the file and save the file. You then restart the Replay manager service on that server via an elevated command prompt (or use the GUI):
net start ReplayManager
net stop ReplayManager

Restart the Replay manager service on the server you need to manager before connecting to the server again with the Replay manager client tool GUI.

When you now close and launch the Replay Manager GUI and connect to the server things will be a lot faster and certainly wont time out anymore.

Conclusion

Maintain your environment. Try to remove and decommissioned storage center SAN from your server configurations in Replay Manager before you take it off line an dispose of it.I f you forget you and run into slow loading Replay Manager GUI or hit a time out. Don’t panic. The Replay manager is actually quite solid and recoverable. We have shown you how to fix this by editing the ReplayManager.config.xml file on the server you need to connect to but can’t.You basically just remove the references to the no longer existing storage centers I hope it helps some of you out there if you run into this. Feel free to reach out in the comments if you have any questions.