Barnacles, strategists, consultants and coaches at the office

Disclaimer: The Dilbert® Life series is a string of post on corporate culture from hell and dysfunctional organizations running wild. This can be quite shocking and sobering. A sense of humor will help when reading this. If you need to live in a sugar coated world were all is well and bliss and think all you do is close to godliness, stop reading right now and forget about the blog entries. It’s going to be dark. Pitch black at times actually, with a twist of humor, if you can laugh at yourself.

When people tell me they have strategy consultants, ITIL, SCRUM, KABAN, … coaches, architects and these are well embedded in their organization to ensure operational and long term success I always try to envision this. No matter how hard I try to see “marketing brochure” mental picture and the connotation of professionalism and success this is supposed to inspire, I never succeed.

clip_image002

In reality this is the mental picture I get: barnacles!

clip_image004

Barnacles, strategists, consultants and coaches at the office inspire me to get a chisel and high pressure cleaner to get rid of these. Barnacles slow us down, reduce efficiency and lead to structural damage.

Most organizations are failing due to their obsession with failing. That’s why ITIL is considered a success. All evidence to the contrary I must add. I have never found an IT professional who had seen any benefits to the success of IT come from ITIL.

ITIL is considered a success by people who are trying to manage IT but who do not understand IT. That’s business analysts, project managers, architect and way to often way too many IT managers. I’m not picking on ITIL per se. Take any methodology in the hands of scared, clueless people and they cling to them like a ship wrecked person to a life preserver. It’s a tool to be used where and when needed. Walking around in one at the office is pretty silly.

ITIL caters to their fears and their childish need to avoid failure. You might say that’s a result, but I think we can at least agree this is not a success or progress, which is the type of result your looking for a business. Still, why do so many waste so much time on processes of control that will not be sustainable in the reality of the field? It soothes fears, if feeds the need to be seen as in charge and having things under control. They think it makes them perceived as being in charge. Basically they’re acting. Like kids, pretending to be what they are not and will never be. It’s a sad day when I have to quote from Corinthians but desperate times call for desperate measures.

“When I was a child, I talked like a child, I thought like a child, I reasoned like a child. When I became a man, I put the ways of childhood behind me.”

Clearly too many people have missed some essential and significant steps or got stuck in them in professional life. Clever consultants and coaches cash in on delivering the instruments to anticipate problems, avoid problems, detect problems, manage change to avoid problems and last but not least provide framework to proactively deal with anything up to and including nuclear warfare. In my reality these people are more on par with racketeers, con men, liars and priest of false religions. As in real life they can make big money and gain a lot of influence and power, but only if you allow them to. However, does not make them right.

Failure is not an option. It is, for all practical purposes, guaranteed and free of charge. What you need is smart people, who understand the context, have a great situational awareness and possess the ability to think and act fast. This is not the same as wasting time and money in endless meetings, task forces and procedures. It’s always what you never considered that will get you in the end. Solve the problems you have fast, effective and decisively to the best of your abilities and in alignment with the environment. If you can do that, you have just made progress on route to success! The results are fast, measurable and simple enough as they are noticeable without a microscope.

There is way too much waste in governance leading to the exact opposite of what one is, supposedly, trying to achieve which is a better and more successful business. In fact, these activities in cost and head count outnumber people delivering tangible results by 3 to 1 and in some cases even more. They appoint blame and steal success as in reality the main purpose is to avoid being blamed themselves and to look good in order to get ahead.

Meanwhile your organization keeps failing as you keep adding overhead, head count and expenses. What you need to do is let your good and best employees excel at what they do best: achieve progress and move along. You need to steer that effort and ability towards the company goals and stimulate your employees.

Move fast, navigate through the unpredictable waters and learn how to deal with the fallout effectively. Whatever you do, don’t think that more governance is the way forward or is real work versus actual progress through results. Face it, you are probably not a nuclear power plant or highly regulated medical institution. You’re most likely a SME trying to thrive with limited budgets, resources and time. So not wasting any of it is paramount. Get rid of the crud, mend your sails and chop the barnacles of your ship’s hull. You can achieve more with men of steel on wooden ships than vice versa. The latter tend to stay and rust in safe harbors. In the end this does not mean you’re reckless!

Upgrade the firmware on a Brocade Fibre Channel Switch

NOTE: content available as pdf download here.

Upgrade the firmware on a Brocade Fibre Channel Switch

In order to maintain a secure, well-functioning fibre channel fabric over the years you’ll need to perform a firmware upgrade now and again. Brocade fibre channel switches are expensive but they do deliver a very solid experience. This experience is also obvious in the firmware upgrade process. We’ll walk through this as a guide on how to upgrade the firmware on a Brocade fibre channel switch environment.

Have a FTP/SFTP/SCP server in place

If you have some switches in your environment you’re probably already running a TFTP or FTP server for upgrading those. For TFTP I use the free but simple and good one provided by Solarwinds. They also offer a free SCP/SFTP solution. For FTP it depends either we have IIS with FTP (and FTPS) set up or we use FileZilla FTP Server which also offers SFTP and FTPS. In any case this is not a blog about these solutions. If you’re responsible for keeping network gear in tip top shape you should this little piece of infrastructure set up for both downloads and uploads of configurations (backup/restore), firmware and boot code. If you don’t have this, it’s about time you set one up sport! A virtual machine will do just fine and we back it up as well as we store our firmware and backups on that VM as well. For mobile scenarios I just keep TFTP & FilleZilla Server installed and ready to go on my laptop in a stopped state until I need ‘m.

Getting the correct Fabric OS firmware

It’s up to your SAN & switch vendors to inform you about support for firmware releases. Some OEMs will publish those on their own support sites some will coordinate with Brocade to deliver them as download for specific models sold and supported by them. Dell does this. To get it select your switch version on the dell support site and under downloads you’ll find a link.

clip_image002

That link takes you to the Brocade download page for DELL customers.

clip_image004

Make sure you download the correct firmware for your switch. Read the release notes and make sure you’re the hardware you use is supported. Do your homework, go through the Brocade Fabric OS (FOS) 7.x Compatibility Matrix. There is no reason to shoot yourself in the foot when this can be avoided. I always contact DELL Compellent CoPilot support to verify the version is support with the Compellent Storage Center firmware.

When you have downloaded the firmware for your operating system (I’m on Windows) unzip it and place the content of the resulting folder in your FTP root or desired folder. I tend to put the active firmware under the root and archive older one as they get replaced. So that root looks like this. You can copy it there over RDP or via a FTP client. If the FTP server is running your laptop, it’s just a local copy.

clip_image005

The upgrade process

A word on upgrading the firmware

I you move from a single major level/version to the next or upgrade within a single major level/version you can do non-disruptive upgrades with a High Availability (HA) reboot meaning that while the switch reloads it will not impact the data flow, the FC ports stay online. Everything keeps running, bar that you lose connectivity to the switch console for a short time.

Some non-disruptive upgrade examples:

V6.3.2e to V6.4.3g

V7.4.0a to v7.4.0b

V7.3.0c to v7.4.0b

Note that this way you can step from and old version to a new one step by step without ever needing downtime. I have always found this a really cool capability.

You can find Brocades recommendations on what the desired version of a major release is in https://www.brocade.com/content/dam/common/documents/content-types/target-path-selection-guide/brocade-fos-target-path.pdf

I tend to way a bit with the latest as the newer ones need some wrinkles taken care of as we can see now switch 7.4.1 which is susceptible to memory leaks.

Some disruptive upgrade examples (FC ports go down):

7.1.2b to 7.4.0a

6.4.3.h to 7.4.0b

Our upgrade here from 7.4.0a to 7.4.0b is non-disruptive as was the upgrade from to 7.3.0c to 7.4.0a. You can jump between version more than one version but it will require a reboot that takes the switch out of action. Not a huge issue if you have (and you should) to redundant fabrics but it can be avoided by moving between versions one at the time. IT takes longer but it’s totally non-disruptive which I consider a good thing in production. I reserve disruptive upgrades for green field scenarios or new switches that will be added to the fabric after I’m done upgrading.

Prior to the upgrade

There is no need to run a copy run or write memory on a brocade FC switch. It persists what you do and you have to save and activate your zoning configuration anyway when you configure those (cfgsave). All other changes are persisted automatically. So in that regards you should be all good to go.

Make a backup copy of your configuration as is. This gives you a way out if the shit hits the fan and you need to restore to a switch you had to reset or so. Don’t forget to do this for the switches in both fabrics, which normally you have in production!

You log on switch with your username and password over telnet or ssh (I use putty or kitty)

MySwitchName:admin> configupload

Hit ENTER

Select the protocol of the backup target server you are using

Protocol (scp, ftp, sftp, local) [ftp]: ftp

Hit ENTER

Server Name or IP Address [host]: 10.1.1.12

HIT ENTER

Enter the user, here I’m using anonymous

User Name [user]: anonymous

Hit ENTER

Give the backup file a clear and identifying name

Path/Filename [<home dir>/config.txt]: MySwitchNameConfig20151208.txt

Hit ENTER

Select all (default)

Section (all|chassis|switch [all]): all

configUpload complete: All selected config parameters are uploaded

That’s it. You can verify you have a readable backup file on your FTP server now.

clip_image007

The Upgrade

A production environment normally has 2 fabrics for redundancy. Each fabric exists out of 1 or more switches. It’s wise to start with one fabric and complete the upgrade there. Only after all is proven well there should you move on to the second fabric. To avoid any impact on production I tend plan these early or late in the day also avoiding any backup activity. Depending on your environment you could see some connectivity drops on any FC-IP links (remote SAN replication FC to IP ó IP to FC) but when you work one fabric at the time you can mitigate this during production hours via redundancy.

Log on to first brocade fabric switch with your username and password over telnet or ssh (I use putty or kitty). At the console prompt type

firmwaredownload

This is the command for the non-disruptive upgrade. If you need or want to do a disruptive one, you’ll need to use firmwaredownload –s.

Hit Enter

Enter the IP address of the FTP server (of the name if you have name resolution set up and working)

Server Name or IP address: 10.1.1.12

User name: I fill out anonymous as this gives me the best results. Leaving it blank doesn’t always work depending on your FTP server.

User Name: anonymous

Enter the path to the firmware, I placed the firmware folder in the root of the FTP server so that is

Path: /v7.4.0b

Hit enter

At the password prompt leave the password empty. Anonymous FTP doesn’t need one.

Password:

Hit enter, the upgrade process preparation starts. After the checks have passed you’ll be asked if you want to continue. We enter Y for yes and hit Enter. The firmware download starts and you’ll see lost of packages being downloaded. Just let it run.

clip_image009

This goes on for a while. At one point you’ll see the prom update happening.clip_image011

When it’s done it starts removing unneeded files and when done it will inform you that the download is done and the HA rebooting starts. HA stands for high availability. Basically it fails over to the next CP (Control Processor, see http://www.brocade.com/content/html/en/software-upgrade-guide/FOS_740_UPGRADE/GUID-20EC78ED-FA91-4CA6-9044-E6700F4A5DA1.html) while the other one reboots and loads the new firmware. All this happens while data traffic keeps flowing through the switch. Pretty neat.

When you keep a continuous ping to the FC switch running during the HA reboot you’ll see a short drop in connectivity.

image

But do realize that since this is a HA reboot the data traffic is not interrupted at all. When you get connectivity back you SSH to switch and verify the reported version, which here is now 7.4.0b.

clip_image014

That’s it. Move on to the switch in the same fabric until you’re done. But stop there before you move on to your second fabric (failure domain). It pays to go slow with firmware upgrades in an existing environment.

This doesn’t just mean waiting a while before installing the very latest firmware to see whether any issues pop up in the forums. It also means you should upgrade one fabric at the time and evaluate the effects. If no problems arise, you can move on with the second fabric. By doing so you will always have a functional fabric even if you need to bring down the other one in order to resolve an issue.

On the other hand, don’t leave fabrics unattended for years. Even if you have no functional issues, bugs are getting fixed and perhaps more importantly security issues are addressed as well as browser and Java issues for GUI management. I do wish that the 6.4.x series of the firmware got an update in order for it to work well with Java 8.x.

A highly redundant Application Delivery Controller Setup with KempTechnologies

Introduction

The goal was to make sure the KempTechnologies LoadMaster Application Delivery Controller was capable to handle the traffic to all load balanced virtual machines in a high volume data and compute environment. Needless to say the solution had to be highly available.

A highly redundant Application Delivery Controller Setup with KempTechnologies

The environment offers rack and row as failure units in power, networking and compute. Hyper-V clusters nodes are spread across racks in different rows. Networking is high to continuously available allowing for planned and unplanned maintenance as well as failure of switches. All racks have redundant PDUs that are remotely managed over Ethernet. There is a separate out of band network with remote access.

The 2 Kemp LoadMasters are mounted a different row and different rack to spread the risk and maintain high availability. Eth0 & Eth2 are in active passive bond for a redundant management interface, eth1 is used to provide a secondary backup link for HA. These use the switch independent redundant switches of the rack that also uplink (VLT) to the Force10 switches (spread across racks and rows themselves). The two 10GBps ports are in an active-passive bond to trunked ports of the two redundant switch independent 10 Gbps switches in the rack. So we also have protection against port or cable failures.

image

Some tips: Use TRUNK for the port mode, not general with DELL switches.

This design allows gives us a lot of capabilities.We have redundant networking for all networks. We have an active-passive LoadMasters which means:

  • Failover when the active on fails
  • Non service interrupting firmware upgrades
  • The rack is the failure domain. As each rack is in a different row we also mitigate “localized” issues (power, maintenance affecting the rack, …)

Combine this with the fact that these are bare metal LoadMasters (DELL R320 with iDRAC –  see Remote Access to the KEMP R320 LoadMaster (DELL) via DRAC Adds Value) we have out of band management even when we have network issues. The racks are provisioned with PDU that are managed over Ethernet so we can even cut the power remotely if needed to resolve issues.

Conclusion

The results are very good and we get “zero ping loss” failover between the LoadMaster Nodes during testing.

We have a solid, redundant Application Deliver Controller deployment that does not break the switch independent TOR setup that exists in all racks/rows. It’s active passive on the controller level and active-passive at the network (bonding) level. If that is an issue the TOR switches should be configured as MLAGs. That would enable LACP for the bonded interfaces. At the LoadMaster level these could be configured as a cluster to get an active-active setup, if some of the restrictions this imposes are not a concern to your environment.

Important Note:

Some high end switches such as the Force10 Series with VLT support attaching single homes devices (devices not attached to both members on an VLT). While VLT and MLAG are very similar MLAGs come with their own needs & restrictions. Not all switches that support MLAG can handle single homed devices. The obvious solution is no to attach single homed devices but that is not always a possibility with certain devices. That means other solutions are need which could lead to a significant rise in needed switches defeating the economics of affordable redundant TOR networking (cost of switches, power, rack space, operations, …) or by leveraging MSTP and configuring a dedicates MSTP network for a VLAN which also might not always be possible / feasible so solve the issue. Those single homed devices might very well need to be the same VLANs as the dual homed ones. Stacking would also solve the above issue as the MLAG restrictions do not apply. I do not like stacking however as it breaks the switch independent redundant network design; even during planned maintenance as a firmware upgrade brings down the entire stack.

One thing that is missing is the ability to fail over when the network fails. There is no concept of a “protected” network. This could help try mitigate issues where when a virtual service is down due to network issues the LoadMaster could try and fail over to see if we have more success on the other node. For certain scenarios this could prevent long periods of down time.

Accelerated Checkpoint merging with ReFS v2 in Windows Server 2016

Introduction

This blog post is a teaser where we show you some of the results we have seen with ReFS v2 in Windows 2016 (TPv4). In a previous blog post (Lightning Fast Fixed VHDX File Creation Speed With ReFS on Windows Server 2016) we have demonstrated the very fast VHDX file creation capabilities we got with ReFS v2. Now we look at another benefit of ReFS v2 in a Hyper-V environment, thanks to a feature or ReFS v2 called block cloning. We get accelerated checkpoint merging with ReFs v2 in Windows 2016

The Demo

For this short demo we have a virtual machine running Windows Server 2016. It resides on a CSV formatted with REFS (64K unit allocation size). Inside the virtual machine there is a second data disk. Our  VM called CheckPointReFS (64K unit allocation size) has this data volume formatted with ReFS (64K unit allocation size) and it runs on the ReFS formatted CSV. The disks in this test are fixed sized VHDX files.

On the data volumes we have about 30GB worth of ISO files. We checkpoint the VMs and then create a copy of those files on the data volume.

image

We then delete this checkpoint.

image

Via the events 19070 (start of a background disk merge) and 19080 (completion of a background disk merge) in the Microsoft-Windows-Hyper-V-VMMS/Admin logs we calculate the time this took: 5 seconds.

image_thumb76

image_thumb74

There are moments you just have to say “WAUW”. Really this rocks and it’s amazing. So amazing I figured I made a mistake and I ran it again … 4 seconds. WOEHOE!  What where the times you saw when you last deleted a large checkpoint?

I am really looking forward to do more testing with ReFS v2 capabilities with Hyper-V on Windows 2016.