I’m Attending The E2E Virtualization Conference

Well I’ve just finished doing the paperwork for attending the Experts 2 Experts conference in London http://www.pubforum.info/pubforum/E2E2011London.aspx. It runs from 18th to 20th November 2011. I’m looking forward to this one as I’m going to meet up with a lot of people from my on line network and have a change to discuss our virtualization experiences and share information in real life, face to face.

It’s good to get to attend vendor independent events and exchange information, enrich and extend our networks. I already know several people from my twitter/blogging network will be attending and I’m happy to meet up with you if you’re there. Just let me know via e-mail, the feedback option on this blog or via twitter (@workinghardinit). Well, I’ll see you there!

Consider CPU Power Optimization Versus Performance When Virtualizing

Over the past couple of years I’ve read, heard, seen and responded to reports of users dealing with performance issues when trying to save the planet with the power saving options on CPUs. As this if often enabled by default they often don’t even realize this is in play. Now for most laptop users this is fine and even for a lot of desktop users it delivers upon the promise of less energy consumption. Sure, there are always some power users and techies that need every last drop of pure power but on the whole life is good this way. So you reduce your power needs, help save the planet and hopefully some money along the way as well. Now, even when your filthy rich and money is no objection to you what so ever, you could still be in a place where there are no more extra watts available due to capacity being maxed out or the fact they have been reserved for special events like the London Olympics, so keeping power consumption in check becomes a concern for you as well.

Now this might make good economic sense for a lot of environments (mobile computing) but in other places it might not work out that well. So when you have al this cool & advanced power management running in some environments you need to take care and not turn your virtualization hosts into under achievers. Perhaps that putting it too strong but hey I need to wake you up to get your attention. The more realistic issue is that people are running more and more heavy workloads in virtual machines and that the hosts used for that contain more and more cores per socket using very advanced CPU functionalities and huge amounts of RAM. Look at these KB article KB2532917: Hyper-V Virtual Machines Exhibit Slow Startup and Shutdown and KB 2000977: Hyper-V: Performance decrease in VMs on Intel Xeon 5500 (Nehalem) systems. All this doesn’t always compute (pun intended) very well.

Most hyper-V consultants will also be familiar with the blue screen bugs related to C-state like You receive a “Stop 0x0000007E” error on the first restart after you enable Hyper-V on a Windows Server 2008 R2-based computer and Stop error message on a Windows Server 2008 R2-based computer that has the Hyper-V role installed and that uses one or more Intel CPUs that are code-named Nehalem: “0x00000101 – CLOCK_WATCHDOG_TIMEOUT” on top of the KB articles mentioned above. I got bitten by the latter one a few times (yes I was a very early adopter of Hyper-V). Don’t start bashing Microsoft too hard on this, VMware and other vendors are dealing with their own C-State (core parking) devils (just Google for it) and read the articles to realize sometimes this is a hardware/firmware issue. A colleague of mine told me that some experts are advising to just turn C-state off in a virtualization environment. I’ll leave that to the situation at hand but it is an area that you need to be aware of an watch out for. As always, and especially if you’re reading this in 2014, realize that all information has a time-limited shelf life based on the technology at the time of writing. Technology evolves and who knows what CPUs & hypervisors will capable of in the future?  Also, these bugs have been listed on most Hyper-V blogs as they emerged, so I hope you’re not totally surprised.

It’s not just the C-States we need to watch out for, the P-states have given us some performance issues as well. I’ve come across some “strange” results in virtualized environments that resulted from “merely confused” system administrators to customers suffering from underperforming servers, both physical and virtual actually. All those fancy settings like SpeedStep (Intel) or Cool’n’Quiet (AMD), might cause some issues, perhaps not in your environment but it pays to check it out and be aware of these as servers arrive with those settings enabled in the BIOS and Windows 2008 R2 is using them by default. Oh, If you need some reading on what C-States and P-States are, take a look at C-states and P-states are very different

Some confusion can happen when virtual machines report less speed than the physical CPUs can deliver, worsened by the fact that sometimes it varies between VMs on the same host. As long as this doesn’t cause performance issues this can be lived with by most people but the inquisitive minds. Wen performance takes a dive, servers start to respond slower and apps wind down to a glacial pace; you see productivity suffer which causes people to get upset. To add to the confusion SCVMM allows you to assign a CPU type to your VMs as a hint to SCVMM to help out with intelligent placement of the virtual machines (see What is CPU Type in SCVMM 2008 R2 VM Processor Hardware Profile?), which confuses some people even more. And guess on whose desk that all ends up?

When talking performance on servers we see issues that pitch power (and money, and penguins) savings against raw performance. We’ve seen some SQL servers and other CPU hungry GIS applications servers underperform big time (15% to 20%) under certain conditions. How is this possible? Well, when CPUs are trimmed down in voltage and frequency to reduce power consumption when the performance is not needed. The principle is that they will spring back into action when it is needed. In reality, this “springing” back into action isn’t that responsive. It seems that the gradual trimming down or beefing up the CPUs voltage and frequency isn’t that transparent to the processes needing it. Probably because constant, real-time, atomic adjustments aren’t worth the effort or are technically challenging. For high-performance demands this is not good enough and could lead to more money spend on extra servers and time spend on different approaches (code, design, and architecture) to deal with a somewhat artificial performance issue. The only time you’re not going to have these issues is when your servers are either running apps with mediocre to low-performance needs or when they are so hungry for performance those CPUs will never be trimmed down, they just don’t get the opportunity to do this. There is a lot to think about here and now add server virtualization into the mix. No my dear application owner Task Manager’s CPU information is not the real raw info you can depend on for the complete truth and nothing but the truth.  Many years ago CPUz was my favorite tool to help tweak my home PC. Back then I never thought it would become part of my virtualization toolkit but it’s easy and faster than figuring it out with all the various performance counters.

Now don’t think this is an “RDBMS only” problem and that, since you’re a VDI guy or a GIS or data crunching guy, you’re out of the woods. VDI and other resource-hungry applications (like GIS and data crunching) that show heterogenic patterns in CPU needs can suffer as well and you’d do well to check on your vCPUs and pCPUs and how they are running under different loads. I actually started looking at SQL Server because of seeing the issue first with freaked out GIS application running at 100%v CPUs and the pCPU being all relaxed about it. It made me go … “hang on I need to check something” that’s when I ran into a TechNet forum post on Hyper-V Core Parking performance issues leading to some interesting articles by Glenn Berry and Brent Ozar who are dealing with this on physical servers as well. The latter article even mentions an HP ILO card bug that prevents the CPU from throttling back up completely. Ouch!

Depending on your findings and needs you might just want turn SpeedStep or Cool’n’Quiet off either in the BIOS or in windows. Food for taught, what if one day some vendors decide you don’t need to be able to turn that off, it disappears from your view and ultimately from your control … The “good enough is good enough” world can lead to a very mediocre world. Am I being paranoid? Nope, not according to Ron Oglesby (you want VDI reality checks? Check him out) in his blog post SpeedStep and VDI? Is it a good thing? Not for me. where CISCO UCS 230 blades are causing him problems.

So what do I do? Well to be honest, when the need for stellar and pure raw performance is there, the power savings go out the window whenever I see that it’s causing issues. If it doesn’t, fine, then they can stay. So yes, this means no money saved, no reduction of cooling costs and penguins (not Linux, but those fluffy birds on the South Pole that can’t fly) losing square footage of ice surface. Why? Because the business wants and needs the performance and they are nagging me to deliver it. When you have a need for that performance you’ll make that trade-off and it will be the correct decision. Their fancy new servers performing worse or not better than what they replaced and that virtualization project getting bashed for failing to deliver? Ouch! This is unacceptable, but, to tell you the truth, I kind of like penguins. They are cute. So I’m going to try and help them with Dynamic Optimization and Power Optimization in System Center Virtual Machine Manager 2012. Perhaps this has a better change for performance-critical setups to provide power savings than the advanced CPU capabilities. With this approach, you have nodes running on full power, while distributing the load and shutting down entire nodes when there is over capacity. I’ll be happy to report how this works out in real life. But do mind that this is very environment-dependent and you might not have any issues what so ever, so don’t try to fix what is not broken.

The thing is in most places you can’t hang around for many weeks fine-tuning very little configuration option in the CPUs in collaboration with developers & operations. The production needs, costs and time constraints (by the time they notice any issues “playtime” has come and gone) just won’t allow for it. I’m happy to have those options where I have the opportunity to use them but in most environments, I’ll stick with easier and faster fixes due to those constraints. Microsoft also informs us to keep an eye on power-saving settings in this KB article Degraded overall performance on Windows Server 2008 R2 and offers some links to more guidance on this subject. There is no “one size fits all” solution. By the way some people claim that the best performance results come from leaving SpeedStep on in the BIOS and disabling it in Windows. Others swear by disabling it in the BIOS. I just tend to use what I can where I can and go by the results. It’s all a bit empirical and this is a cool topic to explore, but as always time is limited and you’re not always in the position where you can try it all out at will.

In the end, it comes down to making choices. This is not as hard as you think as long as you make the right choices for the right reasons. Even with the physical desktops that are Wakeup On LAN (WOL) enabled to allow users to remotely boot them when they want to work from home or while traveling, I’ve been known to tell the bean counters that they had to pick one of two: have all options available to their users or save the penguins. You see WOL with a machine that has been shut down works just fine. But when they go into hibernation/standby you have to enable the NICs to be allowed to wake up the computer from hibernation or standby for WOL to work or the users won’t be able to remotely connect to them. See more on this at http://technet.microsoft.com/en-us/library/ee617165(WS.10).aspx But this means they’ll wake up a lot more than necessary by non-targeted network traffic. So what? Think of the benefits! An employee wanting to work a bit at 20:00 PM to get work done on her hibernating PC at work so she can take a couple of hours to take her kid to the doctor next morning can do so = priceless as that mother knows how great of a boss and company she works for.

A VDI Reality Check @ BriForum 2011 For Resource Hungry Desktops In A Demanding Environment

So what did we notice? VDI generates enough interest from various angles that is for sure. Both on the demand side as on the (re)seller & integrator side. Most storage vendors are bullish enough to claim that they can handle whatever IOPS required to get the most bang for the buck but only the smaller or newest players were present and engaged in interaction with the attendees. One thing is for sure VDI has some serious potential but it has to be prepared well and implemented thoroughly. Don’t do it over the weekend and see if it works out for all your users.

The amount of tools & tactics for VDI on both the storage side and the configuration/management side is both more complex and diverse than with server virtualization.  The possible variations on how to tackle a VDI project are almost automatically more numerous as well. This is due to the fact that desktops are often a lot more complex and heterogenic in nature than server-side apps. On top of that, the IO on a desktop can be quite high. Some of it can be blamed on the client OS but lots of that has to do with the applications and utilities used on desktops.  I think that developers had so many resources at their disposal that there wasn’t to much pressure on optimization there. The age of multi-cores and x64 bit will help in thinking more about how and application uses CPY cycles but virtualization might very well help in abstracting that away. When a PC has one vCPU and the host has 4*8 cores, how good is that hypervisor at using all that pCPU power to address the needs of that one vCPU?  But I digress. All in all, it takes more effort and complexity to do VDI than server virtualization. So there is a higher cost or at least the APEX isn’t such a convincing clear cut story as it is with server virtualization. If you’re not doing the latter today when and where you can you are missing out of a major number of benefits that are just to good to ignore. I wouldn’t dare say that for VDI. Treating VDI just like server virtualization is said to be one of the main reasons for VDI failing or being put on hold or being limited to a smaller segment of the desktop population.

My experience with server virtualization is also with rather heterogenic environments where we have VMs with anything between 1 and 4 virtual CPUs, 2 to 12 GB of RAM. And yet I have to admit it has been a great success. Never the less I can’t say that helped me much in my confidence that a large part of our desktop environment can be virtualized successfully and cost-effectively as I think that our desktops are such vicious resource hogs they need another step forward in raw power and functionality versus cost. Let briefly describe the environment. 85% of the workforce at my current gig has dual 24” wide screens, with anything between 4GB to 8 GB of RAM, Quad-Core CPUs and SCSI / SATA 10.000 RPM disks with anything between 250 GB to 1TB local storage in combination with very decent GPUs. Now the employees run Visual Studio, SQL Server, multiple CAD & GIS packages, and various specialized image processing software that gauges image and other files that can be 2GB or even higher. If they aren’t that large than they are still very numerous. On top of that 1Gbps network to the desktop is the only thing we offer anymore. So this is not a common office suite plus a couple of LOB applications order, this is a large and rich menu for a very hard to please audiences. That means that if you ask them what they want, they only answer more, more, more … And I won’t even mention 3D screens & goggles.

Now I know that X amount of time the machines are idle or doing a lot less but in the end that’s just a very nice statistic. When a couple of dozen users start playing around with those tools and throw that data around you still need them and their colleagues to be happy customers. Frankly even with the physical hardware that they have now that can be a challenge. And please don’t start about better, less resource wasting applications and such. You can’t just f* the business and tell them to get or wait for better apps. That flies in the face of reality. You have to be able to deliver the power where and when needed with the software they use. You just can’t control the entire universe.

I heard about integrators achieving 40-60 VMs per host in a VDI project. Some customers can make due with Windows 7 and 1GB of RAM. I’m not one of those. I think the guys & gals of the service desk would need armed escorts if we rolled that out to the employees they care for. One of the things I notice is that a lot of people choose to implement storage just for VDI. I’m not surprised. But until now I’ve not needed to do it. Not even for databases and other resource hogs. Separate clusters, yes, as the pCPU/vCPU ratio and Memory requirements differ a lot from the other servers. The fact that the separate cluster uses other HBA’s en LUNS also helps.

Next to SANs local storage for VDI is another option for both performance and cost. But for recovery, this isn’t quite that good a solution. The idea of having non-persistent disks (in a pool) or a combination of that with persistent disks is not something I can see fly with our users. And frankly, a show of hands at BriForum seems to indicate that this isn’t very widespread. VDI takes really high-performance storage, isolated from your server virtualization to make it a success. On top of that if you need control, rapid provisioning, user virtualization &  workspace management in a layered/abstracted way. Lost of interest there but again, yet more tools to get it done. Then there is also application virtualization, terminal service-based solutions etc. So we get a more involved, divers, and expensive solution compared to server virtualization. Now to offset these costs we need to look at what we can gain. So where do the benefits to be found?

With non-persistent disk you have rapid provisioning of know good machines in a pool but your environment must accept this and I don’ see this flying well in face of the reality of consumerization of ICT. De-duplication and thin provisioning help to get the storage needs under control but the bigger the client-side storage needs and the more diverse these are the fewer gains can be found there. Better control, provisioning, resource sharing, manageability, disaster recovery, it is all possible but it is all so very specific to the environment compared to server virtualization and some solutions contradict gains that might have been secured with other approaches (disaster recovery, business continuity with SAN versus local storage). One of the most interesting possibilities for the environment I described was perhaps doing virtualization on the client. I look at it as booting from VHD in the Windows 7 era but on steroids. If you can save guard the images/disks on a SAN  with de-duplication & thin provisioning you can have high availability & business continuity as losing the desktops is a matter of pushing to VM to other hardware which due to abstraction by virtualization should be a problem. It also deals with the network issues of VDI, a hidden bottleneck as most people focus on the storage. Truth be told, the bandwidth we consume is that big, it could be that VDI might have it best improvements for us on that front.

Somewhat surprising was that Microsoft, whilst being really present at PubForum in Dublin, was nowhere to be seen at BriForum. Citrix was saving it’s best for its own conference (Synergy) I think. Too bad, I mean when talking about VDI in 2011 we’re talking about Windows 7 for the absolute majority of implementations and Citrix has a strong position in VDI really giving VMware a run for their money. Why miss the opportunity? And yesterday at TechEd USA we heard about the HSBC story of a 100.000 seat VDI solution on Hyper-V http://www.microsoft.com/Presspass/press/2011/may11/05-16TechEd11PR.mspx.

On a side note, I wish I would/could have gone to PubForum as well. Should have done that. Now, these musings are based upon what I see at my current place of endeavor. VDI has a time and place where it can provide significant operational and usage advantages to make the business case for VDI. Today, I’m not convinced this is the case for our needs at this moment in time. looking at our refresh schedule we’ll probably pass on a VDI solution for the coming one. But booting from VHD as a standard in the future… I’m going to look into that, it will be a step towards the future I think.

To conclude BriForum 2011 was a good experience and the smaller scale of it makes for good and plenty of opportunities for interaction and discussion. A very positive note is that most vendors & companies present were discussing real issues we all face. So it was more than just sales demos. Brian, nice job.

BriForum 2011 Europe Here I Come

As you might have read a previous blog and noticed in the sidebar, I’m off to London (UK) to attend  BriForum 2011 Europe. It’s time to get away from the wide screens overlooking my ICT empire toys and broaden my horizons  For those who think the cloud is going to take away your job … think again, I’m getting busier than ever. The reality is that we just can’t push a button and have everything up and running in the cloud. Greenfield projects and startups might beat existing infrastructure & application architecture over the head with cloud and make those businesses run harder for their money but they will run and compete. That race will produce a huge workload.

So I’m of to dive into some sessions on Cloud, Server & Application Virtualization, VDI … should make for some interesting days. I hope to be able to talk to lots of people with a variety of experiences to help find out new or alternate ways to address some issues (or challenges) we need to tackle in the years ahead. Subjects like Disaster Recovery, Business Continuity, application-aware storage in a virtualized environment, Geo Clustering, Site Recovery, … should give us ample to discuss. Give us a shout if you’re there. It’s also a nice opportunity to meet up with some fellow bloggers and twitter. acquaintances.

A colleague of mine is heading to the USA, Atlanta to attend TechEd 2011 USA. So he’s crossing the big pond to get some brand new info on the latest and the greatest in Microsoft technologies on the IT Pro side of the business.

So of to London, I go, onwards & always going forward in IT as there is no turning back I’ll keep you posted when I find the time to do so.