Upgrading Hyper-V Cluster Nodes to Windows Server 2012 (Beta) – Part 1

This is a multipart series based on some lab test & work I did.

  1. Part 1 Upgrading Hyper-V Cluster Nodes to Windows Server 2012 (Beta) – Part 1
  2. Part 2 Upgrading Hyper-V Cluster Nodes to Windows Server 2012 (Beta) – Part 2
  3. Part 3 Upgrading Hyper-V Cluster Nodes to Windows 8 (Beta) – Part 3

After I got back from the MVP Summit 2012 in Bellevue/Redmond I could wait to start playing with a Windows 8 Hyper-V cluster so I decided to upgrade my Windows 2008 R2 cluster nodes to Windows 8. That means evicting them on by one, upgrading them and adding them to a new Windows 8 cluster. As we can build a one node cluster this can be done a node at the time. This isn’t a fail proof definite “How To”, I’m just sharing what I did.

Evicting a node

Before evicting a node make sure all virtual machines are running on the other node(s). As you can see the cluster warrior has 2 nodes, crusader & saracen (I was listening to some Saxon heavy metal at the time I built that lab setup). We evacuated node saracen prior to evicting it.

image

Evict the node & confirm when asked.

image

image

When this is done all storage is off line to the node evicted from the cluster. No need to worry about that.

Upgrade that node to Windows 8

To anyone having installed/upgraded to Windows 2008 R2 this should all be a very recognizable experience. Being lazy, I left the iSCSI initiator configuration in there with the Hyper-V & failover cluster roles installed during the upgrade. Now for production environments I like to build my nodes from scratch to have an exactly known, new and clean installation base. But for my test lab at home I wanted to get it done as fast as possible. If only the days had more hours …For extra safety you can pull the plug (or disable the switch ports) on your iSCSI or FC connections and make sure no storage is presented to the node during the upgrade process. Now please do mind is use Intel server grade NIC adaptors for which Windows 8 beta has drivers. Your situation may vary so I can’t guarantee the 7 year old FC HBA in your lab server will just work, OK!?

So run setup.exe from the Windows 8 (Beta) ISO you extracted to a folder on the server or  from the (bootable) USB you created with the downloaded ISO.

image

 

The Windows Setup installer will start.

04 run setup

 

Click on “Install now” to proceed and start the setup process.

image

 

Select to “Go online to get the latest updates for Setup (Recommended)”

image

 

So it looks for updates on line.

image

 

It didn’t find any but that’s OK.

image

 

Select the installation you want. I went with for Server with a GUI as I want screen shots. But as I wrote in the blog post Windows 8 Server With GUI, Minimal Server Interface & Server Core Lesson with the Desktop Experience Feature you can turn it into a Server Core Installation and back again now. So no regrets with any choice you make here, which is a nice improvement that can save us a lot of time.

image

Accept the EULA

image

 

We opt to upgrade (in production I go for a clean install)

image

 

I get notified that I have to remove PerfectDisk. I had an evaluation copy of Raxco PerfectDisk installed I used to do some testing with redirected CSV traffic and defragmentation (see Some Feedback On How to defrag a Hyper-V R2 Cluster Shared Volume).

image

 

So the upgrade was cancelled.

image

 

I uninstalled PerfectDisk but still it was a no go. I  had to remove all traces of it in the registry & files systems that the uninstall left or the upgrade just wouldn’t start. But after that it worked.

image

 

That means we can kick of the upgrade! It all looks very familiar Smile It takes a couple of reboots and some patience. But all in all it’s a fast process.

image

image

image

image

After this step it takes a couple of reboots and some patience. But all in all it’s a fast process. After some reboots and a screen that goes dark in between those …we get our restyled beta fish.

image

image

image

And voila we’re where we need to be … Smile

image

 

After the upgrade process I ran into one error. The GUI for Failover Clustering would not start. The solution if found for that was simply to remove that role and add it again. That did the trick.

ClusGUI

 

So this was a description of the first steps to transition a  Windows 2008 R2 SP1 cluster to a  Windows 8 (Beta) Cluster. As seen we evict the nodes one by one to upgrade them or do a clean install. In the latter case you’ll need to do the iSCSI initiator configuration again,  install the Failover Cluster role and in the case of a Hyper-V cluster the Hyper-V role. The nodes can than be added to a new Windows 8 cluster, starting out with a one node cluster. More on that in the second part of this blog post.

Hotfixes For Hyper-V & Failover Clustering Can Be Confusing KB2496089 & KB2521348

As I’m building or extending a number of Hyper-V Clusters in the next 4 months I’m gathering/updating my list with the Windows 2008 R2 SP1 hotfixes relating to Hyper-V and Failover Clustering. Microsoft has once published KB2545685: Recommended hotfixes and updates for Windows Server 2008 R2 SP1 Failover Clusters but that list is not kept up to date, the two hotfixes mentioned are in the list below. I also intend to update my list for Windows Server 2008 SP2 and Windows 2008 R2 RTM. As I will run into to these and it’s nice to have a quick reference list.

I’ll include my current list below. Some of these fixes are purely related to Hyper-V, some to a combination of hyper-V and clusters, some only to clustering and some to Windows in general. But they are all ones that will bite you when running Hyper-V (in a failover cluster or stand alone). Now for the fun part with some hotfixes I’ll address in this blog post. Confusion Smile Take a look at the purple text and the green text hotfixes and the discussion below. Are there any others like this I don’t know about?

* KB2496089 is included in SP1 according to “Updates in Win7 and WS08R2 SP1.xls” that can be downloaded here (http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=269) but the Dutch language KB article states it applies to W2K8R2SP1 http://support.microsoft.com/kb/2496089/nl

Artikel ID: 2498472 – Laatste beoordeling: dinsdag 10 februari 2011 – Wijziging: 1.0

Vereisten

Deze hotfix moet worden uitgevoerd een van de volgende besturings systemen:

  • Windows Server 2008 R2
  • Servicepack 1 (SP1) voor Windows Server 2008 R2
Voor alle ondersteunde x64 versies van Windows Server 2008 R2

6.1.7600.20881
4,507,648
15-Jan-2011
04: 10
x64

Vmms.exe
6.1.7601.21642
4,626,944
15-Jan-2011
04: 05
x64

When you try to install the hotfix it will. So is it really in there? Compare file versions! Well the version after installing the hotfix on a W2K8R2 SP1 Hyper-V server the version of vmms.exe was 6.1.7601.21642 and on a Hyper-V server with SP1 its was 6.1.7061.17514. Buy the way these are English versions of the OS, no language packs installed.

With hotfix installed on SP1

Withhotfix_thumb[1]

Without hotfix installed on SP1

Withoutpatch_thumb[1]

To make matters even more confusing while the Dutch KB article states it applies to both W2K8R2 RTM and W2K8R2SP1 but the English version of the article has been modified and only mentions W2K8R2 RTM anymore.

http://support.microsoft.com/kb/2496089/en-us

Article ID: 2496089 – Last Review: February 23, 2011 – Revision: 2.0

For all supported x64-based versions of Windows Server 2008 R2

Vmms.exe
6.1.7600.20881
4,507,648
15-Jan-2011
04:10
x64

So what gives? Has SP1 for W2K8R2 been updated with the fix included and did the SP1 version I installed (official one right after it went RTM) in the lab not yet include it? Do the service packs differ with language, i.e. only the English one got updated?. Sigh :-/ Now for the good news: ** It’s all very academic because of this KB 2521348 A virtual machine online backup fails in Windows Server 2008 R2 when the SAN policy is set to “Offline All” which brings the vmms.exe version to 6.1.7601.21686 and this hot fix supersedes KB2496089 Smile. See http://blogs.technet.com/b/yongrhee/archive/2011/05/22/list-of-hyper-v-windows-server-2008-r2-sp1-hotfixes.aspx where this is explicitly mentioned.

Ramazan Can mentions hotfix 2496089 and whether it is included in SP1 in the comments on his blog post http://ramazancan.wordpress.com/2011/06/14/post-sp1-hotfixes-for-windows-2008-r2-sp1-with-failover-clustering-and-hyper-v/ but I’m not very convinced it is indeed included. The machine I tested on are W2K8R2 English RTM updated to SP1, not installations for the media including SP1 so perhaps there could also be a difference. It also should not matter that if you install SP1 before adding the Hyper-V role, so that can’t be the cause.

Anyway, keep your systems up to date and running smoothly, but treat your Hyper-V clusters with all due care and attention.

  1. KB2277904: You cannot access an MPIO-controlled storage device in Windows Server 2008 R2 (SP1) after you send the “IOCTL_MPIO_PASS_THROUGH_PATH_DIRECT” control code that has an invalid MPIO path ID
  2. KB2519736: Stop error message in Windows Server 2008 R2 SP1 or in Windows 7 SP1: “STOP: 0x0000007F”
  3. KB2496089: The Hyper-V Virtual Machine Management service stops responding intermittently when the service is stopped in Windows Server 2008 R2
  4. KB2485986: An update is available for Hyper-V Best Practices Analyzer for Windows Server 2008 R2 (SP1)
  5. KB2494162: The Cluster service stops unexpectedly on a Windows Server 2008 R2 (SP1) failover cluster node when you perform multiple backup operations in parallel on a cluster shared volume
  6. KB2496089: The Hyper-V Virtual Machine Management service stops responding intermittently when the service is stopped in Windows Server 2008 R2 (SP1)*
  7. KB2521348: A virtual machine online backup fails in Windows Server 2008 R2 (SP1) when the SAN policy is set to “Offline All”**
  8. KB2531907: Validate SCSI Device Vital Product Data (VPD) test fails after you install Windows Server 2008 R2 SP1
  9. KB2462576: The NFS share cannot be brought online in Windows Server 2008 R2 when you try to create the NFS share as a cluster resource on a third-party storage disk
  10. KB2501763: Read-only pass-through disk after you add the disk to a highly available VM in a Windows Server 2008 R2 SP1 failover cluster
  11. KB2520235: “0x0000009E” Stop error when you add an extra storage disk to a failover cluster in Windows Server 2008 R2 (SP1)
  12. KB2460971: MPIO failover fails on a computer that is running Windows Server 2008 R2 (SP1)
  13. KB2511962: “0x000000D1” Stop error occurs in the Mpio.sys driver in Windows Server 2008 R2 (SP1)
  14. KB2494036: A hotfix is available to let you configure a cluster node that does not have quorum votes in Windows Server 2008 and in Windows Server 2008 R2 (SP1)
  15. KB2519946: Timeout Detection and Recovery (TDR) randomly occurs in a virtual machine that uses the RemoteFX feature in Windows Server 2008 R2 (SP1)
  16. KB2512715: Validate Operating System Installation Option test may identify Windows Server 2008 R2 Server Core installation type incorrectly in Windows Server 2008 R2 (SP1)
  17. KB2523676: GPU is not accessed leads to some VMs that use the RemoteFX feature to not start in Windows Server 2008 R2 SP1
  18. KB2533362: Hyper-V settings hang after installing RemoteFX on Windows 2008 R2 SP1
  19. KB2529956: Windows Server 2008 R2 (SP1) installation may hang if more than 64 logical processors are active
  20. KB2545227: Event ID 10 is logged in the Application log after you install Service Pack 1 for Windows 7 or Windows Server 2008 R2
  21. KB2517329: Performance decreases in Windows Server 2008 R2 (SP1) when the Hyper-V role is installed on a computer that uses Intel Westmere or Sandy Bridge processors
  22. KB2532917: Hyper-V Virtual Machines Exhibit Slow Startup and Shutdown
  23. KB2494016: Stop error 0x0000007a occurs on a virtual machine that is running on a Windows Server 2008 R2-based failover cluster with a cluster shared volume, and the state of the CSV is switched to redirected access
  24. KB2263829: The network connection of a running Hyper-V virtual machine may be lost under heavy outgoing network traffic on a computer that is running Windows Server 2008 R2 SP1
  25. KB2406705: Some I/O requests to a storage device fail on a fault-tolerant system that is running Windows Server 2008 or Windows Server 2008 R2 (SP1) when you perform a surprise removal of one path to the storage device
  26. KB2522766: The MPIO driver fails over all paths incorrectly when a transient single failure occurs in Windows Server 2008 or in Windows Server 2008 R2

Exchange 2007 & 2010 Event ID’s: 2601, 2604, 2501 & Users Can’t Access Mailboxes / Public Folders On My Day Off

I took the day off as I needed some time to deal with government administration. Good thing this is a blog about IT issues because holey crap what a time eating, confusing and rather pointless mess government administration can be. The process to get to the desired outcome is very tedious, prone to misunderstanding & pretty inefficient . What the entire duration of the process and the number of administrative entities involved contribute to the desired result is a mystery. It’s pure show and window dressing. But OK, we took the day of to finally get it all sorted after 5 months of patiently waiting for this day.

So I sleep until 08:00, get up and head for the kitchen for a jar of coffee. With the only Java I truly like in my hand I make my way to the home office. I check mails/alerts from System Center, Support Requests etc. I’m like a responsible guy dude, even when I need a day off. I do monitor the condition of my projects in production and I do step in when needed and document my findings. It keeps me honest when I design and sell my solutions. Beware of some architects that are not the ones having to deal with the crap architectures they design, they are often empty suits. Anyway, I see an issue that could be a warning of more to come. Someone has a problem with Outlook 2007 which reports the following error (translation from Dutch):

“Unable to expand the folder. The Microsoft Exchange Server computer is not available. Either there are network problems or the Microsoft Exchange Server computer is down for maintenance.(/o=<DOMAIN>/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=servers/cn=<dagmember1>)”

Now I know that user. Smart, diligent and reliable. That user even provides the relevant and necessary information in their support request. Yes they do exist and HRM should hire those exclusively. So in combination with that error we knew we did not have an PEBKAC or ID-10T on our hands but a real issue.

I quickly check that DAG member node Outlook of that user is trying to connect to but I know that due to maintenance their mailboxes currently reside on another member of the DAG. So i could very well be just the public folders. Bingo. A quick test reveals this to be the case. Also the Windows 2008 R2 server and Exchange 2010 itself are running perfectly fine, happy as can be, except on that one node we see the Application Event Log messages:

Log Name:      Application
Source:        MSExchange ADAccess
Date:          8/19/2010 7:12:43 AM
Event ID:      2601
Task Category: General
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      dagmember1.company.blog
Description:
Process MSEXCHANGEADTOPOLOGY (PID=1620). When initializing a remote procedure call (RPC) to the Microsoft Exchange Active Directory Topology service, Exchange could not retrieve the SID for account <WKGUID=XXXXXXXXXXNOREALIDXXXXXXXXXXXXXX,CN=Microsoft Exchange,CN=Services,CN=Configuration,…> – Error code=8007077f. The Microsoft Exchange Active Directory Topology service will continue starting with limited permissions.

Log Name:      Application
Source:        MSExchange ADAccess
Date:          8/19/2010 7:12:43 AM
Event ID:      2604
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      dagmember1.company.demo
Description:
Process MSEXCHANGEADTOPOLOGY (PID=1620). When updating security for a remote procedure call (RPC) access for the Microsoft Exchange Active Directory Topology service, Exchange could not retrieve the security descriptor for Exchange server object DAGMEMBER1 – Error code=8007077f. The Microsoft Exchange Active Directory Topology service will continue starting with limited permissions.

Log Name:      Application
Source:        MSExchange ADAccess
Date:          8/19/2010 7:12:43 AM
Event ID:      2501
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      dagmember1.company.blog
Description:
Process MSEXCHANGEADTOPOLOGY (PID=1620). The site monitor API was unable to verify the site name for this Exchange computer – Call=DsctxGetContext Error code=8007077f. Make sure that Exchange server is correctly registered on the DNS server.

I think I’m OK when I see the possible cause. Why? Because I also know even if that probable cause isn’t the problem, it’s a hiccup I’ve seen before and I know how to fix its one. When you search those errors you can find a TechNet article describing a possible cause: “An inactive network connection is first on the binding list” http://technet.microsoft.com/en-us/library/dd789571(EXCHG.80).aspx. The fix is quite simple. Correct the NIC order and restart the MSExchange ADTopology Service. I had my scare about Active Directory and DNS horrors the first time I ever saw this one. So no gut wrenching panic here :-)

But why do servers ever get in to this state when the NIC ordering is just fine? We did some firmware and upgrade recently after hours but that didn’t affect the NIC binding order. Now I’m pretty weird at times but I still know what I’m doing. Those NIC where OK when I configured those servers. Checking that has become a second nature on multi homed and clustered servers. I also remember happening this to me once before somewhere in February 2010 with another setup of Exchange 2010 on Windows 2008 R2. And in that case the NIC order in the binding list was also OK. I checked back then as well just to make sure. But since I build those Exchange 2010 setups myself I just know they are close to godliness both in design and implementation :-). Back then the issue went away by restarting the server, restarting the MSExchange ADTopology Service will do however, and the problem never came back. For some reason the AD Site information query fails. Now Windows retries and is OK after a while. Exchange, tries to get the AD Site information once, fails and keeps thinking there is an issue. With as a result clients have no connectivity and those errors that initially make you think you could have DNS issues, AD problems etc. But fortunately it’s a lot less serious.

So when the NIC binding order is OK why does this happen? I can’t tell you for sure but I do know that I’m not the only one (not that weird after all) since Microsoft published KB Article “MSExchange ADAccess Event ID’s 2601, 2604, 2501” http://support.microsoft.com/kb/2025528 . This article is a so called FAST PUBLISH from Microsoft Support and states that the issue only occurs on Windows 2008 R2 and that it affect Exchange 2007 and Exchange 2010. The cause? Well this is where they provide only what I already knew:

“During a restart of the server, the operating system queries Active Directory to get its AD Site information.  On a Windows 2008 R2 server, this will sometimes fail.  As the Exchange services are starting, it also will do a query for its AD Site and that too will fail. Windows will continue to try and determine its AD Site name and will eventually succeed.  However, Exchange does not re-try the query and the above errors are logged in the application log every 15 minutes.”

And yes the workaround/fix is also nothing new:

“After the server has been up for a minute or two, run NLTest /DSGetSite to verify that that the proper Active Directory Site is being returned by Windows.  Once that has been verified, restart the MSExchange ADTopology Service.”

Do note that this will also restart a slew of dependant Exchange services so it takes a little while.

  • Microsoft Exchange Transport Log Search
  • Microsoft Exchange Transport Log
  • Microsoft Exchange Service Host
  • Microsoft Exchange Search Indexer
  • Microsoft Exchange Replication Service
  • Microsoft Exchange Mail Submission
  • Microsoft Exchange Mailbox Assistants
  • Microsoft Exchange File Distribution
  • Microsoft Exchange EdgeSync
  • Microsoft Exchange Anti-spam Update

So after some manual intervention we had the users back in business. And all is well for them, as they rise and sleep under the watchful eye of a bunch of good IT Pro’s who’ll protect them form further harm and problems 😉 Now I need to get an auto fix for this I think until Microsoft fixes this one for good. SCOM where are you? No, no, no … It’s my day off for getting that administration done!

Reflections on Getting Windows Network Load Balancing To Work (Part 2)

This is part 2 in series on Windows Network Load Balancing. Part 1 can be found here: https://blog.workinghardinit.work/2010/07/01/reflections-on-getting-windows-network-load-balancing-to-work-part-1/

On Default Gateways, Routing & Forwarding.

Here’s a bullet list of what people tend to trip over when configuring NLB network settings.

  • No support for multiple Default Gateways that are on multiple subnets
  • The default gateway does not have to be empty on the NLB NIC
  • The Private and the NLB NIC can be on separate or the same subnets
  • You can have multiple Default Gateways if they are on the same subnet
  • Don’t forget about static routes where and when needed.
  • Beware of the strong host model in Windows 2008 (R2) for both IPv4 & IPv6 (WK3 it was only for IPv6)
  • Mind the order of the connections in Adapters and Bindings.

Now let’s address the subjects in this list.

No support for multiple Default Gateways that are on multiple subnets

When using IP addresses from different subnets you cannot have a default gateway on every NIC because that will cause routing issues. This is not different for the NIC’s used in Windows NLB. So you can have only one NIC with a Default Gateway and if the other NICs need to route somewhere you need to add static persistent routes. Those routes must be persistent or they will not survive a reboot of the server. In the figure below you see a classic two NIC NLB cluster with the Default Gateway Empty on the NLB NIC. This could be a valid setup for an intranet. You can add routes for the subnet in the company that need to be able to talk to the NLB Cluster and you’re golden. The Private NIC gets a default gateway and acts like any other NIC in your network.

image

In this example we have the Default Gateway on the Private NICs they can route internally and to the internet. If you need traffic to & from the internet form the NLB NIC you could enable forwarding on the NLB NIC or enable weak host behavior which can be done more atomic than what you achieve by enabling forwarding. If you only need to route internally we could use the same approach of enabling forwarding instead of adding static persistent routes for the NLB NIC. But then you don’t isolate & protect traffic that neatly and it will route to everywhere the default gateway can get.

So we prefer to play with static persistent routes in this case. We’ll briefly look at some examples now. If you only need to route internally (i.e. to reach the database or a client PC) from the NLB NIC we add the needed static persistent routes on the NLB NICs using the route command.

In order for the NLB NICs to reach the database with strong host model and no forwarding enabled:

Route add -p 10.30.0.0 mask 255.255.0.0 10.10.0.1

To reach the client PC’s:

Route add -p 10.20.0.0 mask 255.255.0.0 10.10.0.1

(Using route print you can look at the routes and using route delete you can get rid of them.)

Or by using netsh, (it’s advised to use netsh from Windows 2008 on)

netsh interface ipv4 add route 10.30.0.0/16 "NLB NIC" 10.10.0.1

netsh interface ipv4 add route 10.20.0.0/16 "NLB NIC" 10.10.0.1

(you can look at the routing table by using netsh interface ipv4 show route, with netsh interface ipv4 delete route you get ridd of then, see http://technet.microsoft.com/en-us/library/cc731521(WS.10).aspx for more information.

You could also connect to the database over the PRIVATE NIC and then you don’t need that route. If you can configure it like that it’s a good solution. But all situations differ.

You can also play with the weakhost / stronghost model behaviour:

netsh interface ipv4 set interface Private NIC weakhostsend=enabled

netsh interface ipv4 set interface Private NIC weakhostreceive=enabled

netsh interface ipv4 set interface NLB NIC weakhostsend=enabled

netsh interface ipv4 set interface NLB NIC weakhostreceive=enabled

Now don’t just blindly enable on every NIC you can find on the server. Test what you really need and use only that. I leave that as an exercise to the readers. It really depends on the situation and needs for your particular situationJ. Keep in mind that when you enable weakhostsend and weakhostreceive on every NIC this reverts your Windows 2008 servers back to Windows 2003 behavior and this might not be needed or wanted. So just enable what you need for optimal security.

Naturally enabling forwarding will do the trick as well, as this creates a weak host model. Depending on how many NICs you use and how traffic must flow you might have to do it on more than one NIC, normally the one(s) without a default gateway.

netsh interface ipv4 set interface "NLB NIC" forwarding=enabled

 

If you want to see the configuration of the NIC you can run:

           netsh interface ipv4 show interface l=verbose

That will produce something like below:

Interface Local Area Connection Parameters
———————————————-
IfLuid                             : ethernet_5
IfIndex                            : 3
State                              : connected
Metric                             : 10
Link MTU                           : 1500 bytes
Reachable Time                     : 21500 ms
Base Reachable Time                : 30000 ms
Retransmission Interval            : 1000 ms
DAD Transmits                      : 3
Site Prefix Length                 : 64
Site Id                            : 1
Forwarding                         : disabled
Advertising                        : disabled
Neighbor Discovery                 : enabled
Neighbor Unreachability Detection  : enabled
Router Discovery                   : dhcp
Managed Address Configuration      : enabled
Other Stateful Configuration       : enabled
Weak Host Sends                    : disabled
Weak Host Receives                 : disabled

Use Automatic Metric               : enabled
Ignore Default Routes              : disabled
Advertised Router Lifetime         : 1800 seconds
Advertise Default Route            : disabled
Current Hop Limit                  : 0
Force ARPND Wake up patterns       : disabled
Directed MAC Wake up patterns      : disabled


The default gateway does not have to be empty on the NLB NIC

It is not a hard requirement to leave the Default Gateway on the NLB NIC empty and put it on the private NIC. You can set it on the NLB NIC and leave the private NIC’s gateway empty instead. An example of this you can see in the demo. This is the best choice in my opinion when you need the NLB NIC to route to destinations you don’t know how to reach, i.e. the internet, so for public websites. The prime function of the default gateway is exactly to help with that. When you don’t know where to send it, send it to the Default Gateway. If you need to reach other internal subnets from the Private NIC, just use static routes. Don’t use the NLB NIC as that is internet facing in this case. You can see an example of this in the figure below. Also in this case you’ll find that you do not have to enable forwarding on the NIC using netsh, as the NIC that has to answer to the unknown IP Address has the Default Gateway. This setup works great for example in a managed domain environment for internet access where the NLB NICs are internet facing and the private NIC is for management, Active Directory, Backups, etc.

image

In this example we have the Default Gateway on the NLB NICs so it can route internet traffic. Any routes needed in the Private NIC subnet are added as persistent static routes. An example of this is to reach the database server.

As traffic from the Private range is never supposed to go via the NLB Public range and vice versa we do not need to care about forwarding or strong host /weak host models. We can keep traffic nicely separated and that is a good thing. If you build this on Windows 2008(R2) just like you did on Windows 2003 it would work out of the box and you might not even know about a change in default behavior from weak host model to strong host model.

To get the PRIVATE NIC to reach the database server you’d add static routes and be done with it.

Add needed static persistent routes using the route command:

Route add -p 10.20.0.0 mask 255.255.0.0 172.16.2.1

Or by using netsh, (it’s advised to use netsh from Windows 2008 on)

netsh interface ipv4 add route 10.20.0.0/16 "PRIVATE NIC" 172.16.2.1

No requirement to have different subnets for Private and NLB NICs  / Multiple Gateways When the subnets are the same

There is no requirement to have different subnets for every NIC. Sometimes I read that this is a requirement on forums when someone is having issues but it’s not.You can also experiment with multiple Default Gateways if they are on the same subnet (WARNINGS APPPLY*)

image

So here you can play with giving every NIC a default gateway (same subnet, so no issues), with static persistent routes, with enabling forwarding and weak host / strong host configuration. I tend to use only one gateway and use static persistent routes. If I need to relay I’ll go for weak host minimal configuration or revert to forwarding.

WARNINGS APPLY*: When you start having multiple NIC’s for multiple NLB Clusters on the same NLB nodes, things can get a bit complicated and unpredictable. So I prefer only to use a default gateway on both NICs when you have two NIC , one for private (management) traffic and one for the NLB cluster traffic. Once you have multiple NIC’s for multiple NLB clusters (1 private NIC + 2 or more NLB cluster NICs) you can no longer play this game safely, even if they are all on the same subnet, without running into trouble I have experienced. You can get an event id 18 “NLB cluster [X.X.X.X]: NLB detected duplicate cluster subnets. This may be due to network partitioning, which prevents NLB heartbeats of one or more hosts from reaching the other cluster hosts. Although NLB operations have resumed properly, please investigate the cause of the network partitioning” . Also in this situation you can’t have a default gateway on the management NIC and one on one of the NLB NIC’s without a default gateway on the second NLB NIC. Forget that. You can get issues with a node remaining in “converging” forever and what’s worse the NLB cluster will send traffic to all nodes so 1/x connections will fail. Rebooting one node might help but once you reboot ‘m both you run the risk of this happening and you really don’t want that. Once you dealing with multiple cluster IP addresses on multiple separate NIC’s you’d better stick to one default gateway on one of the NIC’s and nowhere else.  This kind of makes me wonder if it’s pure luck that it works with 2 cluster NICs or not, with multiple and with reboots of the nodes I know we run into trouble and that’s no good.

It’s also smart not to mix static routes with forwarding to achieve the same thing. And please have the exact same configuration on very particular NIC on every node. Not one node with NLB NIC 1 routing via static routes and the other node using forwarding on NLB NIC 1. That’s asking for inconsistent behavior.

We’ll briefly look at some examples now.

If you only need to route internally (i.e to reach the database or a client PC) we add the needed static persistent routes on the NLB NICs using the route command.

In order for the NLB NICs to reach the database with strong host model and no forwarding enabled:

Route add -p 10.30.0.0 mask 255.255.0.0 10.10.0.1

To reach the client PC’s:

Route add -p 10.20.0.0 mask 255.255.0.0 10.10.0.1

(Using route print you can look at the routes and using route delete you can get rid of them.)

Or by using netsh, (it’s advised to use netsh from Windows 2008 on)

netsh interface ipv4 add route 10.30.0.0/16 "NLB NIC" 10.10.0.1

netsh interface ipv4 add route 10.20.0.0/16 "NLB NIC" 10.10.0.1

(you can look at the routing table by using netsh interface ipv4 show route, with netsh interface ipv4 delete route you get ridd of then, see http://technet.microsoft.com/en-us/library/cc731521(WS.10).aspx for more information.

You can also just enter the default gateway on the NLB NICs as well. All NICs are on the same subnet this will cause no issues. Just remember that traffic will also go to where ever that gateway routes, even to the internet.

We already know we can play with the weakhost / stronghost model:

netsh interface ipv4 set interface Private NIC weakhostsend=enabled

netsh interface ipv4 set interface Private NIC weakhostreceive=enabled

netsh interface ipv4 set interface NLB NIC weakhostsend=enabled

netsh interface ipv4 set interface NLB NIC weakhostreceive=enabled

Again don’t just blindly enable on every NIC you can find on the server. Test what you really need and use only that. I leave that as an exercise to the readers. As I’ve said before, it really depends on the situation and needs for your particular situation. Keep in mind that when you enable weakhostsend and weakhostreceive on every NIC this will just revert your Windows 2008 server into Windows 2003 behavior and this might not be needed or wanted. So just enable what you need for optimal security.

There is a very good explanation of strong and weak host behavior by "The Cable Guy" at http://technet.microsoft.com/en-us/magazine/2007.09.cableguy.aspx I strongly advise you to go take a look.

And naturally enabling forwarding will do the trick in this scenario as well, as this creates a weak host model. Depending on how many NICs you use and how traffic must flow you might have to do it on more than one NIC, normally the one(s) without a default gateway.

netsh interface ipv4 set interface "NLB NIC" forwarding=enabled

 
When & Why Use Three NICs or more?

NLB supports using multiple network adapters to configure separate clusters. This allows for configuring multiple independent clusters on each host. We used to have only virtual clusters meaning that you could configure multiple clusters on a single network adapter. Anyone who ever had to trouble shoot some networking or configuration issues on a production NLB will appreciate the ability to limit interruptions and problems to one cluster instead of 2 or more. As an example of this I had to trouble shoot a CAS/HUB Exchange Implementation two node NLB implementation. The NLB Cluster of the CAS role had this very issue, but since it was running on its own cluster with a separate NIC the HUB role NLB cluster has no issues what so ever. Another good reason to use more NIC is to separate traffic, for example FTP versus HTTP on the same NLB Cluster.

One of the worst things that can happen is that an issue messes up the proper functioning of the NLB itself. That way even if the virtual IP remains available no host or only some of the hosts get network traffic. That means the cluster is unavailable or is only partially responding. This is a bad situation to be in and can be hard to trouble shoot. Since it’s a high availability technology you can bet someone is looking over your shoulder that has a vested interest in getting that resolved as soon as possible.

Mind the order of the connections in Adapters and Bindings

Make sure the PRIVATE NIC that is to be used for private network traffic (DNS, AD, RDP, …) is listed first. That prevent any issues (speed, functionality) of those services and you experience will be much better. This is illustrated in the figures below. LAN-HUB is the PRIVATE NIC here. The others are for NLB (yup it’s an Exchange 2010 setup).

image

Conclusion & recapitulation

I’ll finish with some closing musings on single & multiple default gateway and getting/sending network traffic where it needs to go.

When you enter a gateway on the second, third and so on NIC next to the one on the first NIC you’ll get a warning:

—————————

Microsoft TCP/IP

—————————

Warning – Multiple default gateways are intended to provide redundancy to a single network (such as an intranet or the Internet). They will not function properly when the gateways are on two separate, disjoint networks (such as one on your intranet and one on the Internet). Do you want to save this configuration?

—————————

Yes No

—————————

This will not work reliable when you have multiple subnets. This is why you use static persistent routing entries. Depending on your needs you can also use forwarding or the weak host model and even combine those with static persistent routes if needed of desired. Now the above also means that if you have multiple NICs with IP addresses on the same subnet you can indeed enter a Default Gateway on all of them.

If you don’t have or cannot have a Default Gateway filled in you are left with two options. If you know what needs to go where you can add static routes, which is basically telling the NIC the IP of a gateway to send traffic to for a certain destination. This is assuming you can reach that IP and that the traffic is not from a source/ to a destination that has no route defined and firewall allow for it, etc.

If you have no route or you can’t specify one (i.e. you can’t predict where traffic will have to go) you have one other option left and that is to route the traffic via the NIC that does have a Default Gateway. This used to work out of the box on Windows 2003 and earlier, but it doesn’t work out of the box since Windows 2008 (R2). That is because by default NICs in Windows 2008(R2) operate in a strong host model. So it will not receive or send traffic destined for some other IP than itself or send traffic originating somewhere else than itself. For that you’ll need to set the NIC properties to weak host send and receive or you need to enable forwarding. Actually forwarding is disabled by default on Windows 2003 as well. The big difference is that Windows 2003 operates in a weak host manner (send/receive) as opposed to Windows 2008 (R2) strong host mode. By enabling forwarding we put the Windows 2008 server in weak host mode and as such it works (see RFC1122). On the internet you’ll find both solutions, but the link between the two is often never made. Using weak host receiving and weak host sending allows for more atomic, custom configurations than forwarding.

Contact me via the web site or leave a comment if you have any questions or suggestions.

Post Script / Side Note because someone asked J

Basically you can have multiple gateways on a server but only one default gateway. You can add more than one default gateway on the same NIC but then they will only be used when the default gateway filled out in is not available, it will then try the next one and so forth. You can add multiple gateways to a single NIC or one or more to multiple NICs but that can, get messy very quickly. Whether it is wise to provide gateway redundancy in such a manner is another discussion. See also KB article http://support.microsoft.com/kb/157025. Be mindful of the extra configurations you’ll need (Dead Gateway Detection). This is a rather uncommon scenario on a windows server. You can use it for redundancy or when you want the traffic to go to a certain default gateway instead of another when it is available (so separate traffic for example for cost or to reduce the traffic load).

And then there’s adding a default gateway that’s on another subnet than the IP address of the NIC. In that case you get this warning:


—————————

Microsoft TCP/IP

—————————

Warning – The default gateway is not on the same network segment (subnet) that is defined by the IP address and subnet mask. Do you want to save this configuration?

—————————

Yes No

—————————

All pretty cool stuff you can do to mess with peoples head and understanding of what’s going on (it can work if the router on the local subnet has a route the subnet where that default gateway lives and PROXY ARP is working … but we’re not going to turn this into a networking course or pretty soon we’ll be installing RRAS and turn the server into a router.