Intermittent TLS issues with Windows Server 2012 R2 connecting to SQL Server 2016 running on Windows Server 2016 or 2019

Introduction

Someone asked me to help investigate an issue that was hindering client/server applications. They suffered from intermittent TLS issues with Windows Server 2012 R2 connecting to SQL Server 2016 running on Windows Server 2016 or 2019. Normally everything went fine but one in every 250 to 500 connection of the server-client to a database on SQL Server 2016 they got the error below.

System.ServiceModel.CommunicationException: The InstanceStore could not be initialized. —> System.Runtime.DurableInstancing.InstancePersistenceCommandException: The execution of the InstancePersistenceCommand named {urn:schemas-microsoft-com:System.Activities.Persistence/command}CreateWorkflowOwner was interrupted by an error. —> System.Data.SqlClient.SqlException: A connection was successfully established with the server, but then an error occurred during the login process. (provider: SSL Provider, error: 0 – An existing connection was forcibly closed by the remote host.) —> System.ComponentModel.Win32Exception: An existing connection was forcibly closed by the remote host

Now actually retrying the connection in the code could work around this. Good code should have such mechanisms implemented. But in the end, there was indeed an issue.

Intermittent TLS issues with Windows Server 2012 R2 connecting to SQL Server 2016 running on Windows Server 2016 or 2019

I did a quick verification of any network issues. The network was fine, so I could take this cause of the table. The error itself did not happen constantly, but rather infrequently. All indications pointed to a (TLS) configuration issue. “An existing connection was forcibly closed by the remote host” was the clearest hint for this. But then one would expect this to happen every time.

We also checked that the Windows Server 201 R2 hosts were fully up to date and had either .NET 4.7 or .NET 4.8 installed. These versions support TLS 1.2 without issues normally.

Also, on Windows Server 2012 R2 TLS 1.2 is enabled by default and does not require editing the registry to enable it. You have to do this is you want to disable it and re-enable it.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols\TLS 1.2\Client]
“DisabledByDefault”=dword:00000000

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols\TLS 1.2\Server]
“DisabledByDefault”=dword:00000000

These were not present on the Windows Server 2012 R2 host but also not on the Windows Server 2016 or 2019 hosts we use for comparison.

We also check if KB3154520 – Support for TLS System Default Versions included in the .NET Framework 3.5 on Windows 8.1 and Windows Server 2012 R2 is installed. This allows using the operating system defaults for SSL and TLS instead of the hardcoded .NET Framework defaults.

For 64-bit operating systems:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft.NETFramework\v2.0.50727]
“SystemDefaultTlsVersions”=dword:00000001

[HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft.NETFramework\v2.0.50727]
“SystemDefaultTlsVersions”=dword:00000001
For 32-bit operating systems:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft.NETFramework\v2.0.50727]
“SystemDefaultTlsVersions”=dword:00000001


Note If the application has set the ServicePointManager.SecureProtocol in code or through config files to a specific value or uses the SslStream.AuthenticateAs* APIs to specify a specific SslProtocols enum, the registry setting behavior does not occur. But in our test code, we also have control over this. So we can test a lot of permutations.viu

Test code

To properly dive into the issue we needed to reproduce the error at will or at least very fast. So for that, we contacted a dev and asked him to share the code paths that actually made the connections to the databases. This is to verify if there were multiple services connecting and maybe only one had issues. It turned out it was all the same. It all failed in the same fashion.

So we came up with a test program to try and reproduce it as fast as possible even if it occurred infrequently. That ability, to test configuration changes fast, was key in finding a solution. With this test program, we did not see this issue with clients that were running on Windows Server 2016 or Windows Server 2019. Not with the actual services in the environments test or in production nor with our automated test tool. Based on the information and documentation of .NET and Windows Server 2012 R2 this should not have been an issue there either. But still, here we are.

The good thing about the test code is that we can easily play with different settings in regards to the TLS version specified in the code. We noted that using TLS 1.1 or 1.0 would show a drop in connection errors versus TLS 1.2 but not eliminate them. No matter what permutation we tried we just got a difference in frequency of the issue. We were not able to get rid of the error. Now that we had tried to deal with the issue on the application level we decided to focus on the host.

The host

Even with no TLS 1.2 enforced the Windows Server 2012 R2 client hello in its connection to the SQL Server host uses TLS 1.2. The cipher they select ( TLS_DHE_RSA_WITH_AES_256_GCM_SHA384) is actually considered weak. This is evident when we look at the client and server hello to port 1433.

TLS issues with Windows Server 2012 R2 connecting to SQL Server 2016 running on Windows Server 2016 or 2019
Client hello
TLS issues with Windows Server 2012 R2 connecting to SQL Server 2016 running on Windows Server 2016 or 2019
Server Hello

Trying to specify the TLS version in the test code did nor resolve the issue. So we now tried solving it on the host. We are going for best practices on the Windows Server 2012 R2 client side. ! In the end, I ended up doing the following

  • Allowed only TLS 1.2
  • Allowed only Secure ciphers & Ordered those for Perfect Forward Privacy
  • Enforce the OS to always select the TLS version
  • Enforce the use of the most secure (TLS 1.2 as that is the only one we allow)

Note that there is no support for TLS 1.3 at the moment of writing.

After this, we ran our automated tests again and did not see even one occurrence of these issues anymore. This fixed it. It worked without any other changes except for one. There was one application that had not been upgraded to .NET 4.6 or higher and enforced TLS 1.1. So we leaned on the app owners a bit to have them recompile their code. In the end, they went with 4.8.

In the network capture of connections, we see they now select a secure cipher (TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384).. With this setup, we did not experience the flowing infrequent error anymore. “System.Data.SqlClient.SqlException: A connection was successfully established with the server, but then an error occurred during the login process. (provider: SSL Provider, error: 0 – An existing connection was forcibly closed by the remote host.) —> System.ComponentModel.Win32Exception: An existing connection was forcibly closed by the remote host

Client Hello
Server Hello

The cause

We had an assumption is that once in a while and older/weaker ciphers were used and that this causes the connection error. By implementing best practices we disabled those which prevent this from happening. As we did not change the source code om most machines it either used the .Net framework defaults or the specified settings.

You can easily implement best practices by using a community script like Setup Microsoft Windows or IIS for SSL Perfect Forward Secrecy and TLS 1.2. There is also a tool like IISCrypto or IISCryptoCLI. Potentially, you can always automate this and devise a way to configure this wholesale with the tools of your choice if needed.

The solution

As you can read above, what finally did the trick was implementing a TLS 1.2 only best practices configuration on the Windows Server 2012 R2 hosts.

If you run into a similar issue please test any solution, including mine here, before implementing it. TLS versions that software and operating systems require or are able to use differ. So any solution needs to be tested end to end for a particular use case. Even more so if ServicePointManager.SecureProtocol in code or through config files or via  SslStream.AuthenticateAs* APIs are in play.

In this particular environment forcing TLS 1.2 and having the OS control which TLS version is being used by the .NET applications worked. Your mileage may differ and you might need to use a different approach to fix the issue in your environment. Big boy rules apply here!

Conclusion

Tech debt will sooner or later always rear its ugly head. There is a reason I upgrade and update regularly and well ahead of deadlines. I have always been doing that as much as possible (SSL Certs And Achieving “A” Level Security With Older Windows Version) . But in this case that this seemed more like a bug than a configuration issue. It only happened every now and then which meant troubleshooting was a bit more difficult. But that was addressed with a little test program. this helped us test configuration changes to fix these Intermittent TLS issues with Windows Server 2012 R2 connecting to SQL Server 2016 running on Windows Server 2016 or 2019 fast and easily. I shared this case as it might help other people out there struggling with the same issue.

18 thoughts on “Intermittent TLS issues with Windows Server 2012 R2 connecting to SQL Server 2016 running on Windows Server 2016 or 2019

  1. That is some mighty fine sleuthing and wonderful blogging! I’ve encountered the same issue and I was wondering if it would be possible to share the test code since as you’ve noted, this issue occurs very intermittently and so testing resolution is quite painful without a reliable way to test? For us the issue seems to have begun after to occur once a Windows Server 2016 CU post Nov 2019 was installed (i.e. any CU from Dec 2019 to Mar 2020)…but we’re not 100% sure of this.

    Thank you for the great blog post!

    • I can dig some of it up. Can you compile .NET code? When using Visual Studio Code instead of .NET we need to use .NET Core and I think I have that as well somewhere.

  2. I think I was able to reproduce using a tight loop calling System.Data.SqlClient.SqlConnection within PowerShell. I will reapply the latest Server CU and give it a whirl.

    Thanks again for your great post!

    • Yeh, that will do the job as well, in the end, we looped SQLCommand in a DataReader hit the issue every X hundred times. Good luck. Happy it helped someone out.

    • Here is some .NET core code for inspiration. In PowerShell you can do the same. I had to mask connectionstrings etc. But you get the drift.
      using System;
      using System.Collections.Generic;
      using System.Configuration;
      using System.Data.SqlClient;
      using System.Diagnostics;
      using System.IO;
      using System.Linq;
      using System.Net;
      using System.Text;
      using System.Threading;
      using System.Threading.Tasks;

      namespace TLS-TEST
      {
      class Program
      {
      static void Main(string[] args)
      {
      int count = 0;
      string connectionString = “CONNECTIONSTRINGe”;
      //connectionString = ConfigurationManager.ConnectionStrings[“DATA”].ConnectionString;

              //Console.WriteLine("Attempt to connect");
      
              //
              // Summary:
              //     Specifies the Secure Socket Layer (SSL) 3.0 security protocol.
              // Ssl3 = 48,
              //
              // Summary:
              //     Specifies the Transport Layer Security (TLS) 1.0 security protocol.
              // Tls = 192,
              //
              // Summary:
              //     Specifies the Transport Layer Security (TLS) 1.1 security protocol.
              // Tls11 = 768,
              //
              // Summary:
              //     Specifies the Transport Layer Security (TLS) 1.2 security protocol.
              // Tls12 = 3072
              //System.Net.ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11;
              //Console.WriteLine("security protocol: " + System.Net.ServicePointManager.SecurityProtocol);
              while (true)
              {
      
      
                  try
                  {
                      using (SqlConnection con = new SqlConnection(connectionString))
                      {
                          con.Open();
      
                          using (SqlCommand command = new SqlCommand("SELECT top(1) * from sys.tables", con))
                          {
                              using (SqlDataReader reader = command.ExecuteReader())
                              {
                                  while (reader.Read())
                                  {
                                      //Console.WriteLine("Result: {0}", reader.GetString(0));
                                      var output = reader.GetString(0);
                                      Console.WriteLine (output);
                                      count++;
                                      string filename = String.Format("{0:HH-mm-ss.ffff}__{1}{2}.txt", DateTime.Now, "Success", count);
      
                                      using (StreamWriter sw = File.CreateText(Path.Combine(@"c:\temp\TLSConnectionTest", filename)))
                                      {
                                          sw.WriteLine("It went well");
                                          sw.WriteLine("Date : " + DateTime.Now.ToString());
                                          sw.WriteLine("security protocol: " + System.Net.ServicePointManager.SecurityProtocol);
                                      }                                  
                                  }
                              }                            
                          }                      
                      }
                  }
      
                  catch (Exception ex)
                  {
                      count++;
                      string filename = String.Format("{0:HH-mm-ss.ffff}__{1}{2}.txt", DateTime.Now, "exception", count);
      
                      using (StreamWriter sw = File.CreateText(Path.Combine(@"c:\temp\TLSConnectionTest", filename)))
                      {
                          sw.WriteLine("EXCEPTION!");
                          sw.WriteLine("Date : " + DateTime.Now.ToString());
                          sw.WriteLine("message:" + ex.Message);
                          sw.WriteLine("stacktrace:" + ex.StackTrace);
                          sw.WriteLine("security protocol: " + System.Net.ServicePointManager.SecurityProtocol);
                      }
                  }
              }
          }
      }
      

      }

  3. Similarly here’s the essential bits I used:

    $i = 0
    while($i -lt 99999)
    {
    $i++
    try
    {
    $SqlConnection = New-Object System.Data.SqlClient.SqlConnection
    $SqlConnection.ConnectionString = “Server=;Database=master;Integrated Security=true;Pooling=false;”
    $SqlConnection.Open();
    $SqlConnection.Close()
    }
    catch
    {
    write-host $($_ | Out-String) -foregroundcolor “red” -backgroundcolor “black”
    write-host $($_.Exception|format-list -force|out-string) -foregroundcolor “red” -backgroundcolor “black”
    }
    }

  4. Just wanted to give a big thank you for taking your time to write about this.
    We’re facing the exact same problem (symptoms) and you’re pushing us in the right direction.
    Thanks and keep up the good work! 🙂
    /Charles

    • You’re most welcome. Please report back with your success in this as I am in contact with MSFT about looking into this.

  5. We were unable to reproduce the issue using the PowerShell script (I suspect that the issue has to do with image activation), but we proceeded with disabling the ciphers:

    TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
    TLS_DHE_RSA_WITH_AES_128_GCM_SHA256

    …on one of our utility servers that access a SQL 2019 database server (running on Windows Server 2019 Standard), and the issue disappeared (we monitored for a full week)!

    This morning, we disabled the two ciphers on three other servers – and will monitor the results.

    Thanks again for your help!

    Andrew de Mos

  6. Very well explained Ron. We are experiencing same issue in our environment and your post helped us to look in right direction. Appreciate it!

  7. Regard error: “A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 – An existing connection was forcibly closed by the remote host.)”

    I found this article which mentions the TLS_DHE cipher suite on Windows 2012 R2 has issues account for leading zeros on the MAC when calculating keys. And the TLS client and server need to calculate keys exactly the same way otherwise they get different results.

    The article mentions Windows 2012 R2 does not have the patch, but Windows 2016 and up do.
    I think this is my problem. And your solution should fix it. I’ll see after my next reboot.

    https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/apps-forcibly-closed-tls-connection-errors

    • Thx for the feedback. Do let us know what the end result is for you. It sure seems like you should be ok after implementing the workaround.

  8. Pingback: Offline Azure Devops Windows 2012 R2 build server with failing builds - Working Hard In ITWorking Hard In IT

  9. Lovely post, thank you for the in-depth article. I’m recently facing a issue similar to this, but it’s intermittent after using what we call ‘Gateway’ console applications, even Xamarin IOS/Android applications – which all use later TLS 1.2 consistently. What we are finding is that our Server 2016 (even with the REG Edit settings for Client/Server) set to disabled anything but TLS 1.2, to somehow come back with a Error Event in Event Log that says ”An TLS 1.2 connection request was received from a remote client application, but none of the cipher suites supported by the client application are supported by the server. The TLS connection request has failed.”

    This is bizare, and i have a support request with Azure for our Cloud Service (Classic) that we see this on, but my question is: “Have you ever seen a similar issue like your article explains, generated by a Server that seems to turn off/on intermittently with Schannel Error Log Event created ?

    This seems bizare….

Leave a Reply, get the discussion going, share and learn with your peers.

This site uses Akismet to reduce spam. Learn how your comment data is processed.