July 26, 2016Alan Finn

Hyper-V and QLogic Equals DPC_WATCHDOG_VIOLATION BSOD

Working with some older hardware (HP DL585 G7 and NC523 SFP 10Gb Dual Port Adapters), I ran into an issue with a Hyper-V cluster where the nodes would intermittently crash with the DPC_WATCHDOG_VIOLATION error with a 0x133 error code. The crash was guaranteed to be repeated if I manually initiated a Live Migration process. This error is essentially caused by a driver exceeding a timeout threshold. You can read more about the watchdog violation here and if you’re feeling really geeky, you can read about DPC objects and driver I/O here.

After analyzing the memory.dmp, the stack pointed to the QLogic driver (dlxgnd64.sys). As I’m sure you would, I proceeded to update the driver for the Intelligent NIC; however, since the server was already a little over 2 years old, the latest version of the HP driver was already installed. Hmm… Next, I went to QLogic directly and looked up their number for the NC523 which they OEM for HP which turned out to be QLE3242. The driver on the QLogic site was more current so I gave that a shot. After updating I tested again with a Live Migration and once again enjoyed the lovely cornflower blue hue of the BSOD. Crap. Back to Google.

After additional digging, I found some errors in the System event log for ID 106 regarding load balanced teaming on the NIC. After a little research, I ran across this article on MS Support. Again, I’ll let you read the details but in a nutshell, the NIC’s in the team were overlapping their usage of the same processors. As I was using hyper-threading, I followed the steps in the article to specify specific processors for each NIC and the max number of processors VMQ could use:

Set-NetAdapterVMQ -Name “Ethernet1” -BaseProcessorNumber 4 -MaxProcessors 8 (VMQ would use processors 4,6,8,10,12,14,16,18)
Set-NetAdapterVMQ -Name “Ethernet2” -BaseProcessorNumber 20 -MaxProcessors 8 (VMQ would use processors 20,22,24,26,28,30,32,34)

This did not require a restart and once I made the changes on the NIC’s, I was able to Live Migrate without any crashes. I will also note that although I updated the drivers, I also tested this without updating on another Hyper-V cluster with identical hardware and the VMQ settings resolved the issue there. I burned about 6 to 8 hours banging my head on various troubleshooting items including several I didn’t include here so I hope this post saves you a bit of time and headache.

April 18, 2016Alan Finn

Windows Splash Screen Appears When Launching Application on XenApp/XenDesktop 7.6 and Storefront 3.5

After upgrading Storefront from 2.5 to 3.5, I noticed that all published applications where the VDA was running on Windows 2012R2 started displaying the Windows logon process in a splash screen. alt text

The application continued to launch successfully, but this splash screen did not start appearing until after the Storefront upgrade. This also did not occur on VDA’s running on Windows 2008R2, only 2012 servers. The fix was to update the following registry key on the VDA:
Key: HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\CitrixLogon
Name: DisableStatus
Type: REG_DWORD
Value: 0x00000000

December 28, 2015Alan Finn

Netscaler 10.5 and Windows 2012 R2 SChannel Errors with TLS 1.2

So as part of a recent upgrade I was performing, I upgraded a couple of Netscaler Access Gateways from version 10.1 to version 10.5. The upgrade went very smoothly, no errors, no user calls… for a while. The next day, we started receiving some calls regarding issues with launching apps via Storefront. Some users were receiving the “SSL Error 43: The proxy denied access to…” error with their STA ticket when clicking on their application icons on the web page.

Tracking down the servers based on their STA ID in the ticket, I noticed that users only had issues when they were attempting to authenticate to Windows 2012 R2 delivery controllers. The Windows 2008 R2 delivery controllers were not denying the STA requests. Jumping on one of the Windows 2012 R2 delivery controllers, I noticed the System event log was flooded with Schannel errors for Event ID 36874 (An TLS 1.2 connection request was received from a remote client application, but none of the cipher suites supported by the client application are supported by the server. The SSL connection request has failed.) and Event ID 36888 (A fatal alert was generated and sent to the remote endpoint. This may result in termination of the connection. The TLS protocol defined fatal error code is 40. The Windows SChannel error state is 1205.). Well, we obviously have an SSL issue, but these codes aren’t exactly pointing me anywhere. Looking up the error code on the RFC page for the TLS protocol (http://tools.ietf.org/html/rfc5246) I found that error code 40 is a handshake failure (you can find this in the A.3 part of the appendix in the Alert Messages section). I can’t remember where exactly I found the enum definition for the Schannel 1205 code, but it basically means that a fatal error was send to the endpoint and the connection was being forcibly terminated. At least I now knew there was an issue with the SSL handshake between the Netscalers and the Windows 2012 R2 delivery controllers. Time for some network tracing.</p> Firing up Wireshark on the delivery controller, I could see that the connection was getting immediately reset by the server after the Client Hello from the Netscaler.

Expanding the Client Hello packet in the capture, I could see a list of ciphers currently being offered by the Netscaler. (Note – for the sake of easier troubleshooting, I left the default grouping of ciphers in place as it was a large group of widely accepted ciphers until I identified the issue and then trimmed down the cipher list. You should limit the number of ciphers available on the virtual server of your Access Gateway to just what you need and leverage the more current stronger methods available such as AES 256 over RC4 and MD5, etc. if possible.)

Next, I configured the SSL Cipher Suite Order on the windows server to match what the Netscaler was presenting in the Client Hello packet, at least the top 10 or so. This can be done using either gpedit.msc for local policy or via the Group Policy Management Console as follows:

In either editor, expand Computer Configuration/Administrative Templates/Network.
Click on SSL Cipher Suite Order in the SSL Configuration Settings
Select the Enabled option and then follow the instructions in the Help section of the policy. Basically, all the ciphers you want will be listed on a single line separated by commas with no spaces anywhere.
You must reboot the server for the changes to take effect.

Even after the reboot, the SChannel errors were still present and the network captures were still showing the handshake failing due to a reset from the server. I’ll save you the time you will spend on re-ordering the ciphers on both the Netscaler and the Windows Server 2012 R2 Delivery Controller along with the multitude of reboots that go with it; it simply won’t work (at least at the time I published this). I stepped back and decided to try tweaking the TLS protocol versions since I wasn’t getting anywhere with the cipher suites (key exchange algorithms). For the sake of brevity, after much additional testing, headbanging, and googling I was able to get the handshake to work when I disabled TLS 1.2 on the Windows 2012 server. This forced the server to renegotiate using TLS 1.1 with the Netscaler which worked with the cipher suites I tested with that were supported by both the OS and the Netscaler. I did find a nice article supporting this here for additional reference.

To disable TLS 1.2 on the server, you need to modify a registry key:

Go to HKLM\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols.
If the TLS 1.2 key does not exists, create it.
Inside the TLS 1.2 **key, create another key called **Client.
Within the Client key, create two REG_DWORD values:

a. DisabledByDefault (set value to 1).

b. Enabled (set value to 0).

You will need to reboot one more time for the changes to take effect. This finally cleared up my SChannel errors as well as allowed me to add the controllers back as STA’s in the virtual server; in a green status this time.

August 20, 2015Alan Finn

Using Regular Expressions with SCOM 2012 Groups

A few examples of using regular expressions in group targeting in SCOM. String pattern matching (?i:fs) – This simple pattern will match “fs” in any string and is case-insensitive. For example, DFW-FS01 would match. The parenthesis and question mark stipulate a non-capturing group. A capture group stores regex matches for use later in the expression. Since we don’t need to do anything with the match, the non-capturing group makes more sense and is optimized for this case. The i: after the question mark is a modifier that stipulates a case-insensitive match. This would effectively match fs, Fs, fS, FS. (?i:fs|ps) – This expands on the previous example to match alternatives in the non-capturing group. Let’s say we wanted to add both file servers and print servers into a group expression. This example would match both DFW-FS01 and DFS-PS01. Think of the pipe symbol like an “or” conditional operator. (?i:[pf]s) – This is another way to get the same results as the previous example. It produces the same matching results. In this case we know that our file servers will be either FS01 or PS01 so we put the “p” and the “f” in brackets which means match either “p” or “f” immediately followed by “s”. (?i:[a-z][a-z][a-z]-sql-cl[\^c-z]) – We can also match by character ranges and exclude characters as well. The brackets with a-z inside mean to match any single character “a” through “z. The caret inside the last bracket negates the match so this would mean match any single character except a character “c” through “z”. This would match something like DFW-SQL-CLA, but not DFW-SQL-CLD.

Number pattern matching Parenthesis, brackets, etc still have the same function with numbers. IP ranges are a good example of common pattern matching in SCOM; some examples are as follows: **^([0-9]{1,3}).([0-9]{1,3}).([0-9]{1,3}).([0-9]{1,3})$ - **While not the best method, if you are not worried about validating that a valid number was entered in any octect, this is a simple match for an IPv4 pattern. Let’s break this down.

The caret at the beginning means that this is the start of the string, there should be no characters or digits before this in the match.
([0-9]{1,3}) is a capturing group similar to what we used in the string matching earlier with the parenthesis. The [0-9] means to match any single digit between 0 and 9. The {1,3} means to repeat the match 1 to 3 times. This is how we match one octet regardless of if there are one, two, or three digits.
The . or “backslash dot” is how we match the dot between octects. The “dot” is called a meta-character or special character which is used to match any single character in an expression. Since we want it to actually match the “dot” we use the “backslash” to escape and tell the expression to match the “dot” exactly, not as a meta-character.
We then repeat this pattern again for each octet coming to the $ dollar sign at the end. This simply means that this should be the end of the string and no more characters or digits should come after it. Since we are matching an IP in this example expressly, we don’t expect to see anything afterwards. If you needed to match an IP address as part of a string or sentence where you expect characters after the IP address, simply remove the dollar sign.

As I mentioned earlier, this is a quick method to match an IP address; however, it will also match 999.999.999.999 which doesn’t fall into any IPv4 scheme I’ve worked with. Let’s say we want to match a specific IP address on any particular class B network. The following would meet that criteria: ^(10).([0-9]|[1-9][0-9]|1[0-9][1-9]|2[0-4][0-9]|25[0-5]).(10).(250). We already understand the first, third and fourth octet, but what about the second? Let’s split it up between the pipe: [1-9] – match any digit between zero and nine. This takes care of single digits. [1-9][0-9] - match anything between 10 and 99. 1[0-9][1-9] – match anything between 100 and 199. 2[0-4][0-9] – match anything between 200 and 249. This is where we would limit to allowed IPv4 range. 25[0-5] – match anything between 250 and 255. That’s it. That will cover any number between 1 and 255 in the second octect.

Another example might be to match anything on the 172.24.x.0 network; but the last octet had to match 224 or 225. This would look something like this: ^(172).(24). ([0-9]|[1-9][0-9]|1[0-9][1-9]|2[0-4][0-9]|25[0-5]).(224|225)$

String and number pattern matching What if we had multiple four node clusters using a naming convention similar to DFW-SQLCL1, DFW-SQLCL2, DFW-SQLCL3, DFW-SQLCL4 and we wanted to group only the first and second nodes from all sites into a group? We would use the following expression: (?i:[a-z][a-z][a-z]-sqlcl[1|2]) or another way to shorten it would be (?i:[a-z]{1,3}-sqlcl)[1|2].

It helps to use a regular expression tester when working with these. A couple of good ones are https://regex101.com

August 05, 2015Alan Finn

Purging Kerberos Ticket Cache on Remote Machines

I was recently asked to help the DBA and Storage teams with an issue related to backup authentication. From what I was told, they had been testing different authentication methods to access a Data Domain device as backup target on several SQL clusters. When attempting to normalize everything using a single account and authentication mechanism, they were running into authentication issues getting back to the share on the Data Domain due to cached Kerberos tickets. The vendor recommended that they purge the Kerberos cache on each of the devices to clear the tickets. The kicker was that there were quite a few servers involved in this issue so logging on and manually running klist.exe would have been fairly time consuming. The DBA’s were not very keen on my first suggestion to just remotely reboot the passive nodes and let clustering work it’s magic. They responded by calling me crazy and making absurd claims about production outage this and change control that, etc. Geesh (chuckle)!

Having been shot down as a cluster-reboot-comedian, I threw together the following script to remotely run klist on each of the servers via Invoke-Method:

<#
	.SYNOPSIS
		Deletes all current kerberos tickets on specified machines
	
	.DESCRIPTION
		Uses klist.exe to purge Kerberos tickets on designated servers/workstations.
	
	.PARAMETER Targets
		String array of computer names.
	
	.EXAMPLE
		PS C:> .Remove-KerbTickets -Targets Server01, Server01, Server03

	.EXAMPLE
		PS C:> $arr = Get-ADComputer -LDAPFilter "(name=*FS01)" | Select-Object -ExpandProperty name
	 	PS C:> $arr | .Remove-KerbTickets.ps1
#>
	
[CmdletBinding()]
param
(
	[Parameter(Mandatory = $true,
			   ValueFromPipeline = $true)]
	[string[]]$Targets
)

process {
	$CurrentSessions = @()
	
	$scriptcontent = { param ($SessionItem); klist -li $SessionItem purge}
	
	foreach ($Server in $Targets) {
		$Error.Clear()
		$CurrentLogonSessions = Get-WmiObject -ComputerName $Server -Class Win32_LogonSession -ErrorAction SilentlyContinue
		
		if (!$Error) {
			foreach ($Session in $CurrentLogonSessions) {
				$UserID = [convert]::ToString($Session.LogonID, 16) # Convert the LogonID value from decimal to hex
				$UserID = '0x' + $UserID # Append hex char to the string
				$CurrentSessions += $UserID # Add string to the array
			}
			
			foreach ($SessionItem in $CurrentSessions) {
				$Error.Clear()
				$results = Invoke-Command -ComputerName $Server -ScriptBlock $scriptcontent -ArgumentList $SessionItem -ErrorAction SilentlyContinue
				If ($Error[0].Exception.Message -match "The client cannot connect") {
					Write-Host "$Server - Unable to connect via WinRM"
					break
				}
				<# Example placeholder to handle klist errors				
				if ($results -match "0xc000005f") {
					Write-Host "Session no longer exists or may have been terminated."	
				}
				#>
				if ($results -match "purged") {
					Write-Host "$Server - Ticket(s) purged"
				}
			}
			$Error.Clear()
		}
		else {
			Write-Host "$Server - WMI Error: $($Error[0].Exception.Message)"
		}
	}
	$CurrentSessions.Clear()
}

afinn.net