I was looking into a Windows Server 2008 R2 Cluster the other day. I do like clusters, especially Microsoft ones. This particular cluster had experienced an incident and through a combination of factors and follow up actions had been left in a state where the cluster service was failing due to configuration errors.
This quote by the way, is from Padme in Star Wars Episode II.
An incident had occurred during an operation to remove a disk resource and replace it with another disk resource. The operation appeared to have succeeded until a few days later when a reboot was executed on one of the cluster nodes. This was a 3 node cluster in a majority node set configuration.
In order to shed some light on the issue, the logs needed to be enabled and extracted for processing. There’s a good blog post here describing the logging mechanism for Failover Cluster Services.
So, logging was turned on, logs were extracted, this was the relevant result.
DBG [RCM] rcm::RcmGroup::InitializeFromDb() ERR [CORE] Node 1: exception caught ERROR_FILE_NOT_FOUND(2)' because of 'OpenSubKey failed.' INFO 000007fe:fd12ac3d( ERROR_MOD_NOT_FOUND(126) ) INFO 00000000:01a3ed18( ERROR_MOD_NOT_FOUND(126) ) INFO 00000000:01a3f028( ERROR_MOD_NOT_FOUND(126) ) INFO 00000000:ff7ea368( ERROR_MOD_NOT_FOUND(126) ) INFO 00000000:01a3ecf0( ERROR_MOD_NOT_FOUND(126) ) INFO 00000001:e06d7363( ERROR_MOD_NOT_FOUND(126) ) ERR Exception in the InstallState is fatal (status = 2) DBG Exception in the InstallState is fatal: set netft heartbeat interval to 900 seconds ERR Exception in the InstallState is fatal (status = 2), executing OnStop INFO [DM]: Shutting down, so unloading the cluster database. INFO [DM] Shutting down, so unloading the cluster database (waitForLock: false). DBG [DM] Unloading Hive, Key \Registry\Machine\Cluster, discardCurrentChanges true ERR FatalError is Calling Exit Process. WARN [RHS] Cluster service has terminated. WARN [RHS] Cluster service has terminated. DBG [RHS] Closing all resources... DBG [RHS] Closing all resources... DBG [RHS] Shutdown event is signaled INFO [RHS] Exiting. DBG [RHS] Shutdown event is signaled INFO [RHS] Exiting.
That’s not what you want to see. Interestingly enough though, given the error turned out to be a reference to a missing resource, I can understand the FILE_NOT_FOUND (2) error, but the ERROR_MOD_NOT_FOUND (126) was a little confusing. This log entry was the same across all three nodes by this stage of the problem (the incorrect settings in CLUSDB had been replicated to the other nodes).
My original thought was to use Microsoft Sysinternals Process Monitor to observe which resource was attempting to be accessed when the service failed (it would have shown an attempt to access a non-existent registry key). However, after doing some more poking around within the CLUSDB file (this is the registry hive loaded by the cluster service), I was able to locate the missing resource.
In case that some how happened again, I created a small script to search the resource group configuration for any reference to a missing resource. The script can be found here. You will need to change the extension back to .Vbs. To use this script, a copy of the CLUSDB file needs to be taken from a broken cluster node (the file is typically located in %SystemRoot%\Cluster). Copy the file to another machine and then use REGEDIT to load the hive into HKEY_LOCAL_MACHINE\Cluster (that location was arbitrarily chosen and is referenced by the script). The script can then be executed from a command prompt with cscript and will then process the registry data in that location. Once executed, you can unload the hive from within REGEDIT.
Simply execute the script and it should list all of the resources and resource groups. If it located a resource group with a reference to a missing resource, that will be highlighted in the script output (look for **Missing**). I’ve not included rectification steps in the script, as I tend to prefer manually fixing registry errors.
Should you find a missing resource, search for the resource GUID (e.g. d5bcfa55-04fc-4d68-9a67-7f9708d80faf) in the HKEY_LOCAL_MACHINE\Cluster hive. You will likely need to remove that entry from the resource group, and possibly remove a dependency key as well depending on where this is referenced.
When the CLUSDB file was corrected, it was applied to the first node only for testing (all 3 nodes had their Cluster Service stopped). Using a few command line switches, the first node could be brought up by itself without any resource groups coming online. There are a few useful switches for starting the service in a controlled fashion.
net start clussvc /ips /fq
The /IPS switch instructs the cluster service to not attempt to bring any non-core resource groups online. The /FQ switch instructs the node to form a quorum despite no connectivity to the other offline nodes.
Ideally however, a working CLUSDB file from an existing system state backup should have been used to correct this issue.
It was an interesting problem, trying to recover clusters where valid configuration backups are inexplicably unavailable is always a fun challenge. My advice is always the same however, when you’re performing any work on a cluster, do make sure its in line with processes and procedures recommended by the application and cluster software vendors.