Trouble Connecting to Cluster Nodes? Check WMI!

Trouble Connecting to Cluster Nodes? Check WMI!

A frequent cluster network connection issue we see happens when the cluster cannot use WMI.  WMI is Windows Management Instrumentation, which is an interface through which Windows components can provide information and notifications to each other, often between remote computers (more info about WMI).  Failover Clustering and System Center Virtual Machine Manager (SCVMM) often use WMI to communicate between cluster nodes, so if there is an issue contacting a cluster node, WMI may be the culprit.  We use WMI in most of our wizards, such as ‘Create Cluster Wizard’, ‘Validate a Configuration Wizard’, and ‘Add Node Wizard’, so any of the following messages and warnings we list could be due to WMI issues:

·         “RPC Server Unavailable” error.

·         Access is Denied.

·         The computer ‘Node1’ could not be reached.

·         Failed to retrieve the maximum number of nodes for ‘{0}’.

·         The computer ‘Node1.contoso.com’ does not have the Failover Clustering feature installed.  Use Server Manager to install the feature on this computer.

o   Note: first confirm you have installed the Failover Clustering feature on this node

 

 

Troubleshooting Steps

Follow these series of troubleshooting steps to allow you to continue connecting your cluster.

 

1) Ensure it is not a DNS Issue

It is possible that the reason you cannot contact the other servers is due to a DNS issue.  Before troubleshooting WMI, try connecting to that cluster, node or server using these methods when prompted by the cluster:

a)      Network Name for the cluster or node

a.       Example: MyNode

b)      FQDN for the cluster or node

a.       Example: MyNode.contoso.com

c)       IP Address for the cluster or node

a.       Example: 10.10.10.123

d)      Some wizard pages have a ‘browse’ button which allows you to find other clusters in the domain through Active Directory

 

 

2) Check your that WMI is Running on the Node

Windows Server Failover Clustering supports PowerShell and earlier version also come with a lightweight WMI client (WBEMTest).  Using either PowerShell or Wbemtest you can confirm that WMI is up and running.  Although you can use WMI remotely, it is better to test this directly on the server to ensure there are no other networking or firewall issue affecting the connection.

 

WMI Service

First check that the ‘Windows Management Instrumentation’ Service has started on each node by opening the Services console on that node. Also check that its Startup Type is set to Automatic.

 

 

Next we will check that Failover Clustering WMI (MSCluster) is running.  These tests would be applicable after the cluster has already been created since we are checking for cluster-specific WMI functionality. 

WBEMTest or directly on the server

·         Launch CMD

·         CMD > WBEMTest

·         The Windows Management Instrumentation Tester will launch

·         Select Connect

·         Namespace: Root\MSCluster

·         Select Connect

o   If you see more options available, it means you are connected and WMI is working

§  Feel free to try a query to confirm, such as selecting ‘Query’ and enter: SELECT * from MSCluster_Resource

o   If you see an error, there is a WMI issue

PowerShell or remotely from another node within the same cluster (2008 R2 or higher only)

·         Launch Elevated PowerShell

·         PS > get-wmiobject mscluster_resourcegroup -computer MyNode -namespace “ROOT\MSCluster“

o   If you see a lot of information displayed, WMI is running

o   If you see an error, there is a WMI or firewall issue

 

 

3) Check your Firewall Settings

When a cluster is created, we automatically open up all the firewall settings you need.  However enterprise security policies can make changes over time, so it is worth checking that the firewall on each server is allowing cluster communication.  WMI request a DCOM connection to be made between the nodes, so you need to ensure that the ‘Remote Administration’ setting is enabled on every cluster node.  This can be done through the Windows Firewall GUI or running the elevated command: CMD > netsh firewall set service RemoteAdmin enable.  You will see a variety of errors or warnings if your firewall is not property configured.  For more information about how WMI uses the firewall and troubleshooting firewall issues, visit: http://msdn.microsoft.com/en-us/library/aa389286(VS.85).aspx.

 

 

4) Reboot the Node

This can often fix intermittent issues.  Follow best practices when rebooting the server, such as live migrating VMs and gracefully failing over other services and applications to reduce downtime.  Only do this if the other troubleshooting attempts described above have failed.

 

 

5) Rebuild a Corrupt WMI Repository

If you continue to see errors after checking that WMI is running, the firewall is properly configured and rebooting, it is possible that your WMI repository has become corrupt so the cluster can no longer read from it.  The following steps will enable you to rebuild your repository so that the other nodes can read from it again.  Rebuilding the repository should be your last troubleshooting step, not your first.

 

·         In the Services console, manually stop the WMI service to ensure that dependent services are stopped

·         Start WMI service again

·         Launch and elevated CMD or PowerShell

·         CMD/PS > winmgmt /ResetRepository

 

 

6) Patch WMI for Performance Improvements

You initial connection problems should now be fixed.  If you continue to experience intermittent connection issues caused by WMI, it could be due to the performance of your servers.  We have released a hotfix for 2008 R2 which improves the speeds at which we return WMI queries, and this is optimized for the most common WMI calls which SCVMM makes.  Get it here: http://support.microsoft.com/kb/974930.

 

 

Good luck in resolving your cluster connection issues with WMI!

Source: https://blogs.msdn.microsoft.com/clustering/2010/11/23/trouble-connecting-to-cluster-nodes-check-wmi/

How to Troubleshoot Create Cluster Failures

How to Troubleshoot Create Cluster Failures

In this blog, I will outline the steps in order to troubleshoot “Create Cluster” failures with Windows Server 2012 or later Failover Clustering.


Step 1: Run the Cluster Validation Tool

The cluster validation tool runs a suite of tests to verify that your hardware and settings are compatible with failover clustering. The first thing to do when troubleshooting, and something you should do every time you create a cluster is to run the Validate tool. To run cluster validation:

  1. Open the Failover Cluster Manager snap-in (CluAdmin.msc)
  2. Select Validate Cluster:

    Note
    :   You can also use the Failover Clustering Windows PowerShell® cmdlet, Test-Cluster
    , to validate your cluster.
  3. Navigate to C:\Windows\Cluster\Reports directory and open the Validation Report .MHT file (.HTM in Win2016)
  4. Review any tests that report as Failed or Warning.

The validation summary provides a starting point to drill down further into the failure.  For instance, in the example below we can detect an invalid Windows Firewall Configuration.

It is also useful to investigate the warnings flagged by validate. For example, the Active Directory Configuration test warning below flags a potential cluster creation problem:


Step 2: Analyze the CreateCluster Log

If you cannot successfully create a cluster after all your validation tests are passing, the next step is to examine the CreateCluster.mht file. This file is created during the cluster creation process through the “Create Cluster” wizard in Failover Cluster Manager or the Create-Cluster Failover Clustering Windows PowerShell® cmdlet. The file can be found in the following location: C:\Windows\Cluster\Reports\CreateCluster.mht
Note:  In Windows Server 2016 the report is changed from .MHT to .HTM

The admin level logging in the CreateCluster.mht file can help you determine the step at which the cluster creation process failed. For example in the CreateCluster.mht snippet below you can infer that there was a problem with configuring a Cluster Name Object for the cluster.


Step 3: Turn on Cluster API Debug Tracing

If you are unable to pinpoint the root cause of the failure by neither the Validate report nor the Create Cluster log, then verbose debug logging can be enabled. Debug tracing can be turned on with the following steps:

  1. Open Event Viewer (eventvwr.msc)
  2. Click View then “Show Analytic and Debug Logs”
  3. Browse down to Applications and Services Logs \ Microsoft \ Windows \ FailoverClustering-Client \ Diagnostic
  4. Right-click on Diagnostic and select “Enable Log”
  5. Attempt to create a cluster
  6. Right-click on Diagnostic and select “Disable Log”
    Note: The debug tracing will be generated to the Diagnostic channel and viewable only after you disable logging.
  7. Left-click on Diagnostic to view the logging captured.

The following are examples of events generated to the Diagnostic channel when cluster creation fails when the Cluster Name Object cannot be added to the clusterou container. In this case, the cluster administrator does not have the Read All Properties permission on the organizational unit (OU) in Active Directory.


Step 3b: Turn on Cluster API Event Log Tracing Programmatically

You can also turn on the Cluster API event log tracing programmatically. The debug information obtained will be the same as Step 3 but you are able to set this up using a script. The following are the steps to configure:

  1. Run to start the logging:
    logman start clusapiLogs -p {a82fda5d-745f-409c-b0fe-18ae0678a0e0} -o clusapi.etl -ets
  2. Attempt to create a cluster
  3. Run to stop the logging:
    logman stop clusapiLogs -ets
  4. Run to generate the dump file:
    tracerpt clusapi.etl -of CSV –o c:\report.csv
  5. Open the generated Comma Separated Value (CSV) dump file and examine the User Data column for potential issues. Note that the ‘-o’ parameter determines where the CSV dump file is generated.

The following are some examples of Cluster API event log traces found for a “create cluster” failure.

CreateCluster: Create cluster test-33 will be using a Read-Write DC \\VM1.subhattcluster.com.
CreateClusterNameCOIfNotExists: Failed to create computer object test-33 on DC \\VM1.subhattcluster.com with OU ou=clusterou
"CreateCluster: Create cluster failed with exception. Error = 8202
msg: Failed to create cluster name test-33 on DC \\VM1.subhattcluster.com. Error 8202.

Step 4: Generate the Cluster.log file

The cluster log provides verbose logging for the cluster service and allows advanced troubleshooting. The cluster log can be generated even when the cluster creation fails by specifying the node to collect the log on. You can generate the cluster log using the Failover Clustering Windows PowerShell® cmdlet Get-ClusterLog

Get-ClusterLog –Node <CreateClusterNode>

Note:  The default verbosity level for the cluster log is 3. This proves to be sufficient for most debugging purposes. However, if this verbosity level is not capturing the data you need, you can increase the verbosity level .  On a Windows PowerShell® console run:

(Get-Cluster).ClusterLogLevel = 5 

This generates significant spew so the default level should be restored once the troubleshooting is completed.

The cluster log can be generated in local time using Failover Clustering Windows PowerShell®:

Get-ClusterLog -UseLocalTime

Bonus Tip:

The number one reason for create cluster failures is due to misconfigured permissions in Active Directory environments resulting in failures while creating the Cluster Name Object (CNO).

Review: “How to Create a Cluster in a Restrictive Active Directory Environment”

“Failover Cluster Step-by-Step Guide: Configuring Accounts in Active Directory”

Did you really review the links above? Here’s a quick test… How would you fix the following “Create Cluster” errors?

1.       An enabled computer account (object) for <cno> was found.

Answer:

1.       Verify that the cluster name you attempting to use for the new cluster is not already being used by a cluster in production. If it is, you should chose another name for the cluster.  In other words you need to ensure that you can take over the computer name with no adverse repurcussions.

2.       On the Domain Controler,  launch the Active Directory Users and Computers  snap-in (type dsa.msc)

3.       Navigate to the OU you which has the cluster name you are trying to use. In this case you are searching for “Test-8”. You might have to search multiple OUs to find the conflicting cluster name.

4.       Delete the existing Cluster Name Object (CNO), “Test-8” or disable it by right-clicking on the CNO and selecting disable.

 

2.       You do not have permissions to create a computer account (object) in Active Directory

Answer:

1.       On the Domain Controler launch the Active Directory Users and Computers  snap-in (type dsa.msc)

2.       On the View menu, make sure that Advanced Features is selected

3.       Navigate to the OU you are trying to create your Cluster Name Object (CNO) in. By default this will the same OU as that of the node you are trying to create a cluster from.

4.       Right-click on the OU and select Properties and then the Security tab.

5.       Ensure that the Cluster Administrator has Create all child objects permissions

6.       Select the Advanced tab

7.       Click Add, type the name of the cluster administrator account for the Principal

8.       In the Permission container dialog box, locate the Create Computer objects and Read All Properties permissions, and make sure that the Allow check box is selected for each one.

A final note: In this blog I have focused on “Create Cluster” failures. However, the same troubleshooting steps can also be used for “Add node” failures (failures encountered while adding a node to a cluster).

Source: https://blogs.msdn.microsoft.com/clustering/2012/05/07/how-to-troubleshoot-create-cluster-failures-in-windows-server-2012/

SMB 3.0 on Clusters

Enabling server application storage on file shares

To enable server applications to store their live data on file shares, there are two requirements. First, the server role or application needs to support it. This includes updating the application to support UNC paths (\\server\share\file.vhd) in its setup and management tools, as well as fully testing the applications in the use cases in this scenario. In Microsoft SQL Server 2008 R2, there is added support for storing SQL user databases on SMB file shares. Microsoft SQL Server 2012 adds support for the SQL system database, as well as configuring SQL Server as a cluster. As demonstrated at the //BUILD conference, Windows Server “8” has also added support for storing virtual machines files on SMB file shares.

Second, the file server itself needs to support allowing server applications to store their data on file shares. During Windows Server “8” customer planning engagements, we identified the following top-level requirements for the file server to support storing server applications:

Continuous availability. Server applications expect storage to always be available and in general, do not handle input/output (I/O) errors or unexpected closures of file handles well. These types of events may cause virtual machines to crash because the virtual machine can no longer write to its disk or cause databases to go offline. Customers commonly deploy hardware redundancy, such as multiple network adapters, network switches, and Windows cluster configurations to mitigate hardware outages. While such configurations allow the file server to quickly recover from a failure, the recovery is not transparent to the application and virtual machines must be restarted and databases brought online. The Windows Server “8” file server solution must be able to quickly and transparently recover from network or node failures, with no downtime or administrator intervention required.

Performance. Some server roles, such as Hyper-V and SQL Server, are very sensitive to storage performance, including bandwidth, latency and I/O per second (IOPS). It is also important to ensure that CPU consumption when accessing storage is kept to a minimum to provide as much CPU time to the application as possible. Finally, server applications tend to have an access pattern that is very different than that of user applications. Where user applications mostly read or write a file in full, server applications tend to append or update existing data. The Windows Server “8” file server solution must be able to deliver storage bandwidth to server applications almost equivalently to that of multiple 10Gbps Ethernet network or Infiniband adapters with latency, IOPS, and CPU consumption rivaling that of Fibre Channel.

Scalability. The configurations for a Windows file server cluster are often deployed in active-passive configurations, which leaves at least one node unused. A workaround is to configure multiple file server instances in a cluster. This allows you to use all of the hardware in the cluster. However, this requires additional administration and the bandwidth available for a share is still limited to the bandwidth available on the node where it is currently online. The Windows Server “8” file server must be able to support active-active configurations where a share can be accessed through any node, increasing the maximum bandwidth to the aggregate of the cluster nodes and simplifying administration.

Data protection. Another key ability is creation of application-consistent shadow copies of the data for backup purposes. In Windows, this is usually accomplished using the Volume Shadow Copy Service (VSS) infrastructure. VSS, in its current form, only supports local storage. The Windows “8” file server solution must be able to support application consistent shadow copies through full integration with VSS and with minimal impact on existing VSS requestors, writers, and providers.

As you can see, this is quite a demanding list of requirements. However, we agreed that we needed to address all of them to provide a reliable, available, and serviceable file server with great performance for se rver application storage.

Features overview

Supporting server application storage on file shares in Windows Server “8” was a major decision for the product team. Several features were introduced specifically to make sure file storage could meet or exceed the requirements commonly applied to block storage, without losing file storage’s inherit benefits in ease of management and cost effectiveness. This also required the introduction of a new version of SMB, which is Window’s main remote file protocol. These new capabilities include:

SMB Transparent Failover: Enables administrators to perform hardware or software maintenance of nodes in a clustered file server without interrupting server applications storing data on these file shares. Also, if a hardware or software failure occurs on a cluster node, this feature enables SMB clients to transparently reconnect to another cluster node without interrupting server applications that are storing data on these file shares. This is achieved regardless of the type of operation that is under way when the failure occurs. For block-based storage, this is the equivalent of having a multi-controller storage array.

SMB Multichannel: Enables you to simultaneously use multiple connections and network interfaces with two main benefits: increased throughput and fault tolerance. For instance, if you have four 10GbE interfaces on both the SMB client and server, you can simultaneously use them to effectively achieve 40Gbps throughput from the four 10Gbps network adapters. In the event that one of the network adapters or cables fails, your SMB client will continue to use the network uninterrupted, at a lower throughput. Best of all, this is achieved without additional configuration steps. You only need to configure the multiple network interfaces as you normally would.

SMB Direct: One of the main advantages of Fibre Channel block storage is the ability to have low latency and fast, offloaded data transfers. To match that in the file server world, SMB introduces support for network adapters that have RDMA capability and can function at full speed with very low latency, while using very little CPU. When using one of three RDMA technologies (Infiniband, iWARP or RoCE), the SMB client has a low CPU overhead, which is comparable to Fibre Channel, and saves CPU cycles for the main workload on the box, such as Hyper-V or SQL Server. Best of all, these network interfaces are detected and function without requiring additional SMB configuration steps. If RDMA interfaces are available, they will be automatically used.

SMB Scale-Out: Taking advantage of Cluster Shared Volume (CSV) version 2, administrators can create file shares that provide simultaneous access to data files, with direct I/O, through all nodes in a file server cluster. This means that the maximum file serving capacity for a given share is no longer limited by the capacity of a single cluster node, but rather the aggregate bandwidth across the cluster. Also, this active-active configuration lets you balance the load across cluster nodes by moving file server clients without any service interruption. Finally, SMB Scale-Out simplifies the management of clustered file servers and file shares.

VSS for SMB File Shares: The ability to create application-consistent snapshots of the server application data is critical to backing up the data. In Windows, this is accomplished using the Volume Shadow Copy Service (VSS) infrastructure. VSS for SMB file shares extends the VSS infrastructure to perform application-consistent shadow copies of data stored on remote SMB file shares for backup and restore purposes. In addition, VSS for SMB file shares enable backup applications to read the backup data directly from a shadow copy file share rather than involving the server application computer during the data transfer. Because this feature leverages the existing VSS infrastructure, it is easy to integrate with existing VSS-aware backup software and VSS-aware applications, such as Hyper-V.

SMB-specific Windows PowerShell cmdlets: Managing file shares is now accomplished using either the new Windows Server Manager GUIsupporting file server clusters, which includes several profiles for creating SMB shares, or using the all new SMB Windows PowerShell cmdlets, which use the familiar Windows PowerShell infrastructure for command-line and scripting. This complete new set of Windows PowerShell version 3 cmdlets was created to manage file shares, file share permissions, client mappings, server configuration, and client configuration. There is also an extensive set of cmdlets to monitor sessions, open files, connections, network interfaces, and multichannel connections. These cmdlets are built upon a standards-based management protocol using WMIv2 classes that allow developers, on Windows and Linux, to create automated solutions for file server configuration and monitoring.

SMB Performance Counters: In the application server world, storage performance is paramount, as is the ability to measure it. With that in mind, Windows Server “8” includes server and client performance counters that allow administrators to easily look into the key metrics for file storage, including IOPs, latency, queue depth, and throughput. These counters match the familiar block storage performance counters, making it simple to leverage your existing skills and guidance for storage performance for Windows Server.

Performance: Performance was also a key area of focus in SMB. In addition to making the large maximum transmission unit (large MTU) enabled by default, there was a significant amount of work to optimize performance for different kinds of workloads, covering both small and large I/O, and both sequential and random access. These optimizations were developed while investigating typical end-to-end workloads, such as online transaction processing, data warehousing, virtual web servers in a private cloud, virtual desktop infrastructure, and consolidated home folders. These investigations led to specific improvements in many areas of the operating system.

Let us take a closer look at SMB Transparent Failover. SMB Transparent Failover requires:

  • A failover cluster running Windows Server “8” with at least two cluster nodes and configured with the file server role. The cluster must pass the cluster validation tests in “Validate a Configuration Wizard”.
  • File shares created with the continuous availability property, which is the default setting for clustered file shares.
  • Computers accessing the clustered file shares must be running Windows “8” Consumer Preview or Windows Server “8”.

When the SMB client initially connects to the file share, the client determines whether the file share has the continuous availability property set. If it does, this means the file share is a clustered file share and supports SMB transparent failover. When the SMB client subsequently opens a file on the file share on behalf of the application, it requests a persistent file handle. When the SMB server receives a request to open a file with a persistent handle, the SMB server interacts with the Resume Key filter to persist sufficient information about the file handle, along with a unique key (resume key) supplied by the SMB client, to stable storage. The SMB client uses the resume key to reference the handle during a resume operation after a failover. To protect against data loss from writing data into an unstable cache, persistent file handles are always opened with write through.

If a failure occurs on the file server cluster node to which the SMB client is connected, the SMB client attempts to reconnect to another file server cluster node. Once the SMB client successfully reconnects to another node in the cluster, the SMB client starts the resume operation using the resume key. When the SMB server receives the resume key, it interacts with the Resume Key filter to recover the handle state to the same state it was prior to the failure with end-to-end support (SMB client, SMB server and Resume Key filter) for operations that can be replayed, as well as operations that cannot be replayed. The application running on the SMB client computer does not experience any failures or errors during this operation. From an application perspective, it appears the I/O operations are stalled for a small amount of time.

It is very important to keep the number of I/O stalls during a failover to a minimum. Since SMB is sitting on top of TCP/IP, the SMB client would normally rely on TCP timeout to determine if the file server cluster node has failed. However, relying on TCP timeouts can lead to fairly long I/O stalls, since each timeout is typically ~20 seconds. SMB Witness was created to enable faster recovery from unplanned failures, allowing the SMB client to not have to wait for a TCP timeout. SMB Witness is a new service that is installed automatically with the failover clustering feature. When the SMB client initially connects to a file server cluster node, the SMB client notifies the SMB Witness client, which is running on the same computer. The SMB Witness client obtains a list of cluster members from the SMB Witness service running on the file server cluster node. The SMB Witness client picks a different cluster member and issues a registration request to the SMB Witness service on that cluster member.

If an unplanned failure occurs on the file server cluster node, the SMB Witness service on the other cluster member receives a notification from the cluster service. The SMB Witness service also notifies the SMB Witness client, which in turns notifies the SMB client that the file server cluster node has failed. Upon receiving the SMB Witness notification, the SMB client immediately starts reconnecting to a different file server cluster node, which significantly speeds up recovery from unplanned failures.

Installing Failover Cluster feature using Windows PowerShell

Installing Failover Cluster feature using Windows PowerShell

It is important to note that you must run these cmdlets in a PowerShell console that is opened with elevated privileges, which means opening it with the “Run as Administrator” option.

The following cmdlet will install the Failover Clustering feature and the management tools.
Note: If you do not specify the –IncludeManagementTools switch, the Failover Cluster Administrator and PowerShell cmdlets for cluster will not be installed.

Install-WindowsFeature -Name Failover-Clustering –IncludeManagementTools

You can use the –ComputerName parameter to install the features on other servers without having to log into them.  Here is an example of the cmdlet to install the failover cluster feature and tools on a specified server, in this case “Foo”:

Install-WindowsFeature -Name Failover-Clustering –IncludeManagementTools –ComputerName Foo

If you would like to find the list of features and the names to specify in the Install-WindowsFeature cmdlet, you can use this cmdlet:

Get-WindowsFeature

Wildcards can be helpful to narrow down the returned set of features:

Get-WindowsFeature Failover*

The Get-WindowsFeature Failover* cmdlet will return the feature, but not the tools.  To get the tools you can use the following:

Get-WindowsFeature RSAT-Cluster*

 

Understand the components of PAM

Privileged Access Management keeps administrative access separate from day-to-day user accounts. This solution relies on parallel forests:+

  • CORP: Your general-purpose corporate forest that includes one or more domains. While you may have multiple CORP forests, the examples in these articles assume a single forest with a single domain for simplicity.
  • PRIV: A dedicated forest created especially for this PAM scenario. This forest includes one domain to accommodate privileged groups and accounts which are shadowed from one or more CORP domains.

The MIM solution as configured for PAM includes the following components:

  • MIM Service: implements business logic for performing identity and access management operations, including privileged account management and elevation request handling.
  • MIM Portal: a SharePoint-based portal, hosted by SharePoint 2013, which provides an administrator management and configuration UI.
  • MIM Service Database: stored in SQL Server 2012 or 2014, and holds identity data and meta-data required for MIM Service.
  • PAM Monitoring Service and PAM Component Service: two services that manage the lifecycle of privileged accounts and assists the PRIV AD in group membership lifecycle.
  • PowerShell cmdlets: for populating MIM Service and PRIV AD with users and groups that correspond to the users and groups in the CORP forest for PAM administrators, and for end users requesting just-in-time (JIT) use of privileges on an administrative account.
  • PAM REST API and sample portal: for developers integrating MIM in the PAM scenario with custom clients for elevation, without needing to use PowerShell or SOAP. The use of the REST API is demonstrated with a sample web application.

Once installed and configured, each group created by the migration procedure in the PRIV forest is a shadow SIDHistory-based security group (or in a later update with Windows Server vNext, a foreign principal group) mirroring the SID group in the original CORP forest. Furthermore, when the MIM Service adds members to these groups in the PRIV forest, those memberships will be time limited.

As a result, when a user requests elevation using the PowerShell cmdlets, and their request is approved, the MIM Service will add their account in the PRIV forest to a group in the PRIV forest. When the user logs in with their privileged account, their Kerberos token will contain a Security Identifier (SID) identical to the SID of the group in the CORP forest. Since the CORP forest is configured to trust the PRIV forest, the elevated account being used to access a resource in the CORP forest appears, to a resource checking the Kerberos group memberships, be a member of that resource’s security groups. This is provided via Kerberos cross-forest authentication.

Furthermore, these memberships are time limited so that after a preconfigured interval of time, the user’s administrative account will no longer be part of the group in the PRIV forest. As a result, that account will no longer be usable for accessing additional resources.

Source: https://docs.microsoft.com/en-us/microsoft-identity-manager/pam/principles-of-operation

Privileged Access Management for Active Directory Domain Services

IN THIS ARTICLE

Privileged Access Management (PAM) is a solution that is based on Microsoft Identity Manager (MIM), Windows Server 2012 R2, and Windows Server Technical Preview. It helps organizations restrict privileged access within an existing Active Directory environment.

NOTE

PAM is an instance of Privileged Identity Management (PIM) that is implemented using Microsoft Identity Manager (MIM).

Privileged Access Management accomplishes two goals:

  • Re-establish control over a compromised Active Directory environment by maintaining a separate bastion environment that is known to be unaffected by malicious attacks.
  • Isolate the use of privileged accounts to reduce the risk of those credentials being stolen.

What problems does PAM help solve?

A real concern for enterprises today is resource access within an Active Directory environment. Particularly troubling is news about vulnerabilities, unauthorized privilege escalations, and other types of unauthorized access including pass-the-hash, pass-the-ticket, spear phishing, and Kerberos compromises.

Today, it’s too easy for attackers to obtain Domain Admins account credentials, and it’s too hard to discover these attacks after the fact. The goal of PAM is to reduce opportunities for malicious users to get access, while increasing your control and awareness of the environment.

PAM makes it harder for attackers to penetrate a network and obtain privileged account access. PAM adds protection to privileged groups that control access across a range of domain-joined computers and applications on those computers. It also adds more monitoring, more visibility, and more fine-grained controls so that organizations can see who their privileged administrators are and what are they doing. PAM gives organizations more insight into how administrative accounts are used in the environment.

How is PAM set up?

PAM builds on the principle of just-in-time administration, which relates to just enough administration (JEA). JEA is a Windows PowerShell toolkit that defines a set of commands for performing privileged activities and an endpoint where administrators can get authorization to run those commands. In JEA, an administrator decides that users with a certain privilege can perform a certain task. Every time an eligible user needs to perform that task, they enable that permission. The permissions expire after a specified time period, so that a malicious user can’t steal the access.

PAM setup and operation has four steps.

PAM steps: prepare, protect, operate, monitor - diagram

  1. Prepare: Identify which groups in your existing forest have significant privileges. Recreate these groups without members in the bastion forest.

  2. Protect: Set up lifecycle and authentication protection, such as Multi-Factor Authentication (MFA), for when users request just-in-time administration. MFA helps prevent programmatic attacks from malicious software or following credential theft.

  3. Operate: After authentication requirements are met and a request is approved, a user account gets added temporarily to a privileged group in the bastion forest. For a pre-set amount of time, the administrator has all privileges and access permissions that are assigned to that group. After that time, the account is removed from the group.

  4. Monitor: PAM adds auditing, alerts, and reports of privileged access requests. You can review the history of privileged access, and see who performed an activity. You can decide whether the activity is valid or not and easily identify unauthorized activity, such as an attempt to add a user directly to a privileged group in the original forest. This step is important not only to identify malicious software but also for tracking "inside" attackers.

How does PAM work?

PAM is based on new capabilities in AD DS, particularly for domain account authentication and authorization, and new capabilities in Microsoft Identity Manager. PAM separates privileged accounts from an existing Active Directory environment. When a privileged account needs to be used, it first needs to be requested, and then approved. After approval, the privileged account is given permission via a foreign principal group in a new bastion forest rather than in the current forest of the user or application. The use of a bastion forest gives the organization greater control, such as when a user can be a member of a privileged group, and how the user needs to authenticate.

Active Directory, the MIM Service, and other portions of this solution can also be deployed in a high availability configuration.

The following example shows how PIM works in more detail.

PIM process and participants - diagram

The bastion forest issues time-limited group memberships, which in turn produce time-limited ticket-granting tickets (TGTs). Kerberos-based applications or services can honor and enforce these TGTs, if the apps and services exist in forests that trust the bastion forest.

Day-to-day user accounts do not need to move to a new forest. The same is true with the computers, applications, and their groups. They stay where they are today in an existing forest. Consider the example of an organization that is concerned with these cybersecurity issues today, but has no immediate plans to upgrade the server infrastructure to the next version of Windows Server. That organization can still take advantage of this combined solution by using MIM and a new bastion forest, and can better control access to existing resources.

PAM offers the following advantages:

  • Isolation/scoping of privileges: Users do not hold privileges on accounts that are also used for non-privileged tasks like checking email or browsing the Internet. Users need to request privileges. Requests are approved or denied based on MIM policies defined by a PAM administrator. Until a request is approved, privileged access is not available.

  • Step-up and proof-up: These are new authentication and authorization challenges to help manage the lifecycle of separate administrative accounts. The user can request the elevation of an administrative account and that request goes through MIM workflows.

  • Additional logging: Along with the built-in MIM workflows, there is additional logging for PAM that identifies the request, how it was authorized, and any events that occur after approval.

  • Customizable workflow: The MIM workflows can be configured for different scenarios, and multiple workflows can be used, based on the parameters of the requesting user or requested roles.

How do users request privileged access?

There are a number of ways in which a user can submit a request, including:

  • The MIM Services Web Services API
  • A REST endpoint
  • Windows PowerShell (New-PAMRequest)

What workflows and monitoring options are available?

As an example, let’s say a user was a member of an administrative group before PIM is set up. As part of PIM setup, the user is removed from the administrative group, and a policy is created in MIM. The policy specifies that if that user requests administrative privileges and is authenticated by MFA, the request is approved and a separate account for the user will be added to the privileged group in the bastion forest.

Assuming the request is approved, the Action workflow communicates directly with bastion forest Active Directory to put a user in a group. For example, when Jen requests to administer the HR database, the administrative account for Jen is added to the privileged group in the bastion forest within seconds. Her administrative account’s membership in that group will expire after a time limit. With Windows Server Technical Preview, that membership is associated in Active Directory with a time limit; with Windows Server 2012 R2 in the bastion forest, that time limit is enforced by MIM.

NOTE

When you add a new member to a group, the change needs to replicate to other domain controllers (DCs) in the bastion forest. Replication latency can impact the ability for users to access resources. For more information about replication latency, see How Active Directory Replication Topology Works.

In contrast, an expired link is evaluated in real time by the Security Accounts Manager (SAM). Even though the addition of a group member needs to be replicated by the DC that receives the access request, the removal of a group member is evaluated instantaneously on any DC.

This workflow is specifically intended for these administrative accounts. Administrators (or even scripts) who need only occasional access for privileged groups, can precisely request that access. MIM logs the request and the changes in Active Directory, and you can view them in Event Viewer or send the data to enterprise monitoring solutions such as System Center 2012 – Operations Manager Audit Collection Services (ACS), or other third-party tools.

Source: https://docs.microsoft.com/en-us/microsoft-identity-manager/pam/privileged-identity-management-for-active-directory-domain-services