How to Troubleshoot Create Cluster Failures
Enabling server application storage on file shares
To enable server applications to store their live data on file shares, there are two requirements. First, the server role or application needs to support it. This includes updating the application to support UNC paths (\\server\share\file.vhd) in its setup and management tools, as well as fully testing the applications in the use cases in this scenario. In Microsoft SQL Server 2008 R2, there is added support for storing SQL user databases on SMB file shares. Microsoft SQL Server 2012 adds support for the SQL system database, as well as configuring SQL Server as a cluster. As demonstrated at the //BUILD conference, Windows Server “8” has also added support for storing virtual machines files on SMB file shares.
Second, the file server itself needs to support allowing server applications to store their data on file shares. During Windows Server “8” customer planning engagements, we identified the following top-level requirements for the file server to support storing server applications:
– Continuous availability. Server applications expect storage to always be available and in general, do not handle input/output (I/O) errors or unexpected closures of file handles well. These types of events may cause virtual machines to crash because the virtual machine can no longer write to its disk or cause databases to go offline. Customers commonly deploy hardware redundancy, such as multiple network adapters, network switches, and Windows cluster configurations to mitigate hardware outages. While such configurations allow the file server to quickly recover from a failure, the recovery is not transparent to the application and virtual machines must be restarted and databases brought online. The Windows Server “8” file server solution must be able to quickly and transparently recover from network or node failures, with no downtime or administrator intervention required.
– Performance. Some server roles, such as Hyper-V and SQL Server, are very sensitive to storage performance, including bandwidth, latency and I/O per second (IOPS). It is also important to ensure that CPU consumption when accessing storage is kept to a minimum to provide as much CPU time to the application as possible. Finally, server applications tend to have an access pattern that is very different than that of user applications. Where user applications mostly read or write a file in full, server applications tend to append or update existing data. The Windows Server “8” file server solution must be able to deliver storage bandwidth to server applications almost equivalently to that of multiple 10Gbps Ethernet network or Infiniband adapters with latency, IOPS, and CPU consumption rivaling that of Fibre Channel.
– Scalability. The configurations for a Windows file server cluster are often deployed in active-passive configurations, which leaves at least one node unused. A workaround is to configure multiple file server instances in a cluster. This allows you to use all of the hardware in the cluster. However, this requires additional administration and the bandwidth available for a share is still limited to the bandwidth available on the node where it is currently online. The Windows Server “8” file server must be able to support active-active configurations where a share can be accessed through any node, increasing the maximum bandwidth to the aggregate of the cluster nodes and simplifying administration.
– Data protection. Another key ability is creation of application-consistent shadow copies of the data for backup purposes. In Windows, this is usually accomplished using the Volume Shadow Copy Service (VSS) infrastructure. VSS, in its current form, only supports local storage. The Windows “8” file server solution must be able to support application consistent shadow copies through full integration with VSS and with minimal impact on existing VSS requestors, writers, and providers.
As you can see, this is quite a demanding list of requirements. However, we agreed that we needed to address all of them to provide a reliable, available, and serviceable file server with great performance for se rver application storage.
Supporting server application storage on file shares in Windows Server “8” was a major decision for the product team. Several features were introduced specifically to make sure file storage could meet or exceed the requirements commonly applied to block storage, without losing file storage’s inherit benefits in ease of management and cost effectiveness. This also required the introduction of a new version of SMB, which is Window’s main remote file protocol. These new capabilities include:
SMB Transparent Failover: Enables administrators to perform hardware or software maintenance of nodes in a clustered file server without interrupting server applications storing data on these file shares. Also, if a hardware or software failure occurs on a cluster node, this feature enables SMB clients to transparently reconnect to another cluster node without interrupting server applications that are storing data on these file shares. This is achieved regardless of the type of operation that is under way when the failure occurs. For block-based storage, this is the equivalent of having a multi-controller storage array.
SMB Multichannel: Enables you to simultaneously use multiple connections and network interfaces with two main benefits: increased throughput and fault tolerance. For instance, if you have four 10GbE interfaces on both the SMB client and server, you can simultaneously use them to effectively achieve 40Gbps throughput from the four 10Gbps network adapters. In the event that one of the network adapters or cables fails, your SMB client will continue to use the network uninterrupted, at a lower throughput. Best of all, this is achieved without additional configuration steps. You only need to configure the multiple network interfaces as you normally would.
SMB Direct: One of the main advantages of Fibre Channel block storage is the ability to have low latency and fast, offloaded data transfers. To match that in the file server world, SMB introduces support for network adapters that have RDMA capability and can function at full speed with very low latency, while using very little CPU. When using one of three RDMA technologies (Infiniband, iWARP or RoCE), the SMB client has a low CPU overhead, which is comparable to Fibre Channel, and saves CPU cycles for the main workload on the box, such as Hyper-V or SQL Server. Best of all, these network interfaces are detected and function without requiring additional SMB configuration steps. If RDMA interfaces are available, they will be automatically used.
SMB Scale-Out: Taking advantage of Cluster Shared Volume (CSV) version 2, administrators can create file shares that provide simultaneous access to data files, with direct I/O, through all nodes in a file server cluster. This means that the maximum file serving capacity for a given share is no longer limited by the capacity of a single cluster node, but rather the aggregate bandwidth across the cluster. Also, this active-active configuration lets you balance the load across cluster nodes by moving file server clients without any service interruption. Finally, SMB Scale-Out simplifies the management of clustered file servers and file shares.
VSS for SMB File Shares: The ability to create application-consistent snapshots of the server application data is critical to backing up the data. In Windows, this is accomplished using the Volume Shadow Copy Service (VSS) infrastructure. VSS for SMB file shares extends the VSS infrastructure to perform application-consistent shadow copies of data stored on remote SMB file shares for backup and restore purposes. In addition, VSS for SMB file shares enable backup applications to read the backup data directly from a shadow copy file share rather than involving the server application computer during the data transfer. Because this feature leverages the existing VSS infrastructure, it is easy to integrate with existing VSS-aware backup software and VSS-aware applications, such as Hyper-V.
SMB-specific Windows PowerShell cmdlets: Managing file shares is now accomplished using either the new Windows Server Manager GUIsupporting file server clusters, which includes several profiles for creating SMB shares, or using the all new SMB Windows PowerShell cmdlets, which use the familiar Windows PowerShell infrastructure for command-line and scripting. This complete new set of Windows PowerShell version 3 cmdlets was created to manage file shares, file share permissions, client mappings, server configuration, and client configuration. There is also an extensive set of cmdlets to monitor sessions, open files, connections, network interfaces, and multichannel connections. These cmdlets are built upon a standards-based management protocol using WMIv2 classes that allow developers, on Windows and Linux, to create automated solutions for file server configuration and monitoring.
SMB Performance Counters: In the application server world, storage performance is paramount, as is the ability to measure it. With that in mind, Windows Server “8” includes server and client performance counters that allow administrators to easily look into the key metrics for file storage, including IOPs, latency, queue depth, and throughput. These counters match the familiar block storage performance counters, making it simple to leverage your existing skills and guidance for storage performance for Windows Server.
Performance: Performance was also a key area of focus in SMB. In addition to making the large maximum transmission unit (large MTU) enabled by default, there was a significant amount of work to optimize performance for different kinds of workloads, covering both small and large I/O, and both sequential and random access. These optimizations were developed while investigating typical end-to-end workloads, such as online transaction processing, data warehousing, virtual web servers in a private cloud, virtual desktop infrastructure, and consolidated home folders. These investigations led to specific improvements in many areas of the operating system.
Let us take a closer look at SMB Transparent Failover. SMB Transparent Failover requires:
- A failover cluster running Windows Server “8” with at least two cluster nodes and configured with the file server role. The cluster must pass the cluster validation tests in “Validate a Configuration Wizard”.
- File shares created with the continuous availability property, which is the default setting for clustered file shares.
- Computers accessing the clustered file shares must be running Windows “8” Consumer Preview or Windows Server “8”.
When the SMB client initially connects to the file share, the client determines whether the file share has the continuous availability property set. If it does, this means the file share is a clustered file share and supports SMB transparent failover. When the SMB client subsequently opens a file on the file share on behalf of the application, it requests a persistent file handle. When the SMB server receives a request to open a file with a persistent handle, the SMB server interacts with the Resume Key filter to persist sufficient information about the file handle, along with a unique key (resume key) supplied by the SMB client, to stable storage. The SMB client uses the resume key to reference the handle during a resume operation after a failover. To protect against data loss from writing data into an unstable cache, persistent file handles are always opened with write through.
If a failure occurs on the file server cluster node to which the SMB client is connected, the SMB client attempts to reconnect to another file server cluster node. Once the SMB client successfully reconnects to another node in the cluster, the SMB client starts the resume operation using the resume key. When the SMB server receives the resume key, it interacts with the Resume Key filter to recover the handle state to the same state it was prior to the failure with end-to-end support (SMB client, SMB server and Resume Key filter) for operations that can be replayed, as well as operations that cannot be replayed. The application running on the SMB client computer does not experience any failures or errors during this operation. From an application perspective, it appears the I/O operations are stalled for a small amount of time.
It is very important to keep the number of I/O stalls during a failover to a minimum. Since SMB is sitting on top of TCP/IP, the SMB client would normally rely on TCP timeout to determine if the file server cluster node has failed. However, relying on TCP timeouts can lead to fairly long I/O stalls, since each timeout is typically ~20 seconds. SMB Witness was created to enable faster recovery from unplanned failures, allowing the SMB client to not have to wait for a TCP timeout. SMB Witness is a new service that is installed automatically with the failover clustering feature. When the SMB client initially connects to a file server cluster node, the SMB client notifies the SMB Witness client, which is running on the same computer. The SMB Witness client obtains a list of cluster members from the SMB Witness service running on the file server cluster node. The SMB Witness client picks a different cluster member and issues a registration request to the SMB Witness service on that cluster member.
If an unplanned failure occurs on the file server cluster node, the SMB Witness service on the other cluster member receives a notification from the cluster service. The SMB Witness service also notifies the SMB Witness client, which in turns notifies the SMB client that the file server cluster node has failed. Upon receiving the SMB Witness notification, the SMB client immediately starts reconnecting to a different file server cluster node, which significantly speeds up recovery from unplanned failures.
Privileged Access Management keeps administrative access separate from day-to-day user accounts. This solution relies on parallel forests:+
- CORP: Your general-purpose corporate forest that includes one or more domains. While you may have multiple CORP forests, the examples in these articles assume a single forest with a single domain for simplicity.
- PRIV: A dedicated forest created especially for this PAM scenario. This forest includes one domain to accommodate privileged groups and accounts which are shadowed from one or more CORP domains.
The MIM solution as configured for PAM includes the following components:
- MIM Service: implements business logic for performing identity and access management operations, including privileged account management and elevation request handling.
- MIM Portal: a SharePoint-based portal, hosted by SharePoint 2013, which provides an administrator management and configuration UI.
- MIM Service Database: stored in SQL Server 2012 or 2014, and holds identity data and meta-data required for MIM Service.
- PAM Monitoring Service and PAM Component Service: two services that manage the lifecycle of privileged accounts and assists the PRIV AD in group membership lifecycle.
- PowerShell cmdlets: for populating MIM Service and PRIV AD with users and groups that correspond to the users and groups in the CORP forest for PAM administrators, and for end users requesting just-in-time (JIT) use of privileges on an administrative account.
- PAM REST API and sample portal: for developers integrating MIM in the PAM scenario with custom clients for elevation, without needing to use PowerShell or SOAP. The use of the REST API is demonstrated with a sample web application.
Once installed and configured, each group created by the migration procedure in the PRIV forest is a shadow SIDHistory-based security group (or in a later update with Windows Server vNext, a foreign principal group) mirroring the SID group in the original CORP forest. Furthermore, when the MIM Service adds members to these groups in the PRIV forest, those memberships will be time limited.
As a result, when a user requests elevation using the PowerShell cmdlets, and their request is approved, the MIM Service will add their account in the PRIV forest to a group in the PRIV forest. When the user logs in with their privileged account, their Kerberos token will contain a Security Identifier (SID) identical to the SID of the group in the CORP forest. Since the CORP forest is configured to trust the PRIV forest, the elevated account being used to access a resource in the CORP forest appears, to a resource checking the Kerberos group memberships, be a member of that resource’s security groups. This is provided via Kerberos cross-forest authentication.
Furthermore, these memberships are time limited so that after a preconfigured interval of time, the user’s administrative account will no longer be part of the group in the PRIV forest. As a result, that account will no longer be usable for accessing additional resources.
IN THIS ARTICLE
Privileged Access Management (PAM) is a solution that is based on Microsoft Identity Manager (MIM), Windows Server 2012 R2, and Windows Server Technical Preview. It helps organizations restrict privileged access within an existing Active Directory environment.
Privileged Access Management accomplishes two goals:
- Re-establish control over a compromised Active Directory environment by maintaining a separate bastion environment that is known to be unaffected by malicious attacks.
- Isolate the use of privileged accounts to reduce the risk of those credentials being stolen.
What problems does PAM help solve?
A real concern for enterprises today is resource access within an Active Directory environment. Particularly troubling is news about vulnerabilities, unauthorized privilege escalations, and other types of unauthorized access including pass-the-hash, pass-the-ticket, spear phishing, and Kerberos compromises.
Today, it’s too easy for attackers to obtain Domain Admins account credentials, and it’s too hard to discover these attacks after the fact. The goal of PAM is to reduce opportunities for malicious users to get access, while increasing your control and awareness of the environment.
PAM makes it harder for attackers to penetrate a network and obtain privileged account access. PAM adds protection to privileged groups that control access across a range of domain-joined computers and applications on those computers. It also adds more monitoring, more visibility, and more fine-grained controls so that organizations can see who their privileged administrators are and what are they doing. PAM gives organizations more insight into how administrative accounts are used in the environment.
How is PAM set up?
PAM builds on the principle of just-in-time administration, which relates to . JEA is a Windows PowerShell toolkit that defines a set of commands for performing privileged activities and an endpoint where administrators can get authorization to run those commands. In JEA, an administrator decides that users with a certain privilege can perform a certain task. Every time an eligible user needs to perform that task, they enable that permission. The permissions expire after a specified time period, so that a malicious user can’t steal the access.
PAM setup and operation has four steps.
Prepare: Identify which groups in your existing forest have significant privileges. Recreate these groups without members in the bastion forest.
Protect: Set up lifecycle and authentication protection, such as Multi-Factor Authentication (MFA), for when users request just-in-time administration. MFA helps prevent programmatic attacks from malicious software or following credential theft.
Operate: After authentication requirements are met and a request is approved, a user account gets added temporarily to a privileged group in the bastion forest. For a pre-set amount of time, the administrator has all privileges and access permissions that are assigned to that group. After that time, the account is removed from the group.
Monitor: PAM adds auditing, alerts, and reports of privileged access requests. You can review the history of privileged access, and see who performed an activity. You can decide whether the activity is valid or not and easily identify unauthorized activity, such as an attempt to add a user directly to a privileged group in the original forest. This step is important not only to identify malicious software but also for tracking "inside" attackers.
How does PAM work?
PAM is based on new capabilities in AD DS, particularly for domain account authentication and authorization, and new capabilities in Microsoft Identity Manager. PAM separates privileged accounts from an existing Active Directory environment. When a privileged account needs to be used, it first needs to be requested, and then approved. After approval, the privileged account is given permission via a foreign principal group in a new bastion forest rather than in the current forest of the user or application. The use of a bastion forest gives the organization greater control, such as when a user can be a member of a privileged group, and how the user needs to authenticate.
Active Directory, the MIM Service, and other portions of this solution can also be deployed in a high availability configuration.
The following example shows how PIM works in more detail.
The bastion forest issues time-limited group memberships, which in turn produce time-limited ticket-granting tickets (TGTs). Kerberos-based applications or services can honor and enforce these TGTs, if the apps and services exist in forests that trust the bastion forest.
Day-to-day user accounts do not need to move to a new forest. The same is true with the computers, applications, and their groups. They stay where they are today in an existing forest. Consider the example of an organization that is concerned with these cybersecurity issues today, but has no immediate plans to upgrade the server infrastructure to the next version of Windows Server. That organization can still take advantage of this combined solution by using MIM and a new bastion forest, and can better control access to existing resources.
PAM offers the following advantages:
Isolation/scoping of privileges: Users do not hold privileges on accounts that are also used for non-privileged tasks like checking email or browsing the Internet. Users need to request privileges. Requests are approved or denied based on MIM policies defined by a PAM administrator. Until a request is approved, privileged access is not available.
Step-up and proof-up: These are new authentication and authorization challenges to help manage the lifecycle of separate administrative accounts. The user can request the elevation of an administrative account and that request goes through MIM workflows.
Additional logging: Along with the built-in MIM workflows, there is additional logging for PAM that identifies the request, how it was authorized, and any events that occur after approval.
Customizable workflow: The MIM workflows can be configured for different scenarios, and multiple workflows can be used, based on the parameters of the requesting user or requested roles.
How do users request privileged access?
There are a number of ways in which a user can submit a request, including:
- The MIM Services Web Services API
- A REST endpoint
- Windows PowerShell (
What workflows and monitoring options are available?
As an example, let’s say a user was a member of an administrative group before PIM is set up. As part of PIM setup, the user is removed from the administrative group, and a policy is created in MIM. The policy specifies that if that user requests administrative privileges and is authenticated by MFA, the request is approved and a separate account for the user will be added to the privileged group in the bastion forest.
Assuming the request is approved, the Action workflow communicates directly with bastion forest Active Directory to put a user in a group. For example, when Jen requests to administer the HR database, the administrative account for Jen is added to the privileged group in the bastion forest within seconds. Her administrative account’s membership in that group will expire after a time limit. With Windows Server Technical Preview, that membership is associated in Active Directory with a time limit; with Windows Server 2012 R2 in the bastion forest, that time limit is enforced by MIM.
This workflow is specifically intended for these administrative accounts. Administrators (or even scripts) who need only occasional access for privileged groups, can precisely request that access. MIM logs the request and the changes in Active Directory, and you can view them in Event Viewer or send the data to enterprise monitoring solutions such as System Center 2012 – Operations Manager Audit Collection Services (ACS), or other third-party tools.