Mirco-S2D | Thomas Tech Systems blog

Table of Contents

Preface

I know we all hate nothing more than a long story before the recipe, but this one has some real explaining to do. Lucky you, I have a table of contents and you can skip over my woes.

First, this is not what this project was supposed to be. I chose the Minisforum MS-01 to make a 2-node S2D cluster due to the NVMe slots, dual 20g thunderbolt, and dual 10g SFP+. I was hoping to use the dual 20g thunderbolt ports for storage & live migration, and the 10g X710 ports for Management & Compute. I even picked up some QNAP cards so I could do 5x NVMe per node.

Unfortunately, I could not get the thunderbolt ports to work for SMB Multichannel or Network ATC, so using them for storage was off the table. They did however come in handy as a way to RDP from one node to the other when I was messing with network changes, so they’re not entirely useless. Next problem, I could not get the X710’s to behave. I tried firmware updates, latest drivers, oldest drivers, etc. I just could not get them to behave properly, even for compute & management traffic.

Additionally, I started off with a big lot of enterprise 1.92TB M.2 PM983 NVMe’s. These refused to work when booted into an OS from an NVMe. Regardless of which slot, how many drives, etc. I ended up getting Samsung 9100 Pros that work fine, but are about half the size and don’t have the same Power Loss Prevention capabilities.

So I eventually settled on putting a Mellanox ConnectX-4 lx nic in each node, and put Storage, Compute, and Management traffic over those. That of course meant no QNAP card, and only 3x SSD per node. It also meant that I needed some RDMA capable switches, which was not my initial intention. It did however give me a chance to try out cumulus. Let me emphasize, nearly every part of this is unsupported by microsoft. You should not use this for your business. But, it is quite the fun little homelab cluster.

Landscape

Hardware Setup

Hardware

First up, the hardware I used:

Part	Model
CPU	i9-12900H (14c/20t)
RAM	2x 48GB DDR5 Crucial
Boot SSD	Random old NVMe
Data SSD	2x Samsung 9100 Pro 1TB
Nic	Mellanox ConnectX-4 lx

Step 1 is of course to install everything. When it comes to SSD placement, make sure to put the boot drive in the slot furthest to the right. That slot only runs at PCIe 3.0x2 so we don’t want to use that for a real data drive. Also, make sure that the M.2/U.2 switch is set to M.2, or you’re going to have a really bad day.

Landscape

BIOS

Once they’re all put together, onto the BIOS Setup.

All of my nodes shipped with BIOS 1.26, which seemed to work fine for me, but if you bought used I’d make sure both nodes are on the same version.
If you bought used, go ahead and reset to factory defaults. If you bought new, maybe do it anyways. I’ve seen some weird things with these guys.
Advanced menu.
- Trusted Computing:
  - Enable SHA384 and SM3_256 PCR Banks
- Onboard Devices Settings:
  - DVMT Pre-Allocated: 48M
  - Aperture Size: 128MB
- ACPI Settings:
  - Restore (Restory?) on AC Power Loss: Last State
- HW Monitor & Smart Fan:
  - Set all fans to “Full Mode”. These boxes are still going to run plenty warm, and at full speed they’re still quieter than my other servers.
I also set the ME Password, but haven’t yet started to do anything with it.

AD prep

Lots of this is optional, and environmental based. But this is what I did ahead of time to make things easier.

Create OU for HV Clusters, with a sub OU for this particular cluster.
Create and link GPO to allow RDP and Remote Powershell.
Create and link GPO to configure W32TM (NTP) settings.
Create and link GPO to use Deliver Optimization on the local network.
Create and link GPO to disable Interactive Logon CTRL+ALT+DEL.ß
Create and link GPO to allow necessary Windows Firewall rules.
- RDP
- ICMP (Ping)
- File and Printer Sharing (SMB)
- WinRM
- WMI
- Delivery Optimization
- Performance Logs and Alerts
- Virtual Machine Monitoring

Initial OS Setup

I used the absolute latest version of Windows Server vNext, but feel free to use the retail release of Windows Server 25 if you’d like.

Prep Windows Server on a USB drive
Boot, and install Windows Server to the boot NVMe.
- If you did vNext, you can use the public vNext activation key “2KNJJ-33Y9H-2GXGX-KMQWH-G6H67”
Set an admin password, and login.
Do a Rename-Computer and restart
```
1Rename-Computer -NewName HV01 -Restart
```
Set static IP on a Nic. My preference is the 1st ConnectX-4 that shows up
Join the Domain
- Move the computer objects to the dedicated OU now, not later.
Restart the computer
```
1Restart-Computer
```

Set the timezone.

1Set-TimeZone -Id "Central Standard Time"

Set Power Plan to High Performance

1Powercfg -setactive 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c

Set minimum processor state to 25%

1Powercfg -setacvalueindex 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c 54533251-82be-4824-96c1-47b60b740d00 893dee8e-2bef-41e0-89c6-b55d0929964c 25

Drivers

Download Mellanox Drivers: https://network.nvidia.com/products/adapter-software/ethernet/windows/winof-2/
- You need the WinOF-2 driver, and use the latest version, not LTS. At the time of writing, LTS does not support Windows Server 2025.
Download MS-01 driver package: https://www.minisforum.com/pages/product-info
- Scroll down to MS-01, expand “Drivers & Downloads” then click on the “Driver Package” link
- This package actually has lots of fun stuff in it. It has an i226 driver that will actually install on Windows Server 2025 if you want to try that.
Download Intel X710 driver package: https://www.intel.com/content/www/us/en/support/products/82947/ethernet-products/700-series-controllers-up-to-40gbe/intel-ethernet-controller-x710.html
- OR, just disable the X710 adapters from the device manager
- You may have to scroll down for the “Intel® Network Adapter Driver for Windows Server 2025*” package
Install Mellanox drivers
Install Chipset Drivers (MS-01 #1)
Install Graphics Drivers (MS-01 #4)
Install X710 Drivers (Intel Package)
- IF you left the X710 adapters enabled, regardless of if you intend to use them, install the drivers. Failover Clustering validation doesn’t like when you use the of box network drivers. Alternatively you can disable the X710’s from Device Manager.
- If you actually want to use them, you should probably update the firmware too. https://www.intel.com/content/www/us/en/download/18634/non-volatile-memory-nvm-update-utility-for-intel-ethernet-adapters-700-series-windows.html
Optionally install i226, ME, Etc.
- If you choose to use i226 as a sort of “OS Out of Band” port, you will need to create an extra Network ATC intent later.
Restart the computer
```
1Restart-Computer
```

Host Network Prep

This part is pretty important, and will help set you up for Failover Clustering health checks, and Network ATC Configuration.

Now that you have the chipset drivers, if you connected the nodes with a thunderbolt link, that will pop up as a network connection. Set static IPs on both ends with no gateway.

As an example, I’ve included this IP table to help. I would highly recommend not to touch the storage networks (VLan 711 and 712), but feel free to modify the others for whatever meets your environment.

Cluster IP: 10.10.0.40/23

NIC	HV01	HV02	VLAN
vManagement	10.10.0.41/23	10.10.0.42/23	1
vSMB1	10.71.1.41/24	10.71.1.42/24	711
vSMB2	10.71.2.41/24	10.71.2.42/24	712
TB1	10.72.1.41/24	10.72.1.42/24	N/A

Rename Network Adapters

You will need to substitute the proper -NewName depending on the order in which your adapters got named by default. You can see in this screenshot that the adapters on this node got added in a different order.

1Get-NetAdapter | Sort Name
2
3Rename-NetAdapter -Name 'Ethernet' -NewName 'CX4-1'
4Rename-NetAdapter -Name 'Ethernet 2' -NewName 'CX4-2'
5Rename-NetAdapter -Name 'Ethernet 3' -NewName 'X710-1'
6Rename-NetAdapter -Name 'Ethernet 4' -NewName 'X710-2'
7Rename-NetAdapter -Name 'Ethernet 5' -NewName 'TB1'

Cluster Configuration

First, we’re going to check that the data drives are ready to be pooled. At this point, you will have to run this on each node.
```
1Get-PhysicalDisk
```
- IF any of your drives are listed as “CanPool=False” then you can run the command from Microsoft to clean them. https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/deploy-storage-spaces-direct#step-31-clean-drives

Add Roles and Features. You 100% want to do this from powershell, not server manager.

1Install-WindowsFeature -Name "Hyper-V", "Failover-Clustering", "Data-Center-Bridging", "RSAT-Clustering-PowerShell", "Hyper-V-PowerShell", "FS-FileServer”, “NetworkATC” -IncludeAllSubFeature -IncludeManagementTools -Restart

FCM Validation:
- Open Failover Cluster Manager. If you did server core, you will need to do this from a management machine.
- In the top right, click “Validate Configuration”
- It will take a few minutes to validate, and provide a report. There may be some errors or warnings, and you should view the report and check them all.
- Now, if you set a static IP on CX4-1, but not CX4-2, this validation will complain about a DHCP mismatch on a cluster network. This is fine and can be ignored. It’s also not abnormal that one of the two nodes got a defender update that the other one didn’t get. I always ignore that.
- Assuming everything else in the validation report looks good, click the “Create the cluster now using the validated nodes” box, and click finish.
- Fill in the name you’d like to provide the cluster, click next a couple of times, and click create.
- From the cluster overview page, go to “Cluster Core Resources”, and right click on the “Server Name” object with your clusters name. Go to properties, Select the network address and click edit, then enter the IP address you’d like to use for the cluster then click OK. Optionally also select “publish PRT records” and click OK again.

Network ATC

This section assumes you are using the same exact network setup as me, but this will hopefully at least give you a good starting point to understand ATC. You only have to do this on one node, and it will apply to the whole cluster.

First, create the intent

1Add-NetIntent -ClusterName HV-CLUS1 -Name ConvergedIntent -Management -Compute -Storage -AdapterName CX4-1, CX4-2

Next, we need to get the status and make sure it’s succeeded. DO NOT CONTINUE UNTIL THIS IS COMPLETE
```
1Get-NetIntentStatus
```

Next, we’ve got some overrides. Network ATC will override settings set manually in other places. This is part of it’s intentional configuration drift capabilities.

1$ClusterOverride = New-NetIntentGlobalClusterOverrides
2$ClusterOverride.EnableNetworkNaming = $True
3$ClusterOverride.EnableLiveMigrationNetworkSelection = $True
4$ClusterOverride.EnableVirtualMachineMigrationPerformanceSelection = $True
5$ClusterOverride.VirtualMachineMigrationPerformanceOption = "SMB"
6$ClusterOverride.MaximumVirtualMachineMigrations = "2"
7Set-NetIntent -GlobalClusterOverrides $ClusterOverride

And just like before, we want to watch until it’s completed with the following commands.
```
1Get-NetIntentStatus
2Get-NetIntentStatus -GlobalOverrides
```
IF “Get-NetIntentStatus -GlobalOverrides” comes back with the Error “WindowsFeatureNotInstalled”, run this command then configure settings from WAC.
```
1Remove-NetIntent -GlobalOverrides
```
If you’re looking for some alternate Network ATC setup configurations, Lee has a great blog post here: https://www.hciharrison.com/azure-stack-hci/network-atc/
If you’re feeling really adventurous, and trying to get the X710’s to work, Network ATC will fail because they do not support RDMA. This is the same problem you’d have trying to use Network ATC in a VM or with any other non-RDMA Nics. Here are the necessary overrides to make it work. DO NOT DO THIS IF YOU HAVE RDMA CAPABLE NICS.
```
1$Override = New-NetIntentAdapterPropertyOverrides
2$Override.JumboPacket = "9000"
3$Override.NetworkDirect = $false
```

Setup Cluster Aware Updating (CAU)

First step here is often overlooked, and VERY important. Give the cluster object full control of it’s OU!

These are the parameters I chose to use for this lab cluster. I expect most of you to somewhat modify these settings.

 1$Parameters = @{
 2    ClusterName = ‘HV-CLUS1'
 3    DaysOfWeek = ‘Monday’, ‘Friday’
 4    WeeksOfMonth = ‘1’, ’2’, ‘3’, ‘4’
 5    MaxFailedNodes = ‘0’
 6    MaxRetriesPerNode = ‘3’
 7    RebootTimeoutMinutes = ‘30’
 8    SuspendClusterNodeTimeoutMinutes = ‘30’
 9    SuspendRetriesPerNode = ‘2’
10    WaitForStorageRepairTimeoutMinutes = ‘60’
11    RequireAllNodesOnline = $true
12    AttemptSoftReboot = $true
13    EnableFirewallRules = $true
14    Force = $true
15    }
16Add-CauClusterRole @Parameters

Then, we’ll check that it’s working.
```
1Get-CauClusterRole
```
Assuming it says Online, and the settings look right, you should be good. I did however run into an odd issue a few days after deployment. My CAU status went to Offline. Thankfully, super easy to check, and super easy to fix.
```
1Get-CauClusterRole
2Enable-CauClusterRole
```

Set Cluster Witness

Since this is a 2-node cluster, we definitely want a cluster witness. This is super easy, you just need to point it at a SMB share that both nodes can write to.
```
1Set-ClusterQuorum -Cluster HV-CLUS1 -FileShareWitness \\SERVER-1\HV-CLUS1-SHARE -Credential (Get-Credential)
2Get-ClusterQuorum -Cluster HV-CLUS1
```
- You should then check that the cluster witness shows as “online” in FCM.

Setup S2D

Finally the fun part!

1Enable-ClusterStorageSpacesDirect -PoolFriendlyName ‘MICRO-S2D’ -Verbose

And check that the storage pool is online and healthy.
```
1Get-StoragePool
```

Or if you want more details:

1Get-StoragePool -IsPrimordial $false | FL

Validate that the ClusterPerformanceHistory volume has been created. This can take up to ~15 minutes.
```
1Get-Volume
```

S2D Volume Creation

First step here is really to understand how much space is available in your pool.
- I don’t yet have a S2D 101 post to point you to the MS Learn doc on S2D mirroring: https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/fault-tolerance#mirroring
- And a great S2D calculator made by Oberneder IT Solutions: https://s2d-calculator.com/
Since I used 2x1TB SSD per node, I’m going to “lose” one drive worth of capacity per node to the recommended reserve, then I’m using 2-way mirroring across the two nodes, leaving me with just under 1TB usable, but I don’t want to run at 100% right out of the gate, so I’m going with a 600GB CSV (Cluster Shared Volume) to start.
```
1$VolumeName = "Workloads1"
2$StoragePool = Get-Storagepool -IsPrimordial $False
3New-Volume -StoragePool $StoragePool -FriendlyName $VolumeName -FileSystem CSVFS_ReFS -Size 600GB -ResiliencySettingName "Mirror" -ProvisioningType "Fixed"
```

And for fun, I wanted a dedicated CSV for the PDC that I’m going to run on this cluster.

1$VolumeName = "DC1"
2$StoragePool = Get-Storagepool -IsPrimordial $False
3New-Volume -StoragePool $StoragePool -FriendlyName $VolumeName -FileSystem CSVFS_ReFS -Size 128GB -ResiliencySettingName "Mirror"-ProvisioningType "Fixed"

Hyper-V Tweaks

Create (my) standard folder structure.

1New-Item -Path C:\ClusterStorage\Workloads1\VMs -ItemType Directory
2New-Item -Path C:\ClusterStorage\Workloads1\VHDs -ItemType Directory
3New-Item -Path C:\ClusterStorage\Workloads1\ISOs -ItemType Directory

Set default locations for VM Creation.

1Get-ClusterNode | Foreach { Set-VMHost -ComputerName $_.Name -VirtualMachinePath 'C:\ClusterStorage\Workloads1\VMs’ }
2Get-ClusterNode | Foreach { Set-VMHost -ComputerName $_.Name -VirtualHardDiskPath 'C:\ClusterStorage\Workloads1\VHDs' }

Increase Failover Cluster load balancer aggressiveness (Yes, this is actually the valid powershell way to set this…)
```
1(Get-Cluster).AutoBalancerLevel = 2
```
Set maximum parallel migrations. Don’t ask me why this isn’t covered with all the other things we’ve set.
```
1(get-Cluster).MaximumParallelMigrations = 3
```

Double checks, extra validation, and tidbits

I had plenty of weird things going on with WAC Network ATC on this deployment. I blame it on using the latest vNext version of Windows Server, latest preview version of WAC, and some other unnamed new features. This is basically a list of good things to double check.
Double check that Network ATC extension works in WAC. Mine did not, and I had to add a different extension feed to get a newer version of Netowrk ATC than what was in the default feed.
- WAC > Settings > Extensions > Feeds > Add “https://aka.ms/sme-extension-feed"
- Install the newest Network ATC Extension.
Double check Network ATC Global Overrides. I had some funky stuff with my Network ATC Powershell module, so I double checked everything looked good in WAC.
- WAC > Cluster > Network ATC CLuster Settings
Set CPU Scheduler to the newer Core Scheduler. I haven’t found a way to do this in Powershell yet. If you find one, let me know!
- WAC > Cluster > Settings > General
Enable In-Memory read cache. I have enough RAM, so I’m happy to give some up to in-memory read cache. This is basically what every linux ZFS system is doing with ARC.
- WAC > Cluster > Settings > In-memory Cache
Set Storage Spaces Repair Speed
- WAC > Cluster > Settings > Storage Spaces and Pools

Check SMB Multichannel

1Get-SmbMultichannelConnection -SmbInstance SBL

Getting rid of a removed adapters from showing up as partitioned in FCM
- The docs claim that you can use Add-ClusterExcludedAdapter commands to fix this, but I have never been able to get them to work. Here’s the actual fix:
- Open Regedit, navigate to “Computer\HKEY_LOCAL_MACHINE\Cluster\NetworkInterfaces" and delete all references to the removed adapter. Having good friendly names will help here.
Affinity rules vs Preferred Owners
- Affinity rules can be configured in WAC or Powershell, and allow you to try and keep two or more VMs together or apart. For most production VMs, this is what you probably want to use. You can also enable “soft anti-affinity” so if you don’t have enough nodes in the cluster to keep them apart, it will keep them running. Perfect use case for this is having two domain controllers on a 3-node cluster.
- Preferred Owner allows you to specify which node in the cluster you’d like the role (VM) to run on. This can have some interesting use cases. In my case, I want to keep DC1 on HV01, and DC2 on HV02 so I chose to use preferred owners. Preferred owner does however have some… unintuitive logic to it. If you think you want to use this, you should read the doc thoroughly. https://learn.microsoft.com/en-us/troubleshoot/windows-server/high-availability/groups-fail-logic-three-more-cluster-node-members

Ubuntu VM optimizations

1sudo apt update
2sudo apt install linux-azure linux-image-azure linux-headers-azure linux-tools-common linux-cloud-tools-common linux-tools-azure linux-cloud-tools-azure
3sudo apt full-upgrade
4sudo apt install openssh-server
5sudo reboot