You are currently viewing 2024 Lab Overhaul Part 1 – Expansion: New Hosts, Upgraded Storage & Networking

2024 Lab Overhaul Part 1 – Expansion: New Hosts, Upgraded Storage & Networking

Introduction

In the last two years, my home lab has undergone significant transformations to enhance performance, scalability, and reliability. In this post, I’ll walk you through the initial phase of these upgrades, focusing on adding two new lab hosts and transitioning from a single-node Ceph-based storage solution to a more robust setup with TrueNAS on a Dell R730xd. These changes have laid the foundation for a more resilient and efficient environment that powers my increasingly complex smart home and lab infrastructure. This blog is the first of a several-part series detailing many of the changes I have made since my last posts back in the summer of 2022.

Adding Two More Compute Hosts

One of the first steps in expanding my lab was adding two new hosts. With growing demands on my existing infrastructure, including the need to run more virtual machines (VMs) and manage heavier workloads, the addition of these hosts was essential.

  • Hardware Selection
    • CPUs: Since my existing hosts were using Ryzen 5000 series processors, I chose to do the same for the new hosts to ensure compatibility with VMware EVC and ensure seamless failover and load balancing of VMs. One of the hosts is also now utilizing a G series chip, meaning it has an iGPU, allowing me to utilize onboard graphics for the hypervisor console and free up a PCIe slot for VM passthrough.
    • RAM: Both servers have at least 64 GB of RAM. With all my VMs powered on and under load, I only need about 2 hosts worth of compute and RAM, but having four allows for easy maintenance.
    • Networking: Both servers have Intel x520 dual-port cards installed, allowing for 2x 10 Gb SFP+ connectivity. More on this in a future post.
  • Power Impact
    • For average usage, these computers sip power, using only about 50 kWh per month, or $6/month/host before any solar/battery savings.

Conversion from Ceph to TrueNAS

While Ceph worked well when I set it up, I was running things completely against best practices. For starters, upgrades were a massive undertaking because I did not have multiple managers and monitor nodes. Secondly, I modified the CRUSH map to provide redundancy at the disk level instead of the server level so I could run everything on a single PC. Finally, even with SSD caches and logging, performance was only moderate when presenting iSCSI or NFS shares to VMware, and drives were not hot-swappable or easily identifiable. After researching various potential solutions, I settled on digging out an R720xd I had been using in a previous iteration of the lab and installing TrueNAS.

  • Why TrueNAS?
    • TrueNAS offered a simplified management interface, robust features, and better performance for my specific use case. It also allowed me to begin playing around with ZFS instead of traditional hardware-based RAID like I am used to. Plus, if I ever need to add more storage in the future, I can easily replace disks one at a time to expand capacity or add another server.
  • TrueNAS Challenges
    • When I started with TrueNAS, I unknowingly set myself into complications I didn’t know better about at the time, which were easily avoidable. First off, I installed TrueNAS Core, which received its final major update only a few days before the time of writing this post. Luckily, migration to TrueNAS Scale was fairly seamless.
    • Secondly, I configured each disk into its own RAID 0 dataset in the PERC controller settings. This was mainly due to not being able to put disks into non-RAID mode with this generation of server and not wanting to custom flash alternative firmware onto the controller that would change it to IT mode. It also meant any time a disk was replaced I had to reboot the server in order to create a new RAID array. This was later rectified when I accidentally fried something on the server (more on this in a future post) and decided to upgrade to an R730xd, which allows me to put disks into non-RAID mode and present them directly to TrueNAS from iDRAC with no reboots. TrueNAS can now also see metrics like disk temperatures.
  • Migration
    • Migration was fairly simple, as I was able to utilize storage vMotion inside of VMware for most systems. I did opt to move over the disks that were powering Ceph, so I had to lose my local Veeam backups and my media server files had to be restored from S3-compatible storage due to the size of the vmdks.
    • With a desktop no longer in use for Ceph, it actually became the fourth compute host so I could reuse the system instead of having it go to waste.
  • New Datasets
    • Since I had more than 8 drivebays available, I decided to expand my storage while I was here. First, I purchased a PCIe to NVMe riser and 2x 2 TB Intel Optane NVMe drives, which are mirrored. The most latency-sensitive workloads, such as a VM that hosts several game servers were migrated to this array.
    • Second, I created a raidz2 pool with 4x 2TB SATA SSDs. These host the VM OS disks for all VMs that are not on the NVMe pool. This means that only my media server files and my Veeam backups are located on the HDD pool. Veeam backups are now just ReFS as I have offsite immutable backups on S3-compatible storage. I could have used MinIO on top of TrueNAS, however, research showed support for it was planned to end and Veeam’s performance with it was troublesome at best. The HDD pool is still 6x 6 TB 7200 RPM drives, now in raidz2, with 2x 1 TB SATA SSDs mirrored for logs.
    • All in all, this gives me almost 27 TB usable on the server. I do however lose several terabytes of overhead as TrueNAS prefers datasets to stay below 80%, and VMware likes VMFS utilization to be below 75%.

Lab Relocation

With the introduction of PowerEdge servers and an enterprise network switch came increased heat and noise, to the point where I could no longer reasonably keep my lab in my office or a closet. So, I decided to purchase an open 42U rack and move everything to my garage.

  • Why Relocate?
    • As I said, heat and noise were concerns. While you would think that the Texas summers would cause cooling issues for the servers, this is their second summer out there and they seem to still be handling things like a champ. While you can hear some fan noise in the laundry room as it has the door to the garage in it, you cannot hear it once the laundry room doors are shut.
    • In addition, the relocation made it easier to perform some networking upgrades and changes for the rest of the house to expand on what was done for the lab.
  • UPS Installation
    • While there, I also decided to invest in a 2U 1500 VA UPS. I’ve since added solar to the house and a Tesla PowerWall, but surprisingly they have not made the UPS redundant. I’ll explain this in more detail in a later post as I have some nifty automation around it, but the short version is the PowerWall regularly fails to have a clean cut-over and does not supply infinite power, so the UPS handles cutovers and clean shutdowns.
  • Additional Switching and Systems
    • Since the relocation, I have added a few more systems to the rack, which I’ll cover in later parts. It felt worth mentioning since there is a photo of the rack in its current state and not the state it was in when I built it.

Networking Upgrades

To further enhance lab performance and leverage the higher IOPS of the new NVMe and SSD datasets, I moved everything over to 10 Gb SFP+ from 2.5 GbE. I also took the time to improve overall internet reliability at the house.

  • Upgrading the Lab
    • Upgrading the lab was relatively simple. Each physical server now has a dual port Intel X520 card, allowing for 2x 10 Gb SFP+. For roughly $300 on eBay, I managed to pick up a Dell PowerConnect 8132F, which has 24 10 Gb SFP+ ports.
  • Retirement of NSX and Expansion of pfSense
    • As I wasn’t really using it, I decided to retire NSX from the lab. As you may recall, I also had a pfSense VM to handle routing and tagging as my consumer Asus router did not support tagging. The pfSense VM has also been retired, in place of a proper firewall that runs pfSense on a Dell PowerEdge R220. I have not fully stress-tested it, but I know from a quick iperf test that I can get at least 2.5 Gb between subnets. Unfortunately, the device I was using to test only had a 2.5 Gb NIC.
  • Wireless Mesh and Other Upgrades
    • As I started adding wireless devices in and around the house, I noticed connectivity was regularly dropping for some of the devices. So, I converted my Asus router to AP mode, purchased a few more, and created a three-device mesh, including a router in the garage, so I can quickly look things up in the driveway and garage without having to walk back inside to get videos and forums to load. The backhaul for this mesh is all Cat 5e, as nearly every room in the house was wired with the cables, connecting to RJ11 ports in the wall and terminating in a security panel inside my master closet. Swapping the wall plates to RJ45 and installing a switch alongside the main Asus AP allows for multi-gig connectivity since the ethernet runs are short enough. For example, the backbone from the main AP to the lab/pfSense server is 10 Gb and my gaming PC has 2.5 GbE.

All of this work ended up flowing pretty well together and laid the foundation for a lot of future enhancements. In the next part of the series, I’ll be going through some Unifi, 3D printing, and media server goodies I added or found benefitted from these changes.

Leave a Reply