Published
April 5, 2024

Reliable Distributed Storage for Bare-Metal CAPI Clusters

Kevin Reeuwijk
Kevin Reeuwijk
Principal Solution Architect

Starting to look at bare metal? You’re in good company

With many VMware customers concerned about Broadcom's plans for their products, more organizations are looking for alternatives. 

Why should you run your on-premises Kubernetes clusters on top of a virtualization layer like vSphere, when you could just as well run them on bare metal physical servers in the data center?

It’s no wonder we’re seeing bare metal Kubernetes becoming much more popular. In fact, we think bare metal is ready for prime time!

Choose Your Data Storage Architecture Wisely

Data storage is an important consideration when you’re designing your bare metal Kubernetes environment. 

You have multiple options to choose from, including backend storage arrays with CSI (container storage interface ) drivers and local disks with distributed storage solutions like Portworx, Rook-Ceph, Quobyte and others.

Distributed storage solutions are great because they allow for hyper-converged architectures, like VMware vSAN enables you to pool local disks in each server to create a shared storage pool. 

The challenge then is how you handle the inevitable upgrades of Kubernetes and operating systems, without affecting the bare metal storage pool’s availability or performance. 

Introducing CAPI to the Mix: Watch Out for Repaves

Nowadays Cluster API (CAPI) is a popular tool for declaratively automating the deployment and maintenance of Kubernetes clusters. 

CAPI handles upgrades through a mechanism known as “repaving”—replacing old nodes in the cluster one by one with new nodes that have the new desired state in place. 

But how can you perform such a repave without triggering massive storage resilvering actions or losing the whole storage cluster altogether? 

Let us show you  how you can use Canonical MAAS (Metal as a Service) and Spectro Cloud Palette to make these repave actions work seamlessly, without data loss.

Retaining Data after CAPI Node Repaves

The solution to the challenge is pretty straightforward. We must ensure two things:

  1. Whenever CAPI repaves a node, the data on the local disks may not be wiped.
  2. The worker nodes of the cluster are repaved by removing one server from the cluster at a time, then reinstalling that same machine and rejoining it back to the cluster with the data on the local disks still intact.

Most bare metal deployment solutions have logic to wipe a server’s disks before operating system installation, and some also provide the ability to wipe a server’s disks when the node is uninstalled. 

For our needs here, we don’t want any of that to happen, so we must configure the bare metal deployment solution to never wipe disks other than the disk on which the OS is installed.

Next, to make sure we don’t cause the storage solution to start recovery operations from lost redundancy, we have to make sure we repave the cluster in the right sequence. 

By default, most CAPI solutions will use the “Expand First” (`RollingUpdateScaleOut` in CAPI terms) repave logic. This logic will install an additional fresh new server and add it to the cluster first, before then removing an old server. 

While this is useful to ensure the cluster never has less total compute capacity than before you started the repave operation, it is problematic for distributed storage clusters because you are introducing a new node without any data to the cluster, while taking away a node that does contain data. 

So instead, we want to use the “Contract First” (`RollingUpdateScaleIn`) repave logic for the pool of storage nodes. That way we can remove a storage node first (temporarily running with reduced redundancy), then reinstall it and add it back to the cluster, thereby immediately restoring data redundancy.

Putting the Plan into Action

Spectro Cloud Palette integrates with Canonical MAAS to perform bare metal deployment of Kubernetes clusters. It’s an easy way to build entire Kubernetes clusters from scratch, in a fully automated way.

To configure both products to support reliable distributed storage in bare metal clusters, we must perform three steps in Canonical MAAS and two steps in Spectro Cloud Palette:

Canonical MAAS:

  1. Use a fixed resource pool for our storage nodes
  2. Remove all additional disks except the OS disk from the nodes’ configuration
  3. Disable the “Erase nodes’ disks prior to releasing” option

Spectro Cloud Palette:

  1. Set the worker pool for the storage nodes to “Contract First”
  2. Increase the “node repave interval” for the worker pool containing storage nodes to allow the storage software enough time to properly install and initialize

Let’s go through these steps in a bit more detail.

Steps for Canonical MAAS

We need a fixed set of servers that act as storage nodes in our Kubernetes cluster, so let’s create a dedicated resource pool in MAAS and assign those servers to it:

Distributed storage for bare metal clusters

In this example we have a resource pool called `cl1_storage` with four nodes in it. These will be our storage nodes for cluster1. 

All four nodes will be used during normal operation. This ensures that as this pool gets repaved, one node will first leave the cluster, be reinstalled and then join the cluster again.

Next, let’s make sure that MAAS is not aware of the additional disks that the servers contain, so that those disks don’t get wiped. 

Any disks that are present in the “Storage” tab of the server in MAAS will be wiped at the start of the OS installation procedure. So, for every server we must remove all disks except for the disk on which the OS will be installed: 

Steps for Canonical MAAS

If the “Available disks and partitions” section is empty, like in the screenshot above, you’ve done it correctly.

Finally, to ensure MAAS does not wipe all data from all disks (even the ones it doesn’t know about) during node release, we have to check an option in Settings → Storage:

Storage settings in bare metal envionment

The above is the default configuration in MAAS, which does not wipe disks when a node is released. However, you want to make sure that on your MAAS instance, the “Erase nodes’ disks prior to releasing” option is also disabled.

This feature is a systemwide option today. If you need this feature enabled for some nodes, but disabled for others, please chime in on this MAAS feature request.

Steps for Spectro Cloud Palette

You can make all the changes you need to Spectro Cloud Palette from a single screen. 

In the deployment wizard, configure the worker pool for storage to match the MAAS resource pool in name and number of servers:

configuring the worker pool for storage
  • The number of nodes should match the number of bare metal servers in the MAAS resource pool
  • The Rolling update strategy should be set to Contract First
  • The Resource pool should be set to the correct one (cl1_storage in this example)
  • The Node repave interval should be increased from its default of 0 seconds. We recommend at least 10 minutes, or 15 if you want to be extra safe.

The node repave interval controls the `minReadySeconds` property on the `MachineDeployment` resource for that node pool in CAPI. During repaves, it tells CAPI to wait for that many seconds after a new node has joined to the cluster before moving on with draining and deleting the next node from the cluster. 

Since distributed storage solutions need several minutes to install and initialize on a fresh node, we can use this property to slow down the repave process just enough so that Portworx or Rook-Ceph has enough time to complete installation and rejoin the storage cluster, restoring data redundancy. That then allows the next node to drain and leave the cluster without causing a double failure.

Expect the Unexpected

We’ve tested this method successfully using both Portworx and Rook-Ceph. That said, you should always keep an eye on your clusters during repave operations. 

That’s because the node repave interval only waits for an arbitrary amount of time; it does not wait until some custom success condition is met. 

So, if a different issue causes the distributed storage software to not install properly on the new node, you can still run into trouble. For example, Portworx supports specific kernel versions, and installing new nodes with a kernel version it doesn’t support can prevent the installation from succeeding. For that reason, it’s a good idea to lock the kernel version that MAAS deploys. Reach out to us if you want to learn how to achieve that.

Always make sure you test the upgrades first (as Spectronaut Darren describes in this blog post on upgrading) and monitor your nodes during upgrades to make sure the storage components work as expected.

Next Steps

We love sharing our expertise about storage on Kubernetes, whether you’re running in public cloud environments, in edge computing or running Kubernetes on bare metal server hardware. 

So do check out our other blogs and webinars, or get in touch to schedule a 1:1 demo to see how Palette can help you.

Tags:
Bare Metal
Integrations
Cluster Profiles
Operations
Subscribe to our newsletter
By signing up, you agree with our Terms of Service and our Privacy Policy