Deploying complex infrastructure with a Terraform state machine

How do you handle dependencies in complex deployments?

In today’s world of interconnected cloud services, deploying application infrastructure can get pretty complicated. For example, your Kubernetes app in EKS may need several pods to share storage, requiring you to set up Amazon EFS for your cluster. Your IT department may require you to use RFC1918 IP address conservation for any EKS clusters you deploy in their main VPC. Another app might be deployed through Flux and require retrieving a SOPS key from your company’s secret store first and adding it to the cluster as a secret. Automating all the steps of actually getting a Kubernetes application into production is not easy.

Terraform has helped many companies address all or part of this problem. But, with Terraform being a desired state language, it can be difficult to deal with complex interdependencies between resources. The Terraform language is aimed at deploying everything in a single run, using resource dependencies to create resources in the right order. However, the more complex the infrastructure gets, the more difficult it becomes to avoid circular dependencies in your TF code that make deployment difficult, or sometimes even outright impossible.

A real example — and a workaround you can use today

I personally ran into this problem as I was working with a customer that uses the RFC1918 IP address conservation approach for EKS. This approach requires customizing the AWS VPC CNI so that it attaches pods to dedicated pod networking subnets that use a special RFC1918 CIDR (typically 100.64.0.0/16).

Due to limitations at the time, deploying the CNI with a custom configuration wasn’t working through CAPA(the Cluster API provider for AWS) and I needed to deploy the cluster with a vanilla CNI configuration first, then redeploy the AWS VPC CNI Helm chart over it with a custom configuration. Making this work through a single Terraform run was simply impossible, so I had to figure out a solution.

I figured that if I could just run Terraform multiple times, with slightly different configurations for each run, I could deliver a working solution for the customer.

Terraform would still be able to manage the final state of the infrastructure that it deployed, as well as correctly deprovision it on a TF destroy run. So, the question became: how can I make Terraform apply different configurations, depending on which step of the deployment process it is? And how can I maintain the state of which deployment step a cluster is in?

Let’s take this step by step…

To start with the easy part of maintaining state: this could be accomplished by using a tag on the cluster that stored the current step. I chose to use a simple step tag that contains a number to indicate which step of the multi-step process the cluster is on. A fresh cluster would start on step 0 and complete at an arbitrary number, incrementing by 1 for each run.

The next challenge was determining the correct value of the step tag dynamically during a TF run. I achieved this through an external resource in Terraform:

	data "external" "wait_for_cluster_state" {
  program = ["bash", "./modules/cluster/wait_for_cluster_state.sh"]
  query = {
    CLUSTER_NAME = var.name
    SC_HOST      = var.sc_host
    SC_API_KEY   = var.sc_api_key
    SC_PROJECT   = var.sc_project_name
  }
}

‍
This will execute the waitforcluster_state.sh bash script that queries the Spectro Cloud Palette API to determine if the cluster exists and which step it is on. It also verifies that all pending changes on the cluster have been fully implemented and will loop until that is the case. This is helpful in scenarios where, for example, the new VPC CNI config has been applied, but it takes the cluster a while to fully implement the changes.

This is the script I use for this external resource:

	#!/bin/bash
eval "$(jq -r '. | to_entries | .[] | .key + "=" + (.value | @sh)')"
# results in:
# $CLUSTER_NAME, $SC_HOST, $SC_API_KEY, $SC_PROJECT
# Get project UID
PROJ_UID=$(curl -s -H "ApiKey:$SC_API_KEY" https://$SC_HOST/v1/projects | jq -r --arg SC_PROJECT "$SC_PROJECT" '.items[].metadata | select(.name==$SC_PROJECT) | .uid')
# Get cluster state info
CLUSTER_STATE_DATA=$(curl -s -H "ApiKey:$SC_API_KEY" -H "projectUid:$PROJ_UID" https://$SC_HOST/v1/spectroclusters\?filters\=metadata.name\=${CLUSTER_NAME}ANDstatus.state!="Deleted"\&fields\=metadata.uid,metadata.labels,status.state,status.conditions,status.packs)
# Parse status.state field
CLUSTER_STATE=$(echo $CLUSTER_STATE_DATA | jq -r '.items[0].status.state')
if [ "$CLUSTER_STATE" = "Running" ]; then
# Parse metadata.labels.step field
CLUSTER_STEP=$(echo $CLUSTER_STATE_DATA | jq -r '.items[0].metadata.labels.step')
# Loop as long as their are machine pools or packs still being created/applied
while echo $CLUSTER_STATE_DATA | jq -r '.items[] | "status=" + .status.conditions[].status, "status=" + .status.packs[].condition.status' | grep -e "status=True" -v > /dev/null
do
  sleep 35
  # Refresh cluster state info for next iteration
  CLUSTER_STATE_DATA=$(curl -s -H "ApiKey:$SC_API_KEY" -H "projectUid:$PROJ_UID" https://$SC_HOST/v1/spectroclusters\?filters\=metadata.name\=${CLUSTER_NAME}ANDstatus.state!="Deleted"\&fields\=status.conditions,status.packs)
done
# Output current cluster step
jq -n --arg CLUSTER_STEP "$CLUSTER_STEP" '{step:$CLUSTER_STEP}'
else
# Cluster does not exist, output step -1
jq -n '{step:"-1"}'
fi

The script returns -1 if the cluster doesn’t exist yet, otherwise it returns the value of the step tag on the cluster. We can then use the value of the current step in our code as data.external.waitforclusterstate.result.step_. I use the following code to generate two local variables that make it easier to use the cluster state in other places of the code:

	locals {
  cluster_exists = tonumber(data.external.wait_for_cluster_state.result.step) >= 0 ? true : false
  step = local.cluster_exists == true ?
          (lookup(local.eks_state_map, data.external.wait_for_cluster_state.result.step).last_step == true ?
            tonumber(data.external.wait_for_cluster_state.result.step) :
            tonumber(data.external.wait_for_cluster_state.result.step) + 1) : 0
  eks_state_map = { # map of unique configs for each step

The cluster_exists variable is very useful for other blocks in Terraform, where you only want to define a resource if the base cluster has been deployed. For example, I needed to retrieve the value of an AWS security group that gets auto-created when the EKS cluster is deployed. So, I used this variable to define the TF resource like this:

	data "aws_security_group" "eks" {
  count = local.cluster_exists == true ? 1 : 0
  tags = {
    "aws:eks:cluster-name" = var.name
  }
}

The step variable combines several pieces of logic:

If the cluster does not exist, return 0
If the cluster exists and its current step value is not the last step in the process, increment the step value by 1 and return that value
If the cluster exists and its current step value is the last step in the process, return the step value as-is.

The last step: the state map

To determine which step constitutes the last step in the process, we come to the final piece of the puzzle: the state map. This is a local variable that provides the ability to define a unique configuration for every individual step. The basic structure looks like this:

	eks_state_map = {
  "0" = {
    last_step = false
    tags = ["step:0"]
    # Variables for step 0
    # variable1 = value_0
  },

  "1" = {
    last_step = false
    tags = ["step:1"]
    # Variables for step 1
    # variable1 = value_1
  },

  "2" = {
    last_step = true
    tags = ["step:2"]
    # Variables for step 2
    # variable1 = value_2
  }
}

I found the most powerful way to leverage the state map is through using dynamic blocks. For example, I defined dynamic blocks for the spectrocloudclustereks resource, so that I can dynamically set the cluster configuration based on the current step:

	  resource "spectrocloud_cluster_eks" "this" {
  name = var.name
  tags = lookup(local.eks_state_map, local.step).tags
  cloud_account_id = data.spectrocloud_cloudaccount_aws.this.id

  cloud_config {
    ssh_key_name = var.sshKeyName
    region = var.aws_region
    endpoint_access = "public"
    vpc_id = data.aws_vpc.eks.id
    az_subnets = {
      "${var.aws_region}a" = "${data.aws_subnet.eks-prv-0.id},${data.aws_subnet.eks-pub-0.id}"
      "${var.aws_region}b" = "${data.aws_subnet.eks-prv-1.id},${data.aws_subnet.eks-pub-1.id}"
      "${var.aws_region}c" = "${data.aws_subnet.eks-prv-2.id},${data.aws_subnet.eks-pub-2.id}"
    }
  }

  dynamic "cluster_profile" {
    for_each = lookup(local.eks_state_map, local.step).profiles
    content {
      id = cluster_profile.value.id
      dynamic "pack" {
        for_each = cluster_profile.value.packs
        content {
          name = pack.value.name
          tag = pack.value.tag
          values = pack.value.values
          dynamic "manifest" {
            for_each = try(pack.value.manifests, [])
            content {
              name = manifest.value.name
              content = manifest.value.content
            }
          }
        }
      }
    }
  }

  dynamic "machine_pool" {
    for_each = lookup(local.eks_state_map, local.step).machine_pools
    content {
      name = machine_pool.value.name
      min = machine_pool.value.min
      count = machine_pool.value.count
      max = machine_pool.value.max
      instance_type = machine_pool.value.instance_type
      disk_size_gb = machine_pool.value.disk_size_gb
      az_subnets = machine_pool.value.az_subnets
    }
  }
}

Which then allows me to set the desired state per step in the state map like so:

	eks_state_map = {
  "0" = {
    last_step = false
    tags = ["step:0"]
    # Variables for step 0
    profiles = [
      {
        id = data.spectrocloud_cluster_profile.infra_base_profile.id
        packs = []
      }
    ]
    machine_pools = [
      {
        name = "temp-worker-pool"
        min = 1
        count = 1
        max = 1
        instance_type = "t3.large"
        disk_size_gb  = 60
        az_subnets = {
          "${var.aws_region}a" = data.aws_subnet.eks-prv-0.id
          "${var.aws_region}b" = data.aws_subnet.eks-prv-1.id
          "${var.aws_region}c" = data.aws_subnet.eks-prv-2.id
        }
      }
    ]
  },
  "1" = {
    last_step = false
    tags = ["step:1"]
    # Variables for step 1
    profiles = [
      {
        id = data.spectrocloud_cluster_profile.infra_base_profile.id
        packs = []
      },
      {
        id    = data.spectrocloud_cluster_profile.infra_cni_profile.id
        packs = [
          {
            name = "cni-aws-vpc-addon"
            tag  = "1.0.0"
            values = templatefile("${path.module}/config/cni-aws-vpc.yaml", {
              eks-security-group : local.eks-security-group-id,
              eks-subnet-a       : local.eks-pod-subnet-a
              eks-subnet-b       : local.eks-pod-subnet-b
              eks-subnet-c       : local.eks-pod-subnet-c
            })
          }
        ]
      }
    ]

    machine_pools = [
      {
        name = "temp-worker-pool"
        min = 1
        count = 1
        max = 1
        instance_type = "t3.large"
        disk_size_gb  = 60
        az_subnet = {
          "${var.aws_region}a" = data.aws_subnet.eks-prv-0.id
          "${var.aws_region}b" = data.aws_subnet.eks-prv-1.id
          "${var.aws_region}c" = data.aws_subnet.eks-prv-2.id
        }
      }
    ]
  },
}

Finally, we need to tie it all together and make Terraform output some useful information so that we can use a simple script to loop TF runs until the last step is reached. First, we define some useful outputs:

	output "cluster_name" {
  value = spectrocloud_cluster_eks.this.name
}

output "completed_step" {
  value = local.step
}

output "last_step" {
  value = lookup(local.eks_state_map, local.step).last_step
}

We then use the last_step output to determine if we need to perform another TF run, in the following script:

	terraform fmt -check
terraform init -input=false -upgrade
terraform validate

until [ "$LAST_STEP" = "true" ]
do
  echo "Applying Terraform configuration..."
  terraform apply -auto-approve -input=false
  echo "Terraform apply complete."
  LAST_STEP=$(terraform output -raw last_step)
  if [ "$LAST_STEP" = "true" ] || [ "$LAST_STEP" = "false" ]; then
    if [ "$LAST_STEP" = "false" ]; then
      echo "Waiting 2 minutes before starting next Terraform cycle..."
      sleep 120
    fi
  else
    LAST_STEP="true"
  fi
done

and with this, our Terraform state machine is complete. I hope this state machine walkthrough is useful for tackling your more complex Terraform infrastructure deployment challenges. It certainly helped me get more out of Terraform than I was able to before.

New CAPA developments ahead!

While my solution works, it was always meant as a temporary workaround until the technology existed to no longer need multiple runs to deploy the infrastructure.

So, while this workaround is still a useful tool to keep in your back pocket, I’m pleased to say that we have also contributed significant improvements to CAPA to get the custom VPC CNI functionality working out of the box. This capability is new in Palette 3.0 — and it means I’ll be able to move my customer’s automation back to a single Terraform run. To find out more about Palette 3.0, check out the release notes.

Nov 3, 2022