Breaking the Terraform Monolith - Silos of Infrastructure

Posted on Jan 15, 2019

At Globality we have about 120K lines of Terraform configuration that we use to manage our infrastructure. We manage everything we can this way. AWS, GitHub, CI, and more.

We’re heavy users; we even go so far as to working with our patched version of Terraform while the official plugins catch up with us. For example, we recently moved a number of our AI classifiers to SageMaker, but the AWS provider didn’t support all of the configuration options we needed.

As with any software system, the combination of organic growth and large code bases creates challenges.

I want to share some of these challenges and our solutions.

The Challenges

Build Time (plan and apply)

We started (relatively) naively: we had one single git repository (globality-terraform) and put everything in it.

We weren’t completely stupid; we organized our code into modules and directories, but even so, every time we wanted to make a change, we had run terraform plan and terraform apply against every resource in our repository 1.

The runs were not fast. It was unbearably slow for changesets that only touched a few resources.

Access Control

We take security seriously. We run on the principle of least access. You only have the access you absolutely must have.

Engineers typically don’t have access to database configurations, core networking, and so forth and they certainly do not have write access to these things. Engineers may have access to service definitions, message queues, and application load-balancers. By having a single terraform repository with all of our infrastructures, we were blocking engineers from using terraform for in the places where they had access because it wasn’t reasonable to give them access to everything else.

Surface Area

Our most common infrastructure changes relate to adding/changing services. (We do this a lot!)

In a typical case, we’d need to:

  1. Create a new docker image repository (in ECR) for the service.
  2. Create a task definition (in ECS) to run one more containers from this image.
  3. Create a service (in ECS) for this task definition.
  4. Create some number of queues (in SQS) for the service’s asynchronous daemon(s).
  5. Subscribe these queues to appropriate topics (in SNS).
  6. Configure storage for application secrets (in SecretsManager).
  7. Connect load balancers (ALB) and DNS (Route53) to the service.
  8. More?

We wrote HOWTO documents and step-by-step instructions, but this was all still just too much friction.

Terraform Fragility

Up until Terraform 0.12, repetitive resource definitions had to rely on various count-based hacks.

For example, if you have many ECR repos, you might do something like:

resource "aws_ecr_repository" "repo" {
  count = "${length(var.services)}"
  name  = "${var.services[count.index]}"
}

This works… right up until it doesn’t because Terraform tracks state based on the index of the declaration. In particular, if you change the order of your definitions in var.services – which frequently happens when you deprecate some service that doesn’t happen to be the one you most recently added – Terraform sees this as a change of every service after the lowest index that changed.

This can completely ruin your day. In more ways than you could possibly imagine. e.g.: it potentially removes the DNS of your service and recreates it, making it not available for a couple of minutes or remove the service altogether breaking the service discovery tier. Messy.

What We’ve Done

Terraform Template Generators

Instead of using Terraform count to produce copies of resources, we write template generators to produce Terraform configuration files. We typically do so by writing a small Python DSL (dataclasses and Enums are excellent here) as an input to Jinja2 template files.

We’ve done this to generate ECS definitions from a centralized service directory DSL. We’ve done this to generate user entitlements (as AWS IAM roles) from data in our SSO system.

Because this approach generates discrete resources without the index-based identifiers, removing resources results in a terraform plan that only involves those resources. (It’s also far easier to employ Terraform’s -target mechanism without index-based identifiers.)

The other nice thing about this approach is that it allows engineers to do work without necessarily having to learn Terraform. Adding or modifying a service consists of editing a Python or JSON file in git. Creating a PR, and allowing the generator to do its job. The number of files (most) engineers have to touch in Terraform to a grand total of zero. Error rate goes down because no one has to remember (or consult a HOWTO) to create the ECR repository in the right place or add the right subscription.

We ended up with a clean DSL(s), well-organized Terraform repositories, and happier developers.

Separation Of Concerns

If you think about your application infrastructure (beyond the hobby, single point of entry app), you have multiple layers that you need to manage.

  1. Network - VPC, Routing Tables, Security groups, VPN
  2. Compute Capacity - Some “cluster” abstraction. From auto scaling groups, load balancers, etc…
  3. CDN - Anything that is customer facing and needs to “sit” behind a CDN. For example our Auth layer, static files etc…
  4. Tasks - Any scheduled task that runs in the infrastructure level. For us, that includes a ton of lambda functions, step functions, backup systems, archiving systems and more
  5. Services - Actual applications run here. Including ECS/Kubernetes definitions, DNS records, service discovery and more.

There are more layers than mentioned here, like storage (DB, Cache, Redis), There’s secrets storage, encryption keys, and others; However, the principle stands.

Having these layers means we can give access to any ECS based API without giving access to any VPC/CDN.

Only a handful of people (literally) can have access to all of the above, and out of those, only about 50% have access to prod, and only 10% of those have access to the base layers of the security/network.

Separating terraform to repositories based on the layers was the solution we went with

That, however, created other challenges…

Terraform resource linking

Let’s say you want to create a load balancer with terraform; to do that, you need to create a security group and assign that security group to the load balancer.

resource "aws_security_group" "lb_sg" {
  # REDACTED
}

resource "aws_lb" "test" {
  # REDACTED
  security_groups    = ["${aws_security_group.lb_sg.id}"]
}

The main reason a lot of terraform configuration is monolithic is for this reason. You need access to resources to link them to other resources.

If you break your terraform, you need to address this, we started off from the basics.

Every resource we have follows strict conventions.

  1. Name includes the environment name at the end or it IS the environment name. VPC will be called dev, any subnet will be called ephemenral-dev, private-dev. Auto scaling group will be called frontend-dev etc…
  2. Every resource will have a tag called Environment with the actual environment name in it. 2

Now that we have the standards established, we can use terraform to query these resources.

Treating each repository as a “service” in the micro-services world. Each repository does its own thing, some of them base their work on the assumption that some other service did its thing.

So, the layer that introduces the ALB queries for a security group that was provisioned by another layer.

Let’s take a concrete example.

For us, the entire base network is it’s own terraform repository, we create the VPC, the subnets, the routing tables and all the rest of the network base in that repo.

In other repos, we might need access to that vpc_id attribute. Or, you will need access to the subnet ids to place other resource in them (lambda functions for example).

Using data resources

One of the most powerful features in terraform is the ability to query your infrastructure into what’s called a data resource.

For example, if you want to use the vpc_id, you need to have this data resource

data "aws_vpc" "selected" {
  filter {
    name = "tag:Name"
    values = ["${var.environment}"]
  }
}

This will filter through the VPC and will return it. You can then use data.aws_vpc.selected.id.

You can use the same with choosing subnet ids for example:

data "aws_vpc" "selected" {
  filter {
    name   = "tag:Name"
    values = ["${var.environment}"]
  }
}

data "aws_subnet_ids" "selected" {
  vpc_id = "${data.aws_vpc.selected.id}"

  filter {
    name   = "tag:Name"
    values = ["private-${var.environment}"]
  }
}

This technique allows you to rely on resources already provisioned by another terraform repository.

This is especially useful for us in DR scenarios, we have a base of the environment provisioned in other regions, when we need to flip over, we don’t need to create everything from scratch.

Since we follow strict conventions, the data resources always return the right thing. Whether we are in the primary region OR in the disaster recovery mode.

Using remote_state

In scenarios where you can’t use data resources (when it doesn’t exist or will not fit with what you need), you can read the state of repo X from repo Y.

Any output from the main terraform execution context is written to the state. That state can then be read and used in other repos.

data "terraform_remote_state" "account" {
  backend = "s3"

  config {
    bucket = "{ BUCKET_NAME }"
    key    = "{ KEY_NAME }"
    region = "{ REGION }"
  }
}

You can then reference any valid output var in it using ${data.terraform_remote_state.account.some_variable_name}

Automation

We wrap each repository with automation:

  1. We run the terraform version of lint called fmt.
  2. We run that fmt on a set of directories (again - standards)
  3. We run plan on every commit to a branch and post that plan back to Github with the diff.
  4. After PR is approved, we apply the plan (on development)

Having fmt run with every commit (just like tests in the application world) makes all of our repositories share the same aesthetics and look-and-feel.

Running the plan through CI makes sure we have a record of every PR - diff + people who don’t have permission running terraform (most) can make changes and make sense of things.

Conclusion

We now have about 6-8 layers of terraform code, each of them run in a separate cadence, you can experiment on one without effecting the other.

We have engineers self-serve on a lot of infrastructure changes without writing terraform or having the permission to run it on their machines.

Everything runs through Github flows, CI runs lint, plan, apply and comments back to Github with the expected changes.

Thank you

Thanks to Jesse and Moshe for reviewing earlier versions of this blog post


  1. Yes, there is a way to tell terraform to narrow down the scope of what you want to run, but we also run terraform on CI, we can’t real ↩︎

  2. We have separate account too. Even though, we still use these conventions. ↩︎