Pythian Blog: Technical Track

7 tips for managing Infrastructure with Terraform

Managing Infrastructure with Terraform

Hashicorp's Terraform is a powerful tool for managing diverse infrastructure as code, and automating deployment tasks at the infrastructure layers using provider-exposed APIs such as those provided by AWS and vSphere. Getting started with Terraform is fairly simple, however as we plan towards a operational framework, and as expected in our DevOps paradigm, we begin to focus on consistency and standardization of coding practices, release testing and deployment procedures, and managing state and dependencies effectively. Terraform does some cool stuff, but to get the value from it, there are some considerations to manage to keep your environment manageable.

Managing State and Environmental Separation

The Terraform state file is how the operational state of the environment is stored. It's local to the system running the 'terraform' toolset. In order to collaborate and support in a team environment, the state must be revision-controlled and stored centrally. Git does this job nicely; but it is imperative that the state be stored and managed separately from the code, and for each environment. We can wrap a procedure around the Terraform tasks that we execute to checkout the state for the environment being tested (i.e. development, staging, production), and run a 'terraform plan' to ensure everything in the current environment is as it should be. For example, a script that executes the following:
cd terraform-env
 git checkout origin_code production
 git checkout origin_state production
 terraform plan
...should yield no output and a return value of 0 (meaning empty diff between code and state). This makes our little command sequence script-able; we can raise an alarm on a 'true' return value from the plan command to alert us of uninitiated changes to the infrastructure's state, such as a failed instance. Note however, that this is not a good solution for monitoring the provider's API services themselves (the 'terraform plan' command will hang if the provider's API is unavailable). When we move to a release of updated code, 'terraform plan' will yield a list of changes to the environment for review. Subsequently, a 'terraform apply' updates both the environment and the state. Once the release is verified, a new state file can be stored in the repository for operational reference.

Managing Resource Dependencies

When developing, use references to resource object attributes to map dependency resources to each other, and check that dependencies are properly ordered using the graph command and a pipe:
$ terraform graph | dot -Tpng >graph.png [or '-Tps >graph.ps' for postscript format]
All the vertexes in our dependency graph map to a resource object reference in a template, for example:
resource "aws_route_table" "private_b" {
  ...
  nat_gateway_id = "${aws_nat_gateway.nat_gw_b.id}"
  ...
 }
This effectively ties the resource by its programmatic object attribute (vs. a string identifier, addressed below), in this case the route table to the NAT gateway as a dependency in the an AWS VPC, creating a vertex, or graph node that Terraform uses to determine build order and plan output. The related NAT gateway resource is also in Terraform:
resource "aws_nat_gateway" "nat_gw_b" {
 ...
 }
We can specify string reference values for existing infrastructure in our resource templates also (but not with depends_on), and we can use 'terraform import' to add them to the state (resource templates for imported resources will be generated in future releases of Terraform), however this 'state-only' import is very fragile; because the resource doesn't exist in our templates, it orphans and marks it tainted on the next 'terraform plan'. As much as possible, we recommend that all dependencies be managed in Terraform and referenced by their resource object attributes, specifically so that the tool is capable of establishing a vertex in its dependency graph. If there is a dependency resource that's unmanaged, and it fails or is removed from the environment, terraform will throw an error if it tries to reference a non-existing resource by a string identifier.

Abstracting and Interpolating Configuration Variables

To keep configuration of the environment for use by operations, use variables (in a 'variables.tf' template) to abstract configuration values into a single location. The idea is to make the environment as 'tune-able' as possible by specifying configuration options. First, the basics:
/* IAM user's API access key and secret */
 variable "access_key" {}
 variable "secret_key" {}
 /* IAM user's EC2 instance access key */
 variable "keypair" {}
 variable "keyfile" {}
These empty values will be picked up form the terraform.tfvars file, which is excepted from revision control, should have '600' permissions and contain your personal credentials for the provider's API (AWS in this particular case) and the deployment key to use (i.e. your IAM ssh keypair). If no variable is defined in the 'terraform.tfvars' file or elsewhere, these empty values are prompted for on execution. We use these types of blocks to define collections, in this case for the various system roles in a Hadoop cluster:
variable "type" {
  description = "Instance type to deploy for each cluster role."
  default {
  "server" = "t2.medium"
  "edgenode" = "t2.small"
  "datanode" = "d2.2xlarge"
  "servicenode" = "t2.medium"
  }
 }
These can be called from your resource templates, using (for example):
/* specify the instance type for the role */
 instance_type = "${var.type.datanode}"
If we have tags that we wish to abstract into configurable variables, we can add a variable collection like this:
variable "ami_ubuntu14" {
  description = "The AMI id of the image to use, and relevant platform string for instance tagging."
  default {
  "us-east-1" = "ami-8446ff93"
  "us-west-2" = "ami-65579105"
  "platform" = "Ubuntu 14.04 LTS"
  }
 }
...and do this in our resource template:
tags {
  Platform = "${var.ami_ubuntu14.platform}"
  Tier = "cluster"
  Role = "datanode"
 }
Note that collections are limited to one sub-level of hierarchy (i.e. we can't embed a collection in a collection). But, we can embed a list as a string...

Use Interpolation and Functions Effectively

Use interpolation to map resource lists (AZ example) mapped in variables. For example, in the variables file we define something like:
variable "region" {
  default {
  "primary" = "us-west-2"
  "backup" = "us-east-1"
  }
 }
 
 variable "azones" {
  description = "List of AZ's per region for lookup from resource templates."
  default {
  "us-east-1" = "us-east-1a,us-east-1b"
  "us-west-2" = "us-west-2a,us-west-2b,us-west-2c"
  }
 }
...and in the resource template, we define something like:
availability_zone = "${element(split(",", lookup(var.azones, var.region.primary)), 0)}"
...where the element and split functions in concert allow us to create, manage and extend our variable maps without touching the resource templates. This allows templates to be reused and extended with flexible, list-based configuration values. For instance, we can change the primary region and supply a new list of availability zones that match that new region, without touching the lookup in the resource template that references it.

Build Resources Based on Variable Options

We were able to figure out how to extend deployment options by specifying a true / false value and multiplying by the resource's 'count' variable. This turns out to be very useful indeed. For example, in my reference template I multiply the count by a true / false variable value called server.needed. If true, the count will be multiplied by one (the numeric representation of 'true'), yielding the count specified in the other variable. If false, it is multiplied by zero, and no servers will be deployed.
count = ${var.server.count * var.server.needed}
There's even a neat way to negate this. A value of false is zero or null, and a value of true is any other value. So, if I wish to specify, for instance, whether a DB configuration will be local or remote, I can do this on the resource template that defines the remote database, perhaps in an RDS resource template:
count = ${1 * var.db.remote}
...where db.remote is a true / false value. Then, in the resource template where the local database will be defined, I can test for the inverse value, triggering a local build if var.db.remote is false:
count = ${1 * (1 - var.db.remote)}
The subtraction from one makes it false if true (1 - 1 = 0, false), and true if false (1 - 0 = 1, true). A great application of this is to decide whether to deploy in one availability zone or two, based on a configuration flag:
variable "multi_az" {
  description = "Define whether the VPC will be deployed with multiple availability zones."
  default = true
 }
 
 resource "aws_instance" "inst_az1" {
  description = "Deploy an instance in AZ 1"
  ...
  availability_zone = "${element(split(",", lookup(var.azones, var.region.primary)), 0)}"
  count = ${var.inst_count}
 }
 
 resource "aws_instance" "inst_az2" {
  description = "Deploy a second instance in AZ 2, if multi-AZ is true"
  ...
  availability_zone = "${element(split(",", lookup(var.azones, var.region.primary)), 1)}"
  count = ${var.inst_count * var.multi-az}
 }

Properly Configure .gitignore

This is really basic but really important. Abstract your credentials from the repository. The .gitignore for the code repository should list:
terraform.*
...which will except the tfvars and state / state backup files from the code repository. The .gitignore for the state repository should list:
*
 !terraform.tfstate*
...which will result in git storing only the state files for operational management of the environment by the support team. Of course, remember never to add credentials or sensitive information into the repository. Use a local 'terraform.tfvars' file with strict permissions to store your credentials, and leverage ssh-agent to cache and manage your authentication to the environment.

Use Null Resource Templates to Decouple CM

One of the powerful features of Terraform is its ability to bootstrap configuration management frameworks, such as Puppet and Ansible. In order to allow Terraform to trigger configuration management without destroying the resources and recreating them, we discovered a method to decouple the bootstrapping by defining a null resource and a trigger, like this:
resource "null_resource" "datanode" {
  count = "${var.count.datanode}"
 
 triggers {
 instance_ids = "${element(aws_instance.datanode.*.id, count.index)}"
 }
 
 provisioner "remote-exec" {
 inline = [
  ...
 ]
 
 connection {
 type = "ssh"
 user = "centos"
 host = "${element(aws_instance.datanode.*.private_ip, count.index)}"
   }
  }
 }
This resource builds a provisioner for each datanode in a Hadoop cluster of variable size. The element function combined with the count.index value allows us to create one copy of this resource per node, and trigger a re-provisioning on a single node using the array element identifier. For example, we can now use 'terraform taint null_resource.datanode; terraform plan' to get Terraform to plan the execution of the provisioner for a specific datanode, without having to recreate the instance resource or run the same provisioner on all the other nodes. We can also assign null resource provisioners directly to specific instances, by replacing the splat list in the trigger with a direct reference. For instance, we can provision our server node with this trigger:
trigger {
 server_id = "${aws_instance.server.id}"
 }
...which tells Terraform that if the instance id changes for the server, create the resource containing the trigger (which happens to be a provisioner in a null resource block, in our case).  

No Comments Yet

Let us know what you think

Subscribe by email