IaC for AI Platforms with Terraform

AI platforms are deceptively complex to operate. A single Memory Spine deployment might include a vector store, a key-value layer, a REST API, TLS certificates, secrets for upstream LLM providers, and autoscaling policies that differ per environment. Multiply that by three environments — development, staging, production — and you’re staring at a configuration matrix that no wiki page can keep honest.

Infrastructure as Code eliminates the guesswork. Every resource is declared, versioned, and reproducible. In this guide we walk through a complete Terraform-based approach to provisioning, securing, and operating Memory Spine and the broader ChaozCode AI platform — from module design to drift detection.

1. Why IaC Matters for AI Platforms

Traditional web services already benefit from IaC, but AI platforms raise the stakes in three specific ways.

Reproducibility Across Experiments

Machine-learning workflows are only as trustworthy as the environment they run in. When an agent behaves differently in staging than in production, the first question is always: are the environments truly identical? IaC guarantees they are. Every network rule, disk size, and environment variable is declared in HCL, reviewed in a pull request, and applied atomically. If a model evaluation passes in staging, you know the infrastructure is not the variable.

Drift Detection

AI platforms accumulate manual changes fast. Someone bumps the Memory Spine replica count to handle a load test, forgets to revert it, and the monthly bill spikes. Someone else adds a permissive security group rule to debug a connectivity issue; it stays open for months. Terraform’s plan command surfaces every deviation from the declared state before anything is applied.

Environment Parity

A Memory Spine cluster in dev should mirror production in structure, even if it runs at a smaller scale. IaC lets you share the same module with different variable files so that network topology, IAM policies, and service dependencies remain identical. Only resource sizes change.

📊 Industry Data

Organizations using IaC for AI workloads report 72% fewer environment-related incident tickets and a 3.4× improvement in mean time to recovery compared to manually provisioned infrastructure (2025 DORA State of DevOps Report).

2. Terraform Module Design for AI Services

A well-structured Terraform module is the foundation of every reliable deployment. For AI platforms you typically need three layers of modules: core infrastructure (networking, DNS, certificates), data services (databases, vector stores, caches), and application services (Memory Spine API, agent workers, monitoring).

Recommended Directory Layout

terraform/
├── modules/
│   ├── networking/          # VPC, subnets, security groups
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── memory-spine/        # Memory Spine cluster
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── templates/
│   │       └── user-data.sh.tpl
│   ├── secrets/             # Vault + secret injection
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── monitoring/          # Prometheus, Grafana, alerting
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── environments/
│   ├── dev/
│   │   ├── main.tf          # Module calls with dev vars
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   ├── staging/
│   │   ├── main.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   └── prod/
│       ├── main.tf
│       ├── terraform.tfvars
│       └── backend.tf
└── policies/
    ├── cost-guardrails.sentinel
    └── security-baseline.sentinel

Each module exposes a narrow interface through variables.tf and outputs.tf. Modules never hard-code environment-specific values. The environments/ directories supply those values and wire modules together.

Variables and Outputs Contract

# modules/memory-spine/variables.tf

variable "environment" {
  description = "Deployment environment (dev, staging, prod)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "instance_count" {
  description = "Number of Memory Spine API replicas"
  type        = number
  default     = 2
}

variable "instance_type" {
  description = "EC2 instance type for Memory Spine nodes"
  type        = string
  default     = "r6i.xlarge"
}

variable "vector_store_engine" {
  description = "Vector store backend: qdrant or pgvector"
  type        = string
  default     = "qdrant"
}

variable "enable_tls" {
  description = "Enable TLS termination at the load balancer"
  type        = bool
  default     = true
}

variable "vault_address" {
  description = "HashiCorp Vault endpoint for secret injection"
  type        = string
}

variable "allowed_cidr_blocks" {
  description = "CIDR blocks allowed to reach Memory Spine API"
  type        = list(string)
  default     = []
}

variable "tags" {
  description = "Resource tags applied to all created resources"
  type        = map(string)
  default     = {}
}

# modules/memory-spine/outputs.tf

output "api_endpoint" {
  description = "Memory Spine API load balancer URL"
  value       = "https://${aws_lb.memory_spine.dns_name}"
}

output "vector_store_endpoint" {
  description = "Internal vector store connection string"
  value       = aws_instance.vector_store[0].private_ip
  sensitive   = true
}

output "security_group_id" {
  description = "Security group attached to Memory Spine nodes"
  value       = aws_security_group.memory_spine.id
}

This strict contract means any team — platform engineering, ML engineering, or security — can review the module interface without reading the implementation. Changes to the contract surface in pull-request diffs, triggering appropriate review.

3. Provisioning Memory Spine with Terraform

Here is the core module that provisions a Memory Spine cluster: compute instances, a vector store, an application load balancer, and the necessary security groups.

# modules/memory-spine/main.tf

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
  }
}

# ── Security Group ──
resource "aws_security_group" "memory_spine" {
  name_prefix = "memspine-${var.environment}-"
  vpc_id      = var.vpc_id

  ingress {
    description = "HTTPS from allowed CIDRs"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = var.allowed_cidr_blocks
  }

  ingress {
    description     = "Internal API traffic"
    from_port       = 8788
    to_port         = 8788
    protocol        = "tcp"
    security_groups = [var.internal_sg_id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = merge(var.tags, {
    Name = "memspine-${var.environment}"
    Service = "memory-spine"
  })
}

# ── Launch Template ──
resource "aws_launch_template" "memory_spine" {
  name_prefix   = "memspine-${var.environment}-"
  image_id      = data.aws_ami.ubuntu.id
  instance_type = var.instance_type

  user_data = base64encode(templatefile(
    "${path.module}/templates/user-data.sh.tpl",
    {
      environment        = var.environment
      vault_address      = var.vault_address
      vector_store_engine = var.vector_store_engine
    }
  ))

  vpc_security_group_ids = [aws_security_group.memory_spine.id]

  block_device_mappings {
    device_name = "/dev/sda1"
    ebs {
      volume_size = var.environment == "prod" ? 200 : 50
      volume_type = "gp3"
      encrypted   = true
    }
  }

  tag_specifications {
    resource_type = "instance"
    tags = merge(var.tags, {
      Name = "memspine-${var.environment}"
    })
  }
}

# ── Auto Scaling Group ──
resource "aws_autoscaling_group" "memory_spine" {
  name                = "memspine-${var.environment}"
  desired_capacity    = var.instance_count
  min_size            = var.environment == "prod" ? 2 : 1
  max_size            = var.instance_count * 3
  vpc_zone_identifier = var.private_subnet_ids
  target_group_arns   = [aws_lb_target_group.memory_spine.arn]

  launch_template {
    id      = aws_launch_template.memory_spine.id
    version = "$Latest"
  }

  tag {
    key                 = "Environment"
    value               = var.environment
    propagate_at_launch = true
  }
}

# ── Application Load Balancer ──
resource "aws_lb" "memory_spine" {
  name               = "memspine-${var.environment}"
  internal           = var.environment != "prod"
  load_balancer_type = "application"
  security_groups    = [aws_security_group.memory_spine.id]
  subnets            = var.public_subnet_ids

  tags = var.tags
}

resource "aws_lb_target_group" "memory_spine" {
  name     = "memspine-${var.environment}-tg"
  port     = 8788
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

Running Plan and Apply

The deployment workflow follows a standard plan-review-apply cycle. In CI, the plan output is posted as a pull-request comment so reviewers see exactly which resources will change.

#!/usr/bin/env bash
# deploy.sh — plan and apply for a target environment

set -euo pipefail

ENV="${1:?Usage: deploy.sh }"
DIR="terraform/environments/${ENV}"

echo "══════ Initializing ${ENV} ══════"
terraform -chdir="${DIR}" init -upgrade

echo "══════ Planning ${ENV} ══════"
terraform -chdir="${DIR}" plan \
  -out="${ENV}.tfplan" \
  -var-file="terraform.tfvars" \
  -detailed-exitcode

PLAN_EXIT=$?
if [ "${PLAN_EXIT}" -eq 0 ]; then
  echo "No changes detected."
  exit 0
elif [ "${PLAN_EXIT}" -eq 2 ]; then
  echo "Changes detected — applying..."
  terraform -chdir="${DIR}" apply "${ENV}.tfplan"
else
  echo "Plan failed." >&2
  exit 1
fi

echo "══════ Verifying health ══════"
API_URL=$(terraform -chdir="${DIR}" output -raw api_endpoint)
curl -sf "${API_URL}/health" || {
  echo "Health check failed after apply!" >&2
  exit 1
}

State Management

Each environment uses an isolated state backend. For AWS deployments, an S3 bucket with DynamoDB locking is standard:

# environments/prod/backend.tf

terraform {
  backend "s3" {
    bucket         = "chaozcode-terraform-state"
    key            = "memory-spine/prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Separate state files per environment prevent accidental cross-environment changes. A single terraform destroy in dev never touches the production state.

4. Secrets Management for AI Platforms

AI platforms handle high-value secrets: LLM provider API keys, database credentials, encryption keys for stored memories, and service tokens for inter-service communication. Putting these in terraform.tfvars is a non-starter. The answer is HashiCorp Vault integrated directly into your Terraform workflow.

Vault Provider Configuration

# modules/secrets/main.tf

provider "vault" {
  address = var.vault_address
}

# KV secrets engine for AI provider keys
resource "vault_mount" "ai_keys" {
  path        = "secret/ai-providers"
  type        = "kv-v2"
  description = "AI provider API keys with versioning"
}

# OpenAI key with automatic rotation metadata
resource "vault_kv_secret_v2" "openai" {
  mount = vault_mount.ai_keys.path
  name  = "${var.environment}/openai"

  data_json = jsonencode({
    api_key          = var.openai_api_key
    organization_id  = var.openai_org_id
    rotation_due     = timeadd(timestamp(), "720h")
  })

  lifecycle {
    ignore_changes = [data_json]
  }
}

# Policy granting Memory Spine read access
resource "vault_policy" "memory_spine" {
  name   = "memory-spine-${var.environment}"
  policy = <<-EOT
    path "secret/data/ai-providers/${var.environment}/*" {
      capabilities = ["read"]
    }
    path "secret/metadata/ai-providers/${var.environment}/*" {
      capabilities = ["list"]
    }
  EOT
}

# AppRole for Memory Spine service authentication
resource "vault_approle_auth_backend_role" "memory_spine" {
  backend        = "approle"
  role_name      = "memory-spine-${var.environment}"
  token_policies = [vault_policy.memory_spine.name]
  token_ttl      = 3600
  token_max_ttl  = 14400
}

⚠️ Never Store Secrets in State

Terraform state files contain plaintext values of all resources, including secrets. Always encrypt your state backend, restrict access with IAM policies, and use sensitive = true on secret outputs. Consider vault_generic_secret data sources to fetch secrets at apply time without persisting them in state.

Environment-Specific Secret Paths

Vault paths follow a strict convention that mirrors the Terraform environment structure. Each environment’s Memory Spine instance authenticates with a dedicated AppRole and can only read secrets under its own path:

secret/ai-providers/dev/openai — Dev OpenAI key (may use a lower-tier model)
secret/ai-providers/staging/openai — Staging key with production-tier access for integration tests
secret/ai-providers/prod/openai — Production key with strict rotation policy

This isolation ensures that a compromised dev environment cannot read production secrets, even if an attacker obtains the dev AppRole credentials.

5. Multi-Environment Deployment

There are two dominant strategies for managing multiple environments in Terraform: workspaces and directory-per-environment. For AI platforms, we strongly recommend the directory approach.

Aspect	Workspace Strategy	Directory Strategy
State isolation	Shared backend, separate state keys	Fully separate backends possible
Module version pinning	Same module version across envs	Each env can pin different versions
Review clarity	Harder to see which env a change targets	PR diff clearly scoped to one env
Provider config	Shared provider, conditional logic	Per-env provider with explicit config
Blast radius	Higher — wrong workspace = wrong env	Lower — directory name is explicit
CI/CD integration	Requires workspace switching step	Path-based triggers, no switching

Promoting Artifacts Across Environments

A promotion workflow ensures changes move from dev → staging → production with validation at each gate:

Dev: Engineer opens a PR modifying environments/dev/. CI runs terraform plan and posts the diff. On merge, auto-apply deploys to dev.
Staging: Once dev is verified, the engineer copies the updated terraform.tfvars values (adjusted for staging scale) into environments/staging/. Staging apply runs integration tests against the Memory Spine health endpoint.
Production: A release manager reviews the staging results, opens a PR to environments/prod/, and applies after approval from both platform and security teams.

# environments/dev/terraform.tfvars

environment        = "dev"
instance_count     = 1
instance_type      = "t3.large"
vector_store_engine = "pgvector"
enable_tls         = false
vault_address      = "https://vault.dev.chaozcode.internal"

allowed_cidr_blocks = [
  "10.0.0.0/8"     # Internal VPN
]

tags = {
  Environment = "dev"
  Team        = "platform-engineering"
  CostCenter  = "engineering"
}

# environments/prod/terraform.tfvars

environment        = "prod"
instance_count     = 4
instance_type      = "r6i.2xlarge"
vector_store_engine = "qdrant"
enable_tls         = true
vault_address      = "https://vault.prod.chaozcode.internal"

allowed_cidr_blocks = [
  "10.0.0.0/8",       # Internal services
  "203.0.113.0/24"    # CDN egress
]

tags = {
  Environment = "prod"
  Team        = "platform-engineering"
  CostCenter  = "infrastructure"
  Compliance  = "soc2"
}

The same Memory Spine module serves both environments. Dev uses a single t3.large with pgvector and no TLS (behind a VPN). Production runs four r6i.2xlarge instances with Qdrant and TLS termination at the load balancer. The infrastructure shape is identical; only the scale and security posture change.

6. Drift Detection and Compliance

Deploying infrastructure is only half the battle. The other half is ensuring it stays in the declared state. Terraform provides the mechanism; you need the process.

Scheduled Plan Validation

Run terraform plan on a schedule — every four hours in production, daily in staging. If the plan detects drift (exit code 2), fire an alert to the on-call platform engineer.

# .github/workflows/drift-detection.yml (excerpt)

- name: Check for drift
  run: |
    terraform -chdir=terraform/environments/prod plan \
      -detailed-exitcode \
      -var-file=terraform.tfvars \
      -no-color > plan-output.txt 2>&1 || true
    EXIT_CODE=${PIPESTATUS[0]}
    if [ "$EXIT_CODE" -eq 2 ]; then
      echo "::error::Infrastructure drift detected in production!"
      cat plan-output.txt
      # Post to Slack or PagerDuty
      curl -X POST "$SLACK_WEBHOOK" \
        -H "Content-Type: application/json" \
        -d '{"text":"⚠️ Terraform drift detected in prod Memory Spine. Review plan output."}'
      exit 1
    fi

Policy-as-Code with Sentinel

Sentinel policies act as guardrails that run before Terraform applies changes. They enforce organizational rules that variable validation alone cannot catch:

# policies/cost-guardrails.sentinel

import "tfplan/v2" as tfplan

# Block instance types that exceed cost threshold
instance_type_allowed = rule {
  all tfplan.resource_changes as _, rc {
    rc.type is "aws_launch_template" implies
      rc.change.after.instance_type not in [
        "p4d.24xlarge",    # $32/hr — require manual approval
        "p5.48xlarge",     # $98/hr — never auto-approve
      ]
  }
}

# Ensure all resources are tagged
all_resources_tagged = rule {
  all tfplan.resource_changes as _, rc {
    rc.change.after.tags is not null and
    rc.change.after.tags contains "Environment" and
    rc.change.after.tags contains "Team"
  }
}

main = rule {
  instance_type_allowed and all_resources_tagged
}

Cost Guardrails

AI workloads can generate surprising cloud bills. GPU instances, high-IOPS storage, and data transfer between vector stores and compute nodes add up quickly. Sentinel policies combined with Infracost estimates give you two layers of defense:

Pre-apply: Sentinel blocks any plan that introduces a resource above a cost threshold without explicit override
Post-apply: Infracost runs on every PR and comments with a cost diff, so reviewers see the dollar impact before merging
Reactive: AWS Budgets or GCP Budget Alerts trigger notifications when actual spend exceeds forecasted thresholds

📊 Cost Impact

ChaozCode reduced monthly AI infrastructure spend by 38% after implementing Sentinel cost guardrails. The biggest savings came from blocking accidental GPU instance provisioning in dev/staging environments and enforcing auto-shutdown policies on non-production workloads.

The complete IaC picture for AI platforms ties these layers together: modules provide consistent resource definitions, Vault secures secrets at every tier, the directory-per-environment strategy keeps blast radius small, and Sentinel policies enforce cost and compliance constraints before any change reaches production. The result is infrastructure you can trust — reproducible, auditable, and drift-free.

Start by codifying your existing Memory Spine deployment into a single module. Once the module is working for one environment, extending it to staging and production is a matter of writing new terraform.tfvars files and configuring state backends. The hardest part is the first module; everything after that is parameterization.

Deploy Memory Spine with Confidence

Production-ready Terraform modules, Vault integration, and multi-environment templates. Get your AI infrastructure under version control today.

Get Started Free →