DevOps · 8 min read

IaC for AI Platforms with Terraform

Reproducible, auditable, drift-free infrastructure for AI agent platforms. Terraform modules for Memory Spine provisioning, secrets management, and multi-environment deployment.

🚀
Part of ChaozCode · Memory Spine is one of 8 apps in the ChaozCode DevOps AI Platform. 233 agents. 363+ tools. Start free →

AI platforms are deceptively complex to operate. A single Memory Spine deployment might include a vector store, a key-value layer, a REST API, TLS certificates, secrets for upstream LLM providers, and autoscaling policies that differ per environment. Multiply that by three environments — development, staging, production — and you’re staring at a configuration matrix that no wiki page can keep honest.

Infrastructure as Code eliminates the guesswork. Every resource is declared, versioned, and reproducible. In this guide we walk through a complete Terraform-based approach to provisioning, securing, and operating Memory Spine and the broader ChaozCode AI platform — from module design to drift detection.

1. Why IaC Matters for AI Platforms

Traditional web services already benefit from IaC, but AI platforms raise the stakes in three specific ways.

Reproducibility Across Experiments

Machine-learning workflows are only as trustworthy as the environment they run in. When an agent behaves differently in staging than in production, the first question is always: are the environments truly identical? IaC guarantees they are. Every network rule, disk size, and environment variable is declared in HCL, reviewed in a pull request, and applied atomically. If a model evaluation passes in staging, you know the infrastructure is not the variable.

Drift Detection

AI platforms accumulate manual changes fast. Someone bumps the Memory Spine replica count to handle a load test, forgets to revert it, and the monthly bill spikes. Someone else adds a permissive security group rule to debug a connectivity issue; it stays open for months. Terraform’s plan command surfaces every deviation from the declared state before anything is applied.

Environment Parity

A Memory Spine cluster in dev should mirror production in structure, even if it runs at a smaller scale. IaC lets you share the same module with different variable files so that network topology, IAM policies, and service dependencies remain identical. Only resource sizes change.

📊 Industry Data

Organizations using IaC for AI workloads report 72% fewer environment-related incident tickets and a 3.4× improvement in mean time to recovery compared to manually provisioned infrastructure (2025 DORA State of DevOps Report).

2. Terraform Module Design for AI Services

A well-structured Terraform module is the foundation of every reliable deployment. For AI platforms you typically need three layers of modules: core infrastructure (networking, DNS, certificates), data services (databases, vector stores, caches), and application services (Memory Spine API, agent workers, monitoring).

Recommended Directory Layout

terraform/
├── modules/
│   ├── networking/          # VPC, subnets, security groups
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── memory-spine/        # Memory Spine cluster
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── templates/
│   │       └── user-data.sh.tpl
│   ├── secrets/             # Vault + secret injection
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── monitoring/          # Prometheus, Grafana, alerting
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── environments/
│   ├── dev/
│   │   ├── main.tf          # Module calls with dev vars
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   ├── staging/
│   │   ├── main.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   └── prod/
│       ├── main.tf
│       ├── terraform.tfvars
│       └── backend.tf
└── policies/
    ├── cost-guardrails.sentinel
    └── security-baseline.sentinel

Each module exposes a narrow interface through variables.tf and outputs.tf. Modules never hard-code environment-specific values. The environments/ directories supply those values and wire modules together.

Variables and Outputs Contract

# modules/memory-spine/variables.tf

variable "environment" {
  description = "Deployment environment (dev, staging, prod)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "instance_count" {
  description = "Number of Memory Spine API replicas"
  type        = number
  default     = 2
}

variable "instance_type" {
  description = "EC2 instance type for Memory Spine nodes"
  type        = string
  default     = "r6i.xlarge"
}

variable "vector_store_engine" {
  description = "Vector store backend: qdrant or pgvector"
  type        = string
  default     = "qdrant"
}

variable "enable_tls" {
  description = "Enable TLS termination at the load balancer"
  type        = bool
  default     = true
}

variable "vault_address" {
  description = "HashiCorp Vault endpoint for secret injection"
  type        = string
}

variable "allowed_cidr_blocks" {
  description = "CIDR blocks allowed to reach Memory Spine API"
  type        = list(string)
  default     = []
}

variable "tags" {
  description = "Resource tags applied to all created resources"
  type        = map(string)
  default     = {}
}
# modules/memory-spine/outputs.tf

output "api_endpoint" {
  description = "Memory Spine API load balancer URL"
  value       = "https://${aws_lb.memory_spine.dns_name}"
}

output "vector_store_endpoint" {
  description = "Internal vector store connection string"
  value       = aws_instance.vector_store[0].private_ip
  sensitive   = true
}

output "security_group_id" {
  description = "Security group attached to Memory Spine nodes"
  value       = aws_security_group.memory_spine.id
}

This strict contract means any team — platform engineering, ML engineering, or security — can review the module interface without reading the implementation. Changes to the contract surface in pull-request diffs, triggering appropriate review.

3. Provisioning Memory Spine with Terraform

Here is the core module that provisions a Memory Spine cluster: compute instances, a vector store, an application load balancer, and the necessary security groups.

# modules/memory-spine/main.tf

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
  }
}

# ── Security Group ──
resource "aws_security_group" "memory_spine" {
  name_prefix = "memspine-${var.environment}-"
  vpc_id      = var.vpc_id

  ingress {
    description = "HTTPS from allowed CIDRs"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = var.allowed_cidr_blocks
  }

  ingress {
    description     = "Internal API traffic"
    from_port       = 8788
    to_port         = 8788
    protocol        = "tcp"
    security_groups = [var.internal_sg_id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = merge(var.tags, {
    Name = "memspine-${var.environment}"
    Service = "memory-spine"
  })
}

# ── Launch Template ──
resource "aws_launch_template" "memory_spine" {
  name_prefix   = "memspine-${var.environment}-"
  image_id      = data.aws_ami.ubuntu.id
  instance_type = var.instance_type

  user_data = base64encode(templatefile(
    "${path.module}/templates/user-data.sh.tpl",
    {
      environment        = var.environment
      vault_address      = var.vault_address
      vector_store_engine = var.vector_store_engine
    }
  ))

  vpc_security_group_ids = [aws_security_group.memory_spine.id]

  block_device_mappings {
    device_name = "/dev/sda1"
    ebs {
      volume_size = var.environment == "prod" ? 200 : 50
      volume_type = "gp3"
      encrypted   = true
    }
  }

  tag_specifications {
    resource_type = "instance"
    tags = merge(var.tags, {
      Name = "memspine-${var.environment}"
    })
  }
}

# ── Auto Scaling Group ──
resource "aws_autoscaling_group" "memory_spine" {
  name                = "memspine-${var.environment}"
  desired_capacity    = var.instance_count
  min_size            = var.environment == "prod" ? 2 : 1
  max_size            = var.instance_count * 3
  vpc_zone_identifier = var.private_subnet_ids
  target_group_arns   = [aws_lb_target_group.memory_spine.arn]

  launch_template {
    id      = aws_launch_template.memory_spine.id
    version = "$Latest"
  }

  tag {
    key                 = "Environment"
    value               = var.environment
    propagate_at_launch = true
  }
}

# ── Application Load Balancer ──
resource "aws_lb" "memory_spine" {
  name               = "memspine-${var.environment}"
  internal           = var.environment != "prod"
  load_balancer_type = "application"
  security_groups    = [aws_security_group.memory_spine.id]
  subnets            = var.public_subnet_ids

  tags = var.tags
}

resource "aws_lb_target_group" "memory_spine" {
  name     = "memspine-${var.environment}-tg"
  port     = 8788
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

Running Plan and Apply

The deployment workflow follows a standard plan-review-apply cycle. In CI, the plan output is posted as a pull-request comment so reviewers see exactly which resources will change.

#!/usr/bin/env bash
# deploy.sh — plan and apply for a target environment

set -euo pipefail

ENV="${1:?Usage: deploy.sh }"
DIR="terraform/environments/${ENV}"

echo "══════ Initializing ${ENV} ══════"
terraform -chdir="${DIR}" init -upgrade

echo "══════ Planning ${ENV} ══════"
terraform -chdir="${DIR}" plan \
  -out="${ENV}.tfplan" \
  -var-file="terraform.tfvars" \
  -detailed-exitcode

PLAN_EXIT=$?
if [ "${PLAN_EXIT}" -eq 0 ]; then
  echo "No changes detected."
  exit 0
elif [ "${PLAN_EXIT}" -eq 2 ]; then
  echo "Changes detected — applying..."
  terraform -chdir="${DIR}" apply "${ENV}.tfplan"
else
  echo "Plan failed." >&2
  exit 1
fi

echo "══════ Verifying health ══════"
API_URL=$(terraform -chdir="${DIR}" output -raw api_endpoint)
curl -sf "${API_URL}/health" || {
  echo "Health check failed after apply!" >&2
  exit 1
}

State Management

Each environment uses an isolated state backend. For AWS deployments, an S3 bucket with DynamoDB locking is standard:

# environments/prod/backend.tf

terraform {
  backend "s3" {
    bucket         = "chaozcode-terraform-state"
    key            = "memory-spine/prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Separate state files per environment prevent accidental cross-environment changes. A single terraform destroy in dev never touches the production state.

4. Secrets Management for AI Platforms

AI platforms handle high-value secrets: LLM provider API keys, database credentials, encryption keys for stored memories, and service tokens for inter-service communication. Putting these in terraform.tfvars is a non-starter. The answer is HashiCorp Vault integrated directly into your Terraform workflow.

Vault Provider Configuration

# modules/secrets/main.tf

provider "vault" {
  address = var.vault_address
}

# KV secrets engine for AI provider keys
resource "vault_mount" "ai_keys" {
  path        = "secret/ai-providers"
  type        = "kv-v2"
  description = "AI provider API keys with versioning"
}

# OpenAI key with automatic rotation metadata
resource "vault_kv_secret_v2" "openai" {
  mount = vault_mount.ai_keys.path
  name  = "${var.environment}/openai"

  data_json = jsonencode({
    api_key          = var.openai_api_key
    organization_id  = var.openai_org_id
    rotation_due     = timeadd(timestamp(), "720h")
  })

  lifecycle {
    ignore_changes = [data_json]
  }
}

# Policy granting Memory Spine read access
resource "vault_policy" "memory_spine" {
  name   = "memory-spine-${var.environment}"
  policy = <<-EOT
    path "secret/data/ai-providers/${var.environment}/*" {
      capabilities = ["read"]
    }
    path "secret/metadata/ai-providers/${var.environment}/*" {
      capabilities = ["list"]
    }
  EOT
}

# AppRole for Memory Spine service authentication
resource "vault_approle_auth_backend_role" "memory_spine" {
  backend        = "approle"
  role_name      = "memory-spine-${var.environment}"
  token_policies = [vault_policy.memory_spine.name]
  token_ttl      = 3600
  token_max_ttl  = 14400
}
⚠️ Never Store Secrets in State

Terraform state files contain plaintext values of all resources, including secrets. Always encrypt your state backend, restrict access with IAM policies, and use sensitive = true on secret outputs. Consider vault_generic_secret data sources to fetch secrets at apply time without persisting them in state.

Environment-Specific Secret Paths

Vault paths follow a strict convention that mirrors the Terraform environment structure. Each environment’s Memory Spine instance authenticates with a dedicated AppRole and can only read secrets under its own path:

This isolation ensures that a compromised dev environment cannot read production secrets, even if an attacker obtains the dev AppRole credentials.

5. Multi-Environment Deployment

There are two dominant strategies for managing multiple environments in Terraform: workspaces and directory-per-environment. For AI platforms, we strongly recommend the directory approach.

Aspect Workspace Strategy Directory Strategy
State isolation Shared backend, separate state keys Fully separate backends possible
Module version pinning Same module version across envs Each env can pin different versions
Review clarity Harder to see which env a change targets PR diff clearly scoped to one env
Provider config Shared provider, conditional logic Per-env provider with explicit config
Blast radius Higher — wrong workspace = wrong env Lower — directory name is explicit
CI/CD integration Requires workspace switching step Path-based triggers, no switching

Promoting Artifacts Across Environments

A promotion workflow ensures changes move from dev → staging → production with validation at each gate:

  1. Dev: Engineer opens a PR modifying environments/dev/. CI runs terraform plan and posts the diff. On merge, auto-apply deploys to dev.
  2. Staging: Once dev is verified, the engineer copies the updated terraform.tfvars values (adjusted for staging scale) into environments/staging/. Staging apply runs integration tests against the Memory Spine health endpoint.
  3. Production: A release manager reviews the staging results, opens a PR to environments/prod/, and applies after approval from both platform and security teams.
# environments/dev/terraform.tfvars

environment        = "dev"
instance_count     = 1
instance_type      = "t3.large"
vector_store_engine = "pgvector"
enable_tls         = false
vault_address      = "https://vault.dev.chaozcode.internal"

allowed_cidr_blocks = [
  "10.0.0.0/8"     # Internal VPN
]

tags = {
  Environment = "dev"
  Team        = "platform-engineering"
  CostCenter  = "engineering"
}
# environments/prod/terraform.tfvars

environment        = "prod"
instance_count     = 4
instance_type      = "r6i.2xlarge"
vector_store_engine = "qdrant"
enable_tls         = true
vault_address      = "https://vault.prod.chaozcode.internal"

allowed_cidr_blocks = [
  "10.0.0.0/8",       # Internal services
  "203.0.113.0/24"    # CDN egress
]

tags = {
  Environment = "prod"
  Team        = "platform-engineering"
  CostCenter  = "infrastructure"
  Compliance  = "soc2"
}

The same Memory Spine module serves both environments. Dev uses a single t3.large with pgvector and no TLS (behind a VPN). Production runs four r6i.2xlarge instances with Qdrant and TLS termination at the load balancer. The infrastructure shape is identical; only the scale and security posture change.

6. Drift Detection and Compliance

Deploying infrastructure is only half the battle. The other half is ensuring it stays in the declared state. Terraform provides the mechanism; you need the process.

Scheduled Plan Validation

Run terraform plan on a schedule — every four hours in production, daily in staging. If the plan detects drift (exit code 2), fire an alert to the on-call platform engineer.

# .github/workflows/drift-detection.yml (excerpt)

- name: Check for drift
  run: |
    terraform -chdir=terraform/environments/prod plan \
      -detailed-exitcode \
      -var-file=terraform.tfvars \
      -no-color > plan-output.txt 2>&1 || true
    EXIT_CODE=${PIPESTATUS[0]}
    if [ "$EXIT_CODE" -eq 2 ]; then
      echo "::error::Infrastructure drift detected in production!"
      cat plan-output.txt
      # Post to Slack or PagerDuty
      curl -X POST "$SLACK_WEBHOOK" \
        -H "Content-Type: application/json" \
        -d '{"text":"⚠️ Terraform drift detected in prod Memory Spine. Review plan output."}'
      exit 1
    fi

Policy-as-Code with Sentinel

Sentinel policies act as guardrails that run before Terraform applies changes. They enforce organizational rules that variable validation alone cannot catch:

# policies/cost-guardrails.sentinel

import "tfplan/v2" as tfplan

# Block instance types that exceed cost threshold
instance_type_allowed = rule {
  all tfplan.resource_changes as _, rc {
    rc.type is "aws_launch_template" implies
      rc.change.after.instance_type not in [
        "p4d.24xlarge",    # $32/hr — require manual approval
        "p5.48xlarge",     # $98/hr — never auto-approve
      ]
  }
}

# Ensure all resources are tagged
all_resources_tagged = rule {
  all tfplan.resource_changes as _, rc {
    rc.change.after.tags is not null and
    rc.change.after.tags contains "Environment" and
    rc.change.after.tags contains "Team"
  }
}

main = rule {
  instance_type_allowed and all_resources_tagged
}

Cost Guardrails

AI workloads can generate surprising cloud bills. GPU instances, high-IOPS storage, and data transfer between vector stores and compute nodes add up quickly. Sentinel policies combined with Infracost estimates give you two layers of defense:

📊 Cost Impact

ChaozCode reduced monthly AI infrastructure spend by 38% after implementing Sentinel cost guardrails. The biggest savings came from blocking accidental GPU instance provisioning in dev/staging environments and enforcing auto-shutdown policies on non-production workloads.

The complete IaC picture for AI platforms ties these layers together: modules provide consistent resource definitions, Vault secures secrets at every tier, the directory-per-environment strategy keeps blast radius small, and Sentinel policies enforce cost and compliance constraints before any change reaches production. The result is infrastructure you can trust — reproducible, auditable, and drift-free.

Start by codifying your existing Memory Spine deployment into a single module. Once the module is working for one environment, extending it to staging and production is a matter of writing new terraform.tfvars files and configuring state backends. The hardest part is the first module; everything after that is parameterization.

Deploy Memory Spine with Confidence

Production-ready Terraform modules, Vault integration, and multi-environment templates. Get your AI infrastructure under version control today.

Get Started Free →
Share this article:

🔧 Related ChaozCode Tools

Memory Spine

Persistent memory for AI agents — store, search, and recall context across sessions

Solas AI

Multi-perspective reasoning engine with Council of Minds for complex decisions

AgentZ

Agent orchestration and execution platform powering 233+ specialized AI agents

Explore all 8 ChaozCode apps →