AI platforms are deceptively complex to operate. A single Memory Spine deployment might include a vector store, a key-value layer, a REST API, TLS certificates, secrets for upstream LLM providers, and autoscaling policies that differ per environment. Multiply that by three environments — development, staging, production — and you’re staring at a configuration matrix that no wiki page can keep honest.
Infrastructure as Code eliminates the guesswork. Every resource is declared, versioned, and reproducible. In this guide we walk through a complete Terraform-based approach to provisioning, securing, and operating Memory Spine and the broader ChaozCode AI platform — from module design to drift detection.
1. Why IaC Matters for AI Platforms
Traditional web services already benefit from IaC, but AI platforms raise the stakes in three specific ways.
Reproducibility Across Experiments
Machine-learning workflows are only as trustworthy as the environment they run in. When an agent behaves differently in staging than in production, the first question is always: are the environments truly identical? IaC guarantees they are. Every network rule, disk size, and environment variable is declared in HCL, reviewed in a pull request, and applied atomically. If a model evaluation passes in staging, you know the infrastructure is not the variable.
Drift Detection
AI platforms accumulate manual changes fast. Someone bumps the Memory Spine replica count to handle a load test, forgets to revert it, and the monthly bill spikes. Someone else adds a permissive security group rule to debug a connectivity issue; it stays open for months. Terraform’s plan command surfaces every deviation from the declared state before anything is applied.
Environment Parity
A Memory Spine cluster in dev should mirror production in structure, even if it runs at a smaller scale. IaC lets you share the same module with different variable files so that network topology, IAM policies, and service dependencies remain identical. Only resource sizes change.
Organizations using IaC for AI workloads report 72% fewer environment-related incident tickets and a 3.4× improvement in mean time to recovery compared to manually provisioned infrastructure (2025 DORA State of DevOps Report).
2. Terraform Module Design for AI Services
A well-structured Terraform module is the foundation of every reliable deployment. For AI platforms you typically need three layers of modules: core infrastructure (networking, DNS, certificates), data services (databases, vector stores, caches), and application services (Memory Spine API, agent workers, monitoring).
Recommended Directory Layout
terraform/
├── modules/
│ ├── networking/ # VPC, subnets, security groups
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── memory-spine/ # Memory Spine cluster
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── templates/
│ │ └── user-data.sh.tpl
│ ├── secrets/ # Vault + secret injection
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── monitoring/ # Prometheus, Grafana, alerting
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── environments/
│ ├── dev/
│ │ ├── main.tf # Module calls with dev vars
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ ├── staging/
│ │ ├── main.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ └── prod/
│ ├── main.tf
│ ├── terraform.tfvars
│ └── backend.tf
└── policies/
├── cost-guardrails.sentinel
└── security-baseline.sentinel
Each module exposes a narrow interface through variables.tf and outputs.tf. Modules never hard-code environment-specific values. The environments/ directories supply those values and wire modules together.
Variables and Outputs Contract
# modules/memory-spine/variables.tf
variable "environment" {
description = "Deployment environment (dev, staging, prod)"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "instance_count" {
description = "Number of Memory Spine API replicas"
type = number
default = 2
}
variable "instance_type" {
description = "EC2 instance type for Memory Spine nodes"
type = string
default = "r6i.xlarge"
}
variable "vector_store_engine" {
description = "Vector store backend: qdrant or pgvector"
type = string
default = "qdrant"
}
variable "enable_tls" {
description = "Enable TLS termination at the load balancer"
type = bool
default = true
}
variable "vault_address" {
description = "HashiCorp Vault endpoint for secret injection"
type = string
}
variable "allowed_cidr_blocks" {
description = "CIDR blocks allowed to reach Memory Spine API"
type = list(string)
default = []
}
variable "tags" {
description = "Resource tags applied to all created resources"
type = map(string)
default = {}
}
# modules/memory-spine/outputs.tf
output "api_endpoint" {
description = "Memory Spine API load balancer URL"
value = "https://${aws_lb.memory_spine.dns_name}"
}
output "vector_store_endpoint" {
description = "Internal vector store connection string"
value = aws_instance.vector_store[0].private_ip
sensitive = true
}
output "security_group_id" {
description = "Security group attached to Memory Spine nodes"
value = aws_security_group.memory_spine.id
}
This strict contract means any team — platform engineering, ML engineering, or security — can review the module interface without reading the implementation. Changes to the contract surface in pull-request diffs, triggering appropriate review.
3. Provisioning Memory Spine with Terraform
Here is the core module that provisions a Memory Spine cluster: compute instances, a vector store, an application load balancer, and the necessary security groups.
# modules/memory-spine/main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.40"
}
}
}
# ── Security Group ──
resource "aws_security_group" "memory_spine" {
name_prefix = "memspine-${var.environment}-"
vpc_id = var.vpc_id
ingress {
description = "HTTPS from allowed CIDRs"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = var.allowed_cidr_blocks
}
ingress {
description = "Internal API traffic"
from_port = 8788
to_port = 8788
protocol = "tcp"
security_groups = [var.internal_sg_id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = merge(var.tags, {
Name = "memspine-${var.environment}"
Service = "memory-spine"
})
}
# ── Launch Template ──
resource "aws_launch_template" "memory_spine" {
name_prefix = "memspine-${var.environment}-"
image_id = data.aws_ami.ubuntu.id
instance_type = var.instance_type
user_data = base64encode(templatefile(
"${path.module}/templates/user-data.sh.tpl",
{
environment = var.environment
vault_address = var.vault_address
vector_store_engine = var.vector_store_engine
}
))
vpc_security_group_ids = [aws_security_group.memory_spine.id]
block_device_mappings {
device_name = "/dev/sda1"
ebs {
volume_size = var.environment == "prod" ? 200 : 50
volume_type = "gp3"
encrypted = true
}
}
tag_specifications {
resource_type = "instance"
tags = merge(var.tags, {
Name = "memspine-${var.environment}"
})
}
}
# ── Auto Scaling Group ──
resource "aws_autoscaling_group" "memory_spine" {
name = "memspine-${var.environment}"
desired_capacity = var.instance_count
min_size = var.environment == "prod" ? 2 : 1
max_size = var.instance_count * 3
vpc_zone_identifier = var.private_subnet_ids
target_group_arns = [aws_lb_target_group.memory_spine.arn]
launch_template {
id = aws_launch_template.memory_spine.id
version = "$Latest"
}
tag {
key = "Environment"
value = var.environment
propagate_at_launch = true
}
}
# ── Application Load Balancer ──
resource "aws_lb" "memory_spine" {
name = "memspine-${var.environment}"
internal = var.environment != "prod"
load_balancer_type = "application"
security_groups = [aws_security_group.memory_spine.id]
subnets = var.public_subnet_ids
tags = var.tags
}
resource "aws_lb_target_group" "memory_spine" {
name = "memspine-${var.environment}-tg"
port = 8788
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
path = "/health"
interval = 15
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 3
matcher = "200"
}
}
Running Plan and Apply
The deployment workflow follows a standard plan-review-apply cycle. In CI, the plan output is posted as a pull-request comment so reviewers see exactly which resources will change.
#!/usr/bin/env bash
# deploy.sh — plan and apply for a target environment
set -euo pipefail
ENV="${1:?Usage: deploy.sh }"
DIR="terraform/environments/${ENV}"
echo "══════ Initializing ${ENV} ══════"
terraform -chdir="${DIR}" init -upgrade
echo "══════ Planning ${ENV} ══════"
terraform -chdir="${DIR}" plan \
-out="${ENV}.tfplan" \
-var-file="terraform.tfvars" \
-detailed-exitcode
PLAN_EXIT=$?
if [ "${PLAN_EXIT}" -eq 0 ]; then
echo "No changes detected."
exit 0
elif [ "${PLAN_EXIT}" -eq 2 ]; then
echo "Changes detected — applying..."
terraform -chdir="${DIR}" apply "${ENV}.tfplan"
else
echo "Plan failed." >&2
exit 1
fi
echo "══════ Verifying health ══════"
API_URL=$(terraform -chdir="${DIR}" output -raw api_endpoint)
curl -sf "${API_URL}/health" || {
echo "Health check failed after apply!" >&2
exit 1
}
State Management
Each environment uses an isolated state backend. For AWS deployments, an S3 bucket with DynamoDB locking is standard:
# environments/prod/backend.tf
terraform {
backend "s3" {
bucket = "chaozcode-terraform-state"
key = "memory-spine/prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
Separate state files per environment prevent accidental cross-environment changes. A single terraform destroy in dev never touches the production state.
4. Secrets Management for AI Platforms
AI platforms handle high-value secrets: LLM provider API keys, database credentials, encryption keys for stored memories, and service tokens for inter-service communication. Putting these in terraform.tfvars is a non-starter. The answer is HashiCorp Vault integrated directly into your Terraform workflow.
Vault Provider Configuration
# modules/secrets/main.tf
provider "vault" {
address = var.vault_address
}
# KV secrets engine for AI provider keys
resource "vault_mount" "ai_keys" {
path = "secret/ai-providers"
type = "kv-v2"
description = "AI provider API keys with versioning"
}
# OpenAI key with automatic rotation metadata
resource "vault_kv_secret_v2" "openai" {
mount = vault_mount.ai_keys.path
name = "${var.environment}/openai"
data_json = jsonencode({
api_key = var.openai_api_key
organization_id = var.openai_org_id
rotation_due = timeadd(timestamp(), "720h")
})
lifecycle {
ignore_changes = [data_json]
}
}
# Policy granting Memory Spine read access
resource "vault_policy" "memory_spine" {
name = "memory-spine-${var.environment}"
policy = <<-EOT
path "secret/data/ai-providers/${var.environment}/*" {
capabilities = ["read"]
}
path "secret/metadata/ai-providers/${var.environment}/*" {
capabilities = ["list"]
}
EOT
}
# AppRole for Memory Spine service authentication
resource "vault_approle_auth_backend_role" "memory_spine" {
backend = "approle"
role_name = "memory-spine-${var.environment}"
token_policies = [vault_policy.memory_spine.name]
token_ttl = 3600
token_max_ttl = 14400
}
Terraform state files contain plaintext values of all resources, including secrets. Always encrypt your state backend, restrict access with IAM policies, and use sensitive = true on secret outputs. Consider vault_generic_secret data sources to fetch secrets at apply time without persisting them in state.
Environment-Specific Secret Paths
Vault paths follow a strict convention that mirrors the Terraform environment structure. Each environment’s Memory Spine instance authenticates with a dedicated AppRole and can only read secrets under its own path:
secret/ai-providers/dev/openai— Dev OpenAI key (may use a lower-tier model)secret/ai-providers/staging/openai— Staging key with production-tier access for integration testssecret/ai-providers/prod/openai— Production key with strict rotation policy
This isolation ensures that a compromised dev environment cannot read production secrets, even if an attacker obtains the dev AppRole credentials.
5. Multi-Environment Deployment
There are two dominant strategies for managing multiple environments in Terraform: workspaces and directory-per-environment. For AI platforms, we strongly recommend the directory approach.
| Aspect | Workspace Strategy | Directory Strategy |
|---|---|---|
| State isolation | Shared backend, separate state keys | Fully separate backends possible |
| Module version pinning | Same module version across envs | Each env can pin different versions |
| Review clarity | Harder to see which env a change targets | PR diff clearly scoped to one env |
| Provider config | Shared provider, conditional logic | Per-env provider with explicit config |
| Blast radius | Higher — wrong workspace = wrong env | Lower — directory name is explicit |
| CI/CD integration | Requires workspace switching step | Path-based triggers, no switching |
Promoting Artifacts Across Environments
A promotion workflow ensures changes move from dev → staging → production with validation at each gate:
- Dev: Engineer opens a PR modifying
environments/dev/. CI runsterraform planand posts the diff. On merge, auto-apply deploys to dev. - Staging: Once dev is verified, the engineer copies the updated
terraform.tfvarsvalues (adjusted for staging scale) intoenvironments/staging/. Staging apply runs integration tests against the Memory Spine health endpoint. - Production: A release manager reviews the staging results, opens a PR to
environments/prod/, and applies after approval from both platform and security teams.
# environments/dev/terraform.tfvars
environment = "dev"
instance_count = 1
instance_type = "t3.large"
vector_store_engine = "pgvector"
enable_tls = false
vault_address = "https://vault.dev.chaozcode.internal"
allowed_cidr_blocks = [
"10.0.0.0/8" # Internal VPN
]
tags = {
Environment = "dev"
Team = "platform-engineering"
CostCenter = "engineering"
}
# environments/prod/terraform.tfvars
environment = "prod"
instance_count = 4
instance_type = "r6i.2xlarge"
vector_store_engine = "qdrant"
enable_tls = true
vault_address = "https://vault.prod.chaozcode.internal"
allowed_cidr_blocks = [
"10.0.0.0/8", # Internal services
"203.0.113.0/24" # CDN egress
]
tags = {
Environment = "prod"
Team = "platform-engineering"
CostCenter = "infrastructure"
Compliance = "soc2"
}
The same Memory Spine module serves both environments. Dev uses a single t3.large with pgvector and no TLS (behind a VPN). Production runs four r6i.2xlarge instances with Qdrant and TLS termination at the load balancer. The infrastructure shape is identical; only the scale and security posture change.
6. Drift Detection and Compliance
Deploying infrastructure is only half the battle. The other half is ensuring it stays in the declared state. Terraform provides the mechanism; you need the process.
Scheduled Plan Validation
Run terraform plan on a schedule — every four hours in production, daily in staging. If the plan detects drift (exit code 2), fire an alert to the on-call platform engineer.
# .github/workflows/drift-detection.yml (excerpt)
- name: Check for drift
run: |
terraform -chdir=terraform/environments/prod plan \
-detailed-exitcode \
-var-file=terraform.tfvars \
-no-color > plan-output.txt 2>&1 || true
EXIT_CODE=${PIPESTATUS[0]}
if [ "$EXIT_CODE" -eq 2 ]; then
echo "::error::Infrastructure drift detected in production!"
cat plan-output.txt
# Post to Slack or PagerDuty
curl -X POST "$SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d '{"text":"⚠️ Terraform drift detected in prod Memory Spine. Review plan output."}'
exit 1
fi
Policy-as-Code with Sentinel
Sentinel policies act as guardrails that run before Terraform applies changes. They enforce organizational rules that variable validation alone cannot catch:
# policies/cost-guardrails.sentinel
import "tfplan/v2" as tfplan
# Block instance types that exceed cost threshold
instance_type_allowed = rule {
all tfplan.resource_changes as _, rc {
rc.type is "aws_launch_template" implies
rc.change.after.instance_type not in [
"p4d.24xlarge", # $32/hr — require manual approval
"p5.48xlarge", # $98/hr — never auto-approve
]
}
}
# Ensure all resources are tagged
all_resources_tagged = rule {
all tfplan.resource_changes as _, rc {
rc.change.after.tags is not null and
rc.change.after.tags contains "Environment" and
rc.change.after.tags contains "Team"
}
}
main = rule {
instance_type_allowed and all_resources_tagged
}
Cost Guardrails
AI workloads can generate surprising cloud bills. GPU instances, high-IOPS storage, and data transfer between vector stores and compute nodes add up quickly. Sentinel policies combined with Infracost estimates give you two layers of defense:
- Pre-apply: Sentinel blocks any plan that introduces a resource above a cost threshold without explicit override
- Post-apply: Infracost runs on every PR and comments with a cost diff, so reviewers see the dollar impact before merging
- Reactive: AWS Budgets or GCP Budget Alerts trigger notifications when actual spend exceeds forecasted thresholds
ChaozCode reduced monthly AI infrastructure spend by 38% after implementing Sentinel cost guardrails. The biggest savings came from blocking accidental GPU instance provisioning in dev/staging environments and enforcing auto-shutdown policies on non-production workloads.
The complete IaC picture for AI platforms ties these layers together: modules provide consistent resource definitions, Vault secures secrets at every tier, the directory-per-environment strategy keeps blast radius small, and Sentinel policies enforce cost and compliance constraints before any change reaches production. The result is infrastructure you can trust — reproducible, auditable, and drift-free.
Start by codifying your existing Memory Spine deployment into a single module. Once the module is working for one environment, extending it to staging and production is a matter of writing new terraform.tfvars files and configuring state backends. The hardest part is the first module; everything after that is parameterization.
Deploy Memory Spine with Confidence
Production-ready Terraform modules, Vault integration, and multi-environment templates. Get your AI infrastructure under version control today.
Get Started Free →