Compliance & Policy as Code

Audit Trails & Change Management

18 min Lesson 8 of 27

Audit Trails & Change Management

Every compliance framework — SOC 2, ISO 27001, PCI DSS, HIPAA — converges on the same foundational question: who changed what, when, and was it authorized? Audit trails answer that question for regulators. PR-based change management answers it for your engineering team in real time. These are not separate concerns; they are two layers of the same control, and top-tier engineering organizations run both in concert.

What Makes an Audit Log Immutable

A log is only evidence if it cannot be tampered with after the fact. Three properties define immutability in practice:

Write-once storage: logs are written to a destination where even the owner cannot delete or modify individual records. AWS CloudTrail with S3 Object Lock (Compliance mode), GCS with Bucket Lock, and Azure Blob with immutability policies all enforce this at the storage layer.
Cryptographic integrity: each log batch is hashed and the hash is stored separately or signed. CloudTrail delivers a digest file every hour containing SHA-256 hashes of all log files in the preceding window. You can validate integrity offline: aws cloudtrail validate-logs --trail-arn <arn> --start-time <ISO8601>.
Centralized aggregation: logs from every service and every region flow to a dedicated security account or log archive that application teams have no write access to. In AWS this is typically a dedicated Log Archive account in AWS Organizations, receiving logs via CloudTrail Organization Trail and centralized CloudWatch Logs destinations.

Immutability is a storage property, not a process property. Promising that nobody will delete logs is not a control. Configuring Object Lock with a 7-year retention in Compliance mode is a control. Compliance mode prevents even the AWS root account from removing the lock until the retention period expires — this is what satisfies PCI DSS Requirement 10.5.

What Must Be Logged

Log everything that changes state. For cloud infrastructure the minimum baseline is:

Control plane API calls: every CreateInstance, DeleteBucket, AssumeRole, and IAM mutation. AWS CloudTrail management events, GCP Cloud Audit Logs (Admin Activity), Azure Activity Log.
Data plane reads on sensitive resources: S3 data events for buckets containing PII or payment data. These are high-volume and cost money; enable selectively with S3 event selectors.
Authentication events: console logins (including MFA status), API key creation, credential rotation failures. A failed MFA attempt at 03:00 UTC from an unusual IP is signal — you need to be able to detect it.
Infrastructure-as-Code plan and apply: who ran terraform apply, what plan hash was approved, what changed. Terraform Cloud / Atlantis keeps this natively; for open-source pipelines you must capture this in your CI log store.
Kubernetes audit log: every kubectl exec, delete pod, RBAC binding change, and secret access is recorded at the API server. Ship these to your SIEM — they are the equivalent of CloudTrail for your cluster control plane.

# Enable CloudTrail organization trail with S3 Object Lock
# (run once from the management account)

aws cloudtrail create-trail \
  --name org-audit-trail \
  --s3-bucket-name my-org-audit-logs \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --is-organization-trail

aws cloudtrail start-logging --name org-audit-trail

# Lock the S3 bucket — Compliance mode, 7-year retention
aws s3api put-object-lock-configuration \
  --bucket my-org-audit-logs \
  --object-lock-configuration '{
    "ObjectLockEnabled": "Enabled",
    "Rule": {
      "DefaultRetention": {
        "Mode": "COMPLIANCE",
        "Years": 7
      }
    }
  }'

# Validate CloudTrail log integrity for the past 24 hours
aws cloudtrail validate-logs \
  --trail-arn arn:aws:cloudtrail:us-east-1:123456789012:trail/org-audit-trail \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

PR-Based Change Management as the Audit Story

GitOps is not just a deployment pattern — it is your change management system. Every infrastructure change merged through a pull request produces a durable, cross-referenced record: who proposed the change, who reviewed it, what automated checks passed, who approved it, and exactly what diff was applied. This is the evidence an auditor wants for SOC 2 CC8.1 (Change Management) and ISO 27001 A.12.1.2 (Change Management Procedure).

The audit story for any production change must be reconstructable from the Git history alone:

Engineer opens a PR against the main branch of the infrastructure repo.
CI runs terraform plan, tflint, and OPA policy checks automatically. Results are posted as PR comments.
At least one peer reviewer approves (enforced by branch protection: require N approvals, dismiss stale reviews, require status checks).
On merge, the pipeline runs terraform apply using a time-limited OIDC token — no long-lived credentials in CI.
The commit SHA, Terraform plan hash, apply output, and the PR URL are emitted to your audit log store.

PR-based change management: every production change flows through review and produces an immutable audit record.

Branch Protection as a Mandatory Control

Branch protection rules on your main or production branch are not optional. Without them, anyone with write access can push directly to main and bypass the entire review process — the audit trail exists but the authorization step is missing. Minimum required settings for a compliant infrastructure repo:

# GitHub branch protection via gh CLI (set on the infra repo)
gh api repos/{owner}/{repo}/branches/main/protection \
  --method PUT \
  --header "Accept: application/vnd.github+json" \
  --field required_status_checks='{"strict":true,"contexts":["terraform-plan","opa-policy"]}' \
  --field enforce_admins=true \
  --field required_pull_request_reviews='{"required_approving_review_count":1,"dismiss_stale_reviews":true,"require_code_owner_reviews":true}' \
  --field restrictions=null \
  --field allow_force_pushes=false \
  --field allow_deletions=false

Enforce on admins. enforce_admins: true is the most commonly skipped setting. Without it, repository admins can push directly to main — and in a post-incident audit, "the admin bypassed the process" is as damaging as having no process at all. SOC 2 auditors check this explicitly.

Linking PR Merges to Deployment Events in the Audit Log

A gap that organizations miss: the PR is in GitHub, the deployment event is in CloudTrail, but there is no automated link between them. An auditor asks: "Show me the approval for the security group change deployed on March 14." Without cross-referencing, this requires manual detective work across two systems.

The fix is to emit a structured audit event from your CI/CD pipeline at apply time that contains both the GitHub PR URL and the cloud resource change details. Write this to your centralized log destination (CloudWatch Logs, Datadog, Splunk) with a consistent schema:

# Emit a structured change audit event from your CI pipeline
# (example shell step run after terraform apply)

COMMIT_SHA=$(git rev-parse HEAD)
PR_URL="${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/pull/${PR_NUMBER}"
PLAN_HASH=$(sha256sum tfplan.binary | awk '{print $1}')
TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)

cat <<EOF | aws logs put-log-events \
  --log-group-name /audit/infra-changes \
  --log-sequence-token $(aws logs describe-log-streams \
      --log-group-name /audit/infra-changes \
      --query 'logStreams[0].uploadSequenceToken' \
      --output text) \
  --log-events timestamp=$(date +%s%3N),message="$(cat -)"
{
  "event_type": "infrastructure_change",
  "timestamp": "${TIMESTAMP}",
  "actor": "${GITHUB_ACTOR}",
  "pr_url": "${PR_URL}",
  "commit_sha": "${COMMIT_SHA}",
  "plan_hash": "${PLAN_HASH}",
  "approvers": "${PR_APPROVERS}",
  "environment": "production",
  "repository": "${GITHUB_REPOSITORY}"
}
EOF

Production Failure Modes

Even well-designed audit systems have recurring failure patterns at scale:

Log destination outage silently drops events: your application keeps running but audit events are lost. Use CloudTrail's SNS notification on log delivery failure, or deploy a dead-letter queue for log shipping pipelines.
Clock skew between services: logs from two services show events out of causal order, making incident reconstruction unreliable. Enforce NTP synchronization on all hosts and use a monotonic event sequencer (Kafka offsets, Kinesis sequence numbers) for high-volume streams where sub-second ordering matters.
Credentials in the audit log: a developer runs aws s3 cp s3://bucket/file . --sse-c AES256 --sse-c-key <base64key> and the key appears in CloudTrail. Log sanitization middleware in your SIEM should redact known secret patterns, but the root fix is preventing secrets from appearing in CLI arguments (use environment variables or AWS Secrets Manager references instead).
Stale retention policies: S3 lifecycle rules delete log files after 90 days, but the compliance requirement is 12 months. Run a quarterly audit: aws s3api get-bucket-lifecycle-configuration --bucket my-org-audit-logs and verify retention matches your policy.

Never store audit logs in the same account as the application. If an attacker compromises your production account and obtains root or admin credentials, the first thing they do is delete logs to cover their tracks. Logs in a dedicated, locked-down Log Archive account with no cross-account delete permissions survive a production account breach.

Querying Audit Logs at Scale

Collecting logs is only useful if you can query them quickly during an incident or audit. CloudTrail Lake lets you run SQL directly against your trail events without exporting to Athena first. For an organization-wide trail, a typical query retrieving all IAM changes in the past 30 days looks like:

-- CloudTrail Lake: find all IAM policy mutations in the last 30 days
SELECT
  eventTime,
  userIdentity.arn AS actor,
  eventName,
  requestParameters,
  sourceIPAddress,
  userAgent
FROM
  ${event_data_store_id}
WHERE
  eventSource = 'iam.amazonaws.com'
  AND eventName IN (
    'AttachRolePolicy','DetachRolePolicy',
    'PutRolePolicy','DeleteRolePolicy',
    'CreatePolicy','DeletePolicy',
    'CreateUser','DeleteUser'
  )
  AND eventTime > DATE_ADD('DAY', -30, NOW())
ORDER BY
  eventTime DESC
LIMIT 500

For multi-cloud environments, route everything through a single SIEM (Splunk, Elastic, Datadog) and normalize to a common schema (OCSF — Open Cybersecurity Schema Framework — is gaining adoption as the cross-cloud standard). This lets a single query surface events from AWS, GCP, and Kubernetes simultaneously during an incident.