We are still cooking the magic in the way!
Audit Trails & Change Management
Audit Trails & Change Management
Every compliance framework — SOC 2, ISO 27001, PCI DSS, HIPAA — converges on the same foundational question: who changed what, when, and was it authorized? Audit trails answer that question for regulators. PR-based change management answers it for your engineering team in real time. These are not separate concerns; they are two layers of the same control, and top-tier engineering organizations run both in concert.
What Makes an Audit Log Immutable
A log is only evidence if it cannot be tampered with after the fact. Three properties define immutability in practice:
- Write-once storage: logs are written to a destination where even the owner cannot delete or modify individual records. AWS CloudTrail with S3 Object Lock (Compliance mode), GCS with Bucket Lock, and Azure Blob with immutability policies all enforce this at the storage layer.
- Cryptographic integrity: each log batch is hashed and the hash is stored separately or signed. CloudTrail delivers a
digest fileevery hour containing SHA-256 hashes of all log files in the preceding window. You can validate integrity offline:aws cloudtrail validate-logs --trail-arn <arn> --start-time <ISO8601>. - Centralized aggregation: logs from every service and every region flow to a dedicated security account or log archive that application teams have no write access to. In AWS this is typically a dedicated Log Archive account in AWS Organizations, receiving logs via CloudTrail Organization Trail and centralized CloudWatch Logs destinations.
What Must Be Logged
Log everything that changes state. For cloud infrastructure the minimum baseline is:
- Control plane API calls: every
CreateInstance,DeleteBucket,AssumeRole, and IAM mutation. AWS CloudTrail management events, GCP Cloud Audit Logs (Admin Activity), Azure Activity Log. - Data plane reads on sensitive resources: S3 data events for buckets containing PII or payment data. These are high-volume and cost money; enable selectively with S3 event selectors.
- Authentication events: console logins (including MFA status), API key creation, credential rotation failures. A failed MFA attempt at 03:00 UTC from an unusual IP is signal — you need to be able to detect it.
- Infrastructure-as-Code plan and apply: who ran
terraform apply, what plan hash was approved, what changed. Terraform Cloud / Atlantis keeps this natively; for open-source pipelines you must capture this in your CI log store. - Kubernetes audit log: every
kubectl exec,delete pod, RBAC binding change, and secret access is recorded at the API server. Ship these to your SIEM — they are the equivalent of CloudTrail for your cluster control plane.
PR-Based Change Management as the Audit Story
GitOps is not just a deployment pattern — it is your change management system. Every infrastructure change merged through a pull request produces a durable, cross-referenced record: who proposed the change, who reviewed it, what automated checks passed, who approved it, and exactly what diff was applied. This is the evidence an auditor wants for SOC 2 CC8.1 (Change Management) and ISO 27001 A.12.1.2 (Change Management Procedure).
The audit story for any production change must be reconstructable from the Git history alone:
- Engineer opens a PR against the
mainbranch of the infrastructure repo. - CI runs
terraform plan,tflint, and OPA policy checks automatically. Results are posted as PR comments. - At least one peer reviewer approves (enforced by branch protection: require N approvals, dismiss stale reviews, require status checks).
- On merge, the pipeline runs
terraform applyusing a time-limited OIDC token — no long-lived credentials in CI. - The commit SHA, Terraform plan hash, apply output, and the PR URL are emitted to your audit log store.
Branch Protection as a Mandatory Control
Branch protection rules on your main or production branch are not optional. Without them, anyone with write access can push directly to main and bypass the entire review process — the audit trail exists but the authorization step is missing. Minimum required settings for a compliant infrastructure repo:
enforce_admins: true is the most commonly skipped setting. Without it, repository admins can push directly to main — and in a post-incident audit, "the admin bypassed the process" is as damaging as having no process at all. SOC 2 auditors check this explicitly.
Linking PR Merges to Deployment Events in the Audit Log
A gap that organizations miss: the PR is in GitHub, the deployment event is in CloudTrail, but there is no automated link between them. An auditor asks: "Show me the approval for the security group change deployed on March 14." Without cross-referencing, this requires manual detective work across two systems.
The fix is to emit a structured audit event from your CI/CD pipeline at apply time that contains both the GitHub PR URL and the cloud resource change details. Write this to your centralized log destination (CloudWatch Logs, Datadog, Splunk) with a consistent schema:
Production Failure Modes
Even well-designed audit systems have recurring failure patterns at scale:
- Log destination outage silently drops events: your application keeps running but audit events are lost. Use CloudTrail's SNS notification on log delivery failure, or deploy a dead-letter queue for log shipping pipelines.
- Clock skew between services: logs from two services show events out of causal order, making incident reconstruction unreliable. Enforce NTP synchronization on all hosts and use a monotonic event sequencer (Kafka offsets, Kinesis sequence numbers) for high-volume streams where sub-second ordering matters.
- Credentials in the audit log: a developer runs
aws s3 cp s3://bucket/file . --sse-c AES256 --sse-c-key <base64key>and the key appears in CloudTrail. Log sanitization middleware in your SIEM should redact known secret patterns, but the root fix is preventing secrets from appearing in CLI arguments (use environment variables or AWS Secrets Manager references instead). - Stale retention policies: S3 lifecycle rules delete log files after 90 days, but the compliance requirement is 12 months. Run a quarterly audit:
aws s3api get-bucket-lifecycle-configuration --bucket my-org-audit-logsand verify retention matches your policy.
Querying Audit Logs at Scale
Collecting logs is only useful if you can query them quickly during an incident or audit. CloudTrail Lake lets you run SQL directly against your trail events without exporting to Athena first. For an organization-wide trail, a typical query retrieving all IAM changes in the past 30 days looks like:
For multi-cloud environments, route everything through a single SIEM (Splunk, Elastic, Datadog) and normalize to a common schema (OCSF — Open Cybersecurity Schema Framework — is gaining adoption as the cross-cloud standard). This lets a single query surface events from AWS, GCP, and Kubernetes simultaneously during an incident.