Cloud & Kubernetes Security Hardening

Detection & Response in the Cloud

18 min Lesson 9 of 28

Detection & Response in the Cloud

Hardening reduces the probability of a breach; detection and response reduce the blast radius when hardening fails. Every major cloud provider ships native telemetry services — AWS CloudTrail, GuardDuty, Security Hub, and CloudWatch Logs — that give you visibility into what every principal, service, and resource is doing at every moment. At big-tech scale, the gap between teams that get paged on an attacker's second API call versus the ones that discover a breach in a quarterly audit comes down to one thing: how well they have operationalized this telemetry into actionable, low-noise alerts.

This lesson covers the full detection path: where the raw events come from, how GuardDuty's threat intelligence models work, how to write your own CloudWatch Metric Filters for the suspicious patterns GuardDuty does not cover, and how to build a response runbook that closes the loop from alert to remediation in minutes rather than days.

CloudTrail: The Immutable API Ledger

AWS CloudTrail records every API call made against your account — console clicks, CLI commands, SDK calls, and service-to-service calls — as structured JSON events delivered to S3 and optionally to CloudWatch Logs. It is the foundation of every detection and forensic investigation workflow.

Production best practices that most teams miss:

Enable a multi-region trail — single-region trails miss global services (IAM, STS, Route 53, CloudFront) and any resource created in a region you are not watching. The --is-multi-region-trail flag fixes this.
Enable log file validation — CloudTrail logs SHA-256-signs each delivered file. --enable-log-file-validation lets you prove to auditors and incident responders that logs have not been tampered with.
Protect the S3 destination — restrict bucket access so only CloudTrail and your SIEM can write/read; block public access; enable MFA Delete; enable S3 Object Lock in compliance mode for your retention window.
Send to CloudWatch Logs — S3 delivery has a 15-minute delay; CloudWatch Logs ingestion is near-real-time and enables Metric Filters for sub-minute alerting.
Enable CloudTrail Insights — ML-based anomaly detection on write API call rates; catches credential stuffing and bulk data exfiltration that look like legitimate spikes.

# Create a hardened multi-region trail (run once per management account)
aws cloudtrail create-trail \
  --name org-audit-trail \
  --s3-bucket-name my-org-cloudtrail-logs \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --cloud-watch-logs-log-group-arn arn:aws:logs:us-east-1:123456789012:log-group:cloudtrail:* \
  --cloud-watch-logs-role-arn arn:aws:iam::123456789012:role/CloudTrailCloudWatchRole \
  --include-global-service-events \
  --enable-log-file-validation

aws cloudtrail start-logging --name org-audit-trail

# Enable Insights (write + read-rate anomalies)
aws cloudtrail put-insight-selectors \
  --trail-name org-audit-trail \
  --insight-selectors '[{"InsightType":"ApiCallRateInsight"},{"InsightType":"ApiErrorRateInsight"}]'

At the organization level, deploy a delegated administrator account (the security tooling account) and use AWS Organizations trail — one trail covers all member accounts and cannot be disabled by member account admins. This is the pattern used by every well-run AWS organization at scale.

GuardDuty: Threat Intelligence at Cloud Scale

Amazon GuardDuty is a regional, managed threat detection service that continuously analyzes CloudTrail management events, CloudTrail S3 data events, VPC Flow Logs, and DNS query logs without requiring you to route anything to it manually. It correlates this data against AWS threat intelligence feeds, CrowdStrike, Proofpoint, and its own ML anomaly models to produce findings — scored, categorized alerts.

GuardDuty findings are organized into three families you must know cold:

Reconnaissance — port scanning (Recon:EC2/PortProbeUnprotectedPort), unusual API enumeration (Recon:IAMUser/MaliciousIPCaller). Attackers mapping your environment before they act.
Credential Compromise / Instance Compromise — API calls from Tor exit nodes (UnauthorizedAccess:IAMUser/TorIPCaller), calls from IPs on known threat lists (UnauthorizedAccess:EC2/TorClient), credentials used outside normal operating hours or regions (CredentialAccess:IAMUser/AnomalousBehavior).
Exfiltration / Impact — large S3 data retrieval from an unusual principal (Exfiltration:S3/ObjectRead.Unusual), EC2 instance communicating with a known C2 domain (Backdoor:EC2/C&CActivity.B), Bitcoin mining traffic (CryptoCurrency:EC2/BitcoinTool.B`).

Detection pipeline: raw logs feed GuardDuty and CloudWatch; findings route through EventBridge to alerting and automated remediation.

Writing CloudWatch Metric Filters for Custom Detection

GuardDuty covers the most common threat patterns, but your environment has unique risks that require custom signal extraction. CloudWatch Metric Filters parse CloudTrail logs in near-real-time, increment a custom metric when a pattern matches, and trigger an alarm — giving you alerting on any API event within 60 seconds of it occurring.

High-signal patterns every production account should monitor:

Root account usage (any API call from the root principal)
ConsoleLogin failures — brute-force attempts on IAM user passwords
IAM policy and user changes — unauthorized privilege escalation
CloudTrail being stopped or deleted — an attacker covering their tracks
Security group ingress rules opened to 0.0.0.0/0
KMS key deletion scheduled — a ransomware precursor

# 1. Create a Metric Filter for root account activity
aws logs put-metric-filter \
  --log-group-name cloudtrail \
  --filter-name RootAccountUsage \
  --filter-pattern '{ $.userIdentity.type = "Root" && $.userIdentity.invokedBy NOT EXISTS && $.eventType != "AwsServiceEvent" }' \
  --metric-transformations \
    metricName=RootAccountUsageCount,metricNamespace=CloudTrailAlerts,metricValue=1,defaultValue=0

# 2. Alarm: any root usage is a critical page (threshold=1 over 5 min)
aws cloudwatch put-metric-alarm \
  --alarm-name RootAccountUsage \
  --metric-name RootAccountUsageCount \
  --namespace CloudTrailAlerts \
  --statistic Sum \
  --period 300 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:security-alerts \
  --alarm-description "Root account API call detected"

# 3. Metric Filter: CloudTrail stopped or deleted
aws logs put-metric-filter \
  --log-group-name cloudtrail \
  --filter-name CloudTrailMutated \
  --filter-pattern '{ ($.eventName = StopLogging) || ($.eventName = DeleteTrail) || ($.eventName = UpdateTrail) }' \
  --metric-transformations \
    metricName=CloudTrailMutatedCount,metricNamespace=CloudTrailAlerts,metricValue=1,defaultValue=0

The AWS CIS Benchmark for AWS (v1.5) publishes the exact filter patterns and alarm thresholds for 14 mandatory controls. Automate their deployment via Terraform using the aws_cloudwatch_log_metric_filter and aws_cloudwatch_metric_alarm resources — treat your detection posture as code and test it in CI.

Automating Response with EventBridge + Lambda

An alert you act on in 2 hours is not detection — it is forensics. The goal is automated containment for high-confidence findings and human-in-the-loop confirmation for lower-confidence ones. The pattern used at scale: GuardDuty findings emit to EventBridge; a Lambda function evaluates severity and finding type; it either auto-remediates (revoke credentials, isolate instance, block IP in WAF) or opens a PagerDuty incident with the full context pre-populated.

Auto-remediation actions by finding type:

Compromised IAM credentials (UnauthorizedAccess:IAMUser/*) — Lambda calls iam:CreateAccessKey to generate a new key, iam:UpdateAccessKeyStatus to deactivate the old one, and iam:AttachUserPolicy to attach a deny-all quarantine policy. The user retains their identity but cannot act until the security team re-enables them.
Compromised EC2 instance (Backdoor:EC2/*, CryptoCurrency:EC2/*) — Lambda modifies the instance's security group to allow only SSH from the security team's bastion CIDR, creates an EBS snapshot for forensics, and tags the instance SecurityStatus=Quarantined.
Unusual S3 access (Exfiltration:S3/*) — Lambda enables S3 Block Public Access on the bucket, and revokes any S3 bucket policy that allows s3:GetObject to *.

# EventBridge rule: route HIGH severity GuardDuty findings to Lambda
# (Severity >= 7.0 = HIGH; >= 4.0 = MEDIUM)
{
  "source": ["aws.guardduty"],
  "detail-type": ["GuardDuty Finding"],
  "detail": {
    "severity": [{ "numeric": [">=", 7] }]
  }
}

# Lambda handler skeleton (Python 3.12) — quarantine compromised IAM user
import boto3, json

iam = boto3.client("iam")
QUARANTINE_POLICY_ARN = "arn:aws:iam::123456789012:policy/SecurityQuarantine"

def handler(event, context):
    finding = event["detail"]
    ftype   = finding["type"]
    user    = finding.get("resource", {}).get("accessKeyDetails", {}).get("userName")

    if not user or "IAMUser" not in ftype:
        return  # handled by other targets

    # 1. Attach deny-all quarantine policy
    iam.attach_user_policy(UserName=user, PolicyArn=QUARANTINE_POLICY_ARN)

    # 2. Deactivate all access keys for the user
    keys = iam.list_access_keys(UserName=user)["AccessKeyMetadata"]
    for key in keys:
        iam.update_access_key(
            UserName=user,
            AccessKeyId=key["AccessKeyId"],
            Status="Inactive"
        )

    print(json.dumps({"action": "quarantined", "user": user, "finding": ftype}))

Auto-remediation can cause outages if a legitimate service account triggers a false positive. Always scope auto-remediation to findings with severity >= 7.0 and finding types you have validated against your environment. Use a DO_NOT_QUARANTINE IAM tag on service accounts that should instead route to a human-reviewed queue. Test your Lambda against GuardDuty sample findings (aws guardduty create-sample-findings) in a non-production account before enabling in production.

Correlating Signals: Security Hub as the Single Pane

At any realistic AWS footprint (dozens of accounts, multiple regions), GuardDuty findings, Config rules, Macie findings, Inspector vulnerability reports, and your custom CloudWatch alarms create a signal deluge if consumed independently. AWS Security Hub aggregates all of these into a single, normalized finding format (ASFF — AWS Security Finding Format) and provides a cross-account, cross-region dashboard with compliance scores against CIS, PCI-DSS, and NIST 800-53.

Enable Security Hub in your security tooling account as the delegated administrator, aggregate all member accounts, and use its Automated Response and Remediation (SHARR) solution as the foundation for your runbook library. Every finding that Security Hub receives can trigger a custom action (EventBridge rule) or a built-in remediation playbook.

Incident Response Runbook Principles

Detection without a practiced runbook is theater. A production-grade runbook for a GuardDuty finding has five phases: Triage (is this a true positive?), Containment (stop the bleeding), Eradication (remove attacker presence), Recovery (restore service), and Post-Incident Review (close the detection gap). Every phase must have explicit owners, decision trees, and a maximum time budget. For credential compromise, best-practice SLAs at leading cloud-native companies are: Triage < 5 min (automated), Containment < 15 min (automated or on-call engineer), Eradication and Recovery < 2 hours, PIR within 5 business days.

Store runbooks as code in your security repository, link them directly from the GuardDuty finding detail in PagerDuty, and rehearse them quarterly with tabletop exercises and annual chaos/game days where red team engineers deliberately trigger findings to verify the detection-to-containment pipeline is working end-to-end.