Cloud & Kubernetes Security Hardening

Identity Is the Perimeter

18 min Lesson 3 of 28

Identity Is the Perimeter

In traditional data-center security, the network edge was the boundary. Firewalls guarded the moat. If you were inside the castle walls, you were trusted. Cloud destroyed that model. Infrastructure now spans multiple clouds, laptops, CI/CD runners, SaaS APIs, and Kubernetes pods — all communicating over the public internet. The moat is gone. Identity is now the only consistent enforcement point you have.

This lesson focuses on three production-grade skills that separate senior DevOps engineers from junior ones: enforcing least privilege at scale, running continuous role hygiene programs, and deploying automated access analyzers that catch drift before attackers do.

Least Privilege at Scale: Why It Is Hard and How Big-Tech Does It

The principle of least privilege is easy to state and nearly impossible to maintain at scale without tooling. The moment a developer adds AdministratorAccess to a role "just for testing" and the PR merges on a Friday, your blast radius doubles. At scale — hundreds of roles, dozens of CI pipelines, thousands of Kubernetes service accounts — manual review is theater.

Production-grade least privilege requires three things working together:

  1. Start narrow, drift upward deliberately. New roles begin with zero permissions. Permissions are added only when a documented service requirement exists, not when someone gets an AccessDenied error and escalates.
  2. Measure actual usage, not assumed usage. AWS IAM Access Analyzer generates policy recommendations based on CloudTrail activity over the past 90 days. Anything unused gets removed on a cadence.
  3. Treat IAM policies as code. All role definitions live in Terraform. Changes go through pull request review, just like application code. No console-only changes are permitted; SCPs (Service Control Policies) can enforce this at the organization level.

A concrete example: a Lambda function that reads from one DynamoDB table should have a role scoped exactly to that table and that action. Not DynamoDB full access. Not DynamoDB read-only on all tables. One table, one set of actions.

# Terraform: minimal IAM role for a Lambda reading one DynamoDB table resource "aws_iam_role" "order_reader_lambda" { name = "order-reader-lambda-role" assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [{ Effect = "Allow" Principal = { Service = "lambda.amazonaws.com" } Action = "sts:AssumeRole" }] }) } resource "aws_iam_role_policy" "order_reader_policy" { name = "order-reader-policy" role = aws_iam_role.order_reader_lambda.id policy = jsonencode({ Version = "2012-10-17" Statement = [ { Sid = "ReadOrdersTable" Effect = "Allow" Action = [ "dynamodb:GetItem", "dynamodb:Query", "dynamodb:BatchGetItem" ] Resource = "arn:aws:dynamodb:us-east-1:123456789012:table/orders" }, { Sid = "WriteLogs" Effect = "Allow" Action = [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ] Resource = "arn:aws:logs:*:*:*" } ] }) }
When you are unsure what permissions a workload actually needs, deploy it first with CloudWatch Logs enabled and the arn:aws:iam::aws:policy/ReadOnlyAccess policy. Let it run for a day. Then use IAM Access Analyzer policy generation to get a precise list based on actual CloudTrail events. Strip it back to exactly that, then remove ReadOnlyAccess. This is the build-measure-tighten loop Google and Amazon use internally.

Kubernetes Service Account Least Privilege with IRSA and Workload Identity

Inside Kubernetes, every pod that calls AWS APIs should use IAM Roles for Service Accounts (IRSA) on EKS, or Workload Identity on GKE. Never mount long-lived AWS credentials as secrets. The IRSA flow mints short-lived STS tokens scoped to a specific Kubernetes service account, bound by an OIDC trust relationship. No token lives longer than the pod.

# Annotate the Kubernetes service account to bind it to an IAM role apiVersion: v1 kind: ServiceAccount metadata: name: order-reader-sa namespace: orders annotations: eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/order-reader-lambda-role --- # The IAM role trust policy must allow the OIDC provider to assume it # (managed by Terraform eks_iam_role_for_service_account module in practice) # Verify the token is being injected correctly in the running pod: # kubectl exec -n orders <pod> -- env | grep AWS # AWS_ROLE_ARN=arn:aws:iam::123456789012:role/order-reader-lambda-role # AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
A common production failure: the OIDC thumbprint in the IAM provider goes stale when AWS rotates the certificate. Rotate the thumbprint via Terraform on the same cadence as your AWS SDK upgrades, or pin the root CA thumbprint rather than the intermediate. Silent auth failures in workloads are the symptom; a stale OIDC thumbprint is a frequent root cause.

Role Hygiene: Continuous Cleanup at Scale

Roles accumulate. An engineer leaves, their personal role stays. A one-off migration project ends, its cross-account role stays. Over 18 months, role count in a mid-size AWS account can triple with no intentional additions. Role hygiene is the operational practice of continuously detecting and eliminating this drift.

The standard big-tech playbook:

  • Last used tracking. Every IAM role records when it was last used and in which region. Roles unused for 90 days are quarantined (deny all via inline deny policy), then deleted after a 14-day grace period.
  • Role ownership tagging. Every role carries Owner, Team, and Expires tags. Untagged roles are quarantined automatically. CI pipelines enforce tags at creation time via policy-as-code (OPA or SCP).
  • Permission boundary enforcement. All developer-created roles must have a permission boundary attached that caps the maximum permissions they can ever grant themselves. This prevents privilege escalation even if a developer creates a role with a broad policy.
# AWS CLI: find IAM roles unused for more than 90 days aws iam generate-credential-report aws iam get-credential-report --output text --query Content | base64 -d > creds.csv # Or use the Access Advisor API for per-service last-used data: aws iam list-roles --query 'Roles[*].[RoleName,RoleLastUsed.LastUsedDate]' \ --output text | sort -k2 | head -40 # Identify roles with no last-used date (never used): aws iam list-roles | jq -r \ '.Roles[] | select(.RoleLastUsed.LastUsedDate == null) | .RoleName' # Attach a quarantine deny policy to a stale role: aws iam put-role-policy \ --role-name stale-migration-role \ --policy-name QuarantineDenyAll \ --policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Deny","Action":"*","Resource":"*"}]}'

IAM Access Analyzer: Automated Drift Detection

AWS IAM Access Analyzer is a continuous analysis engine that monitors resource-based policies (S3 bucket policies, KMS key policies, SQS queues, Lambda function policies, IAM roles, and more) and reports any principal outside your trust zone that has been granted access. It also generates least-privilege policy recommendations from CloudTrail and can validate policies you write against AWS's policy grammar and known security anti-patterns before you deploy them.

IAM Access Analyzer flow: CloudTrail → Analyzer → Findings → Remediation CloudTrail API Events IAM Access Analyzer Policy Validation External Access Findings External Access Unused Permissions Policy Errors Overpermissive Grants Remediation Archive / Fix Terraform PR Alert → Oncall Resource Policies (S3, KMS, SQS, Lambda, Roles) Exports to Security Hub / EventBridge / SNS
IAM Access Analyzer continuously reads resource policies and CloudTrail events, surfaces findings, and feeds remediation workflows.

Access Analyzer operates at the AWS Organization level. Enable it in every region (including regions you think you do not use — attackers favor quiet regions). All findings should route to Security Hub, which aggregates them with findings from GuardDuty, Inspector, and Macie into a single pane of glass your security team monitors.

# Enable IAM Access Analyzer at the organization level (run from management account) aws accessanalyzer create-analyzer \ --analyzer-name org-analyzer \ --type ORGANIZATION \ --region us-east-1 # Generate a least-privilege policy recommendation for a specific role # based on the last 90 days of CloudTrail activity: aws accessanalyzer start-policy-generation \ --policy-generation-details '{"principalArn":"arn:aws:iam::123456789012:role/order-reader-lambda-role"}' # List unresolved findings (external access alerts): aws accessanalyzer list-findings \ --analyzer-arn arn:aws:accessanalyzer:us-east-1:123456789012:analyzer/org-analyzer \ --filter '{"status":{"eq":["ACTIVE"]}}' \ --query 'findings[*].[id,resource,resourceType,condition]' \ --output table # Validate a policy document before you deploy it: aws accessanalyzer validate-policy \ --policy-document file://proposed-policy.json \ --policy-type IDENTITY_POLICY
Access Analyzer catches a class of misconfiguration that manual review misses: implicit public access. A KMS key policy that says "Principal": "*" with a condition that team members assume is restrictive but is actually always true is invisible to a human reviewer skimming the JSON. The analyzer evaluates the full policy logic and flags it. Enable it and treat every active finding as a Sev-2 incident.

Putting It Together: The Identity Hygiene Feedback Loop

Best-in-class organizations run identity hygiene as a continuous, automated feedback loop rather than a quarterly audit. The cycle is: define narrow roles in Terraform → deploy → measure actual usage via CloudTrail → generate Access Analyzer recommendations → open automated PRs to remove unused permissions → merge → repeat every 90 days. Pair this with alerting on any AssumeRole call that crosses account boundaries unexpectedly, any new root API call, and any policy attachment to a principal with a wildcard action. Identity hygiene is not a project — it is a standing on-call rotation item.