Capstone: A Big-Tech Production Platform

The Complete Platform & Your Career

18 min Lesson 10 of 30

The Complete Platform & Your Career

You have built every layer of the Arctiq Commerce platform across the previous nine lessons. Now step back and see the entire system as one coherent architecture — end to end, from a user's HTTPS request to the database write, through the observability pipeline, enforced by security policy, protected by SRE practice, and governed by cost controls. Then we address the forward-looking question that matters most at this stage: how do you turn this project-level depth into a senior DevOps or SRE career?

The Assembled Platform — End-to-End Request Flow

Every architectural decision you made traces through a real request. A user on a mobile device in Frankfurt places an order. Here is the complete path, and where each platform layer intervenes:

DNS & edge: Route 53 Latency routing resolves api.arctiq.com to the eu-west-1 CloudFront distribution. CloudFront checks the WAF rule group (rate limit, OWASP top-10, IP reputation list from Lesson 7). The CDN origin is the ALB in eu-west-1 — the EU region is now fully active, not standby, after the active-active promotion you completed in Lesson 8.
Ingress & service mesh: The request hits the NLB, then the Istio ingress gateway. Istio terminates TLS, validates the JWT (via the RequestAuthentication policy), and routes to the orders service in the team-commerce namespace. Istio's Envoy sidecar injects a traceparent header (W3C trace context) — this is the trace root that propagates through every downstream call.
Application compute: Karpenter has scheduled the orders pod on a c6g.2xlarge Spot instance in the eu-west-1b AZ. The pod's service account uses IRSA — no static credentials; the AWS SDK reads a projected token and exchanges it with STS for short-lived credentials scoped to the orders-service-role.
Secrets: At pod startup, the Vault Agent injected the database DSN and the Stripe API key into an in-memory tmpfs volume at /vault/secrets/. The application reads from the file — it has never seen a static secret.
Data write: The order is written to Aurora PostgreSQL (the eu-west-1 read replica, which was promoted to writer during the DR exercise in Lesson 8). The Aurora Global Database replicates the write to us-east-1 with typical lag of 80–120 ms. An order.placed event is published to the MSK Kafka topic orders-v2. MirrorMaker2 replicates that topic to the US cluster asynchronously.
Observability: The Envoy sidecar reports span data to the OpenTelemetry Collector DaemonSet. The Collector batches and sends traces to Jaeger, metrics to Prometheus (via remote_write to Thanos), and structured logs to Loki. The entire order flow — ingress latency, database write duration, Kafka produce latency — appears as a single flame graph in Grafana within 15 seconds of the request completing.
Policy enforcement: Falco is watching the orders pod. If the process attempts a fork/exec outside the allowed list (defined in the custom Falco rule from Lesson 7), an alert fires to the security Slack channel and OPA blocks any subsequent attempt to create an exec session into that pod.
Cost accounting: The EC2 instance running the pod carries the Kubernetes cluster tags propagated by Kubecost: team=commerce, service=orders, env=prod. The cost for this request — compute, data transfer, ALB LCUs — is automatically attributed to the commerce team's monthly budget in the Kubecost dashboard.

Key architecture insight: The platform is invisible to the product engineer who owns the orders service. They write business logic, push to GitHub, and within 7 minutes their code is running in production across two regions, traced, alerted on, and cost-attributed — without opening a single ticket to the platform team. That invisibility is the measure of a mature platform.

The Complete Architecture Diagram

The complete Arctiq Commerce platform end-to-end: user request entering via Route 53 and CloudFront, flowing through Istio into Karpenter-scheduled EKS pods, writing to Aurora and Kafka, observed by the OpenTelemetry stack, enforced by Vault/Falco/OPA, and cost-tracked by Kubecost — active in both regions simultaneously.

Platform Health Validation — The Smoke Test Suite

Every production deploy should end with a programmatic validation that confirms the platform is functioning end-to-end, not just that the pods are Running. The following k6 script is the canonical smoke test run by ArgoCD's PostSync hook after every GitOps sync wave completes.

// k6 smoke test — runs as ArgoCD PostSync hook, budget: 60s, <1% error rate
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const errorRate = new Rate('errors');
const orderLatency = new Trend('order_latency_ms', true);

export const options = {
  vus: 10,
  duration: '45s',
  thresholds: {
    errors: ['rate<0.01'],           // <1% error rate
    order_latency_ms: ['p(99)<800'],  // 99th percentile <800ms
    http_req_failed: ['rate<0.01'],
  },
};

const BASE = __ENV.TARGET_URL || 'https://api.arctiq.com';

export default function () {
  // 1. Health probe (Istio, service mesh, pod readiness)
  const health = http.get(`${BASE}/healthz`);
  check(health, { 'healthz 200': (r) => r.status === 200 });
  errorRate.add(health.status !== 200);

  // 2. Authenticated order flow (exercises JWT, Vault secret, Aurora write)
  const headers = { 'Authorization': `Bearer ${__ENV.SMOKE_TOKEN}`,
                    'Content-Type': 'application/json' };
  const start = Date.now();
  const order = http.post(`${BASE}/v1/orders`,
    JSON.stringify({ sku: 'SMOKE-TEST-SKU', qty: 1, idempotency_key: `smoke-${__VU}-${__ITER}` }),
    { headers });
  orderLatency.add(Date.now() - start);
  check(order, {
    'order 201': (r) => r.status === 201,
    'order id present': (r) => JSON.parse(r.body).order_id !== undefined,
  });
  errorRate.add(order.status !== 201);

  sleep(1);
}

Production practice at scale: At companies like Stripe and Shopify, post-deploy smoke tests have a hard 60-second budget and run against production, not staging — because staging is never representative at traffic scale. Idempotency keys on the test order ensure smoke-test writes are identifiable and can be cleaned by a nightly job. The k6 thresholds block is the gate: if p(99) > 800ms, the ArgoCD sync wave is marked failed and Argo Rollouts triggers automatic rollback.

Key Platform Metrics — What a Staff Engineer Tracks Weekly

The platform team owns four dashboards, each updated weekly in the engineering all-hands. These are not vanity metrics — they are leading indicators of platform health that predict incidents before they happen:

Deploy frequency and lead time: Target is >20 deploys/day across all 12 teams, median lead time <10 minutes. A lead time spike above 20 minutes usually means a flaky test in CI, not a Kubernetes problem — the dashboard exposes this. Current: 34 deploys/day, median 6.8 min.
Error budget burn rate: The 99.95% SLO has a quarterly error budget of 131 minutes. A 6x burn rate (consuming in 1/6th the time) triggers a freeze on risky changes. Track this as a Grafana alert, not a monthly report. Current: 1.2x — healthy.
Pod eviction rate: Karpenter Spot interruptions cause evictions. More than 3% eviction rate per day signals that the On-Demand fallback capacity ratio needs adjusting, or that a workload is missing a PodDisruptionBudget. Current: 0.8%.
Secrets rotation lag: Vault dynamic credentials rotate every 15 minutes for the database DSN. If any workload holds a lease older than 30 minutes, it appears in the Vault audit log as a violation. Track as a Prometheus gauge scraped from the Vault metrics endpoint. Current: 0 violations.
Cost per 1,000 orders: The business metric that connects platform efficiency to revenue. Kubecost computes this by joining K8s cost data with order throughput from the application metrics. Trend: $0.31/1k orders, target <$0.40. Spot usage saves ~$21k/month vs on-demand.

What the Capstone Proved — and What It Did Not

Being honest about the limits of this capstone is itself a senior-engineering skill. What you have built is a complete, production-grade architecture for a company at the 2–20 million user scale. What you have not built, and what would come next at a real company, includes:

Multi-tenancy isolation: Arctiq runs 12 teams in a single EKS cluster. At 50+ teams, namespace-level isolation begins to break down (noisy neighbor on the API server, overly wide RBAC, etcd pressure). The next evolution is dedicated clusters per business unit with a fleet management layer (Cluster API or EKS Blueprints at the account level).
Global distributed tracing at billion-request scale: Jaeger with in-memory storage works at our scale. At 1 billion requests/day, you need tail-based sampling (Tempo or Honeycomb), a columnar trace store (ClickHouse or Parquet on S3), and probabilistic sampling policies that guarantee 100% sampling for error traces and 1% for healthy ones.
FinOps maturity: Kubecost gives you per-team cost visibility. True FinOps at a large company adds unit economics (cost per API call, cost per active user), anomaly detection on spend (ML-based, not threshold-based), and engineer-facing cost nudges in the PR pipeline ("this change will increase infra cost by $400/month").

Honest assessment: A senior engineer can build what you built in this capstone. A Staff engineer can explain why each architectural choice was made and what would force a different choice. A Principal engineer can evolve the platform as the company grows from 2 million to 200 million users without a full rewrite. The difference is not tooling knowledge — it is accumulated judgment about failure modes, organizational dynamics, and the cost of complexity.

Your Career Path — The Three Trajectories

Completing a capstone like this positions you at the senior engineer level. The natural next steps diverge into three trajectories, and choosing consciously between them matters more than most engineers realize:

Staff / Principal Platform Engineer: You own a platform used by hundreds of engineers. Your output is multiplied through others — you write the golden-path templates, the architectural patterns, the internal standards. The skills that compound: technical writing, system design at scale, organizational influence without authority, and the discipline to say no to requests that increase operational burden without a commensurate reliability gain.
SRE / Production Engineering: You own the reliability of large-scale systems — typically 1M+ RPS, complex failure domains, and SLOs that real customers feel. The skills that compound: statistical analysis of reliability data, deep knowledge of kernel-level performance (eBPF, perf flamegraphs, latency histograms), incident command, and the ability to translate reliability risk into business risk that an executive can act on.
Engineering Manager / Director of Platform: You own the team that builds the platform. Your output is the output of 8–20 engineers. The skills that compound: hiring for judgment not credentials, creating a team culture that treats operational toil as a first-class engineering problem, communicating platform value to non-technical stakeholders, and defining a multi-year platform roadmap that stays relevant as the product changes.

# Practical 90-day post-capstone action plan

# Week 1-2: Publish the capstone
# Push the Terraform modules and k8s manifests to a public GitHub repo.
# Write a 2,000-word architectural decision record (ADR) explaining the three
# most interesting trade-offs you made. This becomes your portfolio artifact.
git init arctiq-platform && cd arctiq-platform
git checkout -b main
# structure: modules/ (terraform), k8s/ (manifests), docs/ (ADRs), scripts/ (smoke tests)

# Week 3-4: Get a real AWS environment
# AWS has a free $300 credit for new accounts. Spin up a real EKS cluster
# (t3.medium nodes, ~$0.04/hr each), deploy the kube-prometheus-stack,
# and run the k6 smoke test against a real endpoint.
eksctl create cluster \
  --name arctiq-demo \
  --region us-east-1 \
  --nodegroup-name default \
  --node-type t3.medium \
  --nodes 3 \
  --managed

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.adminPassword=demo1234 \
  --set prometheus.prometheusSpec.retention=7d

# Week 5-8: Certifications that matter at senior level
# CKA (Certified Kubernetes Administrator) — the one cert hiring managers check
# AWS Solutions Architect Professional — signals cloud architecture depth
# HashiCorp Vault Associate — useful if targeting companies using Vault

# Week 9-12: Contribute to the ecosystem
# Open a PR to an CNCF project (Karpenter, OpenTelemetry, Argo). Even docs.
# A merged PR to a major project is worth more than any certification on a resume.
# Good first issues: https://github.com/aws/karpenter-provider-aws/issues?q=label%3A%22good+first+issue%22

The Senior DevOps Interview — What Is Actually Tested

Technical interviews for senior DevOps and SRE roles at top companies have shifted significantly. The commodity questions ("what is a pod?", "explain blue-green deployments") are screened out in the initial filter. The interviews that matter test four things:

System design under ambiguity: You will be given a vague prompt like "design the deployment system for a company with 500 microservices." The interviewer is testing whether you ask clarifying questions (deployment frequency? team structure? tolerance for complexity?) before drawing boxes. The capstone taught you to start with requirements, not solutions.
Incident analysis: You will be given a graph or a log snippet and asked to diagnose a production incident. Typical scenario: latency p99 spiked to 4 seconds at 14:23 but p50 was unchanged — what do you look at first? (Answer: tail latency with stable median suggests a single slow downstream, not a cluster-wide problem — check distributed traces for the slowest 1% of requests, look at GC pause metrics and connection pool saturation.)
Trade-off reasoning: "Should we use Istio or Linkerd for our service mesh?" There is no right answer — there are trade-offs. Istio's broader feature set costs more CPU/memory and operational complexity. Linkerd's Rust data plane is faster and simpler but has less ecosystem tooling. The interviewer is testing whether you can articulate trade-offs clearly, not whether you have memorized a winner.
Production failure modes: "What breaks first when your EKS cluster scales from 500 to 5,000 nodes?" The answer involves etcd write throughput, API server request rate limits, CoreDNS NXDomain flood from misconfigured ndots:5, and the Kubernetes scheduler's default pod QPS ceiling. These are not things you learn from tutorials — they come from running production systems at scale, or from studying post-mortems from companies that have.

The highest-signal interview preparation: Read 20 public post-mortems from companies running at the scale you want to work at. Cloudflare, Stripe, Shopify, GitHub, and Slack all publish detailed incident reports. For each one, trace the failure back to the platform layer that could have prevented it (better alerting, a chaos experiment that would have surfaced the failure mode, a missing circuit breaker). This exercise builds more interviewing ability than any course can.

Closing: What This Course Actually Taught You

This course was never really about Terraform syntax or Kubernetes YAML. Those are implementation details that change with every major version. What the course built, across 50 tutorials and this capstone, is a mental model for how production systems fail and how resilient platforms are designed to fail gracefully. The four mental models that will serve you for the next decade:

Defense in depth: No single control is enough. Every layer — network, admission control, runtime security, secret rotation, alerting — exists because the layer above it will eventually fail. You do not trust the WAF to block everything, so you have Istio mTLS. You do not trust mTLS alone, so you have OPA. You do not trust OPA alone, so you have Falco. The goal is not zero breaches — it is detecting and containing a breach before it becomes a catastrophe.
Observability-first design: A system you cannot observe is a system you cannot understand and cannot improve. Every service you deploy should be instrumentable before it is deployed, not instrumented after the first incident. SLOs are not monitoring configuration — they are a contract between the platform and the business.
Toil is a system design failure: Every time an engineer has to do something manually that could be automated, that is a bug in the platform, not a feature of the role. The greatest leverage a platform team has is to eliminate entire categories of manual work — not to do the manual work faster.
Requirements drive architecture: Every tool, every service, every layer in the platform exists because of a specific requirement with a number attached to it. If you cannot point to the requirement that justifies a piece of complexity, that piece of complexity should not exist.

The platform you built in this capstone handles the Arctiq Commerce requirements. The judgment you developed building it handles requirements you have not seen yet. That is what a senior DevOps engineering career is built on — not the tools, but the judgment to choose the right tools for the constraints in front of you.

The most common career mistake at this level: Staying too close to the technology and not investing in communication skills. The best platform engineers at large companies spend 30–40% of their time writing — architectural decision records, post-mortems, proposals, engineering blogs. The engineers who advance to Staff and Principal level are not necessarily the deepest technically — they are the ones who can explain complex systems simply enough that a VP of Engineering can make an informed decision. If you take one action from this lesson, start writing publicly about what you build.