Performance & Load Testing

Performance Engineering Mindset

18 min Lesson 1 of 28

Performance Engineering Mindset

Performance is not a feature you bolt on at the end of a release cycle. At Google, Amazon, and Netflix, performance is treated as a correctness requirement — a system that returns the right answer in 30 seconds is, for most purposes, wrong. This lesson builds the conceptual foundation you need before touching a single load-testing tool: the precise vocabulary of performance (latency, throughput, percentiles), the economics of slowness, and the mental models that separate engineers who find real bottlenecks from those who chase noise.

Latency vs. Throughput — Two Orthogonal Axes

Latency is the time elapsed from the moment a request is sent to the moment a complete response is received. It is always measured from the client's perspective. Latency is a property of a single request.

Throughput is the rate at which a system completes work — typically requests per second (RPS), transactions per second (TPS), or messages per second. Throughput is a property of the system under a given load.

These two axes are related but not the same. Little's Law captures the relationship cleanly:

# Little's Law:
# L = λ × W
#
# L = average number of requests in the system (concurrency)
# λ = throughput (arrivals per second = requests per second at steady state)
# W = average latency (seconds per request)
#
# Example: your service processes 1,000 RPS with mean latency 50 ms
#   L = 1000 × 0.050 = 50 concurrent requests in-flight at any moment
#
# Consequence: to DOUBLE throughput at the SAME latency,
# you must be able to handle double the in-flight concurrency.
# If your thread pool is sized at 50, doubling RPS will queue requests
# and latency will explode — even though CPU is not the bottleneck.

Little's Law is technology-agnostic and applies to any queuing system — a Kubernetes pod, a database connection pool, a load balancer, or a highway on-ramp. Any time you increase throughput without increasing capacity, latency rises. Any time you reduce latency without reducing concurrency, you increase throughput. Keep this triangle in your head at all times.

Why Averages Lie: The Case for Percentiles

The single most important habit you can develop is never trusting a mean latency number. An average conceals the distribution. A service with mean latency 20 ms can still be delivering 5-second responses to a significant fraction of users if the distribution has a long tail.

The industry standard is to report latency as percentiles:

p50 (median): the latency experienced by the "typical" user. Half of requests are faster, half are slower.
p95: 95 % of requests complete within this time. 1 in 20 users sees at least this latency.
p99: The threshold that 99 % of requests stay under. At 1,000 RPS, roughly 10 users per second exceed this number.
p999 (p99.9): At 10,000 RPS, 10 users per second hit this. For high-traffic services this is the number that drives SLA breaches and user complaints.
p9999 (p99.99) / max: Worst-case. Meaningful for financial transactions, safety-critical systems, or any workflow where a single slow response blocks downstream work.

A concrete production example: a checkout service at a major retailer might report p50=12 ms, p99=180 ms, p999=4,200 ms. The mean is 22 ms. Reporting "our service has 22 ms latency" hides the fact that 1 in 1,000 customers waits over 4 seconds — at peak traffic of 50,000 RPS, that is 50 users per second experiencing a 4-second stall on the payment page.

A realistic latency distribution: the mean sits near the peak, but the long tail stretches p99.9 to 350x the median — hidden from any average-based dashboard.

Prometheus histograms are pre-bucketed. The accuracy of histogram_quantile(0.99, ...) depends entirely on how you defined your bucket boundaries. If your buckets stop at 1 s and your p99 is 2 s, Prometheus will silently return 1 s. Always define at least one bucket above your worst expected latency. A common pattern is doubling buckets: 5 ms, 10 ms, 25 ms, 50 ms, 100 ms, 250 ms, 500 ms, 1 s, 2.5 s, 5 s, 10 s.

The Economics of Slowness

Performance problems have direct revenue impact. Amazon's engineering teams famously found that every 100 ms of latency added to page load cost roughly 1 % in sales. Google reported a 0.5-second slowdown in search results reduced traffic by 20 %. These numbers are from the 2000s — user tolerance for slowness has only decreased since.

The modern framing is the Core Web Vitals model: Google's ranking algorithm penalises pages where Interaction to Next Paint (INP) exceeds 200 ms and Largest Contentful Paint (LCP) exceeds 2.5 s. Slowness is now a direct SEO cost, not just a UX cost.

At the infrastructure level, slowness has a compounding effect through cascading latency. In a microservice architecture where a single user request fans out to 10 downstream calls, the end-to-end latency follows the maximum of the parallel calls, not the mean. If each service has p99 = 50 ms, the probability that at least one of 10 parallel calls hits p99 is roughly 1 − (0.99)^10 ≈ 9.6 % — meaning your composite p99 is effectively your individual p90. This is the tail amplification problem that every senior engineer at a services company must internalise.

The canonical mitigation for tail amplification is the hedged request pattern (also called "backup requests"): after a short delay (e.g., 95th-percentile latency), send a second identical request to a different replica and use whichever response arrives first. Google Bigtable, Cassandra, and many Envoy-based service meshes support this natively. The trade-off is increased backend load — typically 5–10 % more requests for a significant reduction in tail latency. Only apply it to idempotent read paths.

Connecting to What You Already Know

Everything you have done in this course feeds into performance engineering:

Prometheus + Grafana: your primary instrument for observing percentiles in production. The histogram_quantile function and RED dashboards (Rate, Errors, Duration) are the entry point for every performance investigation.
Kubernetes resource limits: CPU throttling at the cgroup level is one of the most common hidden latency contributors in containerised workloads. A pod running at its CPU limit does not slow gradually — requests queue and latency spikes non-linearly.
Distributed tracing (Jaeger/Tempo): percentile metrics tell you that something is slow; traces tell you where. The two tools are complementary, not redundant.
SLOs from SRE principles: your latency SLO (e.g., "p99 < 200 ms over a 30-day window") is the formal statement of the performance contract. Load testing validates that you can deliver that SLO under realistic traffic.

# Quick sanity-check: query p99 latency for a service in Prometheus
# (assumes standard histogram metric http_request_duration_seconds)

histogram_quantile(
  0.99,
  sum by (le, service) (
    rate(http_request_duration_seconds_bucket{service="checkout"}[5m])
  )
)

# If this returns NaN, your buckets don't cover the actual latency range.
# Check: max(http_request_duration_seconds_bucket{le="+Inf"}) vs your highest finite bucket.

# Throughput for the same service:
sum(rate(http_request_duration_seconds_count{service="checkout"}[5m])) by (service)

# Error ratio (complement of success rate):
1 - (
  sum(rate(http_request_duration_seconds_count{service="checkout", status_code=~"2.."}[5m]))
  /
  sum(rate(http_request_duration_seconds_count{service="checkout"}[5m]))
)

The Performance Engineering Workflow

Before you run a single load test, establish three things:

Define the SLO: what is the acceptable latency at which percentile, under what concurrency, for what error budget? Without a concrete target, you cannot declare success or failure.
Establish the baseline: measure the system under production-representative traffic now, before any changes. You cannot improve what you have not measured, and you cannot confirm an improvement without a baseline.
Identify the bottleneck hypothesis: use existing observability data (dashboards, traces, profiler samples) to form a specific hypothesis about where the constraint is. Load testing without a hypothesis produces noise; load testing with a hypothesis produces signal.

The remaining lessons in this tutorial operationalise each of these steps — k6 and JMeter for synthetic load, profiling tools for application internals, and structured reporting to close the feedback loop with stakeholders. The mental models in this lesson are the lens through which everything else will be interpreted.