Performance & Load Testing

Designing Realistic Tests

18 min Lesson 5 of 28

Designing Realistic Tests

A load test that does not resemble production is worse than no test at all. It gives you confidence you have not earned. Engineers who run synthetic tests — a single endpoint hammered with identical requests from one IP — routinely discover in production that their service melts under traffic patterns they never exercised. Designing a realistic test requires three things done right: production-like data, a traffic model that mirrors actual user behavior, and a test environment that does not introduce artifacts that mask real problems.

Why Synthetic Tests Lie

A purely synthetic load test fails in predictable ways. Database query planners cache execution plans for the query shapes they see during warmup; hitting a single row ID every time means the planner returns a cached, optimal plan that production never gets when user IDs are uniformly distributed across 400 million rows. CDN and reverse proxy caches warm up on repeated identical URLs; the cache-hit rate in your test is 95%, but in production it is 40%. Connection pools size themselves to observed concurrency; a single-threaded ramp misses the N+1 query problem that only surfaces when 500 coroutines hit the ORM simultaneously. These are not edge cases — they are common reasons teams greenlight a system that falls over on launch day.

Production-Like Data

The gold standard is a recent, anonymized snapshot of production data loaded into the test database. Anonymizing PII is non-negotiable: replace real names and emails with faker-generated equivalents, truncate payment tokens, hash user IDs with a reversible salt if traceability is needed for debugging. Tools like faker, mimesis, or Postgres's pg_anonymizer extension handle this at scale.

For most teams, a full production clone is impractical due to size. The correct approach is representative sampling: reproduce the statistical shape of production data rather than every row. That means preserving cardinality (number of distinct values), distribution skew (a small percentage of power users drive 80% of writes — Zipf distribution), and referential integrity (foreign keys, join tables). A 10 GB sample that reflects the 1 TB distribution correctly is far more useful than a 10 GB sample that overrepresents new accounts created in the last 24 hours.

Key idea: The most dangerous data artifact is a cold cache at test start. In production, your Redis cache and CDN are warm. Running a load test against a cold cache measures your origin-server performance under peak miss load — a scenario that occurs in production only after a full cache flush or a deployment that invalidates all keys. Decide deliberately: if you are testing cache-cold behavior (a valid scenario), accept the inflated numbers; if you are testing steady-state throughput, pre-warm the cache before your measurement window begins.

To generate parameterized test data, build a CSV or JSON fixture file and drive your k6 (or JMeter) test from it rather than hardcoding values:

// k6 script: drive requests from a CSV of real user IDs and product IDs
import http from 'k6/http';
import { SharedArray } from 'k6/data';
import { check, sleep } from 'k6';

// SharedArray loads once, shared across all VUs — avoids duplicating 100k rows per VU
const users = new SharedArray('users', function () {
  return JSON.parse(open('./fixtures/users.json'));
});

const products = new SharedArray('products', function () {
  return JSON.parse(open('./fixtures/products.json'));
});

export default function () {
  const user    = users[Math.floor(Math.random() * users.length)];
  const product = products[Math.floor(Math.random() * products.length)];

  const res = http.get(
    `https://staging.example.com/api/v1/products/${product.id}`,
    { headers: { Authorization: `Bearer ${user.token}` } }
  );

  check(res, {
    'status 200':         (r) => r.status === 200,
    'latency < 200ms':   (r) => r.timings.duration < 200,
  });

  sleep(Math.random() * 2 + 0.5); // think time: 0.5 – 2.5 s
}

Traffic Modeling

Real traffic is not a flat line. It has daily and weekly periodicity, spikes triggered by marketing campaigns or viral events, gradual ramps at business-day start, and the terrifying thundering herd that occurs when a scheduled job releases thousands of deferred notifications simultaneously. Your test shape must model these patterns, not just peak load in isolation.

The canonical traffic model for a B2C web service follows a 24-hour curve that looks like a sine wave with a second smaller peak in the evening. Extract this from your APM tool (Datadog, Grafana, New Relic) by querying the last 30 days of request-rate data and computing the P95 hourly distribution. The test scenario should replay a compressed version of that curve: ramp up to morning peak, sustain, dip, return to evening peak, ramp down. Total test duration: 30–60 minutes for a full shape replay.

k6 stages and the ramping-vus executor implement this directly:

// k6 executor: ramping-vus shaped after a real traffic curve
export const options = {
  scenarios: {
    traffic_shape: {
      executor: 'ramping-vus',
      startVUs: 0,
      stages: [
        { duration: '5m',  target: 200  },  // morning ramp-up
        { duration: '10m', target: 500  },  // morning peak
        { duration: '5m',  target: 300  },  // midday trough
        { duration: '10m', target: 600  },  // afternoon spike (promo event)
        { duration: '5m',  target: 400  },  // post-spike decay
        { duration: '10m', target: 700  },  // evening peak
        { duration: '5m',  target: 0    },  // ramp-down
      ],
      gracefulRampDown: '30s',
    },
  },
  thresholds: {
    http_req_duration: ['p(95)<300', 'p(99)<800'],
    http_req_failed:   ['rate<0.005'],           // <0.5% error rate
  },
};

// For a thundering-herd scenario, add a second scenario with arrival-rate executor
// that fires 2000 requests in 10 seconds at t=25m to simulate a push notification release

Beyond VU count, model the request mix accurately. In a typical e-commerce service, reads outweigh writes 9:1. Catalog browse and search account for 70% of reads; checkout and order status for the rest. If your load test is 100% catalog reads you have not tested the database write contention that causes p99 latency to spike during flash sales. Build a weighted scenario that dispatches each virtual user to a behavior group at the start of each iteration, matching the production mix.

Test Environment Fidelity

A test environment that misrepresents production topology produces results that are either falsely reassuring or falsely alarming — both outcomes waste engineering time. The minimum bar for a meaningful load test is:

Same hardware class. A 4-vCPU staging server will show saturation at 3,000 RPS while the 64-vCPU production fleet handles 80,000. Size staging at 1/N the production fleet, and multiply results by N — or better, test on a production replica that mirrors the exact instance type and autoscaling configuration.
Same network topology. If production traffic goes through a load balancer, CDN, and API gateway, the test must traverse the same path. Testing directly against an application server bypasses every layer that adds latency and that can become a bottleneck under load (connection limits on the load balancer, rate limits on the API gateway, origin-shield logic in the CDN).
Same dependency configuration. Stub or shadow downstream services only when they are genuinely outside your control (third-party payment processor). For services you own, run real instances sized proportionally. Stubbing your own inventory service because staging does not have one means you will never detect that your inventory service becomes a serialization bottleneck when order throughput exceeds 200 TPS.
Isolation from production traffic. Never run a load test against a production environment unless you have explicit blast-radius controls: feature flags that route test traffic to a shadow pool, or a canary environment wired to real backends but receiving synthetic requests only. A load test that leaks synthetic orders into the production order database is a data quality incident.

A representative load-test topology: fixture-driven generator, full CDN/LB/gateway path, pre-warmed cache, and an anonymized DB snapshot — isolated from the production write path.

Think Time and Session Modeling

Real users pause between actions. A human browsing a product catalog spends 2–15 seconds reading a page before clicking the next link. Omitting think time in your test collapses the user session into a tight loop that generates 10x the requests a real user would, which means your 500-VU test actually simulates 5,000 concurrent active users — a number you did not intend and cannot interpret. Model think time as a random variable drawn from a distribution that matches your session analytics (log-normal is common; check your browser telemetry). In k6, sleep(Math.random() * think_time) is the minimum; for more realism use a Poisson distribution via Math.log(1 - Math.random()) / (-1/mean).

Beyond think time, model the user journey: the ordered sequence of pages a typical session visits. An e-commerce session is not random URL selection — it is home → search → PDP → cart → checkout, with dropout probabilities at each step. A 15% conversion rate means 85% of sessions never reach the checkout endpoint. If your load test hits checkout at the same rate as product search, you are testing a behavioral pattern that does not exist in production, and you will see checkout-service saturation that real users would never cause.

Production practice: Extract your top-10 user journeys from session replay tools (FullStory, Amplitude, Mixpanel) or from your access logs via awk and sort -rn. Encode each journey as a k6 scenario group, assign VU weights matching the real journey distribution, and you have a test that the production analytics team can verify makes sense — a major credibility win when presenting results to leadership.

Production pitfall: Many teams size their test by VU count rather than by target RPS (requests per second). VU count is an opaque metric: 500 VUs with 0ms think time generates radically more RPS than 500 VUs with 2s think time. Always state your test targets in terms of RPS or arrivals/second, measure the actual rate during the test, and validate that it matches the production traffic you intended to simulate. Use k6\'s constant-arrival-rate executor when you need to hold a precise RPS target regardless of response time.

Baseline Sizing: A Worked Example

To arrive at a realistic peak VU count: from your APM tool, read the peak RPS you need to sustain (say, 2,000 RPS). From session analytics, your average session duration is 8 minutes and each session makes 40 requests — so each user generates 40/480 ≈ 0.083 RPS. Little's Law gives you concurrent users = throughput / per-user rate = 2,000 / 0.083 ≈ 24,000 concurrent sessions. That is your VU target at peak, not 500. Teams that start with an arbitrary "let's try 100 VUs" number are not load testing — they are warming up the JVM.

Once your test is running with production-like data, a realistic traffic shape, and correct environment fidelity, the metrics it produces mean something. Latency percentiles map to real user experience. Error rates reflect actual system behavior under load. Saturation signals point to the real bottleneck. Only then are you in a position to make an evidence-based capacity decision — which is what the rest of this tutorial covers.