Platform Engineering & Developer Experience

Project: Design an IDP

18 min Lesson 10 of 28

Project: Design an IDP

This capstone lesson walks through the end-to-end design of a production-grade Internal Developer Platform (IDP) for a sample organisation — Acme Corp — a 400-engineer fintech running 120 microservices on Kubernetes across two AWS regions. By the end you will have a complete golden-path specification, a self-service surface design, the Backstage scaffolder template that powers it, the Crossplane Composition for on-demand PostgreSQL, and the Kyverno policies that enforce guardrails — every artefact ready to drop into a real monorepo.

Step 1 — Characterise the Organisation

Before writing a single line of YAML, collect four inputs:

  1. Service taxonomy. Acme has three archetypes: API service (Java/Go, REST/gRPC), async worker (Kafka consumer, Python/Go), and ML inference (Python, GPU-optional). Every golden path maps to one archetype.
  2. Cognitive tax survey. A 10-question dev survey reveals the top four pain points: (1) writing Kubernetes YAML from scratch, (2) provisioning databases, (3) wiring Datadog APM, (4) configuring RBAC. These become the IDP's first four self-service actions.
  3. Compliance requirements. PCI-DSS scope for the payments cluster means: immutable container images (no :latest), no root containers, mandatory network policy, secrets from Vault (not env vars). These become Kyverno policies enforced at admission.
  4. Existing assets inventory. Acme already has: Terraform modules for VPC/EKS, a Vault PKI mount, a Datadog agent DaemonSet, and 40 Helm charts of varying quality. The IDP wraps these; it does not replace them.

Step 2 — Define the Golden Path for an API Service

A golden path is a fully-specified, opinionated delivery path from git init to production. For Acme's API service archetype it covers six dimensions:

  • Scaffold — a Backstage Software Template generates the repo skeleton (Dockerfile, Makefile, .github/workflows/ci.yaml, k8s/ manifests, catalog-info.yaml, docs/ TechDocs stub) in under 60 seconds.
  • CI pipeline — GitHub Actions: lint → unit test → docker build --provenance=true --sbom=true → Trivy scan (block on CRITICAL) → push to ECR with immutable tag sha-<commit> → update the GitOps repo via a PR to k8s/overlays/staging/<service>/image.yaml.
  • GitOps delivery — ArgoCD Application CR per environment (staging, production). Staging auto-syncs; production requires a manual sync gate or a JIRA approval webhook.
  • Runtime — a pre-tested Helm chart with sensible defaults: resources.requests.cpu: 200m, resources.limits.memory: 512Mi, HPA on CPU+RPS, PodDisruptionBudget minAvailable: 1, Istio sidecar enabled, Datadog APM via admission controller.
  • Observability — automatic Datadog dashboard (provisioned by a Backstage action calling the Datadog API) and a default SLO (99.5% success rate, 300 ms p99 latency) wired to a PagerDuty service.
  • Security — Vault AppRole injected by the Vault Agent sidecar; mTLS via Istio; NetworkPolicy denying all ingress except the mesh gateway and the monitoring namespace.
The golden path is opt-in, not mandatory in the first six months. Teams that deviate still get security policies enforced, but they own their own pipeline and dashboard. After six months, any service not on the golden path appears as a "technical debt" finding in the Backstage scorecard. This approach drives adoption without mandate wars.

Step 3 — Design the Self-Service Surface

The self-service surface is what developers actually click or type. Acme ships three surfaces:

  1. Backstage portal — primary UI. Software Templates for scaffolding; a software catalog for discovery; TechDocs for documentation; Scorecards showing golden-path compliance per service.
  2. Platform CLI (acme) — wraps Backstage scaffolder API calls for engineers who live in the terminal. acme new service, acme provision db, acme open-dash <service>. The CLI is a thin wrapper; all logic lives in the platform API.
  3. GitOps self-service — for infrastructure primitives (databases, buckets, queues), the self-service action opens a PR to the platform GitOps repo. A human reviewer (or a policy bot) approves; Crossplane reconciles the resource. This keeps an audit trail in git — non-negotiable for PCI-DSS.

The Backstage scaffolder action for on-demand PostgreSQL is the highest-traffic self-service action (run ~40 times per month at Acme). It calls a custom scaffolder backend action that creates a Crossplane PostgreSQLInstance claim, commits it to the GitOps repo, and opens a PR tagged with the requesting team's JIRA project key:

# Backstage scaffolder action: acme:provision-postgres # packages/backend/src/plugins/scaffolder/actions/provisionPostgres.ts (simplified) import { createTemplateAction } from '@backstage/plugin-scaffolder-node'; import { Octokit } from '@octokit/rest'; export const createProvisionPostgresAction = () => createTemplateAction<{ serviceName: string; environment: 'staging' | 'production'; storageSizeGi: number; }>({ id: 'acme:provision-postgres', schema: { input: { required: ['serviceName', 'environment', 'storageSizeGi'], properties: { serviceName: { type: 'string' }, environment: { type: 'string', enum: ['staging', 'production'] }, storageSizeGi: { type: 'number', minimum: 10, maximum: 500 }, }, }, }, async handler(ctx) { const { serviceName, environment, storageSizeGi } = ctx.input; const claimYaml = ` apiVersion: platform.acme.io/v1alpha1 kind: PostgreSQLInstance metadata: name: ${serviceName}-${environment} namespace: ${serviceName} labels: team: ${ctx.templateInfo?.entity?.spec?.owner ?? 'unknown'} environment: ${environment} spec: parameters: storageSizeGi: ${storageSizeGi} version: "15" highAvailability: ${environment === 'production'} writeConnectionSecretToRef: name: ${serviceName}-postgres-creds `; // open PR to platform GitOps repo const octokit = new Octokit({ auth: ctx.secrets?.githubToken }); await octokit.repos.createOrUpdateFileContents({ owner: 'acme-corp', repo: 'platform-gitops', path: `claims/${environment}/${serviceName}-postgres.yaml`, message: `chore: provision postgres for ${serviceName} (${environment})`, content: Buffer.from(claimYaml).toString('base64'), branch: `provision-${serviceName}-postgres-${Date.now()}`, }); ctx.logger.info(`Postgres claim PR opened for ${serviceName} in ${environment}`); }, });

Step 4 — The IDP Architecture Diagram

The diagram below shows how Acme's four platform planes interact at runtime: the developer plane, the control plane (Backstage + platform API), the delivery plane (GitOps), and the infrastructure plane (Crossplane + cloud APIs).

Acme Corp IDP architecture — four planes Developer Plane Backstage Portal acme CLI GitOps PRs TechDocs Scorecard Control Plane Backstage Scaffolder + Catalog Platform API Policy + Quota engine Vault Secrets + PKI Kyverno Admission policies Delivery Plane GitHub Actions CI Build, scan, push image GitOps Repo Desired state manifests ArgoCD Reconcile to cluster ECR Image registry Infrastructure Plane Crossplane DB / Bucket / Queue EKS (us-east-1/eu-west-1) Multi-region clusters Terraform Modules VPC, IAM, ACM Datadog APM + SLOs + Alerts Acme Corp IDP — four-plane architecture
Acme Corp IDP — four planes: Developer, Control, Delivery, and Infrastructure.

Step 5 — Enforcing Guardrails with Kyverno

Guardrails are the policies that make the golden path trustworthy. Acme ships four baseline Kyverno ClusterPolicies applied to every namespace except kube-system:

# policy: require-immutable-image-tag.yaml apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-immutable-image-tag annotations: policies.kyverno.io/category: Best Practices policies.kyverno.io/description: > Blocks :latest and mutable tags; all images must use a sha-<commit> digest tag. spec: validationFailureAction: Enforce background: true rules: - name: check-image-tag match: any: - resources: kinds: [Pod] namespaceSelector: matchExpressions: - key: kubernetes.io/metadata.name operator: NotIn values: [kube-system, monitoring, vault] validate: message: "Image tag must match sha-[a-f0-9]{7,40}. :latest and mutable tags are forbidden." pattern: spec: containers: - image: "*:sha-?([a-f0-9]{7,40})*" --- # policy: require-non-root.yaml apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-non-root spec: validationFailureAction: Enforce rules: - name: check-non-root match: any: - resources: kinds: [Pod] validate: message: "Containers must not run as root (runAsNonRoot: true required)." pattern: spec: securityContext: runAsNonRoot: true containers: - securityContext: allowPrivilegeEscalation: false

Step 6 — Measuring IDP Success

An IDP without metrics is a platform team acting on faith. Acme tracks four DORA-aligned platform KPIs at week-over-week granularity:

  • Onboarding time to first deployment — target: < 2 hours from acme new service to first staging deployment. Measured by diffing scaffold timestamp vs first ArgoCD sync timestamp stored in the platform telemetry DB.
  • Self-service success rate — percentage of database/queue provisioning requests that complete without a platform team ticket. Target: > 90%. Measured by the Crossplane claim reconciliation events.
  • Golden path adoption — percentage of services with a Backstage scorecard score > 80/100. Target: > 75% of services. Surfaced on the engineering all-hands dashboard.
  • Platform p99 API latency — the Backstage backend and platform API must respond in < 200 ms p99. Breaching this SLO pages the platform team — an IDP that is slower than filing a ticket will not be used.
Publish the platform metrics dashboard publicly inside your organisation. When product teams can see that 87 % of services are on the golden path and that onboarding time dropped from 3 days to 90 minutes, the platform team earns organisational trust and budget for the next phase. Visibility is a product feature.
The most common IDP failure mode is over-abstracting too early. Teams that build a beautiful self-service portal before talking to developers frequently abstract away the wrong things and create a platform that nobody uses. Interview five representative engineering teams, identify their top three pain points, and ship a walking skeleton that solves exactly those three problems before adding anything else. Expand the surface incrementally.

Putting It All Together

Acme's IDP ships as a mono-repo: platform-gitops/ holds the desired state (ArgoCD ApplicationSets, Crossplane Compositions, Kyverno policies, Vault policies); platform-backstage/ holds the portal; acme-cli/ holds the CLI. Each component is versioned independently, deployed via its own GitOps pipeline, and has an SLO. The platform team treats this stack as a product: it has a roadmap, a changelog, a deprecation policy, and on-call rotation. That product mindset — not any particular tool choice — is what separates platforms that scale from platforms that become legacy monoliths.

ES
Edrees Salih
1 hour ago

We are still cooking the magic in the way!