Platform Engineering & Developer Experience

Self-Service Infrastructure

18 min Lesson 5 of 28

Self-Service Infrastructure

In 2010 a developer who needed a database waited for a ticket. The DBA team provisioned it in two weeks. In 2025 the same developer opens their internal developer portal, fills in a form — engine, version, storage class, backup schedule — clicks Create, and has a running, policy-compliant database in four minutes. The infrastructure still gets provisioned by the same cloud APIs. The difference is who drives them, and whether guardrails prevent bad outcomes at the moment of action rather than in a post-incident review.

Self-service infrastructure is the operational heart of a mature internal developer platform (IDP). This lesson covers the three-layer model that makes it work: platform APIs that abstract complexity, a control-plane that reconciles desired state, and guardrails enforced before resources are ever created. Crossplane sits at the centre of the most influential open-source implementation of this pattern today.

Why Platform APIs, Not Raw Cloud APIs

Every major cloud exposes thousands of resource types across hundreds of API surfaces. A single RDS instance requires decisions about subnet groups, parameter groups, IAM roles, KMS keys, security group rules, backup windows, deletion protection, and multi-AZ topology. A developer should make none of those decisions — they are organisational decisions, made once, encoded in a platform API.

A platform API is an opinionated, organisation-scoped abstraction. Instead of aws_db_instance with fifty attributes, a developer sees PostgresDatabase with five: name, size (small/medium/large), environment, backup policy, and owner team. The platform API maps those five inputs to the forty-five infrastructure inputs that reflect your organisation's standards. This is the golden path from the previous lesson made machine-enforceable.

The contract a platform API must satisfy:

  • Idempotent: calling it twice with the same input produces the same result; safe to re-apply on drift detection.
  • Observable: callers can poll or watch the status of their request; the system emits events on state transitions.
  • Self-documenting: the schema is machine-readable (OpenAPI or a Kubernetes CRD) so the portal, the CLI, and the policy engine all derive from a single source of truth.
  • Auditable: every mutation carries actor identity, timestamp, and reason — written into an immutable audit log before the cloud API call is ever made.
Kubernetes CRDs as a universal platform API surface. Crossplane, Backstage scaffolding hooks, and most IDP implementations today model platform resources as Kubernetes custom resources. This gives you the entire Kubernetes API machinery for free: admission webhooks for validation, RBAC for authorisation, kubectl for CLI access, and the watch API for real-time status. Your developers already know how to read a YAML manifest; you do not need to build a new client SDK.

Crossplane: Control Plane as Platform Foundation

Crossplane extends Kubernetes into a universal control plane for any infrastructure. It adds three primitives on top of standard Kubernetes:

  • Provider: a controller pod that translates Crossplane resources into real cloud API calls (AWS, GCP, Azure, Vault, Helm, Terraform — there are 200+ official providers). Each provider ships its own CRDs: one CRD per cloud resource type.
  • CompositeResourceDefinition (XRD): defines your custom platform API type — for example, PostgresDatabase — as a CRD schema. This is the resource type developers and portal forms target.
  • Composition: a mapping from your high-level PostgresDatabase claim to the underlying low-level provider resources (VPC, subnet group, IAM role, RDS instance, Route 53 record) with all organisational opinions baked in.

The reconciliation loop is pure Kubernetes: a developer applies a PostgresDatabase manifest; the Crossplane composite controller reads the Composition and materialises the constituent provider resources; each provider controller reconciles those resources against the real cloud API and writes status back. The developer watches the status field on their claim; the platform team never touches a ticket.

# Step 1 — install Crossplane into your platform cluster helm repo add crossplane-stable https://charts.crossplane.io/stable helm install crossplane crossplane-stable/crossplane \ --namespace crossplane-system --create-namespace \ --set args='{--enable-composition-revisions}' # Step 2 — install the AWS provider (credentials via an existing IRSA role) cat <<EOF | kubectl apply -f - apiVersion: pkg.crossplane.io/v1 kind: Provider metadata: name: upbound-provider-aws-rds spec: package: xpkg.upbound.io/upbound/provider-aws-rds:v1.7.0 controllerConfigRef: name: irsa-provider-config --- apiVersion: pkg.crossplane.io/v1alpha1 kind: ControllerConfig metadata: name: irsa-provider-config spec: serviceAccountAnnotations: eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/CrossplaneRDSRole EOF # Step 3 — define the platform API type (XRD) cat <<EOF | kubectl apply -f - apiVersion: apiextensions.crossplane.io/v1 kind: CompositeResourceDefinition metadata: name: xpostgresdatabases.platform.example.com spec: group: platform.example.com names: kind: XPostgresDatabase plural: xpostgresdatabases claimNames: kind: PostgresDatabase plural: postgresdatabases versions: - name: v1alpha1 served: true referenceable: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: size: type: string enum: [small, medium, large] environment: type: string enum: [dev, staging, prod] ownerTeam: type: string required: [size, environment, ownerTeam] EOF

Once the XRD is registered, a developer can claim a database in any namespace they have permission to write to. They never see RDS parameter groups, subnet IDs, or KMS key ARNs — those live in the Composition, owned by the platform team:

# Developer-facing claim (goes into the team's app namespace) apiVersion: platform.example.com/v1alpha1 kind: PostgresDatabase metadata: name: payments-db namespace: team-payments annotations: platform.example.com/owner: payments-team@example.com platform.example.com/cost-center: "CC-4421" spec: size: medium environment: prod ownerTeam: payments # Watch until ready (typically 3-5 minutes for RDS) kubectl get postgresDatabase payments-db -n team-payments -w # NAME SYNCED READY CONNECTION-SECRET AGE # payments-db True True payments-db-creds 4m32s
Version your Compositions, not just your XRDs. Crossplane supports Composition revisions. When you update an organisational policy — say, enforcing storage encryption with a new KMS key — publish a new Composition revision and use compositeDeletePolicy: Foreground. Existing claims stay on the old revision; new claims pick up the new one. This gives you a safe migration path without a flag day that breaks all existing infrastructure. Google's internal platform tooling uses the same revision model — never mutate infrastructure under running workloads without a transition window.

Abstraction Layers: The Control Plane Stack

Crossplane is rarely used alone. Production platforms layer multiple control planes and abstraction levels. The standard stack at big-tech scale looks like this, from highest to lowest abstraction:

  1. Developer Portal / GitOps manifest: The developer's entry point. A Backstage software template renders a form, commits a PostgresDatabase YAML to the team's GitOps repository, and a Flux or Argo CD application syncs it to the platform cluster.
  2. Platform API (Crossplane XRD / Claim): The organisational contract. Validates inputs, enforces naming conventions via an admission webhook, stamps default labels (cost-center, team, environment), and routes to the correct Composition based on environment.
  3. Composition: The organisational opinion. Maps the claim to a set of managed resources with all hardened defaults: encryption at rest, multi-AZ for prod, backup retention, deletion protection, security group allowing only the app's pod CIDR.
  4. Provider Managed Resources: The cloud API translation layer. One Crossplane provider CRD per cloud resource type (e.g., RDSInstance, SubnetGroup, DBParameterGroup). The provider controller calls the AWS SDK and reconciles continuously.
  5. Cloud Control Plane (AWS/GCP/Azure): The actual infrastructure. The provider authenticates via IRSA/Workload Identity, making least-privilege API calls scoped to a single account or project per environment.
Self-service infrastructure control plane abstraction layers Control Plane Abstraction Stack Developer Portal (Backstage / GitOps PR) Software template renders form → commits PostgresDatabase YAML → Flux syncs Platform API — Crossplane XRD / Claim Validates schema · stamps labels · admission webhook · routes to Composition Composition (Organisational Policy) Encryption · multi-AZ · backup retention · security groups · deletion protection Provider Managed Resources RDSInstance · SubnetGroup · DBParameterGroup · IAMRole · SecurityGroup Cloud Control Plane AWS / GCP / Azure APIs — IRSA / Workload Identity (least-privilege) continuous reconcile loop
Each layer translates a higher-level intent into lower-level resource operations. Developers interact only with Layer 1-2; organisational policy lives in Layer 3; the cloud sees only least-privilege API calls at Layer 5.

Guardrails: Policy Before Provisioning

Self-service without guardrails is just unsupervised cloud spending. Guardrails must fire before a resource is ever created, not after a cost spike or a security scan surfaces a violation. Three layers of guardrail work together in a mature platform:

  • Schema validation (XRD openAPIV3Schema): Rejects malformed claims at admission. A developer cannot request size: xlarge if the enum only allows small/medium/large. This is a synchronous, zero-latency gate.
  • OPA/Kyverno admission policies: Business logic the XRD schema cannot express. Examples: prod resources require a cost-center label; no developer namespace may provision more than three databases; the RDS major version must be on the approved list. Kyverno policies are CRDs themselves — version-controlled alongside the platform code.
  • Composition-enforced defaults: Even if a developer submits a valid claim, the Composition applies immutable platform-wide settings they cannot override: deletionPolicy: Delete only for dev, Orphan for prod; encryption enforced regardless of what the claim says; backup retention minimum 7 days in production regardless of the backupPolicy field value.
# Kyverno ClusterPolicy: require cost-center label on all platform claims apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-cost-center-label spec: validationFailureAction: Enforce background: true rules: - name: check-cost-center match: any: - resources: kinds: - PostgresDatabase - RedisCluster - KafkaTopic namespaceSelector: matchLabels: environment: prod validate: message: "Production platform resources must carry a platform.example.com/cost-center annotation." pattern: metadata: annotations: platform.example.com/cost-center: "?*"
Composition drift is the silent killer of self-service platforms. Once developers trust the platform to apply correct defaults, they stop auditing individual resources. If a Composition change inadvertently removes a backup retention rule or relaxes a security group, hundreds of databases could be misconfigured before anyone notices. Treat every Composition change as a production change: require a PR review from the security team, run a Crossplane composition unit test (crossplane beta render), and publish the diff of resulting managed resources before merging. At Spotify, every Composition PR includes a rendered-resource diff as a mandatory review artifact.

Production Failure Modes

Self-service infrastructure surfaces its own failure patterns that differ from manually provisioned infrastructure:

  • Provider throttling cascades: A Composition that creates ten managed resources simultaneously will make ten concurrent cloud API calls. At scale — 50 teams creating databases simultaneously during a fleet rotation — you will hit AWS API rate limits. Crossplane providers expose --max-reconcile-rate and --poll-interval flags. Tune them: 10 max-reconcile-rate and a 10-minute poll interval is a sane starting point for RDS at 500-resource scale.
  • Orphaned resources from failed compositions: If a Composition creates resources A, B, and C and the creation of C fails permanently, resources A and B may remain — billed, unsecured, unmonitored. Implement a finalizer-based cleanup controller and alert on XPostgresDatabase claims stuck in Synced: False for more than 30 minutes.
  • IRSA role scope too broad: Many teams create one cross-account IAM role for the entire Crossplane provider. A compromised provider pod then has write access to all cloud resources in the account. Use a separate ProviderConfig per environment and scope each IAM role to a single resource type via condition keys (rds:* only, not *:*).

Self-service infrastructure done well is invisible to developers and auditable to everyone else. The developer gets a database in four minutes; the security team sees a complete audit trail; the finance team has a cost-center tag on every resource; the platform team spends zero time on tickets. That four-minute provisioning time — with full policy compliance baked in — is the concrete product metric that justifies building the platform in the first place.