This lesson is the capstone of the Multi-Cloud tutorial. You will design a single production workload — a stateless API service backed by a managed database and an object store — and produce concrete, deployable configurations for both Azure and GCP. The goal is not to write two separate architectures; it is to write one architecture that happens to have two cloud expressions, driven by a shared Terraform root module and a cloud-specific vars file. That is the discipline separating multi-cloud competency from multi-cloud chaos.
Workload Definition: The Target System
The workload is a REST API that handles user requests, reads/writes to a managed relational database, stores uploaded files in object storage, and emits structured logs. It is containerised and managed by Kubernetes. The portability contract is:
Compute: Kubernetes (AKS on Azure, GKE on GCP) — same Helm chart, same Deployment and Service manifests.
Database: Managed PostgreSQL (Azure Database for PostgreSQL Flexible Server / Cloud SQL for PostgreSQL) — same schema, same connection string format.
Secrets: Azure Key Vault / GCP Secret Manager — injected at pod startup via the Secrets Store CSI Driver.
Networking: Private endpoints for the database; no public IP on the DB in either cloud.
The portability boundary is at the infrastructure layer, not the application layer. The API container image is identical — built once in CI and pushed to a shared registry. Cloud-specific details (connection strings, bucket names, secret paths) are injected as environment variables. The application code never contains an if cloud == azure branch.
Architecture Diagrams: Azure and GCP Side-by-Side
Azure target architecture — AKS cluster in a dedicated node subnet, all data services on private endpoints, ingress via Application Gateway with WAF.GCP target architecture — GKE cluster in a dedicated node subnet, Cloud SQL and Secret Manager accessed via Private Service Access, ingress via GCP L7 Load Balancer with Cloud Armor WAF.
Shared Terraform Root Module
The IaC lives in a single Git repository. The folder structure separates the cloud-agnostic Kubernetes manifests (Helm chart) from the cloud-specific Terraform configurations. A backend.tf per environment points to Azure Blob or GCS for state. The variables.tf defines the contract; the azure.tfvars and gcp.tfvars supply cloud-specific values.
The Helm chart in modules/k8s-api/ renders identically on both clusters. The only cloud-specific piece is the SecretProviderClass manifest, which is templated based on the secret_provider variable passed from Terraform. The Secrets Store CSI Driver (installed on both AKS and GKE) reads from Key Vault or Secret Manager and mounts credentials as environment variables — the pods themselves never know which cloud they are on.
Version-pin the CSI driver on both clusters. The Secrets Store CSI Driver Helm chart version must match between Azure and GCP deployments — use the same chart version in both helm_release Terraform resources. A version skew between clusters is a common source of subtle secret-mount failures that only surface under certain Kubernetes versions, and are notoriously hard to debug.
CI/CD: One Pipeline, Two Targets
The GitHub Actions pipeline builds the image once, pushes to a shared registry (Azure Container Registry or Artifact Registry depending on which cloud is primary), and then runs terraform apply sequentially — first for prod-azure, then for prod-gcp. Canary traffic splitting is done at the ingress level on each cloud independently: Application Gateway uses backend pool weight rules; GCP uses Traffic Director or NEG weight annotations on the GKE Ingress.
Database migrations are the hardest portability problem. Running alembic upgrade head or rails db:migrate against two separate PostgreSQL instances (one on Azure, one on GCP) requires either a shared migration job with both connection strings, or separate migration steps per cloud in the pipeline — with logic to abort the second cloud if the first migration fails. Never apply migrations to both clouds simultaneously if they are not in sync. The safest pattern is: migrate Azure, validate, migrate GCP; use a feature flag to gate new code paths until both databases are migrated.
Production Failure Modes and Lessons Learned
Running this architecture at scale surfaces three recurring failure patterns. First, network asymmetry: the private endpoint model on Azure routes DNS through a private zone that must be linked to the VNet; missing this link causes the pods to resolve the database FQDN to the public IP, bypassing the private path entirely and triggering firewall blocks. GCP PSA avoids this by injecting the private IP directly into the VPC, but requires the google_service_networking_connection resource to be applied before any GKE workloads start. Second, secret rotation drift: when a database password is rotated in Key Vault, Azure auto-updates if the CSI provider is configured with autoRotationPollInterval; GCP Secret Manager requires an explicit version bump. Uncoordinated rotation leaves one cloud with a stale credential. Third, cost model divergence: Azure charges for Private Endpoint NICs per hour; GCP charges for Cloud NAT gateway usage per GB. The same workload can cost meaningfully different amounts on each cloud — track per-cloud cost in your observability stack from day one so anomalies are caught before the monthly bill.
Multi-cloud portability is not free — it has a complexity tax. The shared Terraform module pattern, the CSI secret templating, and the dual-pipeline runs all add overhead. The payoff is leverage in contract negotiations, disaster recovery across providers, and the ability to place workloads closer to customers in regions one provider does not cover. Decide consciously whether that payoff justifies the cost for your organisation before adopting this pattern at scale.