Multi-Cloud: Azure & GCP

Multi-Cloud Strategy & Abstractions

18 min Lesson 9 of 28

Multi-Cloud Strategy & Abstractions

Kubernetes runs identically on GCP and Azure. Terraform speaks to both clouds with the same HCL syntax. So why do multi-cloud architectures fail in production so regularly? Because engineers abstract at the wrong layer, at the wrong time, for the wrong reasons. This lesson draws a precise line between abstraction that pays for itself and abstraction that becomes a maintenance burden nobody budgets for. Getting this judgement right is the difference between a portable, evolvable platform and a custom cloud brokerage nobody can maintain.

The Abstraction Spectrum

Before deciding what to abstract, you need a mental model of the available levers. Every abstraction sits somewhere on a spectrum between thin shim (minimal, cheap, easy to discard) and thick platform (opinionated, expensive, hard to undo). The goal is never maximum abstraction — it is the minimum abstraction that delivers the required operational benefit.

Level 0 — No abstraction. Engineers use AWS CLI, GCP CLI, and Azure CLI directly. Scripts are provider-specific. This is acceptable when a single team owns a single provider. It breaks down when you need to audit, rotate, or replicate infrastructure across providers.
Level 1 — Declarative IaC with provider modules. Terraform or Pulumi with separate per-provider modules. The HCL syntax is shared; the resource definitions are not. You get consistent state management, drift detection, and plan/apply workflow across all clouds without faking portability at the resource level. This is the sweet spot for most organizations.
Level 2 — Kubernetes as compute abstraction. Workloads are packaged as Helm charts or Kustomize overlays. Cluster provisioning is provider-specific (EKS module vs. GKE module); everything inside the cluster is portable. You accept that cluster management differs per provider but workload definitions do not.
Level 3 — Internal Developer Platform (IDP) abstraction. A platform API (Backstage, Port, or custom) exposes a simplified interface: "give me a service with 3 replicas, a Postgres database, and a public endpoint." The platform translates that into provider-specific resources. This is the Netflix/Spotify model. It is enormously powerful and enormously expensive to build and maintain correctly.
Level 4 — Full portability runtime (anti-pattern for most). Something like Crossplane or a fully homegrown cloud broker that tries to make AWS S3 and GCS interchangeable through a single API. Unless you are a cloud vendor yourself, this level erases the value of cloud-native managed services faster than it removes operational burden.

The rule at big-tech companies: abstract the deployment workflow and the observability pipeline; do not abstract the cloud resource API. Engineers at Stripe, Cloudflare, and Uber all maintain separate provider-specific Terraform modules rather than trying to hide providers behind a common interface — because the resource semantics are genuinely different and pretending otherwise introduces invisible bugs.

When to Abstract: Kubernetes

Kubernetes is the most successful compute abstraction in the industry. The case for it in a multi-cloud context is strong precisely because it targets the right layer — compute scheduling and workload lifecycle — and does not pretend to abstract storage, networking, or managed databases.

Abstract with Kubernetes when:

Your workloads are stateless or semi-stateless services. Pods, Deployments, HorizontalPodAutoscalers, and Services are genuinely portable. A Helm chart that deploys your API on EKS will deploy identically on GKE with a cluster switch.
You need to avoid provider lock-in at the compute scheduling layer specifically. Moving from EKS to GKE costs one Terraform module swap and a kubeconfig update — not a rewrite of your application.
You already operate Kubernetes for other reasons (internal complexity, team expertise, tooling ecosystem) and multi-cloud is an additional requirement, not the primary one.

Do not expect Kubernetes to abstract:

Ingress and load balancing. An AWS ALB Ingress Controller annotation is not compatible with GKE's managed Ingress. You need cloud-specific ingress configurations even when both clusters run the same Helm chart.
Storage classes. gp3 on AWS, pd-ssd on GCP, and Premium_LRS on Azure are different storage classes with different performance, pricing, and replication models. A StorageClass object is provider-specific.
IAM / service accounts. EKS IRSA, GKE Workload Identity, and AKS Pod Identity are all different mechanisms for the same goal (pod-level cloud credentials). Your Helm chart must be parameterized for these differences or you will hard-code one provider's annotation pattern and break the others.

# ── Helm values pattern for multi-cloud Kubernetes portability ─────────────────
# values-aws.yaml
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/my-app-role"
ingress:
  className: "alb"
  annotations:
    alb.ingress.kubernetes.io/scheme: "internet-facing"
    alb.ingress.kubernetes.io/target-type: "ip"
storage:
  className: "gp3"

---
# values-gcp.yaml
serviceAccount:
  annotations:
    iam.gke.io/gcp-service-account: "my-app@my-project.iam.gserviceaccount.com"
ingress:
  className: "gce"
  annotations:
    kubernetes.io/ingress.class: "gce"
storage:
  className: "premium-rwo"

# Deploy command — same chart, environment-specific values overlay:
# helm upgrade --install my-app ./charts/my-app \
#   -f values-base.yaml \
#   -f values-aws.yaml       # or values-gcp.yaml
#   --namespace production

When to Abstract: Terraform

Terraform is the right abstraction for infrastructure provisioning across clouds — but it is frequently misused. The correct pattern is per-provider modules, shared workflow. The wrong pattern is a single module that tries to provision AWS and GCP resources behind conditional logic.

Abstract with Terraform when:

You need consistent drift detection, state management, and plan/apply review across both clouds.
You want a single CI pipeline that provisions infrastructure on any provider using the same toolchain — terraform plan, terraform apply, terraform destroy.
You need to enforce organizational policies (tagging, naming conventions, allowed regions) consistently regardless of provider. This is best done with Sentinel (Terraform Enterprise) or OPA/Conftest in your CI pipeline.

Do not try to abstract away provider differences in HCL:

# ── WRONG: conditional provider logic that hides resource differences ───────────
# This pattern looks clever and becomes unmaintainable.

# main.tf (anti-pattern)
variable "cloud_provider" {
  type = string  # "aws" or "gcp"
}

resource "aws_db_instance" "db" {
  count = var.cloud_provider == "aws" ? 1 : 0
  # ... AWS-specific config
}

resource "google_sql_database_instance" "db" {
  count = var.cloud_provider == "gcp" ? 1 : 0
  # ... GCP-specific config — DIFFERENT fields, different semantics
}

# ── RIGHT: separate modules per provider, shared calling convention ────────────
# modules/aws-postgres/main.tf  — contains aws_db_instance + security groups + KMS
# modules/gcp-postgres/main.tf  — contains google_sql_database_instance + VPC peering

# environments/prod-aws/main.tf
module "database" {
  source          = "../../modules/aws-postgres"
  instance_class  = "db.r7g.xlarge"
  multi_az        = true
  backup_retention_period = 7
}

# environments/prod-gcp/main.tf
module "database" {
  source          = "../../modules/gcp-postgres"
  tier            = "db-n1-highmem-4"
  availability_type = "REGIONAL"
  backup_enabled  = true
}

This structure gives you the benefits of Terraform (state, plan, review, policy enforcement) without pretending that an Aurora instance and a Cloud SQL instance are the same thing. The modules share naming conventions and output interfaces, but the implementation is provider-native.

Use Terragrunt or Terraform workspaces for multi-environment, multi-cloud state isolation. Each environment+provider combination gets its own state backend (S3 bucket in AWS, GCS bucket in GCP). A Terragrunt root.hcl generates the backend config automatically from the directory path. This prevents the two most common multi-cloud Terraform accidents: state file collision (two applies writing to the same state) and cross-environment resource deletion (a destroy in staging accidentally referencing production state).

When NOT to Abstract

This is the section most multi-cloud guides skip. The decision to not abstract is as important as the decision to abstract. Here are the cases where abstraction costs more than it saves:

Managed data services. BigQuery, Redshift, Cloud Spanner, Aurora, Cosmos DB — do not abstract these. Their query engines, pricing models, concurrency behavior, and replication semantics are fundamentally different. An abstraction layer that treats them as interchangeable gives you the worst of all worlds: you cannot use BigQuery's slot-based pricing, partitioning, or BI Engine; you cannot use Aurora's Serverless v2 auto-scaling; and your "portable" query will perform badly on all of them because it avoids provider-specific optimization hints.
Identity and access mechanisms. Do not build a custom IAM abstraction. AWS IAM policies, GCP IAM bindings, and Azure RBAC have different expression models. Your security team needs to audit the native policies, not a translation layer. Use a federated IdP (Okta, Entra ID) as the single source of identity, and maintain provider-native role bindings managed by Terraform or a dedicated IaC pipeline.
Networking primitives. VPCs, subnets, security groups, and routing tables are provider-specific and intentionally so. An abstraction that tries to make AWS Security Groups and GCP Firewall Rules look the same will miss the critical semantic difference: AWS security groups are stateful and attached to ENIs; GCP firewall rules are stateless and attached to the VPC. A policy that looks correct in the abstraction could be a security hole in the implementation.
Observability backends. Do not try to abstract CloudWatch, Cloud Monitoring, and Azure Monitor into a homogeneous API. Instead, run a single external observability platform (Datadog, Grafana Cloud, or Prometheus remote-write) that consumes from each provider's native telemetry. The providers emit; you aggregate externally. This is cheaper, more reliable, and does not require you to maintain a multi-cloud metrics adapter.

The internal cloud broker anti-pattern: several large organizations have built internal platforms that expose a unified "cloud resource API" to hide provider differences — effectively building a private cloud brokerage. Without exception, these projects run years behind provider feature releases, accumulate critical security patches that never ship because the abstraction layer does not support the new IAM feature, and end up forcing engineering teams to file tickets to get access to features the cloud provider shipped six months ago. If you inherit one of these, plan a careful sunset, not an extension.

Abstraction decision map: the green column uses shared tooling; the red column stays provider-native to preserve semantics, security, and performance.

The Multi-Cloud Strategy Decision Framework

At senior level, you will be asked to define multi-cloud strategy, not just implement it. The following framework mirrors what platform engineering teams at large-scale companies actually use to make these decisions:

Classify each workload by portability value. For each service, ask: "If we had to move this to a different provider in 90 days, what would that cost?" Workloads with near-zero portability value (ones that deeply use provider-specific APIs) should be explicitly flagged as single-cloud committed. This is not a failure — it is a conscious, documented trade-off.
Separate the control plane from the data plane. The control plane (CI/CD, secrets rotation, cost reporting, alerting) should be cloud-agnostic or self-hosted. The data plane (where your users' traffic flows) can and should use provider-native services for performance and cost. Never build your alert pipeline on a cloud-native service that has no multi-region failover SLA.
Define your abstraction contracts as interfaces, not implementations. Document what outputs a "compute module" must produce (an endpoint, a scaling policy, a service account) without specifying how. The AWS implementation satisfies the contract with EKS + IRSA; the GCP implementation satisfies it with GKE + Workload Identity. Tests validate the contract, not the implementation.
Treat provider-specific features as opt-in, not opt-out. The default for a new service is: run on Kubernetes with a Helm chart, use Vault for secrets, emit OpenTelemetry traces. Teams can opt in to provider-specific services (BigQuery, Pub/Sub, Azure Service Bus) when the business case is clear and documented. Opt-in prevents accidental lock-in from individual engineers making convenient choices.

# ── OPA policy: enforce multi-cloud abstraction standards in Terraform plans ────
# policy/multicloud_standards.rego

package terraform.multicloud

import future.keywords.if
import future.keywords.contains

# Deny any S3 bucket that does not have a cost_center tag
deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    not resource.change.after.tags.cost_center
    msg := sprintf("S3 bucket %v missing required cost_center tag", [resource.address])
}

# Deny direct IAM key creation (must use IRSA or Workload Identity)
deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_iam_access_key"
    msg := sprintf("Direct IAM access key %v is forbidden; use IRSA or Vault dynamic credentials", [resource.address])
}

# Require all GKE node pools to use Workload Identity
deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "google_container_node_pool"
    not resource.change.after.workload_metadata_config
    msg := sprintf("GKE node pool %v must enable Workload Identity (workload_metadata_config)", [resource.address])
}

# Run in CI: conftest test --policy policy/ plan.json

Production Failure Modes in Multi-Cloud Abstractions

Understanding why abstraction layers fail in production is as important as knowing how to build them. These are the recurring patterns at organizations with 2+ years of multi-cloud operation:

Abstraction drift. The AWS and GCP versions of your Terraform module diverge over six months because the AWS team adds a feature and nobody updates the GCP module. Now you have two different systems being called by the same name. Mitigation: contract tests and scheduled drift checks in CI.
Clock skew in cross-cloud workloads. A workflow that writes to AWS SQS and reads from GCP Pub/Sub has no shared transaction boundary. Messages can arrive out of order, and error handling in one cloud does not automatically roll back state in the other. This is the distributed systems problem compounded by provider differences. Mitigation: idempotent consumers, explicit sequence numbers, and cross-cloud dead-letter queues with unified alerting.
Secret rotation lag. A secret stored in AWS Secrets Manager and replicated manually to GCP Secret Manager gets rotated on AWS but not on GCP. The GCP workload runs on stale credentials until it fails. Mitigation: Vault as the single source of truth with dynamic secrets; never store secrets in provider-native managers directly.
Cost attribution collapse. When a single user request fans out across AWS, GCP, and Azure, no single provider's cost report captures the true cost of that request. FinOps teams lose visibility, overages go undetected, and unit economics calculations become unreliable. Mitigation: consistent tagging taxonomy enforced at IaC level and a unified FinOps platform that normalizes across providers.

The "portable by default" trap: mandating that every service must be portable across clouds sounds like good governance. In practice it forces engineers to avoid all provider-native managed services — no RDS, no BigQuery, no Azure Cognitive Services — because those are "not portable." The result is teams running self-managed PostgreSQL on EC2 instead of RDS, self-managed Kafka instead of MSK, and self-managed Elasticsearch instead of OpenSearch Service. The operational burden of self-managing those services dwarfs any portability benefit. Portability should be a workload-specific decision, not an organization-wide mandate.

A mature multi-cloud strategy is not defined by how much you abstract — it is defined by how deliberately you choose what to abstract. Kubernetes and Terraform give you proven, industry-standard abstractions at the layers where portability genuinely exists. Beyond those two layers, your goal is not to hide the clouds from each other; it is to operate them consistently through unified tooling for observability, secrets, cost, and identity. The next lesson examines how these strategies come together in a real cross-cloud reference architecture.