Multi-Cloud: Azure & GCP

Azure DevOps & AKS

18 min Lesson 4 of 28

Azure DevOps & AKS

Azure provides a tightly integrated DevOps story that stretches from source control through container builds to managed Kubernetes. Three services form the core: Azure Pipelines for CI/CD, Azure Container Registry (ACR) for private image storage, and Azure Kubernetes Service (AKS) for managed Kubernetes. At big-tech scale these services interlock with Azure Active Directory (Entra ID), role-based access control (RBAC), and managed identities — so no passwords or long-lived tokens are needed anywhere in the chain.

Azure Pipelines: CI/CD Without Secret Sprawl

Azure Pipelines is Microsoft's hosted CI/CD platform. Pipelines are defined in YAML at the root of the repository (azure-pipelines.yml) and run on Microsoft-hosted agents (Ubuntu, Windows, macOS) or self-hosted agent pools. The key architectural difference from GitHub Actions is that Azure Pipelines has first-class concepts of Environments and Deployment Jobs — these give you approval gates, exclusive locks, and deployment history baked in at the platform level, not bolted on after the fact.

The recommended authentication pattern between Pipelines and Azure services is a Workload Identity Federation (WIF) service connection. The pipeline acquires a short-lived OIDC token from the Azure DevOps token endpoint and exchanges it for an Azure AD access token via az login --federated-token — no service principal secret is stored anywhere. This is the 2024+ standard; the older approach of storing a client secret in a service connection is now considered a compliance risk.

# azure-pipelines.yml — full CI + ACR push + AKS deploy pipeline trigger: branches: include: [main] paths: exclude: ['**.md'] variables: imageRepository: 'api-service' containerRegistry: 'acmecr.azurecr.io' dockerfilePath: 'Dockerfile' tag: '$(Build.BuildId)' stages: - stage: Build displayName: Build & Push Image jobs: - job: BuildJob pool: vmImage: 'ubuntu-latest' steps: - task: Docker@2 displayName: Build and push to ACR inputs: command: buildAndPush repository: $(imageRepository) dockerfile: $(dockerfilePath) containerRegistry: 'acme-acr-service-connection' # WIF service connection tags: | $(tag) latest - stage: DeployStaging displayName: Deploy to Staging dependsOn: Build jobs: - deployment: DeployToStaging environment: 'staging' # Env has approval gates + history pool: vmImage: 'ubuntu-latest' strategy: runOnce: deploy: steps: - task: AzureCLI@2 displayName: Set AKS context inputs: azureSubscription: 'acme-aks-service-connection' scriptType: bash scriptLocation: inlineScript inlineScript: | az aks get-credentials \ --resource-group acme-rg \ --name acme-aks-staging \ --overwrite-existing kubectl set image deployment/api-service \ api-service=$(containerRegistry)/$(imageRepository):$(tag) \ --namespace staging kubectl rollout status deployment/api-service --namespace staging - stage: DeployProd displayName: Deploy to Production dependsOn: DeployStaging condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main')) jobs: - deployment: DeployToProd environment: 'production' # Requires manual approval pool: vmImage: 'ubuntu-latest' strategy: runOnce: deploy: steps: - task: AzureCLI@2 displayName: Rolling deploy to prod AKS inputs: azureSubscription: 'acme-aks-service-connection' scriptType: bash scriptLocation: inlineScript inlineScript: | az aks get-credentials \ --resource-group acme-rg \ --name acme-aks-prod kubectl set image deployment/api-service \ api-service=$(containerRegistry)/$(imageRepository):$(tag) \ --namespace production kubectl rollout status deployment/api-service \ --namespace production --timeout=300s
Always tag images with the build ID and update latest together. The build ID tag is immutable and auditable (you can trace any running pod back to a pipeline run). The latest tag is a convenience pointer for local dev. In Kubernetes manifests on the cluster, reference the immutable tag — never latest — so rolling back is a one-command kubectl rollout undo rather than a rebuild.

Azure Container Registry: Beyond a Simple Registry

ACR is much more than private Docker Hub. At the Premium tier (required for production) it adds geo-replication, private endpoints, content trust, and built-in vulnerability scanning via Microsoft Defender for Containers. The authentication model is designed around managed identities: your AKS cluster's node pool identity (or the kubelet managed identity) gets the AcrPull role on the registry, so Kubernetes can pull images without a imagePullSecret. This is the zero-secret pull pattern every production AKS cluster should use.

# Create ACR and attach it to AKS with zero-secret image pull # (AKS kubelet managed identity gets AcrPull automatically) az acr create \ --resource-group acme-rg \ --name acmecr \ --sku Premium \ --location eastus az aks create \ --resource-group acme-rg \ --name acme-aks-prod \ --node-count 3 \ --node-vm-size Standard_D4s_v5 \ --enable-managed-identity \ --attach-acr acmecr \ # grants AcrPull to kubelet MI automatically --enable-oidc-issuer \ # required for Workload Identity --enable-workload-identity \ --network-plugin azure \ --network-policy calico \ --generate-ssh-keys # Verify the role assignment was created az role assignment list \ --scope $(az acr show --name acmecr --query id -o tsv) \ --query "[?roleDefinitionName=='AcrPull']"

AKS Essentials: What the Managed Control Plane Hides (and What It Doesn't)

AKS manages the Kubernetes control plane (etcd, API server, scheduler, controller manager) at no charge — Microsoft patches and upgrades it. You pay only for worker node VMs. But "managed" does not mean "zero ops": you still own node pool sizing, upgrade channels, networking choices, RBAC, pod security, autoscaling, and workload configuration. The decisions you make at cluster creation are expensive to change later.

AKS Architecture: Control Plane, Node Pools, and Integrated Services Microsoft-Managed API Server (kube-apiserver) etcd (HA, encrypted) Scheduler / Controllers Azure AD / Entra RBAC Customer-Managed (Worker Nodes) System Node Pool (kube-system workloads) coredns metrics-server azure-npm otelcollector User Node Pool(s) (application workloads) api-service 3 replicas worker-service 2 replicas Cluster Autoscaler (min=2, max=20) Azure Load Balancer / AGIC Azure Monitor / Container Insights ACR (AcrPull via MI) Azure Key Vault (CSI driver) Azure Virtual Network (CNI overlay) — each pod gets a VNet IP kubectl
AKS splits responsibility: Microsoft owns the control plane; you own node pools, networking, autoscaling, and workload configuration.

Critical AKS Production Decisions

Network plugin: Use --network-plugin azure with --network-plugin-mode overlay (CNI Overlay, GA as of 2024). This gives each pod a real VNET IP without consuming the subnet IP space the way classic Azure CNI does — a classic pain point that caused IP exhaustion at scale. Pair with --network-policy calico or --network-policy azure for pod-level network policy enforcement.

Upgrade channel: Set --auto-upgrade-channel patch for production. This automatically applies patch-version upgrades (e.g., 1.29.3 → 1.29.5) within your current minor version, keeping you current on CVE fixes without major API changes. Minor version upgrades (1.29 → 1.30) stay manual so you can test workload compatibility first.

Workload Identity (not Pod Identity): The legacy AAD Pod Identity add-on is deprecated. Use Azure Workload Identity — the OIDC-based successor — so pods assume Azure AD identities without node-level daemon sets or storing credentials as Kubernetes secrets.

# Configure Workload Identity: give a pod access to Key Vault without secrets # Step 1 — Create a managed identity for the workload az identity create \ --name api-service-identity \ --resource-group acme-rg MI_CLIENT_ID=$(az identity show \ --name api-service-identity \ --resource-group acme-rg \ --query clientId -o tsv) # Step 2 — Grant Key Vault access to the identity az keyvault set-policy \ --name acme-keyvault \ --secret-permissions get list \ --spn $MI_CLIENT_ID # Step 3 — Federate the identity with the AKS OIDC issuer AKS_OIDC=$(az aks show \ --name acme-aks-prod \ --resource-group acme-rg \ --query oidcIssuerProfile.issuerUrl -o tsv) az identity federated-credential create \ --name api-service-fed \ --identity-name api-service-identity \ --resource-group acme-rg \ --issuer $AKS_OIDC \ --subject "system:serviceaccount:production:api-service-sa" \ --audiences api://AzureADTokenExchange # Step 4 — Annotate the Kubernetes ServiceAccount # kubectl annotate serviceaccount api-service-sa \ # azure.workload.identity/client-id=$MI_CLIENT_ID \ # --namespace production # Step 5 — Label the pod/deployment to use workload identity # spec.template.metadata.labels: # azure.workload.identity/use: "true" # The Azure Workload Identity webhook injects AZURE_CLIENT_ID, AZURE_TENANT_ID, # and the projected token volume automatically — your app uses DefaultAzureCredential
Never store Azure credentials as Kubernetes Secrets. A kubectl get secret -o yaml by any namespace admin reveals base64-decoded credentials immediately. Workload Identity eliminates this entirely. For workloads that need to read secrets at runtime, mount them via the Azure Key Vault CSI driver (secrets-store-csi-driver) — secrets are projected as files or environment variables from Key Vault, never stored in etcd.

AKS Cost Optimization at Scale

AKS clusters in production commonly idle 40–60% of allocated resources during off-peak hours. The corrective measures are layered: Vertical Pod Autoscaler (VPA) right-sizes CPU/memory requests (a pod requesting 2 CPU but using 200m wastes node capacity); Cluster Autoscaler removes underutilized nodes after a configurable cooldown; and Spot node pools handle batch or fault-tolerant workloads at 60–90% discount. Configure separate spot pools rather than mixing spot and on-demand nodes in the same pool — the eviction behavior differs and mixed pools cause hard-to-debug scheduling failures.

The Azure DevOps + ACR + AKS trinity maps directly to the GitOps model: Azure Pipelines builds and pushes an immutable image to ACR; a GitOps operator (Flux or Argo CD running in AKS) detects the image update and applies the new manifest. Pipelines own the build artifact; the GitOps operator owns the cluster state. This separation of concerns is the production-grade pattern at Azure-native shops — the pipeline never runs kubectl apply directly in this model.