Data & Feature Pipelines
Data & Feature Pipelines
In traditional software, a broken deployment rolls back in minutes. In ML, the analogous disaster is silent: a feature pipeline drifts upstream, stale values flow into a serving model, and prediction quality degrades without a single error log. Data and feature pipelines are therefore not plumbing — they are the load-bearing structure of any ML platform. This lesson covers the three pillars that make them production-grade: data versioning, feature stores, and pipeline orchestration.
Data Versioning
Reproducibility is the non-negotiable foundation. If you cannot answer "what exact data produced this model?", you cannot debug, audit, or safely retrain. At scale, raw data lives in object storage (S3, GCS) or a data lakehouse (Delta Lake, Iceberg). Versioning strategies divide into two families:
- Copy-on-write snapshots — Delta Lake and Apache Iceberg store a transaction log of which files belong to each snapshot. Zero copy overhead for reads; time-travel is a single SQL predicate. This is the preferred approach for petabyte-scale tables.
- Content-addressable trees — DVC (Data Version Control) treats datasets like Git objects: each version is a hash of its contents, with a thin
.dvcpointer committed to the code repo. Works for any file type and any remote (S3, GCS, Azure, SSH).
dataset_version=<hash> as metadata into the model registry entry. This closes the lineage loop and is the first thing oncall engineers check during a model regression incident.
Feature Stores
A feature store solves the training-serving skew problem — the most common silent killer in ML production. Without one, the data scientist computes rolling_7d_revenue differently in a Jupyter notebook (full history) than the engineer does in a serving microservice (streaming window). The model trains on one distribution and infers on another.
A feature store has two planes:
- Offline store — columnar (Parquet on S3, BigQuery, Snowflake). Used for point-in-time-correct joins during training. "What was the user's 7-day revenue as of the training label timestamp?" This prevents future leakage.
- Online store — low-latency KV (Redis, DynamoDB, Cassandra). Serving path reads pre-materialized feature vectors in <5 ms. A materialization job syncs offline → online on a schedule or event trigger.
Feast is the most widely deployed open-source option. In production, teams define features as Python objects and deploy them with a single CLI command:
Pipeline Orchestration
An ML pipeline is not a single DAG — it is a composition of three interconnected DAGs: data ingestion (batch or streaming), feature engineering (transformations, aggregations, joins), and training (potentially triggered by data drift rather than a schedule). The orchestrator must handle dependencies between all three, retries, backfills, and resource-aware scheduling (GPU nodes are expensive — tasks that can run on CPU should not acquire GPU slots).
Orchestration options ranked by operational overhead:
- Airflow / Astronomer — industry standard for data engineering. DAGs are Python; the operator ecosystem is vast. Main weakness: the scheduler process is a single point of failure in open-source installations; Celery or Kubernetes executor is required at scale.
- Prefect 2 / Prefect Cloud — designed to fix Airflow's pain points. Tasks are plain Python decorated with
@task; flows auto-create DAGs. The control plane is SaaS; agents run inside your infrastructure. Good for teams that want to move fast without babysitting a scheduler. - Kubeflow Pipelines (KFP) — Kubernetes-native. Each step is a container; the pipeline YAML is compiled from the Python SDK. Best fit when every step must be reproducible and containerized — especially training steps that need Kubernetes resource requests.
- Metaflow — Netflix-built, AWS-native. Branching, resuming failed runs, and versioning are first-class. Preferred when the team is data-scientist-heavy and self-service is more important than operator control.
execution_date, treat state as append-only in the offline store, and test the backfill path in CI. Netflix and Airbnb run automated backfill tests on every feature pipeline commit.
Production Failure Modes
Senior engineers internalize these common breakages:
- Late-arriving data — streaming sources guarantee at-least-once delivery but not ordering. Watermarks (Flink, Spark Structured Streaming) must be configured with real-world latency budgets, not theoretical ones. A 10-minute watermark on a pipeline where 5% of events arrive 12 minutes late silently drops those events from feature windows.
- Schema drift — an upstream team adds a column, renames a field, or changes a type. The Great Expectations or Soda library should run as the first task in every ingestion pipeline and fail fast on schema violations before data contaminates the feature store.
- Null propagation — a missing
user_id10 joins deep becomes a training dataset with 3% nulls in a feature the model has never seen null for, which becomes a serving time NaN that some models handle silently and others score as zero. - Clock skew between offline and online — if the materialization job runs every hour and your model requires features fresher than 60 minutes, you have a silent stale-feature problem in the peak-traffic window just before a materialization run.