MLOps & DevOps for AI Systems

Data & Feature Pipelines

18 min Lesson 2 of 28

Data & Feature Pipelines

In traditional software, a broken deployment rolls back in minutes. In ML, the analogous disaster is silent: a feature pipeline drifts upstream, stale values flow into a serving model, and prediction quality degrades without a single error log. Data and feature pipelines are therefore not plumbing — they are the load-bearing structure of any ML platform. This lesson covers the three pillars that make them production-grade: data versioning, feature stores, and pipeline orchestration.

Data Versioning

Reproducibility is the non-negotiable foundation. If you cannot answer "what exact data produced this model?", you cannot debug, audit, or safely retrain. At scale, raw data lives in object storage (S3, GCS) or a data lakehouse (Delta Lake, Iceberg). Versioning strategies divide into two families:

Copy-on-write snapshots — Delta Lake and Apache Iceberg store a transaction log of which files belong to each snapshot. Zero copy overhead for reads; time-travel is a single SQL predicate. This is the preferred approach for petabyte-scale tables.
Content-addressable trees — DVC (Data Version Control) treats datasets like Git objects: each version is a hash of its contents, with a thin .dvc pointer committed to the code repo. Works for any file type and any remote (S3, GCS, Azure, SSH).

# Initialize DVC alongside an existing Git repo
dvc init
git add .dvc .dvcignore && git commit -m "Initialize DVC"

# Track a raw dataset directory; the actual data goes to remote storage
dvc add data/raw/clicks_2025_06.parquet
git add data/raw/clicks_2025_06.parquet.dvc .gitignore
git commit -m "Track June 2025 click dataset v1"

# Push data to S3 remote (already configured in .dvc/config)
dvc push

# On any machine: restore exact data for a given git commit
git checkout v1.3.0
dvc pull                    # fetches the hash-matched files from remote

# Delta Lake: time-travel without DVC — SQL native
spark.read \
  .format("delta") \
  .option("versionAsOf", 42) \
  .load("s3a://ml-lake/events/clicks")

At Uber, LinkedIn, and Meta, the canonical lineage system tracks which dataset version fed which training run. The training job writes dataset_version=<hash> as metadata into the model registry entry. This closes the lineage loop and is the first thing oncall engineers check during a model regression incident.

Feature Stores

A feature store solves the training-serving skew problem — the most common silent killer in ML production. Without one, the data scientist computes rolling_7d_revenue differently in a Jupyter notebook (full history) than the engineer does in a serving microservice (streaming window). The model trains on one distribution and infers on another.

A feature store has two planes:

Offline store — columnar (Parquet on S3, BigQuery, Snowflake). Used for point-in-time-correct joins during training. "What was the user's 7-day revenue as of the training label timestamp?" This prevents future leakage.
Online store — low-latency KV (Redis, DynamoDB, Cassandra). Serving path reads pre-materialized feature vectors in <5 ms. A materialization job syncs offline → online on a schedule or event trigger.

Feature store architecture: data flows from multiple sources through offline and online stores to training and serving consumers.

Feast is the most widely deployed open-source option. In production, teams define features as Python objects and deploy them with a single CLI command:

# feature_repo/features.py  (Feast feature definitions)
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta

user = Entity(name="user_id", join_keys=["user_id"])

user_stats_source = FileSource(
    path="s3://ml-lake/features/user_stats/",
    timestamp_field="event_timestamp",
)

user_stats_fv = FeatureView(
    name="user_stats",
    entities=[user],
    ttl=timedelta(days=7),
    schema=[
        Field(name="rolling_7d_revenue",  dtype=Float32),
        Field(name="purchase_count_30d",  dtype=Int64),
        Field(name="last_active_days_ago", dtype=Int64),
    ],
    source=user_stats_source,
)

# --- Apply to registry and materialize to online store ---
# feast apply                                      # register schema
# feast materialize-incremental $(date -u +%Y-%m-%dT%H:%M:%S)

# --- Training: point-in-time correct retrieval ---
# store.get_historical_features(
#     entity_df=training_labels,       # must include event_timestamp
#     features=["user_stats:rolling_7d_revenue"],
# ).to_df()

# --- Serving: low-latency online lookup ---
# store.get_online_features(
#     features=["user_stats:rolling_7d_revenue"],
#     entity_rows=[{"user_id": "u_12345"}],
# ).to_dict()

Training-serving skew kills silently. Without a feature store, a common failure mode is: engineer rewrites the revenue calculation to fix a bug in serving, accuracy drops 4%, and no alert fires because the model still returns valid floats. Always define features once, in the store, and have both the training job and the serving layer call the same definition.

Pipeline Orchestration

An ML pipeline is not a single DAG — it is a composition of three interconnected DAGs: data ingestion (batch or streaming), feature engineering (transformations, aggregations, joins), and training (potentially triggered by data drift rather than a schedule). The orchestrator must handle dependencies between all three, retries, backfills, and resource-aware scheduling (GPU nodes are expensive — tasks that can run on CPU should not acquire GPU slots).

Orchestration options ranked by operational overhead:

Airflow / Astronomer — industry standard for data engineering. DAGs are Python; the operator ecosystem is vast. Main weakness: the scheduler process is a single point of failure in open-source installations; Celery or Kubernetes executor is required at scale.
Prefect 2 / Prefect Cloud — designed to fix Airflow's pain points. Tasks are plain Python decorated with @task; flows auto-create DAGs. The control plane is SaaS; agents run inside your infrastructure. Good for teams that want to move fast without babysitting a scheduler.
Kubeflow Pipelines (KFP) — Kubernetes-native. Each step is a container; the pipeline YAML is compiled from the Python SDK. Best fit when every step must be reproducible and containerized — especially training steps that need Kubernetes resource requests.
Metaflow — Netflix-built, AWS-native. Branching, resuming failed runs, and versioning are first-class. Preferred when the team is data-scientist-heavy and self-service is more important than operator control.

# Prefect 2: minimal ML pipeline definition
from prefect import flow, task
from prefect.task_runners import ConcurrentTaskRunner
import pandas as pd

@task(retries=3, retry_delay_seconds=60, log_prints=True)
def ingest_raw_data(date: str) -> pd.DataFrame:
    return pd.read_parquet(f"s3://ml-lake/raw/clicks/{date}.parquet")

@task
def compute_features(df: pd.DataFrame) -> pd.DataFrame:
    df["rolling_7d_revenue"] = (
        df.groupby("user_id")["revenue"]
          .transform(lambda x: x.rolling(7, min_periods=1).sum())
    )
    return df[["user_id", "event_timestamp", "rolling_7d_revenue"]]

@task
def materialize_to_feature_store(features: pd.DataFrame) -> None:
    # write to offline store; downstream materialization job syncs to Redis
    features.to_parquet(
        "s3://ml-lake/features/user_stats/",
        partition_cols=["event_timestamp"],
    )

@flow(task_runner=ConcurrentTaskRunner(), name="daily-feature-pipeline")
def feature_pipeline(date: str = "2025-06-11") -> None:
    raw = ingest_raw_data(date)
    features = compute_features(raw)
    materialize_to_feature_store(features)

if __name__ == "__main__":
    feature_pipeline()

Design for backfill from day one. A pipeline that can only run on "today" will cost you the first time a bug is found in week-old features. Parametrize every pipeline by execution_date, treat state as append-only in the offline store, and test the backfill path in CI. Netflix and Airbnb run automated backfill tests on every feature pipeline commit.

Production Failure Modes

Senior engineers internalize these common breakages:

Late-arriving data — streaming sources guarantee at-least-once delivery but not ordering. Watermarks (Flink, Spark Structured Streaming) must be configured with real-world latency budgets, not theoretical ones. A 10-minute watermark on a pipeline where 5% of events arrive 12 minutes late silently drops those events from feature windows.
Schema drift — an upstream team adds a column, renames a field, or changes a type. The Great Expectations or Soda library should run as the first task in every ingestion pipeline and fail fast on schema violations before data contaminates the feature store.
Null propagation — a missing user_id 10 joins deep becomes a training dataset with 3% nulls in a feature the model has never seen null for, which becomes a serving time NaN that some models handle silently and others score as zero.
Clock skew between offline and online — if the materialization job runs every hour and your model requires features fresher than 60 minutes, you have a silent stale-feature problem in the peak-traffic window just before a materialization run.

The two metrics every ML platform team tracks at the feature layer: feature freshness (P99 age of the online store value at query time) and feature coverage (fraction of serving requests where all required features are non-null and non-stale). A coverage drop below 95% on a high-value feature is treated as a P1 incident, not a data quality ticket.