Implementation, Deployment & Maintenance

Data Migration

18 min Lesson 3 of 10

Data Migration

Cutting over to a new system is never just about software — it is equally about data. The records accumulated in a legacy system over years or decades represent irreplaceable institutional memory: patient histories, financial ledgers, customer orders, inventory counts. Migrating that data correctly is often the single riskiest activity in a go-live plan, and it is one of the analyst's most important responsibilities to design, oversee, and verify it.

This lesson walks through the four disciplines of a professional data migration: mapping legacy fields to the new schema, cleansing dirty records before they contaminate the new system, migrating via a repeatable pipeline, and verifying completeness and correctness after the move.

Why Data Migration Fails

Most data migration failures share the same root causes: the team discovers data quality problems only after cut-over (too late to fix without downtime), the mapping is undocumented and inconsistent, or verification is informal ("it looks about right"). A structured approach to each stage eliminates these failure modes before they reach production.

Stage 1 — Data Mapping

A data map (sometimes called a field mapping matrix) is a formal document that traces every field in the legacy system to its counterpart in the new system. For a clinic migrating from a paper-based scheduling tool to a cloud EMR, this means answering: where does PatientDOB from the old PATIENTS table land in the new schema? Is it patient.date_of_birth? Is the format DD/MM/YYYY vs YYYY-MM-DD? Are there multiple legacy fields that merge into one new field (first name + last name → full_name)?

A data map captures five things for every field pairing:

Source field — legacy table.column, data type, and sample values.
Target field — new schema table.column and data type.
Transformation rule — concatenation, format conversion, lookup translation, default value, or NULL handling.
Business rule exceptions — e.g., "if legacy status = 3, map to new status = 'suspended'; all other numeric codes map to 'active'."
Owner — the business stakeholder who confirmed the rule.

Analyst tip: Build the data map collaboratively with the data owner from the business side, not just the IT team. Only the business owner knows that the legacy field CUST_TYPE = 9 means "employee discount" — IT sees it as just a number.

Stage 2 — Data Cleansing

Legacy systems routinely accumulate decades of inconsistencies: duplicate customer records, phone numbers stored as free text, dates entered as "TBD", foreign-key orphans, and fields left NULL that the new system requires. Migrating dirty data into a clean schema propagates the problems and makes the new system look broken to users from day one.

A cleansing pass addresses four categories of defects:

Completeness — required fields that are blank or NULL. Action: backfill from another source, apply a default, or flag for manual review.
Consistency — the same concept stored differently across rows (e.g., country code stored as "UK", "United Kingdom", "GB"). Action: normalise against a reference lookup table.
Accuracy — values that are structurally valid but factually wrong (a birthdate of 01/01/1900 that is clearly a placeholder). Action: mark as suspect; route to business owner for correction.
Duplicates — multiple rows representing the same real-world entity. Action: merge using a defined master-record rule (most-recently-updated wins, or most-complete record wins).

Cleansing is done on a copy of the legacy data — never on the production source. The cleansed copy becomes the input to the migration pipeline. Keep a record of every change made; the audit trail is required for compliance in regulated industries such as healthcare and finance.

Warning: Do not migrate and cleanse simultaneously in a single script. Run a dedicated cleansing pass first, validate the cleansed dataset, then feed it into the migration. Mixing the two steps makes defects impossible to isolate and rerun.

Stage 3 — The Migration Pipeline

The migration itself is implemented as an ETL pipeline — Extract, Transform, Load — ideally as a repeatable, version-controlled script rather than a one-time manual operation. Repeatability is essential: the pipeline will run in development, in UAT, and at least once more during production cut-over, and the results must be identical each time.

The pipeline stages are:

Extract — pull data from the legacy source (database dump, CSV export, API call, or direct read-only connection).
Transform — apply every rule from the data map: reformat dates, translate code values, concatenate fields, derive calculated columns, handle NULLs.
Validate — run business-rule assertions against the transformed dataset before writing a single row to the target (e.g., "every patient record must have a valid NHS/SSN number"; "order totals must equal sum of line items"). Reject and log failing rows; do not silently skip them.
Load — write to the new system in dependency order (parent tables before child tables to respect foreign-key constraints). Use transactions so a partial load is rolled back cleanly on failure.
Post-load audit — run count and checksum queries immediately after loading; compare against source counts.

The five-stage data migration pipeline. Rejected rows are logged and routed back to business owners — never silently dropped.

Stage 4 — Migration Verification

Verification answers the question: did all the right data arrive, intact, in the new system? It has three layers:

Quantitative checks — compare row counts per entity between source and target. If the legacy system had 18,432 customer records and the new system has 18,432, that is a necessary condition (though not sufficient on its own). Also compare totals for financial fields: the sum of all invoice amounts in the legacy system must equal the sum in the new system.
Structural checks — verify that no NOT NULL constraints were violated, all foreign-key relationships are intact, and index uniqueness constraints hold. Run these as database-level assertions immediately post-load.
Business spot-checks — ask domain experts to manually inspect a representative sample of records (typically 0.5–1% of each entity type). For an online store migration: a buyer queries a known historic order and confirms that item names, quantities, prices, and status all appear correctly in the new system. Automated counts cannot catch a systematic transformation error; human spot-checks can.

Migration rehearsals: Run the full pipeline — extract, cleanse, transform, validate, load, verify — at least twice before the actual production cut-over: once in a development environment and once in a UAT environment against a production-sized data copy. The rehearsals surface timing issues (a migration that takes 11 hours cannot be done in a 4-hour maintenance window), missing transformation rules, and volume-related edge cases that never appear on small test datasets.

Migration Artefacts the Analyst Owns

As the analyst, you are responsible for producing and maintaining four artefacts throughout the migration process:

Data Map / Field Mapping Matrix — the complete source-to-target field register with all transformation rules and business owners. This is the foundation of everything else.
Data Quality Report — the output of the profiling and cleansing phase: counts of defects found by category, the rules applied to fix them, and the residual issues escalated to business owners.
Migration Run Log — a timestamped record of each pipeline execution: environment, start/end times, row counts extracted/rejected/loaded, and any errors encountered.
Verification Sign-off Checklist — the checklist of quantitative, structural, and business checks with pass/fail results and the name of the approving business stakeholder.

Four categories of data quality defects, with example, remediation action, and who owns each decision.

Putting It Together: Logistics Firm Scenario

A logistics company is retiring a 12-year-old on-premises transport management system (TMS) in favour of a cloud-based platform. The legacy TMS holds 840,000 shipment records, 62,000 customer accounts, and 14 years of driver log data stored in an Access database exported to CSV.

The analyst team spends three weeks building the data map (230 field pairings documented), discovers during profiling that 9% of customer phone numbers are in free-text narrative format rather than E.164, and that 4,200 shipment records reference a carrier code that no longer exists in the carrier table (orphaned foreign keys). The cleansing pass normalises phone numbers with a regex transformation, and the orphaned shipments are mapped to a special "legacy-unknown-carrier" record agreed with the business. The pipeline runs in dev in 6.5 hours, allowing the operations team to plan a Friday-night maintenance window. Verification spot-checks by the dispatch supervisor on 600 sample shipments pass with zero discrepancies. The migration is signed off before Monday morning.

Key takeaway: Data migration is not a technical afterthought — it is a structured analytical exercise. The analyst is the bridge between the business owners who understand what the data means and the engineers who build the pipeline that moves it. Both perspectives are required for a migration to succeed.