From Analysis to System Design

Designing the Data Layer from the ERD

18 min Lesson 5 of 10

Designing the Data Layer from the ERD

By the time a project reaches design, you have an approved Entity-Relationship Diagram (ERD) that captures what data the business needs and how the entities relate. That ERD is a logical model — it describes the world in business terms, free of any database technology. The next step is to translate it into a physical schema: concrete tables, typed columns, indexes, and integrity constraints that a real database engine can enforce. This translation is called logical-to-physical mapping, and it is one of the most consequential decisions a systems analyst makes. Done well, the database supports the application for years. Done carelessly, it forces painful migrations, silently loses data, or grinds queries to a halt under load.

Step 1 — Map Each Entity to a Table

The starting rule is straightforward: every strong entity in the ERD becomes a table, and every attribute becomes a column. But "straightforward" does not mean "thoughtless." Each decision deserves deliberate consideration:

  • Naming convention: table names should be consistent — plural snake_case (clinic_appointments) or singular Pascal-case, but never mixed. Pick one and enforce it across the schema.
  • Primary key: every table needs a primary key. If the natural key (e.g., a national ID or email address) is stable and compact, it can serve as the PK. More often, a surrogate key — an auto-incrementing integer or a UUID — is chosen because natural keys can change and real-world identifiers are sometimes duplicated.
  • Weak entities: a weak entity (one that cannot be identified without its owner) maps to a table whose primary key is a composite of its own partial key plus the FK pointing to the owner table.
Analyst note: Do not let developers rename your tables during implementation without your sign-off. The column and table names in the physical schema must trace back to terms used in the ERD and requirements document. Renaming silently breaks traceability.

Step 2 — Choose Column Data Types Precisely

Every ERD attribute must be assigned a concrete data type. This choice affects storage efficiency, query performance, and data integrity. Below are the most common mapping decisions analysts encounter:

  • Short text identifiers (names, codes, status flags) → VARCHAR(n) with a realistic upper bound — do not default to VARCHAR(255) everywhere; a country code is 2 characters, not 255.
  • Long text (notes, descriptions, addresses) → TEXT or NVARCHAR(MAX).
  • Whole numbers (quantities, counts, age) → INT or SMALLINT; use BIGINT only when the range demands it.
  • Money and ratesDECIMAL(p, s), never FLOAT or DOUBLE; floating-point arithmetic introduces rounding errors in financial calculations.
  • Dates and timesDATE, TIME, DATETIME, or TIMESTAMP — choose based on what you actually store. An appointment date does not need a time component; a log entry does. Store all timestamps in UTC.
  • Boolean flagsBOOLEAN (or TINYINT(1) in MySQL) — avoid storing 'Y'/'N' strings; they are invisible to query optimizers.
  • Enumerated values (status, category) → an ENUM type or a lookup/reference table. Reference tables are more maintainable when the list changes over time.

Step 3 — Translate Relationships into Foreign Keys

Every relationship line on the ERD becomes one or more foreign-key columns in the physical schema. The cardinality determines where the FK lives:

  • One-to-Many (1:N): the FK goes on the "many" side. In a clinic system, one doctor has many appointments, so appointments.doctor_id references doctors.id.
  • One-to-One (1:1): the FK can go on either side; place it on the table that is optional or more frequently queried alone.
  • Many-to-Many (M:N): create a junction table (also called a bridge or association table) whose PK is the composite of the two FKs. A course-enrollment system has students and courses in an M:N relationship — the junction table enrollments holds student_id and course_id, plus any attributes of the enrollment (enrollment date, grade).
ERD-to-Physical Mapping: Clinic Appointment Example Logical (ERD) Physical (Tables) Doctor id, name, specialty Patient id, name, dob, phone Appointment id, date, time status, notes 1 N 1 N doctors 🔑 id INT PK AUTO name VARCHAR(100) NOT NULL specialty VARCHAR(60) patients 🔑 id INT PK AUTO name VARCHAR(100) NOT NULL dob DATE phone VARCHAR(20) UNIQUE appointments 🔑 id INT PK AUTO 🔗 doctor_id INT FK 🔗 patient_id INT FK appt_date DATE NOT NULL appt_time TIME NOT NULL status ENUM(...) notes TEXT created_at TIMESTAMP maps to
Mapping a 3-entity ERD (Doctor, Patient, Appointment) to three physical tables with typed columns and foreign keys.

Step 4 — Define Constraints

Data integrity cannot be enforced by application code alone — users access databases through multiple paths (reporting tools, admin scripts, APIs). The database itself must be the last line of defence. The principal constraint types to specify in your design document are:

  • NOT NULL: every column that must always have a value — do not leave nullability to chance. In the clinic schema, appt_date can never be null; notes can.
  • UNIQUE: a single-column or multi-column uniqueness constraint — a patient's phone number must be unique; an invoice line must be unique by (invoice_id, line_number).
  • CHECK: inline value validation — CHECK (price >= 0), CHECK (end_date >= start_date). Always specify these; they prevent impossible states from reaching the database.
  • FOREIGN KEY with referential action: define ON DELETE and ON UPDATE behaviour explicitly. Options are RESTRICT, CASCADE, SET NULL, and NO ACTION. For the clinic, deleting a doctor should be RESTRICT if appointments exist — you do not want orphaned records silently created.
  • DEFAULT values: created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP removes the burden from the application and makes audit trailing automatic.
Common trap — cascading deletes: ON DELETE CASCADE is dangerous unless you genuinely want child rows destroyed when the parent is deleted. In an e-commerce system, deleting a product with CASCADE would silently erase all order lines referencing that product — destroying historical sales records. Use RESTRICT by default and justify every CASCADE in writing.

Step 5 — Design Indexes for Performance

A primary-key index is created automatically by every major database engine. But for any non-trivial system, you need to design additional indexes based on the queries the application will run. The principle is: index the columns you filter by and join on, not every column.

  • Foreign-key columns: index every FK column. Without an index on appointments.doctor_id, a query for "all appointments for Dr. Hana" does a full table scan every time.
  • Frequently filtered columns: if users search appointments by status or date range, those columns need indexes.
  • Composite indexes: when queries always filter on two columns together (e.g., patient_id + appt_date), a composite index is faster than two separate indexes.
  • Unique indexes: any UNIQUE constraint is implemented as a unique index — declare both together.
Index Design Decision Map Index Design: appointments table appointments columns id doctor_id patient_id appt_date appt_time status notes created_at PK index on id auto-created by engine idx_appt_doctor FK join to doctors idx_appt_patient FK join to patients idx_appt_date_status composite: date + status for dashboard queries No index on notes large TEXT, never filtered
Deciding which indexes to create on the appointments table based on query patterns.

Step 6 — Handle Derived Attributes and Multi-Valued Attributes

ERDs sometimes include attributes that do not map directly to a single column:

  • Derived attributes (e.g., age, derived from date of birth): do not store them as columns. Store the source (dob) and let the application or a database view compute the derived value. Storing derived data creates consistency problems as soon as the source changes.
  • Multi-valued attributes (e.g., a patient can have multiple phone numbers): do not stuff them into a single comma-delimited column. Create a child table — patient_phones(id, patient_id, phone_number, phone_type) — with a FK back to the parent. This preserves normalization and makes each phone number independently queryable.
Normalisation shortcut check: if any column in a new table could hold a list of values in a single cell — phone numbers, tags, statuses — stop and ask whether a child table is the right answer. It almost always is.

Documenting the Physical Schema

The output of this step is not code — it is a data layer specification that becomes a section of the Design Specification Document. For each table, document: table name and purpose, column names with types and nullability, primary key, all foreign keys and their referential actions, all check constraints, all non-PK indexes, and any relevant notes on default values or triggers. This document is reviewed by the DBA, the developers, and the lead analyst before a single line of SQL is written.

In a logistics company that manages shipments, this discipline prevents the silent bugs that arise when a developer assumes weight is an integer while the requirements say it can be fractional, or assumes status is unconstrained while the business has exactly five valid values. The physical schema specification closes those gaps before they become runtime failures.