Mastering Database Schema Design: From Normalization to Sharding and Scaling
This article explains essential database design principles—including normalization, denormalization, join avoidance, and various sharding techniques—while also covering scaling strategies such as vertical upgrades, horizontal partitioning, and the use of flash storage to boost performance.
1. Schema Design Principles
Relational database design relies on normal forms to guarantee data consistency and minimize redundancy.
1.1 First Normal Form (1NF)
All column values must be atomic; multi‑valued or composite attributes are prohibited.
1.2 Second Normal Form (2NF)
Every non‑key attribute must be fully dependent on the whole primary key, ensuring a unique identifier for each row.
1.3 Third Normal Form (3NF)
No non‑key attribute may transitively depend on the primary key; each business entity’s attributes are stored in separate tables.
1.4 Boyce‑Codd Normal Form (BCNF)
Every determinant must be a candidate key, providing a stricter reduction of redundancy, especially when composite keys are involved.
2. Denormalization Techniques
2.1 Reducing Data Redundancy for Login
A typical user_info table stores extensive user profile data. For login operations only id, nickname, password and real_name are required. Splitting these columns into a dedicated login table dramatically reduces I/O and network traffic.
Table: user_info (≈1,000,000 rows)
- id BIGINT(20) NOT NULL
- name VARCHAR(32) NOT NULL
- gender TINYINT(4) NOT NULL
- age INT(8) NOT NULL
- tel VARCHAR(16) NULL
- email VARCHAR(64) NULL
- school VARCHAR(32) NULL
- company VARCHAR(32) NULL
- interest VARCHAR(512) NULL
- gmt_create DATETIME NOT NULL
- gmt_modified DATETIME NOT NULL Table: login (≈1,000,000 rows)
- id BIGINT(20) NOT NULL
- nickname VARCHAR(32) NOT NULL
- password VARCHAR(128) NOT NULL -- stored as MD5 hash
- real_name VARCHAR(32) NOT NULLRead‑only login queries now touch only the slim login table, cutting row size and bandwidth.
2.2 Avoiding Expensive Joins
MySQL implements joins as Nested Loop Joins; other engines may use Hash Join or Sort‑Merge Join. All join algorithms become costly when tables grow large because each join requires scanning full rows. Designing schemas to minimise joins—by merging frequently accessed columns or accepting controlled redundancy—improves performance.
2.3 Moving Consistency Checks to the Application Layer
Foreign‑key and unique constraints add overhead to every INSERT/UPDATE. In trusted environments they can be omitted from the DBMS and enforced in application code, reducing latency while preserving logical correctness.
2.4 Minimal SQL Layer
At the storage engine level relational databases store data as <k,v> pairs. A query such as SELECT user_id, user_name FROM user_info WHERE age = 8; forces the engine to fetch full rows that satisfy the predicate before projecting the requested columns. By stripping higher‑level features—authentication, SQL parsing, complex access control—only essential indexing, transaction and locking remain, effectively turning the system into a pure key‑value store with NoSQL‑like performance.
3. Scaling Strategies
3.1 Scale‑Up vs. Scale‑Out
Scale‑up adds resources (CPU, memory, SSD, network) to a single server. It is simple to implement and does not require data migration.
Scale‑out adds more database instances or nodes. It typically involves replication or data partitioning (sharding) to distribute load horizontally.
3.2 Data Sharding
Why shard? A single MySQL instance reaches capacity limits around tens of terabytes and a few thousand TPS. Sharding spreads data across multiple nodes, preserving throughput.
Low business coupling between nodes
Consistent business type per node
Balanced data volume and access frequency
Maintain consistency and safety guarantees
Vertical sharding separates tables or databases by business function (e.g., user profile vs. order data), reducing cross‑table joins and simplifying schema evolution.
Horizontal sharding (partitioning) splits a single logical table into many identical tables based on a sharding key such as user_id or order_id. Advantages include fixed cost, elimination of single‑table bottlenecks, and transparent transaction handling. Drawbacks are routing complexity, limited query flexibility (only the sharding key can be used efficiently), difficulty with joins across shards, and costly re‑sharding.
Other variants include:
Logical sharding – isolates business logic rather than pure data volume.
Time‑based partitioning – stores each time period (day, week, month) in a separate table.
Hot‑cold sharding – separates frequently accessed (“hot”) rows from archival (“cold”) rows.
Volume‑based sharding – splits tables once a row‑count threshold is reached, common for log tables.
3.3 Routing After Sharding
When data is distributed, the application must locate the correct shard. Common approaches:
Embed routing logic directly in application code.
Extend the database with plugins or middleware that perform transparent routing.
Deploy a middle‑layer proxy that intercepts SQL statements and forwards them to the appropriate shard without modifying the application.
Each method trades development effort, transparency, and operational complexity.
3.4 Flash Storage for Scale‑Up
Replacing mechanical disks with SSDs reduces average seek latency from ~5 ms to <1 ms and raises IOPS from ~100 to several thousand, delivering a substantial performance boost with minimal architectural changes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
