From MySQL to Apache Doris: Key Design Shifts for OLAP Migration
This article explains how backend engineers should rethink table design, indexing, partitioning, and key strategies when migrating attendance data from MySQL's OLTP model to Apache Doris 2.1.7's OLAP architecture, providing concrete DDL examples and practical tips.
Mindset Shift: OLTP vs OLAP
Backend developers accustomed to MySQL’s OLTP patterns (auto‑increment primary keys, point‑lookup queries, B‑tree indexes) must adopt a different approach for Apache Doris, an OLAP engine optimized for scanning massive rows and performing aggregations.
Core Design Differences
Design goal : MySQL optimizes single‑row latency and transactional consistency; Doris optimizes large‑scale scans and aggregations.
Query pattern : MySQL favors point queries and small range scans; Doris expects time‑range scans, multi‑dimensional aggregations, and large‑table joins.
Table schema : MySQL typically uses an auto‑increment id primary key; Doris uses a composite time + user_id key for physical ordering.
Index strategy : MySQL builds B‑tree indexes for every filter column; Doris builds inverted indexes only for high‑cardinality fields and relies on full scans for low‑cardinality columns.
Practical Case: Attendance Event Table
MySQL (OLTP) DDL :
CREATE TABLE attendance_event (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
user_id VARCHAR(32) NOT NULL,
event_time DATETIME NOT NULL,
device_no VARCHAR(32),
status TINYINT,
user_type TINYINT COMMENT '1 staff, 2 contractor, 3 leader',
INDEX idx_user_id (user_id),
INDEX idx_event_time (event_time)
) ENGINE=InnoDB;Typical queries target a single row, e.g. SELECT * FROM attendance_event WHERE id = ? or WHERE user_id = ?.
Doris (OLAP) DDL :
CREATE TABLE attendance_event (
`time` DATETIME NOT NULL COMMENT 'event time',
`user_id` VARCHAR(32) NOT NULL COMMENT 'user ID',
`device_no` VARCHAR(32),
`status` TINYINT,
`user_type` TINYINT COMMENT '1 staff, 2 contractor, 3 leader',
-- other fields ...
) ENGINE=OLAP
DUPLICATE KEY(`time`, `user_id`)
PARTITION BY RANGE(`time`) (
PARTITION p202602 VALUES LESS THAN ('2026-03-01'),
PARTITION p_future VALUES LESS THAN (MAXVALUE)
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
PROPERTIES (
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "MONTH",
"dynamic_partition.start" = "-1",
"dynamic_partition.end" = "6",
"dynamic_partition.time_zone" = "Asia/Shanghai"
);Key changes:
Removed the auto‑increment id.
Used time + user_id as the duplicate key to define physical order.
Partitioned by time to enable partition pruning for time‑range queries.
Bucketed by user_id (16 buckets) to co‑locate a user’s rows for efficient GROUP BY / JOIN.
Key Concepts Explained
Partition vs. Bucket
Partition determines which data files need to be scanned. For example, a monthly partition lets a query with WHERE time BETWEEN … read only the relevant month.
Bucket determines how data is distributed for parallel scanning. With 16 buckets, up to 16 threads can read data concurrently.
UNIQUE KEY Semantics
In MySQL, PRIMARY KEY or UNIQUE KEY guarantees uniqueness and enables fast point lookups. In Doris, UNIQUE KEY implements idempotent writes: rows with the same key overwrite previous versions, which is useful for status fields that may be updated repeatedly.
CREATE TABLE entity_certificate (
`certificate_id` VARCHAR(32) NOT NULL,
`image_uploaded` TINYINT,
UNIQUE KEY(`certificate_id`) -- only the latest record is kept
) ENGINE=OLAP
DISTRIBUTED BY HASH(`certificate_id`) BUCKETS 16;Because the table is small, it should remain unpartitioned; partitioning would add unnecessary metadata overhead.
Index Strategy
Only high‑cardinality columns (e.g., device_no) receive inverted indexes. Low‑cardinality columns such as status or user_type are scanned directly.
CREATE INDEX idx_device_no ON attendance_event(`device_no`) USING INVERTED;Guideline: create inverted indexes for columns with >10% distinct values; skip columns with fewer than 100 distinct values.
Dynamic Partitioning
Manual creation of monthly partitions is tedious. Doris can manage partitions automatically via table properties:
PROPERTIES (
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "MONTH",
"dynamic_partition.start" = "-1", -- keep last month
"dynamic_partition.end" = "6", -- create 6 future months
"dynamic_partition.time_zone" = "Asia/Shanghai"
);Setting the time zone avoids boundary misalignment.
Table Classification & Design Patterns
Fact Tables (Time‑Series)
Pattern: DUPLICATE KEY(time, dimension) Partition: RANGE(time) with dynamic partitions
Distribution: HASH(dimension) Typical examples: attendance_event, daily snapshots, leave records
Dimension / Mapping Tables
Pattern: UNIQUE KEY(business_key) or DUPLICATE KEY(business_key) No partitions; only bucketing
Distribution: HASH(business_key) Typical examples: employee info, contractor info, certificate mapping
FAQ
Do I still need an auto‑increment ID in Doris?
No. Auto‑increment requires global coordination and hurts performance. Use business keys, Snowflake IDs, or UUIDs when a unique identifier is required.
Does Doris support column default values?
Currently not. Fields such as create_time, update_time, or deleted must be populated explicitly by the application.
Must every table be partitioned?
Only large tables that are queried by time range benefit from partitioning. Small dimension tables should remain unpartitioned.
How many buckets should I set?
Typically 10‑20 buckets per BE node; with three replicas, 16 buckets offers a good balance between parallelism and metadata overhead.
What happens if I create too many indexes?
Write performance degrades and storage usage rises; Doris’s inverted indexes are heavier than MySQL’s B‑tree indexes, so limit them to high‑cardinality columns.
How to handle data updates?
UNIQUE KEY tables automatically overwrite rows (suitable for low‑frequency updates).
Bulk updates can be performed with UPDATE, but they are slower than MySQL.
Recommended pattern: append‑only writes with a version column; use UNIQUE KEY only for fields that truly need overwriting.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
