Apache Doris Basics: Creating Databases, Tables, Partitioning, Data Import, and Rollup
This article provides a comprehensive guide to Apache Doris, covering how to create databases and tables with single and composite partitions, import data via broker and routine loads, understand its aggregate, uniq, and duplicate data models, and leverage rollup and prefix index features for optimized querying.
Apache Doris (originally Baidu Palo) is an open‑source distributed SQL database built on massive parallel processing technology, integrating concepts from Google Mesa, Apache Impala, and Apache ORC.
Basic Usage
Creating a database:
MySQL> CREATE DATABASE example_db;
MySQL> SHOW DATABASES;Creating tables with single and composite partitions. Single‑partition example:
CREATE TABLE table1 (
siteid INT DEFAULT '10',
citycode SMALLINT,
username VARCHAR(32) DEFAULT '',
pv BIGINT SUM DEFAULT '0'
) AGGREGATE KEY(siteid, citycode, username)
DISTRIBUTED BY HASH(siteid) BUCKETS 10
PROPERTIES("replication_num" = "1");Composite‑partition example (partitioned by date range and hashed by siteid):
CREATE TABLE table2 (
event_day DATE,
siteid INT DEFAULT '10',
citycode SMALLINT,
username VARCHAR(32) DEFAULT '',
pv BIGINT SUM DEFAULT '0'
) AGGREGATE KEY(event_day, siteid, citycode, username)
PARTITION BY RANGE(event_day) (
PARTITION p201706 VALUES LESS THAN ('2017-07-01'),
PARTITION p201707 VALUES LESS THAN ('2017-08-01'),
PARTITION p201708 VALUES LESS THAN ('2017-09-01')
)
DISTRIBUTED BY HASH(siteid) BUCKETS 10
PROPERTIES("replication_num" = "1");After creation, you can view tables with SHOW TABLES; and describe them with DESC table_name;.
Data Import
Broker load example:
LOAD LABEL table1_20170708 (
DATA INFILE("hdfs://your.namenode.host:port/dir/table1_data")
INTO TABLE table1
) WITH BROKER hdfs (
"username"="hdfs_user",
"password"="hdfs_password"
) PROPERTIES (
"timeout"="3600",
"max_filter_ratio"="0.1"
);Routine load (Kafka) example:
CREATE ROUTINE LOAD db1.job1 on tbl1
PROPERTIES ("desired_concurrent_number"="1")
FROM KAFKA (
"kafka_broker_list"="broker1:9091,broker2:9091",
"kafka_topic"="my_topic",
"property.security.protocol"="ssl",
"property.ssl.ca.location"="FILE:ca.pem",
"property.ssl.certificate.location"="FILE:client.pem",
"property.ssl.key.location"="FILE:client.key",
"property.ssl.key.password"="abcdefg"
);Other import methods such as SparkLoad, S3 load, and StreamLoad are also supported.
Doris Introduction
Doris originated from Baidu in 2017, entered Apache incubation in 2018, and differs from the commercial version DorisDB. It combines MPP query engine, columnar storage, and compression technologies.
Data Model
Tables consist of rows and columns. Columns are divided into Key (dimension) and Value (metric) types. Three model types exist:
Aggregate model – supports aggregation functions (SUM, REPLACE, MAX, MIN) on Value columns.
Uniq model – a special case of Aggregate using REPLACE to enforce primary‑key uniqueness.
Duplicate model – stores raw rows without any aggregation; DUPLICATE KEY only defines sorting columns.
Example of an Aggregate table schema with SUM, REPLACE, MAX, and MIN aggregation types is provided, illustrating how identical Key rows are merged while Value columns are aggregated accordingly.
Rollup
Rollup creates materialized views (Rollup tables) on a base table to accelerate queries at coarser granularity. For Aggregate and Uniq models, Rollup stores pre‑aggregated data; for Duplicate models it merely reorders columns for prefix‑index optimization.
Example: a Rollup containing user_id and sum(cost) allows queries like SELECT user_id, SUM(cost) FROM table GROUP BY user_id; to hit the Rollup directly, reducing scanned data.
Prefix Index
Doris generates a sparse prefix index (up to 36 bytes, with varchar limited to 20 bytes) for the first columns of Base or Rollup tables. During query planning, Doris matches query predicates against this prefix index to select the most suitable Base/Rollup, improving scan efficiency.
Additional Notes
Default replication is 3 for high availability; the examples use a single replica for simplicity.
Tables support dynamic addition/removal of partitions, schema modifications, and adding Rollup tables to improve performance.
Null attributes on columns can affect query performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
