Databases 16 min read

Master StarRocks: Simplify Partitioning, Data Import, and Table Optimization

This guide walks you through using StarRocks—covering effortless expression‑based partitioning, streamlined data loading with INSERT FROM FILES and PIPE, powerful in‑flight data transformation using SELECT/JOIN/UNNEST, and flexible table structure tweaks via ALTER TABLE to boost query performance.

StarRocks

Mar 13, 2024

Expression‑Based Partitioning and Bucketing

StarRocks 3.x allows partitions to be defined directly with the date_trunc() function. The system automatically creates daily partitions and, for detail tables, applies random bucketing. Tablet numbers are adjusted dynamically based on cluster resources and import volume.

CREATE TABLE user_behavior_declared (
    UserID int(11),
    ItemID int(11),
    CategoryID int(11),
    BehaviorType varchar(65533),
    Timestamp datetime
) DUPLICATE KEY (CategoryID, UserId)
PARTITION BY date_trunc('day', Timestamp)
-- DISTRIBUTED BY RANDOM  -- no longer required

SQL‑Friendly Data Import

Large data sets can be loaded from HDFS or cloud storage with the standard‑SQL INSERT FROM FILES statement. A quick SELECT * on FILES() helps verify schema and sample rows before loading.

SELECT * FROM FILES(
    'path' = 's3://starrocks-examples/user_behavior_ten_million_rows.parquet',
    'format' = 'parquet',
    'aws.s3.region' = 'us-east-1',
    'aws.s3.use_instance_profile' = 'false',
    'aws.s3.access_key' = 'AAAAAAAAAAAAAAAAAAAA',
    'aws.s3.secret_key' = 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
) LIMIT 5;

After confirming the schema, load the data:

INSERT INTO user_behavior_declared
SELECT * FROM FILES(
    'path' = 's3://starrocks-examples/user_behavior_ten_million_rows.parquet',
    'format' = 'parquet',
    'aws.s3.region' = 'us-east-1',
    'aws.s3.use_instance_profile' = 'false',
    'aws.s3.access_key' = 'AAAAAAAAAAAAAAAAAAAA',
    'aws.s3.secret_key' = 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
);

Progress can be monitored with SHOW LOAD or by querying information_schema.loads. For multi‑terabyte imports, the PIPE feature (available since V3.2) wraps INSERT FROM FILES, automatically splits the job into smaller tasks, tolerates per‑file errors, and supports continuous ingestion via the 'AUTO_INGEST' = 'TRUE' property.

CREATE PIPE user_behavior_pipe
PROPERTIES (
    'AUTO_INGEST' = 'FALSE',  -- set to TRUE for continuous ingestion
    'BATCH_SIZE' = '100MB',
    'BATCH_FILES' = '10'
)
AS
INSERT INTO user_behavior_declared
SELECT * FROM FILES(
    'path' = 's3://starrocks-examples/user-behavior-10-million-rows/*',
    'format' = 'parquet',
    'aws.s3.region' = 'us-east-1',
    'aws.s3.use_instance_profile' = 'false',
    'aws.s3.access_key' = 'AAAAAAAAAAAAAAAAAAAA',
    'aws.s3.secret_key' = 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
);

In‑Flight Data Transformation

The INSERT FROM FILES statement supports full SELECT and JOIN capabilities, enabling complex transformations during import. The example below expands an Array<STRING> column c_arr (values like "k8s#1#SUCC") into separate rows with individual fields using UNNEST and SPLIT:

SELECT UserID, ItemID, CategoryID, BehaviorType, Timestamp, c_arr,
       split(t_v.unnest, '#')[1] AS platform,
       cast(split(t_v.unnest, '#')[2] AS INT) AS id,
       split(t_v.unnest, '#')[3] AS status
FROM FILES(
    'path' = 's3://starrocks-examples/user_behavior_ten_million_rows.parquet',
    'format' = 'parquet',
    'aws.s3.region' = 'us-east-1',
    'aws.s3.use_instance_profile' = 'false',
    'aws.s3.access_key' = 'AAAAAAAAAAAAAAAAAAAA',
    'aws.s3.secret_key' = 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
), unnest(c_arr) AS t_v;

Flexible Table Structure Optimization

Table schemas can be adjusted with ALTER TABLE to change distribution, bucket count, or order‑by keys, allowing the system to adapt to evolving data volumes and query patterns.

ALTER TABLE user_behavior_declared DISTRIBUTED BY RANDOM;

ALTER TABLE user_behavior_declared DISTRIBUTED BY HASH(UserId) BUCKETS 12;

ALTER TABLE user_behavior_declared ORDER BY (CategoryId, BehaviorType);

These operations may require a full table reorganization; progress can be checked with SHOW ALTER TABLE OPTIMIZE.

Note: Changing the bucket method or key currently requires rebuilding the entire table.

Reference Links

https://docs.starrocks.io/zh/docs/table_design/expression_partitioning/

https://docs.starrocks.io/zh/docs/table_design/Data_distribution/

https://docs.starrocks.io/zh/docs/loading/s3/

https://docs.starrocks.io/zh/docs/sql-reference/sql-statements/data-definition/ALTER_TABLE/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL StarRocks partitioning data import Pipe Table Optimization

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.