How FlatJSON Transforms JSON Queries in StarRocks 4.0 for Near‑Columnar Performance
StarRocks 4.0 introduces FlatJSON, a columnar storage and execution engine that converts high‑frequency JSON fields into native columns, dramatically reducing I/O and CPU costs and enabling JSON queries to run with performance close to that of traditional columnar data.
StarRocks 4.0 releases a set of JSON‑related enhancements, the most important being FlatJSON, which stores and processes JSON data in a columnar fashion, allowing real‑time analytics on logs, click‑streams, and IoT data without the heavy overhead of full JSON parsing.
Why JSON Queries Are Slow
In typical analytical workloads, JSON is stored as a single string per row. Even simple queries such as
SELECT get_json_string(event, '$.type') AS event_type, COUNT(DISTINCT user_id) FROM events_log WHERE get_json_string(event, '$.region') = 'US' AND to_datetime(get_json_int(dt, '$.event_ts')) BETWEEN '2024-01-01' AND '2024-12-31' GROUP BY event_type;suffer from several problems:
Each row requires loading the entire JSON document into memory.
All fields are read, even if only a few are needed.
Filters cannot use indexes, forcing full‑table scans.
String‑based calculations prevent dictionary encoding and other columnar optimizations.
These factors can make a query that should finish in milliseconds take dozens of seconds.
Traditional Database Solutions
Binary serialization (e.g., PostgreSQL JSONB) : parses JSON on write and stores a binary representation, but still falls short for OLAP workloads.
Manual ETL / Generated columns : users expand JSON into separate columns during ingestion, achieving near‑columnar speed at the cost of complex ETL pipelines.
Automatic columnar storage : attempts to infer and extract fields automatically, but requires sophisticated schema evolution handling.
FlatJSON Architecture
FlatJSON builds on StarRocks' segment architecture. A segment (~1 GB) stores each column independently as pages with encoding and compression. During import, FlatJSON:
Scans all JSON keys and counts field frequency to identify "hot" fields.
Infers the most efficient data type for each hot field.
Stores hot fields as native columns (INT, STRING, DOUBLE, etc.).
Writes low‑frequency or schema‑varying fields into a fallback "redundant" column using the Binary JSON format.
Consequently, a JSON document is transformed into a semi‑structured table where frequently accessed fields behave like ordinary columns.
Why FlatJSON Is Faster
Higher columnar compression : hot fields benefit from dictionary encoding, reducing storage size.
Eliminated redundancy : JSON keys are no longer stored repeatedly.
Lower I/O : only the columnized fields are read during query execution.
No runtime parsing : the execution engine reads column values directly, bypassing string parsing.
Execution‑Stage Optimizations
Indexing (ZoneMap, Bitmap, Bloomfilter) : indexes on columnized fields allow selective page reads, turning full scans into targeted reads.
Dictionary decoding : low‑cardinality strings (e.g., region) are stored as integer codes; predicates are rewritten to operate on codes, dramatically reducing CPU work.
Late materialization : rows are filtered using lightweight row identifiers first; actual column values are fetched only for rows that survive all filters.
Global dictionary : a cluster‑wide dictionary extends the benefits of dictionary encoding to aggregation, sorting, and join phases, turning string operations into integer operations.
Performance Comparison
Benchmarks on a 1‑billion‑row dataset show traditional JSON processing taking ~30 seconds, while FlatJSON completes the same aggregation in ~0.5 seconds, representing orders‑of‑magnitude reductions in both I/O and CPU usage.
Real‑World Use Cases
Event‑log analytics : queries that filter by region and event type drop from tens of seconds to sub‑second latency, with schema‑evolution handled automatically.
E‑commerce reporting : high‑cardinality SKU and price fields are columnized, reducing ETL complexity and accelerating report generation.
IoT monitoring : heterogeneous device schemas are accommodated; frequent metrics become columns while rare fields stay in the fallback JSON.
Enabling FlatJSON
Create a table with a JSON column and set the property "flat_json.enable" = "true". An optional "flat_json.null.factor" can be used to skip extracting fields that appear in less than a configurable fraction of rows.
CREATE TABLE events_log (
dt DATE,
event_id BIGINT,
event JSON
) DUPLICATE KEY(dt, event_id)
PARTITION BY date_trunc('DAY', dt)
DISTRIBUTED BY HASH(dt, event_id)
PROPERTIES (
"flat_json.enable" = "true",
"flat_json.null.factor" = "0.3"
);Insert sample data and query it exactly as you would with a normal JSON column:
INSERT INTO events_log VALUES
('2025-09-01', 1001, PARSE_JSON('{"user_id":12345,"region":"US","event_type":"click","ts":1710000000}')),
('2025-09-01', 1002, PARSE_JSON('{"user_id":54321,"region":"CA","event_type":"purchase","ts":1710000300,"experiment_flag":"A"}'));
SELECT get_json_string(event, '$.event_type') AS event_type,
COUNT(*) AS cnt
FROM events_log
WHERE get_json_string(event, '$.region') = 'US'
AND get_json_int(event, '$.ts') BETWEEN 1710000000 AND 1710003600
GROUP BY event_type;Conclusion
FlatJSON provides an engineering‑level solution for high‑performance JSON analytics. By automatically columnizing hot fields, leveraging indexes, dictionary encoding, global dictionaries, and late materialization, StarRocks enables sub‑second query latency on massive semi‑structured datasets without sacrificing schema flexibility.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
