How MaxCompute Turns Semi‑Structured Data into High‑Performance Columnar Storage
This article explains the nature of semi‑structured data, compares schema‑on‑read and schema‑on‑write approaches, and shows how Alibaba Cloud MaxCompute leverages columnar storage and dynamic parsing to achieve low‑cost, high‑performance analytics for large‑scale data workloads.
01 Overview of Semi‑Structured Data
Semi‑structured data lies between structured and unstructured data, containing self‑describing formats like JSON and XML that include internal schema information, enabling easier parsing and extraction compared to raw unstructured data.
Unlike structured data, which requires predefined schemas and incurs high alteration costs, semi‑structured data offers flexibility and can be nested, providing a balance of adaptability and efficient access.
02 Traditional Solutions Comparison
Data warehouses handle semi‑structured data via two models: schema on read , where data is stored raw and parsed at query time, offering flexibility but poor query performance; and schema on write , where data is parsed and transformed during ingestion, yielding better storage efficiency and faster queries but less flexibility for evolving schemas.
Schema‑on‑read requires full table scans and decompression for each query, leading to high CPU usage and latency, whereas schema‑on‑write allows direct column access for known fields, improving performance.
However, schema‑on‑write assumes stable schemas; frequent upstream changes force costly table alterations.
03 MaxCompute Semi‑Structured Data Solution
MaxCompute is a serverless, enterprise‑grade cloud data warehouse that supports massive PB‑EB scale analytics. It ingests semi‑structured data (e.g., logs, IoT events) and provides built‑in import pipelines, real‑time monitoring, and downstream analytics.
By extracting schema during write and retaining dynamic parsing at read time, MaxCompute achieves low storage cost, high query performance, and flexibility for rapid upstream iteration.
04 Columnar Storage of Semi‑Structured Data
During short‑term ingestion windows, common fields across records are identified and column‑stored using AliORC, Alibaba’s high‑performance ORC‑compatible format, which naturally supports nested structures and enables column pruning.
Dirty or sparse data is handled by storing each field’s raw binary value together with its type metadata, isolating corrupt entries without affecting columnar compression or query speed.
Rare fields are aggregated into a special column to avoid column explosion, with on‑demand lookup when needed.
05 Adaptive Query Processing
MaxCompute’s engine builds a logical plan, inserts dynamic type‑conversion operators when column files contain heterogeneous types (e.g., int, string, binary), then performs column pruning and filtering, achieving near‑native columnar performance even with evolving schemas.
06 Benefits and Future Work
Benchmarks show that column‑stored JSON in MaxCompute delivers up to an order of magnitude faster query times and lower storage usage compared to raw JSON strings, approaching native columnar performance. Remaining gaps include better native type conversion for dates and further optimization of JSON parsing.
Overall, MaxCompute’s out‑of‑the‑box semi‑structured columnar solution requires no user‑side code changes, automatically extracts common schemas, handles dirty and sparse data, and delivers high‑performance analytics at reduced cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
