Big Data 20 min read

MaxCompute Semi-Structured Data Solutions: Architecture, Comparison, and Performance Benefits

This article explains the concepts of semi‑structured data, compares traditional schema‑on‑read and schema‑on‑write approaches, and details MaxCompute's columnar storage solution—including AliORC, adaptive query processing, and handling of dirty or sparse data—to achieve high performance and low cost in big‑data warehousing.

DataFunSummit

Sep 7, 2023

MaxCompute Semi-Structured Data Solutions: Architecture, Comparison, and Performance Benefits

First, the article defines semi‑structured data as a middle ground between structured and unstructured data, highlighting its self‑describing nature and typical formats such as JSON and XML, which combine flexibility with a protocol for parsing.

It then contrasts structured data—strictly defined tables with fixed schemas—and unstructured data—free‑form media lacking any inherent schema—explaining the trade‑offs in flexibility, storage efficiency, and query performance.

Semi‑structured data inherits the flexibility of unstructured data while providing enough schema information to enable efficient parsing and access, making it suitable for logs, IoT telemetry, mobile event reporting, and autonomous driving.

The article reviews traditional data‑warehouse solutions: schema‑on‑read , which stores raw data and parses it at query time (high flexibility, low performance), and schema‑on‑write , which parses data during ingestion (lower flexibility, higher performance). It discusses the maintenance costs of schema‑on‑write in fast‑changing business environments.

MaxCompute’s semi‑structured data solution is then presented. MaxCompute is a serverless, cloud‑native data‑warehouse service that supports massive scale (hundreds of GB to EB). It ingests semi‑structured data, stores it using the AliORC columnar format, and leverages dynamic schema extraction to columnar‑store stable fields while preserving flexibility for evolving fields.

The solution extracts common structures from short‑term stable data, column‑stores them to reduce storage and improve query speed, and handles long‑term schema evolution with adaptive parsing. It also addresses dirty data by storing values together with their type information in a binary format, and mitigates sparse data by aggregating low‑frequency fields into a special column.

During query execution, the engine performs column pruning, detects the actual runtime type of each field, and inserts dynamic conversion operators as needed to unify types (e.g., converting strings or binaries to integers) before applying filters, thus achieving adaptive processing.

Performance analysis shows that the columnar JSON approach yields near‑native columnar query speeds and significantly reduces storage compared with raw JSON strings, while still leaving room for further optimization (e.g., better handling of date types).

In conclusion, MaxCompute’s out‑of‑the‑box semi‑structured columnar solution provides high performance, low storage cost, and minimal operational overhead, while supporting dynamic schema evolution, dirty and sparse data handling, and adaptive query processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MaxCompute Semi‑structured Data

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.