Big Data 20 min read

How MaxCompute Turns Semi‑Structured Data into High‑Performance Columnar Storage

This article explains the nature of semi‑structured data, compares schema‑on‑read and schema‑on‑write approaches, and shows how Alibaba Cloud MaxCompute leverages columnar storage and dynamic parsing to achieve low‑cost, high‑performance analytics for large‑scale data workloads.

Alibaba Cloud Big Data AI Platform

Sep 14, 2023

How MaxCompute Turns Semi‑Structured Data into High‑Performance Columnar Storage

01 Overview of Semi‑Structured Data

Semi‑structured data lies between structured and unstructured data, containing self‑describing formats like JSON and XML that include internal schema information, enabling easier parsing and extraction compared to raw unstructured data.

Unlike structured data, which requires predefined schemas and incurs high alteration costs, semi‑structured data offers flexibility and can be nested, providing a balance of adaptability and efficient access.

02 Traditional Solutions Comparison

Data warehouses handle semi‑structured data via two models: schema on read , where data is stored raw and parsed at query time, offering flexibility but poor query performance; and schema on write , where data is parsed and transformed during ingestion, yielding better storage efficiency and faster queries but less flexibility for evolving schemas.

Schema‑on‑read requires full table scans and decompression for each query, leading to high CPU usage and latency, whereas schema‑on‑write allows direct column access for known fields, improving performance.

However, schema‑on‑write assumes stable schemas; frequent upstream changes force costly table alterations.

03 MaxCompute Semi‑Structured Data Solution

MaxCompute is a serverless, enterprise‑grade cloud data warehouse that supports massive PB‑EB scale analytics. It ingests semi‑structured data (e.g., logs, IoT events) and provides built‑in import pipelines, real‑time monitoring, and downstream analytics.

By extracting schema during write and retaining dynamic parsing at read time, MaxCompute achieves low storage cost, high query performance, and flexibility for rapid upstream iteration.

04 Columnar Storage of Semi‑Structured Data

During short‑term ingestion windows, common fields across records are identified and column‑stored using AliORC, Alibaba’s high‑performance ORC‑compatible format, which naturally supports nested structures and enables column pruning.

Dirty or sparse data is handled by storing each field’s raw binary value together with its type metadata, isolating corrupt entries without affecting columnar compression or query speed.

Rare fields are aggregated into a special column to avoid column explosion, with on‑demand lookup when needed.

05 Adaptive Query Processing

MaxCompute’s engine builds a logical plan, inserts dynamic type‑conversion operators when column files contain heterogeneous types (e.g., int, string, binary), then performs column pruning and filtering, achieving near‑native columnar performance even with evolving schemas.

06 Benefits and Future Work

Benchmarks show that column‑stored JSON in MaxCompute delivers up to an order of magnitude faster query times and lower storage usage compared to raw JSON strings, approaching native columnar performance. Remaining gaps include better native type conversion for dates and further optimization of JSON parsing.

Overall, MaxCompute’s out‑of‑the‑box semi‑structured columnar solution requires no user‑side code changes, automatically extracts common schemas, handles dirty and sparse data, and delivers high‑performance analytics at reduced cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MaxCompute Columnar Storage Semi‑structured Data schema-on-read schema on write

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.