Big Data 24 min read

How Structured Big Data Storage Powers Modern Data Systems

This article explores the core components of data systems, the evolution toward lightweight, intelligent big data architectures, the distinction between primary and secondary storage, challenges of data replication, and how Alibaba Cloud's Tablestore implements advanced features such as storage‑compute separation, CDC, and multi‑model indexing for scalable, cost‑effective structured big data storage.

Alibaba Cloud Developer

Sep 4, 2019

How Structured Big Data Storage Powers Modern Data Systems

Preface

Any application relies on data processing; data drives business innovation and the shift toward intelligence. This article targets data‑system engineers and architects, aiming to inspire their design choices.

Data System Architecture

Core Components

A typical data‑system architecture includes both application and data subsystems. The data side consists of several key components:

Relational Database : transaction‑oriented primary data store.

Cache : accelerates access to expensive query results.

Search Engine : provides complex condition queries and full‑text search.

Queue : decouples processing, enabling real‑time data exchange between upstream and downstream.

Unstructured Big Data Storage : stores massive files such as images or videos, supporting both online queries and offline analytics.

Structured Big Data Storage : bridges online and offline workloads, offering linear scalability for PB‑level data.

Batch Compute : handles large‑scale offline analytics and interactive analysis.

Stream Compute : delivers low‑latency real‑time views.

Typical data system architecture diagram

Primary vs. Secondary Storage

Primary storage receives data directly from business or computation and often requires strong ACID guarantees. Secondary storage is derived from primary data via synchronization or replication and is optimized for query, retrieval, and analysis.

Data Replication Techniques

Application‑Level Multi‑Write : the simplest method writes to both primary and secondary stores in application code, but it suffers from consistency, reliability, and scalability issues.

Asynchronous Queue Replication : writes are decoupled through a message queue, allowing either both stores or only the secondary store to be written asynchronously.

Change Data Capture (CDC) : the primary store emits change logs that downstream stores consume; this approach offers the best decoupling but demands CDC support from the primary system.

Storage Component Selection

Choosing storage involves balancing data models, query languages, cost, and performance for both online and offline workloads. Architects must consider whether a component serves as primary or secondary storage and ensure flexible data exchange channels for rapid iteration.

Open‑Source Structured Storage

Prominent open‑source options include HBase and Cassandra. HBase, built on HDFS with a wide‑column model, offers strong scalability and LSM‑based write performance but suffers from weak query capabilities, limited CDC support, high cost, operational complexity, and hotspot issues.

Alibaba Cloud Tablestore

Tablestore is Alibaba Cloud’s structured big‑data storage service designed around the “derived data system” concept.

Storage‑Compute Separation : built on a distributed file system to achieve independent scaling of storage and compute.

LSM Engine : optimized for high‑throughput writes and data‑temperature layering.

Serverless Service Model : provides elastic compute resources that scale with workload, enabling true cost separation.

Multi‑Model Indexes : offers global secondary indexes and diversified indexes for arbitrary column combinations, full‑text, and spatial queries.

CDC Tunnel Service : supports full‑ and incremental real‑time data subscription, seamlessly integrating with Flink for stream processing.

Open‑Source Ecosystem Compatibility : integrates with MaxCompute, Data Lake Analytics, Flink, and Spark without data migration.

Lambda‑Plus Architecture : combines batch (Spark) and stream (Flink) processing on a single Tablestore master dataset, eliminating the need for dual writes.

Conclusion

The article presented the essential components of data‑system architecture, discussed primary/secondary storage and replication strategies, evaluated open‑source options, and highlighted how Tablestore’s design—storage‑compute separation, LSM engine, serverless model, multi‑index support, and CDC—addresses the key challenges of structured big‑data storage in modern cloud environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Services Big Data Data Architecture CDC structured storage

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.