Big Data 13 min read

How to Evolve Big Data Architectures for ZB‑Scale Analytics and Real‑World Use Cases

This article reviews the challenges of handling Zettabyte‑scale data, outlines practical big‑data processing architectures, discusses storage‑compute separation, data‑driven workflows, risk‑control frameworks, and shares concrete Spark implementations at MobTech, offering actionable insights for modern data engineers.

UCloud Tech

Dec 4, 2019

How to Evolve Big Data Architectures for ZB‑Scale Analytics and Real‑World Use Cases

Big Data Processing Practices and Architecture Evolution

By 2025 global data generation will reach 180 ZB, yet less than 10% is effectively stored, used, and analyzed. The key is extracting valuable insights from ZB‑scale data to drive business growth. UCan’s big‑data technical salon featured five senior experts sharing their exploration and practical applications.

Developers often struggle to choose the right big‑data framework for tasks such as aggregating billions of records. Options include HBase + Phoenix, Kudu + Impala, or Spark. UCloud big‑data engineer Liu Jingze discussed criteria for reducing operational costs while maintaining high performance.

Effective data analysis follows four layers: data source, storage, aggregation/computation, and application. The ecosystem comprises data ingestion, storage, computation, and application layers, supplemented by scheduling, monitoring, permission, and metadata management.

A generic architecture was presented, featuring an OLTP SDK for backend interfaces, Flume for data collection, Kafka for streaming, Elasticsearch for modeling, and optional HDFS + Hive for cold storage. For large‑scale businesses, raw data may be retained in HDFS, while Hive provides a common cold‑storage solution.

When data volume grows, simple joins across multiple frameworks become inefficient. A wide‑table approach stores business data in MySQL or HBase, then uses Spark or Flink with asynchronous I/O to join dimension data, persisting the result back to HBase for fast access. Heavy‑weight metrics can be served directly via Phoenix, Impala, or Trafodion.

For extremely heavy workloads, detailed data from HBase can be pre‑aggregated in Spark/Flink and exposed to OLTP systems for backend services.

Storage‑Compute Separation and Data Abstraction

Early big‑data clusters combined storage and compute in a single data center, leading to high network overhead and underutilized resources. BLUECITY’s big‑data director Liu Baoliang proposed a storage‑compute separation architecture, emphasizing elastic scaling, tiered data handling, and rapid cluster provisioning.

The core consists of separate storage clusters (e.g., HDFS) and compute clusters (e.g., Spark, Flink), connected by a non‑blocking network. This design improves flexibility, simplifies hardware upgrades, and enhances performance.

Data‑Driven Methodology: From Collection to Feedback

Data‑driven practice involves four closed‑loop steps: data collection, modeling, analysis, and feedback. Fu Lili, co‑founder and chief architect of Sensors Analytics, emphasized unified data ingestion APIs, SDKs, and server‑side tools to standardize collection.

Modeling splits data into events, users, and entities, supporting user segmentation by attributes such as age, gender, location, and device. ETL challenges are mitigated by building common scheduling, computation, quality, and metadata management pipelines.

Analysis can follow two paths: routine reporting for basic metrics or abstract models that empower users to self‑serve data. Flexibility, interpretability, and architectural simplicity are critical.

Business Risk Control in the Digital Era

Digital businesses face increasing black‑market threats. CTO Liang Kun of Shumei Technology identified three weaknesses: thin defense relying on blacklists and simple rules, delayed response due to offline T+1 mining, and slow evolution lacking self‑learning.

A full‑stack risk‑control system should include deployment, strategy, profiling, and operation layers. Decoupling risk‑control from business logic enables independent rule updates. The architecture comprises multi‑scenario strategy, real‑time risk engine, and risk‑profile network, deployed across seven global cloud clusters handling 30 billion daily requests and peak QPS over 100 k.

Spark in MobTech: Practical Cases

MobTech, a global data‑intelligence platform with 120 billion devices and 500 k apps, faces challenges of massive Yarn/Spark workloads, large data volumes, and high resource consumption.

Two Spark case studies were presented:

Dynamic partition pruning using Bloom filters to filter irrelevant rows before joining a 2‑billion‑row table with a 10‑million‑row table. The SQL example:

SELECT /*+ bloomfilter(b.id) */ a.*, b.* FROM a JOIN b ON a.id = b.id

Bloom filters guarantee that if a key is absent, it truly does not exist, while a positive result may be a false positive, trading accuracy for space.

Retrieving and computing over trillion‑scale data for 4 000+ tags spanning two years. The solution combines horizontal integration (consolidating tag tables) and vertical integration (weekly/monthly aggregation), creating an index table linking dates and data IDs to accelerate queries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

risk control Spark Storage Compute Separation Data Architecture

Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.