Big Data 12 min read

Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action

This article explores how enterprises can tackle the explosive growth of data by adopting modern big‑data architectures, including storage‑compute separation, data‑driven workflows, risk‑control frameworks, and real‑world Spark optimizations, offering practical guidance for scalable, high‑performance analytics.

Programmer DD
Programmer DD
Programmer DD
Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action

Big Data Business Normalization: Processing Methods and Architecture Evolution

With an estimated 180 ZB of data to be generated worldwide by 2025, less than ten percent is effectively stored, used, and analyzed. Experts discussed how to choose appropriate big‑data frameworks—such as HBase + Phoenix, Kudu + Impala, or Spark—for aggregating billions of records while minimizing operational costs and maximizing performance.

The complete big‑data stack is divided into data ingestion, storage, computation, and application layers, complemented by task scheduling, cluster monitoring, permission management, and metadata management.

Storage‑Compute Separation and Data Abstraction Practice

Traditional monolithic clusters that couple storage and compute lead to resource waste and operational complexity. The proposed storage‑compute separation architecture introduces independent storage clusters (e.g., HDFS + Hive for cold data) and compute clusters (e.g., Spark, Flink) connected via a non‑blocking network, enabling elastic scaling and easier hardware upgrades.

Data‑Driven: From Method to Practice

Data‑driven decision making follows four closed‑loop steps: data collection, data modeling, data analysis, and data feedback. A unified data ingestion API, consistent SDKs, and centralized processing pipelines are recommended to ensure clean, reusable data for downstream modeling and analytics.

Modeling involves defining events, users, and entities, with optional user segmentation. ETL challenges are mitigated by building generic scheduling, computation, quality, and metadata management components.

Analysis can rely on routine reporting for static metrics or on abstract models that empower end‑users to explore data flexibly, emphasizing interpretability and architectural simplicity.

Business Risk Control in the Digital Era

Modern enterprises face sophisticated black‑market threats that expose weaknesses in traditional rule‑based, offline, and slow‑to‑evolve risk controls. A full‑stack risk‑control system should comprise a deployment layer, strategy layer, profiling layer, and operation layer, all decoupled from core business services.

The architecture includes multi‑scenario policy engines, real‑time risk assessment platforms, and risk‑profile networks, deployed across seven global cloud clusters handling up to 30 billion daily requests and peak QPS exceeding 100 k.

Spark in MobTech: Practical Sharing

MobTech processes petabyte‑scale data across Yarn/Spark clusters, confronting challenges such as high resource consumption and long execution times. Two case studies were presented:

Dynamic partition pruning using Bloom filters to eliminate irrelevant rows before join operations.

Index‑driven retrieval for billions of records across thousands of tags, combining horizontal (tag‑level) and vertical (time‑level) data consolidation.

Both approaches dramatically reduced I/O and computation costs while maintaining query accuracy.

SELECT /*+ bloomfilter(b.id) */ a.*, b.* FROM a JOIN b ON a.id = b.id

By integrating incremental indexing via UDAFs and consolidating daily tables into weekly/monthly aggregates, MobTech achieved efficient cold‑data access for long‑term tag back‑tracking.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataData-drivenrisk controlSparkStorage Compute SeparationData Architecture
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.