Big Data Platform Architecture: Expert Insights on Components, Challenges, and Trends
An expert interview series examines the architecture of big data platforms, detailing core modules such as data integration, storage, computation, scheduling, and query analysis, while highlighting current challenges, best‑practice tools, and future trends like cloud‑native, object storage, and real‑time processing.
01 Big Data Platform Architecture
The article introduces the overall view of a big data platform, dividing it into core modules: data integration, storage & computation, distributed scheduling, and query analysis.
02 Data Integration
1. Log Synchronization – Open‑source log collection tools include Sqoop, Flume, Logstash, Filebeat, and Vector; Flume is popular in cloud‑native scenarios.
Experts note that log synchronization must handle large volumes and ensure continuous output with buffering to avoid data loss.
2. Data Extraction Tools – Tools such as DataX and BitSail extract data from heterogeneous sources into analysis stores like HDFS.
Data integration is critical; slow or unreliable pipelines erode trust in the platform.
3. Data Transfer Queues – Common queues: Kafka (streaming), RabbitMQ (queue), Pulsar (stream + queue).
Kafka is well‑known but less user‑friendly; Pulsar offers a more advanced architecture.
03 Data Processing: Storage & Computation
1. Data Storage – HDFS – HDFS provides horizontal scalability and high fault tolerance.
Optimizing HDFS is vital; large clusters can suffer latency under heavy load. Emerging architectures like JuiceFS separate data and metadata to improve performance.
2. Data Computation
(1) Batch engines: MapReduce, Hive, Spark. Real‑time engines: Storm, Spark Streaming, Flink (both batch and streaming).
Hive is reliable but batch‑only; Spark is fast and supports near‑real‑time processing.
Experts see Spark + data lake as the future, but note a lack of a unified engine that handles both batch and streaming efficiently.
(2) Real‑time engines – Storm, Spark Streaming, Flink. Flink is currently the most widely adopted for streaming.
Flink excels at real‑time computation but is weaker for large‑scale batch workloads; stability and latency improvements remain challenges.
04 Data Scheduling
1. Task Scheduling Systems – Crontab, Apache Airflow, Oozie, Azkaban, Kettle, XXL‑JOB, Apache DolphinScheduler, SeaTunnel, etc.
DolphinScheduler is Chinese‑friendly and suited for big‑data scenarios; Airflow is popular internationally.
2. Resource Scheduling Systems – Yarn and Azkaban. Yarn is widely used; Azkaban is a niche alternative.
05 Big Data Query
1. OLAP Engines – Comparison of Presto, StarRocks, Impala.
StarRocks delivers the highest performance but consumes more CPU/memory; Impala can approach StarRocks after tuning; Presto is easy to use but slower.
2. Query Optimization Tools – Alluxio, JuiceFS, JindoFS.
Alluxio offers universal data orchestration; JuiceFS provides similar features optimized for cloud storage; JindoFS is limited to Alibaba Cloud OSS.
06 Future Trends of Big Data Platform Architecture
Experts discuss four main trends:
OLAP will focus on faster computation, elastic scaling, and cloud‑native designs.
Object storage adoption grows for cost‑effectiveness and data‑lake use cases.
Cloud‑native architectures will improve elasticity, but stability and integration remain challenges.
Real‑time computing (e.g., Flink) continues to evolve with performance and reliability improvements.
Interviewees: Zhang Yaodong (Xiaomi), Zhu Jianghua (NetEase), Fan Yuchen (NetEase) – all senior engineers in big‑data platforms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
