Author

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

290

Articles

Likes

1.1k

Views

Comments

Latest from Big Data Technology Architecture

100 recent articles max

Big Data Technology Architecture

Sep 28, 2021 · Big Data

Integrating Apache Kyuubi with CDH 6 and Spark 3: Deployment, Configuration, and Performance Tuning

This guide explains how to deploy Apache Kyuubi on a CDH 6 cluster, replace HiveServer2 with Kyuubi, integrate Spark 3, apply necessary patches, configure environment and Spark settings, and optimize engine sharing for various workloads, providing complete code snippets and step‑by‑step instructions.

CDHHiveServer2Kyuubi

0 likes · 19 min read

Integrating Apache Kyuubi with CDH 6 and Spark 3: Deployment, Configuration, and Performance Tuning

Big Data Technology Architecture

Sep 17, 2021 · Big Data

Real‑time Computing Platform Architecture, Flink Migration, and One‑stop Platform at 58.com

This article details the design and implementation of 58.com’s real‑time computing platform, covering its architecture, data ingestion, storage, Flink‑based stream processing, SQL extensions, performance optimizations, Storm‑to‑Flink migration tools, the Wstream management console, state handling, monitoring, and future roadmap.

FlinkReal-Time ComputingStorm Migration

0 likes · 16 min read

Real‑time Computing Platform Architecture, Flink Migration, and One‑stop Platform at 58.com

Big Data Technology Architecture

Aug 31, 2021 · Big Data

Real-time CDC Data Read/Write Solutions in Data Lake Architecture with Flink and Iceberg

This article, compiled by community volunteers, examines various CDC data real‑time read/write solutions for data lake architectures, comparing offline HBase, Apache Kudu, Hive, Spark + Delta, and ultimately advocating Flink + Iceberg for efficient, correct, and scalable streaming ingestion and analytics.

CDCFlinkIceberg

0 likes · 18 min read

Real-time CDC Data Read/Write Solutions in Data Lake Architecture with Flink and Iceberg

Big Data Technology Architecture

Aug 24, 2021 · Big Data

An Overview of Apache Parquet: Architecture, Storage Model, and Comparison with ORC

This article provides a comprehensive introduction to Apache Parquet, covering its origins, columnar storage advantages, nested schema support, internal architecture, storage model components, comparison with ORC, and practical tools for inspecting Parquet files.

HadoopORC Comparisoncolumnar storage

0 likes · 10 min read

An Overview of Apache Parquet: Architecture, Storage Model, and Comparison with ORC

Big Data Technology Architecture

Aug 24, 2021 · Big Data

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

This article presents a detailed collection of Spark performance‑tuning techniques—including submit‑script parameters, RDD and operator optimizations, parallelism and memory settings, broadcast variables, Kryo serialization, locality wait adjustments—as well as systematic methods for detecting and resolving data skew and common runtime issues such as shuffle failures, serialization errors, and JVM memory problems.

JVM TuningShuffleSpark

0 likes · 21 min read

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

Big Data Technology Architecture

Aug 17, 2021 · Big Data

Detailed Overview of Flink CDC 2.0: Architecture, Features, and Future Roadmap

This article provides an in‑depth technical overview of Flink CDC 2.0, covering its CDC fundamentals, comparison of query‑based and log‑based approaches, the new lock‑free chunk algorithm, FLIP‑27 based parallel snapshot reading, performance benchmarks, documentation improvements, and future roadmap for stability and ecosystem integration.

Change Data CaptureData IntegrationDebezium

0 likes · 16 min read

Detailed Overview of Flink CDC 2.0: Architecture, Features, and Future Roadmap

Big Data Technology Architecture

Aug 12, 2021 · Big Data

Enterprise Data Lake Architecture, Delta Lake Core Capabilities, and Stream‑Batch Integrated Analytics on Alibaba Cloud

This article explains the rapid growth of data, the limitations of traditional warehouses, and how a cloud‑based data lake built on object storage with Delta Lake format provides low‑cost, flexible, and ACID‑compliant analytics, followed by a step‑by‑step guide to ingest, manage, and analyze data using Alibaba Cloud DLF and Databricks DDI with Spark streaming and batch jobs.

Alibaba CloudDelta LakeSpark

0 likes · 19 min read

Enterprise Data Lake Architecture, Delta Lake Core Capabilities, and Stream‑Batch Integrated Analytics on Alibaba Cloud

Big Data Technology Architecture

Aug 12, 2021 · Databases

Understanding HBase HLog and Fault Recovery Mechanisms

This article explains HBase's write path using Memstore and HLog, details the lifecycle of HLog including construction, rolling, expiration, and deletion, and thoroughly analyzes the three fault‑recovery models—Log Splitting, Distributed Log Splitting, and Distributed Log Replay—highlighting their processes, advantages, and configuration nuances.

HBaseHLogLog Splitting

0 likes · 14 min read

Understanding HBase HLog and Fault Recovery Mechanisms

Big Data Technology Architecture

Aug 10, 2021 · Big Data

Building a Real‑Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices

This article presents Tencent's practical experience of constructing a real‑time data warehouse by integrating Apache Flink with Apache Iceberg, covering background pain points of traditional Lambda architectures, Iceberg's table format and capabilities, Flink‑Iceberg sink design, small‑file handling, and future roadmap for a unified streaming‑batch data lake.

Apache FlinkApache IcebergReal-Time Data Warehouse

0 likes · 20 min read

Building a Real‑Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices

Big Data Technology Architecture

Jul 27, 2021 · Big Data

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

This article introduces the most important and still mainstream components of the big data ecosystem—including Hadoop’s storage and compute framework, Hive data warehouse, HBase NoSQL database, Spark unified engine, Kafka messaging platform, and Elasticsearch search engine—explaining their core concepts, architectures, and typical use cases.

ElasticsearchHBaseHadoop

0 likes · 9 min read

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch