Author

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

290

Articles

Likes

1.1k

Views

Comments

Latest from Big Data Technology Architecture

100 recent articles max

Big Data Technology Architecture

May 2, 2022 · R&D Management

Why Large Companies Frequently Rebuild Their Own Tools and How to Manage It Effectively

The article analyzes why big and medium‑sized companies often reinvent existing tools—due to unsatisfactory open‑source options, desire for technical prestige, low priority of requests, and skill development—and offers organizational strategies to evaluate, coordinate, and incentivize such efforts responsibly.

Organizational StrategyR&D managementTooling

0 likes · 7 min read

Why Large Companies Frequently Rebuild Their Own Tools and How to Manage It Effectively

Big Data Technology Architecture

Apr 29, 2022 · Big Data

Halodoc’s Data Platform Evolution: From Redshift to a LakeHouse Architecture with Apache Hudi

This article describes how Halodoc’s data engineering team identified limitations of their Redshift‑based platform, evaluated a LakeHouse design, selected Apache Hudi for mutable data handling, and outlined the challenges and benefits of building a scalable, decoupled storage‑compute architecture for their growing healthcare services.

Apache HudiData Engineeringdata platform

0 likes · 9 min read

Halodoc’s Data Platform Evolution: From Redshift to a LakeHouse Architecture with Apache Hudi

Big Data Technology Architecture

Nov 30, 2021 · Big Data

Building a Real-Time MySQL and PostgreSQL Streaming ETL with Flink CDC

This tutorial shows how to quickly construct a streaming ETL pipeline that captures changes from MySQL and PostgreSQL using Flink CDC, enriches order data with product and shipment information, and writes the results into Elasticsearch for real‑time visualization in Kibana.

CDCDockerElasticsearch

0 likes · 11 min read

Building a Real-Time MySQL and PostgreSQL Streaming ETL with Flink CDC

Big Data Technology Architecture

Nov 28, 2021 · Big Data

EMR Studio: Architecture and Features for Simplifying Big Data Development

EMR Studio is a one‑stop, open‑source‑compatible big data development platform that integrates Zeppelin, Jupyter, Airflow and a custom Cluster Manager to streamline job creation, scheduling, monitoring, and cluster switching, thereby addressing common usability challenges in Spark, Flink, Hive, and Presto workflows.

AirflowApache SparkData Engineering

0 likes · 9 min read

EMR Studio: Architecture and Features for Simplifying Big Data Development

Big Data Technology Architecture

Nov 28, 2021 · Big Data

Investigation and Resolution of HiveServer2 JDBC Connection Failures and GC‑Induced Hang

The article analyzes why HiveServer2 experiences JDBC connection failures and task execution stalls under high concurrency, reproduces the issues using GC monitoring and large join queries, and presents memory‑ and GC‑tuning solutions including server migration and JVM parameter adjustments to improve stability.

GC TuningHadoopHiveServer2

0 likes · 7 min read

Investigation and Resolution of HiveServer2 JDBC Connection Failures and GC‑Induced Hang

Big Data Technology Architecture

Nov 24, 2021 · Big Data

Using Iceberg Catalogs with HiveCatalog and HadoopCatalog: Table Creation, Data Ingestion, and Querying

This article explains the concept of Iceberg catalogs, compares HiveCatalog and HadoopCatalog, and provides step‑by‑step Spark examples for downloading the Iceberg jar, creating tables, loading data, querying, and examining the underlying metadata and directory structures.

HadoopCatalogHiveCatalogIceberg

0 likes · 15 min read

Using Iceberg Catalogs with HiveCatalog and HadoopCatalog: Table Creation, Data Ingestion, and Querying

Big Data Technology Architecture

Nov 23, 2021 · Big Data

Step-by-Step Guide to Setting Up Flink CDC with MySQL, Hudi, and Hive Integration on a Hadoop Cluster

This comprehensive tutorial walks through configuring a Hadoop‑based environment (Flink 1.13.1, Scala 2.11, CDH 6.2.0, Hive 2.1.1, Hudi 0.10), compiling Hudi, setting up Flink and MySQL binlog, creating CDC source and Hudi sink tables, running Flink jobs, and synchronizing the results to Hive partitions for query via Hive and Presto.

CDCFlinkHive

0 likes · 15 min read

Step-by-Step Guide to Setting Up Flink CDC with MySQL, Hudi, and Hive Integration on a Hadoop Cluster

Big Data Technology Architecture

Nov 16, 2021 · Big Data

Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0

This article explains how Apache Spark 3.0 improves SQL workload performance through Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP), detailing their design principles, runtime optimizations, configuration parameters, and practical examples that demonstrate reduced shuffle partitions, smarter join strategies, and handling of data skew.

Dynamic Partition PruningSQL OptimizationSpark

0 likes · 9 min read

Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0

Big Data Technology Architecture

Nov 16, 2021 · Databases

ByteHouse: ClickHouse Enterprise Edition Case Studies and Optimizations at ByteDance

ByteDance’s ByteHouse, a ClickHouse enterprise edition, showcases large‑scale real‑time analytics through two detailed case studies—recommendation system metrics and ad‑delivery data—detailing technical selection, challenges, multi‑threaded Kafka Engine, async indexing, buffer engine enhancements, and the resulting performance gains.

ByteHouseKafka Enginebig data

0 likes · 10 min read

ByteHouse: ClickHouse Enterprise Edition Case Studies and Optimizations at ByteDance

Big Data Technology Architecture

Nov 15, 2021 · Big Data

Flink Sort‑Shuffle: Design, Implementation, and Performance Evaluation

This article explains how Flink's new sort‑shuffle mechanism improves large‑scale batch processing by reducing file counts, optimizing I/O, lowering memory usage, and delivering up to tenfold speedups, while also detailing configuration tips and future enhancements.

Data ShuffleFlinkSort-Shuffle

0 likes · 16 min read

Flink Sort‑Shuffle: Design, Implementation, and Performance Evaluation