Big Data Technology Architecture
Author

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

290
Articles
0
Likes
602
Views
0
Comments
Recent Articles

Latest from Big Data Technology Architecture

100 recent articles max
Big Data Technology Architecture
Big Data Technology Architecture
May 2, 2022 · R&D Management

Why Large Companies Frequently Rebuild Their Own Tools and How to Manage It Effectively

The article analyzes why big and medium‑sized companies often reinvent existing tools—due to unsatisfactory open‑source options, desire for technical prestige, low priority of requests, and skill development—and offers organizational strategies to evaluate, coordinate, and incentivize such efforts responsibly.

Organizational StrategyR&D ManagementTooling
0 likes · 7 min read
Why Large Companies Frequently Rebuild Their Own Tools and How to Manage It Effectively
Big Data Technology Architecture
Big Data Technology Architecture
Apr 29, 2022 · Big Data

Halodoc’s Data Platform Evolution: From Redshift to a LakeHouse Architecture with Apache Hudi

This article describes how Halodoc’s data engineering team identified limitations of their Redshift‑based platform, evaluated a LakeHouse design, selected Apache Hudi for mutable data handling, and outlined the challenges and benefits of building a scalable, decoupled storage‑compute architecture for their growing healthcare services.

Apache Hudidata engineeringdata platform
0 likes · 9 min read
Halodoc’s Data Platform Evolution: From Redshift to a LakeHouse Architecture with Apache Hudi
Big Data Technology Architecture
Big Data Technology Architecture
Nov 28, 2021 · Big Data

EMR Studio: Architecture and Features for Simplifying Big Data Development

EMR Studio is a one‑stop, open‑source‑compatible big data development platform that integrates Zeppelin, Jupyter, Airflow and a custom Cluster Manager to streamline job creation, scheduling, monitoring, and cluster switching, thereby addressing common usability challenges in Spark, Flink, Hive, and Presto workflows.

AirflowApache SparkEMR Studio
0 likes · 9 min read
EMR Studio: Architecture and Features for Simplifying Big Data Development
Big Data Technology Architecture
Big Data Technology Architecture
Nov 28, 2021 · Big Data

Investigation and Resolution of HiveServer2 JDBC Connection Failures and GC‑Induced Hang

The article analyzes why HiveServer2 experiences JDBC connection failures and task execution stalls under high concurrency, reproduces the issues using GC monitoring and large join queries, and presents memory‑ and GC‑tuning solutions including server migration and JVM parameter adjustments to improve stability.

GC tuningHadoopHiveServer2
0 likes · 7 min read
Investigation and Resolution of HiveServer2 JDBC Connection Failures and GC‑Induced Hang
Big Data Technology Architecture
Big Data Technology Architecture
Nov 23, 2021 · Big Data

Step-by-Step Guide to Setting Up Flink CDC with MySQL, Hudi, and Hive Integration on a Hadoop Cluster

This comprehensive tutorial walks through configuring a Hadoop‑based environment (Flink 1.13.1, Scala 2.11, CDH 6.2.0, Hive 2.1.1, Hudi 0.10), compiling Hudi, setting up Flink and MySQL binlog, creating CDC source and Hudi sink tables, running Flink jobs, and synchronizing the results to Hive partitions for query via Hive and Presto.

CDCFlinkHive
0 likes · 15 min read
Step-by-Step Guide to Setting Up Flink CDC with MySQL, Hudi, and Hive Integration on a Hadoop Cluster
Big Data Technology Architecture
Big Data Technology Architecture
Nov 16, 2021 · Big Data

Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0

This article explains how Apache Spark 3.0 improves SQL workload performance through Adaptive Query Execution (AQE) and Dynamic Partition Pruning (DPP), detailing their design principles, runtime optimizations, configuration parameters, and practical examples that demonstrate reduced shuffle partitions, smarter join strategies, and handling of data skew.

Adaptive Query ExecutionDynamic Partition PruningSQL Optimization
0 likes · 9 min read
Understanding Adaptive Query Execution and Dynamic Partition Pruning in Apache Spark 3.0
Big Data Technology Architecture
Big Data Technology Architecture
Nov 16, 2021 · Databases

ByteHouse: ClickHouse Enterprise Edition Case Studies and Optimizations at ByteDance

ByteDance’s ByteHouse, a ClickHouse enterprise edition, showcases large‑scale real‑time analytics through two detailed case studies—recommendation system metrics and ad‑delivery data—detailing technical selection, challenges, multi‑threaded Kafka Engine, async indexing, buffer engine enhancements, and the resulting performance gains.

ByteHouseClickHouseKafka engine
0 likes · 10 min read
ByteHouse: ClickHouse Enterprise Edition Case Studies and Optimizations at ByteDance