Tagged articles
407 articles
Page 1 of 5
Radish, Keep Going!
Radish, Keep Going!
Jan 30, 2026 · Big Data

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Uber tackled the challenge of replicating over 350 PB of data across on‑premise and cloud lakes by redesigning Hadoop Distcp, moving intensive tasks to the Application Master, parallelising copy‑listing and commit phases, and leveraging Uber‑mapper jobs to dramatically cut latency and improve resource efficiency.

Big DataDistcpHadoop
0 likes · 17 min read
How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations
Big Data Tech Team
Big Data Tech Team
Sep 15, 2025 · Interview Experience

Top Data Warehouse Engineer Interview Questions & Answers Revealed

This article compiles three interview rounds for a data warehouse engineer role, covering fundamental concepts, practical skills, and leadership thinking with detailed Q&A on ETL, Hadoop components, schema design, data quality, data lake vs. warehouse, ACID properties, cloud solutions, SQL optimization, real‑time processing, security, and team management.

ETLHadoopSQL Optimization
0 likes · 12 min read
Top Data Warehouse Engineer Interview Questions & Answers Revealed
Big Data Tech Team
Big Data Tech Team
Jul 23, 2025 · Big Data

From Beginner to Data Warehouse Architect: A Complete Roadmap

This guide walks you through every essential topic—from data warehouse architecture and layering, through ETL, OLAP, Hadoop, and Flink, to visualization tools, learning paths, recommended resources, and the management skills needed to become a proficient data warehouse architect.

ETLFlinkHadoop
0 likes · 9 min read
From Beginner to Data Warehouse Architect: A Complete Roadmap
DataFunSummit
DataFunSummit
Jul 20, 2025 · Big Data

How Beike Scaled to 600 PB: The Evolution of a Data‑Fusion Architecture

This article details Beike's data‑fusion architecture evolution, covering industry trends, multi‑stage Hadoop upgrades, storage cost optimization with erasure coding, remote shuffle integration, GPU‑centric training stability, and future hybrid‑cloud strategies, while also sharing organizational and operational lessons learned.

AIData ArchitectureHadoop
0 likes · 16 min read
How Beike Scaled to 600 PB: The Evolution of a Data‑Fusion Architecture
Big Data Tech Team
Big Data Tech Team
Jun 8, 2025 · Big Data

Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals

This guide outlines a comprehensive Hadoop learning roadmap, covering essential prerequisites, core concepts such as HDFS, MapReduce, and YARN, hands‑on projects, advanced ecosystem tools like Hive, Pig, HBase and Spark, plus curated resources and community channels for aspiring big‑data engineers.

HDFSHadoopMapReduce
0 likes · 7 min read
Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 23, 2024 · Big Data

Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning

This article explains Hadoop’s core concepts using a library analogy, details HDFS storage and MapReduce processing, provides complete Java implementations for a word‑count job with support for text, CSV, and JSON inputs, and discusses extensibility and performance optimizations such as combiners and custom partitioners.

Big DataHadoopMapReduce
0 likes · 20 min read
Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning
ITPUB
ITPUB
Sep 11, 2024 · Big Data

Is Storage‑Compute Separation the Future? Unpacking the Lakehouse Debate

The article examines the concepts of storage‑compute separation and the lake‑warehouse (lakehouse) model, tracing their evolution from physical Hadoop clusters to containerized compute and object storage, and argues that true separation requires MPP systems to adopt open standards, effectively merging lake and warehouse architectures.

Big Data ArchitectureHadoopLakehouse
0 likes · 7 min read
Is Storage‑Compute Separation the Future? Unpacking the Lakehouse Debate
DataFunSummit
DataFunSummit
Aug 13, 2024 · Big Data

Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design

This article presents Qichacha's comprehensive data‑cost‑reduction strategy, detailing its Hadoop‑based three‑pillar architecture, layered data warehouse, Hive upgrades, unified metadata across multi‑cloud clusters, middleware choices such as Alluxio and JuiceFS, version‑compatible hybrid clouds, and Kubernetes‑driven resource orchestration to achieve scalable, low‑cost data processing.

Big DataHadoopdata-warehouse
0 likes · 16 min read
Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design
DataFunTalk
DataFunTalk
May 19, 2024 · Big Data

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

This article introduces Tencent's Big Data Processing Suite (TBDS), discusses challenges of data silos, and presents Gravitino's open‑source unified metadata service and permission model, detailing how it integrates Hadoop, MPP, and various catalog plugins to provide consistent access control across heterogeneous data platforms.

Big DataGravitinoHadoop
0 likes · 12 min read
Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data
DataFunSummit
DataFunSummit
Apr 26, 2024 · Big Data

Didi's Big Data Cost Governance Practices

This article details Didi's comprehensive big data cost governance framework, covering its data architecture, asset management scoring, Hadoop and Elasticsearch cost optimization methods, and practical insights on organizational processes and incentives for effective cost control.

ElasticsearchHadoopResource Optimization
0 likes · 17 min read
Didi's Big Data Cost Governance Practices
vivo Internet Technology
vivo Internet Technology
Apr 24, 2024 · Big Data

Analysis and Resolution of a FileSystem‑Induced Memory Leak Causing OOM in Production

The article details how repeatedly calling FileSystem.get(uri, conf, user) created distinct UserGroupInformation objects, inflating the static FileSystem cache and causing a heap‑memory leak that triggered an Out‑Of‑Memory error, and explains that using the two‑argument get method or explicitly closing instances resolves the issue.

HadoopOutOfMemoryPerformance debugging
0 likes · 13 min read
Analysis and Resolution of a FileSystem‑Induced Memory Leak Causing OOM in Production
Efficient Ops
Efficient Ops
Apr 23, 2024 · Big Data

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

This guide walks through planning a three‑node Hadoop 3.3.5 cluster, explains default and custom configuration files, details core‑site, hdfs‑site, yarn‑site, and mapred‑site settings, shows how to distribute configs, start HDFS and YARN, and perform basic file‑system tests.

Big DataCluster SetupConfiguration
0 likes · 11 min read
How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes
DataFunTalk
DataFunTalk
Mar 24, 2024 · Big Data

Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance

This article details Didi's comprehensive big‑data asset governance platform, covering its architectural layers, Hadoop and Elasticsearch governance practices, health‑score models, lifecycle recommendations, and future plans for automated and intelligent data governance to reduce cost and manual effort.

Big DataData GovernanceElasticsearch
0 likes · 17 min read
Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance
360 Smart Cloud
360 Smart Cloud
Jan 15, 2024 · Big Data

Design and Optimization of the Ozone Distributed Object Storage System

This article presents a comprehensive overview of Ozone, a Hadoop‑based distributed object storage system, detailing its architecture, metadata management, scalability enhancements, small‑file handling, erasure coding, lifecycle policies, and future improvements aimed at boosting performance and reliability for large‑scale unstructured data workloads.

Big DataDistributed SystemsHadoop
0 likes · 15 min read
Design and Optimization of the Ozone Distributed Object Storage System
Architects Research Society
Architects Research Society
Jan 2, 2024 · Big Data

Understanding Data Lakes: Concepts, Benefits, Challenges, and Comparison with Data Warehouses

This article explains what a data lake is, its origins, key characteristics such as collecting all data, enabling diverse user access, and flexible processing, compares it with traditional data warehouses, discusses cost advantages, potential pitfalls like data swamps, and outlines best‑practice considerations for enterprise adoption.

AnalyticsData ArchitectureData Lake
0 likes · 10 min read
Understanding Data Lakes: Concepts, Benefits, Challenges, and Comparison with Data Warehouses
ITPUB
ITPUB
Dec 14, 2023 · Big Data

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

This step‑by‑step guide shows how to install and configure a single‑node Hadoop 3.2.0 environment on CentOS 7, set up Python 3.7, write MapReduce mapper and reducer scripts in Python, and run a word‑count job using Hadoop streaming, illustrating core Hadoop concepts and their relevance today.

HadoopMapReducePython
0 likes · 21 min read
How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster
Architects Research Society
Architects Research Society
Nov 26, 2023 · Big Data

Data Lake vs Data Warehouse: Key Differences and How to Choose

Data lakes and data warehouses serve different purposes in big‑data architectures; this article explains their definitions, core attributes, five major distinctions—including data retention, type support, user coverage, adaptability, and insight speed—and offers guidance on selecting or combining the two approaches.

AnalyticsData ArchitectureData Lake
0 likes · 12 min read
Data Lake vs Data Warehouse: Key Differences and How to Choose
WeiLi Technology Team
WeiLi Technology Team
Nov 1, 2023 · Big Data

How to Diagnose and Resolve HDFS Safe Mode Issues

This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.

Big DataCluster ManagementHDFS
0 likes · 10 min read
How to Diagnose and Resolve HDFS Safe Mode Issues
DevOps
DevOps
Oct 25, 2023 · Big Data

An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies

This article provides a comprehensive overview of big data, covering its origins, definitions, 5V characteristics, data formats, real‑world applications, Hadoop architecture, testing challenges, functional and performance testing strategies, and the skills required for effective big data testing.

5V CharacteristicsBig DataData Formats
0 likes · 35 min read
An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies
政采云技术
政采云技术
Aug 23, 2023 · Big Data

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

This comprehensive tutorial details the process of deploying a complete Hadoop-based big data ecosystem on ARM architecture, covering the installation and configuration of essential components including Java, Zookeeper, Hadoop, MySQL, Hive, and Spark with practical code examples.

ARM architectureCluster DeploymentDistributed Systems
0 likes · 19 min read
Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

How Lakehouse Architecture is Transforming Hadoop: A Deep Dive into Hudi, Iceberg, and Delta Lake

This article analyzes the rise of lake‑house architecture in the Hadoop ecosystem, compares the technical capabilities of Hudi, Iceberg and Delta Lake, details implementation enhancements such as MOR and multi‑writer support, showcases Flink integration, presents a real‑time marketing use case, and outlines future development directions.

Big DataData GovernanceDelta Lake
0 likes · 14 min read
How Lakehouse Architecture is Transforming Hadoop: A Deep Dive into Hudi, Iceberg, and Delta Lake
DataFunTalk
DataFunTalk
Jun 9, 2023 · Big Data

Cloud Music Data Governance Practice

This article presents a comprehensive case study of NetEase Cloud Music's data governance practice, covering data background, governance philosophy, detailed solutions across metadata, storage, compute, and model design, practical implementations, measurable cost savings, and future planning for sustainable data management.

Cost OptimizationHadoopSpark
0 likes · 15 min read
Cloud Music Data Governance Practice
dbaplus Community
dbaplus Community
May 21, 2023 · Big Data

How Cloud Migration Transforms Big Data Architecture: Lessons from G‑Line

This article examines the limitations of traditional physical‑server Hadoop clusters and explains how adopting cloud‑native technologies, distributed object storage, and compute‑storage separation can improve resource utilization, disaster recovery, performance, security, observability, and cost efficiency for large‑scale big data workloads.

Hadoopcloud migrationdistributed storage
0 likes · 12 min read
How Cloud Migration Transforms Big Data Architecture: Lessons from G‑Line
Programmer DD
Programmer DD
Feb 27, 2023 · Big Data

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

With data volumes soaring, traditional Hadoop and Spark clusters become costly and cumbersome for small to medium workloads, prompting many to seek lighter alternatives; this article examines the technical, operational, and financial burdens of Hadoop/Spark and introduces the open‑source SPL engine as a fast, low‑cost, easy‑to‑use big‑data solution.

Big DataHadoopSpark
0 likes · 16 min read
Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution
DataFunSummit
DataFunSummit
Feb 6, 2023 · Product Management

Key Capabilities and Knowledge for Platform Data Product Managers in the Big Data Era

This article outlines the evolution of big data, defines the role of platform data product managers, details their core competencies—including general, professional thinking, and technical skills—covers the Hadoop ecosystem, and explains the end‑to‑end offline data‑warehouse construction process with practical examples and Q&A.

Hadoopoffline data warehouseplatform data
0 likes · 12 min read
Key Capabilities and Knowledge for Platform Data Product Managers in the Big Data Era
JD Tech
JD Tech
Dec 29, 2022 · Big Data

Financial Enterprise Big Data Platform Construction Plan: Architecture, Design, and Implementation

This document outlines a comprehensive big‑data platform construction plan for a financial enterprise, describing the current data challenges, objectives, three‑layer architecture, recommended commercial Hadoop solution (TDH), detailed model‑design steps, implementation schedule, hardware/software specifications, and key success factors.

Financial ServicesHadoopTDH
0 likes · 15 min read
Financial Enterprise Big Data Platform Construction Plan: Architecture, Design, and Implementation
DataFunTalk
DataFunTalk
Dec 29, 2022 · Big Data

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

This article presents the background, requirements, architecture, key modules, and practical impact of OPPO's non‑intrusive big‑data diagnostic platform—named Compass—designed to quickly locate issues, provide optimization suggestions, and achieve cost‑saving and efficiency gains for large‑scale Spark and Hadoop workloads.

Big DataCost reductionHadoop
0 likes · 17 min read
Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)
Data Thinking Notes
Data Thinking Notes
Dec 5, 2022 · Big Data

How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance

This article details NetEase Cloud Music's year‑long data governance initiative, covering data background, governance strategy, project plan, practical actions, results, and future outlook, and shows how metadata‑driven management reduced storage by over 30% while improving reliability and efficiency.

Big DataCost OptimizationData Governance
0 likes · 17 min read
How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance
ITPUB
ITPUB
Nov 9, 2022 · Backend Development

How to Scale a High‑Traffic Blog: From Nginx to MyRocks and Hadoop

This article explains how to overcome performance bottlenecks of a rapidly growing blog by progressively enhancing the traditional Nginx‑MySQL stack with load‑balanced app servers, Redis caching, read/write splitting, MySQL partitioning, MyRocks, and finally a hybrid NoSQL‑big‑data architecture using Hadoop and HBase.

BackendHadoopScalability
0 likes · 9 min read
How to Scale a High‑Traffic Blog: From Nginx to MyRocks and Hadoop
DataFunSummit
DataFunSummit
Nov 5, 2022 · Big Data

2022 Open Source Big Data Heat Report: Trends, Moore’s Law, and Top 30 Projects

The 2022 Open Source Big Data Heat Report, released at the Yunqi Conference, analyzes 102 active projects, discovers a 40‑month “Moore’s law” doubling of project heat, highlights three major trends—diversification, integration, and cloud‑native—and ranks the top 30 hottest open‑source big‑data projects.

Cloud NativeHadooptrend analysis
0 likes · 6 min read
2022 Open Source Big Data Heat Report: Trends, Moore’s Law, and Top 30 Projects
Python Crawling & Data Mining
Python Crawling & Data Mining
Oct 30, 2022 · Big Data

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

This article explains how Ozone, the Hadoop community’s new distributed object‑storage system, overcomes HDFS’s small‑file limitations with a hierarchical Volume‑Bucket‑Object model, detailing its architecture, components, data flow for creating and reading objects, and the benefits of its scalable, fault‑tolerant design.

Big DataHadoopOzone
0 likes · 12 min read
Why Ozone Is the Next‑Generation Distributed Object Store for Big Data
DataFunSummit
DataFunSummit
Sep 25, 2022 · Big Data

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

This article shares Xiaomi's internal practices of Hadoop YARN, covering scheduling and resource optimization, elastic scheduling, node overcommit handling, federation architecture, metadata warehouse construction, and future plans to improve cluster utilization and cost efficiency.

Big DataHadoopYARN
0 likes · 20 min read
Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi
Hulu Beijing
Hulu Beijing
Jul 7, 2022 · Big Data

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

This article details Hulu's five‑year journey from Hadoop 2.6 to 3.3.2, covering major feature evolutions, the original cluster architecture, a comprehensive upgrade plan, compatibility challenges across HDFS, YARN, Hive, Spark and Flink, and the testing and rollout strategies that ensured a smooth migration.

Big DataCluster UpgradeCompatibility
0 likes · 17 min read
How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration
DataFunSummit
DataFunSummit
Jul 1, 2022 · Big Data

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

Shilong Fei from Xiaomi Data Platform presents an in‑depth exploration of elastic scheduling for Hadoop YARN, covering background, design of resource pools, auto‑scaling architecture, challenges such as job stability and user transparency, achieved cost reductions, and future plans for further optimization.

Auto ScalingBig DataHadoop
0 likes · 20 min read
Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN
Architecture Digest
Architecture Digest
May 23, 2022 · Big Data

Overview of Core Technologies in a Big Data Platform Architecture

This article explains the main layers of a typical big data platform—data collection, storage and analysis, sharing, and application—detailing common tools such as Flume, DataX, Hive, Spark, SparkSQL, Impala, and Spark Streaming, and discusses task scheduling and monitoring in the ecosystem.

Data PlatformDataXHadoop
0 likes · 10 min read
Overview of Core Technologies in a Big Data Platform Architecture
DataFunTalk
DataFunTalk
May 21, 2022 · Big Data

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

This talk presents Xiaomi's design and deployment of an elastic scheduling system for Hadoop YARN, covering background analysis, resource‑pool strategy, auto‑scaling architecture, stability challenges, label‑based resource isolation, Spark shuffle handling, cost‑saving results and future plans.

Big DataHadoopResource Management
0 likes · 16 min read
Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN
dbaplus Community
dbaplus Community
May 12, 2022 · Big Data

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

This article details Bilibili's end‑to‑end Presto on Hadoop architecture, covering the multi‑engine SQL stack, dispatcher routing, cluster scale, stability enhancements like coordinator HA and real‑time punish, query limits, Hive UDF compatibility, insert‑overwrite support, Alluxio caching, multi‑datacenter routing, query result caching, Raptorx local cache, JDK upgrades, dynamic filtering, and future roadmap, illustrating how these innovations boosted query throughput and reduced latency.

Big DataCluster ManagementDistributed Systems
0 likes · 32 min read
How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 15, 2022 · Big Data

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

This guide explains how to set up the Flink SQL client to work with Apache Iceberg, covering Scala version requirements, downloading and deploying Iceberg jars, configuring Hive and HDFS catalogs, creating databases and tables, performing insert and overwrite operations, and querying data in both batch and streaming modes.

Big DataCatalogFlink
0 likes · 18 min read
Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query
Bilibili Tech
Bilibili Tech
Apr 9, 2022 · Big Data

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

Bilibili’s Presto on Hadoop combines a multi‑engine offline platform with Kubernetes‑managed YARN scheduling, Ranger security, and a custom dispatcher, scaling to over 400 nodes handling 160 k daily queries on 10 PB, while adding coordinator HA, resource‑group punishment, query limits, Alluxio caching, dynamic filtering, and numerous SQL‑level enhancements, with future auto‑scaling and materialized‑view automation.

Big DataHadoopPresto
0 likes · 30 min read
Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements
Bilibili Tech
Bilibili Tech
Mar 25, 2022 · Big Data

Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling

Bilibili transformed its YARN CapacityScheduler from a heartbeat‑driven design to a multi‑threaded global scheduler by separating lock handling, adopting Weighted Round‑Robin with DRF, adding batch node selection, fixing proposal inconsistencies, tuning GC and logging, and thereby reduced application allocation time by about 38 % on clusters of up to 8,000 nodes.

Big DataCapacitySchedulerHadoop
0 likes · 15 min read
Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling
IT Xianyu
IT Xianyu
Jan 27, 2022 · Big Data

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

This tutorial provides step‑by‑step instructions for installing Hadoop 3.1.1, Homebrew, Hive, and configuring MySQL as Hive's metastore on macOS, including environment variable setup, hive‑site.xml configuration, MySQL connector placement, schema initialization, and verification commands.

Big DataHadoopInstallation
0 likes · 6 min read
Installing Apache Hive on macOS with Hadoop and MySQL Metastore
HomeTech
HomeTech
Jan 13, 2022 · Cloud Native

AutoKH: A Mixed‑Workload Resource Management Solution on Kubernetes and Hadoop

AutoKH is a cloud‑native mixed‑workload framework that integrates Kubernetes and Hadoop to dynamically schedule online and offline tasks, improve CPU and memory utilization, enforce priority classes, and ensure service stability through operators, CronHPA, and resource‑control components.

CPU ManagerHadoopKubernetes
0 likes · 19 min read
AutoKH: A Mixed‑Workload Resource Management Solution on Kubernetes and Hadoop
HomeTech
HomeTech
Dec 24, 2021 · Big Data

Handling java.lang.OutOfMemoryError in Hadoop MapReduce

This article explains the four locations where java.lang.OutOfMemoryError can occur in Hadoop's MapReduce framework—client, ApplicationMaster, Map, and Reduce phases—and provides configuration adjustments and best‑practice solutions to mitigate each type of OOM issue.

HadoopMapReduceOutOfMemoryError
0 likes · 11 min read
Handling java.lang.OutOfMemoryError in Hadoop MapReduce
dbaplus Community
dbaplus Community
Dec 15, 2021 · Big Data

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

This article details the background, challenges, and step‑by‑step solutions for migrating over a hundred petabytes of Hadoop HDFS data across data centers within a month, covering strategy selection, code modifications, balance optimization, and tool enhancements.

Balance OptimizationBig Data OperationsData Migration
0 likes · 14 min read
How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime
Big Data Technology Architecture
Big Data Technology Architecture
Nov 28, 2021 · Big Data

Investigation and Resolution of HiveServer2 JDBC Connection Failures and GC‑Induced Hang

The article analyzes why HiveServer2 experiences JDBC connection failures and task execution stalls under high concurrency, reproduces the issues using GC monitoring and large join queries, and presents memory‑ and GC‑tuning solutions including server migration and JVM parameter adjustments to improve stability.

GC tuningHadoopHiveServer2
0 likes · 7 min read
Investigation and Resolution of HiveServer2 JDBC Connection Failures and GC‑Induced Hang
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 22, 2021 · Big Data

Comprehensive Big Data Learning Path and Resource Guide

This article presents a detailed learning roadmap for aspiring big‑data experts, covering foundational programming languages, data structures, Linux basics, databases, distributed system theory, and essential frameworks such as Hadoop, Spark, Flink, Kafka, and provides curated B‑site video links and reference materials.

Big DataFlinkHadoop
0 likes · 9 min read
Comprehensive Big Data Learning Path and Resource Guide
DataFunTalk
DataFunTalk
Nov 20, 2021 · Big Data

How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices

This article provides a comprehensive guide to designing and implementing a big‑data platform, covering architecture overview, data ingestion with Flume, storage on HDFS/Hive/HBase, processing engines such as Hive, Spark and Flink, scheduling solutions like Azkaban and Airflow, and the construction of self‑service analytics systems.

Big DataETLHadoop
0 likes · 29 min read
How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices
Big Data Technology Architecture
Big Data Technology Architecture
Nov 13, 2021 · Big Data

Case Study: Migrating Baicaowei's On‑Premise Hadoop Data Platform to Alibaba Cloud Native Data Lake

This article details Baicaowei's migration from an IDC‑hosted Hadoop cluster to a cloud‑native data lake on Alibaba Cloud, outlining the business drivers, pain points of the legacy platform, architectural goals, design principles, solution selection, implementation steps, and future outlook for the new big‑data ecosystem.

Alibaba CloudBig DataDelta Lake
0 likes · 16 min read
Case Study: Migrating Baicaowei's On‑Premise Hadoop Data Platform to Alibaba Cloud Native Data Lake
Architects' Tech Alliance
Architects' Tech Alliance
Nov 12, 2021 · Big Data

Understanding Data Lakes: Definitions, Evolution, and Architectural Patterns

The article explains what a data lake is, compares various vendor definitions, outlines its four essential components, describes three evolutionary architecture stages from self‑hosted Hadoop to cloud‑native storage‑compute separation, and discusses the benefits and challenges of adopting data lake solutions in modern big‑data platforms.

AWSData LakeHadoop
0 likes · 8 min read
Understanding Data Lakes: Definitions, Evolution, and Architectural Patterns
21CTO
21CTO
Oct 14, 2021 · Big Data

How LinkedIn Scaled Hadoop to 11,000 Nodes and Solved YARN Delays

LinkedIn’s engineers detail how they repeatedly doubled their Hadoop cluster to over 11,000 nodes, tackled YARN scheduling delays caused by workload imbalances, and created the DynoYARN simulation tool to predict performance impacts of massive scaling.

Big DataDynoYARNHadoop
0 likes · 7 min read
How LinkedIn Scaled Hadoop to 11,000 Nodes and Solved YARN Delays
Java High-Performance Architecture
Java High-Performance Architecture
Oct 12, 2021 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

This article breaks down a typical big data platform architecture into its four layers—data collection, storage and analysis, sharing, and real‑time computation—detailing the essential tools such as Flume, HDFS, Hive, Spark, DataX, and task scheduling systems that enable scalable, low‑latency data processing and delivery.

Big DataData ArchitectureDataX
0 likes · 8 min read
Unpacking the Core Technologies Behind Modern Big Data Platforms
Architecture Digest
Architecture Digest
Oct 11, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

This article explains the typical architecture of a big‑data platform, detailing its four core layers—data collection, storage & analysis, data sharing, and application—and describing the key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and task scheduling components.

Big DataData ArchitectureDataX
0 likes · 8 min read
Core Technologies and Architecture of a Big Data Platform
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 8, 2021 · Big Data

Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide

This article provides a comprehensive guide to optimizing Hadoop HDFS storage through erasure coding and heterogeneous storage policies, explains fault‑tolerance techniques such as safe mode and slow‑disk monitoring, and shares practical MapReduce performance tuning and enterprise‑level configuration examples for large‑scale clusters.

Cluster TuningHDFSHadoop
0 likes · 32 min read
Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide
Java Architect Essentials
Java Architect Essentials
Sep 21, 2021 · Big Data

Interview on Kuaishou's Billion‑Scale Big Data Architecture Evolution and Practices

The interview with Kuaishou senior architect Zhao Jianbo details the three‑phase evolution of its trillion‑scale big data platform, covering foundational Hadoop services, real‑time and OLAP extensions, deep customizations, Spring Festival Gala challenges, scheduling innovations, Hadoop usage, and the relationship between big data and cloud architectures.

Big DataFlinkHadoop
0 likes · 19 min read
Interview on Kuaishou's Billion‑Scale Big Data Architecture Evolution and Practices
ITPUB
ITPUB
Sep 16, 2021 · Big Data

Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons

This article explains how Hadoop revolutionized big data by providing a distributed architecture with HDFS for storage and MapReduce for processing, outlines its ecosystem components, describes the inner workings of HDFS and MapReduce, and discusses the strengths and limitations of this approach.

HDFSHadoopMapReduce
0 likes · 7 min read
Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 16, 2021 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer data structure to store serialized key/value pairs and their metadata in memory, describes its initialization, write path, spill handling, and the underlying algorithms that ensure efficient in‑memory sorting and disk spilling.

HadoopIn-Memory BufferMapReduce
0 likes · 24 min read
Understanding Hadoop's Circular Buffer in the Shuffle Phase
IT Architects Alliance
IT Architects Alliance
Sep 5, 2021 · Big Data

Big Data Platform Architecture: Core Layers, Technologies, and Practices

This article outlines a typical big data platform architecture, detailing its core layers—data acquisition, storage and analysis, sharing, application, real‑time computation, and task scheduling—while introducing key technologies such as Flume, HDFS, Hive, Spark, DataX, and monitoring considerations.

Big DataData PlatformHadoop
0 likes · 9 min read
Big Data Platform Architecture: Core Layers, Technologies, and Practices
Architects' Tech Alliance
Architects' Tech Alliance
Sep 2, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

The article outlines a typical big data platform architecture, detailing its core layers—data collection, storage and analysis, sharing, application, real-time computation, and task scheduling—while describing key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and Redis.

Data ArchitectureData IntegrationHadoop
0 likes · 9 min read
Core Technologies and Architecture of a Big Data Platform
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 1, 2021 · Big Data

Understanding Hadoop Data Splitting and InputFormat Mechanisms

This article explains Hadoop's data splitting concepts, the distinction between HDFS blocks and logical InputSplits, details the source code of various InputFormats such as TextInputFormat, CombineTextInputFormat, KeyValueTextInputFormat, NLineInputFormat, and custom InputFormats, and provides complete Java examples for Mapper, Reducer, and driver classes.

Data SplittingHadoopInputFormat
0 likes · 24 min read
Understanding Hadoop Data Splitting and InputFormat Mechanisms
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 10, 2021 · Databases

Kudu Overview: Architecture, Features, and Use Cases

Kudu is an open‑source columnar storage engine from Cloudera that combines high‑throughput batch processing with low‑latency random reads, offering features such as C++/Java APIs, Raft‑based replication, flexible consistency, partitioning, and integration with Hadoop, Spark, Impala, and other ecosystem components.

Columnar StorageHadoopKudu
0 likes · 64 min read
Kudu Overview: Architecture, Features, and Use Cases
The Dominant Programmer
The Dominant Programmer
Aug 2, 2021 · Big Data

How to Build a Beginner Hadoop Cluster on CentOS 7

This article introduces Apache Hadoop’s open‑source framework, explains its core components such as HDFS, MapReduce, ZooKeeper, HBase, Hive, Pig, Mahout, Sqoop, Flume, Chukwa, Oozi​e, Ambari and YARN, and outlines the steps to set up a beginner‑level Hadoop cluster on CentOS 7.

Big DataCentOS 7HBase
0 likes · 11 min read
How to Build a Beginner Hadoop Cluster on CentOS 7
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 2, 2021 · Big Data

Comprehensive Big Data Interview Question Guide for Major Tech Companies

This article compiles extensive interview questions and topics covering Hadoop, Spark, Flink, Hive, Kafka, MySQL, Redis, Java fundamentals, and algorithms, organized by companies such as Xiaomi, ByteDance, Alibaba, Shopee, Tencent, Meituan, NetEase, and Baidu, to help candidates prepare effectively for big‑data engineering roles.

Big DataFlinkHadoop
0 likes · 22 min read
Comprehensive Big Data Interview Question Guide for Major Tech Companies
Big Data Technology Architecture
Big Data Technology Architecture
Jul 27, 2021 · Big Data

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

This article introduces the most important and still mainstream components of the big data ecosystem—including Hadoop’s storage and compute framework, Hive data warehouse, HBase NoSQL database, Spark unified engine, Kafka messaging platform, and Elasticsearch search engine—explaining their core concepts, architectures, and typical use cases.

Big DataElasticsearchHBase
0 likes · 9 min read
Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch
UCloud Tech
UCloud Tech
Jul 13, 2021 · Big Data

Step‑by‑Step Guide to Deploy UCloud’s Free USDP Big Data Platform on CentOS

This article walks you through the complete installation and configuration of UCloud's free USDP (UCloud Data Platform) on a three‑node CentOS 7.2‑7.6 cluster, covering environment preparation, package download, repair scripts, MySQL setup, service startup, web UI activation, monitoring, and a quick Hive query example.

CentOSCluster DeploymentHadoop
0 likes · 19 min read
Step‑by‑Step Guide to Deploy UCloud’s Free USDP Big Data Platform on CentOS