Tagged articles

Hadoop

413 articles · Page 1 of 5

May 13, 2026 · Big Data

How Vivo Upgraded a Million‑Node YARN Cluster: Architecture, Scheduler Switch, and Performance Optimizations

This article details Vivo's end‑to‑end upgrade of a YARN 2.6.0 cluster to a modern version for a million‑node, hundred‑thousand‑tasks‑per‑day platform, covering architectural evolution, scheduler migration, compatibility fixes, performance tuning, and service‑continuity strategies.

Big DataCapacity SchedulerHadoop

0 likes · 28 min read

How Vivo Upgraded a Million‑Node YARN Cluster: Architecture, Scheduler Switch, and Performance Optimizations

Radish, Keep Going!

Jan 30, 2026 · Big Data

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Uber tackled the challenge of replicating over 350 PB of data across on‑premise and cloud lakes by redesigning Hadoop Distcp, moving intensive tasks to the Application Master, parallelising copy‑listing and commit phases, and leveraging Uber‑mapper jobs to dramatically cut latency and improve resource efficiency.

Big DataData ReplicationDistcp

0 likes · 17 min read

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Big Data Technology Tribe

Dec 19, 2025 · Big Data

Why Did Our HDFS Standby NameNode Crash? A Deep Dive into Block Recovery Bugs

A recent HDFS outage caused the Standby and Observer NameNodes to crash after heavy client load triggered block recovery failures, exposing a bug in commitBlockSynchronization that leads to mismatched block IDs and edit‑log inconsistencies, which can be fixed by applying HDFS‑17861.

BlockRecoveryCrashHDFS

0 likes · 15 min read

Why Did Our HDFS Standby NameNode Crash? A Deep Dive into Block Recovery Bugs

Big Data Tech Team

Sep 15, 2025 · Interview Experience

From Beginner to Data Warehouse Architect: A Complete Roadmap

This guide walks you through every essential topic—from data warehouse architecture and layering, through ETL, OLAP, Hadoop, and Flink, to visualization tools, learning paths, recommended resources, and the management skills needed to become a proficient data warehouse architect.

Data WarehouseETLFlink

0 likes · 9 min read

From Beginner to Data Warehouse Architect: A Complete Roadmap

DataFunSummit

Jul 20, 2025 · Big Data

How Beike Scaled to 600 PB: The Evolution of a Data‑Fusion Architecture

This article details Beike's data‑fusion architecture evolution, covering industry trends, multi‑stage Hadoop upgrades, storage cost optimization with erasure coding, remote shuffle integration, GPU‑centric training stability, and future hybrid‑cloud strategies, while also sharing organizational and operational lessons learned.

AICloud ComputingData Architecture

0 likes · 16 min read

How Beike Scaled to 600 PB: The Evolution of a Data‑Fusion Architecture

Big Data Tech Team

Jun 8, 2025 · Big Data

Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals

This guide outlines a comprehensive Hadoop learning roadmap, covering essential prerequisites, core concepts such as HDFS, MapReduce, and YARN, hands‑on projects, advanced ecosystem tools like Hive, Pig, HBase and Spark, plus curated resources and community channels for aspiring big‑data engineers.

Distributed ComputingHDFSHadoop

0 likes · 7 min read

Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals

Past Memory Big Data

Dec 27, 2024 · Big Data

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Uber’s data lake on Hadoop stores hundreds of petabytes in Parquet files and, by adopting ZSTD compression, column pruning, and column reordering, achieves up to 79% storage reduction and significant vCore savings, with detailed benchmarks guiding optimal compression levels and open‑source contributions.

Apache ParquetBig DataHadoop

0 likes · 14 min read

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Rare Earth Juejin Tech Community

Dec 26, 2024 · Big Data

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

This article explains the origins of big‑data technologies, details the architecture and read/write mechanisms of Hadoop's HDFS, describes the MapReduce programming model, and provides complete Java code examples for a simple distributed file‑processing job using Maven dependencies.

Big DataDistributed File SystemHDFS

0 likes · 15 min read

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

Rare Earth Juejin Tech Community

Nov 23, 2024 · Big Data

Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning

This article explains Hadoop’s core concepts using a library analogy, details HDFS storage and MapReduce processing, provides complete Java implementations for a word‑count job with support for text, CSV, and JSON inputs, and discusses extensibility and performance optimizations such as combiners and custom partitioners.

Big DataHadoopMapReduce

0 likes · 20 min read

Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning

ITPUB

Sep 11, 2024 · Big Data

Is Storage‑Compute Separation the Future? Unpacking the Lakehouse Debate

The article examines the concepts of storage‑compute separation and the lake‑warehouse (lakehouse) model, tracing their evolution from physical Hadoop clusters to containerized compute and object storage, and argues that true separation requires MPP systems to adopt open standards, effectively merging lake and warehouse architectures.

Big Data ArchitectureHadoopLakehouse

0 likes · 7 min read

Is Storage‑Compute Separation the Future? Unpacking the Lakehouse Debate

DataFunSummit

Aug 13, 2024 · Big Data

Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design

This article presents Qichacha's comprehensive data‑cost‑reduction strategy, detailing its Hadoop‑based three‑pillar architecture, layered data warehouse, Hive upgrades, unified metadata across multi‑cloud clusters, middleware choices such as Alluxio and JuiceFS, version‑compatible hybrid clouds, and Kubernetes‑driven resource orchestration to achieve scalable, low‑cost data processing.

Big DataData WarehouseHadoop

0 likes · 16 min read

Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design

Big Data Technology & Architecture

Aug 3, 2024 · Big Data

Comprehensive Big Data Interview Questions and Topics

This article compiles a wide range of interview questions covering JVM garbage collection, Hadoop, Hive, Flink, HBase, data warehousing, real‑time processing, and HR topics, providing a thorough preparation guide for candidates targeting senior big‑data positions.

FlinkHadoopHive

0 likes · 9 min read

Comprehensive Big Data Interview Questions and Topics

Past Memory Big Data

Aug 2, 2024 · Big Data

How Haijing Tech Built a Real-Time Telecom Analytics Platform with ByConity

Haijing Technology faced Hadoop's real‑time limits and ClickHouse's operational pain points, so it adopted the open‑source ByConity platform, which provides a unified table engine, fast multi‑table joins, and seamless scaling to deliver a carrier‑grade real‑time analytics solution.

Big DataByConityClickHouse

0 likes · 11 min read

How Haijing Tech Built a Real-Time Telecom Analytics Platform with ByConity

Mike Chen's Internet Architecture

Jul 15, 2024 · Big Data

Master Distributed Computing: Hadoop, Spark, and Flink Explained

This article introduces the fundamentals of distributed computing, compares major frameworks such as Hadoop, Spark, and Flink, and outlines their key components, performance characteristics, and typical application scenarios including big‑data analytics, cloud services, real‑time streaming, and scientific computing.

Big DataDistributed ComputingFlink

0 likes · 7 min read

Master Distributed Computing: Hadoop, Spark, and Flink Explained

360 Smart Cloud

May 28, 2024 · Big Data

HDFS Upgrade from 2.6.0‑cdh to 3.1.2 with DataNode Federation and Mixed Deployment

This article details the background, planning, step‑by‑step procedures, encountered issues, and rollback strategies for upgrading a Hadoop HDFS cluster from version 2.6.0‑cdh to 3.1.2, including mixed‑deployment of DataNodes across different federations and necessary configuration changes.

DataNodeHDFSHadoop

0 likes · 16 min read

HDFS Upgrade from 2.6.0‑cdh to 3.1.2 with DataNode Federation and Mixed Deployment

DataFunTalk

May 19, 2024 · Big Data

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

This article introduces Tencent's Big Data Processing Suite (TBDS), discusses challenges of data silos, and presents Gravitino's open‑source unified metadata service and permission model, detailing how it integrates Hadoop, MPP, and various catalog plugins to provide consistent access control across heterogeneous data platforms.

Access ControlBig DataGravitino

0 likes · 12 min read

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

DataFunSummit

Apr 26, 2024 · Big Data

Didi's Big Data Cost Governance Practices

This article details Didi's comprehensive big data cost governance framework, covering its data architecture, asset management scoring, Hadoop and Elasticsearch cost optimization methods, and practical insights on organizational processes and incentives for effective cost control.

ElasticsearchHadoopresource optimization

0 likes · 17 min read

Didi's Big Data Cost Governance Practices

vivo Internet Technology

Apr 24, 2024 · Big Data

Analysis and Resolution of a FileSystem‑Induced Memory Leak Causing OOM in Production

The article details how repeatedly calling FileSystem.get(uri, conf, user) created distinct UserGroupInformation objects, inflating the static FileSystem cache and causing a heap‑memory leak that triggered an Out‑Of‑Memory error, and explains that using the two‑argument get method or explicitly closing instances resolves the issue.

HadoopOutOfMemoryPerformance debugging

0 likes · 13 min read

Analysis and Resolution of a FileSystem‑Induced Memory Leak Causing OOM in Production

Efficient Ops

Apr 23, 2024 · Big Data

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

This guide walks through planning a three‑node Hadoop 3.3.5 cluster, explains default and custom configuration files, details core‑site, hdfs‑site, yarn‑site, and mapred‑site settings, shows how to distribute configs, start HDFS and YARN, and perform basic file‑system tests.

Big DataCluster SetupConfiguration

0 likes · 11 min read

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

DataFunTalk

Mar 24, 2024 · Big Data

Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance

This article details Didi's comprehensive big‑data asset governance platform, covering its architectural layers, Hadoop and Elasticsearch governance practices, health‑score models, lifecycle recommendations, and future plans for automated and intelligent data governance to reduce cost and manual effort.

Asset ManagementBig DataData Governance

0 likes · 17 min read

Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance

360 Smart Cloud

Jan 15, 2024 · Big Data

Design and Optimization of the Ozone Distributed Object Storage System

This article presents a comprehensive overview of Ozone, a Hadoop‑based distributed object storage system, detailing its architecture, metadata management, scalability enhancements, small‑file handling, erasure coding, lifecycle policies, and future improvements aimed at boosting performance and reliability for large‑scale unstructured data workloads.

Big DataHadoopOptimization

0 likes · 15 min read

Design and Optimization of the Ozone Distributed Object Storage System

Architects Research Society

Jan 2, 2024 · Big Data

Understanding Data Lakes: Concepts, Benefits, Challenges, and Comparison with Data Warehouses

This article explains what a data lake is, its origins, key characteristics such as collecting all data, enabling diverse user access, and flexible processing, compares it with traditional data warehouses, discusses cost advantages, potential pitfalls like data swamps, and outlines best‑practice considerations for enterprise adoption.

AnalyticsData ArchitectureData Lake

0 likes · 10 min read

Understanding Data Lakes: Concepts, Benefits, Challenges, and Comparison with Data Warehouses

ITPUB

Dec 14, 2023 · Big Data

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

This step‑by‑step guide shows how to install and configure a single‑node Hadoop 3.2.0 environment on CentOS 7, set up Python 3.7, write MapReduce mapper and reducer scripts in Python, and run a word‑count job using Hadoop streaming, illustrating core Hadoop concepts and their relevance today.

HadoopMapReducePython

0 likes · 21 min read

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

Tencent Cloud Developer

Dec 14, 2023 · Big Data

Master Word Count with Python & Hadoop: A Step‑by‑Step Guide

This tutorial walks you through Hadoop’s core components, sets up a single‑node Hadoop cluster on CentOS 7, installs Python 3, writes mapper and reducer scripts in Python, and runs a Hadoop‑Streaming word‑count job to demonstrate classic big‑data processing techniques.

Big DataHadoopLinux

0 likes · 22 min read

Master Word Count with Python & Hadoop: A Step‑by‑Step Guide

Architects Research Society

Nov 26, 2023 · Big Data

Data Lake vs Data Warehouse: Key Differences and How to Choose

Data lakes and data warehouses serve different purposes in big‑data architectures; this article explains their definitions, core attributes, five major distinctions—including data retention, type support, user coverage, adaptability, and insight speed—and offers guidance on selecting or combining the two approaches.

AnalyticsData ArchitectureData Lake

0 likes · 12 min read

Data Lake vs Data Warehouse: Key Differences and How to Choose

Big Data Technology & Architecture

Nov 10, 2023 · Big Data

MVP Learning Roadmap for Securing a Big Data Internship

This article offers a concise MVP learning plan for recent graduates aiming to secure a big‑data internship, covering essential computer fundamentals, core big‑data frameworks, project ideas, and algorithm/SQL practice, along with practical study tips and resource recommendations.

FlinkHadoopSQL

0 likes · 8 min read

MVP Learning Roadmap for Securing a Big Data Internship

WeiLi Technology Team

Nov 1, 2023 · Big Data

How to Diagnose and Resolve HDFS Safe Mode Issues

This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.

Big DataHDFSHadoop

0 likes · 10 min read

How to Diagnose and Resolve HDFS Safe Mode Issues

DevOps

Oct 25, 2023 · Big Data

An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies

This article provides a comprehensive overview of big data, covering its origins, definitions, 5V characteristics, data formats, real‑world applications, Hadoop architecture, testing challenges, functional and performance testing strategies, and the skills required for effective big data testing.

5V CharacteristicsBig DataData Formats

0 likes · 35 min read

An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies

Past Memory Big Data

Oct 10, 2023 · Big Data

2023 Big Data Interview Guide: Hadoop, Hive, Doris, Data Warehouse Essentials

This comprehensive 2023 guide covers essential big‑data interview topics, providing detailed explanations and step‑by‑step processes for Hadoop HDFS read/write, YARN, Hive table types and optimizations, Doris architecture and data models, data‑warehouse layers, modeling techniques, quality monitoring, and classic algorithm design questions such as TOP‑K and duplicate detection.

Big DataData WarehouseDoris

0 likes · 54 min read

2023 Big Data Interview Guide: Hadoop, Hive, Doris, Data Warehouse Essentials

Big Data Technology & Architecture

Sep 14, 2023 · Big Data

Big Data Interview Guide: Common Questions from Leading Companies

This article compiles real interview experiences from a top tech firm and other leading companies, presenting a detailed list of common big‑data interview questions covering Hadoop, Hive, Spark, Flink, Kafka, data skew, HDFS architecture, and related concepts to help candidates prepare effectively.

Big DataFlinkHadoop

0 likes · 8 min read

Big Data Interview Guide: Common Questions from Leading Companies

政采云技术

Aug 23, 2023 · Big Data

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

This comprehensive tutorial details the process of deploying a complete Hadoop-based big data ecosystem on ARM architecture, covering the installation and configuration of essential components including Java, Zookeeper, Hadoop, MySQL, Hive, and Spark with practical code examples.

ARM architectureCluster DeploymentHadoop

0 likes · 19 min read

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

AsiaInfo Technology: New Tech Exploration

Aug 18, 2023 · Big Data

How Lakehouse Architecture is Transforming Hadoop: A Deep Dive into Hudi, Iceberg, and Delta Lake

This article analyzes the rise of lake‑house architecture in the Hadoop ecosystem, compares the technical capabilities of Hudi, Iceberg and Delta Lake, details implementation enhancements such as MOR and multi‑writer support, showcases Flink integration, presents a real‑time marketing use case, and outlines future development directions.

Big DataData GovernanceDelta Lake

0 likes · 14 min read

How Lakehouse Architecture is Transforming Hadoop: A Deep Dive into Hudi, Iceberg, and Delta Lake

DataFunTalk

Jun 9, 2023 · Big Data

Cloud Music Data Governance Practice

This article presents a comprehensive case study of NetEase Cloud Music's data governance practice, covering data background, governance philosophy, detailed solutions across metadata, storage, compute, and model design, practical implementations, measurable cost savings, and future planning for sustainable data management.

HadoopMetadataSpark

0 likes · 15 min read

dbaplus Community

May 21, 2023 · Big Data

How Cloud Migration Transforms Big Data Architecture: Lessons from G‑Line

This article examines the limitations of traditional physical‑server Hadoop clusters and explains how adopting cloud‑native technologies, distributed object storage, and compute‑storage separation can improve resource utilization, disaster recovery, performance, security, observability, and cost efficiency for large‑scale big data workloads.

Cloud MigrationDistributed storageHadoop

0 likes · 12 min read

How Cloud Migration Transforms Big Data Architecture: Lessons from G‑Line

Big Data Technology & Architecture

May 19, 2023 · Big Data

Comprehensive Big Data Interview Q&A and Personal Project Summary

This article shares a recent graduate's successful job offer story, emphasizes preparing a detailed personal project summary, and provides extensive big‑data interview questions covering Hadoop, Spark, Flink, Kafka, Hive, ClickHouse, and related technologies to help candidates excel in interviews.

Big DataFlinkHadoop

0 likes · 15 min read

Comprehensive Big Data Interview Q&A and Personal Project Summary

Big Data Technology & Architecture

May 16, 2023 · Big Data

Comprehensive Big Data Interview Questions and Preparation Guide for Campus Graduates

This article compiles extensive big‑data interview questions from companies like Bilibili, ByteDance, Ant Group, and Tencent, offers practical advice on project depth, open‑source contributions, and provides strategic insights for recent graduates navigating a tightening job market.

FlinkHadoopKafka

0 likes · 10 min read

Comprehensive Big Data Interview Questions and Preparation Guide for Campus Graduates

Big Data Technology Architecture

Mar 15, 2023 · Big Data

Ensuring Secure Write Paths in Hadoop S3A: Experiments, Benchmarks, and Best Practices

This article analyses the security of Hadoop S3A write paths in data lakes, explains fast upload mechanisms, demonstrates disk‑IO and network‑error simulations, compares checksum algorithms, and presents Alibaba Cloud EMR JindoSDK best‑practice results with performance and reliability evaluations.

HadoopNetwork ReliabilityS3A

0 likes · 24 min read

Ensuring Secure Write Paths in Hadoop S3A: Experiments, Benchmarks, and Best Practices

Programmer DD

Feb 27, 2023 · Big Data

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

With data volumes soaring, traditional Hadoop and Spark clusters become costly and cumbersome for small to medium workloads, prompting many to seek lighter alternatives; this article examines the technical, operational, and financial burdens of Hadoop/Spark and introduces the open‑source SPL engine as a fast, low‑cost, easy‑to‑use big‑data solution.

Big DataHadoopPerformance

0 likes · 16 min read

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

JD Cloud Developers

Feb 23, 2023 · Big Data

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

This comprehensive tutorial walks you through setting up a three‑node Hadoop 3.3.4 and Spark 3.3.1 environment on CentOS 7 virtual machines, covering system preparation, JDK and Scala installation, Zookeeper configuration, Hadoop and Spark deployment, and verification with practical command‑line examples.

Big DataCluster SetupHadoop

0 likes · 10 min read

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

DataFunSummit

Feb 6, 2023 · Product Management

Key Capabilities and Knowledge for Platform Data Product Managers in the Big Data Era

This article outlines the evolution of big data, defines the role of platform data product managers, details their core competencies—including general, professional thinking, and technical skills—covers the Hadoop ecosystem, and explains the end‑to‑end offline data‑warehouse construction process with practical examples and Q&A.

Hadoopoffline data warehouseplatform data

0 likes · 12 min read

Key Capabilities and Knowledge for Platform Data Product Managers in the Big Data Era

DataFunSummit

Dec 31, 2022 · Big Data

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

This article reviews the history of data platforms—from the first general‑purpose computers and early relational databases through traditional BI, agile BI, and big‑data technologies like Hadoop, Spark, and Flink, up to today’s cloud‑native modern data stack and its future outlook.

Big DataData PlatformFlink

0 likes · 26 min read

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

JD Tech

Dec 29, 2022 · Big Data

Financial Enterprise Big Data Platform Construction Plan: Architecture, Design, and Implementation

This document outlines a comprehensive big‑data platform construction plan for a financial enterprise, describing the current data challenges, objectives, three‑layer architecture, recommended commercial Hadoop solution (TDH), detailed model‑design steps, implementation schedule, hardware/software specifications, and key success factors.

Data WarehouseEnterprise ArchitectureHadoop

0 likes · 15 min read

Financial Enterprise Big Data Platform Construction Plan: Architecture, Design, and Implementation

DataFunTalk

Dec 29, 2022 · Big Data

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

This article presents the background, requirements, architecture, key modules, and practical impact of OPPO's non‑intrusive big‑data diagnostic platform—named Compass—designed to quickly locate issues, provide optimization suggestions, and achieve cost‑saving and efficiency gains for large‑scale Spark and Hadoop workloads.

Big DataHadoopPerformance Optimization

0 likes · 17 min read

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

Data Thinking Notes

Dec 5, 2022 · Big Data

How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance

This article details NetEase Cloud Music's year‑long data governance initiative, covering data background, governance strategy, project plan, practical actions, results, and future outlook, and shows how metadata‑driven management reduced storage by over 30% while improving reliability and efficiency.

Big DataData GovernanceHadoop

0 likes · 17 min read

How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance

Open Source Linux

Nov 11, 2022 · Big Data

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

This guide walks through deploying Hadoop 3.x on a Kubernetes cluster using Helm, covering repository addition, Docker image creation, Helm chart configuration, service adjustments, installation, verification commands, and clean uninstallation, complete with code snippets and screenshots.

Big DataDockerHadoop

0 likes · 14 min read

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

Data Thinking Notes

Nov 9, 2022 · Operations

Why Did My Hadoop Node’s Memory Spike at 3 AM? A Step‑by‑Step Debug Guide

This article details a systematic investigation of a Hadoop NameNode/DataNode that showed high memory usage at 3 AM, identifies zombie crond/sendmail/postdrop processes caused by a failed Postfix service, and provides cleanup commands and preventive measures for memory, disk, and inode issues.

Hadoopdisk usageinode

0 likes · 5 min read

Why Did My Hadoop Node’s Memory Spike at 3 AM? A Step‑by‑Step Debug Guide

ITPUB

Nov 9, 2022 · Backend Development

How to Scale a High‑Traffic Blog: From Nginx to MyRocks and Hadoop

This article explains how to overcome performance bottlenecks of a rapidly growing blog by progressively enhancing the traditional Nginx‑MySQL stack with load‑balanced app servers, Redis caching, read/write splitting, MySQL partitioning, MyRocks, and finally a hybrid NoSQL‑big‑data architecture using Hadoop and HBase.

Hadoopbackendscalability

0 likes · 9 min read

How to Scale a High‑Traffic Blog: From Nginx to MyRocks and Hadoop

DataFunSummit

Nov 5, 2022 · Big Data

2022 Open Source Big Data Heat Report: Trends, Moore’s Law, and Top 30 Projects

The 2022 Open Source Big Data Heat Report, released at the Yunqi Conference, analyzes 102 active projects, discovers a 40‑month “Moore’s law” doubling of project heat, highlights three major trends—diversification, integration, and cloud‑native—and ranks the top 30 hottest open‑source big‑data projects.

Cloud NativeHadooptrend analysis

0 likes · 6 min read

2022 Open Source Big Data Heat Report: Trends, Moore’s Law, and Top 30 Projects

Python Crawling & Data Mining

Oct 30, 2022 · Big Data

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

This article explains how Ozone, the Hadoop community’s new distributed object‑storage system, overcomes HDFS’s small‑file limitations with a hierarchical Volume‑Bucket‑Object model, detailing its architecture, components, data flow for creating and reading objects, and the benefits of its scalable, fault‑tolerant design.

Big DataDistributed storageHadoop

0 likes · 12 min read

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

Past Memory Big Data

Oct 29, 2022 · Big Data

How to Adapt Hadoop for Domestic Big Data Requirements

The article analyzes Hadoop’s declining relevance, the dominance of CDH/HDP, security pressures from vulnerabilities, and outlines ten technical steps—including hardware adaptation, component selection, dependency resolution, compilation, Ambari integration, packaging, testing, and functional verification—required to create a domestic ARM‑based Hadoop distribution, which the authors have released as a free HDP 3.3.1 build.

AmbariArmBig Data

0 likes · 15 min read

How to Adapt Hadoop for Domestic Big Data Requirements

Python Crawling & Data Mining

Oct 16, 2022 · Big Data

What Makes Hadoop the Backbone of Modern Big Data Processing?

This article provides a comprehensive overview of Hadoop, covering its history, core features, the HDFS storage framework, MapReduce computation engine, YARN resource manager, real‑world application scenarios, and the surrounding ecosystem of tools such as Hive, Spark and Kafka.

Distributed ComputingHDFSHadoop

0 likes · 20 min read

What Makes Hadoop the Backbone of Modern Big Data Processing?

MaGe Linux Operations

Sep 26, 2022 · Big Data

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

This tutorial walks you through deploying Hadoop 3.x on a Kubernetes cluster using Helm, covering repository setup, Docker image creation, Helm chart customization, service configuration, installation, verification, and clean‑up, with all necessary commands and YAML snippets.

Big DataDockerHadoop

0 likes · 14 min read

DataFunSummit

Sep 25, 2022 · Big Data

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

This article shares Xiaomi's internal practices of Hadoop YARN, covering scheduling and resource optimization, elastic scheduling, node overcommit handling, federation architecture, metadata warehouse construction, and future plans to improve cluster utilization and cost efficiency.

Big DataHadoopResource Scheduling

0 likes · 20 min read

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

Big Data Technology & Architecture

Sep 19, 2022 · Big Data

Apache Iceberg Table and Catalog Configuration Guide for Hadoop

This article outlines the configuration settings for Apache Iceberg tables and catalogs on Hadoop, covering read and write properties, combine behavior for small HDFS files, reserved table properties, catalog lock options, and Hive Metastore connector Hadoop settings, supplemented with illustrative screenshots.

Big DataCatalogHadoop

0 likes · 3 min read

Apache Iceberg Table and Catalog Configuration Guide for Hadoop

政采云技术

Sep 6, 2022 · Big Data

Compiling and Deploying Spark 3.3.0 on CDH 6.3.2 (Cloudera) – Step‑by‑Step Guide

This guide explains how to download JDK, Maven, Scala and Spark 3.3.0, modify the Spark pom and configuration files for CDH 6.3.2, compile Spark with Maven, deploy the binaries to a client node, set up spark‑sql and spark‑submit scripts, and address common runtime issues.

CDHCompilationHadoop

0 likes · 13 min read

Compiling and Deploying Spark 3.3.0 on CDH 6.3.2 (Cloudera) – Step‑by‑Step Guide

Past Memory Big Data

Aug 15, 2022 · Big Data

How Pinterest Scaled a Hadoop Upgrade Across 17k Nodes

Pinterest’s Monarch batch‑processing platform, built on over 17 k YARN nodes in AWS, was upgraded from Hadoop 2.7.1 to 2.10.0 using a phased, cluster‑by‑cluster strategy that balanced minimal downtime, extensive validation, and custom patches to handle compatibility and dependency issues.

AWS EC2Big DataHadoop

0 likes · 18 min read

How Pinterest Scaled a Hadoop Upgrade Across 17k Nodes

Big Data Technology & Architecture

Jul 27, 2022 · Big Data

Step-by-Step Guide to Installing and Using Flink with Iceberg for Real-Time Data Lake

This article provides a comprehensive tutorial on setting up Flink 1.11 with Iceberg 0.11.1, creating Hive catalogs, building databases and tables, inserting data, and exploring Iceberg components, file structures, partitioned tables, execution plans, and programmatic access via Scala.

Big DataData LakeFlink

0 likes · 10 min read

Step-by-Step Guide to Installing and Using Flink with Iceberg for Real-Time Data Lake

Hulu Beijing

Jul 7, 2022 · Big Data

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

This article details Hulu's five‑year journey from Hadoop 2.6 to 3.3.2, covering major feature evolutions, the original cluster architecture, a comprehensive upgrade plan, compatibility challenges across HDFS, YARN, Hive, Spark and Flink, and the testing and rollout strategies that ensured a smooth migration.

Big DataFlinkHadoop

0 likes · 17 min read

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

DataFunSummit

Jul 1, 2022 · Big Data

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

Shilong Fei from Xiaomi Data Platform presents an in‑depth exploration of elastic scheduling for Hadoop YARN, covering background, design of resource pools, auto‑scaling architecture, challenges such as job stability and user transparency, achieved cost reductions, and future plans for further optimization.

Auto ScalingBig DataHadoop

0 likes · 20 min read

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

Architecture Digest

May 23, 2022 · Big Data

Overview of Core Technologies in a Big Data Platform Architecture

This article explains the main layers of a typical big data platform—data collection, storage and analysis, sharing, and application—detailing common tools such as Flume, DataX, Hive, Spark, SparkSQL, Impala, and Spark Streaming, and discusses task scheduling and monitoring in the ecosystem.

Data PlatformDataXHadoop

0 likes · 10 min read

Overview of Core Technologies in a Big Data Platform Architecture

DataFunTalk

May 21, 2022 · Big Data

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

This talk presents Xiaomi's design and deployment of an elastic scheduling system for Hadoop YARN, covering background analysis, resource‑pool strategy, auto‑scaling architecture, stability challenges, label‑based resource isolation, Spark shuffle handling, cost‑saving results and future plans.

Big DataHadoopResource Management

0 likes · 16 min read

dbaplus Community

May 12, 2022 · Big Data

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

This article details Bilibili's end‑to‑end Presto on Hadoop architecture, covering the multi‑engine SQL stack, dispatcher routing, cluster scale, stability enhancements like coordinator HA and real‑time punish, query limits, Hive UDF compatibility, insert‑overwrite support, Alluxio caching, multi‑datacenter routing, query result caching, Raptorx local cache, JDK upgrades, dynamic filtering, and future roadmap, illustrating how these innovations boosted query throughput and reduced latency.

Big DataHadoopPerformance Optimization

0 likes · 32 min read

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

vivo Internet Technology

May 11, 2022 · Big Data

How We Rolled Out a Massive HDFS 2.6→3.1 Upgrade on a 10,000‑Node Cluster

This article details the end‑to‑end process of migrating a 10,000‑node offline data‑warehouse from CDH 5.14.4 (HDFS 2.6.0) to HDP 3.1.4 (HDFS 3.1.1), covering version selection, rolling‑upgrade strategy, incompatibility fixes, client handling, tool coexistence, testing, automation, and lessons learned.

Big DataCluster MigrationData Warehouse

0 likes · 25 min read

How We Rolled Out a Massive HDFS 2.6→3.1 Upgrade on a 10,000‑Node Cluster

IEG Growth Platform Technology Team

Apr 18, 2022 · Big Data

Big Data Overview: Definitions, Applications, Technology Stack, and Core Components (Hadoop, HDFS, MapReduce, YARN, Hive, HBase)

This comprehensive article explains big data concepts, definitions from Gartner and IBM, real‑world use cases, the Hadoop ecosystem architecture, and detailed introductions to HDFS, MapReduce, YARN, Hive, and HBase, including practical examples and shell commands.

HBaseHDFSHadoop

0 likes · 42 min read

Big Data Overview: Definitions, Applications, Technology Stack, and Core Components (Hadoop, HDFS, MapReduce, YARN, Hive, HBase)

Big Data Technology & Architecture

Apr 15, 2022 · Big Data

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

This guide explains how to set up the Flink SQL client to work with Apache Iceberg, covering Scala version requirements, downloading and deploying Iceberg jars, configuring Hive and HDFS catalogs, creating databases and tables, performing insert and overwrite operations, and querying data in both batch and streaming modes.

Big DataCatalogFlink

0 likes · 18 min read

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

Architect

Apr 11, 2022 · Big Data

Design, Optimization, and Future Roadmap of Bilibili's Presto SQL‑on‑Hadoop Architecture

This article details Bilibili's end‑to‑end Presto‑based SQL‑on‑Hadoop architecture, covering overall system components, query routing, Presto feature set, extensive stability and availability enhancements, performance boosts through caching and multi‑datacenter deployment, and outlines future development plans.

HadoopKubernetesPerformance Optimization

0 likes · 28 min read

Design, Optimization, and Future Roadmap of Bilibili's Presto SQL‑on‑Hadoop Architecture

Bilibili Tech

Apr 9, 2022 · Big Data

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

Bilibili’s Presto on Hadoop combines a multi‑engine offline platform with Kubernetes‑managed YARN scheduling, Ranger security, and a custom dispatcher, scaling to over 400 nodes handling 160 k daily queries on 10 PB, while adding coordinator HA, resource‑group punishment, query limits, Alluxio caching, dynamic filtering, and numerous SQL‑level enhancements, with future auto‑scaling and materialized‑view automation.

Big DataHadoopSQL

0 likes · 30 min read

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

Bilibili Tech

Mar 25, 2022 · Big Data

Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling

Bilibili transformed its YARN CapacityScheduler from a heartbeat‑driven design to a multi‑threaded global scheduler by separating lock handling, adopting Weighted Round‑Robin with DRF, adding batch node selection, fixing proposal inconsistencies, tuning GC and logging, and thereby reduced application allocation time by about 38 % on clusters of up to 8,000 nodes.

Big DataCapacitySchedulerHadoop

0 likes · 15 min read

Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling

DataFunTalk

Mar 18, 2022 · Big Data

Scaling LinkedIn’s Hadoop YARN Cluster Beyond 10,000 Nodes: Challenges and Solutions

This article examines how LinkedIn tackled severe scheduling slowdowns when its Hadoop YARN cluster grew to nearly 10,000 nodes, analyzes the root causes of resource‑manager bottlenecks, and describes the fairness‑redefinition and scheduling‑logic patches that restored throughput and scalability.

Big DataHadoopResource Management

0 likes · 13 min read

Scaling LinkedIn’s Hadoop YARN Cluster Beyond 10,000 Nodes: Challenges and Solutions

Big Data Technology & Architecture

Feb 9, 2022 · Big Data

Apache Ambari Project Retired: End of an Era for Hadoop Management Tool

The Apache Ambari project, once a leading web‑based management and monitoring tool for Hadoop clusters, has been officially retired and moved to the Apache Attic after a unanimous community vote, marking the end of its development despite continued access to its website, source code, and JIRA.

Apache AmbariBig DataHadoop

0 likes · 4 min read

Apache Ambari Project Retired: End of an Era for Hadoop Management Tool

IT Xianyu

Jan 28, 2022 · Big Data

Step-by-Step Guide to Installing and Configuring Hue on CentOS 7 with Hadoop, Hive, and YARN

This tutorial explains how to set up the Hue web UI on a CentOS 7 machine by installing required dependencies, compiling Hue, configuring HDFS, YARN and Hive integration files, starting Hive services, launching Hue, and accessing the interface, with all commands and configuration snippets provided.

Big DataCentOSHadoop

0 likes · 6 min read

Step-by-Step Guide to Installing and Configuring Hue on CentOS 7 with Hadoop, Hive, and YARN

IT Xianyu

Jan 27, 2022 · Big Data

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

This tutorial provides step‑by‑step instructions for installing Hadoop 3.1.1, Homebrew, Hive, and configuring MySQL as Hive's metastore on macOS, including environment variable setup, hive‑site.xml configuration, MySQL connector placement, schema initialization, and verification commands.

Big DataHadoopHive

0 likes · 6 min read

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

HomeTech

Jan 13, 2022 · Cloud Native

AutoKH: A Mixed‑Workload Resource Management Solution on Kubernetes and Hadoop

AutoKH is a cloud‑native mixed‑workload framework that integrates Kubernetes and Hadoop to dynamically schedule online and offline tasks, improve CPU and memory utilization, enforce priority classes, and ensure service stability through operators, CronHPA, and resource‑control components.

CPU ManagerHadoopKubernetes

0 likes · 19 min read

AutoKH: A Mixed‑Workload Resource Management Solution on Kubernetes and Hadoop

Practical DevOps Architecture

Jan 4, 2022 · Big Data

Step-by-Step Guide to Installing and Configuring Hadoop 2.9.2 Cluster on Three Nodes

This article provides a detailed, step-by-step tutorial for installing Hadoop 2.9.2, configuring environment variables, editing XML configuration files, formatting the NameNode, starting HDFS and YARN services, testing the cluster, and setting up the MapReduce history server on a three‑node Linux environment.

Big DataCluster SetupHadoop

0 likes · 9 min read

Step-by-Step Guide to Installing and Configuring Hadoop 2.9.2 Cluster on Three Nodes

DataFunTalk

Dec 27, 2021 · Big Data

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

This article presents a detailed interview-style walkthrough covering Hadoop cluster setup, HDFS components, MapReduce workflow, YARN advantages, Spark fundamentals, Kafka replication, Hive table types, and related big‑data concepts, providing concise explanations and practical insights for data engineers.

Big DataHadoopHive

0 likes · 20 min read

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

HomeTech

Dec 24, 2021 · Big Data

Handling java.lang.OutOfMemoryError in Hadoop MapReduce

This article explains the four locations where java.lang.OutOfMemoryError can occur in Hadoop's MapReduce framework—client, ApplicationMaster, Map, and Reduce phases—and provides configuration adjustments and best‑practice solutions to mitigate each type of OOM issue.

HadoopMapReduceOutOfMemoryError

0 likes · 11 min read

Handling java.lang.OutOfMemoryError in Hadoop MapReduce

dbaplus Community

Dec 15, 2021 · Big Data

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

This article details the background, challenges, and step‑by‑step solutions for migrating over a hundred petabytes of Hadoop HDFS data across data centers within a month, covering strategy selection, code modifications, balance optimization, and tool enhancements.

Balance OptimizationBig Data OperationsData Migration

0 likes · 14 min read

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

Big Data Technology Architecture

Nov 28, 2021 · Big Data

Investigation and Resolution of HiveServer2 JDBC Connection Failures and GC‑Induced Hang

The article analyzes why HiveServer2 experiences JDBC connection failures and task execution stalls under high concurrency, reproduces the issues using GC monitoring and large join queries, and presents memory‑ and GC‑tuning solutions including server migration and JVM parameter adjustments to improve stability.

GC TuningHadoopHiveServer2

0 likes · 7 min read

Investigation and Resolution of HiveServer2 JDBC Connection Failures and GC‑Induced Hang

Big Data Technology & Architecture

Nov 22, 2021 · Big Data

Comprehensive Big Data Learning Path and Resource Guide

This article presents a detailed learning roadmap for aspiring big‑data experts, covering foundational programming languages, data structures, Linux basics, databases, distributed system theory, and essential frameworks such as Hadoop, Spark, Flink, Kafka, and provides curated B‑site video links and reference materials.

Big DataFlinkHadoop

0 likes · 9 min read

Comprehensive Big Data Learning Path and Resource Guide

DataFunTalk

Nov 20, 2021 · Big Data

How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices

This article provides a comprehensive guide to designing and implementing a big‑data platform, covering architecture overview, data ingestion with Flume, storage on HDFS/Hive/HBase, processing engines such as Hive, Spark and Flink, scheduling solutions like Azkaban and Airflow, and the construction of self‑service analytics systems.

Big DataData EngineeringETL

0 likes · 29 min read

How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices

Big Data Technology Architecture

Nov 13, 2021 · Big Data

Case Study: Migrating Baicaowei's On‑Premise Hadoop Data Platform to Alibaba Cloud Native Data Lake

This article details Baicaowei's migration from an IDC‑hosted Hadoop cluster to a cloud‑native data lake on Alibaba Cloud, outlining the business drivers, pain points of the legacy platform, architectural goals, design principles, solution selection, implementation steps, and future outlook for the new big‑data ecosystem.

Alibaba CloudBig DataCloud Migration

0 likes · 16 min read

Case Study: Migrating Baicaowei's On‑Premise Hadoop Data Platform to Alibaba Cloud Native Data Lake

Architects' Tech Alliance

Nov 12, 2021 · Big Data

Understanding Data Lakes: Definitions, Evolution, and Architectural Patterns

The article explains what a data lake is, compares various vendor definitions, outlines its four essential components, describes three evolutionary architecture stages from self‑hosted Hadoop to cloud‑native storage‑compute separation, and discusses the benefits and challenges of adopting data lake solutions in modern big‑data platforms.

AWSData LakeHadoop

0 likes · 8 min read

Understanding Data Lakes: Definitions, Evolution, and Architectural Patterns

Tongcheng Travel Technology Center

Nov 2, 2021 · Big Data

Hadoop Cluster Cross-Data Center Migration Practice at Tongcheng Travel

This article details Tongcheng Travel’s month‑long, zero‑downtime migration of hundreds of petabytes of Hadoop HDFS and YARN clusters across data centers, describing the background, migration strategies, lessons learned, tool enhancements, and future plans to improve data locality, balance, and monitoring.

Big DataCluster MigrationData Center

0 likes · 16 min read

Hadoop Cluster Cross-Data Center Migration Practice at Tongcheng Travel

DataFunTalk

Oct 18, 2021 · Big Data

Building an Intelligent Data Warehouse at Yixin Group: A Big Data Platform Case Study

The article describes how Yixin Group’s product team created an in‑house intelligent data warehouse using Hadoop, Flink/Spark, and standardized data services to transform scattered automotive‑finance data into a secure, scalable platform that supports real‑time analytics and drives business growth.

Big DataData EngineeringFlink

0 likes · 10 min read

Building an Intelligent Data Warehouse at Yixin Group: A Big Data Platform Case Study

21CTO

Oct 14, 2021 · Big Data

How LinkedIn Scaled Hadoop to 11,000 Nodes and Solved YARN Delays

LinkedIn’s engineers detail how they repeatedly doubled their Hadoop cluster to over 11,000 nodes, tackled YARN scheduling delays caused by workload imbalances, and created the DynoYARN simulation tool to predict performance impacts of massive scaling.

Big DataDynoYARNHadoop

0 likes · 7 min read

How LinkedIn Scaled Hadoop to 11,000 Nodes and Solved YARN Delays

Big Data Technology & Architecture

Oct 13, 2021 · Big Data

God of Big Data: A Comprehensive Learning Path and Systematic Resources for Big Data Engineers

The "God of Big Data" project, launched in 2019, offers a detailed learning roadmap, systematic column resources covering Hadoop, Spark, Kafka, and more, and invites engineers transitioning from backend to big‑data development to follow curated articles, GitHub code, and CSDN tutorials.

Data EngineeringHadoopSpark

0 likes · 6 min read

God of Big Data: A Comprehensive Learning Path and Systematic Resources for Big Data Engineers

Java High-Performance Architecture

Oct 12, 2021 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

This article breaks down a typical big data platform architecture into its four layers—data collection, storage and analysis, sharing, and real‑time computation—detailing the essential tools such as Flume, HDFS, Hive, Spark, DataX, and task scheduling systems that enable scalable, low‑latency data processing and delivery.

Big DataData ArchitectureDataX

0 likes · 8 min read

Unpacking the Core Technologies Behind Modern Big Data Platforms

Architecture Digest

Oct 11, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

This article explains the typical architecture of a big‑data platform, detailing its four core layers—data collection, storage & analysis, data sharing, and application—and describing the key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and task scheduling components.

Big DataData ArchitectureDataX

0 likes · 8 min read

Core Technologies and Architecture of a Big Data Platform

Big Data Technology & Architecture

Oct 8, 2021 · Big Data

Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide

This article provides a comprehensive guide to optimizing Hadoop HDFS storage through erasure coding and heterogeneous storage policies, explains fault‑tolerance techniques such as safe mode and slow‑disk monitoring, and shares practical MapReduce performance tuning and enterprise‑level configuration examples for large‑scale clusters.

Cluster TuningHDFSHadoop

0 likes · 32 min read

Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide

Big Data Technology & Architecture

Sep 23, 2021 · Big Data

Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations

This article explains how a 10 GB gzip file is stored and processed on HDFS, details the MapReduce split calculation using GzipCodec, and discusses why Spark reads such non‑splittable files with a single task, recommending file splitting or format conversion for better performance.

Data SplitsHadoopMapReduce

0 likes · 8 min read

Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations

Java Architect Essentials

Sep 21, 2021 · Big Data

Interview on Kuaishou's Billion‑Scale Big Data Architecture Evolution and Practices

The interview with Kuaishou senior architect Zhao Jianbo details the three‑phase evolution of its trillion‑scale big data platform, covering foundational Hadoop services, real‑time and OLAP extensions, deep customizations, Spring Festival Gala challenges, scheduling innovations, Hadoop usage, and the relationship between big data and cloud architectures.

Big DataFlinkHadoop

0 likes · 19 min read

Interview on Kuaishou's Billion‑Scale Big Data Architecture Evolution and Practices

ITPUB

Sep 16, 2021 · Big Data

Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons

This article explains how Hadoop revolutionized big data by providing a distributed architecture with HDFS for storage and MapReduce for processing, outlines its ecosystem components, describes the inner workings of HDFS and MapReduce, and discusses the strengths and limitations of this approach.

HDFSHadoopMapReduce

0 likes · 7 min read

Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons

Big Data Technology & Architecture

Sep 16, 2021 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer data structure to store serialized key/value pairs and their metadata in memory, describes its initialization, write path, spill handling, and the underlying algorithms that ensure efficient in‑memory sorting and disk spilling.

HadoopIn-Memory BufferMapReduce

0 likes · 24 min read

Understanding Hadoop's Circular Buffer in the Shuffle Phase

IT Architects Alliance

Sep 5, 2021 · Big Data

Big Data Platform Architecture: Core Layers, Technologies, and Practices

This article outlines a typical big data platform architecture, detailing its core layers—data acquisition, storage and analysis, sharing, application, real‑time computation, and task scheduling—while introducing key technologies such as Flume, HDFS, Hive, Spark, DataX, and monitoring considerations.

Big DataData PlatformHadoop

0 likes · 9 min read

Big Data Platform Architecture: Core Layers, Technologies, and Practices

Architects' Tech Alliance

Sep 2, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

The article outlines a typical big data platform architecture, detailing its core layers—data collection, storage and analysis, sharing, application, real-time computation, and task scheduling—while describing key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and Redis.

Data ArchitectureData IntegrationHadoop

0 likes · 9 min read

Big Data Technology & Architecture

Sep 1, 2021 · Big Data

Understanding Hadoop Data Splitting and InputFormat Mechanisms

This article explains Hadoop's data splitting concepts, the distinction between HDFS blocks and logical InputSplits, details the source code of various InputFormats such as TextInputFormat, CombineTextInputFormat, KeyValueTextInputFormat, NLineInputFormat, and custom InputFormats, and provides complete Java examples for Mapper, Reducer, and driver classes.

Data SplittingHadoopInputFormat

0 likes · 24 min read

Understanding Hadoop Data Splitting and InputFormat Mechanisms

Big Data Technology Architecture

Aug 24, 2021 · Big Data

An Overview of Apache Parquet: Architecture, Storage Model, and Comparison with ORC

This article provides a comprehensive introduction to Apache Parquet, covering its origins, columnar storage advantages, nested schema support, internal architecture, storage model components, comparison with ORC, and practical tools for inspecting Parquet files.

Columnar StorageHadoopORC Comparison

0 likes · 10 min read

An Overview of Apache Parquet: Architecture, Storage Model, and Comparison with ORC

Big Data Technology & Architecture

Aug 10, 2021 · Databases

Kudu Overview: Architecture, Features, and Use Cases

Kudu is an open‑source columnar storage engine from Cloudera that combines high‑throughput batch processing with low‑latency random reads, offering features such as C++/Java APIs, Raft‑based replication, flexible consistency, partitioning, and integration with Hadoop, Spark, Impala, and other ecosystem components.

Columnar StorageDatabaseHadoop

0 likes · 64 min read

Kudu Overview: Architecture, Features, and Use Cases

The Dominant Programmer

Aug 4, 2021 · Big Data

How to Set Up Hadoop Java Development on Windows and Access HDFS via Java API

This guide walks through installing Hadoop on Windows, configuring environment variables and XML files, adding the required winutils binaries, verifying the setup with HDFS shell commands, and then building a Maven project that uses the Java API to list and inspect files in HDFS.

ConfigurationHDFSHadoop

0 likes · 11 min read

How to Set Up Hadoop Java Development on Windows and Access HDFS via Java API