Tagged articles
191 articles
Page 2 of 2
Tencent Cloud Developer
Tencent Cloud Developer
Jul 13, 2020 · Big Data

Building MVP: A Lightweight Big Data Analysis System for Product Growth

The article describes how a lightweight big‑data analysis platform called MVP was built from scratch—using a User‑Event‑Config model, HDFS + ClickHouse + Spark, and four modules for metric monitoring, root‑cause alerts, deep growth analysis, and A/B testing—enabling real‑time insights in seconds instead of days and dramatically accelerating product‑growth operations.

AARRR ModelClickHouseHDFS
0 likes · 9 min read
Building MVP: A Lightweight Big Data Analysis System for Product Growth
Big Data Technology & Architecture
Big Data Technology & Architecture
May 13, 2020 · Big Data

Analysis of Hadoop HDFS Data Read and Write Process

This article explains the underlying principles of Hadoop HDFS read and write operations, detailing how the client interacts with NameNode and DataNodes, the role of FsDataInputStream and FsDataOutputStream, block location retrieval, pipeline replication, and file closure steps.

Big DataData ReadData Write
0 likes · 8 min read
Analysis of Hadoop HDFS Data Read and Write Process
Big Data Technology Architecture
Big Data Technology Architecture
May 6, 2020 · Big Data

Ozone vs HDFS: Why Ozone Cannot Replace Hadoop’s Core Storage

In this article, senior Alibaba engineer Zheng Kai analyzes Ozone’s role in the Hadoop ecosystem, arguing that despite its usefulness, Ozone cannot solve Hadoop’s core challenges of complexity, cost, and performance, and that Hadoop must focus on storage innovation, compute‑storage separation, and cloud integration to stay relevant.

HDFSHadoopOzone
0 likes · 14 min read
Ozone vs HDFS: Why Ozone Cannot Replace Hadoop’s Core Storage
Big Data Technology & Architecture
Big Data Technology & Architecture
May 6, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines

This article provides a comprehensive, hands‑on tutorial for preparing three VMs, installing JDK and Hadoop, configuring core‑site.xml, hdfs‑site.xml, mapred‑site.xml, yarn‑site.xml, setting environment variables, distributing the package, starting HDFS and YARN, and verifying the cluster via web UI and jps commands.

Big DataCluster SetupHDFS
0 likes · 14 min read
Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines
Big Data Technology Architecture
Big Data Technology Architecture
Apr 20, 2020 · Big Data

Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management

This article provides a comprehensive overview of Hadoop Distributed File System (HDFS), covering its streaming data access model, key characteristics, master‑slave architecture, block storage and replication mechanisms, rack‑aware placement strategy, and how the NameNode manages metadata and checkpoints.

Distributed File SystemHDFSHadoop
0 likes · 7 min read
Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management
dbaplus Community
dbaplus Community
Apr 15, 2020 · Big Data

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

This article details Ctrip's Hadoop evolution, the challenges of expanding across multiple data centers, the evaluation of multi‑cluster versus single‑cluster designs, and the concrete architectural changes, migration tools, bandwidth monitoring, and future plans that enabled a stable cross‑datacenter big‑data platform.

Big DataCross-DataCenterHDFS
0 likes · 19 min read
How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons
Youzan Coder
Youzan Coder
Apr 1, 2020 · Big Data

Presto Implementation and Practice at YouZan: A Big Data Query Engine Journey

The article outlines Presto’s high‑performance, coordinator‑worker architecture and query flow, describes YouZan’s migration from mixed Hadoop deployment to dedicated low‑latency clusters, details challenges such as small‑file handling and regex backtracking with their fixes, and previews future enhancements like Alluxio integration, session property managers, and Ranger‑based multi‑tenant isolation.

FacebookHDFSPerformance Optimization
0 likes · 14 min read
Presto Implementation and Practice at YouZan: A Big Data Query Engine Journey
Open Source Linux
Open Source Linux
Mar 12, 2020 · Big Data

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

This tutorial walks you through setting up a three‑node Hadoop 2.9.2 cluster on CentOS 7.5, covering environment preparation, password‑less SSH, user creation, JDK installation, Hadoop extraction, configuration file edits, directory setup, ownership changes, service startup, and verification via web UIs.

Big DataCentOSCluster Setup
0 likes · 13 min read
Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5
Sohu Tech Products
Sohu Tech Products
Mar 4, 2020 · Big Data

Introduction to HDFS: Architecture, Components, and Operations

This article provides a comprehensive overview of HDFS, covering its role as a distributed file system, the concepts of blocks, NameNode and DataNode responsibilities, replication, edit logs, snapshots, high‑availability mechanisms, and practical considerations for managing large‑scale data storage.

DataNodeDistributed File SystemHDFS
0 likes · 11 min read
Introduction to HDFS: Architecture, Components, and Operations
Ctrip Technology
Ctrip Technology
Feb 27, 2020 · Big Data

Ctrip's Cross‑Datacenter Hadoop Architecture: Design, Implementation, and Lessons Learned

This article details Ctrip's cross‑datacenter Hadoop architecture, covering the evolution of its Hadoop platform, the challenges of multi‑site bandwidth and latency, design choices between multi‑cluster and single‑cluster solutions, and the concrete HDFS, YARN, balancer, migration, monitoring, and throttling implementations that enable transparent, consistent, and efficient multi‑datacenter operations.

Cross-DataCenterData MigrationHDFS
0 likes · 15 min read
Ctrip's Cross‑Datacenter Hadoop Architecture: Design, Implementation, and Lessons Learned
dbaplus Community
dbaplus Community
Feb 25, 2020 · Backend Development

How to Merge Small Files in Flink Checkpoints to Reduce HDFS Load

This article explains a small‑file‑merging technique for Apache Flink checkpoints that reuses FSDataOutputStreams to combine multiple state files into a single HDFS file, detailing design considerations such as concurrent checkpoint support, reference‑counted deletion, space amplification reduction, fault handling, compatibility, and observed production performance gains.

Apache FlinkCheckpointHDFS
0 likes · 13 min read
How to Merge Small Files in Flink Checkpoints to Reduce HDFS Load
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 7, 2020 · Big Data

Why Small Files Are a Problem in Big Data and How Delta Lake Compaction Solves It

This article examines the root causes and performance impact of massive small-file proliferation in traditional data warehouses, explains why HDFS metadata limits scalability, and details how Delta Lake’s custom compaction process can safely merge these files for append-only tables without disrupting reads or writes.

Delta LakeHDFSSmall Files
0 likes · 5 min read
Why Small Files Are a Problem in Big Data and How Delta Lake Compaction Solves It
Didi Tech
Didi Tech
Jan 5, 2020 · Big Data

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions

The team performed a rolling upgrade of HDFS from 2.7 to 3.2 on large clusters, resolving EditLog, Fsimage, StringTable and authentication incompatibilities by omitting EC data, using fallback images, rolling back commits and first upgrading to the latest 2.x release, following a staged JournalNode‑NameNode‑DataNode procedure, validating with rehearsals and a custom trash‑management tool, and achieving uninterrupted service, improved stability, performance and cost efficiency.

Big DataCluster MigrationHDFS
0 likes · 11 min read
Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions
DataFunTalk
DataFunTalk
Jan 2, 2020 · Big Data

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

This article presents an in‑depth overview of ByteDance’s large‑scale HDFS deployment, describing its unique access layer, metadata and data layers, the evolution through multiple growth stages, and the key architectural improvements such as NNProxy, DanceNN, lock redesign, startup acceleration, and slow‑node mitigation techniques.

Big DataByteDanceFederation
0 likes · 18 min read
ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 9, 2019 · Big Data

Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

This article explains how to develop a real‑time ETL application using Apache Flink that reads events from Kafka, partitions them by event time into HDFS directories, and achieves exactly‑once processing through checkpointing, custom bucket assigners, and proper state backend configuration.

Apache FlinkBig DataExactly-Once
0 likes · 11 min read
Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees
ITPUB
ITPUB
Dec 2, 2019 · Backend Development

How Xiaomi Built Talos: A Scalable, Stateless Message Queue for Billions of Events

This article details Xiaomi's journey from Kafka 0.8 to the home‑grown Talos system, covering business motivations, storage‑compute separation architecture, key challenges such as tail‑read and consistency, and extensive performance, resource, and platform optimizations that enable a high‑throughput, multi‑tenant messaging service.

Distributed MessagingHDFSMessage Queue
0 likes · 16 min read
How Xiaomi Built Talos: A Scalable, Stateless Message Queue for Billions of Events
360 Tech Engineering
360 Tech Engineering
Sep 19, 2019 · Big Data

Understanding HDFS: Architecture, Read/Write Operations, Component Roles, Commands, and Pros & Cons

This article provides a comprehensive overview of HDFS, covering its purpose, architecture, read/write mechanisms, replication strategies, component responsibilities, common command‑line tools, and the advantages and disadvantages of using Hadoop Distributed File System for large‑scale data storage.

Distributed File SystemHDFSHadoop
0 likes · 10 min read
Understanding HDFS: Architecture, Read/Write Operations, Component Roles, Commands, and Pros & Cons
360 Tech Engineering
360 Tech Engineering
Aug 22, 2019 · Big Data

Design and Implementation of XStore: A Hadoop‑Based Sample Storage System

This article details the design, architecture, and operational experience of XStore, a Hadoop‑backed sample storage system that handles billions of APK and other binary samples, addressing functional and non‑functional requirements such as real‑time upload, large‑scale storage, high‑performance reads, and disaster recovery.

HBaseHDFSHadoop
0 likes · 11 min read
Design and Implementation of XStore: A Hadoop‑Based Sample Storage System
21CTO
21CTO
Jun 28, 2019 · Big Data

Master Hadoop High Availability: A Complete Step‑by‑Step HA HDFS & YARN Guide

This article provides a comprehensive, language‑agnostic tutorial on building a highly available Hadoop cluster, covering HDFS and YARN HA architectures, QJM shared storage, required components, configuration files, installation commands, startup procedures, verification steps, and troubleshooting references.

Cluster SetupHDFSHadoop
0 likes · 20 min read
Master Hadoop High Availability: A Complete Step‑by‑Step HA HDFS & YARN Guide
Big Data Technology Architecture
Big Data Technology Architecture
May 18, 2019 · Big Data

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.

HDFSHadoopKafka
0 likes · 10 min read
Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations
Qunar Tech Salon
Qunar Tech Salon
May 16, 2019 · Big Data

Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar

This article describes the challenges of scaling Qunar's Hadoop NameNode, introduces HDFS Federation and the FastCopy tool, presents performance tests comparing FastCopy with DistCp, and details the development and evaluation of an optimized qFastCopy solution that reduces multi‑petabyte migration time from hours to a few.

Big DataData MigrationFastCopy
0 likes · 8 min read
Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar
dbaplus Community
dbaplus Community
May 13, 2019 · Big Data

Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com

This article examines the performance challenges encountered after upgrading a large‑scale HDFS cluster at VIP.com, explains the root causes of NameNode RPC latency, and presents concrete solutions—including delayed block reports, configurable block deletion, federation redesign, client monitoring, temp‑directory sharding, and small‑file handling—along with configuration snippets and real‑world results.

Big DataFederationHDFS
0 likes · 13 min read
Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 3, 2019 · Big Data

Understanding RAID and Its Role in HDFS Architecture

This article explains the storage challenges of big data, introduces RAID technologies and their variants, and shows how the principles of RAID are applied in the Hadoop Distributed File System (HDFS) to achieve scalable, reliable, and high‑performance data storage and processing.

Big DataHDFSRAID
0 likes · 10 min read
Understanding RAID and Its Role in HDFS Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 1, 2019 · Big Data

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

This article provides a detailed introduction to Hadoop's ecosystem—including its core modules (Common, HDFS, YARN, MapReduce), the design of a high‑availability HDFS cluster, the principles of distributed file systems, and a complete Scala WordCount MapReduce program—offering a solid foundation for big‑data practitioners.

Big DataHDFSHadoop
0 likes · 15 min read
Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example
Youzan Coder
Youzan Coder
Mar 1, 2019 · Big Data

Flume Practice at YouZan: Data Collection and Pipeline Construction in Big Data Scenarios

YouZan’s experience with Flume shows how the at‑least‑once delivery model, combined with FileChannel storage and custom extensions such as an NsqSource, hourly‑based HdfsEventSink, metric reporting server, and timestamp interceptor, can reliably move MySQL binlog data to HDFS, while tuning transaction batch size and channel capacity boosts throughput and stability, paving the way for a unified management platform.

At-Least-OnceFlumeHDFS
0 likes · 11 min read
Flume Practice at YouZan: Data Collection and Pipeline Construction in Big Data Scenarios
Didi Tech
Didi Tech
Jan 31, 2019 · Big Data

Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

Router‑Based Federation replaces Hadoop’s single‑point HDFS bottleneck with a server‑side global namespace managed by Routers and a State Store, enabling scalable, highly available sub‑clusters; Didi back‑ported the feature, deployed five Routers, fixed numerous bugs, and contributed patches to improve stability and functionality.

Big DataHDFSHadoop
0 likes · 11 min read
Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment
Programmer DD
Programmer DD
Nov 18, 2018 · Databases

How We Optimized HBase for 80 Billion Daily Logs: Real‑World Tuning Strategies

This article details the practical performance‑tuning steps applied to a large‑scale HBase deployment handling 80 billion daily log entries, covering rowkey redesign, region redistribution, HDFS write‑timeout fixes, network‑topology adjustments, and JVM parameter tweaks that together stabilized the system and dramatically improved throughput.

HBaseHDFSPerformance Tuning
0 likes · 14 min read
How We Optimized HBase for 80 Billion Daily Logs: Real‑World Tuning Strategies
JD Tech
JD Tech
Sep 20, 2018 · Big Data

Optimizing Local Storage Systems for Large‑Scale Hadoop HDFS Clusters

This article explains the architecture of Hadoop HDFS, identifies performance bottlenecks in page cache and metadata handling on DataNodes, and presents four practical optimization techniques—including cache‑buffer separation, barrier disabling, directory restructuring, and real‑time monitoring—demonstrating significant throughput and latency improvements in large‑scale clusters.

HDFSHadoopLinux kernel
0 likes · 14 min read
Optimizing Local Storage Systems for Large‑Scale Hadoop HDFS Clusters
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Aug 14, 2018 · Big Data

Understanding HDFS Read and Write Mechanisms

This article explains how HDFS handles file reading and writing, detailing the roles of DFSClient, block selection, hedged reads, packet construction, checksum handling, and the interaction with NameNode and DataNode pipelines to ensure reliability and performance.

DFSClientDistributed File SystemHDFS
0 likes · 7 min read
Understanding HDFS Read and Write Mechanisms
JD Tech
JD Tech
Jul 10, 2018 · Big Data

Deploying Hadoop KMS for Transparent HDFS Encryption: A Step‑by‑Step Guide

This article details a complete, hands‑on deployment of Hadoop KMS on a CentOS‑based Hadoop 2.6.1 cluster, covering environment setup, configuration file changes, key generation, service startup, encryption‑zone creation, user permission tuning, verification procedures, and common troubleshooting tips.

HDFSHadoopKMS
0 likes · 19 min read
Deploying Hadoop KMS for Transparent HDFS Encryption: A Step‑by‑Step Guide
dbaplus Community
dbaplus Community
Jun 7, 2018 · Operations

Why Ceph’s Unlimited Scalability Isn’t As Simple As It Looks

The article examines Ceph’s claimed infinite scalability, cost advantages, and operational stability from an SRE perspective, comparing it with centralized systems like HDFS, and reveals practical challenges such as expansion granularity, crushmap rebalancing, utilization limits, and maintenance overhead.

CephHDFSOperations
0 likes · 15 min read
Why Ceph’s Unlimited Scalability Isn’t As Simple As It Looks
dbaplus Community
dbaplus Community
Mar 7, 2018 · Big Data

Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization

The article outlines a systematic approach for large‑scale Hadoop clusters to monitor daily data growth, identify abnormal paths, manage rapid expansion, clean unused cold data, and implement capacity forecasts, while providing concrete daily and quarterly actions, Hive‑specific strategies, and practical examples to keep storage under control.

Big DataData GrowthHDFS
0 likes · 17 min read
Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization
dbaplus Community
dbaplus Community
Dec 14, 2017 · Big Data

Scaling Vipshop’s Big Data Platform: Monitoring, Multi‑HDFS, Yarn Optimization & Capping

In 2017 Vipshop’s senior big‑data architect shares how the company grew its Hadoop‑based platform from zero to a thousand‑node cluster, detailing cluster health monitoring, multi‑HDFS deployment via Hive, Yarn container allocation improvements, and a hook‑driven Capping resource‑control system to boost stability and efficiency.

Big DataHDFScapping
0 likes · 15 min read
Scaling Vipshop’s Big Data Platform: Monitoring, Multi‑HDFS, Yarn Optimization & Capping
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Oct 21, 2017 · Big Data

Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS

This guide walks through preparing three CentOS 6.9 nodes, configuring hostnames, time sync, password‑less SSH, disabling IPv6, installing JDK, downloading CDH 5.4, setting up core‑site and hdfs‑site XML files, formatting the NameNode, starting HDFS services, configuring YARN and MapReduce, and verifying the installations via the Web UI.

Big DataCDHCentOS
0 likes · 18 min read
Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Jun 9, 2017 · Big Data

Secure HDFS with Guardian 5.0: Complete Permission and Quota Guide

This article explains why Hadoop security is critical, introduces Guardian 5.0’s unified authentication and authorization framework, and provides step‑by‑step instructions for configuring HDFS permissions and quotas through its web UI, helping administrators protect massive data assets efficiently.

Guardian5.0HDFSHadoop
0 likes · 9 min read
Secure HDFS with Guardian 5.0: Complete Permission and Quota Guide
ITFLY8 Architecture Home
ITFLY8 Architecture Home
May 10, 2017 · Big Data

How Hadoop Implements Distributed File Systems: From GFS Theory to Practice

This article explains the fundamentals of distributed file systems by linking Google’s GFS, MapReduce, and BigTable concepts to Hadoop’s open‑source implementation, covering terminology, architecture, server roles, data distribution, RPC protocols, file operations, fault recovery, consistency, load balancing, and garbage collection.

GFSHDFSHadoop
0 likes · 34 min read
How Hadoop Implements Distributed File Systems: From GFS Theory to Practice
MaGe Linux Operations
MaGe Linux Operations
May 3, 2017 · Big Data

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

HDFSHadoopMapReduce
0 likes · 13 min read
From Storage to Real‑Time: The Evolution of Big Data Technologies
Qunar Tech Salon
Qunar Tech Salon
Apr 21, 2017 · Big Data

Ensuring Exact‑Once Semantics in Spark Streaming with Kafka: Offline Repair and Data Deduplication Strategies

This article explains why Spark Streaming combined with Kafka can only guarantee at‑least‑once delivery, outlines the challenges of delayed and out‑of‑order events, and presents practical offline‑repair, deduplication, and output‑format techniques—including code examples—to achieve exact‑once semantics in big‑data pipelines.

Exact-OnceHBaseHDFS
0 likes · 11 min read
Ensuring Exact‑Once Semantics in Spark Streaming with Kafka: Offline Repair and Data Deduplication Strategies
Meituan Technology Team
Meituan Technology Team
Apr 14, 2017 · Big Data

Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

Meituan‑Dianping migrated its 2,000‑node HDFS cluster to Federation by fixing ViewFs compatibility, simplifying mount points, leveraging FastCopy for massive data moves, improving token handling, and automating split‑workflow steps, thereby overcoming single‑NameNode bottlenecks and providing a practical blueprint for large‑scale Hadoop deployments.

Big DataFastCopyFederation
0 likes · 22 min read
Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation
Meituan Technology Team
Meituan Technology Team
Mar 17, 2017 · Big Data

Optimizing Hadoop NameNode Restart in HA with QJM

By applying a series of JIRA patches and configuration tweaks—such as shrinking the fsLock scope, increasing checkpoint transaction thresholds, off‑loading quota calculations, simplifying BlockReport handling, and async processing of mis‑replicated blocks—the Hadoop HA NameNode restart time in a 540 MB metadata cluster drops from roughly 4000 seconds to about 2000 seconds, cutting total downtime to around 35 minutes and greatly improving cluster availability.

HAHDFSHadoop
0 likes · 18 min read
Optimizing Hadoop NameNode Restart in HA with QJM
Efficient Ops
Efficient Ops
Feb 9, 2017 · Big Data

Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3

This article explains the new HDFS disk balancer feature introduced in Hadoop 3, covering its purpose, supported volume‑selection policies, step‑by‑step usage, planning and execution commands, and how it helps maintain balanced storage across DataNode disks.

Disk BalancerHDFSHadoop
0 likes · 8 min read
Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jan 24, 2017 · Big Data

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

This article provides a comprehensive overview of Hadoop as the leading open‑source platform for big‑data processing, detailing its core components HDFS and MapReduce, the evolution to Hadoop 2.0/YARN, and the extensive ecosystem of tools and commercial solutions that enable scalable storage, analysis, and machine‑learning on massive data sets.

Big DataHDFSHadoop
0 likes · 18 min read
Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends
Meituan Technology Team
Meituan Technology Team
Dec 9, 2016 · Big Data

Memory Usage Analysis of HDFS NameNode Core Data Structures

The article quantitatively breaks down HDFS NameNode memory consumption, showing that the Namespace tree and BlocksMap together dominate heap usage (≈53 GB in large clusters), provides detailed per‑object size estimates for NetworkTopology, INode and block structures, and proposes a simple formula to predict total heap requirements and tuning recommendations.

Big DataHDFSMemory Management
0 likes · 13 min read
Memory Usage Analysis of HDFS NameNode Core Data Structures
dbaplus Community
dbaplus Community
Nov 20, 2016 · Databases

How to Slash HBase Read Latency: Proven Client, Server, and HDFS Tweaks

This article examines the common causes of high read latency in HBase—such as full GC, region‑server imbalance, low write throughput, and inefficient client settings—and provides concrete optimization steps for the client, server, column‑family design, and HDFS layers to dramatically improve performance.

Client TuningHBaseHDFS
0 likes · 16 min read
How to Slash HBase Read Latency: Proven Client, Server, and HDFS Tweaks
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 18, 2016 · Big Data

Understanding HDFS: Design Goals, Architecture, and Data Replication

This article explains HDFS’s core design principles, including fault tolerance, high‑throughput data access, master‑slave architecture with Namenode and Datanodes, namespace management, block replication strategies, safe mode, metadata persistence, communication protocols, robustness mechanisms, and file operations such as creation, deletion, and space reclamation.

HDFSdistributed storage
0 likes · 16 min read
Understanding HDFS: Design Goals, Architecture, and Data Replication
MaGe Linux Operations
MaGe Linux Operations
Nov 7, 2016 · Big Data

How HDFS Achieves Low Cost, High Reliability, and Fault Tolerance

This article explains how HDFS, inspired by Google’s GFS, provides a low‑cost, highly reliable, fault‑tolerant, and high‑performance distributed file system for big‑data workloads by using replication, standby NameNodes, block storage, rack awareness, and compute‑close‑to‑data strategies.

Big DataDistributed File SystemHDFS
0 likes · 7 min read
How HDFS Achieves Low Cost, High Reliability, and Fault Tolerance
High Availability Architecture
High Availability Architecture
Oct 20, 2016 · Big Data

Understanding HDFS EditLog Format and Quorum Journal Manager Recovery Process

This article explains the HDFS EditLog file structure, the design of the Quorum Journal Manager for high‑availability, the write‑path optimizations such as batch flushing and double‑buffering, and the detailed Multi‑Paxos based recovery algorithm including isolation, segment selection, prepare and accept phases, and handling journal node failures.

Distributed SystemsEditLogHDFS
0 likes · 12 min read
Understanding HDFS EditLog Format and Quorum Journal Manager Recovery Process
Meituan Technology Team
Meituan Technology Team
Aug 26, 2016 · Big Data

Memory Architecture and Analysis of Hadoop HDFS NameNode

The article dissects Hadoop 2.4.1’s HDFS NameNode memory architecture, detailing how the Namespace, BlockManager, NetworkTopology, and LeaseManager consume the heap, exposing scaling problems when metadata reaches hundreds of millions of inodes and blocks, and recommending file merging, block‑size tuning, federation, or external KV stores to mitigate heap pressure.

Big DataHDFSMemory Management
0 likes · 17 min read
Memory Architecture and Analysis of Hadoop HDFS NameNode
MaGe Linux Operations
MaGe Linux Operations
Aug 23, 2016 · Big Data

Step-by-Step Guide to Building a Hadoop Cluster on CentOS 6.5

This article provides a comprehensive, hands‑on tutorial for setting up a Hadoop 2.6.4 cluster on a CentOS 6.5 development server, covering SSH password‑less login, user/group creation, DNS configuration, JDK installation, environment variables, Hadoop installation, HDFS and YARN configuration, and troubleshooting native library warnings.

Big DataCentOSCluster Setup
0 likes · 12 min read
Step-by-Step Guide to Building a Hadoop Cluster on CentOS 6.5
Architecture Digest
Architecture Digest
Jul 5, 2016 · Big Data

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

The article reviews Hadoop’s origins from Google’s pioneering papers, explains its architecture and ecosystem, evaluates its strengths such as scalability and benchmarks, discusses current limitations like single‑point failures and complex programming, and outlines upcoming improvements including HDFS Federation and next‑generation MapReduce.

Big DataFutureHDFS
0 likes · 14 min read
Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop
ITPUB
ITPUB
Jun 15, 2016 · Databases

Understanding HBase’s Physical Architecture: Regions, Stores, and WAL

This article explains HBase’s internal architecture, covering the roles of HRegionServer, Client, Zookeeper, Master, RegionServer, the physical storage layout, StoreFile and HFile structures, and the Write-Ahead Log mechanism that ensures data durability and fault tolerance.

HBaseHDFSNoSQL
0 likes · 13 min read
Understanding HBase’s Physical Architecture: Regions, Stores, and WAL
Hulu Beijing
Hulu Beijing
May 31, 2016 · Big Data

What’s New in Hadoop 3.0? Key Features and Improvements Explained

Hadoop 3.0, built on JDK 1.8, adds erasure‑coded HDFS, multi‑NameNode support, native MapReduce task optimizations, cgroup‑based YARN memory and disk isolation, and container resizing, with an alpha slated for summer and a GA release expected in November or December.

Big DataHDFSHadoop
0 likes · 5 min read
What’s New in Hadoop 3.0? Key Features and Improvements Explained
Qunar Tech Salon
Qunar Tech Salon
May 13, 2016 · Big Data

Overview and Architecture of Hadoop Distributed File System (HDFS)

This article provides a comprehensive overview of Hadoop Distributed File System (HDFS), detailing its design goals, architecture components such as NameNode, DataNode and SecondaryNameNode, data block handling, replication strategies, communication protocols, and the read, write, and delete processes.

Big DataDistributed File SystemHDFS
0 likes · 18 min read
Overview and Architecture of Hadoop Distributed File System (HDFS)
Architect
Architect
Apr 28, 2016 · Big Data

Design and Architecture of Youzan Unified Log Platform

The article describes the design, components, and implementation details of Youzan's unified log platform, covering log ingestion via rsyslog, Logstash, and Flume, centralized processing with Kafka, real‑time analysis using Storm/Spark, and storage in HDFS, Elasticsearch, and Hawk, while also discussing challenges and future improvements.

ElasticsearchHDFSKafka
0 likes · 10 min read
Design and Architecture of Youzan Unified Log Platform
ITPUB
ITPUB
Mar 19, 2016 · Big Data

Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads

This article explains the fundamentals of distributed file systems, focusing on Hadoop’s HDFS architecture, the separation of metadata and data via NameNode and DataNode, and detailed step‑by‑step write and read processes, including replication, fault recovery, and block splitting across nodes.

Big DataDataNodeDistributed File System
0 likes · 8 min read
Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads
ITPUB
ITPUB
Jan 8, 2016 · Databases

How Facebook Scales MySQL Backups: Strategies, Storage, and Incremental Techniques

This article explains Facebook's large‑scale MySQL backup architecture, covering the Python‑based automation framework, master‑slave deployment, logical mysqldump backups, warm and cold storage locations, source selection heuristics, full and incremental backup pipelines, verification processes, and future RBR‑based improvements.

FacebookHDFSIncremental Backup
0 likes · 15 min read
How Facebook Scales MySQL Backups: Strategies, Storage, and Incremental Techniques

Design Principles and Architecture of HDFS (Hadoop Distributed File System)

This article explains HDFS's design goals, master/slave architecture, namespace management, block replication strategies, fault tolerance mechanisms, metadata persistence, communication protocols, robustness features, data organization, access methods, and space reclamation, providing a comprehensive overview of Hadoop's distributed storage system.

DataNodeHDFSNameNode
0 likes · 20 min read
Design Principles and Architecture of HDFS (Hadoop Distributed File System)
MaGe Linux Operations
MaGe Linux Operations
Apr 7, 2015 · Big Data

How Hadoop’s Tiered Storage Optimizes Data Based on Temperature

This article explains Hadoop’s tiered storage concept, describing how data is classified by temperature—hot, warm, cold, frozen—and automatically moved across disk and archive layers to optimize cost and performance, with examples from Hadoop versions and eBay’s large‑scale deployment.

Big DataData TemperatureHDFS
0 likes · 9 min read
How Hadoop’s Tiered Storage Optimizes Data Based on Temperature