Tagged articles

HDFS

192 articles · Page 1 of 2

Jun 24, 2026 · Big Data

Four Leading Distributed Storage Solutions Explained

The article reviews four major distributed storage systems—HDFS, Ceph, GlusterFS, and FastDFS—detailing their architectures, core strengths such as HDFS’s batch processing, Ceph’s unified object/block/file capabilities, GlusterFS’s horizontal scalability, and FastDFS’s lightweight handling of small files, while also noting each solution’s limitations.

CephDistributed storageFastDFS

0 likes · 6 min read

Four Leading Distributed Storage Solutions Explained

Big Data Technology Tribe

Jan 31, 2026 · Backend Development

Debugging Lance‑Spark & Lance‑Ray on HDFS: Build Wheels and Fix Common Errors

This guide walks through building custom pylance and lance‑namespace wheels to enable HDFS support, resolves common ModuleNotFoundError, Hive dependency, and native library issues, clarifies correct table_id usage, and provides a complete Python script that reads and modifies a Lance dataset with Ray.

HDFSPythonlance-ray

0 likes · 9 min read

Debugging Lance‑Spark & Lance‑Ray on HDFS: Build Wheels and Fix Common Errors

Raymond Ops

Jan 30, 2026 · Big Data

Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch

This guide walks you through designing and deploying a highly available HDFS architecture with dual NameNodes, ZooKeeper‑based failover, and a tuned YARN resource scheduler, covering detailed configuration files, failover testing, performance tuning, monitoring, automated health checks, capacity planning, and best‑practice checklists for production‑grade big‑data platforms.

Big DataHAHDFS

0 likes · 28 min read

Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch

Big Data Technology Tribe

Dec 19, 2025 · Big Data

Why Did Our HDFS Standby NameNode Crash? A Deep Dive into Block Recovery Bugs

A recent HDFS outage caused the Standby and Observer NameNodes to crash after heavy client load triggered block recovery failures, exposing a bug in commitBlockSynchronization that leads to mismatched block IDs and edit‑log inconsistencies, which can be fixed by applying HDFS‑17861.

BlockRecoveryCrashHDFS

0 likes · 15 min read

Why Did Our HDFS Standby NameNode Crash? A Deep Dive into Block Recovery Bugs

Architect Chen

Oct 15, 2025 · Big Data

Comparing HDFS, Ceph, FastDFS, and GlusterFS: Which Distributed Storage Fits Your Needs?

This article provides a concise overview of four major distributed storage solutions—HDFS, Ceph, FastDFS, and GlusterFS—detailing their architectures, advantages, drawbacks, and ideal use cases to help engineers choose the right system for large‑scale data workloads.

CephDistributed storageFastDFS

0 likes · 7 min read

Comparing HDFS, Ceph, FastDFS, and GlusterFS: Which Distributed Storage Fits Your Needs?

Mike Chen's Internet Architecture

Sep 18, 2025 · Fundamentals

Understanding Distributed Storage: File, Object, Block, and Key‑Value Systems Explained

This article explains the core concepts and architectures of distributed storage, covering file‑based systems like HDFS, object storage such as Ceph, block storage for high‑performance workloads, and key‑value stores like Redis and Cassandra, highlighting their use cases and design principles.

CephDistributed storageFile System

0 likes · 4 min read

Understanding Distributed Storage: File, Object, Block, and Key‑Value Systems Explained

Mike Chen's Internet Architecture

Sep 10, 2025 · Fundamentals

Understanding Key Distributed Storage Systems: HDFS, Ceph, FastDFS, and TFS

This article provides a concise overview of four major distributed storage solutions—HDFS, Ceph, FastDFS, and TFS—highlighting their architectures, strengths, weaknesses, and typical use cases for large‑scale data and e‑commerce applications.

Big DataCephDistributed storage

0 likes · 4 min read

Understanding Key Distributed Storage Systems: HDFS, Ceph, FastDFS, and TFS

Mike Chen's Internet Architecture

Sep 9, 2025 · Fundamentals

Understanding 4 Major Distributed File Systems: HDFS, CephFS, GFS, and TFS

This article provides a concise overview of four key distributed file systems—HDFS, CephFS, GFS, and TFS—explaining their architectures, strengths, weaknesses, and typical application scenarios for large‑scale data storage and processing.

CephFSDistributed File SystemGFS

0 likes · 5 min read

Understanding 4 Major Distributed File Systems: HDFS, CephFS, GFS, and TFS

MaGe Linux Operations

Sep 8, 2025 · Big Data

Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

This comprehensive guide walks you through constructing a fault‑tolerant HDFS high‑availability architecture, configuring dual NameNodes with ZooKeeper and JournalNode clusters, fine‑tuning YARN resource schedulers, implementing monitoring, automated failover testing, and performance optimization, all backed by real‑world production experiences and code examples.

Big Data OperationsHDFSHigh Availability

0 likes · 24 min read

Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

Mike Chen's Internet Architecture

Aug 29, 2025 · Fundamentals

Understanding Distributed Storage: HDFS, CephFS, GlusterFS, and FastDFS Compared

This article compares four major distributed storage solutions—HDFS, CephFS, GlusterFS, and FastDFS—detailing their architectures, strengths, weaknesses, and ideal use cases for big‑data processing, cloud-native environments, and high‑concurrency file services, and how they fit into modern infrastructure strategies.

Big DataCephFSDistributed storage

0 likes · 5 min read

Understanding Distributed Storage: HDFS, CephFS, GlusterFS, and FastDFS Compared

Big Data Tech Team

Aug 24, 2025 · Big Data

Understanding Distributed Storage: HDFS, Ceph, GlusterFS, and FastDFS

This article provides a concise technical overview of four major distributed storage solutions—HDFS, Ceph, GlusterFS, and FastDFS—covering their architecture, key features, pros and cons, and typical use cases for large‑scale data processing and storage.

CephDistributed storageFastDFS

0 likes · 9 min read

Understanding Distributed Storage: HDFS, Ceph, GlusterFS, and FastDFS

Big Data Tech Team

Jun 8, 2025 · Big Data

Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals

This guide outlines a comprehensive Hadoop learning roadmap, covering essential prerequisites, core concepts such as HDFS, MapReduce, and YARN, hands‑on projects, advanced ecosystem tools like Hive, Pig, HBase and Spark, plus curated resources and community channels for aspiring big‑data engineers.

Distributed ComputingHDFSHadoop

0 likes · 7 min read

Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals

360 Zhihui Cloud Developer

May 9, 2025 · Big Data

Mastering Multi‑AZ Replication in HDFS with AZ Mover

This article introduces AZ Mover, a lightweight HDFS client‑side tool that intelligently scans, schedules, and migrates block replicas across multiple availability zones, detailing its design goals, core workflow, command‑line options, concurrency controls, and future enhancements for robust big‑data disaster recovery.

AZ MoverCluster OperationsData Governance

0 likes · 9 min read

Mastering Multi‑AZ Replication in HDFS with AZ Mover

IT Services Circle

Feb 9, 2025 · Big Data

Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability

This article explains how HDFS, the Hadoop Distributed File System, splits large files into blocks, replicates them for fault tolerance, organizes the cluster into NameNode and DataNode components, and provides high‑availability and scalability mechanisms such as standby NameNode and federation, enabling reliable big‑data storage and access.

Big DataDataNodeDistributed File System

0 likes · 11 min read

Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability

IT Architects Alliance

Jan 8, 2025 · Big Data

Understanding Distributed Storage: A Comparative Overview of HDFS, Ceph, and MinIO

This article explains the fundamentals, use cases, advantages, and trade‑offs of three major distributed storage solutions—HDFS, Ceph, and MinIO—guiding readers on how to select the most suitable system for big‑data, cloud‑native, and containerized environments.

Big DataCephDistributed storage

0 likes · 12 min read

Understanding Distributed Storage: A Comparative Overview of HDFS, Ceph, and MinIO

Rare Earth Juejin Tech Community

Dec 26, 2024 · Big Data

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

This article explains the origins of big‑data technologies, details the architecture and read/write mechanisms of Hadoop's HDFS, describes the MapReduce programming model, and provides complete Java code examples for a simple distributed file‑processing job using Maven dependencies.

Big DataDistributed File SystemHDFS

0 likes · 15 min read

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

JD Retail Technology

Oct 29, 2024 · Big Data

JD Unified Storage Practice: Cross‑Region and Tiered Storage on HDFS

This article details JD's large‑scale HDFS unified storage implementation, covering cross‑region storage challenges, topology design, asynchronous block replication, flow‑control mechanisms, tiered storage strategies, automatic hot‑cold data migration, and the resulting performance and cost improvements for big‑data workloads.

Big DataCross-Region StorageData Management

0 likes · 20 min read

JD Unified Storage Practice: Cross‑Region and Tiered Storage on HDFS

DataFunSummit

Oct 4, 2024 · Big Data

JD Retail HDFS Unified Storage: Cross‑Region and Tiered Storage Practices

This article presents JD Retail's large‑scale HDFS deployment, detailing its unified storage architecture, cross‑region data replication challenges and solutions, tiered storage strategies for hot, warm and cold data, and the operational modules that together improve performance, reliability and cost efficiency in a big‑data environment.

Big DataCross-Region StorageDistributed File System

0 likes · 21 min read

JD Retail HDFS Unified Storage: Cross‑Region and Tiered Storage Practices

dbaplus Community

Sep 4, 2024 · Big Data

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

This article details how Ctrip’s data platform evolved from a single‑IDC design to a multi‑IDC, tiered storage and scheduling architecture, covering the challenges of rapid data growth, the migration to Spark 3 via Kyuubi, the introduction of Celeborn shuffle service, and the resulting performance and reliability gains.

Big DataDistributed storageHDFS

0 likes · 23 min read

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

360 Zhihui Cloud Developer

Aug 8, 2024 · Big Data

How to Migrate HBase and HDFS Clusters Safely Without Downtime

This guide details a step‑by‑step migration plan for HBase and HDFS clusters, covering background, high‑availability architecture, role assignments, expansion and shrinkage of ZooKeeper and JournalNode, NameNode and DataNode migration, rolling restarts, and common upgrade pitfalls.

Big DataCluster MigrationHBase

0 likes · 12 min read

How to Migrate HBase and HDFS Clusters Safely Without Downtime

WeiLi Technology Team

Jun 28, 2024 · Big Data

How to Build a Robust Big Data Monitoring and Alerting System

This article explains why high‑availability design and comprehensive monitoring are essential for modern big‑data platforms, outlines a layered architecture, and provides practical guidance on health checks, alerting, and data‑quality monitoring across storage, compute, scheduling, and service layers.

FlinkHDFSarchitecture

0 likes · 14 min read

How to Build a Robust Big Data Monitoring and Alerting System

360 Smart Cloud

May 28, 2024 · Big Data

HDFS Upgrade from 2.6.0‑cdh to 3.1.2 with DataNode Federation and Mixed Deployment

This article details the background, planning, step‑by‑step procedures, encountered issues, and rollback strategies for upgrading a Hadoop HDFS cluster from version 2.6.0‑cdh to 3.1.2, including mixed‑deployment of DataNodes across different federations and necessary configuration changes.

DataNodeHDFSHadoop

0 likes · 16 min read

HDFS Upgrade from 2.6.0‑cdh to 3.1.2 with DataNode Federation and Mixed Deployment

DataFunTalk

May 27, 2024 · Big Data

JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices

This article details JD Retail’s large‑scale HDFS deployment, describing how cross‑region storage challenges were solved with a full‑copy topology, asynchronous block replication, flow‑control mechanisms, and a tiered storage strategy that automatically moves hot, warm, and cold data among SSD, HDD, and high‑density HDD nodes to improve performance and cut costs.

Big DataData ManagementDistributed storage

0 likes · 20 min read

JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices

Bilibili Tech

Apr 26, 2024 · Big Data

Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

To overcome the NameNode write bottleneck caused by a single global read/write lock in Bilibili’s massive HDFS deployment, the team introduced hierarchical fine‑grained locking—splitting the lock into Namespace, BlockPool, and per‑INode levels—which yielded up to three‑fold write throughput gains, a 90 % drop in RPC queue time, and shifted performance limits from lock contention to log synchronization.

Big DataHDFSMetadata

0 likes · 15 min read

Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

Efficient Ops

Apr 23, 2024 · Big Data

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

This guide walks through planning a three‑node Hadoop 3.3.5 cluster, explains default and custom configuration files, details core‑site, hdfs‑site, yarn‑site, and mapred‑site settings, shows how to distribute configs, start HDFS and YARN, and perform basic file‑system tests.

Big DataCluster SetupHDFS

0 likes · 11 min read

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

Alibaba Cloud Big Data AI Platform

Mar 22, 2024 · Artificial Intelligence

How AI Agents Diagnose HDFS Clusters: From Basics to Advanced Framework

This article explores the concept of AI agents, contrasts them with RAG, and demonstrates a LangChain‑based framework that uses specialized tools to automatically diagnose issues in an HDFS cluster through a series of practical experiments and advanced optimization ideas.

AI AgentsHDFSLangChain

0 likes · 21 min read

How AI Agents Diagnose HDFS Clusters: From Basics to Advanced Framework

Linux Code Review Hub

Mar 11, 2024 · Databases

How Didi Built a Next‑Gen Log Storage System with ClickHouse

Didi migrated its massive PB‑scale log data from Elasticsearch to ClickHouse, redesigning storage with separate Log and Trace clusters, optimizing partition and sorting keys, introducing native TCP connectors, and revamping HDFS cold‑hot separation, achieving up to four‑fold query speed gains and 30% lower hardware costs.

ClickHouseFlinkHDFS

0 likes · 15 min read

How Didi Built a Next‑Gen Log Storage System with ClickHouse

DataFunSummit

Feb 6, 2024 · Big Data

Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices

This article presents an in‑depth overview of ByteDance's EB‑scale HDFS, covering its new features, multi‑datacenter architecture, tiered storage implementation, data management services, capacity and fault‑tolerance strategies, as well as practical data‑protection mechanisms and related Q&A.

Big DataData ProtectionDistributed storage

0 likes · 22 min read

Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices

Mike Chen's Internet Architecture

Feb 1, 2024 · Big Data

Master Distributed Storage: HDFS, Ceph, and Swift Explained

This article introduces distributed storage concepts, outlines its five key characteristics, compares major architectures such as HDFS, Ceph, and Swift, and highlights common application scenarios like big‑data processing, cloud storage, databases, and distributed file systems.

Big DataCephDistributed storage

0 likes · 7 min read

Master Distributed Storage: HDFS, Ceph, and Swift Explained

WeiLi Technology Team

Nov 1, 2023 · Big Data

How to Diagnose and Resolve HDFS Safe Mode Issues

This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.

Big DataHDFSHadoop

0 likes · 10 min read

How to Diagnose and Resolve HDFS Safe Mode Issues

Su San Talks Tech

Oct 29, 2023 · Operations

What Are the Best Distributed File Storage Systems and How to Choose One?

This article introduces the concept of distributed storage, outlines its key advantages, reviews major distributed file systems such as GFS, HDFS, Ceph, Lustre, TFS, FastDFS, and GridFS, explains POSIX basics, and provides practical criteria for selecting the most suitable system for different workloads.

CephDistributed storageFile System

0 likes · 12 min read

What Are the Best Distributed File Storage Systems and How to Choose One?

DataFunTalk

Jun 18, 2023 · Big Data

Evolution and Comparison of High‑Performance Cloud‑Native Lakehouse Storage Architecture: From HDFS to JuiceFS

This article examines the evolution of big‑data storage from on‑premise HDFS to cloud‑native object storage, compares their architectures and performance, outlines future lakehouse storage requirements, and demonstrates a practical implementation using the JuiceFS distributed file system.

Big DataHDFSJuiceFS

0 likes · 15 min read

Evolution and Comparison of High‑Performance Cloud‑Native Lakehouse Storage Architecture: From HDFS to JuiceFS

政采云技术

Apr 18, 2023 · Big Data

Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

This article explains how to perform task‑level data cost governance by collecting storage and compute metrics from Hive tables, Spark jobs, and HDFS FsImage files, then estimating monthly expenses using replication factors and resource‑usage rates, while providing practical SQL and shell examples.

Data Cost GovernanceHDFSHive

0 likes · 18 min read

Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

Bilibili Tech

Mar 14, 2023 · Big Data

Bilibili HDFS Erasure Coding Strategy and Implementation

Bilibili reduced petabyte‑scale storage costs by back‑porting erasure‑coding patches to its HDFS 2.8.4 cluster, deploying a parallel EC‑enabled cluster, adding a data‑proxy service, intelligent routing and block‑checking, and automating cold‑data migration, while noting write overhead and planning native acceleration.

Big DataData ReliabilityHDFS

0 likes · 14 min read

Bilibili HDFS Erasure Coding Strategy and Implementation

DataFunTalk

Feb 18, 2023 · Big Data

Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase

The article outlines Xiaomi's data governance journey, focusing on storage‑service cost governance, describing the transition from simple cost‑centered governance to big‑data‑driven asset management, and detailing concrete HDFS and HBase practices that achieved significant resource and cost reductions.

Big DataData GovernanceHBase

0 likes · 15 min read

Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase

DataFunSummit

Feb 12, 2023 · Big Data

Applying Erasure Coding in HDFS: Strategies, Performance, and Repair Techniques

This article explains how Zhihu adopted HDFS erasure coding to reduce storage costs, outlines cold‑hot file tiering policies, describes the EC conversion workflow and the custom EC Worker tool, and details methods for detecting and repairing damaged EC files in a Hadoop environment.

Big DataHDFSPerformance

0 likes · 16 min read

Applying Erasure Coding in HDFS: Strategies, Performance, and Repair Techniques

dbaplus Community

Feb 8, 2023 · Big Data

How Bilibili Scaled Offline Processing Across Multiple Data Centers

This article details Bilibili's multi‑datacenter offline architecture, explaining the capacity challenges, the chosen scale‑out design, and the implementation of job placement, data replication, routing, versioning, throttling, and traffic analysis to efficiently handle massive batch workloads across geographically distributed clusters.

Data ReplicationHDFSbandwidth optimization

0 likes · 26 min read

How Bilibili Scaled Offline Processing Across Multiple Data Centers

StarRing Big Data Open Lab

Feb 1, 2023 · Big Data

Understanding HDFS vs Ceph: Architecture, Pros, and Use Cases

An in‑depth overview compares Hadoop’s HDFS and the open‑source Ceph object storage, detailing their architectures, replication mechanisms, scalability, strengths, limitations, and real‑world enterprise adoption for handling massive large datasets and unstructured data.

CephData ReplicationHDFS

0 likes · 13 min read

Understanding HDFS vs Ceph: Architecture, Pros, and Use Cases

ITPUB

Jan 4, 2023 · Databases

Can Cassandra Beat RDBMS Distributed Bottlenecks? A Deep Dive into Decentralized Databases

The article traces the evolution from Codd's relational model to modern RDBMS scaling limits, explains why centralized Hadoop/HBase architectures struggle with high‑concurrency workloads, and shows how Cassandra’s decentralized design—using consistent hashing, gossip, and virtual nodes—overcomes these bottlenecks while offering flexible consistency guarantees.

CassandraHBaseHDFS

0 likes · 22 min read

Can Cassandra Beat RDBMS Distributed Bottlenecks? A Deep Dive into Decentralized Databases

Data Thinking Notes

Dec 6, 2022 · Big Data

Why Did Multiple HDFS DataNodes Crash? Memory, GC, and Block Overload Explained

This article analyzes a midnight HDFS DataNode failure caused by excessive GC and OOM due to Spark batch jobs, examines how an unexpected surge in block count overloaded default memory settings, and presents concrete remediation steps and optimization recommendations to stabilize the cluster.

Block OverloadDataNodeGarbage Collection

0 likes · 6 min read

Why Did Multiple HDFS DataNodes Crash? Memory, GC, and Block Overload Explained

High Availability Architecture

Nov 30, 2022 · Big Data

Design and Implementation of Vivo's Bees Log Collection Agent

This article presents the design principles, core features, and implementation details of Vivo's self‑developed Bees log collection agent, covering file discovery, unique identification, real‑time and offline ingestion, resource control, platform management, and comparisons with open‑source solutions.

HDFSJavaKafka

0 likes · 22 min read

Design and Implementation of Vivo's Bees Log Collection Agent

Data Thinking Notes

Nov 29, 2022 · Big Data

Understanding HDFS High Availability: Roles, Metadata Persistence, and Failover

This article explains the core concepts of HDFS High Availability, detailing primary and standby NameNode roles, failover mechanisms, shared storage systems, metadata persistence via EditLog and FsImage, and the processes for merging and synchronizing data across active and standby nodes.

EditLogFsImageHDFS

0 likes · 8 min read

Understanding HDFS High Availability: Roles, Metadata Persistence, and Failover

vivo Internet Technology

Nov 23, 2022 · Big Data

Design and Implementation of Vivo's Bees Log Collection Agent

Vivo’s Bees‑agent is a custom, lightweight log‑collection service that discovers rotating files via inotify, uniquely identifies them with inode and hash signatures, supports real‑time and offline ingestion to Kafka and HDFS, offers checkpoint‑resume, resource isolation, rich metrics, and a centralized management platform, outperforming open‑source collectors in latency, memory usage, and scalability.

Agent DesignHDFSJava

0 likes · 24 min read

ITPUB

Oct 21, 2022 · Big Data

Hadoop Explained: Architecture, Core Components, and Real-World Applications

This article provides a comprehensive overview of Hadoop, covering its historical development, key characteristics, the HDFS storage framework, the MapReduce processing engine, YARN resource manager, and a wide range of real-world application scenarios, as well as the broader Hadoop ecosystem and its major components.

Big DataDistributed ComputingHDFS

0 likes · 20 min read

Hadoop Explained: Architecture, Core Components, and Real-World Applications

ITPUB

Oct 20, 2022 · Big Data

Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives

The article examines why Hadoop's Distributed File System may become obsolete by detailing its three main shortcomings—deployment complexity, metadata memory limits, and high replication overhead—and explores how newer architectures and erasure coding could address these issues.

Big DataDistributed File SystemHDFS

0 likes · 8 min read

Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives

Python Crawling & Data Mining

Oct 16, 2022 · Big Data

What Makes Hadoop the Backbone of Modern Big Data Processing?

This article provides a comprehensive overview of Hadoop, covering its history, core features, the HDFS storage framework, MapReduce computation engine, YARN resource manager, real‑world application scenarios, and the surrounding ecosystem of tools such as Hive, Spark and Kafka.

Distributed ComputingHDFSHadoop

0 likes · 20 min read

What Makes Hadoop the Backbone of Modern Big Data Processing?

Big Data Technology & Architecture

Oct 13, 2022 · Big Data

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

This guide details how to execute Apache Hudi file clustering after a batch job and before streaming, using Spark commands to merge numerous small HDFS files into larger ones, configure clustering and cleaning policies, and verify the results with HDFS counts.

Apache HudiBig DataData Lake

0 likes · 15 min read

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

DataFunTalk

Sep 4, 2022 · Big Data

Design and Implementation of Bilibili's Offline Multi‑Datacenter Solution

This article describes Bilibili's offline multi‑datacenter architecture, explaining why a scale‑out approach was chosen over scale‑up, and detailing the unit‑based design, job placement, data replication, routing, versioning, bandwidth throttling, traffic analysis, and the operational results and future directions.

Big DataData ReplicationHDFS

0 likes · 24 min read

Design and Implementation of Bilibili's Offline Multi‑Datacenter Solution

ITPUB

Jul 23, 2022 · Information Security

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

This article details Bilibili's implementation of Apache Ranger for fine‑grained access control across Hadoop, HDFS, Hive, Spark, and Presto, covering architecture, API redesign, admin optimizations, gray‑release strategies, permission pre‑checks, data masking, and future plans for incremental policy loading.

Access ControlData SecurityHDFS

0 likes · 16 min read

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

Bilibili Tech

Jul 22, 2022 · Information Security

Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

Bilibili’s data platform redesigns Ranger‑based access control by simplifying HDFS and Hive policy APIs, parallelizing policy loading, adding gray‑release and pre‑check mechanisms, integrating fine‑grained Hive authorization with data‑masking, extending support to Spark and Presto, and planning incremental loading, policy fusion, and a NameNode proxy to boost security and performance.

Access ControlHDFSHive

0 likes · 15 min read

Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

dbaplus Community

Jul 13, 2022 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

From data ingestion to real‑time analytics, this guide breaks down the essential layers of a typical big‑data platform—covering collection methods, HDFS storage, Hive/Spark analysis, data sharing mechanisms, application use‑cases, streaming with Spark Streaming, and the need for robust scheduling and monitoring.

Big DataData IntegrationData Warehouse

0 likes · 9 min read

Unpacking the Core Technologies Behind Modern Big Data Platforms

ITPUB

Jul 13, 2022 · Big Data

How Bilibili Scaled Offline Processing Across Multiple Data Centers

This article details Bilibili's multi‑datacenter solution for offline big‑data workloads, covering the challenges of capacity limits, the design of a unit‑based architecture, job placement, data replication, routing, versioning, bandwidth throttling, traffic analysis, and future directions.

HDFSbandwidth optimizationjob placement

0 likes · 29 min read

Bilibili Tech

Jul 5, 2022 · Big Data

Multi‑Datacenter Architecture for Offline Big Data Processing at Bilibili

To overcome rapid data growth and on‑premise capacity limits, Bilibili adopted a scale‑out, unit‑based multi‑datacenter architecture that isolates failures, intelligently places jobs, replicates data via an enhanced DistCp service, routes reads with an IP‑aware HDFS router, and throttles cross‑site traffic, enabling stable offline big‑data processing of hundreds of petabytes while preserving throughput.

Data ReplicationHDFSYARN

0 likes · 28 min read

Multi‑Datacenter Architecture for Offline Big Data Processing at Bilibili

DataFunTalk

Jul 4, 2022 · Big Data

Apache Ozone: Architecture, Advantages, and New Features Overcoming HDFS Limitations

This article explains the shortcomings of HDFS at large scale, describes the Federation and Scaling approaches, and details how Apache Ozone redesigns metadata storage, introduces container abstraction, object semantics, and new features such as optimized OM, streaming writes, erasure coding, and RocksDB consolidation to improve scalability and performance.

Apache OzoneHDFSRocksDB

0 likes · 11 min read

Apache Ozone: Architecture, Advantages, and New Features Overcoming HDFS Limitations

DataFunSummit

Jul 2, 2022 · Big Data

Technical Evolution and Optimization of Kuaishou HDFS

Over the past four years Kuaishou's data grew dozens of times, prompting scalability and storage‑cost challenges, and this article details the architectural evolution, performance and cost optimizations, cross‑region expansion, and future plans of Kuaishou's HDFS system.

Big DataDistributed storageHDFS

0 likes · 20 min read

Technical Evolution and Optimization of Kuaishou HDFS

DataFunTalk

Jun 5, 2022 · Big Data

JD Big Data Platform: Cross‑Region and Tiered Storage Architecture and Practices

This article presents JD's large‑scale big‑data platform, detailing its overall architecture, the challenges of cross‑region storage, the design of a unified cross‑domain data synchronization mechanism, and the implementation of tiered storage to improve performance, cost efficiency, and data reliability across multi‑datacenter clusters.

Big DataData PlatformDistributed storage

0 likes · 15 min read

JD Big Data Platform: Cross‑Region and Tiered Storage Architecture and Practices

vivo Internet Technology

May 11, 2022 · Big Data

How We Rolled Out a Massive HDFS 2.6→3.1 Upgrade on a 10,000‑Node Cluster

This article details the end‑to‑end process of migrating a 10,000‑node offline data‑warehouse from CDH 5.14.4 (HDFS 2.6.0) to HDP 3.1.4 (HDFS 3.1.1), covering version selection, rolling‑upgrade strategy, incompatibility fixes, client handling, tool coexistence, testing, automation, and lessons learned.

Big DataCluster MigrationData Warehouse

0 likes · 25 min read

How We Rolled Out a Massive HDFS 2.6→3.1 Upgrade on a 10,000‑Node Cluster

ITPUB

May 7, 2022 · Big Data

How eBay Scaled HDFS to 800 PB Using Federation and Router‑Based Architecture

This article details eBay's evolution of its massive HDFS storage—from a single‑cluster design to ViewFS Federation, then to Router‑Based Federation—highlighting the performance bottlenecks, optimization techniques, FastCopy integration, and future plans for further scaling and automation.

FederationHDFSRouter-based Federation

0 likes · 11 min read

How eBay Scaled HDFS to 800 PB Using Federation and Router‑Based Architecture

DataFunSummit

May 4, 2022 · Big Data

NetEase Big Data Platform: HDFS Optimization and Practices

NetEase’s senior big‑data engineer shares how the company’s large‑scale data platform leverages Hadoop, HDFS, YARN and related technologies, detailing multi‑layer architecture, cross‑cloud deployment, storage optimizations, NameNode performance enhancements, RPC prioritization, and practical lessons from operating petabyte‑scale clusters.

Cluster OptimizationHDFSPerformance Tuning

0 likes · 23 min read

NetEase Big Data Platform: HDFS Optimization and Practices

IEG Growth Platform Technology Team

Apr 18, 2022 · Big Data

Big Data Overview: Definitions, Applications, Technology Stack, and Core Components (Hadoop, HDFS, MapReduce, YARN, Hive, HBase)

This comprehensive article explains big data concepts, definitions from Gartner and IBM, real‑world use cases, the Hadoop ecosystem architecture, and detailed introductions to HDFS, MapReduce, YARN, Hive, and HBase, including practical examples and shell commands.

HBaseHDFSHadoop

0 likes · 42 min read

Big Data Overview: Definitions, Applications, Technology Stack, and Core Components (Hadoop, HDFS, MapReduce, YARN, Hive, HBase)

DataFunTalk

Mar 30, 2022 · Big Data

NetEase Big Data Platform: HDFS Optimization and Practice

This article presents NetEase's big data platform architecture, detailing multi‑layer storage and compute design, HDFS deployment challenges, NameNode and NameSpace performance optimizations, cluster scaling strategies, data tiering, hardware upgrades, and real‑world business use cases, illustrating practical large‑scale big data engineering.

Big DataCluster OptimizationData Management

0 likes · 23 min read

NetEase Big Data Platform: HDFS Optimization and Practice

Bilibili Tech

Mar 30, 2022 · Big Data

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Bilibili’s HDFS now runs a three‑tier architecture—access, metadata, and data layers—enhanced with a custom MergeFS router, observer NameNode, dynamic load balancing, fast‑failover pipelines, and storage‑aware policies, while future work targets transparent erasure coding, tiered data routing, lock refinements, and a Hadoop 3.x migration.

Big DataDistributed File SystemHDFS

0 likes · 22 min read

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Architecture Digest

Dec 28, 2021 · Big Data

HDFS Overview: Architecture, Features, Data Management and Storage Policies

This article provides a comprehensive overview of HDFS, covering basic file system concepts, HDFS architecture, high availability, federation, replica placement, storage policies, colocation, data integrity, and key design considerations for large‑scale distributed storage.

Big DataColocationData Replication

0 likes · 23 min read

HDFS Overview: Architecture, Features, Data Management and Storage Policies

dbaplus Community

Dec 15, 2021 · Big Data

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

This article details the background, challenges, and step‑by‑step solutions for migrating over a hundred petabytes of Hadoop HDFS data across data centers within a month, covering strategy selection, code modifications, balance optimization, and tool enhancements.

Balance OptimizationBig Data OperationsData Migration

0 likes · 14 min read

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

HomeTech

Dec 7, 2021 · Big Data

Flink Task Auto-scaling Design and Implementation

This article presents the design and implementation of Flink task auto‑scaling, covering background, manual and automatic scaling mechanisms, architecture with RescaleCoordinator, persistence via Zookeeper and HDFS, scaling policies for parallelism, CPU and memory, and future plans for fine‑grained and time‑based resource adjustments.

Auto ScalingFlinkHDFS

0 likes · 4 min read

Flink Task Auto-scaling Design and Implementation

Tongcheng Travel Technology Center

Nov 2, 2021 · Big Data

Hadoop Cluster Cross-Data Center Migration Practice at Tongcheng Travel

This article details Tongcheng Travel’s month‑long, zero‑downtime migration of hundreds of petabytes of Hadoop HDFS and YARN clusters across data centers, describing the background, migration strategies, lessons learned, tool enhancements, and future plans to improve data locality, balance, and monitoring.

Big DataCluster MigrationData Center

0 likes · 16 min read

Hadoop Cluster Cross-Data Center Migration Practice at Tongcheng Travel

Big Data Technology & Architecture

Oct 8, 2021 · Big Data

Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide

This article provides a comprehensive guide to optimizing Hadoop HDFS storage through erasure coding and heterogeneous storage policies, explains fault‑tolerance techniques such as safe mode and slow‑disk monitoring, and shares practical MapReduce performance tuning and enterprise‑level configuration examples for large‑scale clusters.

Cluster TuningHDFSHadoop

0 likes · 32 min read

Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide

Big Data Technology & Architecture

Sep 17, 2021 · Big Data

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

This article explains HDFS reliability features such as replica policies, rack awareness, heartbeat, safe mode, checksums, trash, metadata protection and snapshots, then details YARN failover handling for ApplicationMaster, NodeManager and ResourceManager, and finally describes the Hadoop MapReduce shuffle workflow and tuning tips.

HDFSMapReduceReliability

0 likes · 13 min read

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

ITPUB

Sep 16, 2021 · Big Data

Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons

This article explains how Hadoop revolutionized big data by providing a distributed architecture with HDFS for storage and MapReduce for processing, outlines its ecosystem components, describes the inner workings of HDFS and MapReduce, and discusses the strengths and limitations of this approach.

HDFSHadoopMapReduce

0 likes · 7 min read

Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons

The Dominant Programmer

Aug 4, 2021 · Big Data

How to Set Up Hadoop Java Development on Windows and Access HDFS via Java API

This guide walks through installing Hadoop on Windows, configuring environment variables and XML files, adding the required winutils binaries, verifying the setup with HDFS shell commands, and then building a Maven project that uses the Java API to list and inspect files in HDFS.

HDFSHadoopJava

0 likes · 11 min read

How to Set Up Hadoop Java Development on Windows and Access HDFS via Java API

The Dominant Programmer

Aug 4, 2021 · Big Data

Essential HDFS Shell Commands for Managing Hadoop Files

This guide explains how to use the HDFS shell (preferred via hdfs dfs) to list, copy, move, delete, and snapshot files in a Hadoop cluster, detailing command syntax, URI handling, generic options, and practical examples for each operation.

Big DataHDFSHadoop

0 likes · 9 min read

Essential HDFS Shell Commands for Managing Hadoop Files

The Dominant Programmer

Aug 2, 2021 · Big Data

How to Build a Beginner Hadoop Cluster on CentOS 7

This article introduces Apache Hadoop’s open‑source framework, explains its core components such as HDFS, MapReduce, ZooKeeper, HBase, Hive, Pig, Mahout, Sqoop, Flume, Chukwa, Oozie, Ambari and YARN, and outlines the steps to set up a beginner‑level Hadoop cluster on CentOS 7.

Big DataCentOS 7HBase

0 likes · 11 min read

How to Build a Beginner Hadoop Cluster on CentOS 7

Big Data Technology & Architecture

Jul 19, 2021 · Big Data

Understanding Hadoop: MapReduce, HDFS, YARN, and Core Big Data Concepts

This article provides a comprehensive overview of Hadoop’s core components—including MapReduce programming model, HDFS storage architecture, and YARN resource management—while discussing common challenges like data skew and small files, and offering learning resources for aspiring big‑data engineers.

Data SkewHDFSHadoop

0 likes · 9 min read

Understanding Hadoop: MapReduce, HDFS, YARN, and Core Big Data Concepts

DataFunTalk

Jul 8, 2021 · Big Data

Design and Evolution of ByteDance's Multi‑Datacenter HDFS Architecture

This article explains how ByteDance extended the Apache HDFS architecture with a multi‑datacenter design, introducing components such as DanceNN, NNProxy, and BookKeeper to achieve scalable storage, cross‑datacenter data placement, and rack‑level disaster recovery for petabyte‑scale workloads.

ByteDanceDisaster RecoveryHDFS

0 likes · 13 min read

Design and Evolution of ByteDance's Multi‑Datacenter HDFS Architecture

Big Data Technology & Architecture

Jun 24, 2021 · Big Data

Comprehensive Overview of HBase Architecture, Design, and Operations

This article provides an in‑depth technical overview of HBase, covering its Bigtable origins, distributed column‑store design, core components such as ZooKeeper, HMaster and RegionServer, data flow, storage formats, row‑key design, bulk loading, SQL integration, indexing, coprocessors, and performance tuning for big‑data environments.

Columnar DatabaseDistributed storageHBase

0 likes · 30 min read

Comprehensive Overview of HBase Architecture, Design, and Operations

58 Tech

May 28, 2021 · Big Data

Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3

This article details the end‑to‑end upgrade of a 5000‑node Hadoop 2.6.0 cluster to Hadoop 3.2.1 at 58.com, covering HDFS migration, RBF and EC adoption, Yarn federation and rolling upgrades, MR3 integration, extensive compatibility testing, and operational lessons learned for large‑scale big‑data platforms.

Big DataHDFSHadoop

0 likes · 19 min read

Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3

Qu Tech

May 6, 2021 · Big Data

How JuiceFS Cut HDFS Load by 26% and Boost Presto Query Speed 13%

This case study details how integrating JuiceFS with Presto reduced HDFS cluster load by about 26%, achieved over 90% cache hit rate for ad‑hoc queries, and lowered average query latency by roughly 13%, while simplifying operations and improving system stability.

Big DataCacheHDFS

0 likes · 9 min read

How JuiceFS Cut HDFS Load by 26% and Boost Presto Query Speed 13%

Practical DevOps Architecture

Apr 28, 2021 · Big Data

Step-by-Step Hadoop Environment Setup and Configuration on Three Linux Servers

This guide walks through preparing three Linux servers, installing JDK 1.8, configuring Hadoop core, HDFS, MapReduce, and YARN XML files, setting Java environment variables, formatting HDFS, and starting all services to access the Hadoop web UI.

Big DataHDFSHadoop

0 likes · 4 min read

Step-by-Step Hadoop Environment Setup and Configuration on Three Linux Servers

Programmer DD

Apr 14, 2021 · Big Data

Understanding HDFS Architecture: Key Components, Protocols, and Limitations

This article explains HDFS’s master‑slave architecture, detailing the roles of NameNode and DataNode, namespace management, communication protocols, client functions, common configuration parameters, maintenance commands, and the inherent limitations of a single‑NameNode design.

Big DataDataNodeHDFS

0 likes · 5 min read

Understanding HDFS Architecture: Key Components, Protocols, and Limitations

Programmer DD

Apr 13, 2021 · Big Data

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

This article provides a comprehensive overview of HDFS—including its design goals, core components, data read/write workflows, high‑availability mechanisms, federation, storage policies, colocation benefits, and practical usage scenarios—explaining why it is the foundational distributed file system for large‑scale data processing.

Big DataFederationHDFS

0 likes · 17 min read

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

Open Source Linux

Mar 29, 2021 · Big Data

Which Open‑Source Distributed Storage System Is Right for You? HDFS, GlusterFS, Swift, Ceph Compared

This material outlines the fundamental concepts and key concerns of distributed storage, provides an overview of four open‑source systems—HDFS, GlusterFS, OpenStack Swift, and Ceph—and presents a detailed functional comparison to help you choose the best solution for your data infrastructure.

CephComparisonDistributed storage

0 likes · 4 min read

Which Open‑Source Distributed Storage System Is Right for You? HDFS, GlusterFS, Swift, Ceph Compared

DataFunTalk

Mar 27, 2021 · Big Data

Kuaishou's HDFS Architecture, Scale, Challenges, and Practices

This article presents an in‑depth technical overview of Kuaishou's massive HDFS deployment, detailing its architecture, petabyte‑scale data and thousands‑of‑node clusters, the key scalability challenges faced, and the custom solutions—including FixedOrder, RBF balancer, observer read, slow‑node mitigation, and tiered protection—implemented to keep the system performant and reliable.

Big DataData EngineeringHDFS

0 likes · 12 min read

Kuaishou's HDFS Architecture, Scale, Challenges, and Practices

Big Data Technology Architecture

Mar 25, 2021 · Big Data

Implementing Erasure Coding in HDFS: Migration, Testing, and Data Lifecycle Management at JD

This article details JD's end‑to‑end implementation of HDFS erasure coding, covering the migration from replication to EC, the three‑phase upgrade and rollback process, comprehensive automated testing, a custom data‑lifecycle management system for hot‑warm‑cold data, and multi‑layer integrity safeguards to achieve significant storage cost reduction while maintaining reliability.

Data LifecycleHDFSStorage Optimization

0 likes · 17 min read

Implementing Erasure Coding in HDFS: Migration, Testing, and Data Lifecycle Management at JD

JD Tech

Mar 20, 2021 · Big Data

Implementing Erasure Coding in HDFS: Migration Strategy, Testing Framework, and Data Lifecycle Management

This article details JD's practical experience migrating HDFS to erasure coding, covering the decision between upgrade and porting, the step‑by‑step upgrade and rollback procedures, automated testing, a custom data‑lifecycle management system for hot‑warm‑cold data, and comprehensive data‑integrity safeguards to achieve significant storage cost reductions while maintaining production reliability.

Data Lifecycle ManagementHDFSStorage Optimization

0 likes · 17 min read

Implementing Erasure Coding in HDFS: Migration Strategy, Testing Framework, and Data Lifecycle Management

dbaplus Community

Mar 17, 2021 · Big Data

How We Cut PBs of Waste and Optimized HDFS with Tiered Storage and Cloud Migration

This article details a three‑part technical sharing that covers cost governance for offline Hadoop clusters, a large‑scale data‑center migration with architecture upgrades, and a tiered storage strategy using EC and COS to reduce storage costs and improve performance in a cloud‑native big‑data environment.

Big Data MigrationCOSData Governance

0 likes · 10 min read

How We Cut PBs of Waste and Optimized HDFS with Tiered Storage and Cloud Migration

Big Data Technology Architecture

Mar 2, 2021 · Big Data

Understanding and Managing Small Files in Hadoop HDFS

This article explains what small files are in Hadoop HDFS, how they degrade NameNode memory, RPC performance, and application throughput, and provides practical strategies—including detection, configuration, and merging techniques—to mitigate their impact on storage and processing layers.

HDFSHadoop

0 likes · 12 min read

Understanding and Managing Small Files in Hadoop HDFS

DataFunTalk

Feb 8, 2021 · Big Data

Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS

This article explains how Apache Ozone, built on the HDDS layer, addresses the scalability, memory, and performance limitations of HDFS by splitting metadata services, using RocksDB, implementing fine‑grained locking, RAFT‑based HA, and offering rich APIs, while outlining current challenges and future roadmap.

Big DataHDDSHDFS

0 likes · 29 min read

Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS

Full-Stack Internet Architecture

Jan 27, 2021 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

This article provides a comprehensive overview of Hadoop, covering its origins, core components such as HDFS, MapReduce, and YARN, their architectures, data storage and processing mechanisms, fault‑tolerance features, scheduling strategies, and practical optimization techniques for large‑scale distributed computing.

Big DataDistributed ComputingHDFS

0 likes · 33 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

Architects' Tech Alliance

Jan 24, 2021 · Big Data

Outline of Distributed Storage Systems: HDFS, GlusterFS, OpenStack Swift, and Ceph

This article outlines the fundamental concepts and key issues of distributed storage, provides an overview of four open‑source distributed file systems—HDFS, GlusterFS, OpenStack Swift, and Ceph—and compares their functionalities, accompanied by illustrative slide images.

Big DataCephGlusterFS

0 likes · 2 min read

Outline of Distributed Storage Systems: HDFS, GlusterFS, OpenStack Swift, and Ceph

Didi Tech

Jan 22, 2021 · Big Data

Erasure Coding Practice in HDFS at Didi: Principles, Implementation, and Lessons Learned

Didi migrated HDFS to Hadoop 3.2 and implemented erasure coding—using XOR and Reed‑Solomon RS(6,3) striping—to replace three‑replica storage for cold data, building back‑ported clients, automated conversion tools, and cross‑datacenter backup pipelines, while addressing operational bugs and noting performance trade‑offs.

Big DataDidiHDFS

0 likes · 11 min read

Erasure Coding Practice in HDFS at Didi: Principles, Implementation, and Lessons Learned

Big Data Technology & Architecture

Jan 22, 2021 · Big Data

Key New Features and Improvements in Hadoop 3.x

Hadoop 3.x upgrades the platform to JDK 1.8 and introduces a range of enhancements across common components, HDFS, YARN, and MapReduce, including erasure coding, multi‑NameNode high availability, cgroup‑based resource isolation, native map‑output collectors, and split client libraries, while also adding support for Azure and Aliyun distributed file systems.

HDFSHadoopMapReduce

0 likes · 7 min read

Key New Features and Improvements in Hadoop 3.x

Big Data Technology & Architecture

Jan 12, 2021 · Big Data

Hadoop Interview Questions and Topics – HDFS, MapReduce, YARN, and Optimization

This article compiles a comprehensive set of Hadoop interview questions covering HDFS write and read processes, architecture, fault‑tolerance, NameNode metadata management, MapReduce scheduling, combiner and partition roles, YARN scheduling strategies, and various optimization techniques for both MapReduce and HDFS.

HDFSHadoopMapReduce

0 likes · 5 min read

Hadoop Interview Questions and Topics – HDFS, MapReduce, YARN, and Optimization

dbaplus Community

Dec 22, 2020 · Big Data

How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

This article details how eBay's ADI Hadoop team tackled a massive 10 PB, 10‑million‑file migration by optimizing DistCp with Fastcopy, load‑balancing, ACL handling, and failure recovery, ultimately completing the transfer within a two‑hour window while preserving cluster stability and performance.

Big DataDistcpHDFS

0 likes · 16 min read

How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

Tencent Cloud Developer

Dec 2, 2020 · Big Data

WeChat Pay Log System at Scale: Practices with Hermes

WeChat Pay’s Hermes‑based log system ingests trillions of entries daily, storing petabytes across a 200‑node HDFS cluster with four‑nine availability, while LSM‑style writes, separate inverted indexes and hot‑cold tiering cut memory, disk and cost by up to 70 % and keep 95 % of queries under five seconds.

HDFSHermesIndexing

0 likes · 7 min read

WeChat Pay Log System at Scale: Practices with Hermes

Practical DevOps Architecture

Nov 27, 2020 · Big Data

Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster

This tutorial provides a complete walkthrough for downloading Hadoop 2.8.2, setting up a three‑node master‑slave cluster, configuring core, HDFS, MapReduce and YARN settings, creating required directories, distributing the installation, starting the services, verifying the cluster status, and finally shutting it down.

Big DataCluster SetupHDFS

0 likes · 5 min read

Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster

Big Data Technology & Architecture

Aug 16, 2020 · Big Data

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

This article provides a detailed introduction to HDFS, covering its application scenarios, core architecture, fault‑tolerance benefits, drawbacks such as high latency and small‑file inefficiency, essential shell and API commands, cluster management procedures, and newer Hadoop 2.0 features like HA, Federation, snapshots, ACLs, and heterogeneous storage.

Big DataCLIHA

0 likes · 10 min read

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

Big Data Technology & Architecture

Aug 13, 2020 · Big Data

Configuring Kerberos‑Enabled HDFS Access with Maven in a Hadoop Cluster

This guide walks through setting up a Maven project, adding Hadoop dependencies, configuring Kerberos (krb5.conf and keytab), loading core‑site.xml, and providing Java utility classes to initialize the HDFS client and list files in an HA‑enabled Hadoop cluster.

Big DataHDFSHadoop

0 likes · 5 min read

Configuring Kerberos‑Enabled HDFS Access with Maven in a Hadoop Cluster

Big Data Technology & Architecture

Jul 29, 2020 · Big Data

Sqoop Tutorial: Importing and Exporting Data between Relational Databases, HDFS, Hive, and HBase

This article provides a comprehensive guide to using Sqoop for importing data from relational databases into HDFS, Hive, and HBase, as well as exporting data back to databases, covering command syntax, options, and practical examples for big‑data workflows.

Big DataHBaseHDFS

0 likes · 8 min read

Sqoop Tutorial: Importing and Exporting Data between Relational Databases, HDFS, Hive, and HBase

Big Data Technology & Architecture

Jul 27, 2020 · Big Data

How to View Hadoop/YARN Application Logs via History Server and Yarn Commands

This guide explains how to retrieve Hadoop/YARN application logs using the History Server UI, Yarn command‑line tools, and direct HDFS log access, including commands for listing applications, fetching specific logs, and locating the remote log directory.

Big DataCLIHDFS

0 likes · 4 min read

How to View Hadoop/YARN Application Logs via History Server and Yarn Commands