Tagged articles
407 articles
Page 3 of 5
Big Data Technology Architecture
Big Data Technology Architecture
Aug 16, 2019 · Big Data

In‑Depth Overview of HBase Architecture

This article provides a comprehensive, illustrated explanation of Apache HBase's architecture, covering its master‑slave components, region management, Zookeeper coordination, data flow for reads and writes, storage structures, compaction processes, fault recovery, and the system's strengths and limitations within the Hadoop ecosystem.

Distributed SystemsHBaseHadoop
0 likes · 21 min read
In‑Depth Overview of HBase Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Aug 5, 2019 · Big Data

Zookeeper in Distributed Systems: Roles in Kafka, Hadoop, HBase, and Solr

This article explains Zookeeper’s core concepts, its ZAB consensus protocol, and surveys its essential roles in major big‑data components such as Kafka, Hadoop, HBase, and Solr, illustrating how it provides configuration, naming, coordination, leader election, and high‑availability services across distributed architectures.

Distributed SystemsHBaseHadoop
0 likes · 5 min read
Zookeeper in Distributed Systems: Roles in Kafka, Hadoop, HBase, and Solr
Big Data Technology Architecture
Big Data Technology Architecture
Aug 2, 2019 · Big Data

The Rise and Decline of Hadoop: Market Shifts, Ecosystem Evolution, and Future Outlook

This article examines Hadoop’s historical development, the recent financial troubles of its three major vendors, the impact of public‑cloud services, competition from technologies like MongoDB and Elasticsearch, and how the evolving ecosystem and hybrid cloud strategies shape Hadoop’s relevance today.

HadoopHadoop Vendorscloud computing
0 likes · 23 min read
The Rise and Decline of Hadoop: Market Shifts, Ecosystem Evolution, and Future Outlook
Meituan Technology Team
Meituan Technology Team
Aug 1, 2019 · Big Data

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

Meituan improved its custom Hadoop YARN Fair Scheduler by pre‑computing resource usage, filtering zero‑demand jobs, and parallelizing queue sorting, which reduced sorting time from 30 s to 5 s per minute, boosted container‑per‑second throughput to 50 k, enabled live roll‑backs, and prepared the system for clusters up to 10 k nodes and future scaling to hundreds of thousands.

Big DataFair SchedulerHadoop
0 likes · 24 min read
Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler
vivo Internet Technology
vivo Internet Technology
Jul 29, 2019 · Big Data

Is Hadoop Dead? An Analysis of Cloudera’s Move Toward an Enterprise Data Cloud

While Hadoop remains a powerful but complex batch‑processing engine, Cloudera’s merger with Hortonworks and its pivot toward an enterprise data cloud—offering hybrid, multi‑cloud analytics, security, and governance—signals a strategic shift that keeps Hadoop relevant yet no longer central amid rising competitors like MongoDB and Elasticsearch.

ClouderaData GovernanceEnterprise Data Cloud
0 likes · 10 min read
Is Hadoop Dead? An Analysis of Cloudera’s Move Toward an Enterprise Data Cloud
dbaplus Community
dbaplus Community
Jul 24, 2019 · Big Data

Essential Open-Source Tools Every Big Data Engineer Should Know

This article compiles a comprehensive list of common open‑source tools for big data platforms—covering programming languages, data collection, ETL, storage, analysis, query, management, and monitoring—to help learners and practitioners quickly locate and understand the technologies they need.

Big DataETLHadoop
0 likes · 15 min read
Essential Open-Source Tools Every Big Data Engineer Should Know
System Architect Go
System Architect Go
Jul 19, 2019 · Big Data

Introduction to HBase: Architecture, Data Model, and Operations

This article provides a comprehensive overview of HBase, covering its distributed column‑oriented architecture, data model components, storage mechanisms, read/write processes, WAL lifecycle, MemStore flushing, region splitting and merging, and failure recovery within the Hadoop ecosystem.

Big DataHBaseHadoop
0 likes · 20 min read
Introduction to HBase: Architecture, Data Model, and Operations
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 30, 2019 · Big Data

Curated Collection of Big Data, Flink, Hadoop and Real‑Time Computing Articles from the “Big Data Technology and Architecture” Series

This article presents a carefully organized catalogue of over a hundred technical posts covering Flink source‑code analysis, fundamental and advanced big‑data structures, Hadoop ecosystem components, real‑time streaming with Spark and Kafka, as well as system design guidelines and miscellaneous insights, each linked to its original publication for easy reference.

Big DataDistributed SystemsFlink
0 likes · 6 min read
Curated Collection of Big Data, Flink, Hadoop and Real‑Time Computing Articles from the “Big Data Technology and Architecture” Series
21CTO
21CTO
Jun 28, 2019 · Big Data

Master Hadoop High Availability: A Complete Step‑by‑Step HA HDFS & YARN Guide

This article provides a comprehensive, language‑agnostic tutorial on building a highly available Hadoop cluster, covering HDFS and YARN HA architectures, QJM shared storage, required components, configuration files, installation commands, startup procedures, verification steps, and troubleshooting references.

Cluster SetupHDFSHadoop
0 likes · 20 min read
Master Hadoop High Availability: A Complete Step‑by‑Step HA HDFS & YARN Guide
Beike Product & Technology
Beike Product & Technology
Jun 28, 2019 · Big Data

Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover

This article analyzes the performance and stability bottlenecks of a Hadoop 2.7.3 NameNode caused by memory limits, RPC QPS, and long restart times, and presents a comprehensive solution stack—including HDFS federation, ViewFS, FastCopy, and tuned Balance/Mover tools—to improve scalability and reduce downtime.

BalanceFastCopyFederation
0 likes · 11 min read
Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover
Didi Tech
Didi Tech
Jun 22, 2019 · Big Data

Analysis of Hadoop RPC Architecture and Implementation

The article examines Hadoop’s RPC framework—detailing its client‑server workflow, core classes (RPC, Client, Server), dynamic proxy handling, NIO‑based server threading, configurable concurrency controls such as FairCallQueue, and a practical HDFS mkdir command example, illustrating high‑performance distributed communication.

Big DataHadoopRPC
0 likes · 17 min read
Analysis of Hadoop RPC Architecture and Implementation
DataFunTalk
DataFunTalk
Jun 17, 2019 · Big Data

Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era

This article explores Hadoop’s role in the big‑data era, detailing its architecture, core components such as HDFS, YARN, MapReduce, Ozone and Submarine, the challenges of trillion‑scale data, and why its scalability, cost efficiency, and a mature ecosystem give it a competitive edge.

Data LakeDistributed SystemsHadoop
0 likes · 11 min read
Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era
Big Data Technology & Architecture
Big Data Technology & Architecture
May 22, 2019 · Big Data

Key Changes and New Features in Apache Flink 1.8.0 Release

Apache Flink 1.8.0 introduces incremental state cleanup with TTL, updates Hadoop support, deprecates TableEnvironment static methods, adds new Kafka deserialization schema, modifies Maven dependencies, and provides several configuration and Table API enhancements for better stream‑processing performance and compatibility.

Apache FlinkHadoopTable API
0 likes · 7 min read
Key Changes and New Features in Apache Flink 1.8.0 Release
Big Data Technology Architecture
Big Data Technology Architecture
May 18, 2019 · Big Data

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.

HDFSHadoopKafka
0 likes · 10 min read
Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations
dbaplus Community
dbaplus Community
May 13, 2019 · Big Data

Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com

This article examines the performance challenges encountered after upgrading a large‑scale HDFS cluster at VIP.com, explains the root causes of NameNode RPC latency, and presents concrete solutions—including delayed block reports, configurable block deletion, federation redesign, client monitoring, temp‑directory sharding, and small‑file handling—along with configuration snippets and real‑world results.

Big DataFederationHDFS
0 likes · 13 min read
Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 24, 2019 · Big Data

Hive SQL Optimization Techniques and Best Practices

This article provides a comprehensive guide to Hive SQL performance tuning, covering optimization goals, common pitfalls, execution flow, table and job settings, map, shuffle, reduce, and query-level improvements such as join, bucket join, group‑by, and count‑distinct optimizations.

Big DataHadoophive
0 likes · 11 min read
Hive SQL Optimization Techniques and Best Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 17, 2019 · Big Data

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

This tutorial provides a comprehensive, step-by-step procedure for setting up Hive 2.1.0 on a Hadoop 2.7.1 cluster running Ubuntu 14.04, covering environment preparation, Hive installation, configuration of environment variables, MySQL metastore integration, client setup, service startup, and basic verification commands.

Big DataHadoopInstallation
0 likes · 8 min read
Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 16, 2019 · Big Data

Features, Configuration Parameters, and Implementation Details of Hadoop Capacity Scheduler

The article provides a comprehensive overview of Hadoop's Capacity Scheduler, describing its resource‑allocation features, configurable XML parameters, queue access controls, dynamic configuration updates, and the internal workflow of application initialization and resource scheduling within YARN.

CapacitySchedulerHadoopResourceManagement
0 likes · 13 min read
Features, Configuration Parameters, and Implementation Details of Hadoop Capacity Scheduler
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 1, 2019 · Big Data

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

This article provides a detailed introduction to Hadoop's ecosystem—including its core modules (Common, HDFS, YARN, MapReduce), the design of a high‑availability HDFS cluster, the principles of distributed file systems, and a complete Scala WordCount MapReduce program—offering a solid foundation for big‑data practitioners.

Big DataHDFSHadoop
0 likes · 15 min read
Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example
dbaplus Community
dbaplus Community
Mar 27, 2019 · Big Data

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

This article explains how eBay's CAL team reduced Hadoop MapReduce job execution time and resource consumption by over 60% through targeted GC tuning, data‑skew mitigation, and algorithmic improvements, boosting job success rates to nearly 100% while handling petabyte‑scale log data.

Big DataData SkewGC tuning
0 likes · 12 min read
How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization
dbaplus Community
dbaplus Community
Mar 13, 2019 · Operations

How We Upgraded a 100‑Node Hadoop Cluster with Ansible and Ambari

This article details the step‑by‑step process of modernizing a large‑scale Hadoop deployment—identifying legacy pain points, evaluating three migration strategies, selecting an in‑place upgrade using Ambari‑managed HDP, and automating the entire workflow with Ansible to minimize downtime and operational risk.

AmbariAnsibleCluster Upgrade
0 likes · 13 min read
How We Upgraded a 100‑Node Hadoop Cluster with Ansible and Ambari
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 15, 2019 · Big Data

Big Data Mastery Roadmap

This article outlines a comprehensive series of over 500 planned tutorials covering Java advanced features, distributed theory, Hadoop, Spark, Flink, and various big‑data storage and processing technologies, designed to guide engineers transitioning into big‑data development from fundamentals to expert level.

Distributed SystemsFlinkHadoop
0 likes · 4 min read
Big Data Mastery Roadmap
Didi Tech
Didi Tech
Jan 31, 2019 · Big Data

Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

Router‑Based Federation replaces Hadoop’s single‑point HDFS bottleneck with a server‑side global namespace managed by Routers and a State Store, enabling scalable, highly available sub‑clusters; Didi back‑ported the feature, deployed five Routers, fixed numerous bugs, and contributed patches to improve stability and functionality.

Big DataHDFSHadoop
0 likes · 11 min read
Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment
Tencent Cloud Developer
Tencent Cloud Developer
Oct 30, 2018 · Big Data

Big Data Technology Trends and Cloud Data Warehouse Architecture Practices

The article reviews recent big-data trends—from Hadoop’s evolution and Spark’s in-memory advances to emerging storage like Ozone—while detailing data-warehouse models, query-optimizer techniques, and cloud-native architectures that integrate diverse data sources, enabling scalable, AI-ready analytics and modern data-lake capabilities.

Big DataData LakeHadoop
0 likes · 30 min read
Big Data Technology Trends and Cloud Data Warehouse Architecture Practices
JD Tech
JD Tech
Sep 20, 2018 · Big Data

Optimizing Local Storage Systems for Large‑Scale Hadoop HDFS Clusters

This article explains the architecture of Hadoop HDFS, identifies performance bottlenecks in page cache and metadata handling on DataNodes, and presents four practical optimization techniques—including cache‑buffer separation, barrier disabling, directory restructuring, and real‑time monitoring—demonstrating significant throughput and latency improvements in large‑scale clusters.

HDFSHadoopLinux kernel
0 likes · 14 min read
Optimizing Local Storage Systems for Large‑Scale Hadoop HDFS Clusters
Big Data and Microservices
Big Data and Microservices
Sep 4, 2018 · Big Data

Exploring Five Big Data Architectures—from Traditional to Unified AI Designs

The article examines the evolution of big‑data processing by comparing five prevalent architectures—traditional Hadoop‑based stacks, streaming‑only designs, Kappa, Lambda, and the unified Unifield model—highlighting their strengths, weaknesses, and suitable scenarios while discussing the limitations of classic BI systems and the role of distributed storage, computation, and machine‑learning integration.

Big DataData ArchitectureHadoop
0 likes · 14 min read
Exploring Five Big Data Architectures—from Traditional to Unified AI Designs
Architects Research Society
Architects Research Society
Jul 27, 2018 · Big Data

Overview of Apache Hive Features, Usage, and Management

Apache Hive is an open‑source data‑warehouse system built on Hadoop that enables users to read, write, and manage large distributed datasets using SQL‑like queries, offering features such as ETL support, various file‑format connectors, extensible UDFs, and integration with tools like Tez, Spark, and MapReduce.

Apache HiveBig DataETL
0 likes · 5 min read
Overview of Apache Hive Features, Usage, and Management
iQIYI Technical Product Team
iQIYI Technical Product Team
Jul 26, 2018 · Big Data

Gear: An Internal Workflow Scheduling System for Hadoop at iQIYI

Gear is iQIYI’s internal, high‑availability workflow scheduler built on Apache Oozie and extended with a YAML‑based definition language, GitLab‑driven submission, and a web UI, enabling thousands of daily Hadoop/Spark jobs, complex dependencies, retries, and monitoring, and evolving from SSH‑centric 1.x to feature‑rich 2.x.

HadoopJob ManagementOozie
0 likes · 14 min read
Gear: An Internal Workflow Scheduling System for Hadoop at iQIYI
Meitu Technology
Meitu Technology
Jul 24, 2018 · Big Data

Exploring Big Data Cluster Security: Evaluation of Kerberos, Apache Sentry, and Apache Ranger

The article evaluates Kerberos, Apache Sentry, and Apache Ranger for securing Meitu’s large‑scale Hadoop ecosystem, highlighting Ranger’s comprehensive, fine‑grained, policy‑based authorization across HDFS, HBase, Hive, YARN, Storm, and Kafka, and describing its configuration, LDAP integration, and custom SDK implementation.

Apache RangerApache SentryHadoop
0 likes · 12 min read
Exploring Big Data Cluster Security: Evaluation of Kerberos, Apache Sentry, and Apache Ranger
JD Tech
JD Tech
Jul 10, 2018 · Big Data

Deploying Hadoop KMS for Transparent HDFS Encryption: A Step‑by‑Step Guide

This article details a complete, hands‑on deployment of Hadoop KMS on a CentOS‑based Hadoop 2.6.1 cluster, covering environment setup, configuration file changes, key generation, service startup, encryption‑zone creation, user permission tuning, verification procedures, and common troubleshooting tips.

HDFSHadoopKMS
0 likes · 19 min read
Deploying Hadoop KMS for Transparent HDFS Encryption: A Step‑by‑Step Guide
JD Tech
JD Tech
Jul 9, 2018 · Big Data

JD's Large‑Scale Hadoop Cluster Resource Management and Scheduling Architecture

This article describes how JD built a multi‑regional, ten‑thousand‑node Hadoop ecosystem, unified resource management with YARN, introduced a three‑level Router scheduling layer, optimized performance, and integrated deep‑learning frameworks to achieve high availability, cost efficiency, and scalable big‑data processing.

Distributed SchedulingHadoopJD.com
0 likes · 12 min read
JD's Large‑Scale Hadoop Cluster Resource Management and Scheduling Architecture
UCloud Tech
UCloud Tech
Jul 7, 2018 · Big Data

How UMStor and HAdapter Power Big Data Cloud Migration with Superior Performance

The article reports on UCloud's subsidiary presenting at ArchSummit 2018 in Shenzhen, detailing the evolution to the digital era, challenges of PB‑scale data storage, and their solution using NFS‑Ganesha, Hadapter, and UMStor to achieve efficient big‑data‑on‑cloud performance and a data‑lake model.

Data LakeHadoopUMStor
0 likes · 10 min read
How UMStor and HAdapter Power Big Data Cloud Migration with Superior Performance
JD Tech
JD Tech
Jul 4, 2018 · Big Data

ClickHouse Overview: Features, Performance, Engines, and Comparison with Hadoop

This article introduces ClickHouse as a high‑performance, column‑oriented database designed for real‑time big‑data analytics, outlines its key features, performance characteristics, supported interfaces, differences from Hadoop, and explains its main storage engines—MergeTree and Distributed—while also noting its current limitations.

Columnar DatabaseDistributed SystemsHadoop
0 likes · 11 min read
ClickHouse Overview: Features, Performance, Engines, and Comparison with Hadoop
360 Quality & Efficiency
360 Quality & Efficiency
Jun 28, 2018 · Big Data

An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases

This article provides a concise overview of Apache Hive, covering its definition, Hadoop background, architecture, query workflow, storage model, advantages, disadvantages, and a comparison with traditional relational databases, helping readers understand how Hive enables SQL-like queries on data stored in HDFS.

HadoopMapReducedata-warehouse
0 likes · 5 min read
An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases
Architecture Digest
Architecture Digest
Jun 22, 2018 · Databases

Distributed Databases for OLAP: MPP, Hadoop Ecosystem, and Like‑Mesa (ClickHouse/Palo) Overview

This article examines the evolution and classification of distributed databases for OLAP workloads, comparing traditional RDBMS, MPP solutions such as Teradata and Greenplum, Hadoop‑based ecosystems, and newer architectures like ClickHouse and Palo, while highlighting their architectural traits, strengths, and limitations.

HadoopMPPNewSQL
0 likes · 17 min read
Distributed Databases for OLAP: MPP, Hadoop Ecosystem, and Like‑Mesa (ClickHouse/Palo) Overview
ITPUB
ITPUB
Jun 19, 2018 · Big Data

Is Hadoop Still Relevant? Comparing Hadoop, PostgreSQL, and Storm

The article examines Hadoop's relevance by contrasting it with PostgreSQL and Storm, discussing when each technology fits big‑data challenges such as volume, velocity, and variety, and highlighting cost, complexity, and use‑case considerations for enterprises.

Batch ProcessingHadoopStorm
0 likes · 8 min read
Is Hadoop Still Relevant? Comparing Hadoop, PostgreSQL, and Storm
ITPUB
ITPUB
Jun 14, 2018 · Big Data

Why Suning.com Sticks with Hadoop: Insights into China’s Big Data Platform Choices

Amid declining Hadoop usage reports, Suning.com’s 2018‑2020 big‑data platform case study reveals why the retailer still relies on Hadoop’s mature ecosystem, how it integrates HDFS, HBase, YARN, Hive, Spark, Flink and emerging tools, and what future resource‑management plans it envisions.

Data PlatformFlinkHadoop
0 likes · 11 min read
Why Suning.com Sticks with Hadoop: Insights into China’s Big Data Platform Choices
ITPUB
ITPUB
Jun 10, 2018 · Big Data

13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem

This article introduces Hadoop’s origins and core challenges, then presents thirteen essential open‑source tools spanning resource scheduling, real‑time query engines, and additional processing frameworks, detailing each project's purpose, key features, and repository locations to help practitioners choose the right component for big‑data workloads.

HadoopImpalaSpark
0 likes · 12 min read
13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem
ITPUB
ITPUB
Jun 4, 2018 · Big Data

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

Despite Gartner's 2017 claim that Hadoop is nearing the end of its production maturity, a series of interviews with Chinese big‑data experts reveal that Hadoop's ecosystem remains robust, with core components like HDFS, YARN, Spark, and HBase continuing to dominate the market.

Big DataEcosystemGartner
0 likes · 9 min read
Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong
ITPUB
ITPUB
Jun 3, 2018 · Big Data

Spark vs Hadoop: Which Distributed System Fits Your Data Needs?

An in‑depth comparison of Hadoop and Spark examines their architectures, performance, cost, security, and machine‑learning capabilities, helping readers decide which open‑source distributed processing platform best matches their batch, streaming, and analytical workloads.

Big DataCostHadoop
0 likes · 13 min read
Spark vs Hadoop: Which Distributed System Fits Your Data Needs?
Beike Product & Technology
Beike Product & Technology
Jun 1, 2018 · Big Data

Design and Evolution of Lianjia's Big Data Platform: Architecture, Challenges, and Solutions

This article details Lianjia's journey from a Hadoop‑based 0.0 data platform to a sophisticated 2.0 architecture, describing the three‑layer design, OLAP engine choices, transparent compression techniques, operational challenges, and practical recommendations for building and maintaining large‑scale big data systems.

HadoopKylinOLAP
0 likes · 15 min read
Design and Evolution of Lianjia's Big Data Platform: Architecture, Challenges, and Solutions
21CTO
21CTO
May 17, 2018 · Big Data

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

This article explains Hadoop's core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker architecture to improve scalability and resource utilization in large‑scale data processing clusters.

HadoopShuffleYARN
0 likes · 12 min read
Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling
Architects' Tech Alliance
Architects' Tech Alliance
May 14, 2018 · Big Data

Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization

This article explains Hadoop's distributed storage and processing framework, details the MapReduce programming model, describes the classic JobTracker/TaskTracker architecture, outlines the shuffle and combine phases, and introduces YARN as a scalable replacement with its ResourceManager, ApplicationMaster, and NodeManager components.

Big DataHadoopMapReduce
0 likes · 13 min read
Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization
ITPUB
ITPUB
Mar 29, 2018 · Big Data

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

This article explains Hadoop’s core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker design to improve scalability and resource utilization in large‑scale data processing clusters.

Big DataHadoopMapReduce
0 likes · 15 min read
Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture
Snowball Engineer Team
Snowball Engineer Team
Mar 23, 2018 · Big Data

Redesigning Snowball's Log Collection Architecture During Hadoop Cluster Expansion

The article details Snowball's challenges with a saturated CDH Hadoop cluster, outlines the limitations of the original Kafka‑based log pipeline, and explains how a comprehensive redesign using FlumeNG, Spillable Memory Channels, and custom HDFS sinks resolves latency, data loss, and high‑load issues while supporting future growth.

Cluster MigrationFlumeNGHadoop
0 likes · 6 min read
Redesigning Snowball's Log Collection Architecture During Hadoop Cluster Expansion
dbaplus Community
dbaplus Community
Mar 7, 2018 · Big Data

Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization

The article outlines a systematic approach for large‑scale Hadoop clusters to monitor daily data growth, identify abnormal paths, manage rapid expansion, clean unused cold data, and implement capacity forecasts, while providing concrete daily and quarterly actions, Hive‑specific strategies, and practical examples to keep storage under control.

Big DataData GrowthHDFS
0 likes · 17 min read
Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization
Architecture Digest
Architecture Digest
Feb 1, 2018 · Fundamentals

How Search Engines Work: Building Inverted Indexes

This article explains the core of search engine technology by describing what an inverted index is, how it is built using single‑pass memory and multi‑way merge methods, how indexes can be partitioned and incrementally updated, and how Hadoop can be used for large‑scale indexing.

Big DataHadoopindexing
0 likes · 10 min read
How Search Engines Work: Building Inverted Indexes
ITPUB
ITPUB
Nov 23, 2017 · Big Data

7 Typical Big Data Projects Every Hadoop Engineer Should Know

The article outlines seven common big‑data initiatives—data integration, specialized analytics, Hadoop‑as‑a‑service, stream processing, complex event handling, ETL pipelines, and SAS replacement—explaining their goals, typical technologies such as HDFS, Hive, Spark, Storm, Kafka, and practical considerations for enterprises adopting Hadoop ecosystems.

Data IntegrationHadoopproject types
0 likes · 8 min read
7 Typical Big Data Projects Every Hadoop Engineer Should Know
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Oct 21, 2017 · Big Data

Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS

This guide walks through preparing three CentOS 6.9 nodes, configuring hostnames, time sync, password‑less SSH, disabling IPv6, installing JDK, downloading CDH 5.4, setting up core‑site and hdfs‑site XML files, formatting the NameNode, starting HDFS services, configuring YARN and MapReduce, and verifying the installations via the Web UI.

Big DataCDHCentOS
0 likes · 18 min read
Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS
37 Interactive Technology Team
37 Interactive Technology Team
Oct 19, 2017 · Big Data

Ambari Technical Practice for Managing Hadoop Big Data Platforms

The team adopted Apache Ambari to streamline deployment, scaling, monitoring, and upgrade of their Hadoop‑centric big‑data platform, overcoming HA cluster takeover and custom Hive 2.1 integration through a three‑phase test, gray‑scale, and production rollout, thereby improving management efficiency and reducing O&M costs.

AmbariCluster ManagementHDP
0 likes · 18 min read
Ambari Technical Practice for Managing Hadoop Big Data Platforms
21CTO
21CTO
Sep 25, 2017 · Big Data

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

This article explains how Meitu built and evolved a large‑scale data statistics platform to handle billions of users, detailing the challenges of growing data volume, the architectural shifts from simple scripts to Hadoop, and the design of modular components for job management, scheduling, execution, and future expansion.

Big DataData PlatformHadoop
0 likes · 16 min read
How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons
21CTO
21CTO
Sep 5, 2017 · Big Data

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

This article explains what MapReduce is, when to use it, and how to implement a PHP word‑count and a gold‑price average calculation on an Apache Hadoop cluster, covering installation hints, mapper and reducer scripts, testing commands, and visualizing results with gnuplot.

Big DataGnuplotHadoop
0 likes · 10 min read
Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide
Meituan Technology Team
Meituan Technology Team
Aug 25, 2017 · Big Data

Data Platform Integration and Multi‑Data‑Center Architecture at Meituan‑Dianping

After Meituan merged with Dianping, engineers unified two massive Hadoop ecosystems across Beijing and Shanghai by breaking the project into four phases—unify, copy, switch, fuse—standardizing versions, implementing zone‑aware transfers, cross‑realm Kerberos, and federated metadata to achieve a single, reliable multi‑data‑center platform.

Big DataCluster FusionData Platform
0 likes · 32 min read
Data Platform Integration and Multi‑Data‑Center Architecture at Meituan‑Dianping
21CTO
21CTO
Aug 21, 2017 · Big Data

Rethinking Hadoop: When to Use It and How Cloud Computing Changes the Game

This article reviews when Hadoop is appropriate, outlines its core features and limitations, explains cloud computing concepts and service models, and highlights the benefits of pre‑built Hadoop images for accelerating big‑data projects.

Big DataHadoopPre-built Images
0 likes · 13 min read
Rethinking Hadoop: When to Use It and How Cloud Computing Changes the Game
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jul 26, 2017 · Big Data

Inside Taobao’s Massive Data Architecture: From Hadoop “Cloud Ladder” to Real‑Time “Galaxy”

This article details Taobao’s multi‑layer massive data platform, covering its five‑tier architecture, the 1500‑node Hadoop “Cloud Ladder” for batch processing, the low‑latency “Galaxy” stream engine, MySQL‑based MyFOX, HBase‑based Prom storage, the glider middle‑layer, and sophisticated caching strategies that together support petabytes of data and millions of daily queries.

Big DataDistributed SystemsHBase
0 likes · 16 min read
Inside Taobao’s Massive Data Architecture: From Hadoop “Cloud Ladder” to Real‑Time “Galaxy”
21CTO
21CTO
Jun 14, 2017 · Big Data

How Apache Kylin Supercharges Big Data Analytics with Pre‑Computed Cubes

Apache Kylin is an open‑source, distributed OLAP engine built on Hadoop that uses pre‑computed cubes to deliver sub‑second, high‑concurrency SQL queries on massive datasets, integrates with popular BI tools, offers a modular architecture, recent 1.5.x enhancements, and extensive deployment options.

Apache KylinHadoopOLAP
0 likes · 17 min read
How Apache Kylin Supercharges Big Data Analytics with Pre‑Computed Cubes
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Jun 9, 2017 · Big Data

Secure HDFS with Guardian 5.0: Complete Permission and Quota Guide

This article explains why Hadoop security is critical, introduces Guardian 5.0’s unified authentication and authorization framework, and provides step‑by‑step instructions for configuring HDFS permissions and quotas through its web UI, helping administrators protect massive data assets efficiently.

Guardian5.0HDFSHadoop
0 likes · 9 min read
Secure HDFS with Guardian 5.0: Complete Permission and Quota Guide
Architecture Digest
Architecture Digest
Jun 9, 2017 · Big Data

A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning

This guide walks beginners through the entire big‑data ecosystem, covering the 4V characteristics, core open‑source frameworks, Hadoop setup, Hive and SQL on Hadoop, data ingestion and export tools, task scheduling, real‑time processing with Kafka, Storm and Spark Streaming, and an introduction to machine‑learning applications.

HadoopKafkaSpark
0 likes · 17 min read
A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning
dbaplus Community
dbaplus Community
Jun 7, 2017 · Big Data

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

This comprehensive guide walks you through MapReduce fundamentals, the complete execution flow, and seven hands‑on Hadoop projects—including WordCount, custom serialization, custom partitioning, grouping comparators, file merging, multiple outputs, join operations, and friend‑graph analysis—while providing environment setup steps, Maven commands, and Hadoop CLI examples.

CLIHadoopMapReduce
0 likes · 28 min read
Master MapReduce: From Fundamentals to Real‑World Hadoop Projects
dbaplus Community
dbaplus Community
May 24, 2017 · Operations

How to Replace a ZooKeeper Node in a 5‑Node Cluster Without Downtime

This guide details the step‑by‑step process for replacing a faulty ZooKeeper node (myid 5) in a five‑node cluster, covering configuration updates in zoo.cfg, Hadoop’s hdfs‑site.xml, yarn‑site.xml, HBase‑site.xml, and the required service restarts to ensure continuous high‑availability.

ConfigurationHBaseHadoop
0 likes · 10 min read
How to Replace a ZooKeeper Node in a 5‑Node Cluster Without Downtime