Tagged articles

MapReduce

125 articles · Page 1 of 2

Apr 26, 2026 · Industry Insights

Martin Kleppmann on the New DDIA: How AI Will Disrupt Distributed Systems

In a deep interview, Martin Kleppmann explains why the upcoming second edition of Designing Data‑Intensive Applications rewrites core assumptions, declares MapReduce dead, predicts AI‑driven formal verification, warns of a talent gap, and champions local‑first software as the next frontier of distributed systems.

AICloud PrimitivesDDIA

0 likes · 10 min read

Martin Kleppmann on the New DDIA: How AI Will Disrupt Distributed Systems

Big Data Tech Team

Aug 25, 2025 · Interview Experience

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

A comprehensive list of interview topics covering self‑introduction, career moves, data‑warehouse design, team building, architecture comparisons, fact‑table classification, common dimensions, performance tuning, and data‑governance for aspiring big‑data engineers.

Big DataData GovernanceFlink

0 likes · 4 min read

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

Code Ape Tech Column

Aug 5, 2025 · Backend Development

Exploring PowerJob: A Lightweight Distributed Task Scheduler for Java

This article introduces PowerJob, a young yet powerful distributed task scheduling framework, covering its selection reasons, core concepts, high‑availability setup, workflow types, scheduling modes, deployment steps, and detailed code examples for single, broadcast, map, and MapReduce jobs.

Distributed SchedulingJavaMapReduce

0 likes · 15 min read

Exploring PowerJob: A Lightweight Distributed Task Scheduler for Java

Big Data Tech Team

Jun 8, 2025 · Big Data

Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals

This guide outlines a comprehensive Hadoop learning roadmap, covering essential prerequisites, core concepts such as HDFS, MapReduce, and YARN, hands‑on projects, advanced ecosystem tools like Hive, Pig, HBase and Spark, plus curated resources and community channels for aspiring big‑data engineers.

Distributed ComputingHDFSHadoop

0 likes · 7 min read

Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals

Rare Earth Juejin Tech Community

Dec 26, 2024 · Big Data

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

This article explains the origins of big‑data technologies, details the architecture and read/write mechanisms of Hadoop's HDFS, describes the MapReduce programming model, and provides complete Java code examples for a simple distributed file‑processing job using Maven dependencies.

Big DataDistributed File SystemHDFS

0 likes · 15 min read

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

Qunar Tech Salon

Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce

0 likes · 23 min read

Understanding and Solving Small File Problems in Hive and Spark

Rare Earth Juejin Tech Community

Nov 23, 2024 · Big Data

Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning

This article explains Hadoop’s core concepts using a library analogy, details HDFS storage and MapReduce processing, provides complete Java implementations for a word‑count job with support for text, CSV, and JSON inputs, and discusses extensibility and performance optimizations such as combiners and custom partitioners.

Big DataHadoopJava

0 likes · 20 min read

Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning

Alibaba Cloud Native

Jun 30, 2024 · Cloud Computing

Simplify Argo Workflows with Hera: Python SDK Guide for Kubernetes

This tutorial explains how to use the Hera Python SDK to create, submit, and manage Argo Workflows on an ACK One Serverless Argo cluster, covering installation, DAG diamond and MapReduce examples, and practical commands for token generation and workflow execution.

Argo WorkflowsDAGHera SDK

0 likes · 11 min read

Simplify Argo Workflows with Hera: Python SDK Guide for Kubernetes

Code Ape Tech Column

Jun 16, 2024 · Backend Development

Introducing PowerJob: A Lightweight Distributed Task Scheduling Framework and Its Usage

This article introduces PowerJob, a young yet mature distributed task scheduling framework, explains why it was chosen, details its architecture, high‑availability design, deployment steps, and demonstrates various job types—including standalone, broadcast, map, and MapReduce—along with CRON, fixed‑rate, and fixed‑delay scheduling configurations.

JavaMapReducePowerJob

0 likes · 13 min read

Introducing PowerJob: A Lightweight Distributed Task Scheduling Framework and Its Usage

ITPUB

Dec 14, 2023 · Big Data

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

This step‑by‑step guide shows how to install and configure a single‑node Hadoop 3.2.0 environment on CentOS 7, set up Python 3.7, write MapReduce mapper and reducer scripts in Python, and run a word‑count job using Hadoop streaming, illustrating core Hadoop concepts and their relevance today.

HadoopMapReducePython

0 likes · 21 min read

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

Tencent Cloud Developer

Dec 14, 2023 · Big Data

Master Word Count with Python & Hadoop: A Step‑by‑Step Guide

This tutorial walks you through Hadoop’s core components, sets up a single‑node Hadoop cluster on CentOS 7, installs Python 3, writes mapper and reducer scripts in Python, and runs a Hadoop‑Streaming word‑count job to demonstrate classic big‑data processing techniques.

Big DataHadoopLinux

0 likes · 22 min read

Master Word Count with Python & Hadoop: A Step‑by‑Step Guide

DaTaobao Tech

Dec 11, 2023 · Big Data

Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems

The paper presents a centralized online batch‑processing framework for large‑scale promotion systems, where applications integrate via an SDK, a task‑center schedules and dispatches sub‑tasks through RocketMQ to Dubbo‑enabled containers, employing MapReduce‑style splitting, Guava rate‑limiting, heartbeat health checks, and has successfully handled over 1.3 million tasks during Double‑11.

Batch ProcessingBig DataDistributed Scheduling

0 likes · 9 min read

Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems

Code Ape Tech Column

Dec 9, 2023 · Backend Development

PowerJob Overview: Selection Rationale, Architecture, Task Types, and Scheduling Strategies with Code Samples

This article introduces the PowerJob distributed task framework, explains why it was chosen, details its architecture and high‑availability design, demonstrates various job types—including standalone, broadcast, map, and map‑reduce—with Java code examples, and covers scheduling options such as CRON, fixed‑rate, and fixed‑delay execution.

Distributed SchedulingJavaMapReduce

0 likes · 14 min read

PowerJob Overview: Selection Rationale, Architecture, Task Types, and Scheduling Strategies with Code Samples

Shepherd Advanced Notes

Oct 24, 2023 · Backend Development

Why PowerJob Beats xxl-job: A More Powerful Distributed Scheduling Framework

PowerJob is presented as a third‑generation distributed task scheduler that adds workflow orchestration and Map/MapReduce compute to the basic CRON, fixed‑rate and API strategies, outperforming Quartz and xxl-job while offering a visual web console and extensive executor support.

Distributed SchedulingJavaMapReduce

0 likes · 10 min read

Why PowerJob Beats xxl-job: A More Powerful Distributed Scheduling Framework

Rare Earth Juejin Tech Community

Sep 29, 2023 · Backend Development

Concurrent Chunk Processing in Go: A MapReduce‑Style Solution

The article explains how to handle business scenarios that require splitting large data sets into concurrent I/O requests and sequential aggregation by presenting a Go‑based chunk processing framework with map and reduce functions, configurable concurrency, and example code.

Chunk ProcessingGoMapReduce

0 likes · 7 min read

Concurrent Chunk Processing in Go: A MapReduce‑Style Solution

Architecture Digest

Jun 3, 2023 · Backend Development

PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Features, Comparison, and Quick‑Start Guide

PowerJob is a modern distributed job scheduling framework that addresses the limitations of Quartz, XXL‑Job and SchedulerX by offering a web UI, rich scheduling strategies, DAG workflow support, lock‑free high‑performance scheduling, multiple processor types and step‑by‑step quick‑start instructions for developers.

Distributed SchedulingJavaMapReduce

0 likes · 10 min read

PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Features, Comparison, and Quick‑Start Guide

Programmer DD

Feb 24, 2023 · Artificial Intelligence

How Jeff Dean’s Journey Shaped Google’s AI and Big Data Revolution

Jeff Dean, a Google engineering legend, has mastered over 18 programming languages and pioneered transformative technologies such as MapReduce, Bigtable, Spanner, and TensorFlow, illustrating how his relentless pursuit of scalability and performance has driven the evolution of AI, big data, and modern cloud infrastructure.

AIJeff DeanMapReduce

0 likes · 14 min read

How Jeff Dean’s Journey Shaped Google’s AI and Big Data Revolution

StarRing Big Data Open Lab

Feb 8, 2023 · Big Data

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Distributed computing splits massive tasks across multiple servers, and this article explains the classic MapReduce batch engine and the modern Spark framework, covering their architectures, strengths, limitations, and evolution, while highlighting key features like fault tolerance, in‑memory processing, and real‑time streaming capabilities.

Big DataDistributed ComputingMapReduce

0 likes · 12 min read

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Architecture Digest

Dec 16, 2022 · Backend Development

PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Introduction and Quick‑Start Guide

PowerJob is a third‑generation distributed job scheduler that adds workflow orchestration, map‑reduce style computation and rich execution modes to traditional CRON‑based scheduling, and this guide explains its advantages, core features, architecture, and provides step‑by‑step instructions with code samples to get started quickly.

Distributed SchedulingJavaMapReduce

0 likes · 11 min read

PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Introduction and Quick‑Start Guide

ITPUB

Oct 21, 2022 · Big Data

Hadoop Explained: Architecture, Core Components, and Real-World Applications

This article provides a comprehensive overview of Hadoop, covering its historical development, key characteristics, the HDFS storage framework, the MapReduce processing engine, YARN resource manager, and a wide range of real-world application scenarios, as well as the broader Hadoop ecosystem and its major components.

Big DataDistributed ComputingHDFS

0 likes · 20 min read

Hadoop Explained: Architecture, Core Components, and Real-World Applications

Python Crawling & Data Mining

Oct 16, 2022 · Big Data

What Makes Hadoop the Backbone of Modern Big Data Processing?

This article provides a comprehensive overview of Hadoop, covering its history, core features, the HDFS storage framework, MapReduce computation engine, YARN resource manager, real‑world application scenarios, and the surrounding ecosystem of tools such as Hive, Spark and Kafka.

Distributed ComputingHDFSHadoop

0 likes · 20 min read

What Makes Hadoop the Backbone of Modern Big Data Processing?

IEG Growth Platform Technology Team

Apr 18, 2022 · Big Data

Big Data Overview: Definitions, Applications, Technology Stack, and Core Components (Hadoop, HDFS, MapReduce, YARN, Hive, HBase)

This comprehensive article explains big data concepts, definitions from Gartner and IBM, real‑world use cases, the Hadoop ecosystem architecture, and detailed introductions to HDFS, MapReduce, YARN, Hive, and HBase, including practical examples and shell commands.

HBaseHDFSHadoop

0 likes · 42 min read

Big Data Overview: Definitions, Applications, Technology Stack, and Core Components (Hadoop, HDFS, MapReduce, YARN, Hive, HBase)

JavaEdge

Apr 17, 2022 · Big Data

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

The article explains how Spark, developed by UC Berkeley's AMP Lab, quickly surpassed MapReduce by offering faster execution, a simpler Scala‑based programming model, lazy RDD transformations, a rich ecosystem including SQL, Streaming, MLlib and GraphX, and practical code examples such as a three‑line WordCount.

Big DataMapReduceRDD

0 likes · 7 min read

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

Practical DevOps Architecture

Jan 4, 2022 · Big Data

Step-by-Step Guide to Installing and Configuring Hadoop 2.9.2 Cluster on Three Nodes

This article provides a detailed, step-by-step tutorial for installing Hadoop 2.9.2, configuring environment variables, editing XML configuration files, formatting the NameNode, starting HDFS and YARN services, testing the cluster, and setting up the MapReduce history server on a three‑node Linux environment.

Big DataCluster SetupHadoop

0 likes · 9 min read

Step-by-Step Guide to Installing and Configuring Hadoop 2.9.2 Cluster on Three Nodes

DataFunTalk

Dec 27, 2021 · Big Data

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

This article presents a detailed interview-style walkthrough covering Hadoop cluster setup, HDFS components, MapReduce workflow, YARN advantages, Spark fundamentals, Kafka replication, Hive table types, and related big‑data concepts, providing concise explanations and practical insights for data engineers.

Big DataHadoopHive

0 likes · 20 min read

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

HomeTech

Dec 24, 2021 · Big Data

Handling java.lang.OutOfMemoryError in Hadoop MapReduce

This article explains the four locations where java.lang.OutOfMemoryError can occur in Hadoop's MapReduce framework—client, ApplicationMaster, Map, and Reduce phases—and provides configuration adjustments and best‑practice solutions to mitigate each type of OOM issue.

HadoopJavaMapReduce

0 likes · 11 min read

Handling java.lang.OutOfMemoryError in Hadoop MapReduce

Big Data Technology & Architecture

Oct 23, 2021 · Big Data

Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage

This article provides a comprehensive overview of Hive's execution engines—including MapReduce, Tez, and Spark—detailing their architectures, the six-stage Hive SQL compilation process, practical Explain syntax examples, and extensive tuning parameters for each engine to improve performance in big‑data environments.

EXPLAINHiveMapReduce

0 likes · 48 min read

Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage

Big Data Technology & Architecture

Oct 8, 2021 · Big Data

Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide

This article provides a comprehensive guide to optimizing Hadoop HDFS storage through erasure coding and heterogeneous storage policies, explains fault‑tolerance techniques such as safe mode and slow‑disk monitoring, and shares practical MapReduce performance tuning and enterprise‑level configuration examples for large‑scale clusters.

Cluster TuningHDFSHadoop

0 likes · 32 min read

Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide

Big Data Technology & Architecture

Sep 23, 2021 · Big Data

Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations

This article explains how a 10 GB gzip file is stored and processed on HDFS, details the MapReduce split calculation using GzipCodec, and discusses why Spark reads such non‑splittable files with a single task, recommending file splitting or format conversion for better performance.

Data SplitsHadoopMapReduce

0 likes · 8 min read

Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations

Big Data Technology & Architecture

Sep 17, 2021 · Big Data

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

This article explains HDFS reliability features such as replica policies, rack awareness, heartbeat, safe mode, checksums, trash, metadata protection and snapshots, then details YARN failover handling for ApplicationMaster, NodeManager and ResourceManager, and finally describes the Hadoop MapReduce shuffle workflow and tuning tips.

HDFSMapReduceReliability

0 likes · 13 min read

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

ITPUB

Sep 16, 2021 · Big Data

Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons

This article explains how Hadoop revolutionized big data by providing a distributed architecture with HDFS for storage and MapReduce for processing, outlines its ecosystem components, describes the inner workings of HDFS and MapReduce, and discusses the strengths and limitations of this approach.

HDFSHadoopMapReduce

0 likes · 7 min read

Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons

Big Data Technology & Architecture

Sep 16, 2021 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer data structure to store serialized key/value pairs and their metadata in memory, describes its initialization, write path, spill handling, and the underlying algorithms that ensure efficient in‑memory sorting and disk spilling.

HadoopIn-Memory BufferMapReduce

0 likes · 24 min read

Understanding Hadoop's Circular Buffer in the Shuffle Phase

ITPUB

Sep 13, 2021 · Big Data

MapReduce vs MPP: Choosing the Right Engine for Global Data Warehousing

A team of engineers at MBI debates the merits of MapReduce, MPP, and Hive for their KeepS global data‑warehouse, discussing technical differences, scalability, concurrency, and the feasibility of mixed batch engines while navigating budget and operational constraints.

Cluster ComputingGrid ComputingHive

0 likes · 20 min read

MapReduce vs MPP: Choosing the Right Engine for Global Data Warehousing

Big Data Technology & Architecture

Sep 1, 2021 · Big Data

Understanding Hadoop Data Splitting and InputFormat Mechanisms

This article explains Hadoop's data splitting concepts, the distinction between HDFS blocks and logical InputSplits, details the source code of various InputFormats such as TextInputFormat, CombineTextInputFormat, KeyValueTextInputFormat, NLineInputFormat, and custom InputFormats, and provides complete Java examples for Mapper, Reducer, and driver classes.

Data SplittingHadoopInputFormat

0 likes · 24 min read

Understanding Hadoop Data Splitting and InputFormat Mechanisms

Qunar Tech Salon

Aug 26, 2021 · Big Data

Comprehensive Introduction to Apache Spark: History, Core Concepts, Architecture, and Performance Optimization

This article provides a thorough overview of Apache Spark, covering its origins, comparison with MapReduce, core concepts such as RDD, DAG, Jobs, Stages, and Tasks, the submission process, Web UI, and detailed performance tuning techniques including data skew mitigation.

Big DataData SkewMapReduce

0 likes · 15 min read

Comprehensive Introduction to Apache Spark: History, Core Concepts, Architecture, and Performance Optimization

The Dominant Programmer

Aug 2, 2021 · Big Data

How to Build a Beginner Hadoop Cluster on CentOS 7

This article introduces Apache Hadoop’s open‑source framework, explains its core components such as HDFS, MapReduce, ZooKeeper, HBase, Hive, Pig, Mahout, Sqoop, Flume, Chukwa, Oozie, Ambari and YARN, and outlines the steps to set up a beginner‑level Hadoop cluster on CentOS 7.

Big DataCentOS 7HBase

0 likes · 11 min read

How to Build a Beginner Hadoop Cluster on CentOS 7

Big Data Technology & Architecture

Jul 19, 2021 · Big Data

Understanding Hadoop: MapReduce, HDFS, YARN, and Core Big Data Concepts

This article provides a comprehensive overview of Hadoop’s core components—including MapReduce programming model, HDFS storage architecture, and YARN resource management—while discussing common challenges like data skew and small files, and offering learning resources for aspiring big‑data engineers.

Data SkewHDFSHadoop

0 likes · 9 min read

Understanding Hadoop: MapReduce, HDFS, YARN, and Core Big Data Concepts

Big Data Technology & Architecture

Jul 15, 2021 · Big Data

Understanding Hive Architecture, Execution Flow, and the Shift to Tez and Spark

This article explains Hive's core components, execution architecture, how HiveQL is transformed into MapReduce jobs, the advantages of Tez over MapReduce in Hive 3.0+, and the integration of Spark with Hive for modern big‑data processing.

Data WarehouseHiveMapReduce

0 likes · 9 min read

Understanding Hive Architecture, Execution Flow, and the Shift to Tez and Spark

Tech Musings

Jul 8, 2021 · Big Data

Building a Simple Single-Node MapReduce System: From Theory to Code

This article walks through implementing a lightweight single‑machine MapReduce framework inspired by the original MapReduce paper, covering the abstract Map/Reduce model, task scheduling between master and workers, core Go code for map, reduce, worker, and coordinator, and a brief reflection on its limitations.

Big DataLabMapReduce

0 likes · 10 min read

Building a Simple Single-Node MapReduce System: From Theory to Code

Tech Musings

Jul 7, 2021 · Fundamentals

Unlock Distributed Systems Mastery with MIT’s 6.824 Course and Labs

The MIT 6.824 course offers rich video resources, low entry difficulty, and well‑structured labs covering MapReduce, Raft, a simple KV store, and sharding, while the author shares personal challenges and tips for tackling the coursework.

6.824KV storeMIT

0 likes · 4 min read

Unlock Distributed Systems Mastery with MIT’s 6.824 Course and Labs

DataFunTalk

Jun 11, 2021 · Big Data

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

This article explains how to efficiently move large volumes of data from Hive to HBase by leveraging HBase's bulkload mechanism, detailing the original MapReduce workflow, its performance bottlenecks, and a rewritten Spark‑based solution that simplifies ETL, improves partitioning, and achieves several‑fold speedup.

Big DataETLHBase

0 likes · 17 min read

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

Big Data Technology & Architecture

Apr 15, 2021 · Big Data

Hive and Hadoop Interview Questions and Answers

This article provides a comprehensive collection of interview-style questions and detailed answers covering Hive concepts, Hadoop architecture, MapReduce mechanics, HDFS operations, and performance optimization techniques for big‑data processing environments.

Data WarehouseHadoopHive

0 likes · 41 min read

Hive and Hadoop Interview Questions and Answers

Big Data Technology & Architecture

Apr 14, 2021 · Big Data

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

This article explains how Spark implements shuffle write and shuffle read, compares its high‑level and low‑level processes with Hadoop MapReduce, and details the internal data structures, memory‑disk trade‑offs, and configuration options that affect performance.

MapReduceMemoryManagementRDD

0 likes · 21 min read

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

Full-Stack Internet Architecture

Jan 27, 2021 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

This article provides a comprehensive overview of Hadoop, covering its origins, core components such as HDFS, MapReduce, and YARN, their architectures, data storage and processing mechanisms, fault‑tolerance features, scheduling strategies, and practical optimization techniques for large‑scale distributed computing.

Big DataDistributed ComputingHDFS

0 likes · 33 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

Big Data Technology & Architecture

Jan 22, 2021 · Big Data

Key New Features and Improvements in Hadoop 3.x

Hadoop 3.x upgrades the platform to JDK 1.8 and introduces a range of enhancements across common components, HDFS, YARN, and MapReduce, including erasure coding, multi‑NameNode high availability, cgroup‑based resource isolation, native map‑output collectors, and split client libraries, while also adding support for Azure and Aliyun distributed file systems.

HDFSHadoopMapReduce

0 likes · 7 min read

Key New Features and Improvements in Hadoop 3.x

Big Data Technology & Architecture

Jan 12, 2021 · Big Data

Hadoop Interview Questions and Topics – HDFS, MapReduce, YARN, and Optimization

This article compiles a comprehensive set of Hadoop interview questions covering HDFS write and read processes, architecture, fault‑tolerance, NameNode metadata management, MapReduce scheduling, combiner and partition roles, YARN scheduling strategies, and various optimization techniques for both MapReduce and HDFS.

HDFSHadoopMapReduce

0 likes · 5 min read

Hadoop Interview Questions and Topics – HDFS, MapReduce, YARN, and Optimization

Big Data Technology & Architecture

Dec 3, 2020 · Big Data

Hive Query Optimization Techniques and Best Practices

This article presents a comprehensive guide to optimizing Hive queries, covering limit adjustments, join strategies, local mode execution, parallelism, strict mode, mapper and reducer tuning, JVM reuse, dynamic partitioning, speculative execution, data skew handling, and small‑file mitigation techniques.

HiveMapReducePerformance Tuning

0 likes · 20 min read

Hive Query Optimization Techniques and Best Practices

Big Data Technology & Architecture

Oct 31, 2020 · Big Data

Hive Performance Tuning: Understanding Map and Reduce Counts

This article explains how Hive determines the number of map and reduce tasks based on input file size and block configuration, discusses when to increase or decrease map counts, and provides practical commands for adjusting reducer settings to optimize large‑scale data processing.

Big DataHiveMapReduce

0 likes · 6 min read

Hive Performance Tuning: Understanding Map and Reduce Counts

Ctrip Technology

Sep 10, 2020 · Big Data

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

The article describes the design, architecture, and operational details of a unified logging framework at Ctrip's payment center, covering log production via a Log4j2 extension, Kafka‑Camus collection, Hive/ORC storage, MapReduce parsing optimizations, and governance strategies for massive daily TB‑scale data.

Big DataCamusData Governance

0 likes · 15 min read

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

Big Data Technology & Architecture

Sep 1, 2020 · Big Data

Configuring Hadoop to Support LZO Compression

This guide explains how to enable LZO compression in Hadoop by installing the twitter‑provided hadoop‑lzo library, updating core‑site.xml, synchronizing files across nodes, creating LZO indexes, and running a WordCount MapReduce job with LZO‑compressed output.

Big DataConfigurationHadoop

0 likes · 6 min read

Configuring Hadoop to Support LZO Compression

Big Data Technology & Architecture

Jun 15, 2020 · Big Data

Hive Optimization Techniques and Best Practices for Big Data Processing

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, tuning MapReduce jobs, handling data skew in joins, and selecting appropriate storage formats for large‑scale data warehouses.

Big DataData SkewHive

0 likes · 19 min read

Hive Optimization Techniques and Best Practices for Big Data Processing

Big Data Technology & Architecture

May 28, 2020 · Big Data

Hadoop System Bottleneck Detection and MapReduce Optimization Guide

This article provides a comprehensive guide on detecting Hadoop system bottlenecks, analyzing resource constraints, and applying practical MapReduce performance tuning techniques—including baseline creation, counter analysis, combiner usage, compression, and proper Writable types—to achieve optimal big‑data processing efficiency.

Big DataHadoopMapReduce

0 likes · 11 min read

Hadoop System Bottleneck Detection and MapReduce Optimization Guide

Big Data Technology Architecture

May 27, 2020 · Big Data

Why Spark Outperforms Hadoop MapReduce: In‑Memory Computing, Task Scheduling, and Execution Strategies

The article explains that Spark’s in‑memory processing, thread‑based task model, selective shuffle sorting, and flexible RDD/DAG architecture give it a significant performance advantage over Hadoop MapReduce’s disk‑heavy, process‑based batch execution.

Distributed ProcessingMapReduceSpark

0 likes · 4 min read

Why Spark Outperforms Hadoop MapReduce: In‑Memory Computing, Task Scheduling, and Execution Strategies

Big Data Technology & Architecture

Apr 9, 2020 · Big Data

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

The article explains how applying filters, projections, and predicate pushdown in Hadoop and Hive reduces data volume, speeds up MapReduce jobs, and improves performance, while also covering join limitations and providing a Java Mapper example for practical implementation.

Big DataHadoopHive

0 likes · 4 min read

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

Big Data Technology Architecture

Mar 19, 2020 · Big Data

Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations

Data skew in Hive MapReduce jobs, caused by uneven key distribution during joins, group‑by, or COUNT(DISTINCT) operations, can severely slow tasks, and the article explains common scenarios and practical solutions such as using MapJoin, enabling map‑side aggregation, load‑balancing, and rewriting queries to mitigate skew.

Data SkewHiveMapJoin

0 likes · 7 min read

Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations

Big Data Technology & Architecture

Feb 13, 2020 · Big Data

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

This article describes how eBay's Central Application Logging (CAL) system generates massive daily logs, the challenges of Hadoop MapReduce job performance and resource consumption, and the step‑by‑step optimizations—reducing GC time, mitigating data skew, and improving algorithms—that cut execution time by over 60%, lowered cluster resource usage, and raised job success rates to nearly 100%.

Big DataData SkewGC

0 likes · 11 min read

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

Big Data Technology & Architecture

Feb 9, 2020 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer to store serialized key/value pairs and their metadata, detailing its structure, initialization, write path, spill logic, and the background thread that sorts and writes data to disk.

Big DataHadoopJava

0 likes · 24 min read

Architects Research Society

Feb 3, 2020 · Databases

MapReduce‑Style Parallel Query Processing with Citus

The article explains how Citus enables sharding and MapReduce‑style parallel query execution in PostgreSQL, showing performance gains, example bucket algorithms, and how standard SQL can replace custom MapReduce code for large‑scale data analytics.

CitusDistributed PostgreSQLMapReduce

0 likes · 6 min read

MapReduce‑Style Parallel Query Processing with Citus

DataFunTalk

Oct 25, 2019 · Big Data

Migrating Data from HBase to Kafka Using MapReduce

This article explains how to reverse the typical data flow by extracting massive Rowkeys from HBase with MapReduce, storing them on HDFS, and then using batch Get operations to retrieve the full records and write them into Kafka, while handling retries and monitoring progress.

Big DataData MigrationHBase

0 likes · 9 min read

Migrating Data from HBase to Kafka Using MapReduce

Sohu Tech Products

Oct 9, 2019 · Databases

MongoDB Aggregation Framework: Stages, Pipelines, and Examples

This article provides an in‑depth overview of MongoDB’s aggregation framework, explaining the concepts of pipelines and stages such as $match, $group, $project, $lookup, $unwind, and $out, and includes practical code examples, syntax details, and comparisons to SQL aggregation.

AggregationMapReduceMongoDB

0 likes · 25 min read

MongoDB Aggregation Framework: Stages, Pipelines, and Examples

360 Tech Engineering

Jul 31, 2019 · Backend Development

Design and Key Technologies of the 360 Search Engine for Billion‑Scale Web Retrieval

This article explains how 360 Search processes billions of web pages daily, detailing its backend architecture, offline indexing, online retrieval, index organization, and relevance models that enable efficient search over a hundred‑billion‑scale web corpus.

Big DataHBaseIndexing

0 likes · 21 min read

Design and Key Technologies of the 360 Search Engine for Billion‑Scale Web Retrieval

Tencent Cloud Developer

Jul 16, 2019 · Big Data

Design and Challenges of Tencent iData Analysis Center Backend: Bitmap Storage and MapReduce Architecture

Tencent’s iData Analysis Center rebuilt its backend as TGMars, replacing a rigid row‑oriented bitmap store and single‑node MapReduce pipeline with a more extensible architecture that shards user behavior bitmaps, eliminates shuffle overhead, and adds columnar storage, iterative processing and SQL‑like capabilities using Spark to overcome scalability and flexibility limitations.

MapReduceOLAPbitmap storage

0 likes · 10 min read

Design and Challenges of Tencent iData Analysis Center Backend: Bitmap Storage and MapReduce Architecture

Big Data Technology & Architecture

Jun 24, 2019 · Big Data

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, fine‑tuning join operations, and optimizing MapReduce parameters such as mapper/reducer counts, file merging, compression, JVM reuse, parallel execution, strict mode, and storage formats.

Big DataHiveMapReduce

0 likes · 19 min read

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

DataFunTalk

Jun 17, 2019 · Big Data

Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era

This article explores Hadoop’s role in the big‑data era, detailing its architecture, core components such as HDFS, YARN, MapReduce, Ozone and Submarine, the challenges of trillion‑scale data, and why its scalability, cost efficiency, and a mature ecosystem give it a competitive edge.

Data LakeHadoopMapReduce

0 likes · 11 min read

Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era

Full-Stack Internet Architecture

Jun 8, 2019 · Big Data

The Story of Doug Cutting: From Stanford to Hadoop and Beyond

This article chronicles Doug Cutting's journey from his humble beginnings at Stanford through his pioneering work on Lucene, Nutch, and Hadoop, highlighting how his innovations in search and distributed computing reshaped the big data landscape and led to the rise of Cloudera.

Big DataClouderaDoug Cutting

0 likes · 8 min read

The Story of Doug Cutting: From Stanford to Hadoop and Beyond

Big Data Technology & Architecture

Apr 21, 2019 · Big Data

Overview of Hive Data Warehouse, Its Architecture, Query Processing, and Comparison with Impala

This article provides a comprehensive overview of Hive as a Hadoop‑based data warehouse, explains its architecture, query‑to‑MapReduce translation, high‑availability design, and compares its batch‑oriented processing with Impala's low‑latency SQL engine for big data analytics.

Big DataData WarehouseHigh Availability

0 likes · 15 min read

Overview of Hive Data Warehouse, Its Architecture, Query Processing, and Comparison with Impala

Big Data Technology & Architecture

Apr 20, 2019 · Big Data

Weekly Hadoop Knowledge Points: Compression Formats, MapReduce Join, Hive Setup, and YARN Capacity Scheduler

This weekly bulletin summarizes four Hadoop knowledge points—compression formats, MapReduce join techniques, Hive installation, and YARN Capacity Scheduler—while also sharing personal updates about a PhD graduation, the upcoming May Day holiday, and a request for likes and shares.

Big DataHadoopHive

0 likes · 2 min read

Weekly Hadoop Knowledge Points: Compression Formats, MapReduce Join, Hive Setup, and YARN Capacity Scheduler

Big Data Technology & Architecture

Apr 15, 2019 · Big Data

Map‑Side Join and Reduce‑Side Join Examples in Hadoop MapReduce (Java)

This article provides two reusable Java code samples that demonstrate how to perform a map‑side join and a reduce‑side join in Hadoop MapReduce, enabling efficient joining of a large dataset with a smaller reference table.

Big DataHadoopJOIN

0 likes · 8 min read

Map‑Side Join and Reduce‑Side Join Examples in Hadoop MapReduce (Java)

Big Data Technology & Architecture

Apr 10, 2019 · Big Data

Understanding Hadoop DistributedCache: Concepts, API Usage, and Example

This article explains Hadoop's DistributedCache mechanism, its APIs for adding cache files and archives, common use cases, important considerations, the basic workflow, and provides a complete Java Map-side join example demonstrating how to distribute and access cached data in MapReduce jobs.

DistributedCacheHadoopJava

0 likes · 10 min read

Understanding Hadoop DistributedCache: Concepts, API Usage, and Example

Architecture Digest

Apr 5, 2019 · Fundamentals

An Overview of Recent Developments and Practical Topics in Distributed Systems

This article provides a comprehensive introduction to modern distributed systems, covering recent research trends, practical technologies such as Paxos, Consistent Hashing, MapReduce, Spark, various storage and computing paradigms, and offers guidance for beginners on how to navigate the field.

MapReduceParameter ServerPaxos

0 likes · 18 min read

An Overview of Recent Developments and Practical Topics in Distributed Systems

Big Data Technology & Architecture

Apr 4, 2019 · Big Data

Weekly Knowledge Points: Interview Reflections, Hadoop Introduction, MapReduce and HDFS Overview

This weekly briefing shares five curated resources covering interview reflections, a concise Hadoop introduction, the principles of MapReduce, an overview of HDFS, and upcoming plans to study Hive and HBase, emphasizing the distributed nature of big‑data processing.

Big DataHDFSHadoop

0 likes · 3 min read

Weekly Knowledge Points: Interview Reflections, Hadoop Introduction, MapReduce and HDFS Overview

Big Data Technology & Architecture

Apr 2, 2019 · Big Data

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

The article explains Hadoop's MapReduce framework as both a programming model and execution engine, detailing its map and reduce phases, the WordCount example code, job startup components, data shuffling, partitioning, and how large‑scale distributed computations are orchestrated across a cluster.

Big DataDistributed ComputingHadoop

0 likes · 10 min read

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

Big Data Technology & Architecture

Apr 1, 2019 · Big Data

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

This article provides a detailed introduction to Hadoop's ecosystem—including its core modules (Common, HDFS, YARN, MapReduce), the design of a high‑availability HDFS cluster, the principles of distributed file systems, and a complete Scala WordCount MapReduce program—offering a solid foundation for big‑data practitioners.

Big DataHDFSHadoop

0 likes · 15 min read

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

dbaplus Community

Mar 27, 2019 · Big Data

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

This article explains how eBay's CAL team reduced Hadoop MapReduce job execution time and resource consumption by over 60% through targeted GC tuning, data‑skew mitigation, and algorithmic improvements, boosting job success rates to nearly 100% while handling petabyte‑scale log data.

Big DataData SkewGC Tuning

0 likes · 12 min read

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

360 Quality & Efficiency

Jan 4, 2019 · Big Data

Overview of Big Data Processing Engines: MapReduce, Tez, Spark, and Flink

This article reviews the evolution and characteristics of major big‑data processing engines—from first‑generation Hadoop MapReduce to second‑generation DAG‑based Tez, third‑generation in‑memory Spark, and fourth‑generation real‑time Flink—highlighting their batch and streaming use cases.

Big DataFlinkMapReduce

0 likes · 9 min read

Overview of Big Data Processing Engines: MapReduce, Tez, Spark, and Flink

Big Data Technology & Architecture

Dec 31, 2018 · Big Data

Overview of the Big Data Ecosystem and Core Technologies

This article provides a comprehensive overview of the big data ecosystem, explaining key components such as Hadoop, HDFS, Spark, Hive, Pig, HBase, and related tools, and describes how they work together to store, process, and analyze massive datasets efficiently.

Big DataHadoopHive

0 likes · 16 min read

Overview of the Big Data Ecosystem and Core Technologies

Architects Research Society

Dec 30, 2018 · Big Data

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

Big DataDistributed ComputingFlink

0 likes · 22 min read

Overview of Major Apache Big Data Processing Frameworks

21CTO

Dec 14, 2018 · Artificial Intelligence

Inside Jeff Dean and Sanjay Ghemawat’s Epic Journey: From Index Crashes to AI Powerhouses

The article chronicles Jeff Dean and Sanjay Ghemawat’s partnership at Google, from the 2000 index failure that threatened the company, through their pioneering work on MapReduce and large‑scale infrastructure, to the creation of TensorFlow and the rise of Google AI, highlighting their unique collaborative style and lasting impact on modern computing.

GoogleJeff DeanMapReduce

0 likes · 29 min read

Inside Jeff Dean and Sanjay Ghemawat’s Epic Journey: From Index Crashes to AI Powerhouses

21CTO

Sep 21, 2018 · Big Data

Master Massive Data Processing: Key Techniques from Hash Maps to MapReduce

This comprehensive guide explores essential strategies for handling massive datasets, covering hash-based structures, bucket partitioning, heap and quicksort techniques, trie trees, Bloom filters, external sorting, and MapReduce, and demonstrates how to efficiently solve common interview problems such as top‑K queries and duplicate removal.

Data StructuresHashMapReduce

0 likes · 35 min read

Master Massive Data Processing: Key Techniques from Hash Maps to MapReduce

dbaplus Community

Aug 6, 2018 · Big Data

Understanding RAID, HDFS, and MapReduce: From Storage to Distributed Computing

This article explains the storage challenges of big data, introduces RAID levels and their trade‑offs, describes the HDFS architecture with NameNode and DataNode replication, details the MapReduce programming model and execution flow, and shows how Hive translates SQL queries into MapReduce jobs.

Big DataDistributed ComputingHDFS

0 likes · 23 min read

Understanding RAID, HDFS, and MapReduce: From Storage to Distributed Computing

Big Data and Microservices

Jul 24, 2018 · Big Data

Why Hadoop Still Leads Big Data Processing: Core Advantages Explained

This article introduces Hadoop’s open‑source big‑data framework, explains its core components HDFS and MapReduce, and outlines four key advantages—ease of deployment, robustness, scalability, and simplicity—while also covering HBase as the Hadoop‑based column‑oriented database.

Big DataDistributed ComputingHBase

0 likes · 4 min read

Why Hadoop Still Leads Big Data Processing: Core Advantages Explained

360 Quality & Efficiency

Jun 28, 2018 · Big Data

An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases

This article provides a concise overview of Apache Hive, covering its definition, Hadoop background, architecture, query workflow, storage model, advantages, disadvantages, and a comparison with traditional relational databases, helping readers understand how Hive enables SQL-like queries on data stored in HDFS.

Data WarehouseHadoopHive

0 likes · 5 min read

An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases

ITFLY8 Architecture Home

Jun 2, 2018 · Big Data

How Google’s Low‑Cost PC Cluster Powers Its Massive Search Engine

This article examines Google’s unconventional infrastructure, detailing how millions of inexpensive PC‑level servers, custom power supplies, and proprietary networking support massive scale, and explains the core platforms—Google File System, MapReduce, and BigTable—that enable fast, reliable search and data processing across the globe.

BigtableGFSGoogle

0 likes · 29 min read

How Google’s Low‑Cost PC Cluster Powers Its Massive Search Engine

dbaplus Community

May 23, 2018 · Big Data

Understanding MapReduce: A Simple Analogy to Master Big Data Distributed Computing

This article uses a human‑computer analogy and a playing‑card counting example to explain the fundamentals of distributed computing, why single machines cannot handle massive data, and how the MapReduce model’s four steps—split, transform, shuffle, and merge—solve big‑data problems.

Big DataDistributed ComputingMapReduce

0 likes · 15 min read

Understanding MapReduce: A Simple Analogy to Master Big Data Distributed Computing

Architects' Tech Alliance

May 14, 2018 · Big Data

Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization

This article explains Hadoop's distributed storage and processing framework, details the MapReduce programming model, describes the classic JobTracker/TaskTracker architecture, outlines the shuffle and combine phases, and introduces YARN as a scalable replacement with its ResourceManager, ApplicationMaster, and NodeManager components.

Big DataHadoopMapReduce

0 likes · 13 min read

Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization

ITPUB

Mar 29, 2018 · Big Data

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

This article explains Hadoop’s core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker design to improve scalability and resource utilization in large‑scale data processing clusters.

Big DataHadoopMapReduce

0 likes · 15 min read

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

21CTO

Sep 5, 2017 · Big Data

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

This article explains what MapReduce is, when to use it, and how to implement a PHP word‑count and a gold‑price average calculation on an Apache Hadoop cluster, covering installation hints, mapper and reducer scripts, testing commands, and visualizing results with gnuplot.

Big DataGnuplotHadoop

0 likes · 10 min read

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

Architecture Digest

Aug 15, 2017 · Artificial Intelligence

Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning

The article explains why AI engineers need foundational infrastructure knowledge—covering big‑data processing, cloud services, containerization, MapReduce, and deep‑learning platforms—to effectively solve real‑world problems, collaborate with teams, and build scalable, maintainable AI solutions.

AI InfrastructureBig DataCloud Computing

0 likes · 14 min read

Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning

Architecture Digest

Jun 15, 2017 · Big Data

Implementing a Distributed Stepwise Queue with Zookeeper for Hadoop Profit Calculation

This article demonstrates how to use Zookeeper as a distributed stepwise queue to coordinate multiple Hadoop MapReduce jobs for purchase, sales, and other cost calculations, automatically triggering a profit computation once all tasks complete, and provides full Java code examples and deployment instructions.

HadoopJavaMapReduce

0 likes · 21 min read

Implementing a Distributed Stepwise Queue with Zookeeper for Hadoop Profit Calculation

37 Interactive Technology Team

Jun 13, 2017 · Big Data

MapReduce Principles and Hadoop Execution Process with WordCount Example

The article explains MapReduce’s divide‑and‑conquer model and Hadoop’s execution pipeline—including map, partition, spill, merge, shuffle, and reduce phases—illustrated with a WordCount example that shows how mappers emit word‑1 pairs and reducers aggregate counts to produce final frequencies on HDFS.

Distributed ComputingHadoopMapReduce

0 likes · 7 min read

MapReduce Principles and Hadoop Execution Process with WordCount Example

dbaplus Community

Jun 7, 2017 · Big Data

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

This comprehensive guide walks you through MapReduce fundamentals, the complete execution flow, and seven hands‑on Hadoop projects—including WordCount, custom serialization, custom partitioning, grouping comparators, file merging, multiple outputs, join operations, and friend‑graph analysis—while providing environment setup steps, Maven commands, and Hadoop CLI examples.

CLIDistributed ComputingHadoop

0 likes · 28 min read

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

MaGe Linux Operations

May 24, 2017 · Big Data

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

This article explains how big data challenges traditional storage, introduces HDFS for distributed file management, describes parallel processing frameworks like MapReduce, Tez, and Spark, compares higher‑level tools such as Hive and Pig, and explores real‑time streaming and key‑value stores for low‑latency analytics.

HadoopHiveKey-Value Store

0 likes · 9 min read

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

StarRing Big Data Open Lab

May 12, 2017 · Big Data

How to Master Hadoop Performance: A Real-World TPCx-HS Tuning Case Study

This article walks through a detailed Hadoop performance tuning case using the TPCx-HS benchmark, explaining the bottlenecks in TeraGen and TeraSort, the optimization strategies applied, hardware considerations, and the resulting improvements in CPU and network utilization.

Cluster OptimizationHadoopMapReduce

0 likes · 9 min read

How to Master Hadoop Performance: A Real-World TPCx-HS Tuning Case Study

Qunar Tech Salon

May 5, 2017 · Backend Development

WeChat MQ 2.0: Enhanced Asynchronous Queue Design and Optimizations

The article introduces WeChat's self‑developed MQ 2.0 asynchronous queue, detailing its architecture, cross‑machine consumption model, improved task scheduling, efficient processing frameworks—including a MapReduce‑style engine and streaming tasks—and robust overload protection mechanisms that together boost reliability and performance for large‑scale backend services.

MapReduceMessage QueueStreaming Tasks

0 likes · 12 min read

WeChat MQ 2.0: Enhanced Asynchronous Queue Design and Optimizations

MaGe Linux Operations

May 3, 2017 · Big Data

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

HDFSHadoopHive

0 likes · 13 min read

From Storage to Real‑Time: The Evolution of Big Data Technologies

360 Quality & Efficiency

Apr 24, 2017 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and Common Commands

This article introduces Hadoop as a widely used big‑data framework, explains its core components HDFS and MapReduce, describes the cluster node roles, presents typical command‑line usage and a sample MapReduce workflow, and offers guidance for further learning.

Distributed ComputingHDFSHadoop

0 likes · 5 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and Common Commands

Java High-Performance Architecture

Apr 4, 2017 · Big Data

Master MapReduce: Principles, Process, and 7 Hands‑On Examples

This tutorial quickly introduces the MapReduce model, explains its core principles and execution flow, and guides you through seven practical examples—from basic WordCount to custom serialization, partitioning, joins, and friend‑recommendation—while providing test data and an optional ready‑made Hadoop environment for hands‑on practice.

Distributed ComputingHadoopMapReduce

0 likes · 3 min read

Master MapReduce: Principles, Process, and 7 Hands‑On Examples

ITPUB

Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataDistributed ComputingMapReduce

0 likes · 25 min read

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

Huawei Cloud Developer Alliance

Jan 24, 2017 · Big Data

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

This article provides a comprehensive overview of Hadoop as the leading open‑source platform for big‑data processing, detailing its core components HDFS and MapReduce, the evolution to Hadoop 2.0/YARN, and the extensive ecosystem of tools and commercial solutions that enable scalable storage, analysis, and machine‑learning on massive data sets.

Big DataDistributed ComputingHDFS

0 likes · 18 min read

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

Art of Distributed System Architecture Design

Dec 31, 2016 · Big Data

Understanding Hadoop: Architecture, HDFS, and MapReduce

This article explains Hadoop as an Apache‑managed open‑source platform for storing massive data on distributed clusters and running robust, efficient analytics via its two core components—HDFS for storage and the Java‑based MapReduce framework for processing—highlighting modularity, high availability, and common tooling.

Distributed ComputingHDFSHadoop

0 likes · 6 min read

Understanding Hadoop: Architecture, HDFS, and MapReduce