Tagged articles
123 articles
Page 1 of 2
Big Data Tech Team
Big Data Tech Team
Aug 25, 2025 · Interview Experience

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

A comprehensive list of interview topics covering self‑introduction, career moves, data‑warehouse design, team building, architecture comparisons, fact‑table classification, common dimensions, performance tuning, and data‑governance for aspiring big‑data engineers.

Big DataData GovernanceFlink
0 likes · 4 min read
Essential Big Data Interview Questions for Data Warehouse Engineer Roles
Code Ape Tech Column
Code Ape Tech Column
Aug 5, 2025 · Backend Development

Exploring PowerJob: A Lightweight Distributed Task Scheduler for Java

This article introduces PowerJob, a young yet powerful distributed task scheduling framework, covering its selection reasons, core concepts, high‑availability setup, workflow types, scheduling modes, deployment steps, and detailed code examples for single, broadcast, map, and MapReduce jobs.

Distributed SchedulingJavaMapReduce
0 likes · 15 min read
Exploring PowerJob: A Lightweight Distributed Task Scheduler for Java
Big Data Tech Team
Big Data Tech Team
Jun 8, 2025 · Big Data

Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals

This guide outlines a comprehensive Hadoop learning roadmap, covering essential prerequisites, core concepts such as HDFS, MapReduce, and YARN, hands‑on projects, advanced ecosystem tools like Hive, Pig, HBase and Spark, plus curated resources and community channels for aspiring big‑data engineers.

HDFSHadoopMapReduce
0 likes · 7 min read
Master Hadoop: A Step-by-Step Learning Roadmap for Big Data Professionals
Qunar Tech Salon
Qunar Tech Salon
Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce
0 likes · 23 min read
Understanding and Solving Small File Problems in Hive and Spark
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 23, 2024 · Big Data

Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning

This article explains Hadoop’s core concepts using a library analogy, details HDFS storage and MapReduce processing, provides complete Java implementations for a word‑count job with support for text, CSV, and JSON inputs, and discusses extensibility and performance optimizations such as combiners and custom partitioners.

Big DataHadoopJava
0 likes · 20 min read
Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning
Code Ape Tech Column
Code Ape Tech Column
Jun 16, 2024 · Backend Development

Introducing PowerJob: A Lightweight Distributed Task Scheduling Framework and Its Usage

This article introduces PowerJob, a young yet mature distributed task scheduling framework, explains why it was chosen, details its architecture, high‑availability design, deployment steps, and demonstrates various job types—including standalone, broadcast, map, and MapReduce—along with CRON, fixed‑rate, and fixed‑delay scheduling configurations.

JavaMapReducepowerjob
0 likes · 13 min read
Introducing PowerJob: A Lightweight Distributed Task Scheduling Framework and Its Usage
ITPUB
ITPUB
Dec 14, 2023 · Big Data

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

This step‑by‑step guide shows how to install and configure a single‑node Hadoop 3.2.0 environment on CentOS 7, set up Python 3.7, write MapReduce mapper and reducer scripts in Python, and run a word‑count job using Hadoop streaming, illustrating core Hadoop concepts and their relevance today.

HadoopMapReducePython
0 likes · 21 min read
How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster
DaTaobao Tech
DaTaobao Tech
Dec 11, 2023 · Big Data

Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems

The paper presents a centralized online batch‑processing framework for large‑scale promotion systems, where applications integrate via an SDK, a task‑center schedules and dispatches sub‑tasks through RocketMQ to Dubbo‑enabled containers, employing MapReduce‑style splitting, Guava rate‑limiting, heartbeat health checks, and has successfully handled over 1.3 million tasks during Double‑11.

Batch ProcessingBig DataDistributed Scheduling
0 likes · 9 min read
Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems
Code Ape Tech Column
Code Ape Tech Column
Dec 9, 2023 · Backend Development

PowerJob Overview: Selection Rationale, Architecture, Task Types, and Scheduling Strategies with Code Samples

This article introduces the PowerJob distributed task framework, explains why it was chosen, details its architecture and high‑availability design, demonstrates various job types—including standalone, broadcast, map, and map‑reduce—with Java code examples, and covers scheduling options such as CRON, fixed‑rate, and fixed‑delay execution.

BackendDistributed SchedulingJava
0 likes · 14 min read
PowerJob Overview: Selection Rationale, Architecture, Task Types, and Scheduling Strategies with Code Samples
Architecture Digest
Architecture Digest
Jun 3, 2023 · Backend Development

PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Features, Comparison, and Quick‑Start Guide

PowerJob is a modern distributed job scheduling framework that addresses the limitations of Quartz, XXL‑Job and SchedulerX by offering a web UI, rich scheduling strategies, DAG workflow support, lock‑free high‑performance scheduling, multiple processor types and step‑by‑step quick‑start instructions for developers.

Distributed SchedulingJavaMapReduce
0 likes · 10 min read
PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Features, Comparison, and Quick‑Start Guide
Programmer DD
Programmer DD
Feb 24, 2023 · Artificial Intelligence

How Jeff Dean’s Journey Shaped Google’s AI and Big Data Revolution

Jeff Dean, a Google engineering legend, has mastered over 18 programming languages and pioneered transformative technologies such as MapReduce, Bigtable, Spanner, and TensorFlow, illustrating how his relentless pursuit of scalability and performance has driven the evolution of AI, big data, and modern cloud infrastructure.

AIJeff DeanMapReduce
0 likes · 14 min read
How Jeff Dean’s Journey Shaped Google’s AI and Big Data Revolution
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Feb 8, 2023 · Big Data

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Distributed computing splits massive tasks across multiple servers, and this article explains the classic MapReduce batch engine and the modern Spark framework, covering their architectures, strengths, limitations, and evolution, while highlighting key features like fault tolerance, in‑memory processing, and real‑time streaming capabilities.

Big DataMapReduceSpark
0 likes · 12 min read
Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing
Architecture Digest
Architecture Digest
Dec 16, 2022 · Backend Development

PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Introduction and Quick‑Start Guide

PowerJob is a third‑generation distributed job scheduler that adds workflow orchestration, map‑reduce style computation and rich execution modes to traditional CRON‑based scheduling, and this guide explains its advantages, core features, architecture, and provides step‑by‑step instructions with code samples to get started quickly.

Distributed SchedulingJavaMapReduce
0 likes · 11 min read
PowerJob: A Next‑Generation Distributed Task Scheduling and Computing Framework – Introduction and Quick‑Start Guide
ITPUB
ITPUB
Oct 21, 2022 · Big Data

Hadoop Explained: Architecture, Core Components, and Real-World Applications

This article provides a comprehensive overview of Hadoop, covering its historical development, key characteristics, the HDFS storage framework, the MapReduce processing engine, YARN resource manager, and a wide range of real-world application scenarios, as well as the broader Hadoop ecosystem and its major components.

Big DataEcosystemHDFS
0 likes · 20 min read
Hadoop Explained: Architecture, Core Components, and Real-World Applications
JavaEdge
JavaEdge
Apr 17, 2022 · Big Data

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

The article explains how Spark, developed by UC Berkeley's AMP Lab, quickly surpassed MapReduce by offering faster execution, a simpler Scala‑based programming model, lazy RDD transformations, a rich ecosystem including SQL, Streaming, MLlib and GraphX, and practical code examples such as a three‑line WordCount.

Big DataMapReduceRDD
0 likes · 7 min read
Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model
HomeTech
HomeTech
Dec 24, 2021 · Big Data

Handling java.lang.OutOfMemoryError in Hadoop MapReduce

This article explains the four locations where java.lang.OutOfMemoryError can occur in Hadoop's MapReduce framework—client, ApplicationMaster, Map, and Reduce phases—and provides configuration adjustments and best‑practice solutions to mitigate each type of OOM issue.

HadoopJavaMapReduce
0 likes · 11 min read
Handling java.lang.OutOfMemoryError in Hadoop MapReduce
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 23, 2021 · Big Data

Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage

This article provides a comprehensive overview of Hive's execution engines—including MapReduce, Tez, and Spark—detailing their architectures, the six-stage Hive SQL compilation process, practical Explain syntax examples, and extensive tuning parameters for each engine to improve performance in big‑data environments.

HiveMapReduceSQL Optimization
0 likes · 48 min read
Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 8, 2021 · Big Data

Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide

This article provides a comprehensive guide to optimizing Hadoop HDFS storage through erasure coding and heterogeneous storage policies, explains fault‑tolerance techniques such as safe mode and slow‑disk monitoring, and shares practical MapReduce performance tuning and enterprise‑level configuration examples for large‑scale clusters.

Cluster TuningHDFSHadoop
0 likes · 32 min read
Hadoop HDFS Storage Optimization, Erasure Coding, Heterogeneous Storage, and Cluster Tuning Guide
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 17, 2021 · Big Data

Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process

This article explains HDFS reliability features such as replica policies, rack awareness, heartbeat, safe mode, checksums, trash, metadata protection and snapshots, then details YARN failover handling for ApplicationMaster, NodeManager and ResourceManager, and finally describes the Hadoop MapReduce shuffle workflow and tuning tips.

HDFSMapReduceReliability
0 likes · 13 min read
Key Reliability Mechanisms of HDFS, YARN Failover Strategies, and Hadoop Shuffle Process
ITPUB
ITPUB
Sep 16, 2021 · Big Data

Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons

This article explains how Hadoop revolutionized big data by providing a distributed architecture with HDFS for storage and MapReduce for processing, outlines its ecosystem components, describes the inner workings of HDFS and MapReduce, and discusses the strengths and limitations of this approach.

HDFSHadoopMapReduce
0 likes · 7 min read
Understanding Hadoop: Architecture, HDFS, MapReduce, and Their Pros & Cons
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 16, 2021 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer data structure to store serialized key/value pairs and their metadata in memory, describes its initialization, write path, spill handling, and the underlying algorithms that ensure efficient in‑memory sorting and disk spilling.

HadoopIn-Memory BufferMapReduce
0 likes · 24 min read
Understanding Hadoop's Circular Buffer in the Shuffle Phase
ITPUB
ITPUB
Sep 13, 2021 · Big Data

MapReduce vs MPP: Choosing the Right Engine for Global Data Warehousing

A team of engineers at MBI debates the merits of MapReduce, MPP, and Hive for their KeepS global data‑warehouse, discussing technical differences, scalability, concurrency, and the feasibility of mixed batch engines while navigating budget and operational constraints.

Cluster ComputingGrid ComputingHive
0 likes · 20 min read
MapReduce vs MPP: Choosing the Right Engine for Global Data Warehousing
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 1, 2021 · Big Data

Understanding Hadoop Data Splitting and InputFormat Mechanisms

This article explains Hadoop's data splitting concepts, the distinction between HDFS blocks and logical InputSplits, details the source code of various InputFormats such as TextInputFormat, CombineTextInputFormat, KeyValueTextInputFormat, NLineInputFormat, and custom InputFormats, and provides complete Java examples for Mapper, Reducer, and driver classes.

Data SplittingHadoopInputFormat
0 likes · 24 min read
Understanding Hadoop Data Splitting and InputFormat Mechanisms
The Dominant Programmer
The Dominant Programmer
Aug 2, 2021 · Big Data

How to Build a Beginner Hadoop Cluster on CentOS 7

This article introduces Apache Hadoop’s open‑source framework, explains its core components such as HDFS, MapReduce, ZooKeeper, HBase, Hive, Pig, Mahout, Sqoop, Flume, Chukwa, Oozi​e, Ambari and YARN, and outlines the steps to set up a beginner‑level Hadoop cluster on CentOS 7.

Big DataCentOS 7HBase
0 likes · 11 min read
How to Build a Beginner Hadoop Cluster on CentOS 7
Tech Musings
Tech Musings
Jul 8, 2021 · Big Data

Building a Simple Single-Node MapReduce System: From Theory to Code

This article walks through implementing a lightweight single‑machine MapReduce framework inspired by the original MapReduce paper, covering the abstract Map/Reduce model, task scheduling between master and workers, core Go code for map, reduce, worker, and coordinator, and a brief reflection on its limitations.

Big DataDistributed SystemsLab
0 likes · 10 min read
Building a Simple Single-Node MapReduce System: From Theory to Code
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 15, 2021 · Big Data

Hive and Hadoop Interview Questions and Answers

This article provides a comprehensive collection of interview-style questions and detailed answers covering Hive concepts, Hadoop architecture, MapReduce mechanics, HDFS operations, and performance optimization techniques for big‑data processing environments.

Data WarehouseHadoopHive
0 likes · 41 min read
Hive and Hadoop Interview Questions and Answers
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 22, 2021 · Big Data

Key New Features and Improvements in Hadoop 3.x

Hadoop 3.x upgrades the platform to JDK 1.8 and introduces a range of enhancements across common components, HDFS, YARN, and MapReduce, including erasure coding, multi‑NameNode high availability, cgroup‑based resource isolation, native map‑output collectors, and split client libraries, while also adding support for Azure and Aliyun distributed file systems.

HDFSHadoopMapReduce
0 likes · 7 min read
Key New Features and Improvements in Hadoop 3.x
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 3, 2020 · Big Data

Hive Query Optimization Techniques and Best Practices

This article presents a comprehensive guide to optimizing Hive queries, covering limit adjustments, join strategies, local mode execution, parallelism, strict mode, mapper and reducer tuning, JVM reuse, dynamic partitioning, speculative execution, data skew handling, and small‑file mitigation techniques.

HiveMapReduceSQL Optimization
0 likes · 20 min read
Hive Query Optimization Techniques and Best Practices
Ctrip Technology
Ctrip Technology
Sep 10, 2020 · Big Data

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

The article describes the design, architecture, and operational details of a unified logging framework at Ctrip's payment center, covering log production via a Log4j2 extension, Kafka‑Camus collection, Hive/ORC storage, MapReduce parsing optimizations, and governance strategies for massive daily TB‑scale data.

Big DataCamusData Governance
0 likes · 15 min read
Design and Implementation of a Unified Log Framework for Ctrip Payment Center
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 1, 2020 · Big Data

Configuring Hadoop to Support LZO Compression

This guide explains how to enable LZO compression in Hadoop by installing the twitter‑provided hadoop‑lzo library, updating core‑site.xml, synchronizing files across nodes, creating LZO indexes, and running a WordCount MapReduce job with LZO‑compressed output.

Big DataConfigurationHadoop
0 likes · 6 min read
Configuring Hadoop to Support LZO Compression
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 15, 2020 · Big Data

Hive Optimization Techniques and Best Practices for Big Data Processing

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, tuning MapReduce jobs, handling data skew in joins, and selecting appropriate storage formats for large‑scale data warehouses.

Big DataData SkewHive
0 likes · 19 min read
Hive Optimization Techniques and Best Practices for Big Data Processing
Big Data Technology & Architecture
Big Data Technology & Architecture
May 28, 2020 · Big Data

Hadoop System Bottleneck Detection and MapReduce Optimization Guide

This article provides a comprehensive guide on detecting Hadoop system bottlenecks, analyzing resource constraints, and applying practical MapReduce performance tuning techniques—including baseline creation, counter analysis, combiner usage, compression, and proper Writable types—to achieve optimal big‑data processing efficiency.

Big DataHadoopMapReduce
0 likes · 11 min read
Hadoop System Bottleneck Detection and MapReduce Optimization Guide
Big Data Technology Architecture
Big Data Technology Architecture
Mar 19, 2020 · Big Data

Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations

Data skew in Hive MapReduce jobs, caused by uneven key distribution during joins, group‑by, or COUNT(DISTINCT) operations, can severely slow tasks, and the article explains common scenarios and practical solutions such as using MapJoin, enabling map‑side aggregation, load‑balancing, and rewriting queries to mitigate skew.

Data SkewHiveMapJoin
0 likes · 7 min read
Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 13, 2020 · Big Data

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

This article describes how eBay's Central Application Logging (CAL) system generates massive daily logs, the challenges of Hadoop MapReduce job performance and resource consumption, and the step‑by‑step optimizations—reducing GC time, mitigating data skew, and improving algorithms—that cut execution time by over 60%, lowered cluster resource usage, and raised job success rates to nearly 100%.

Big DataData SkewHadoop
0 likes · 11 min read
Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage
Architects Research Society
Architects Research Society
Feb 3, 2020 · Databases

MapReduce‑Style Parallel Query Processing with Citus

The article explains how Citus enables sharding and MapReduce‑style parallel query execution in PostgreSQL, showing performance gains, example bucket algorithms, and how standard SQL can replace custom MapReduce code for large‑scale data analytics.

CitusDistributed PostgreSQLMapReduce
0 likes · 6 min read
MapReduce‑Style Parallel Query Processing with Citus
DataFunTalk
DataFunTalk
Oct 25, 2019 · Big Data

Migrating Data from HBase to Kafka Using MapReduce

This article explains how to reverse the typical data flow by extracting massive Rowkeys from HBase with MapReduce, storing them on HDFS, and then using batch Get operations to retrieve the full records and write them into Kafka, while handling retries and monitoring progress.

Big DataData MigrationHBase
0 likes · 9 min read
Migrating Data from HBase to Kafka Using MapReduce
Sohu Tech Products
Sohu Tech Products
Oct 9, 2019 · Databases

MongoDB Aggregation Framework: Stages, Pipelines, and Examples

This article provides an in‑depth overview of MongoDB’s aggregation framework, explaining the concepts of pipelines and stages such as $match, $group, $project, $lookup, $unwind, and $out, and includes practical code examples, syntax details, and comparisons to SQL aggregation.

MapReduceMongoDBPipeline
0 likes · 25 min read
MongoDB Aggregation Framework: Stages, Pipelines, and Examples
Tencent Cloud Developer
Tencent Cloud Developer
Jul 16, 2019 · Big Data

Design and Challenges of Tencent iData Analysis Center Backend: Bitmap Storage and MapReduce Architecture

Tencent’s iData Analysis Center rebuilt its backend as TGMars, replacing a rigid row‑oriented bitmap store and single‑node MapReduce pipeline with a more extensible architecture that shards user behavior bitmaps, eliminates shuffle overhead, and adds columnar storage, iterative processing and SQL‑like capabilities using Spark to overcome scalability and flexibility limitations.

MapReduceOLAPbitmap storage
0 likes · 10 min read
Design and Challenges of Tencent iData Analysis Center Backend: Bitmap Storage and MapReduce Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 24, 2019 · Big Data

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, fine‑tuning join operations, and optimizing MapReduce parameters such as mapper/reducer counts, file merging, compression, JVM reuse, parallel execution, strict mode, and storage formats.

Big DataHiveJOIN optimization
0 likes · 19 min read
Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning
DataFunTalk
DataFunTalk
Jun 17, 2019 · Big Data

Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era

This article explores Hadoop’s role in the big‑data era, detailing its architecture, core components such as HDFS, YARN, MapReduce, Ozone and Submarine, the challenges of trillion‑scale data, and why its scalability, cost efficiency, and a mature ecosystem give it a competitive edge.

Data LakeDistributed SystemsHadoop
0 likes · 11 min read
Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 1, 2019 · Big Data

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

This article provides a detailed introduction to Hadoop's ecosystem—including its core modules (Common, HDFS, YARN, MapReduce), the design of a high‑availability HDFS cluster, the principles of distributed file systems, and a complete Scala WordCount MapReduce program—offering a solid foundation for big‑data practitioners.

Big DataHDFSHadoop
0 likes · 15 min read
Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example
dbaplus Community
dbaplus Community
Mar 27, 2019 · Big Data

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

This article explains how eBay's CAL team reduced Hadoop MapReduce job execution time and resource consumption by over 60% through targeted GC tuning, data‑skew mitigation, and algorithmic improvements, boosting job success rates to nearly 100% while handling petabyte‑scale log data.

Big DataData SkewGC tuning
0 likes · 12 min read
How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization
Architects Research Society
Architects Research Society
Dec 30, 2018 · Big Data

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

ApacheBig DataFlink
0 likes · 22 min read
Overview of Major Apache Big Data Processing Frameworks
21CTO
21CTO
Dec 14, 2018 · Artificial Intelligence

Inside Jeff Dean and Sanjay Ghemawat’s Epic Journey: From Index Crashes to AI Powerhouses

The article chronicles Jeff Dean and Sanjay Ghemawat’s partnership at Google, from the 2000 index failure that threatened the company, through their pioneering work on MapReduce and large‑scale infrastructure, to the creation of TensorFlow and the rise of Google AI, highlighting their unique collaborative style and lasting impact on modern computing.

GoogleJeff DeanMapReduce
0 likes · 29 min read
Inside Jeff Dean and Sanjay Ghemawat’s Epic Journey: From Index Crashes to AI Powerhouses
21CTO
21CTO
Sep 21, 2018 · Big Data

Master Massive Data Processing: Key Techniques from Hash Maps to MapReduce

This comprehensive guide explores essential strategies for handling massive datasets, covering hash-based structures, bucket partitioning, heap and quicksort techniques, trie trees, Bloom filters, external sorting, and MapReduce, and demonstrates how to efficiently solve common interview problems such as top‑K queries and duplicate removal.

Data StructuresHashHeap
0 likes · 35 min read
Master Massive Data Processing: Key Techniques from Hash Maps to MapReduce
360 Quality & Efficiency
360 Quality & Efficiency
Jun 28, 2018 · Big Data

An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases

This article provides a concise overview of Apache Hive, covering its definition, Hadoop background, architecture, query workflow, storage model, advantages, disadvantages, and a comparison with traditional relational databases, helping readers understand how Hive enables SQL-like queries on data stored in HDFS.

Data WarehouseHadoopHive
0 likes · 5 min read
An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 2, 2018 · Big Data

How Google’s Low‑Cost PC Cluster Powers Its Massive Search Engine

This article examines Google’s unconventional infrastructure, detailing how millions of inexpensive PC‑level servers, custom power supplies, and proprietary networking support massive scale, and explains the core platforms—Google File System, MapReduce, and BigTable—that enable fast, reliable search and data processing across the globe.

BigtableGFSGoogle
0 likes · 29 min read
How Google’s Low‑Cost PC Cluster Powers Its Massive Search Engine
Architects' Tech Alliance
Architects' Tech Alliance
May 14, 2018 · Big Data

Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization

This article explains Hadoop's distributed storage and processing framework, details the MapReduce programming model, describes the classic JobTracker/TaskTracker architecture, outlines the shuffle and combine phases, and introduces YARN as a scalable replacement with its ResourceManager, ApplicationMaster, and NodeManager components.

Big DataHadoopMapReduce
0 likes · 13 min read
Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization
ITPUB
ITPUB
Mar 29, 2018 · Big Data

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

This article explains Hadoop’s core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker design to improve scalability and resource utilization in large‑scale data processing clusters.

Big DataHadoopMapReduce
0 likes · 15 min read
Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture
21CTO
21CTO
Sep 5, 2017 · Big Data

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

This article explains what MapReduce is, when to use it, and how to implement a PHP word‑count and a gold‑price average calculation on an Apache Hadoop cluster, covering installation hints, mapper and reducer scripts, testing commands, and visualizing results with gnuplot.

Big DataGnuplotHadoop
0 likes · 10 min read
Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide
Architecture Digest
Architecture Digest
Aug 15, 2017 · Artificial Intelligence

Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning

The article explains why AI engineers need foundational infrastructure knowledge—covering big‑data processing, cloud services, containerization, MapReduce, and deep‑learning platforms—to effectively solve real‑world problems, collaborate with teams, and build scalable, maintainable AI solutions.

AI InfrastructureBig DataMapReduce
0 likes · 14 min read
Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning
dbaplus Community
dbaplus Community
Jun 7, 2017 · Big Data

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

This comprehensive guide walks you through MapReduce fundamentals, the complete execution flow, and seven hands‑on Hadoop projects—including WordCount, custom serialization, custom partitioning, grouping comparators, file merging, multiple outputs, join operations, and friend‑graph analysis—while providing environment setup steps, Maven commands, and Hadoop CLI examples.

CLIHadoopJava
0 likes · 28 min read
Master MapReduce: From Fundamentals to Real‑World Hadoop Projects
MaGe Linux Operations
MaGe Linux Operations
May 24, 2017 · Big Data

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

This article explains how big data challenges traditional storage, introduces HDFS for distributed file management, describes parallel processing frameworks like MapReduce, Tez, and Spark, compares higher‑level tools such as Hive and Pig, and explores real‑time streaming and key‑value stores for low‑latency analytics.

HadoopHiveMapReduce
0 likes · 9 min read
Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming
Qunar Tech Salon
Qunar Tech Salon
May 5, 2017 · Backend Development

WeChat MQ 2.0: Enhanced Asynchronous Queue Design and Optimizations

The article introduces WeChat's self‑developed MQ 2.0 asynchronous queue, detailing its architecture, cross‑machine consumption model, improved task scheduling, efficient processing frameworks—including a MapReduce‑style engine and streaming tasks—and robust overload protection mechanisms that together boost reliability and performance for large‑scale backend services.

Distributed SystemsMapReduceMessage Queue
0 likes · 12 min read
WeChat MQ 2.0: Enhanced Asynchronous Queue Design and Optimizations
MaGe Linux Operations
MaGe Linux Operations
May 3, 2017 · Big Data

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

HDFSHadoopHive
0 likes · 13 min read
From Storage to Real‑Time: The Evolution of Big Data Technologies
Java High-Performance Architecture
Java High-Performance Architecture
Apr 4, 2017 · Big Data

Master MapReduce: Principles, Process, and 7 Hands‑On Examples

This tutorial quickly introduces the MapReduce model, explains its core principles and execution flow, and guides you through seven practical examples—from basic WordCount to custom serialization, partitioning, joins, and friend‑recommendation—while providing test data and an optional ready‑made Hadoop environment for hands‑on practice.

HadoopMapReduceTutorial
0 likes · 3 min read
Master MapReduce: Principles, Process, and 7 Hands‑On Examples
ITPUB
ITPUB
Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataMapReduceRDD
0 likes · 25 min read
Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jan 24, 2017 · Big Data

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

This article provides a comprehensive overview of Hadoop as the leading open‑source platform for big‑data processing, detailing its core components HDFS and MapReduce, the evolution to Hadoop 2.0/YARN, and the extensive ecosystem of tools and commercial solutions that enable scalable storage, analysis, and machine‑learning on massive data sets.

Big DataHDFSHadoop
0 likes · 18 min read
Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends
Java High-Performance Architecture
Java High-Performance Architecture
Oct 21, 2016 · Big Data

What Is Hive and How Does It Turn SQL into MapReduce?

This article explains Hive as a SQL‑based interface for Hadoop, shows why it simplifies large‑scale data analysis, provides practical command‑line examples for table creation, data loading, and queries, and details how HiveQL is internally converted into MapReduce jobs.

Data WarehouseHiveMapReduce
0 likes · 6 min read
What Is Hive and How Does It Turn SQL into MapReduce?