Tagged articles

MapReduce

125 articles · Page 2 of 2

Oct 21, 2016 · Big Data

What Is Hive and How Does It Turn SQL into MapReduce?

This article explains Hive as a SQL‑based interface for Hadoop, shows why it simplifies large‑scale data analysis, provides practical command‑line examples for table creation, data loading, and queries, and details how HiveQL is internally converted into MapReduce jobs.

Data WarehouseHiveMapReduce

0 likes · 6 min read

What Is Hive and How Does It Turn SQL into MapReduce?

Java High-Performance Architecture

Sep 24, 2016 · Big Data

Step-by-Step Guide to Building a Hadoop 2.7.3 Cluster on Three Servers

This tutorial walks you through preparing three Linux servers, configuring password‑less SSH, installing Hadoop 2.7.3, editing core XML files, distributing the installation, starting the services, and verifying HDFS and MapReduce functionality with practical commands and screenshots.

Big DataCluster SetupHDFS

0 likes · 10 min read

Step-by-Step Guide to Building a Hadoop 2.7.3 Cluster on Three Servers

Ctrip Technology

Aug 26, 2016 · Big Data

Exploring OLAP Engine with Apache Kylin: Architecture, Theory, and Practical Applications in Flight Ticket Big Data

This article presents a comprehensive overview of the Qdata session on OLAP engine exploration, detailing the limitations of traditional MySQL‑based solutions, the requirements for large‑scale analytics, the architecture and theoretical foundations of Apache Kylin, its cube construction process, storage in HBase, query rewriting, real‑world flight‑ticket data applications, and the encountered challenges with corresponding optimization practices.

Apache KylinCubeData Warehouse

0 likes · 7 min read

Exploring OLAP Engine with Apache Kylin: Architecture, Theory, and Practical Applications in Flight Ticket Big Data

MaGe Linux Operations

Aug 11, 2016 · Big Data

Essential MapReduce, HBase, and Spark Configuration Parameters for Faster, More Stable Jobs

This article compiles the most frequently used configuration parameters for MapReduce, HBase, and Spark, explaining their purposes and recommended settings to improve job performance, reliability, and resource utilization in big‑data environments.

Big DataConfigurationHBase

0 likes · 8 min read

Essential MapReduce, HBase, and Spark Configuration Parameters for Faster, More Stable Jobs

MaGe Linux Operations

Aug 4, 2016 · Big Data

How Hadoop 2.0 Collects and Manages Job Logs with YARN

This article explains Hadoop 2.0's built‑in MRv2 log collection mechanism, detailing job‑run and task‑run logs, their generation steps, log aggregation, and the role of the JobHistory Server for centralized analysis.

Big DataHadoopJobHistory

0 likes · 8 min read

How Hadoop 2.0 Collects and Manages Job Logs with YARN

Architecture Digest

Jul 5, 2016 · Big Data

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

The article reviews Hadoop’s origins from Google’s pioneering papers, explains its architecture and ecosystem, evaluates its strengths such as scalability and benchmarks, discusses current limitations like single‑point failures and complex programming, and outlines upcoming improvements including HDFS Federation and next‑generation MapReduce.

Big DataDistributed ComputingFuture

0 likes · 14 min read

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

Hulu Beijing

May 31, 2016 · Big Data

What’s New in Hadoop 3.0? Key Features and Improvements Explained

Hadoop 3.0, built on JDK 1.8, adds erasure‑coded HDFS, multi‑NameNode support, native MapReduce task optimizations, cgroup‑based YARN memory and disk isolation, and container resizing, with an alpha slated for summer and a GA release expected in November or December.

Big DataHDFSHadoop

0 likes · 5 min read

What’s New in Hadoop 3.0? Key Features and Improvements Explained

dbaplus Community

May 25, 2016 · Databases

How Parallel Execution Supercharges SQL Server Queries—and the Pitfalls to Avoid

This article explains the theory behind SQL Server's parallel execution, illustrates its performance gains with Amdahl's Law, lists operators that block parallelism, discusses configuration settings, warns of deadlocks and thread starvation, and presents practical MapReduce‑style optimizations for real‑world workloads.

Amdahl's LawDeadlockMapReduce

0 likes · 16 min read

How Parallel Execution Supercharges SQL Server Queries—and the Pitfalls to Avoid

Architect

May 11, 2016 · Big Data

Comprehensive Guide to Hadoop MapReduce Job Execution, Scheduling, and Optimization

This article provides an in‑depth explanation of Hadoop MapReduce architecture, covering the roles of JobClient, JobTracker, TaskTracker and HDFS, the complete job lifecycle from submission to completion, scheduling strategies, shuffle and sort mechanisms, fault tolerance, and performance tuning techniques.

Big DataHadoopJobTracker

0 likes · 20 min read

Comprehensive Guide to Hadoop MapReduce Job Execution, Scheduling, and Optimization

21CTO

Apr 20, 2016 · Fundamentals

Why Algorithms Matter More Than Learning Every New Programming Language

The article argues that, despite the hype around ever‑changing programming languages, mastering core algorithms and computer science theory remains essential for building efficient, scalable solutions across fields—from search engines and parallel computing to scientific research—because algorithms are the enduring foundation of technology.

Data StructuresMapReducecomputer science fundamentals

0 likes · 11 min read

Why Algorithms Matter More Than Learning Every New Programming Language

21CTO

Mar 31, 2016 · Big Data

Why Hadoop Isn’t the Silver Bullet for Big Data: Insights from Facebook

The article examines common misconceptions about Hadoop, compares it with relational databases, and shares Facebook's data‑analysis practices, highlighting when Hadoop is appropriate and the broader considerations of using open‑source big‑data frameworks.

HadoopMapReduceRelational Databases

0 likes · 8 min read

Why Hadoop Isn’t the Silver Bullet for Big Data: Insights from Facebook

ITPUB

Feb 20, 2016 · Big Data

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

The article chronicles Doug Cutting’s path from his Stanford studies and early Xerox work through the creation of Lucene, Nutch, and Hadoop, highlighting how open‑source innovations and Google’s technologies propelled Hadoop to become a cornerstone of modern big‑data processing and its future outlook.

Big DataDistributed ComputingDoug Cutting

0 likes · 15 min read

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

21CTO

Feb 14, 2016 · Big Data

How PageRank Works: From Random Surfer Theory to MapReduce Implementation

This article explains the fundamental principles of Google's PageRank algorithm, modeling web pages as a directed graph and a random surfer, discusses matrix formulation, convergence issues like dangling nodes and traps, and demonstrates a practical MapReduce implementation with Python code for large‑scale rank computation.

Big DataMapReducePageRank

0 likes · 15 min read

How PageRank Works: From Random Surfer Theory to MapReduce Implementation

Java High-Performance Architecture

Jan 24, 2016 · Big Data

MapReduce Explained: From Library Book Counting to Word Count in Big Data

This article introduces the MapReduce parallel processing model, illustrates its core map and reduce operations with a library‑shelf analogy and a classic word‑count example, and walks through each processing stage using clear diagrams to show how massive data is aggregated efficiently.

Big DataHadoopMapReduce

0 likes · 5 min read

MapReduce Explained: From Library Book Counting to Word Count in Big Data

21CTO

Jan 9, 2016 · Backend Development

What Jeff Dean Really Built: From MapReduce to Spanner

This article debunks humorous "facts" about Jeff Dean while highlighting his real contributions to Google’s infrastructure—such as MapReduce, the Google File System, BigTable, and Spanner—and explains how his work shaped modern backend development and big data processing.

BigtableJeff DeanMapReduce

0 likes · 13 min read

What Jeff Dean Really Built: From MapReduce to Spanner

21CTO

Dec 30, 2015 · Big Data

Mastering Massive Data: MapReduce, Hadoop, and Taobao’s Architecture

This article introduces the fundamental MapReduce model and Hadoop framework, explains their roles in large‑scale data processing, and then examines Taobao’s massive‑data product architecture—including its data source, compute, storage, query, and product layers, as well as the MyFOX, Prom, and Glider components and caching strategies.

Data ArchitectureHadoopMapReduce

0 likes · 16 min read

Mastering Massive Data: MapReduce, Hadoop, and Taobao’s Architecture

Qunar Tech Salon

Dec 13, 2015 · Big Data

Introduction to Distributed Computing: Sharding, Message Queues, Hadoop and MapReduce

This article explains the fundamentals of distributed computing, covering sharding algorithms, message‑queue based task distribution, an overview of Hadoop and its MapReduce model, and the characteristics of offline batch processing for large‑scale data workloads.

Distributed ComputingHadoopMapReduce

0 likes · 11 min read

Introduction to Distributed Computing: Sharding, Message Queues, Hadoop and MapReduce

21CTO

Dec 9, 2015 · Big Data

Mastering Hadoop: From MapReduce Basics to Taobao’s Massive Data Architecture

This article introduces the fundamental MapReduce model and Hadoop framework, explains their components such as HDFS, MapReduce, and HBase, and then examines Taobao’s large‑scale data product architecture—including storage, computation, query, and caching layers—to illustrate practical big‑data processing techniques.

Data ArchitectureHadoopMapReduce

0 likes · 17 min read

Mastering Hadoop: From MapReduce Basics to Taobao’s Massive Data Architecture

21CTO

Nov 26, 2015 · Big Data

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

This article explores the 4V characteristics of big data, real‑world data growth examples, historical analogies, Google’s GFS‑MapReduce‑BigTable model, Hadoop’s architecture and HDFS processes, HBase components, NoSQL alternatives, and practical big‑data applications at Tencent and beyond.

Data ArchitectureDistributed ComputingHadoop

0 likes · 7 min read

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

21CTO

Sep 19, 2015 · Artificial Intelligence

Why Distributed Machine Learning Needs More Data Than Speed

The article explains how distributed machine learning evolved from parallel computing to handle massive, long‑tail data sets, discusses the importance of scalability, fault recovery, and data‑parallel algorithms, and reviews frameworks such as MPI, MapReduce, and Pregel for building large‑scale AI systems.

Big DataLDAMPI

0 likes · 24 min read

Why Distributed Machine Learning Needs More Data Than Speed

MaGe Linux Operations

Aug 20, 2015 · Big Data

15 Must‑Try Resources to Master Hadoop Quickly

This article explains what Hadoop is, outlines its key features, and presents a curated list of 15 high‑quality tutorials, video courses, and books to help beginners and professionals efficiently learn Hadoop and its MapReduce ecosystem.

Data EngineeringHadoopLearning Resources

0 likes · 12 min read

15 Must‑Try Resources to Master Hadoop Quickly

21CTO

Aug 11, 2015 · Big Data

Understanding MapReduce Through a Pizza Sauce Analogy

The author recounts delivering a MapReduce talk, then uses a vivid pizza sauce preparation story to illustrate how mapping chops ingredients and reducing blends them, effectively explaining distributed data processing concepts to a non‑technical audience.

AnalogyDistributed ComputingMapReduce

0 likes · 7 min read

Understanding MapReduce Through a Pizza Sauce Analogy

Efficient Ops

Jun 25, 2015 · Big Data

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

This article chronicles Baidu’s eight‑year journey from early Hadoop adoption to advanced MPI, DAG engines, and real‑time streaming platforms, detailing architectural milestones, performance optimizations, and practical lessons for large‑scale offline and online data processing.

BaiduDAGDistributed Computing

0 likes · 21 min read

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

Qunar Tech Salon

Dec 21, 2014 · Big Data

How to Prevent Java Heap Space Errors in Hadoop MapReduce by Managing Task Memory and Slots

This article outlines five essential steps to avoid Java heap space errors in Hadoop MapReduce by estimating memory consumption, verifying JVM availability and settings, limiting swap usage, and configuring instance slot numbers below the JobTracker's calculated values, ensuring stable cluster performance.

HadoopJava HeapMapReduce

0 likes · 11 min read

How to Prevent Java Heap Space Errors in Hadoop MapReduce by Managing Task Memory and Slots

MaGe Linux Operations

Nov 5, 2014 · Big Data

Quickly Get Hadoop 2.0 Up and Running: A Minimal Configuration Guide

This article walks through the essential steps to install and configure Hadoop 2.0 on a two‑node Linux cluster, covering version selection, directory setup, core XML files, YARN settings, service startup, verification commands, and basic troubleshooting tips.

Big DataCluster SetupHDFS

0 likes · 9 min read

Quickly Get Hadoop 2.0 Up and Running: A Minimal Configuration Guide