Big Data 11 min read

Introduction to Distributed Computing: Sharding, Message Queues, Hadoop and MapReduce

This article explains the fundamentals of distributed computing, covering sharding algorithms, message‑queue based task distribution, an overview of Hadoop and its MapReduce model, and the characteristics of offline batch processing for large‑scale data workloads.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Introduction to Distributed Computing: Sharding, Message Queues, Hadoop and MapReduce

Distributed computing simply means breaking a large computational task into many smaller tasks that run on several machines and then aggregating the results, enabling analysis of massive data such as radar signals, e‑commerce transaction patterns, and more.

Historically, increasing single‑machine performance was the first solution, but explosive data growth outpaced hardware improvements, leading to the compromise of distributed computing, which introduces challenges like consistency, data integrity, communication, fault tolerance, and scheduling.

For example, a product manager may request analysis of 100 GB of purchase data to derive regional spending habits. A single‑machine program might finish in ten hours, but the same requirement could be accelerated to three hours by distributing the workload.

One low‑cost approach is to split the data horizontally across five machines, each processing roughly one‑fifth of the data, achieving the three‑hour target.

Using a simple hash‑modulo sharding algorithm based on user ID, the data can be partitioned as follows:

f(memberid) % 5 = ServerN

Each machine runs the same program but processes only the records whose ID matches its remainder, eliminating inter‑machine communication and simplifying failure handling: a crashed node can resume from its last checkpoint just like a single‑machine job.

However, pure sharding lacks load balancing and requires manual configuration changes on each node. Introducing a message queue (e.g., RabbitMQ) solves these issues by adding a Master that pushes user records to the queue and Workers that consume them.

Master: pushes messages containing user data.

Message queue: RabbitMQ (or similar).

Workers (or Slaves): consume messages and perform the computation.

This architecture provides task distribution, program consistency (all workers run identical code), easy horizontal scaling (add more workers), and fault tolerance (RabbitMQ’s acknowledgment mechanism can re‑queue unprocessed messages if a worker crashes).

Hadoop is a widely used platform for massive data storage and computation. Its core components are:

MapReduce for distributed processing.

HDFS for storage, with higher‑level services such as HBase and Hive built on top.

A platform that allows multiple users to submit jobs via a defined API, handling scheduling and resource allocation.

The following diagram (originally in the source) compares the custom “little monk” design with Hadoop’s architecture, mapping concepts like “big data task”, “task splitting”, “sub‑tasks”, and “result aggregation” to Hadoop’s JobTracker, TaskTracker, and HDFS workflow.

MapReduce standardizes the programming model into a Map phase and a Reduce phase. Data is typically imported from relational databases into HDFS, processed via MapReduce jobs, and then exported back to databases. The workflow includes:

Import 100 GB of data into HDFS.

Write Map and Reduce logic according to the MapReduce API.

Package the program and submit it to the MapReduce platform.

The JobTracker (analogous to the Master) distributes tasks.

TaskTrackers (analogous to Workers) run on each compute node, loading the job’s JAR/DLL.

TaskTrackers execute the Map phase, writing intermediate results to HDFS.

If a Reduce phase exists, TaskTrackers fetch Map outputs via RPC, then run Reduce.

Reduce writes final results back to HDFS.

Export the results from HDFS to the target database.

While MapReduce is powerful for terabyte‑ or petabyte‑scale batch jobs across hundreds or thousands of nodes, it can be overkill for small workloads and has limitations such as lack of real‑time response, reliance on file‑system I/O, and a rigid two‑stage processing model.

These drawbacks motivated the development of real‑time stream processing frameworks like Storm, which will be discussed in the next chapter.

Source: Mushroom Mr. ( http://www.cnblogs.com/mushroom/p/4959904.html )

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

shardingMessage QueueMapReducedistributed computingHadoop
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.