Big Data 35 min read

Master Massive Data Processing: Key Techniques from Hash Maps to MapReduce

This comprehensive guide explores essential strategies for handling massive datasets, covering hash-based structures, bucket partitioning, heap and quicksort techniques, trie trees, Bloom filters, external sorting, and MapReduce, and demonstrates how to efficiently solve common interview problems such as top‑K queries and duplicate removal.

21CTO

Sep 21, 2018

Master Massive Data Processing: Key Techniques from Hash Maps to MapReduce

Massive Data Processing Overview

Massive data processing involves storing, handling, and operating on data sets that are too large to fit into memory or be processed quickly using naive algorithms.

When data size exceeds memory, solutions must address both time and space constraints, often requiring more complex, scenario‑specific techniques.

Divide‑and‑Conquer / Hash Mapping

For time constraints, clever algorithms combined with suitable data structures (e.g., Bloom filter, hash, bitmap, heap, or inverted index) are used. For space constraints, the data is partitioned ("divide and conquer"), typically via hash modulo mapping, turning a large file into many smaller files.

Single‑Machine vs. Cluster

Single‑machine processing considers CPU, memory, and disk I/O.

Cluster processing distributes data across multiple machines for parallel computation.

Core Techniques

Divide‑and‑conquer + hash mapping + heap/quick/merge sort.

Double‑bucket partitioning.

Bloom filter / bitmap.

External sorting.

MapReduce.

Set/Map vs. Hash_Set/Hash_Map

Both set and map store elements sorted by key and do not allow duplicate keys. In a set, the element itself is the key; in a map, each element has a key and a value.

Hash‑based containers ( hash_set, hash_map) are built on hash tables and do not maintain order.

Hash Functions and Collision Handling

A hash function maps arbitrary‑length input to a fixed‑length output, compressing the input space. Common methods include:

Division method: index = value % 16.

Multiplication (square) method: index = (value * value) >> 28.

Fibonacci (golden ratio) method with constants such as 2654435769 for 32‑bit integers.

Collision handling can use open hashing (chaining) or closed hashing (open addressing).

Heap Basics

A heap is a special binary tree where each node is greater (max‑heap) or smaller (min‑heap) than its children, and the tree is complete. Insertion and deletion have O(log n) complexity.

Use a min‑heap of size k to keep the top k elements while scanning a large data set.

Typical Top‑K Problems and Solutions

Find the IP with the most accesses in a day's log: hash‑map counting + heap sorting.

Find top 10 queries from millions of search strings: hash‑map counting + min‑heap.

Find the 100 largest numbers among 1 million: maintain a min‑heap of size 100.

Double‑Bucket Partitioning (Technique 2)

When the value range is huge, repeatedly partition the data into buckets, narrowing the range until it fits into memory, then apply counting or sorting within the final bucket.

Bloom Filter / Bitmap (Technique 3)

Bloom filters use a bit array and multiple hash functions to test set membership with a controllable false‑positive rate. Bitmaps use one bit per possible value, enabling fast existence checks with minimal memory.

External Sorting (Technique 5)

For data that cannot fit into memory, external merge sort splits the file into sorted runs that fit into memory, writes them to disk, then merges the runs.

MapReduce (Technique 6)

MapReduce divides large tasks into parallel Map operations, then aggregates results with Reduce , effectively performing a distributed merge sort.

Sample Code: BitMap Sorting

void SetBit(char *p, int pos){
    p += pos/8;
    *p |= (0x01 << (pos%8));
}

void BitMapSortDemo(){
    int num[] = {3,5,2,10,6,12,8,14,9};
    const int BufferLen = 2; // enough for max value 14
    char *pBuffer = new char[BufferLen];
    memset(pBuffer, 0, BufferLen);
    for(int i=0;i<9;i++) SetBit(pBuffer, num[i]);
    for(int i=0;i<BufferLen;i++){
        for(int j=0;j<8;j++)
            if((*pBuffer & (0x01<<j)) == (0x01<<j))
                printf("%d ", i*8 + j);
        pBuffer++;
    }
}

Representative Images

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

algorithm Data Structures Hash heap MapReduce

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.