Big Data 10 min read

Efficient Sorting and De‑duplication of Massive Datasets: Key Algorithms

This article explores practical algorithms for sorting, de‑duplicating, and extracting top‑K records from massive data sets, covering bitmap techniques, external sorting, hash‑based counting, min‑heap selection, divide‑and‑conquer, Bloom filters, and distributed processing strategies.

ITFLY8 Architecture Home

Sep 17, 2018

Efficient Sorting and De‑duplication of Massive Datasets: Key Algorithms

Sorting Massive Data Without Duplicates

Use the bitmap method; for example, sorting phone numbers by treating them as large integers.

Sorting Massive Data With Duplicates

Example: sorting phone numbers that may repeat. Proposed solutions include:

External sorting: split data into smaller files, sort each, then merge (high I/O cost).

Seek more efficient methods.

Finding the Top‑K Most Frequent Records in Massive Data

Note: The top‑K frequency problem should first compute the occurrence count for each record, then select the top‑K largest counts, using a min‑heap for convenience.

Method 1 (data fits in memory): First, preprocess the data in O(N) time using a hash table to count frequencies; then use a min‑heap of size K to maintain the top‑K, achieving O(N log K) overall.

Method 2 (insufficient memory for a full hash table): Apply divide‑and‑conquer by partitioning data into 1,000 files, compute top‑K in each, then merge the results.

Method 3: For finding the most frequent words in a large text, use a trie combined with a min‑heap of size K, maintaining an array of the current top‑K words sorted descending.

Special Case: Top‑1 (e.g., most frequent IP address)

Algorithm – divide‑and‑conquer + hash:

Hash each IP address by IP % 1024 and store them into 1,024 small files (≈4 MB each).

For each small file, build a hash map (IP → count) and record the IP with the highest count.

Collect the 1,024 candidate IPs and determine the overall most frequent IP using a standard sorting algorithm.

Sorting Massive Data Across Multiple Files

Example: 10 files, each 1 GB, containing user queries that may repeat; sort queries by frequency.

Method 1:

Read the 10 files sequentially, hash each query by hash(query) % 10, and write it to one of 10 new files.

On a machine with ~2 GB RAM, use a hash map to count occurrences in each file, then sort by count using quicksort/heap/merge sort, outputting sorted queries.

Merge the 10 sorted files using a combination of internal and external merge sort (the sorting step avoids loading all files into memory at once).

Method 2: After hashing and partitioning, process the files in a distributed manner (e.g., MapReduce) and merge the final results.

Finding Common Records Between Two Large Files

Example: two files each containing 5 billion URLs (64 bytes per URL) with a 4 GB memory limit.

Method 1:

Scan file A, hash each URL by hash(url) % 1000, and distribute URLs into 1,000 small files (≈300 MB each).

Do the same for file B, creating corresponding 1,000 files.

Only compare matching pairs of small files (a0 vs b0, a1 vs b1, …) to find common URLs.

Method 2: If a small error rate is acceptable, use a Bloom filter.

Finding Non‑Duplicate Records in Massive Data

Example: identify unique integers among 250 million numbers when memory is insufficient.

Method 1 (2‑bit bitmap): Allocate 2 bits per possible integer (total 1 GB for 2³² values). Scan the numbers, updating bits: 00→01 (first occurrence), 01→10 (second occurrence), 10 stays unchanged. After scanning, output numbers whose bits are 01.

Method 2: Similar to Method 1, partition data into small files, find unique numbers within each file (using sorting to avoid loading everything at once), then merge while discarding duplicates.

Searching in Massive Data

Example: given 4 billion unsorted unsigned integers, determine quickly whether a given number is present.

Method 1: Allocate 512 MB of memory, using one bit per possible integer. Set bits for all numbers; query by checking the corresponding bit.

Method 2 (Programming Pearls approach): Apply a binary‑search‑style divide‑and‑conquer on files:

Split numbers by the most significant bit into two files (≤2 billion vs. ≥2 billion).

Compare the query's most significant bit and continue with the matching file.

Repeat the process for the next bit, halving the search space each time.

Continue until the desired number is located; the time complexity is O(log N).

References

Programming Pearls

http://blog.csdn.net/v_JULY_v

Source: https://www.cnblogs.com/xawei/p/6726620.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Bitmap Hash sorting Top‑K external sorting

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.