Big Data 18 min read

Techniques for Processing Massive Data: Sorting, Querying, Top‑K, and Deduplication

This article explains core concepts and practical solutions for handling massive datasets that cannot fit into memory, covering batch processing, distributed sorting, bitmap indexing, hash‑based lookups, top‑K extraction, and deduplication techniques with code examples and multi‑machine strategies.

政采云技术

Sep 19, 2023

Techniques for Processing Massive Data: Sorting, Querying, Top‑K, and Deduplication

What Is Massive Data?

Massive data refers to data volumes so large that they cannot be loaded into memory at once or processed quickly on a single machine.

Small‑Data Processing Solutions

When data fits into memory, many classic techniques can be used:

Sorting: quicksort, merge sort, bucket sort, etc.

Querying: linear scan, binary search, hash tables, red‑black trees, bitmaps.

TOP‑K for static data: sort then scan, or use quick‑select to find a partition point.

TOP‑K for dynamic data: maintain a heap (min‑heap for smallest K, max‑heap for largest K).

Deduplication / duplicate detection: build a hash table for one set and probe with the other, or sort both sets and use a two‑pointer technique.

Problems Faced by Massive Data Processing

Single‑machine memory is insufficient.

Single‑machine processing speed is too slow.

Core Ideas for Massive Data Processing

Do not load all data into memory; process it in batches.

If one machine cannot hold all data, shard the data across multiple machines and process in parallel.

Accelerate slow single‑machine processing by parallelizing the work across many machines.

Common Cases and Corresponding Solutions

Sorting Problem

Case: Sort a 10 GB order file by total amount

If the file fits into memory, sort directly. Otherwise, with 1.5 GB memory, read 1 GB chunks, sort each chunk in memory, write the sorted chunk back to disk, and repeat until ten sorted 1 GB files are produced.

Next, merge the ten sorted files using a multi‑way merge: keep the smallest element from each file in a size‑10 array, repeatedly extract the global minimum, write it to the output file, and replace it with the next element from the same source file.

Optimisations include adding a 100 MB read cache for each chunk, using a write cache, and scaling the approach to multiple machines (e.g., ten machines each handling 1 GB).

When the key range is known and uniformly distributed (e.g., order amount 0‑999), bucket sort can be applied: stream the file, place each record into the appropriate bucket, sort each bucket individually, then concatenate the buckets to obtain a fully sorted file.

Query Problem

Case: Determine whether a login IP is in a whitelist of 1 billion IPs

On a single machine with 1 GB memory, a bitmap is the most space‑efficient structure. A 32‑bit IP address space requires 2³² bits ≈ 500 MB, which fits into memory.

Construction: scan the whitelist file, set the bit corresponding to each IP address. Query: compute the index of the login IP in the bitmap and check whether the bit is set.

For a multi‑machine solution, distribute the bitmap across four machines using a hash (e.g., MD5 % 4) to assign each IP to a specific node. Queries are routed to the appropriate node after applying the same hash.

TOP‑K Problem

Case: Find the top 100 keywords in a 100 GB search‑keyword file

If the data fits in memory, either sort the keywords and scan, or use a hash table to count frequencies and a min‑heap of size 100 to keep the current top‑K.

public String[] top100(String[] keywords) {
  if (keywords == null) return null;
  // hash table for frequency
  Map<String, Integer> map = new HashMap<>();
  for (String keyword : keywords) {
    map.put(keyword, map.getOrDefault(keyword, 0) + 1);
  }
  // min‑heap of size 100
  PriorityQueue<String> heap = new PriorityQueue<>(100, (o1, o2) -> map.get(o1) - map.get(o2));
  for (Map.Entry<String, Integer> entry : map.entrySet()) {
    String keyword = entry.getKey();
    int frequency = entry.getValue();
    if (heap.size() < 100) {
      heap.offer(keyword);
    } else if (frequency > map.get(heap.peek())) {
      heap.poll();
      heap.offer(keyword);
    }
  }
  String[] res = new String[100];
  int i = 0;
  while (!heap.isEmpty()) {
    res[i++] = heap.poll();
  }
  return res;
}

For 100 GB that cannot be loaded, first compute frequencies using external sorting or a distributed hash‑based approach (e.g., shard the file into 10 parts by MD5 % 10, count frequencies in each shard, then merge the partial results). After obtaining a frequency file, apply the same min‑heap technique on the aggregated frequencies.

Deduplication / Duplicate‑Finding Problem

Case: Find common URLs in two 5‑billion‑line files (each line 64 bytes) with only 4 GB memory per machine

When memory is sufficient, two approaches are common:

Sort both files and use a two‑pointer scan to emit matches.

Load one file into a hash table and probe the other.

public List<Integer> findDuplicate(int[] arr1, int[] arr2) {
  if (arr1 == null || arr2 == null) return null;
  Arrays.sort(arr1);
  Arrays.sort(arr2);
  List<Integer> res = new ArrayList<>();
  int a = 0, b = 0;
  while (a < arr1.length && b < arr2.length) {
    if (a < arr1.length - 1 && arr1[a] == arr1[a + 1]) { a++; continue; }
    if (b < arr2.length - 1 && arr2[b] == arr2[b + 1]) { b++; continue; }
    if (arr1[a] == arr2[b]) {
      res.add(arr1[a]);
      a++; b++;
    } else if (arr1[a] < arr2[b]) {
      a++;
    } else {
      b++;
    }
  }
  return res;
}

For massive files, the same ideas are applied after sharding:

Sort‑plus‑two‑pointer: split each file into manageable chunks, sort each chunk, then merge‑scan across machines.

Hash‑based: compute MD5 of each URL, take MD5 % N (e.g., N = 20) to assign URLs to N shards, process each shard on a separate node using an in‑memory hash table, and finally combine the results.

Using 20 machines, each handles one shard, reducing the number of full‑file scans from 20 to 1 and dramatically improving throughput.

Summary

When dealing with massive data, first consider solutions that work if the entire dataset fits into memory. Then, adopt generic big‑data techniques such as batch loading, sharding, and parallel processing across multiple machines to overcome memory and speed limitations.

These strategies enable reliable sorting, fast lookups, top‑K extraction, and deduplication for datasets ranging from tens of gigabytes to terabytes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Deduplication Distributed Computing sorting top_k bitmap indexing massive data processing

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

What Is Massive Data?

Small‑Data Processing Solutions

Problems Faced by Massive Data Processing

Core Ideas for Massive Data Processing

Common Cases and Corresponding Solutions

Sorting Problem

Case: Sort a 10 GB order file by total amount

Query Problem

Case: Determine whether a login IP is in a whitelist of 1 billion IPs

TOP‑K Problem

Case: Find the top 100 keywords in a 100 GB search‑keyword file

Deduplication / Duplicate‑Finding Problem

Case: Find common URLs in two 5‑billion‑line files (each line 64 bytes) with only 4 GB memory per machine

Summary

政采云技术

How this landed with the community

Was this worth your time?

0 Comments

Case: Sort a 10 GB order file by total amount

Case: Determine whether a login IP is in a whitelist of 1 billion IPs

Case: Find the top 100 keywords in a 100 GB search‑keyword file

Case: Find common URLs in two 5‑billion‑line files (each line 64 bytes) with only 4 GB memory per machine