Big Data 18 min read

Techniques for Processing Massive Data: Sorting, Querying, Top‑K, and Deduplication

This article explains core concepts and practical solutions for handling massive datasets that cannot fit into memory, covering batch processing, distributed sorting, bitmap indexing, hash‑based lookups, top‑K extraction, and deduplication techniques with code examples and multi‑machine strategies.

政采云技术
政采云技术
政采云技术
Techniques for Processing Massive Data: Sorting, Querying, Top‑K, and Deduplication

What Is Massive Data?

Massive data refers to data volumes so large that they cannot be loaded into memory at once or processed quickly on a single machine.

Small‑Data Processing Solutions

When data fits into memory, many classic techniques can be used:

Sorting: quicksort, merge sort, bucket sort, etc.

Querying: linear scan, binary search, hash tables, red‑black trees, bitmaps.

TOP‑K for static data: sort then scan, or use quick‑select to find a partition point.

TOP‑K for dynamic data: maintain a heap (min‑heap for smallest K, max‑heap for largest K).

Deduplication / duplicate detection: build a hash table for one set and probe with the other, or sort both sets and use a two‑pointer technique.

Problems Faced by Massive Data Processing

Single‑machine memory is insufficient.

Single‑machine processing speed is too slow.

Core Ideas for Massive Data Processing

Do not load all data into memory; process it in batches.

If one machine cannot hold all data, shard the data across multiple machines and process in parallel.

Accelerate slow single‑machine processing by parallelizing the work across many machines.

Common Cases and Corresponding Solutions

Sorting Problem

Case: Sort a 10 GB order file by total amount

If the file fits into memory, sort directly. Otherwise, with 1.5 GB memory, read 1 GB chunks, sort each chunk in memory, write the sorted chunk back to disk, and repeat until ten sorted 1 GB files are produced.

Next, merge the ten sorted files using a multi‑way merge: keep the smallest element from each file in a size‑10 array, repeatedly extract the global minimum, write it to the output file, and replace it with the next element from the same source file.

Optimisations include adding a 100 MB read cache for each chunk, using a write cache, and scaling the approach to multiple machines (e.g., ten machines each handling 1 GB).

When the key range is known and uniformly distributed (e.g., order amount 0‑999), bucket sort can be applied: stream the file, place each record into the appropriate bucket, sort each bucket individually, then concatenate the buckets to obtain a fully sorted file.

Query Problem

Case: Determine whether a login IP is in a whitelist of 1 billion IPs

On a single machine with 1 GB memory, a bitmap is the most space‑efficient structure. A 32‑bit IP address space requires 2³² bits ≈ 500 MB, which fits into memory.

Construction: scan the whitelist file, set the bit corresponding to each IP address. Query: compute the index of the login IP in the bitmap and check whether the bit is set.

For a multi‑machine solution, distribute the bitmap across four machines using a hash (e.g., MD5 % 4) to assign each IP to a specific node. Queries are routed to the appropriate node after applying the same hash.

TOP‑K Problem

Case: Find the top 100 keywords in a 100 GB search‑keyword file

If the data fits in memory, either sort the keywords and scan, or use a hash table to count frequencies and a min‑heap of size 100 to keep the current top‑K.

public String[] top100(String[] keywords) {
  if (keywords == null) return null;
  // hash table for frequency
  Map
map = new HashMap<>();
  for (String keyword : keywords) {
    map.put(keyword, map.getOrDefault(keyword, 0) + 1);
  }
  // min‑heap of size 100
  PriorityQueue
heap = new PriorityQueue<>(100, (o1, o2) -> map.get(o1) - map.get(o2));
  for (Map.Entry
entry : map.entrySet()) {
    String keyword = entry.getKey();
    int frequency = entry.getValue();
    if (heap.size() < 100) {
      heap.offer(keyword);
    } else if (frequency > map.get(heap.peek())) {
      heap.poll();
      heap.offer(keyword);
    }
  }
  String[] res = new String[100];
  int i = 0;
  while (!heap.isEmpty()) {
    res[i++] = heap.poll();
  }
  return res;
}

For 100 GB that cannot be loaded, first compute frequencies using external sorting or a distributed hash‑based approach (e.g., shard the file into 10 parts by MD5 % 10, count frequencies in each shard, then merge the partial results). After obtaining a frequency file, apply the same min‑heap technique on the aggregated frequencies.

Deduplication / Duplicate‑Finding Problem

Case: Find common URLs in two 5‑billion‑line files (each line 64 bytes) with only 4 GB memory per machine

When memory is sufficient, two approaches are common:

Sort both files and use a two‑pointer scan to emit matches.

Load one file into a hash table and probe the other.

public List
findDuplicate(int[] arr1, int[] arr2) {
  if (arr1 == null || arr2 == null) return null;
  Arrays.sort(arr1);
  Arrays.sort(arr2);
  List
res = new ArrayList<>();
  int a = 0, b = 0;
  while (a < arr1.length && b < arr2.length) {
    if (a < arr1.length - 1 && arr1[a] == arr1[a + 1]) { a++; continue; }
    if (b < arr2.length - 1 && arr2[b] == arr2[b + 1]) { b++; continue; }
    if (arr1[a] == arr2[b]) {
      res.add(arr1[a]);
      a++; b++;
    } else if (arr1[a] < arr2[b]) {
      a++;
    } else {
      b++;
    }
  }
  return res;
}

For massive files, the same ideas are applied after sharding:

Sort‑plus‑two‑pointer: split each file into manageable chunks, sort each chunk, then merge‑scan across machines.

Hash‑based: compute MD5 of each URL, take MD5 % N (e.g., N = 20) to assign URLs to N shards, process each shard on a separate node using an in‑memory hash table, and finally combine the results.

Using 20 machines, each handles one shard, reducing the number of full‑file scans from 20 to 1 and dramatically improving throughput.

Summary

When dealing with massive data, first consider solutions that work if the entire dataset fits into memory. Then, adopt generic big‑data techniques such as batch loading, sharding, and parallel processing across multiple machines to overcome memory and speed limitations.

These strategies enable reliable sorting, fast lookups, top‑K extraction, and deduplication for datasets ranging from tens of gigabytes to terabytes.

Big DataDeduplicationdistributed computingsortingtop-kbitmap indexingmassive data processing
政采云技术
Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.