Tagged articles
7 articles
Page 1 of 1
NiuNiu MaTe
NiuNiu MaTe
Sep 22, 2025 · Big Data

How to De‑duplicate 4 Billion QQ Numbers with Only 1 GB RAM

Learn four practical techniques—simple sorting, hashmap deduplication, external merge sort, and bitmap bit‑set optimization—to efficiently remove duplicate QQ numbers from a 40‑billion‑record file while staying within a strict 1 GB memory limit, even handling tighter 100 MB constraints.

Big DataBitmapalgorithm
0 likes · 9 min read
How to De‑duplicate 4 Billion QQ Numbers with Only 1 GB RAM
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 24, 2020 · Big Data

Common Techniques for Processing Massive Data Sets

This article summarizes a range of practical methods—including Bloom filters, hashing, bit‑maps, heaps, bucket partitioning, database indexes, inverted indexes, external sorting, trie trees, and MapReduce—that are commonly used to handle, deduplicate, and query extremely large data volumes in big‑data applications.

Big DataHashingHeap
0 likes · 11 min read
Common Techniques for Processing Massive Data Sets
Top Architect
Top Architect
Feb 25, 2020 · Big Data

External Sorting of a 4.6 GB File Containing 500 Million Integers: Strategies, Implementations, and Performance

The article presents a practical case of sorting a 4.6 GB file with 500 million random integers, evaluates in‑memory quicksort and merge‑sort implementations, discusses bitmap sorting, and finally details a multi‑phase external‑sort algorithm with measured runtimes and resource considerations.

Sorting Algorithmbitmap sortexternal sort
0 likes · 11 min read
External Sorting of a 4.6 GB File Containing 500 Million Integers: Strategies, Implementations, and Performance
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 26, 2019 · Big Data

Common Techniques for Processing Massive Data Sets

This article summarizes a variety of practical methods—including Bloom filters, hashing, bit‑maps, heaps, bucket partitioning, database indexes, inverted indexes, external sorting, tries, and MapReduce—that can be used to efficiently handle and analyze extremely large data volumes in real‑world scenarios.

Data StructuresHashingexternal sort
0 likes · 15 min read
Common Techniques for Processing Massive Data Sets