Finding Common URLs in Billions of Records: From Naive Search to Hash Divide‑and‑Conquer
This article walks through a classic interview problem of locating common URLs in two massive files, explains why a naïve O(m·n) approach fails, and presents a scalable solution using hash‑based divide‑and‑conquer and file partitioning techniques.
1. Initial Thinking
The interview question asks for the common URLs between file A (3.2 billion URLs) and file B (6.4 billion URLs). A naïve solution would load both files into memory and compare each element, yielding a time complexity of O(m·n). The article illustrates this with a small example of five items, showing step‑by‑step comparisons and identifying the common elements (黄蓉, 赵敏, 白发魔女).
2. Advanced Thinking
The naïve traversal is inefficient; faster search methods such as binary search and, especially, hash lookup (O(1) time) are introduced. A diagram demonstrates how a hash table can locate a target without repeated comparisons.
Even with hash lookup, the sheer size of the data exceeds available memory, so a different strategy is required.
3. Ultimate Thinking
Because the problem involves massive data, the article proposes a divide‑and‑conquer approach: split the large files into many smaller chunks that fit into memory. One simple partitioning rule is to group URLs by their length; another, more robust method, is to hash each URL (e.g., MD5) and partition by the hash prefix.
Using the first two characters of the MD5 hash yields up to 256 buckets (aa, ab, …, zz), ensuring each bucket is small enough to be processed in memory. After partitioning, the common URLs are found within each corresponding pair of small files.
4. Solving the Original Problem
The final solution combines hash‑based partitioning with pairwise comparison of the resulting small files, effectively turning an impossible in‑memory operation into a series of manageable tasks. This hash‑divide‑and‑conquer technique is the key insight for handling billions of URLs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
NiuNiu MaTe
Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
