Big Data 6 min read

Finding Common URLs in Billions of Records: From Naive Search to Hash Divide‑and‑Conquer

This article walks through a classic interview problem of locating common URLs in two massive files, explains why a naïve O(m·n) approach fails, and presents a scalable solution using hash‑based divide‑and‑conquer and file partitioning techniques.

NiuNiu MaTe
NiuNiu MaTe
NiuNiu MaTe
Finding Common URLs in Billions of Records: From Naive Search to Hash Divide‑and‑Conquer

1. Initial Thinking

The interview question asks for the common URLs between file A (3.2 billion URLs) and file B (6.4 billion URLs). A naïve solution would load both files into memory and compare each element, yielding a time complexity of O(m·n). The article illustrates this with a small example of five items, showing step‑by‑step comparisons and identifying the common elements (黄蓉, 赵敏, 白发魔女).

2. Advanced Thinking

The naïve traversal is inefficient; faster search methods such as binary search and, especially, hash lookup (O(1) time) are introduced. A diagram demonstrates how a hash table can locate a target without repeated comparisons.

Even with hash lookup, the sheer size of the data exceeds available memory, so a different strategy is required.

3. Ultimate Thinking

Because the problem involves massive data, the article proposes a divide‑and‑conquer approach: split the large files into many smaller chunks that fit into memory. One simple partitioning rule is to group URLs by their length; another, more robust method, is to hash each URL (e.g., MD5) and partition by the hash prefix.

Using the first two characters of the MD5 hash yields up to 256 buckets (aa, ab, …, zz), ensuring each bucket is small enough to be processed in memory. After partitioning, the common URLs are found within each corresponding pair of small files.

4. Solving the Original Problem

The final solution combines hash‑based partitioning with pairwise comparison of the resulting small files, effectively turning an impossible in‑memory operation into a series of manageable tasks. This hash‑divide‑and‑conquer technique is the key insight for handling billions of URLs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datadivide and conquerhash algorithmurl deduplicationinterview question
NiuNiu MaTe
Written by

NiuNiu MaTe

Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.