Big Data 6 min read

Finding Common URLs in Billions of Records: From Naive Search to Hash Divide‑and‑Conquer

This article walks through a classic interview problem of locating common URLs in two massive files, explains why a naïve O(m·n) approach fails, and presents a scalable solution using hash‑based divide‑and‑conquer and file partitioning techniques.

NiuNiu MaTe

Dec 10, 2021

Finding Common URLs in Billions of Records: From Naive Search to Hash Divide‑and‑Conquer

1. Initial Thinking

The interview question asks for the common URLs between file A (3.2 billion URLs) and file B (6.4 billion URLs). A naïve solution would load both files into memory and compare each element, yielding a time complexity of O(m·n). The article illustrates this with a small example of five items, showing step‑by‑step comparisons and identifying the common elements (黄蓉, 赵敏, 白发魔女).

2. Advanced Thinking

The naïve traversal is inefficient; faster search methods such as binary search and, especially, hash lookup (O(1) time) are introduced. A diagram demonstrates how a hash table can locate a target without repeated comparisons.

Even with hash lookup, the sheer size of the data exceeds available memory, so a different strategy is required.

3. Ultimate Thinking

Because the problem involves massive data, the article proposes a divide‑and‑conquer approach: split the large files into many smaller chunks that fit into memory. One simple partitioning rule is to group URLs by their length; another, more robust method, is to hash each URL (e.g., MD5) and partition by the hash prefix.

Using the first two characters of the MD5 hash yields up to 256 buckets (aa, ab, …, zz), ensuring each bucket is small enough to be processed in memory. After partitioning, the common URLs are found within each corresponding pair of small files.

4. Solving the Original Problem

The final solution combines hash‑based partitioning with pairwise comparison of the resulting small files, effectively turning an impossible in‑memory operation into a series of manageable tasks. This hash‑divide‑and‑conquer technique is the key insight for handling billions of URLs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data divide and conquer hash algorithm url deduplication interview question

Written by

NiuNiu MaTe

Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.