How to Efficiently Find Common URLs in Billions of Records
This article explains how to handle the massive‑data problem of intersecting two files containing billions of URLs by using hash‑based divide‑and‑conquer techniques, file partitioning, and in‑memory hash lookups to achieve scalable performance beyond naive O(m·n) approaches.
1. Initial Thinking
Some may think that loading both files into memory and comparing each URL would solve the problem, but with 3.2 billion URLs in file A and 6.4 billion in file B this is impossible.
To illustrate the naive approach, the article uses five fictional characters as URLs. The step‑by‑step comparison requires many checks and results in a time complexity of O(m·n), which cannot pass the interview.
Step 1 – Example with a character that does not appear in the second line.
Step 2 – Example with a character that appears in both lines.
Step 3 – Another non‑matching example.
Step 4 – Matching example.
Step 5 – Final matching example.
The naive method requires 25 comparisons and yields the common URLs 黄蓉、赵敏、白发魔女, but its O(m·n) complexity is far from acceptable.
2. Advanced Thinking
The bottleneck lies in the search strategy. Linear scan is terrible; binary search is better, but hash lookup provides O(1) time per check.
Even with hash lookup, the data still cannot fit into memory, so the problem remains a massive‑data challenge.
3. Ultimate Thinking
Apply divide‑and‑conquer: split the huge files into many small files that can fit into memory. One simple criterion is the length of the URL; a more robust method is to compute the MD5 hash of each URL and partition by the first one or two hex digits.
After partitioning, each small file can be loaded and processed with a hash table to find intersections.
4. Solving the Original Problem
The original interview question: file A contains 3.2 billion URLs, file B contains 6.4 billion URLs. Direct loading is impossible, so we must partition the URLs.
One approach is to group URLs by their length, but this may still produce large groups. A better approach is to hash each URL (e.g., MD5) and split by the first two hex characters, yielding up to 256 small files.
For each pair of corresponding small files (e.g., A1 vs B1, A2 vs B2, …), we load them into memory and use a hash table to find common URLs.
Hash‑based divide‑and‑conquer splits the large files into many manageable pieces.
Within each piece we locate the common URLs efficiently.
md5 first two letters
A
VS
B
aa
A1 file
VS
B1 file
ab
A2 file
VS
B2 file
ac
A3 file
VS
B3 file
ad
A4 file
VS
B4 file
...
...
VS
...
zz
A256 file
VS
B256 file
By applying this hash‑based partitioning, the memory limitation is removed and the common URLs (黄蓉、赵敏、白发魔女) can be obtained efficiently.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
