Big Data 8 min read

How to Efficiently Find Common URLs in Billions of Records

This article explains how to handle the massive‑data problem of intersecting two files containing billions of URLs by using hash‑based divide‑and‑conquer techniques, file partitioning, and in‑memory hash lookups to achieve scalable performance beyond naive O(m·n) approaches.

Java Backend Technology

Dec 9, 2021

How to Efficiently Find Common URLs in Billions of Records

1. Initial Thinking

Some may think that loading both files into memory and comparing each URL would solve the problem, but with 3.2 billion URLs in file A and 6.4 billion in file B this is impossible.

To illustrate the naive approach, the article uses five fictional characters as URLs. The step‑by‑step comparison requires many checks and results in a time complexity of O(m·n), which cannot pass the interview.

Step 1 – Example with a character that does not appear in the second line.

Step 2 – Example with a character that appears in both lines.

Step 3 – Another non‑matching example.

Step 4 – Matching example.

Step 5 – Final matching example.

The naive method requires 25 comparisons and yields the common URLs 黄蓉、赵敏、白发魔女, but its O(m·n) complexity is far from acceptable.

2. Advanced Thinking

The bottleneck lies in the search strategy. Linear scan is terrible; binary search is better, but hash lookup provides O(1) time per check.

Even with hash lookup, the data still cannot fit into memory, so the problem remains a massive‑data challenge.

3. Ultimate Thinking

Apply divide‑and‑conquer: split the huge files into many small files that can fit into memory. One simple criterion is the length of the URL; a more robust method is to compute the MD5 hash of each URL and partition by the first one or two hex digits.

After partitioning, each small file can be loaded and processed with a hash table to find intersections.

4. Solving the Original Problem

The original interview question: file A contains 3.2 billion URLs, file B contains 6.4 billion URLs. Direct loading is impossible, so we must partition the URLs.

One approach is to group URLs by their length, but this may still produce large groups. A better approach is to hash each URL (e.g., MD5) and split by the first two hex characters, yielding up to 256 small files.

For each pair of corresponding small files (e.g., A1 vs B1, A2 vs B2, …), we load them into memory and use a hash table to find common URLs.

Hash‑based divide‑and‑conquer splits the large files into many manageable pieces.

Within each piece we locate the common URLs efficiently.

md5 first two letters

A1 file

B1 file

A2 file

B2 file

A3 file

B3 file

A4 file

B4 file

...

A256 file

B256 file

By applying this hash‑based partitioning, the memory limitation is removed and the common URLs (黄蓉、赵敏、白发魔女) can be obtained efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

algorithm Hash divide and conquer URL intersection

Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.