How to Find Common URLs in 5 Billion-Entry Files with Only 4 GB RAM
Given two files each containing 5 billion 64‑byte URLs (≈320 GB total) and only 4 GB of memory, the solution partitions the URLs by hash modulo 1000 into 1,000 smaller files, then uses hash sets to identify the intersecting URLs efficiently.
Problem Description
Given two files a and b, each storing 5 billion URLs where each URL occupies 64 bytes (≈320 GB per file), and only 4 GB of RAM is available, find the URLs that appear in both files.
Solution Idea
Because the data size far exceeds available memory, a divide‑and‑conquer (partition) strategy is used. Each file is split into many smaller files such that each sub‑file fits into memory (≤4 GB).
Approach
1. Scan file a and compute hash(URL) % 1000 for each URL. Write the URL into one of the files a0, a1, …, a999. Each resulting file is about 300 MB.
2. Perform the same partitioning on file b, producing b0, b1, …, b999. After this step, any URL that could be common must reside in the same indexed pair of sub‑files (e.g., a0 and b0).
3. For each i from 0 to 999:
Load all URLs from ai into a HashSet.
Stream through bi; for each URL, check if it exists in the HashSet. If it does, write the URL to an output file of common URLs.
Method Summary
Divide the data set using hash modulo partitioning.
For each pair of sub‑files, use a HashSet to compute the intersection.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
