Big Data 3 min read

How to Find Common URLs in 5 Billion-Entry Files with Only 4 GB RAM

Given two files each containing 5 billion 64‑byte URLs (≈320 GB total) and only 4 GB of memory, the solution partitions the URLs by hash modulo 1000 into 1,000 smaller files, then uses hash sets to identify the intersecting URLs efficiently.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
How to Find Common URLs in 5 Billion-Entry Files with Only 4 GB RAM

Problem Description

Given two files a and b, each storing 5 billion URLs where each URL occupies 64 bytes (≈320 GB per file), and only 4 GB of RAM is available, find the URLs that appear in both files.

Solution Idea

Because the data size far exceeds available memory, a divide‑and‑conquer (partition) strategy is used. Each file is split into many smaller files such that each sub‑file fits into memory (≤4 GB).

Approach

1. Scan file a and compute hash(URL) % 1000 for each URL. Write the URL into one of the files a0, a1, …, a999. Each resulting file is about 300 MB.

2. Perform the same partitioning on file b, producing b0, b1, …, b999. After this step, any URL that could be common must reside in the same indexed pair of sub‑files (e.g., a0 and b0).

3. For each i from 0 to 999:

Load all URLs from ai into a HashSet.

Stream through bi; for each URL, check if it exists in the HashSet. If it does, write the URL to an output file of common URLs.

Method Summary

Divide the data set using hash modulo partitioning.

For each pair of sub‑files, use a HashSet to compute the intersection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataMemory Optimizationhash partitionhash setintersection algorithm
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.