Big Data 3 min read

How to Find Common URLs in 5 Billion‑Entry Files with Only 4 GB RAM

This article explains how to locate the intersecting URLs between two 5‑billion‑record files (≈320 GB total) using a hash‑based divide‑and‑conquer method that fits within a strict 4 GB memory limit.

Programmer DD
Programmer DD
Programmer DD
How to Find Common URLs in 5 Billion‑Entry Files with Only 4 GB RAM

Problem Description

Given two files a and b, each containing 5 billion URLs (64 B each, total ≈320 GB), and only 4 GB of memory, find the URLs that appear in both files.

Solution Idea

Since each URL occupies 64 B, the total size far exceeds memory. Use a divide‑and‑conquer (hash partition) strategy: compute hash(URL) % 1000 for each URL in file a and store it into a0…a999, each about 300 MB. Do the same for file b, producing b0…b999. Matching URLs must reside in the same pair of partitions.

Processing Steps

For each i from 0 to 999, load ai into a HashSet, then stream bi and output any URL that already exists in the set to a result file.

Method Summary

Hash‑partition the two large files into 1000 smaller files.

For each pair of small files, use a HashSet to find common URLs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataMemory Optimizationhash partitionURL intersection
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.