Big Data 3 min read

How to Find Common URLs in Two 5‑Billion‑Entry Files with Only 4 GB RAM

This article explains a memory‑efficient, divide‑and‑conquer approach using hash partitioning and HashSet intersection to identify shared URLs between two massive 5‑billion‑record files while limited to just 4 GB of RAM.

Programmer DD
Programmer DD
Programmer DD
How to Find Common URLs in Two 5‑Billion‑Entry Files with Only 4 GB RAM

Problem Description

Given two files a and b, each containing 5 billion URLs (each 64 bytes), and only 4 GB of memory, find the URLs that appear in both files.

Solution Idea

Each URL occupies 64 B, so 5 billion URLs require about 320 GB, far exceeding memory. Therefore we cannot load all URLs at once. We use a divide‑and‑conquer strategy: partition each file into many smaller files so that each partition fits into memory.

First, scan file a and compute hash(URL) % 1000 for each URL, storing it into files a0, a1, …, a999. Each resulting file is about 300 MB. Do the same for file b, creating b0 … b999. After this step, any common URL must reside in the same indexed pair (ai, bi).

Then, for each i from 0 to 999, load ai into a HashSet, scan bi and output any URL that is already in the set to a result file.

Method Summary

Divide the data using hash modulo to create manageable sub‑files.

For each sub‑file pair, use a HashSet to detect intersections.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Memory OptimizationHashingdivide and conquer
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.