How to Test Membership in 4 Billion Numbers Using Only 1 GB Memory
This article explains how to determine whether a given unsigned integer belongs to a set of 4 billion distinct numbers within a 1 GB memory limit, comparing a bitmap approach with a Bloom filter, providing detailed implementation steps and C++ code examples for both methods.
Problem
We have 4 billion distinct unsigned 32‑bit integers stored unsorted. The task is to answer membership queries (does a given number belong to the set?) while using no more than 1 GB of RAM.
Approach Comparison
Bitmap (bit‑set) method – one bit for each possible value (0 … 2³²‑1). Exact answers, O(1) query time, memory usage = 2³² bits ≈ 512 MiB.
Bloom filter – multiple hash functions map each element to bits in a smaller array. Saves memory (≈10 % of bitmap) but introduces a configurable false‑positive probability.
Bitmap Solution
Because 512 MiB fits comfortably within the 1 GB limit, a plain bitmap is the simplest exact solution.
Memory Requirement
2³² bits = 4,294,967,296 bits ≈ 512 MiB.
Implementation Steps
Create an array of uint32_t elements large enough to hold 2³² bits (i.e., 2³² / 32 = 134,217,728 words).
For each input number n, set the corresponding bit:
index = n / 32;
offset = n % 32;
bitmap[index] |= (1U << offset);To answer a query, test the same bit:
bool exists = (bitmap[index] & (1U << offset)) != 0;Complete example (C++17):
#include <iostream>
#include <vector>
#include <cstdint>
const std::size_t WORD_COUNT = 1ULL << 30; // 2^32 / 32
std::vector<uint32_t> bitmap(WORD_COUNT, 0);
inline void setBit(uint32_t n) {
std::size_t idx = n / 32;
uint32_t off = n % 32;
bitmap[idx] |= (1U << off);
}
inline bool checkBit(uint32_t n) {
std::size_t idx = n / 32;
uint32_t off = n % 32;
return (bitmap[idx] & (1U << off)) != 0;
}
int main() {
// Assume "numbers" holds the 4 billion values
std::vector<uint32_t> numbers = {/* ... */};
for (uint32_t v : numbers) setBit(v);
uint32_t query = 123456789U;
if (checkBit(query))
std::cout << "Number exists." << std::endl;
else
std::cout << "Number does not exist." << std::endl;
return 0;
}Bloom Filter Alternative
If memory must be reduced further, a Bloom filter can be used at the cost of false positives.
Typical Parameters
Bit array size ≈ 100 million bits (≈12.5 MiB).
Number of hash functions = 5 (gives a false‑positive rate around 1 % for 4 billion items).
Implementation Steps
Allocate HASH_COUNT vectors of bool of length BIT_ARRAY_SIZE.
Initialize an array of std::hash<uint32_t> (or any fast 32‑bit hash) for the HASH_COUNT functions.
For each input value, compute each hash, take modulo BIT_ARRAY_SIZE, and set the corresponding bit.
To query, compute the same hashes; if any of the bits is 0 the element is definitely absent, otherwise it may be present.
Example (C++17):
#include <array>
#include <vector>
#include <functional>
#include <cstdint>
#include <iostream>
constexpr std::size_t BIT_ARRAY_SIZE = 100'000'000; // 100 M bits
constexpr std::size_t HASH_COUNT = 5;
std::array<std::vector<bool>, HASH_COUNT> filter;
std::array<std::hash<uint32_t>, HASH_COUNT> hasher;
inline void add(uint32_t x) {
for (std::size_t i = 0; i < HASH_COUNT; ++i) {
std::size_t idx = hasher[i](x) % BIT_ARRAY_SIZE;
filter[i][idx] = true;
}
}
inline bool possiblyContains(uint32_t x) {
for (std::size_t i = 0; i < HASH_COUNT; ++i) {
std::size_t idx = hasher[i](x) % BIT_ARRAY_SIZE;
if (!filter[i][idx]) return false; // definitely not present
}
return true; // may be present (false positive possible)
}
int main() {
for (std::size_t i = 0; i < HASH_COUNT; ++i) {
filter[i] = std::vector<bool>(BIT_ARRAY_SIZE, false);
hasher[i] = std::hash<uint32_t>();
}
// Insert the 4 billion numbers here …
uint32_t query = 123456789U;
if (possiblyContains(query))
std::cout << "Number might exist (false positive possible)." << std::endl;
else
std::cout << "Number definitely does not exist." << std::endl;
return 0;
}Performance Considerations
Bitmap: O(1) query and update, exact answer, 512 MiB memory.
Bloom filter: O(k) query where k = number of hash functions (e.g., 5), memory ≈ 12.5 MiB, false‑positive probability configurable (≈1 % with the parameters above).
Conclusion
With a 1 GB memory budget, the bitmap method is the most straightforward and efficient way to achieve exact membership testing for 4 billion unsigned integers. A Bloom filter is an alternative when memory must be reduced further and a small false‑positive rate is acceptable.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
