How to Test Membership in 4 Billion Integers with Bitmap and Distributed Techniques
An interview question about checking whether a new integer belongs to a set of 4 billion numbers leads to a discussion of distributed loading across eight machines, bitmap representation using 500 MB of memory, and interval‑based external sorting, illustrating practical big‑data algorithm design.
Interview Scenario
Fresh graduate Xiao Shi interviews at a BAT subsidiary. After a brief self‑introduction, the interviewer asks: "I have 4 billion integers and a new integer; how would you determine whether the new integer is in the set?"
Distributed Solution Suggested by the Teacher
Teacher Lv explains that loading the entire dataset from disk eight times is extremely slow because each disk I/O operation is costly. He hints that the interviewer expects a distributed approach: split the data across eight machines, load each partition into memory once, and let all machines search in parallel, then aggregate the results. This reduces the query time to seconds.
Bitmap (Bit‑set) Method
Lv then introduces a classic big‑data technique called the bitmap (bit‑set) method. By allocating one bit for each possible integer value, the presence of a number is recorded as a 1, otherwise 0. Since a 32‑bit integer range is about 2³² ≈ 4.2 billion, the bitmap requires roughly 2³² bits ≈ 500 MB of memory, which easily fits in RAM.
To query a new integer, convert it to its index, check the corresponding bit, and instantly know whether it exists.
Alternative Interval‑Based External Sorting
Another student, Dan, proposes an alternative: sort the 4 billion numbers externally, then compress consecutive numbers into intervals represented by a pair (start, length). For example, the sequence 1 2 3 4 6 7 … becomes intervals (1,4) and (6,2). A binary search over these intervals answers membership queries. In the worst case, the number of intervals equals the number of gaps (≈2 billion), each interval occupying 8 bytes, totaling about 1.6 GB—still manageable in memory.
Memory and Performance Analysis
The bitmap solution uses ~500 MB, while the interval method may need up to 1.6 GB. Both approaches avoid repeated disk reads and achieve near‑constant‑time queries, dramatically faster than the naïve eight‑pass disk loading.
Classroom Discussion
In a follow‑up class, Lv presents the bitmap algorithm, and students discuss both solutions, emphasizing the importance of choosing data structures that balance memory consumption and query speed for massive datasets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Brother's Insights
A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
