Big Data 7 min read

How to Test Membership in 4 Billion Integers with Bitmap and Distributed Techniques

An interview question about checking whether a new integer belongs to a set of 4 billion numbers leads to a discussion of distributed loading across eight machines, bitmap representation using 500 MB of memory, and interval‑based external sorting, illustrating practical big‑data algorithm design.

Senior Brother's Insights
Senior Brother's Insights
Senior Brother's Insights
How to Test Membership in 4 Billion Integers with Bitmap and Distributed Techniques

Interview Scenario

Fresh graduate Xiao Shi interviews at a BAT subsidiary. After a brief self‑introduction, the interviewer asks: "I have 4 billion integers and a new integer; how would you determine whether the new integer is in the set?"

Distributed Solution Suggested by the Teacher

Teacher Lv explains that loading the entire dataset from disk eight times is extremely slow because each disk I/O operation is costly. He hints that the interviewer expects a distributed approach: split the data across eight machines, load each partition into memory once, and let all machines search in parallel, then aggregate the results. This reduces the query time to seconds.

Bitmap (Bit‑set) Method

Lv then introduces a classic big‑data technique called the bitmap (bit‑set) method. By allocating one bit for each possible integer value, the presence of a number is recorded as a 1, otherwise 0. Since a 32‑bit integer range is about 2³² ≈ 4.2 billion, the bitmap requires roughly 2³² bits ≈ 500 MB of memory, which easily fits in RAM.

To query a new integer, convert it to its index, check the corresponding bit, and instantly know whether it exists.

Alternative Interval‑Based External Sorting

Another student, Dan, proposes an alternative: sort the 4 billion numbers externally, then compress consecutive numbers into intervals represented by a pair (start, length). For example, the sequence 1 2 3 4 6 7 … becomes intervals (1,4) and (6,2). A binary search over these intervals answers membership queries. In the worst case, the number of intervals equals the number of gaps (≈2 billion), each interval occupying 8 bytes, totaling about 1.6 GB—still manageable in memory.

Memory and Performance Analysis

The bitmap solution uses ~500 MB, while the interval method may need up to 1.6 GB. Both approaches avoid repeated disk reads and achieve near‑constant‑time queries, dramatically faster than the naïve eight‑pass disk loading.

Classroom Discussion

In a follow‑up class, Lv presents the bitmap algorithm, and students discuss both solutions, emphasizing the importance of choosing data structures that balance memory consumption and query speed for massive datasets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataBitmapData StructuresDistributed Computingexternal sortingmembership test
Senior Brother's Insights
Written by

Senior Brother's Insights

A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.