Big Data Interview Questions and Solutions for Massive Data Processing
This article presents ten big‑data interview problems, each describing a scenario such as finding the most frequent IP, top‑K queries, word frequency counting under memory limits, and techniques like hashing, bitmap, trie, heap, and external sorting to solve them efficiently.
1. Find the IP with the most Baidu visits on a given day
The solution hashes IPs (max 2^32) into memory or splits the log into 1000 files, counts frequencies with hash maps, and selects the IP with the highest count.
2. Identify the top 10 most popular query strings from 10 million logs (memory ≤1 GB)
Use a two‑step approach: first hash‑sort the data in O(N), then maintain a min‑heap of size K (10) while scanning the 3 million distinct queries, achieving O(N + N'·logK) time; alternatives include a trie with a min‑heap.
3. Return the 100 most frequent words from a 1 GB file (word size ≤16 bytes, memory 1 MB)
Partition words by hash(word) % 5000 into 5000 small files, recursively split oversized files, count frequencies in each using trie or hash map, keep top 100 with a min‑heap, then merge results.
4. Sort queries by frequency across ten 1 GB files
Hash each query to one of ten files, count frequencies on a machine with ~2 GB RAM using hash maps, then sort via quicksort/heap/merge; alternatives include in‑memory counting if feasible or a distributed MapReduce approach.
5. Find common URLs in two files each containing 5 billion URLs (64 bytes each, memory 4 GB)
Split each file into 1000 buckets using hash(url) % 1000, then for each bucket load one side into a hash set and probe the other; a Bloom filter can be used for a probabilistic solution.
6. Detect non‑repeating integers among 250 million numbers with insufficient memory
Use a 2‑bit bitmap to record occurrence states, scanning once to set bits, then output numbers whose bits indicate a single occurrence; alternatively, partition into smaller files and process each.
7. Determine if a given number exists in an unsorted set of 4 billion unsigned ints
Allocate a 512 MB bitmap (one bit per possible value) and set bits while reading the dataset; query by checking the corresponding bit. Another method recursively partitions by bits, achieving O(log N) lookup.
8. Find the element with the highest duplicate count in massive data
Hash‑partition data into small files, compute the most frequent element in each, then select the overall maximum.
9. Retrieve the top N most frequent items from tens of millions or billions of records
If data fits in memory, use a hash map or balanced tree to count frequencies and a min‑heap to keep the top N.
10. List the top 10 most frequent words in a 10 000‑line text file
Read the file, count word occurrences with a hash map, then extract the ten highest counts, noting the O(N) time and O(M) space where M is the number of distinct words.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
