High‑Performance Challenge: Optimizing a C‑Based User Information Service for Tens of Millions of Records
This article recounts a 2017 internal high‑performance competition where a C server handling a 40 million‑row user‑info dataset was progressively optimized through data compression, custom hash tables, memory layout redesign, and a thread‑pool model, ultimately achieving a four‑fold throughput increase.
In December 2017, the New‑House R&D department at Lianjia organized a high‑performance challenge to evaluate the team’s recent LNMP learning results. Participants were asked to implement a user‑information query API that returns JSON over HTTP, includes a custom header, and follows a fixed URL format.
The official dataset contained 40 million rows (≈5.8 GB). Initial analysis showed that storing raw records in memory would be prohibitive: on a single‑core VM with 4 GB RAM, a PHP array would inflate a 1 M integer array to 144 MB, far exceeding the theoretical 8 MB.
Because the traditional LNMP stack (Client → Nginx → FPM → MySQL) was too heavyweight, the team decided to write a lightweight HTTP server in C and focus on data compression.
Each record consists of an unsigned‑int ID, a 4‑character username (12 bytes in UTF‑8), a 16‑byte hexadecimal string, a 1‑byte gender flag, and a 100‑character "extra" field that can be encoded in 6 bits per character. By re‑encoding, a record size drops from 155 bytes to about 108 bytes, roughly a 70 % compression ratio.
To enable fast lookups, a hash table was initially designed. However, the hash function caused high collision rates when constrained to a 4 KB bucket space, leading to excessive memory usage (≈3 GB) and poor performance.
Switching to a binary‑search approach on two parallel arrays (ID list and record list) reduced memory consumption dramatically. Each record still occupies 108 bytes, but the index structures now fit within the available RAM, allowing 3.4 K records in memory and the rest on disk, with overall memory usage around 3.7 GB.
To further improve concurrency, a simple producer‑consumer thread pool was introduced. Incoming requests are enqueued, and worker threads dequeue tasks for processing. This change raised the throughput to about 8 000 requests per second, roughly a four‑fold improvement.
Additional optimizations included decoding the "extra" field more efficiently by processing three bytes at a time (leveraging the least common multiple of 6 and 8 bits), and minor TCP, process‑priority, and I/O model tweaks, which had limited impact due to the low concurrency of the test.
The author reflects on the experience, emphasizing the importance of a systematic performance analysis framework, the value of C for precise memory control, and the benefit of understanding lower‑level implementations even when using higher‑level languages like PHP.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Beike Product & Technology
As Beike's official product and technology account, we are committed to building a platform for sharing Beike's product and technology insights, targeting internet/O2O developers and product professionals. We share high-quality original articles, tech salon events, and recruitment information weekly. Welcome to follow us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
