How to Query Phone Number Suffix in Milliseconds on 50 Million Records
When faced with 50 million user rows, using LIKE '%1234' triggers a full table scan, but adding a suffix column, reversing the phone number, leveraging function indexes, or integrating Elasticsearch can reduce query time from minutes to a few milliseconds, each with its own trade‑offs.
Why LIKE '%1234' Fails
The naïve query forces a full table scan because the leading wildcard prevents the B+‑tree index from being used. An EXPLAIN shows type: ALL and an estimated rows: 50000000, leading to 30 seconds‑plus execution time and 100 % CPU usage.
Solution 1 – Redundant Suffix Column
Add a phone_suffix column that stores the last four digits, populate it, and create an index on it.
-- 1. Add column
ALTER TABLE users ADD COLUMN phone_suffix CHAR(4) NOT NULL DEFAULT '';
-- 2. Populate data
UPDATE users SET phone_suffix = RIGHT(phone_number, 4);
-- 3. Create index
CREATE INDEX idx_phone_suffix ON users(phone_suffix);Query becomes an equality lookup:
SELECT * FROM users WHERE phone_suffix = '1234';
-- EXPLAIN shows type: ref, rows: ~5000Performance drops from 30 seconds to about 5 ms (≈6000× speed‑up). The trade‑off is an extra 4 bytes per row (~200 MB for 50 M rows) and the need to keep the suffix column in sync with the phone number.
@Transactional
public void updatePhone(Long userId, String newPhone) {
String suffix = newPhone.substring(newPhone.length() - 4);
userMapper.updatePhoneAndSuffix(userId, newPhone, suffix);
}Solution 2 – Store Reversed Phone Number
Insert a reversed_phone column containing the phone number reversed, index it, and query the reversed suffix as a prefix.
-- Add reversed column
ALTER TABLE users ADD COLUMN reversed_phone CHAR(11) NOT NULL DEFAULT '';
-- Populate
UPDATE users SET reversed_phone = REVERSE(phone_number);
-- Index
CREATE INDEX idx_reversed_phone ON users(reversed_phone);Original suffix query LIKE '%5678' becomes:
SELECT * FROM users WHERE reversed_phone LIKE '8765%';This supports variable‑length suffixes (e.g., 4‑, 5‑, 6‑digit searches) with a modest storage increase (11 bytes per row) and typical query times under 10 ms.
Solution 3 – Function Index (MySQL 8.0+ / PostgreSQL)
Create an index on an expression, e.g., REVERSE(phone_number), without adding a new column.
CREATE INDEX idx_phone_reverse_func ON users ((REVERSE(phone_number)));Query using the same expression:
SELECT * FROM users WHERE REVERSE(phone_number) LIKE '8765%';Works only on MySQL 8.0+ (or PostgreSQL) and requires the SQL to match the indexed expression exactly; otherwise the index is ignored.
Solution 4 – Introduce Elasticsearch
For complex, multi‑field fuzzy searches, offload the query to Elasticsearch, which uses inverted indexes.
Application → MySQL (writes) → Canal → Elasticsearch (reads)Example ES mapping includes phone_suffix as a keyword field. Queries can combine terms, e.g., name and phone suffix. The downside is added infrastructure, data‑sync latency (1‑2 s), and maintenance overhead, so it’s recommended only for large‑scale or highly complex search scenarios.
Solution 5 – Sharding & Specialized Structures
When data exceeds 50 M rows, combine sharding (hash‑by‑suffix) with the redundant suffix column. For ultra‑low latency, use Redis bitmaps or Bloom filters to pre‑filter candidates.
Choosing the Right Approach
Based on data volume and query requirements:
< 1 M rows : Simple LIKE with pagination is acceptable.
1 M – 5 M rows, fixed‑length suffix: Redundant suffix column (Solution 1) gives the best performance.
Flexible suffix length: Reversed field method (Solution 2).
MySQL 8.0+ and desire for schema cleanliness: Function index (Solution 3).
> 50 M rows, single‑suffix queries: Sharding + suffix column.
Complex multi‑field fuzzy search: MySQL + Elasticsearch (Solution 4).
Real‑time ultra‑low latency: Redis bitmap / Bloom filter.
Monitoring
After any optimization, monitor slow‑query logs, index hit rates, and P95/P99 response times. Edge cases can still cause high tail latency if not accounted for.
Conclusion
The interview question tests understanding of index mechanics, space‑time trade‑offs, and architectural choices. Demonstrating the step‑by‑step reasoning—from why LIKE '%...' is bad, through multiple viable solutions, to selecting the most appropriate one—shows deep competence in database design and performance engineering.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
