Why Storing 2 Billion Phone Numbers Requires VARCHAR(20) Over INT/BIGINT
When designing a database for billions of phone numbers, using INT or BIGINT leads to overflow, data‑integrity loss, and inefficient queries, while VARCHAR(20) preserves formatting, supports international extensions, simplifies validation, and avoids common development pitfalls.
Introduction
A friend was asked in a ByteDance interview how to store 2 billion phone numbers and whether to use int, bigint, varchar or char. The answer should demonstrate awareness of business scalability, data fault‑tolerance, and comprehensive problem‑solving.
1. Problems with Using int or bigint
1.1 Can int hold an 11‑digit phone number?
In Java, int is a 32‑bit signed integer with a maximum value of 2,147,483,647 (≈2×10⁹), which cannot store an 11‑digit mobile number such as 13728199213. A 64‑bit long (SQL BIGINT) is required.
1.2 Data integrity issues
Storing a phone number like 01324567890 in a LONG loses the leading zero, corrupting the original data. International prefixes (+86), hyphens, or other symbols also prevent safe storage in a pure numeric type.
Long phoneNumber = 01324567890L; // Compile error: leading zero not allowed1.3 Query complications
Searching for numbers that start with a specific prefix (e.g., 137) requires converting BIGINT to a string before performing a LIKE query, which dramatically reduces performance.
2. Advantages of Using VARCHAR
Fidelity : Preserves digits, symbols, and leading zeros exactly as entered.
Flexibility : Supports fuzzy searches, international numbers, and future extensions without schema changes.
Ease of use : No overflow or format‑conversion concerns.
CREATE TABLE user_tab (
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT COMMENT 'User ID',
phone_number VARCHAR(20) NOT NULL COMMENT 'Phone number',
PRIMARY KEY (id),
UNIQUE KEY idx_phone (phone_number)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='User table';3. Hidden interview assessment points
The interviewer evaluates business extensibility, data fault‑tolerance, and holistic thinking. Demonstrating why VARCHAR(20) is preferred over VARCHAR(11) shows these qualities.
3.1 Why not VARCHAR(11) ?
Phone numbers can be longer than 11 characters in real scenarios:
International format: +8613822223333 (14 characters)
Extended format with country code: 008613822223333 (15 characters)
Extension numbers: 13822223333#123 (exceeds 11)
Using VARCHAR(11) would cause errors or truncation in these cases.
3.2 Business extensibility considerations
Support landline numbers with hyphens (e.g., 010-62223333)
Support virtual numbers (e.g., 17012341234-5678)
Allow mixed login identifiers such as email + phone.
Choosing a length that leaves room for future changes avoids frequent schema migrations.
3.3 Data fault‑tolerance
Input may contain spaces or symbols (e.g., 138 2222 3333); storing the raw string simplifies cleaning.
Forcing VARCHAR(11) would require strict filtering, increasing code complexity.
3.4 Comprehensive cost thinking
VARCHAR(11)occupies up to 11 bytes (digits and plus sign each 1 byte in utf8mb4). VARCHAR(20) occupies up to 20 bytes.
The extra 9 bytes per row amount to roughly 18 GB for 2 billion rows, still acceptable compared with the 16 GB required for a BIGINT column.
reasonable_length = base_requirement + international_extension + fault_tolerance_bufferThe exact formula is not fixed; the key is to articulate the three thinking dimensions during the interview.
4. Daily development pitfalls to avoid
4.1 Insufficient field length
Using VARCHAR(11) truncates international or extension numbers.
4.2 Character set and collation
Choosing utf8 prevents storing emojis or special symbols. Use utf8mb4 with utf8mb4_unicode_ci for full Unicode support.
utf8mb4 + utf8mb4_unicode_ci handles characters like + , * , # .
4.3 Index design
Missing a unique index on the phone column allows duplicate entries.
ALTER TABLE user ADD UNIQUE INDEX idx_phone (phone_number);4.4 Data cleaning and validation
Users may input numbers with spaces or hyphens. Clean data before insertion and apply regex validation.
Standardize: remove spaces, hyphens, keep only + and digits. Regex example (strict 11‑digit): ^1(3[0-9]|4[579]|5[0-35-9]|6[2567]|7[0-8]|8[0-9]|9[0-35-9])\d{8}$ Regex example (loose with country code): ^(\+\d{1,3})?1(3\d|4[579]|5[0-35-9]|6[2567]|7[0-8]|8\d|9[0-35-9])\d{8}$
4.5 Privacy and security
Storing phone numbers in plain text risks user privacy. Encrypt the column or mask the output (e.g., 138****3333).
Use AES encryption or built‑in DB encryption functions. Mask sensitive parts when displaying results.
4.6 Risk control validation
// Strict validation (11‑digit pure number, no country code)
String regex = "^1(3[0-9]|4[579]|5[0-35-9]|6[2567]|7[0-8]|8[0-9]|9[0-35-9])\\d{8}$";
// Loose validation (allows optional country code, e.g., +86 13812345678)
String looseRegex = "^(\\+\\d{1,3})?1(3\\d|4[579]|5[0-35-9]|6[2567]|7[0-8]|8\\d|9[0-35-9])\\d{8}$";Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
