Databases 10 min read

Why Storing 2 Billion Phone Numbers Requires VARCHAR(20) Over INT/BIGINT

When designing a database for billions of phone numbers, using INT or BIGINT leads to overflow, data‑integrity loss, and inefficient queries, while VARCHAR(20) preserves formatting, supports international extensions, simplifies validation, and avoids common development pitfalls.

ITPUB
ITPUB
ITPUB
Why Storing 2 Billion Phone Numbers Requires VARCHAR(20) Over INT/BIGINT

Introduction

A friend was asked in a ByteDance interview how to store 2 billion phone numbers and whether to use int, bigint, varchar or char. The answer should demonstrate awareness of business scalability, data fault‑tolerance, and comprehensive problem‑solving.

1. Problems with Using int or bigint

1.1 Can int hold an 11‑digit phone number?

In Java, int is a 32‑bit signed integer with a maximum value of 2,147,483,647 (≈2×10⁹), which cannot store an 11‑digit mobile number such as 13728199213. A 64‑bit long (SQL BIGINT) is required.

1.2 Data integrity issues

Storing a phone number like 01324567890 in a LONG loses the leading zero, corrupting the original data. International prefixes (+86), hyphens, or other symbols also prevent safe storage in a pure numeric type.

Long phoneNumber = 01324567890L; // Compile error: leading zero not allowed

1.3 Query complications

Searching for numbers that start with a specific prefix (e.g., 137) requires converting BIGINT to a string before performing a LIKE query, which dramatically reduces performance.

2. Advantages of Using VARCHAR

Fidelity : Preserves digits, symbols, and leading zeros exactly as entered.

Flexibility : Supports fuzzy searches, international numbers, and future extensions without schema changes.

Ease of use : No overflow or format‑conversion concerns.

CREATE TABLE user_tab (
  id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT COMMENT 'User ID',
  phone_number VARCHAR(20) NOT NULL COMMENT 'Phone number',
  PRIMARY KEY (id),
  UNIQUE KEY idx_phone (phone_number)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='User table';

3. Hidden interview assessment points

The interviewer evaluates business extensibility, data fault‑tolerance, and holistic thinking. Demonstrating why VARCHAR(20) is preferred over VARCHAR(11) shows these qualities.

3.1 Why not VARCHAR(11) ?

Phone numbers can be longer than 11 characters in real scenarios:

International format: +8613822223333 (14 characters)

Extended format with country code: 008613822223333 (15 characters)

Extension numbers: 13822223333#123 (exceeds 11)

Using VARCHAR(11) would cause errors or truncation in these cases.

3.2 Business extensibility considerations

Support landline numbers with hyphens (e.g., 010-62223333)

Support virtual numbers (e.g., 17012341234-5678)

Allow mixed login identifiers such as email + phone.

Choosing a length that leaves room for future changes avoids frequent schema migrations.

3.3 Data fault‑tolerance

Input may contain spaces or symbols (e.g., 138 2222 3333); storing the raw string simplifies cleaning.

Forcing VARCHAR(11) would require strict filtering, increasing code complexity.

3.4 Comprehensive cost thinking

VARCHAR(11)

occupies up to 11 bytes (digits and plus sign each 1 byte in utf8mb4). VARCHAR(20) occupies up to 20 bytes.

The extra 9 bytes per row amount to roughly 18 GB for 2 billion rows, still acceptable compared with the 16 GB required for a BIGINT column.

reasonable_length = base_requirement + international_extension + fault_tolerance_buffer

The exact formula is not fixed; the key is to articulate the three thinking dimensions during the interview.

4. Daily development pitfalls to avoid

4.1 Insufficient field length

Using VARCHAR(11) truncates international or extension numbers.

4.2 Character set and collation

Choosing utf8 prevents storing emojis or special symbols. Use utf8mb4 with utf8mb4_unicode_ci for full Unicode support.

utf8mb4 + utf8mb4_unicode_ci handles characters like + , * , # .

4.3 Index design

Missing a unique index on the phone column allows duplicate entries.

ALTER TABLE user ADD UNIQUE INDEX idx_phone (phone_number);

4.4 Data cleaning and validation

Users may input numbers with spaces or hyphens. Clean data before insertion and apply regex validation.

Standardize: remove spaces, hyphens, keep only + and digits. Regex example (strict 11‑digit): ^1(3[0-9]|4[579]|5[0-35-9]|6[2567]|7[0-8]|8[0-9]|9[0-35-9])\d{8}$ Regex example (loose with country code): ^(\+\d{1,3})?1(3\d|4[579]|5[0-35-9]|6[2567]|7[0-8]|8\d|9[0-35-9])\d{8}$

4.5 Privacy and security

Storing phone numbers in plain text risks user privacy. Encrypt the column or mask the output (e.g., 138****3333).

Use AES encryption or built‑in DB encryption functions. Mask sensitive parts when displaying results.

4.6 Risk control validation

// Strict validation (11‑digit pure number, no country code)
String regex = "^1(3[0-9]|4[579]|5[0-35-9]|6[2567]|7[0-8]|8[0-9]|9[0-35-9])\\d{8}$";

// Loose validation (allows optional country code, e.g., +86 13812345678)
String looseRegex = "^(\\+\\d{1,3})?1(3\\d|4[579]|5[0-35-9]|6[2567]|7[0-8]|8\\d|9[0-35-9])\\d{8}$";
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

sqlDatabase designphone number storagevarchardata integrity
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.