Why Storing Phone Numbers as VARCHAR Beats BIGINT in Billion‑Row Databases
This article explains why phone numbers should be stored as VARCHAR(15~20) rather than INT/BIGINT, covering underlying principles, performance tests, internationalization, security, large‑scale optimization techniques, and a concrete best‑practice checklist for enterprise systems.
Background and Core Conclusion
When designing a user table, developers often debate whether to store phone numbers as BIGINT for space savings or as VARCHAR for flexibility. The article demonstrates that, especially at scales of tens of millions to billions of rows, the correct choice is VARCHAR(15~20) , not an integer type.
Why a Phone Number Is Not a Numeric Value
Numeric fields imply arithmetic participation, sortable mathematical meaning, and compact storage. Phone numbers, however, are identifiers with format constraints (leading zeros, country codes) and never participate in calculations. Storing them as numbers leads to:
Loss of leading zeros (e.g., 013811122233 becomes 13811122233).
Inability to store international formats containing ‘+’ or variable length.
Inapplicability of LIKE or prefix queries.
MySQL’s internal normalization that strips formatting, turning a semantically rich string into a plain number.
Why VARCHAR(15~20) Is the Optimal Solution
Preserves the exact format, including leading zeros and optional "+86" international prefix.
Complies with the E.164 global standard (≤15 digits, optional country code).
Supports flexible queries: exact match, prefix LIKE '138%', and fuzzy matching.
Storage overhead is negligible at massive scale: BIGINT uses 8 bytes per row (≈16 GB for 2 billion rows) while VARCHAR(11~15) uses 12‑16 bytes (≈24‑32 GB), a difference that is trivial in TB‑level databases.
Performance Analysis
Benchmarks show VARCHAR(11) vs BIGINT query speed differences of less than 1 %. The dominant factors for performance are index design, sharding strategy, table schema bloat, and query patterns—not the column type itself.
Optimizations for 20‑Billion‑Row Datasets
Sharding / Partitioning
Two common strategies:
Prefix‑based sharding (e.g., 130‑139 → user_1, 150‑159 → user_2, 180‑189 → user_3).
Hash‑based sharding:
table_index = hash(mobile) % 64;Index Enhancements
Single‑column index: CREATE INDEX idx_mobile ON users(mobile); High‑selectivity prefix index (first 7 digits):
CREATE INDEX idx_mobile_prefix ON users(mobile(7));Hash Field for Massive Concurrency
Adding a mobile_hash CHAR(32) column (e.g., MD5 of the number) reduces index length and improves QPS:
ALTER TABLE users ADD mobile_hash CHAR(32); mobile_hash = MD5(mobile); SELECT * FROM users WHERE mobile_hash = MD5('13800138000');Pre‑Insert Normalization Workflow
Remove noise characters: mobile = mobile.replaceAll("[\s\-()]", ""); Strip country code to a unified 11‑digit domestic format:
if (mobile.startsWith("+86")) mobile = mobile.substring(3);
if (mobile.startsWith("86") && mobile.length() > 11) mobile = mobile.substring(2);Validate pattern:
mobile.matches("^1[3-9]\d{9}$");Security and Compliance Requirements
Masking for display:
SELECT CONCAT(LEFT(mobile,3),'****',RIGHT(mobile,4)) AS mobile_mask FROM users;Avoid logging plain numbers.
Optionally encrypt the column with AES when business demands.
Uniqueness Considerations
Whether the mobile column should be unique depends on the business scenario (login accounts usually require uniqueness, while shared or family accounts may not). The DDL for a unique constraint is:
UNIQUE KEY idx_mobile_unique (mobile);Common Pitfalls (Checklist)
Using BIGINT leads to loss of leading zeros, no international support, and cumbersome queries.
Relying on LIKE '%xxx' disables index usage.
Inconsistent formatting before storage causes query mismatches.
Treating the phone field as an integer creates type mismatches across the stack.
Logging plain numbers violates GDPR and local data‑protection laws.
Final Best‑Practice Specification
Column type: VARCHAR(20) (recommended).
Store pure digits without spaces, signs, or country code.
Create a single‑column index idx_mobile.
For >20 billion rows, add a mobile_hash field to accelerate lookups.
Normalize and validate format before insertion.
Uniqueness is business‑driven.
Prefer exact match queries; avoid fuzzy patterns.
Always mask numbers in UI and logs.
Adopting this guideline ensures semantic correctness, international compatibility, security compliance, query flexibility, and scalability for systems ranging from millions to tens of billions of rows.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
