Databases 12 min read

How to Choose and Use Prefix Indexes for String Columns in MySQL

This article explains why and how to create prefix indexes for string fields in MySQL, compares full‑column and prefix indexes with concrete examples, shows how to pick an optimal prefix length using cardinality statistics, and discusses alternative solutions such as reverse storage and hash fields.

JavaEdge

Jul 26, 2020

How to Choose and Use Prefix Indexes for String Columns in MySQL

Modern web applications often allow login with phone numbers or email addresses, so designing an efficient index for string columns becomes essential. The article uses a MySQL user table (ID, email, …) as a running example.

create table SUser(
    ID bigint unsigned primary key,
    email varchar(64),
    ...
) engine=innodb;

When querying by email without an index, MySQL must perform a full table scan: select f1, f2 from SUser where email='xxx'; MySQL supports prefix indexes, where a part of a string column is indexed. If no length is specified, the index stores the entire string.

Two index definitions are compared:

alter table SUser add index index1(email);
alter table SUser add index index2(email(6));

index1 stores the full email string (illustrated by the first image).

index2 stores only the first 6 bytes of each email (illustrated by the second image).

The prefix index saves space but may increase the number of row scans. The article demonstrates this with a query selecting a specific email:

select id,name,email from SUser where email='[email protected]';

Execution with a full index (index1)

Find the matching entry in the index tree and obtain the primary‑key ID.

Fetch the row by primary key, verify the email, and add it to the result set.

Move to the next index entry; it no longer matches, so stop.

This requires only one primary‑key lookup.

Execution with a prefix index (index2)

Find entries matching the prefix ‘zhangs’; the first match is ID1.

Fetch the row by primary key; the email does not match, so discard it.

Continue to the next index entry (still ‘zhangs’), fetch ID2, verify the full email, and add it to the result.

Repeat until the prefix no longer matches.

This process may require multiple primary‑key lookups (four in the example), illustrating the trade‑off between space savings and query cost.

Choosing an appropriate prefix length

The key is the column’s cardinality: higher cardinality means better discrimination. First, count distinct values for the whole column: select count(distinct email) as L from SUser; Then evaluate distinct counts for various prefix lengths, e.g., 4‑7 bytes:

select count(distinct left(email,4)) as L4,
       count(distinct left(email,5)) as L5,
       count(distinct left(email,6)) as L6,
       count(distinct left(email,7)) as L7
from SUser;

Accept a loss of up to 5 % of distinct values (i.e., keep a prefix whose distinct count is at least 95 % of L). Choose the shortest prefix that satisfies this threshold.

Impact on covering indexes

If a query only requests columns that are fully contained in an index (e.g., select id,email ...), a full index can serve as a covering index, eliminating the need to read the table rows. A prefix index cannot be used as a covering index because it does not store the complete column values, even if the prefix length is long enough.

Alternative solutions for low‑cardinality strings

When the natural prefix has poor discrimination (e.g., ID numbers where the first six digits are identical), two techniques are presented:

Reverse storage

Store the string reversed and index the reversed column. Queries then use the reverse() function:

select field_list from t where id_card = reverse('input_id_card_string');

This leverages the more variable suffix of the original string.

Hash field

Add an integer column to store a checksum (e.g., CRC32) of the original string and index that column:

alter table t add id_card_crc int unsigned, add index(id_card_crc);

Insert statements compute crc32() for the new row. Queries combine the hash lookup with a verification of the full string to avoid false positives:

select field_list from t where id_card_crc = crc32('input_id_card_string') and id_card = 'input_id_card_string';

The hash column occupies only 4 bytes, offering stable query performance despite occasional collisions.

Comparison

Both methods support only equality searches.

Reverse storage adds no extra column but may require additional CPU for the reverse() call.

Hash field adds storage overhead but incurs a cheap crc32() computation and provides near‑constant scan cost.

Summary of indexing options for string columns

Create a full index – simple but consumes more space.

Create a prefix index – saves space but can increase row scans and prevents covering indexes.

Store the string reversed and index the reversed column – helps when the leading characters have low cardinality.

Add a separate hash column and index it – stable performance with modest extra storage; does not support range queries.

Choose the method that best matches the column’s characteristics and the application’s query patterns.

Thought exercise

For a student‑information system where login names follow the pattern [email protected] and the numeric student ID encodes city, school, year, and sequence, the article suggests storing only the year and sequence as a 9‑digit number (e.g., 201100001) in an integer column, effectively using a simple hash that occupies 4 bytes.

Because the table size is modest (under a million rows over 50 years), keeping the original string may be acceptable for simplicity.

Reference: “MySQL 实战 45 讲”.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL Indexing MySQL database optimization Prefix Index

Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.