Databases 10 min read

How to Perform Fuzzy Searches on Encrypted Data: Methods, Pros, and Cons

This article examines why encrypted data hampers fuzzy queries, categorizes three implementation approaches—from naïve in‑memory decryption to conventional token‑based indexing and advanced algorithmic schemes—evaluates their performance, storage overhead, and security trade‑offs, and provides practical references.

Architecture Digest
Architecture Digest
Architecture Digest
How to Perform Fuzzy Searches on Encrypted Data: Methods, Pros, and Cons

Encrypted data is common for protecting sensitive fields such as passwords, phone numbers, and credit‑card details, but traditional encryption makes fuzzy searching difficult. This article explores practical ways to enable fuzzy queries on encrypted values.

Classification of approaches

Silly approaches

Load all encrypted records into memory, decrypt them, and perform fuzzy matching in application code.

Create a plaintext "tag" table that maps each ciphertext to its original value, then query the tag table.

Using DES as an example, the plaintext 13800138000 (11 bytes) becomes the ciphertext HE9T75xNx6c5yLmS5l4r6Q== (24 bytes). For 1 million records this consumes about 22.9 MB, for 10 million about 228.9 MB, and for 100 million roughly 2.3 GB, quickly leading to out‑of‑memory failures for large datasets.

Conventional approaches

Implement the same encryption/decryption algorithm inside the database and use a condition such as decode(key) like '%partial%' to perform fuzzy matching.

Tokenize the plaintext into fixed‑length groups (e.g., every 4 English characters or 2 Chinese characters), encrypt each token, store the encrypted tokens in an extra column, and query with key like '%partial%'.

Example of tokenization: the string ningyu1 is split into "ning", "ingy", "ngyu", "gyu1". To find records containing the fragment "ingy", the encrypted token is searched with key like "%partial%". Encryption expands data size; DES increases length by a factor of 2.18, so storage cost grows accordingly. The method works well when the search fragment is at least 4 English characters (or 2 Chinese characters); shorter fragments cause token explosion and higher storage overhead.

Several e‑commerce platforms (Taobao, Alibaba, Pinduoduo, JD) adopt similar token‑based encrypted search techniques; their documentation links are listed for reference.

Advanced ("super") approaches

These solutions operate at the algorithmic level, often requiring custom designs such as order‑preserving encryption, FMES (Fuzzy Matching Encryption Scheme), Hill cipher adaptations, Bloom‑filter‑based encrypted indexes, or Lucene‑based encrypted search. They aim to keep ciphertext orderable or searchable without excessive storage growth, but they demand deep cryptographic expertise and are typically research‑grade.

Conclusion

Silly approaches are discouraged except for tiny datasets. The token‑based conventional method (approach 2) offers a balanced trade‑off between security, storage cost, and query performance and is recommended for most practical scenarios. When algorithmic talent is available, exploring advanced schemes can further improve performance and security.

Securityfuzzy-searchTokenization
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.