How to Perform Fuzzy Search on Encrypted Data
This article examines the challenges of fuzzy searching encrypted data and compares three implementation approaches—naïve, conventional, and advanced—detailing their principles, performance implications, storage costs, and security trade‑offs, ultimately recommending the conventional token‑based method for most practical applications.
We know that encrypted data is not friendly to fuzzy queries; this article explores ideas for implementing fuzzy search on encrypted data.
For data security, important fields such as passwords, phone numbers, addresses, bank cards are often stored encrypted. Passwords typically use irreversible slow hash algorithms to resist brute‑force attacks.
When retrieving data we can match ciphertext directly, but for fields like phone numbers we need to view the original value and support fuzzy search, so we examine methods for reversible encrypted data.
Many online posts discuss “fuzzy search on encrypted data”, some of which are unreliable; we categorize three types of solutions—naïve, conventional, and advanced—and discuss their pros and cons.
How to Perform Fuzzy Search on Encrypted Data
Naïve Approaches
Load all data into memory, decrypt it, and perform fuzzy matching in the application.
Create a plaintext mapping table (often called a tag table) and query the tags to locate the ciphertext.
Naïve One
For small data sets this can work, but memory consumption grows quickly. For example, the phone number 13800138000 encrypted with DES becomes HE9T75xNx6c5yLmS5l4r6Q== , which occupies 24 bytes.
Count
Bytes
MB
1,000,000
24,000,000
22.89
10,000,000
240,000,000
228.89
100,000,000
2,400,000,000
2,288.89
With hundreds of megabytes to several gigabytes of memory required, this approach easily leads to out‑of‑memory errors for large data sets.
Naïve Two
Creating a plaintext mapping table defeats the purpose of encryption and introduces security and maintenance problems, so this method is strongly discouraged.
Conventional Approaches
Implement encryption/decryption functions in the database and use a query such as decode(key) like '%partial%' .
Tokenize the ciphertext, encrypt each token, store the encrypted tokens in an extra column, and query with key like '%partial%' .
Conventional One
Embedding the same encryption algorithm in the database and modifying fuzzy‑search conditions to decrypt before matching has low implementation cost, but it cannot leverage indexes and may suffer from algorithm mismatches between the application and the database.
If performance requirements are modest and security needs are average, standard algorithms like AES or DES are acceptable.
Conventional Two
This method splits a field into fixed‑length groups (e.g., four English characters or two Chinese characters), encrypts each group, and stores the encrypted groups in an auxiliary column. Queries use key like "%partial%" to match the encrypted tokens.
ningyu1 can be tokenized as ning , ingy , ngyu , gyu1 , etc.
When searching for a substring such as “ingy”, the encrypted token is matched with key like "%partial%" . Encryption typically expands data size; using DES, the 11‑byte plaintext 13800138000 becomes a 24‑byte ciphertext, a 2.18× increase.
The token length must be at least four English characters (or two Chinese characters); shorter tokens increase storage overhead and reduce security.
Several e‑commerce platforms (Taobao, Alibaba, Pinduoduo, JD) use similar schemes for encrypted fields while supporting fuzzy search.
Taobao encrypted field search: https://open.taobao.com/docV3.htm?docId=106213&docType=1
Alibaba encrypted field search: https://jaq-doc.alibaba.com/docs/doc.htm?treeId=1&articleId=106213&docType=1
Pinduoduo encrypted field search: https://open.pinduoduo.com/application/document/browse?idStr=3407B605226E77F2
JD encrypted field search: https://jos.jd.com/commondoc?listId=345
This approach balances implementation complexity, storage cost, and the ability to use database indexes, making it a practical choice.
Advanced Approaches
These solutions involve deeper algorithmic research, such as designing new encryption schemes that preserve order or using specialized structures like Bloom filters. References include Hill cipher‑based fuzzy encryption (FMES), BloomFilter‑enhanced encrypted search, and encrypted fuzzy search mechanisms based on Lucene.
Database fuzzy matching encryption methods: https://www.jiamisoft.com/blog/6542-zifushujumohupipeijiamifangfa.html
Hill cipher and FMES: see above link.
BloomFilter‑based encrypted fuzzy search: http://kzyjc.cnjournals.com/html/2019/1/20190112.htm
Fast query encrypted databases: https://www.jiamisoft.com/blog/5961-kuaisuchaxunshujukujiami.html
Lucene‑based cloud search on ciphertext: https://www.cnblogs.com/arthurqin/p/6307153.html
Verifiable fuzzy query encryption in cloud storage: http://jeit.ie.ac.cn/fileDZYXXXB/journal/article/dzyxxxb/2017/7/PDF/160971.pdf
The Lucene‑based idea is similar to Conventional Two, but stores encrypted tokens in a search engine rather than a relational database.
Summary
We have reviewed naïve, conventional, and advanced methods for fuzzy searching encrypted data. Naïve methods are discouraged; conventional methods, especially the token‑based approach, are recommended for most scenarios due to their reasonable trade‑off between cost, performance, and security.
Overall, the second conventional method offers the best balance of implementation effort, storage overhead, and query efficiency.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.