Approaches to Fuzzy Search on Encrypted Data
This article examines why encrypted data hinders fuzzy queries and compares three categories of solutions—naïve in‑memory decryption, conventional database‑level techniques, and advanced algorithmic methods—highlighting their trade‑offs in security, performance, and storage overhead.
We know encrypted data is not friendly to fuzzy queries; this article explores implementation ideas for performing fuzzy search on encrypted data.
How to Perform Fuzzy Search on Encrypted Data
Three categories of approaches are presented: "silly", "conventional", and "advanced".
Silly Approaches
Load all data into memory, decrypt it, then apply fuzzy matching algorithms.
Create a plaintext tag table that maps to ciphertext and query the tag table for fuzzy matches.
These methods work only for very small datasets; with larger volumes they quickly cause Out of memory errors. A table in the original article shows memory consumption ranging from 22 MB for 1 million records to over 2 GB for 100 million records.
Conventional Approaches
Two widely used techniques are described:
Implement decryption functions directly in the database and use queries such as decode(key) like '%partial%' .
Tokenize the plaintext, encrypt each token, store the encrypted tokens in an auxiliary column, and query with key like '%partial%' .
Method 1 is easy to implement but cannot leverage database indexes, leading to poor performance. Method 2 adds storage overhead (encrypted tokens increase data size) but enables index‑based searches. Common reversible algorithms like AES and DES are mentioned as suitable choices.
Advanced (Super‑God) Approaches
These solutions involve deeper algorithmic research, such as designing new reversible encryption schemes, Bloom‑filter‑based searchable encryption, and adaptations of the Hill cipher ( Hill ) and FMES. Papers on "encrypted fuzzy search using BloomFilter", "cloud‑based verifiable fuzzy query encryption", and "Lucene‑based encrypted search" are referenced. The Lucene‑style method resembles the conventional token‑encrypt approach but differs in storage backend (relational DB vs. Elasticsearch).
Conclusion
Silly methods are discouraged due to security and scalability issues. The conventional token‑encrypt method (method 2) offers a good balance of implementation cost, storage overhead, and query performance, and is recommended for most practical scenarios. Advanced algorithmic approaches may be considered when specialized expertise is available.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.