Emoji Search at iQIYI Douya: From ElasticSearch to Lucene and Semantic Retrieval
iQIYI Douya’s emoji search evolved from ElasticSearch to a pure Lucene implementation and added semantic vector retrieval, enabling fast, scalable, and more accurate text‑based search of AI‑generated images for small‑to‑medium businesses by combining custom tokenization, dense embeddings, and hybrid ranking.
With the rapid development of the Internet, emojis have become an essential communication tool. iQIYI Douya’s emoji search product evolved from using ElasticSearch to Lucene and finally incorporated semantic retrieval, providing practical insights for vertical‑domain search in small‑to‑medium sized businesses.
The core function of Douya emoji search is text‑based retrieval of expressive images generated by AI algorithms from UGC/PGC sources. Typical query types include entity names (e.g., celebrity or movie titles), colloquial emotion or action phrases (e.g., “happy”, “hug”), combined entity‑action phrases, popular memes, and full sentences.
Initially, the system relied on ElasticSearch (ES), a distributed near‑real‑time search engine built on Lucene. An ES index is divided into shards and segments; new documents are written to memory and flushed to new segments, enabling fast indexing and search. Each image document contains three groups of fields: metadata (size, resolution, CDN URL), operational info (source, ingestion time, audit status), and tag information (captions, characters, emotions, actions, categories). Boolean queries filter by business rules, while function_score custom scoring adjusts weights for different fields.
To improve scalability, iQIYI combined ES with HBase as the master image repository. HBase stores all processing stages, while ES holds a deduplicated, audited subset for online search. Separate ES clusters for production and operations reduce index size (from tens of millions to ~1 million documents) and keep index files in OS cache, improving latency and stability.
As query volume and latency requirements grew, the ES‑based service reached its limits. The team migrated to a pure Lucene implementation, abandoning the ES HTTP API in favor of direct Java library calls. Lucene’s analyzer framework allowed custom tokenization (HanLP + Bi‑MM) and fine‑grained control over scoring, while eliminating the need for distributed shard management.
Key reasons for switching to Lucene were:
Search latency is not ultra‑real‑time; daily offline read‑only indexes of a few hundred megabytes can be built in under ten minutes.
Indexes of this size fit comfortably in a single machine’s memory, removing the overhead of sharding.
Isolation between business units enables containerized deployment and easy horizontal or cross‑region scaling.
Custom tokenization, weighting, and ranking become straightforward.
Semantic recall was introduced to handle colloquial phrases that TF‑IDF/BM25 struggle with (e.g., “兴高采烈” vs. “眉开眼笑”). The approach encodes short text (user query or image tags) into dense vectors using sentence embeddings. Pre‑trained models (Word2Vec/FastText, BERT, etc.) are averaged to obtain fixed‑length vectors, which are indexed with ANN libraries such as Annoy or Faiss for millisecond‑scale nearest‑neighbor search.
Douya adopts a simple pipeline: tokenization + Bi‑MM, average of pre‑trained word vectors, and a custom counter‑fitted word‑vector fine‑tuning to separate antonyms (e.g., “sad” vs. “happy”). The vector index (≈8 GB) is memory‑mapped, allowing multiple processes to share it efficiently. The semantic service, written in Python, exposes a gRPC API; average query latency is ~15 ms with ~500 QPS per node.
Ranking combines entity‑focused Lucene recall with semantic recall. Entity matches are scored strictly by term length and order; semantic matches retrieve top‑N similar tags, which are then re‑checked with Lucene for exact matches. Final scores are normalized and blended with an image‑quality score (derived from source, freshness, popularity) to produce the final order. This hybrid strategy improves NDCG by roughly 20 %.
The “Shen Pei Tu” feature overlays user‑provided text onto retrieved images. Its retrieval pipeline mirrors the semantic recall flow, and the actual text rendering occurs lazily on the CDN when the user accesses the image, preserving API performance.
In summary, ElasticSearch suffices for low‑complexity, low‑throughput scenarios. When higher customization and performance are required, a Lucene‑based architecture with optional semantic vector search offers a scalable, cost‑effective solution. Future work includes better annotation data for sentence encoders, learning‑to‑rank optimization, incremental indexing, and expanding search‑related applications such as personalized recommendation, dialogue‑driven emojis, and visual‑semantic image search.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.