Optimizing Ctrip Hotel Search System: Storage, Intelligent Query, Error Correction, and DSL Design
This article details how Ctrip's hotel search system was optimized through storage compression, spatial indexing, KV storage, semantic query generation, context‑aware error correction, and a custom domain‑specific language, balancing performance, flexibility, and user experience for large‑scale online travel services.
Overview: As Ctrip's online travel business grows, the hotel search system—built on Lucene and similar to Solr—faces increasing data volume and higher user expectations. The article describes four optimization areas: storage, intelligent query, error correction, and a redesigned search DSL.
1. Storage Optimization
1.1 Data Compression
In Lucene 8, long‑type fields are automatically compressed. By unifying integer fields to long, memory and disk usage are reduced, lowering operational costs.
1.2 Spatial Index
PointValues replace GeoHash for geographic queries, using a kd‑tree to accelerate point‑in‑polygon filtering.
Pitfalls
PointValues perform well for 2‑D searches but can be slower than inverted indexes for 1‑D data due to decompression overhead. High‑dimensional searches (e.g., word2vec vectors) also degrade performance.
1.3 KV Storage
Search requires both inverted and forward indexes. Lucene’s FieldCache (in early Solar) was replaced by DocValues (from Lucene 4) and later by DISI in Lucene 7. In Lucene 8, DocValues added a jump table for faster random access.
To handle billions of hotel‑POI association records, a custom Java embedded KV store using MappedDirectBuffer was built. It stores data on disk with a compacted B‑Tree, offering logarithmic query time and linear merge time, while keeping memory pressure low.
Pitfalls
DocValues still load metadata into memory, causing thread contention; they enforce sequential docid access, which can be mitigated by caching field values or modifying Lucene source. Mapped buffers can trigger OS‑level flushes when dirty pages exceed thresholds; using read‑only append‑only files avoids this.
2. Intelligent Query
Simple text recall is insufficient; the system now performs semantic analysis to understand user intent.
2.1 Semantic Query Generation Process
1) Entity annotation tags each token with type and ID. 2) Core semantics are extracted by discarding hierarchical location tokens (e.g., "Zhejiang" and "Hangzhou"), keeping only the essential entities (e.g., "West Lake" and "Hilton"). 3) Rules generate a query that prioritizes hotels near West Lake belonging to the Hilton brand.
2.2 Common Semantic Algorithms
2.2.1 Context‑Free Grammar (CFG)
Pros: fast automaton conversion. Cons: rigid rules, unsuitable for flexible natural language.
2.2.2 Dependency Parsing
Connects head words with dependents; supports non‑projective structures but can be exponential in worst‑case.
2.2.3 Simplified Dependency for Hotel Suggestion Engine
Uses knowledge‑graph‑driven buckets to group entities, reducing complexity and avoiding combinatorial explosion.
3. Intelligent Error Correction
Lucene’s built‑in n‑gram based spelling correction is enhanced with context awareness and richer dictionaries to approach Bing/Google‑level performance.
3.1 Locality Sensitive Hashing (LSH)
Introduces additional hash buckets (e.g., word length) and adjusts n‑gram parameters to improve recall precision for large vocabularies.
3.2 Contextual Correction
Considers surrounding tokens, enabling correction of phrases like "southcoase" to "south coast" and handling missing or extra spaces.
3.3 Enhanced Edit Distance
Extends Levenshtein to a 3×3 window (2‑order edit distance) to detect character swaps, double‑typing, and other errors, similar to Damerau‑Levenshtein. Higher‑order extensions increase accuracy but also computational cost.
4. Search DSL
A domain‑specific language is designed to replace Lucene’s limited query syntax, aiming for SQL‑like familiarity, high performance, polymorphism, geographic capabilities, security, and business‑process description.
4.1 Design Considerations
4.1.1 Lower Learning Cost
Align syntax with SQL to leverage developers’ existing knowledge.
4.1.2 High‑Performance Scenarios
Use primitive types (long, double), columnar DocValues, and remove cost‑based optimizers in favor of rule‑based execution.
4.1.3 Polymorphism
Support function and operator overloading (e.g., max returns same type as input; "+" concatenates strings when appropriate).
4.1.4 Security
Enable parameterized queries to prevent script injection, with support for expression‑level parameters.
4.1.5 Business Process Description
Allow doc‑independent expressions for rule engines, similar to PL/SQL stored procedures.
Pitfalls
Avoid early commitment to a two‑stage lexer‑parser architecture; prefer a recursive‑descent automaton for easier future modifications.
Conclusion
The search engine balances CPU‑intensive computation with low latency, requiring both intelligent features and performance. Leveraging mature products (Lucene, Elasticsearch) and storage engines (HBase, LevelDB, RocksDB) while understanding their underlying principles enables flexible, high‑performance solutions.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.