Backend Development 19 min read

Optimizing Ctrip Hotel Search System: Storage, Intelligent Query, Error Correction, and DSL Design

This article details how Ctrip's hotel search system was optimized through storage compression, spatial indexing, KV storage, semantic query generation, context‑aware error correction, and a custom domain‑specific language, balancing performance, flexibility, and user experience for large‑scale online travel services.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Optimizing Ctrip Hotel Search System: Storage, Intelligent Query, Error Correction, and DSL Design

Overview: As Ctrip's online travel business grows, the hotel search system—built on Lucene and similar to Solr—faces increasing data volume and higher user expectations. The article describes four optimization areas: storage, intelligent query, error correction, and a redesigned search DSL.

1. Storage Optimization

1.1 Data Compression

In Lucene 8, long‑type fields are automatically compressed. By unifying integer fields to long, memory and disk usage are reduced, lowering operational costs.

1.2 Spatial Index

PointValues replace GeoHash for geographic queries, using a kd‑tree to accelerate point‑in‑polygon filtering.

Pitfalls

PointValues perform well for 2‑D searches but can be slower than inverted indexes for 1‑D data due to decompression overhead. High‑dimensional searches (e.g., word2vec vectors) also degrade performance.

1.3 KV Storage

Search requires both inverted and forward indexes. Lucene’s FieldCache (in early Solar) was replaced by DocValues (from Lucene 4) and later by DISI in Lucene 7. In Lucene 8, DocValues added a jump table for faster random access.

To handle billions of hotel‑POI association records, a custom Java embedded KV store using MappedDirectBuffer was built. It stores data on disk with a compacted B‑Tree, offering logarithmic query time and linear merge time, while keeping memory pressure low.

Pitfalls

DocValues still load metadata into memory, causing thread contention; they enforce sequential docid access, which can be mitigated by caching field values or modifying Lucene source. Mapped buffers can trigger OS‑level flushes when dirty pages exceed thresholds; using read‑only append‑only files avoids this.

2. Intelligent Query

Simple text recall is insufficient; the system now performs semantic analysis to understand user intent.

2.1 Semantic Query Generation Process

1) Entity annotation tags each token with type and ID. 2) Core semantics are extracted by discarding hierarchical location tokens (e.g., "Zhejiang" and "Hangzhou"), keeping only the essential entities (e.g., "West Lake" and "Hilton"). 3) Rules generate a query that prioritizes hotels near West Lake belonging to the Hilton brand.

2.2 Common Semantic Algorithms

2.2.1 Context‑Free Grammar (CFG)

Pros: fast automaton conversion. Cons: rigid rules, unsuitable for flexible natural language.

2.2.2 Dependency Parsing

Connects head words with dependents; supports non‑projective structures but can be exponential in worst‑case.

2.2.3 Simplified Dependency for Hotel Suggestion Engine

Uses knowledge‑graph‑driven buckets to group entities, reducing complexity and avoiding combinatorial explosion.

3. Intelligent Error Correction

Lucene’s built‑in n‑gram based spelling correction is enhanced with context awareness and richer dictionaries to approach Bing/Google‑level performance.

3.1 Locality Sensitive Hashing (LSH)

Introduces additional hash buckets (e.g., word length) and adjusts n‑gram parameters to improve recall precision for large vocabularies.

3.2 Contextual Correction

Considers surrounding tokens, enabling correction of phrases like "southcoase" to "south coast" and handling missing or extra spaces.

3.3 Enhanced Edit Distance

Extends Levenshtein to a 3×3 window (2‑order edit distance) to detect character swaps, double‑typing, and other errors, similar to Damerau‑Levenshtein. Higher‑order extensions increase accuracy but also computational cost.

4. Search DSL

A domain‑specific language is designed to replace Lucene’s limited query syntax, aiming for SQL‑like familiarity, high performance, polymorphism, geographic capabilities, security, and business‑process description.

4.1 Design Considerations

4.1.1 Lower Learning Cost

Align syntax with SQL to leverage developers’ existing knowledge.

4.1.2 High‑Performance Scenarios

Use primitive types (long, double), columnar DocValues, and remove cost‑based optimizers in favor of rule‑based execution.

4.1.3 Polymorphism

Support function and operator overloading (e.g., max returns same type as input; "+" concatenates strings when appropriate).

4.1.4 Security

Enable parameterized queries to prevent script injection, with support for expression‑level parameters.

4.1.5 Business Process Description

Allow doc‑independent expressions for rule engines, similar to PL/SQL stored procedures.

Pitfalls

Avoid early commitment to a two‑stage lexer‑parser architecture; prefer a recursive‑descent automaton for easier future modifications.

Conclusion

The search engine balances CPU‑intensive computation with low latency, requiring both intelligent features and performance. Leveraging mature products (Lucene, Elasticsearch) and storage engines (HBase, LevelDB, RocksDB) while understanding their underlying principles enables flexible, high‑performance solutions.

DSLSearch EngineLucenebackend optimizationerror correctionsemantic query
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.