Artificial Intelligence 16 min read

Intelligent Content Extraction and Generation Practices on Ctrip's Marco Polo AI Platform

This article details Ctrip's AI‑driven Marco Polo platform, describing how large‑scale NLP pipelines combine extraction, richness evaluation, semantic matching and deep‑learning generation (CopyNet, TA‑seq2seq) to produce high‑quality recommendation reasons across multiple product scenarios.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Intelligent Content Extraction and Generation Practices on Ctrip's Marco Polo AI Platform

With the rapid growth of content‑driven user experiences, Ctrip faces the challenge of discovering, extracting, and generating high‑quality textual recommendations from massive, noisy data sources.

The Marco Polo middle‑platform, developed by Ctrip’s AI R&D team, integrates data, algorithm, and application layers; it connects over 50 data sources (billions of records) and runs algorithms on Spark‑based big‑data infrastructure.

Intelligent content extraction follows three stages: preprocessing (sentiment filtering, sensitive‑word detection, spelling correction), content‑richness assessment (measuring information entropy, part‑of‑speech distribution, dependency structures, and product‑specific features via knowledge graphs and NER), and result optimization (deduplication, semantic matching, aesthetic scoring).

Richness evaluation is organized into three tiers—sentence‑level, product‑level, and scenario‑level—using statistical metrics, knowledge‑graph‑enhanced feature extraction, and dimension‑scoring models to ensure extracted sentences highlight distinctive product attributes.

Semantic matching employs a two‑phase approach: an unsupervised cosine similarity on average word vectors followed by a supervised model (pointwise or pairwise loss) that re‑ranks candidates based on learned relevance, with architectures such as LSTM‑Attention and CNN.

For generation, Ctrip experiments with CopyNet (seq2seq with copy mechanism) and TA‑seq2seq (topic‑aware neural response generation) to overcome the rigidity of pure extraction, incorporating topic keywords derived from LDA and dual‑probability decoding.

The resulting recommendation reasons are deployed in four main scenarios—hotel homepage, short‑highlight carousel, restaurant recommendation, and IM‑based hotel suggestions—demonstrating improved product exposure and reduced operational effort.

Despite successes, challenges remain: occasional generic or syntactically incorrect outputs, lack of CTR‑based validation, scalability limits of deep models on Spark, and the need to better fuse extraction and generation results.

deep learningrecommendation systemsNLPtext generationSparkcontent-extraction
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.