Upgrading WanFang Academic Paper Retrieval System with PaddleNLP
WanFang upgraded its academic paper retrieval system by adopting PaddleNLP’s Chinese pre‑trained Sentence‑BERT models, using weakly supervised SimCSE data and Milvus vector indexing, compressing the transformer for TensorRT‑accelerated inference, achieving 70% better matching quality and 2600 QPS latency‑optimized performance.
At the beginning of the new academic year, many students face the challenge of literature search and plagiarism checking for their theses. WanFang Data Knowledge Service Platform, which aggregates billions of high‑quality knowledge resources, aims to improve its paper retrieval system by leveraging Baidu's PaddlePaddle PaddleNLP.
Business background : The core problem of WanFang's retrieval system is large‑scale text matching. It must quickly find similar documents among billions of records based on user queries, and the relevance ranking directly affects user experience. Key difficulties include scarce labeled data, accurate semantic similarity calculation, and low latency for massive queries.
Technical selection and practice : The team adopted PaddleNLP’s Chinese pre‑trained models and deployed services with Paddle Serving. Weakly supervised data were generated using high‑quality Chinese word embeddings and SimCSE, while supervised signals were extracted from user behavior logs. For model selection, Sentence‑BERT (a twin‑tower fine‑tuned BERT) was chosen over FastText, word2vec, etc., achieving a 70% improvement in matching quality.
Document vectors were pre‑computed with Sentence‑BERT and indexed in the open‑source vector database Milvus, enabling fast similarity recall. Model inference was accelerated by compressing the 12‑layer transformer to 6 layers, applying TensorRT and Paddle Inference, and serving through Paddle Serving, reaching 2600 QPS without accuracy loss.
Overall solution : The architecture consists of three parts – data construction, model selection, and industrial deployment. Data construction leverages massive unsupervised corpora and weak supervision via SimCSE. Model selection uses domain‑adapted pre‑training and R‑Drop data augmentation for ranking. Semantic indexing combines unsupervised (SimCSE) and supervised strategies to improve recall even when labeled data are limited.
For high‑performance online deployment, the system integrates FasterTransformer and provides a simple Python API for rapid model rollout.
The technical team invites interested users to follow PaddleNLP, star the GitHub repository, and join a live broadcast on September 14 (19:00‑20:00) for a deeper discussion and Q&A.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.