Artificial Intelligence 8 min read

Upgrading WanFang Academic Paper Retrieval System with PaddleNLP

WanFang upgraded its academic paper retrieval system by adopting PaddleNLP’s Chinese pre‑trained Sentence‑BERT models, using weakly supervised SimCSE data and Milvus vector indexing, compressing the transformer for TensorRT‑accelerated inference, achieving 70% better matching quality and 2600 QPS latency‑optimized performance.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Upgrading WanFang Academic Paper Retrieval System with PaddleNLP

At the beginning of the new academic year, many students face the challenge of literature search and plagiarism checking for their theses. WanFang Data Knowledge Service Platform, which aggregates billions of high‑quality knowledge resources, aims to improve its paper retrieval system by leveraging Baidu's PaddlePaddle PaddleNLP.

Business background : The core problem of WanFang's retrieval system is large‑scale text matching. It must quickly find similar documents among billions of records based on user queries, and the relevance ranking directly affects user experience. Key difficulties include scarce labeled data, accurate semantic similarity calculation, and low latency for massive queries.

Technical selection and practice : The team adopted PaddleNLP’s Chinese pre‑trained models and deployed services with Paddle Serving. Weakly supervised data were generated using high‑quality Chinese word embeddings and SimCSE, while supervised signals were extracted from user behavior logs. For model selection, Sentence‑BERT (a twin‑tower fine‑tuned BERT) was chosen over FastText, word2vec, etc., achieving a 70% improvement in matching quality.

Document vectors were pre‑computed with Sentence‑BERT and indexed in the open‑source vector database Milvus, enabling fast similarity recall. Model inference was accelerated by compressing the 12‑layer transformer to 6 layers, applying TensorRT and Paddle Inference, and serving through Paddle Serving, reaching 2600 QPS without accuracy loss.

Overall solution : The architecture consists of three parts – data construction, model selection, and industrial deployment. Data construction leverages massive unsupervised corpora and weak supervision via SimCSE. Model selection uses domain‑adapted pre‑training and R‑Drop data augmentation for ranking. Semantic indexing combines unsupervised (SimCSE) and supervised strategies to improve recall even when labeled data are limited.

For high‑performance online deployment, the system integrates FasterTransformer and provides a simple Python API for rapid model rollout.

The technical team invites interested users to follow PaddleNLP, star the GitHub repository, and join a live broadcast on September 14 (19:00‑20:00) for a deeper discussion and Q&A.

Model Deploymentsemantic searchText MatchingPaddleNLPacademic retrievalSentence-BERT
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.