Artificial Intelligence 8 min read

Upgrading WanFang Academic Paper Retrieval System with PaddleNLP

WanFang upgraded its academic paper retrieval system by adopting PaddleNLP’s Chinese pre‑trained Sentence‑BERT models, using weakly supervised SimCSE data and Milvus vector indexing, compressing the transformer for TensorRT‑accelerated inference, achieving 70% better matching quality and 2600 QPS latency‑optimized performance.

Baidu Geek Talk

Sep 13, 2021

Upgrading WanFang Academic Paper Retrieval System with PaddleNLP

At the beginning of the new academic year, many students face the challenge of literature search and plagiarism checking for their theses. WanFang Data Knowledge Service Platform, which aggregates billions of high‑quality knowledge resources, aims to improve its paper retrieval system by leveraging Baidu's PaddlePaddle PaddleNLP.

Business background : The core problem of WanFang's retrieval system is large‑scale text matching. It must quickly find similar documents among billions of records based on user queries, and the relevance ranking directly affects user experience. Key difficulties include scarce labeled data, accurate semantic similarity calculation, and low latency for massive queries.

Technical selection and practice : The team adopted PaddleNLP’s Chinese pre‑trained models and deployed services with Paddle Serving. Weakly supervised data were generated using high‑quality Chinese word embeddings and SimCSE, while supervised signals were extracted from user behavior logs. For model selection, Sentence‑BERT (a twin‑tower fine‑tuned BERT) was chosen over FastText, word2vec, etc., achieving a 70% improvement in matching quality.

Document vectors were pre‑computed with Sentence‑BERT and indexed in the open‑source vector database Milvus, enabling fast similarity recall. Model inference was accelerated by compressing the 12‑layer transformer to 6 layers, applying TensorRT and Paddle Inference, and serving through Paddle Serving, reaching 2600 QPS without accuracy loss.

Overall solution : The architecture consists of three parts – data construction, model selection, and industrial deployment. Data construction leverages massive unsupervised corpora and weak supervision via SimCSE. Model selection uses domain‑adapted pre‑training and R‑Drop data augmentation for ranking. Semantic indexing combines unsupervised (SimCSE) and supervised strategies to improve recall even when labeled data are limited.

For high‑performance online deployment, the system integrates FasterTransformer and provides a simple Python API for rapid model rollout.

The technical team invites interested users to follow PaddleNLP, star the GitHub repository, and join a live broadcast on September 14 (19:00‑20:00) for a deeper discussion and Q&A.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Deployment semantic search text matching PaddleNLP academic retrieval Sentence-BERT

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.