Solution Overview for the Scientific Paper Recommendation Matching Competition
This article presents a comprehensive solution to a competition that requires matching description paragraphs with the three most relevant papers from a 200,000‑paper corpus, detailing background, task definition, evaluation metrics, modeling strategy, and core algorithms such as SIF, InferSent, Bi‑LSTM, and BERT.
The article, based on a presentation at the China Computer Conference, outlines a complete solution for a paper‑recommendation competition, covering task understanding, core modeling ideas, algorithmic concepts, and evaluation, with references to SIF Sentence Embedding, InferSent, Bi‑LSTM, and BERT.
Competition Background Scientific research generates massive data, and understanding citations can enhance knowledge graphs and QA systems. The task provides a corpus of ~200k papers and description paragraphs; participants must retrieve the three most relevant papers for each paragraph.
Task Description Given a description paragraph, match three papers. Example: "An efficient implementation based on BERT[1] and graph neural network (GNN) [2] is introduced." with corresponding references.
Evaluation Scheme Accuracy is measured by Mean Average Precision @ 3 (MAP@3), where precision at each rank k is summed and averaged over all queries.
Task Analysis The problem resembles a fill‑in‑the‑blank task but with sentences (paper titles) instead of words, posing scalability challenges for similarity computation across the large corpus.
Modeling Core Idea The solution splits into recall and ranking stages. In recall, two methods are used:
Word2Vec + TF‑IDF (replaced by Smooth Inverse Frequency) to create weighted sentence embeddings.
Pure TF‑IDF sentence embeddings.
Cosine similarity selects 3–6 candidate papers per paragraph. In ranking, a Bi‑LSTM encodes the (Description, PaperText) pair; the resulting vectors are combined via difference and inner product, passed through a dense layer and softmax to produce relevance scores.
Algorithm Core Ideas
4.1 SIF Sentence Embedding Uses pretrained word vectors with weighted averaging (weight b = a / (a + word frequency)) and removes the projection on the first principal component. This method achieved the best recall accuracy on the validation set.
4.2 InferSent Adapts Facebook's InferSent model: sentence embeddings from various encoders are combined using difference and inner product features. Modifications include changing the 3‑way softmax to 2‑way and dropping the u/v vectors, selecting Bi‑LSTM with max‑pooling as the encoder.
4.3 Others – BERT Two BERT‑based approaches were tried: (1) encoding Description and PaperText separately and computing cosine similarity, and (2) replacing the InferSent component with BERT in the overall model. These were used in a related CCL competition and achieved top MAP@3 scores after limited online submissions.
Conclusion The combined recall (SIF, TF‑IDF) and ranking (Bi‑LSTM) pipeline effectively addresses the large‑scale paper matching task, with BERT offering potential further improvements.
References
[1] Sanjeev Arora, Yingyu Liang, Tengyu Ma. "A Simple but Tough‑to‑Beat Baseline for Sentence Embeddings," ICLR 2017.
[2] Conneau A., Kiela D., Schwenk H., et al. "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data," 2017.
[3] Devlin J., Chang M. W., Lee K., et al. "BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding," 2018.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
