Artificial Intelligence 20 min read

How BlackPearl Dominated All Three KDD 2024 OAG‑Challenge Tracks with Large‑Model Techniques

The BlackPearl team from Meituan’s Search & Content Intelligence group detailed their award‑winning solutions for the three KDD 2024 OAG‑Challenge tasks—paper name disambiguation, source tracing, and academic QA—showcasing large‑model driven pipelines, iterative self‑refinement, grafting‑learning, and extensive hard‑negative mining that outperformed traditional feature‑engineered and BERT‑based baselines.

Meituan Technology Team

Sep 12, 2024

How BlackPearl Dominated All Three KDD 2024 OAG‑Challenge Tracks with Large‑Model Techniques

Task Overview

Paper Name Disambiguation (WhoIsWho‑IND) : Detect papers incorrectly assigned to an author using metadata such as title, abstract, authors, keywords, venue and year.

Paper Source Tracing (PST) : For a given paper, predict the most influential reference(s) (ref‑source) and output an importance score in the range [0, 1].

Academic Question Answering (AQA) : Retrieve the most relevant papers for a professional query; performance measured by MAP@20.

WhoIsWho‑IND Solution

The clustering problem was reformulated as a pairwise comparison task. For each target paper the model receives a set of candidate references and predicts whether each candidate belongs to the correct class.

Key Techniques

Train‑Time Difficulty Increase (TTDI) : During fine‑tuning the maximum input length is gradually reduced and the proportion of noisy references is increased, forcing the model to handle harder examples.

Test‑Time Augmentation (TTA) : The order of candidate references is shuffled at inference time and predictions from multiple shuffles are averaged.

Iterative Self‑Refinement (IRF) : After each inference round the candidates are ranked by predicted correctness probability; the top‑ranked correct papers are fed back as the reference set for the next round, improving confidence without additional training.

Efficient Fine‑Tuning : DeepSpeed zero‑1, LoRA and QLoRA are used to fit large models on a single GPU. Only the most informative fields (title and author) are fine‑tuned separately; a separate model is trained on all fields and later ensembled.

The pipeline ensembles several independently fine‑tuned models and repeats the IRF step K times.

PST Solution

The source‑tracing task suffers from heterogeneous label distributions, extremely long HTML‑encoded identifiers, and a massive unlabeled auxiliary corpus (DBLP). Three complementary techniques address these challenges.

Grafting‑Learning for Dataset

A BERT model is first fine‑tuned on the large, noisy rule‑annotated dataset. Its final hidden states are then used as additional features for a second BERT that is fine‑tuned on the high‑quality human‑annotated data. This grafting preserves useful signals from the noisy set while discarding noise.

Grafting‑Learning for LongText

Long papers are split into segments; each segment is processed by a separate BERT model. Segment‑level prediction probabilities are aggregated and fed to a ChatGLM‑3 model, allowing the final decision to be made with short inputs and avoiding quadratic attention costs.

Automatic RAG & Feature Engineering

Relevant auxiliary information is automatically retrieved from the DBLP corpus and transformed into engineered features. This reduces input length and improves prediction confidence.

AQA Solution

The academic QA task is a noisy retrieval problem. The solution combines high‑quality dense vectors, hard‑negative mining, and iterative boosting.

LLM for Vector

Dense representations are generated with the 7B SFR‑Embedding‑Mistral model, which outperforms smaller auto‑encoders.

Hard Example Mining

During contrastive fine‑tuning, for each positive sample three hard negatives are sampled from a pool of 100 candidates, encouraging the model to discriminate difficult cases.

Boosting Iterations

An iterative pipeline alternates between mining harder negatives and fine‑tuning both recall and ranking models. Each iteration uses the improved model to retrieve a new set of hard negatives, and predictions are aggregated (e.g., rank‑average fusion).

Training details:

Recall model: SFR‑Embedding‑Mistral, instruction‑tuned with contrastive loss, 10 epochs per iteration, learning rate 1e‑4, QLoRA for single‑GPU fine‑tuning.

Ranking model: SOLAR‑10.7B‑Instruct‑v1.0, instruction‑tuned with cross‑entropy loss, same hyper‑parameters as the recall model.

Eight boosting iterations increased MAP@20 from the base model to 0.301 after rank‑average fusion.

Results and Resources

All three tracks were won by the described large‑model pipelines, demonstrating the effectiveness of iterative self‑refinement, grafting‑learning, and hard‑negative boosting.

Full code, configuration files and reproducible scripts are available at:

https://github.com/BlackPearl-Lab/KddCup-2024-OAG-Challenge-1st-Solutions

References

Wang et al., 2019. Automatic brain tumor segmentation using CNNs with test‑time augmentation.

Rasley et al., 2020. DeepSpeed: System optimizations for training >100B‑parameter models.

Hu et al., 2021. LoRA: Low‑rank adaptation of large language models.

Dettmers et al., 2024. QLoRA: Efficient fine‑tuning of quantized LLMs.

Jiangli Club. Grafting‑learning proposal and use cases.

Meng et al., 2024. SFR‑Embedding‑Mistral: Enhance text retrieval with transfer learning.

Zhang et al., 2024. OAG‑Bench: A human‑curated benchmark for academic graph mining.

machine learning Information Retrieval Large Language Model KDD Cup text clustering Academic Knowledge Graph

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.