Artificial Intelligence 9 min read

Technical Summary of the 2021 Sohu Campus Text Matching Algorithm Competition

This article presents a comprehensive technical summary of the 2021 Sohu Campus Text Matching Algorithm Competition, detailing data characteristics, preprocessing strategies, tokenization choices, positional encoding methods, model architectures using relative encodings such as WoBERT and RoFormer, experimental results, and reflections on future improvements.

Sohu Tech Products

Aug 4, 2021

Technical Summary of the 2021 Sohu Campus Text Matching Algorithm Competition

Author Introduction

Chen Zhuo, a second‑year graduate student at Harbin Institute of Technology (Shenzhen), won several awards including top‑2 in the 2021 Tencent Game Security Competition (machine‑learning track), top‑15 in the 2021 "JiTu" AI Algorithm Contest, and top‑16 in the 2021 AIWIN public‑opinion risk‑control contest. He also earned the third prize in the 2021 Sohu Campus Text Matching Algorithm Contest, and this article is his technical summary of that competition.

Competition Overview

The task is text matching, a common NLP problem, but the dataset is unusually diverse: it contains many sub‑categories such as long‑to‑long, short‑to‑long, and short‑to‑short matches. Each sub‑task is further divided into two matching standards, A (lenient) and B (strict), demanding a highly robust model.

# A‑class sample
{
    "source": "...",
    "target": "...",
    "labelA": "1"
}
# B‑class sample
{
    "source": "...",
    "target": "...",
    "labelB": "0"
}

Data Exploration

Text Length

Special characters and noisy text need handling.

Some documents exceed ten thousand characters.

Special Symbol Removal

Example: "巴蒂勇夺赛季首冠🏅️#WTA500#雅拉山谷精英赛女单决赛" → "巴蒂勇夺赛季首冠，雅拉山谷精英赛女单决赛"

Tokenization Granularity

Word‑level tokenization (example): 巴蒂|勇夺|赛季|首冠|，雅拉山谷|精英赛|女单|决赛 Character‑level tokenization (example): 巴|蒂|勇|夺|赛|季|首|冠|，雅|拉|山|谷|精|英|赛|女|单|决|赛

We chose word‑level tokenization because it retains more semantic information and reduces model complexity.

Model Design Details

Positional Encoding

Absolute positional encoding : Simple and fast but limited to the maximum pre‑training length (e.g., 512 tokens) and lacks extrapolation ability.

Relative positional encoding : Considers the distance between the current token and the attention target, offering greater flexibility and allowing longer texts to be processed.

Model Architecture

The dataset contains two main classes (A and B) each with three sub‑categories, forming a multi‑task learning problem. We selected models with relative positional encoding—WoBERT and RoFormer.

Direction 1: Encode the task type as an embedding, concatenate it with the model output, and feed the combined vector into a fully‑connected layer for prediction.

Direction 2: Use a conditional LayerNorm to enable a single model to handle six sub‑tasks.

Direction 3: Transform the conditional vector to the same dimension as the hidden states and add it to the input.

Results and Analysis

The two proposed structures were evaluated on offline and online test sets. The scores are shown below:

Solution

Pre‑round Offline

Pre‑round Online

Final Offline

Final Online

Solution 1

0.771

0.73532

0.7932

0.7832

Solution 2

0.782

0.73764

0.7944

0.7869

Considering deployment cost, we did not use model ensembling and finally selected Solution 2 as the overall approach.

Conclusion and Reflections

Data processing could be further improved, e.g., by data augmentation to increase sample size.

Model ensembling was not explored due to resource constraints; it could boost performance in future work.

For long texts, weighted sliding windows performed better than uniform windows, suggesting more sophisticated segmentation strategies.

We thank the organizers for their support and the many talented participants from whom we learned a great deal. Continuous learning and improvement remain our goals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multi-Task Learning NLP competition text matching Model Design position encoding

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.