Media Domain Named Entity Recognition: Techniques, Evolution, and Sohu’s Practical Implementation
This article reviews the challenges of media‑domain named entity recognition, outlines the evolution from rule‑based methods through traditional machine‑learning and deep‑learning models to attention‑based Transformers, and details Sohu’s practical Bi‑LSTM‑CRF system with data‑annotation strategies and performance results.
Named Entity Recognition (NER) is a core technology in natural language processing, essential for understanding news articles and powering downstream applications such as question answering and search. In the Chinese media domain, NER must handle a wide variety of entity types—people, locations, organizations, and domain‑specific terms—while coping with rapidly emerging new entities.
The development of NER techniques can be divided into four stages: (1) rule‑based and dictionary methods, which rely on manually crafted patterns and suffer from poor scalability; (2) traditional machine‑learning approaches such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF), which treat NER as a sequence labeling problem but are limited by independence assumptions; (3) deep‑learning models, especially Bi‑LSTM combined with CRF, which capture long‑range context and achieve superior performance; and (4) attention‑based models like Transformers, which further improve contextual representation and are often paired with a CRF decoding layer for optimal label sequence inference.
Sohu’s media NER practice builds on the Bi‑LSTM‑CRF architecture and augments it with domain‑specific lexicons and rule sets to improve recognition of out‑of‑vocabulary and ambiguous entities. Data preparation involves a two‑stage pipeline: first, using existing open‑source NER tools to pre‑label long articles, then having annotators refine the results; second, applying a boosting‑style sampling strategy to focus manual effort on low‑confidence segments.
The resulting system achieves approximately 95% precision and 94% recall on internally annotated datasets, demonstrating strong capability in both contextual understanding and new‑entity discovery. Visual examples illustrate the effectiveness of the combined approach compared with purely manual annotation.
In conclusion, the article highlights the importance of integrating rule‑based, statistical, and deep‑learning methods for media‑domain NER, and invites participation in a collaborative content‑recognition competition jointly organized by Sohu and Tsinghua University.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.