DeepDive Powers Knowledge Graph Relation Extraction for Shenma Search
This article explains how Alibaba’s Shenma Search team builds and refines a large‑scale knowledge graph using open information extraction, detailing relation‑extraction techniques, distant supervision challenges, and the DeepDive system’s architecture, custom Chinese NLP pipeline, iterative improvements, and empirical results across millions of triples.
Background Overview
To continuously improve search experience, the Shenma Search knowledge graph team explores open information extraction (OIE) to extract structured information from large-scale unstructured text, which is a core technology for sustainable knowledge graph expansion.
Relation Extraction Overview
Classification of Relation Extraction Techniques
Existing methods are divided into three categories: supervised learning, which treats extraction as a classification problem but requires large labeled corpora; semi‑supervised learning, which uses bootstrapping with seed instances; and unsupervised learning, which clusters entity pairs based on similar contexts. Supervised methods are currently the most widely applied in industry due to higher precision and recall.
Distant Supervision Algorithm
Distant supervision aligns text with a large knowledge graph, automatically labeling sentences that contain a known triple (E1, E2, R) as positive examples. Although it solves the data‑labeling scale problem, the strong assumption often introduces noisy labels, known as the “wrong label problem”. Various improvements—rule‑based filtering, graph‑model methods, and multi‑instance learning—have been proposed to mitigate this issue.
Choosing Relation Extraction Methods for Shenma Knowledge Graph
Unstructured text accounts for the majority of data sources. Shenma’s knowledge graph, now containing ~50 million entities and ~3 billion relations, relies heavily on distant supervision combined with existing graph data to achieve large‑scale, accurate extraction.
Two representative solutions were adopted: a DeepDive‑based system and a deep‑learning approach. DeepDive leverages NLP tools, feature engineering, and iterative human‑in‑the‑loop refinement; the deep‑learning method uses word embeddings and convolutional neural networks for high‑throughput extraction.
DeepDive System Introduction
Overview
DeepDive (Stanford) is an information‑extraction platform that processes text, tables, and images to produce structured facts. Its pipeline consists of data processing, labeling, learning/inference, and interactive iteration.
Architecture and Workflow
Data Processing
Input and segmentation: raw text is split into sentences and assigned global identifiers.
NLP annotation: tokenization, lemmatization, POS, NER, and dependency parsing using Stanford CoreNLP.
Candidate entity‑pair extraction: mentions are located and paired according to predefined rules.
Feature extraction: contextual token sequences, NER tag sequences, surrounding n‑grams, etc., are generated automatically (DDlib) or via user‑defined functions.
Data Labeling
Remote supervision and heuristic rules assign positive or negative labels to candidate pairs. Positive examples are drawn from known triples (e.g., spouse relationships) in the graph; negative examples may be generated from absent triples or mutually exclusive relations.
Learning and Inference
Factor‑graph models learn feature weights and predict the probability that a candidate triple is true. Gibbs sampling and stochastic gradient descent are used for scalable inference.
Interactive Iteration
After each iteration, error analysis guides rule refinement, feature adjustment, or weight tuning, gradually improving precision and recall.
DeepDive Improvements for Shenma Knowledge Graph
Chinese NLP Annotation
Since CoreNLP is English‑focused, a custom Chinese pipeline replaces tokenization with Ali word segmentation, retains NER, and merges tokens into entity‑level units. Long sentences are re‑segmented based on heuristics.
Automatic Subject Augmentation
Approximately 40 % of Chinese encyclopedia sentences lack explicit subjects. An algorithm adds missing subjects by borrowing from the previous sentence or the article title, achieving ~92 % accuracy and reducing extraction errors.
Input Filtering by Relation‑Relevant Keywords
Filtering candidate sentences with keywords related to the target relation shrinks the input set dramatically (e.g., marriage extraction input reduced to 13 % of the original) while preserving recall.
Entity‑Group Expansion
To handle multi‑entity relations (e.g., person‑institution‑position), DeepDive was extended to extract entity groups instead of just pairs.
Application Results
Experiments on a marriage‑relation task show that remote supervision combined with heuristic rules yields thousands of positive examples with a 1:2 positive‑negative ratio. After iterative refinement, the system achieves probability predictions in the [0.95, 1] range for high‑confidence triples.
Precision estimates exceed 95 %, and recall was measured via three sampling methods, yielding rates between 0.49 and 0.62. Error analysis identified common failure modes such as missing entity mentions, insufficient discriminative features, and noisy distant‑supervision labels.
Overall, the DeepDive‑based pipeline processes 80 k–1 M sentences per task, generates 30 k–500 k candidate triples, and completes an iteration in 1–8 hours. To date, the system has produced nearly 30 million candidate triples for Shenma’s knowledge graph across domains such as people, history, organizations, books, and movies.
References
[1] Lin Y., Liu Z., “Relation Extraction based on Deep Learning”. [2] Daojian Zeng et al., “Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks”, EMNLP 2015. [3] Guoliang Ji et al., “Distant Supervision for Relation Extraction with Sentence‑Level Attention and Entity Descriptions”, AAAI 2017. [4] Siliang Tang et al., “ENCORE: External Neural Constraints Regularized Distant Supervision for Relation Extraction”, SIGIR 2017. [5] Mintz et al., “Distant Supervision for Relation Extraction without Labeled Data”, ACL 2009. [6] Ce Zhang, “DeepDive: A Data Management System for Automatic Knowledge Base Construction”, PhD thesis 2015. [7] Riedel et al., “Modeling Relations and Their Mentions without Labeled Text”, KDD 2010. [8] Hoffmann et al., “Knowledge‑Based Weak Supervision for Information Extraction of Overlapping Relations”, ACL 2011. [9] Surdeanu et al., “Multi‑Instance Multi‑Label Learning for Relation Extraction”, EMNLP 2012. [10] Takamatsu et al., “Reducing Wrong Labels in Distant Supervision for Relation Extraction”, ACL 2012.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
