How Cognitive Concept Graphs Power Modern Search Understanding
This article explains the motivation, challenges, architecture, and algorithms behind building a large‑scale cognitive concept graph for search, detailing data construction, concept mining, fusion, confidence scoring, hierarchical structuring, validation, service algorithms, platform access, and real‑world applications such as intent recognition and entity recommendation.
Background
Concepts are the fundamental units of human cognition, representing abstract reflections of objective things and serving as the building blocks of thought. Constructing a classification system for billions of entities and linking them in a cognitive concept graph is a crucial step toward endowing machines with cognitive abilities.
What Is a Cognitive Concept?
Since Aristotle, humans have organized concepts using taxonomies. Modern knowledge bases like Cyc, WordNet, and HowNet provide high‑quality but limited‑scale concept hierarchies. In search, a cognitive concept refers to the abstract description represented by a user‑mentioned phrase or entity.
Challenges
Massive redundancy: a single instance may have hundreds of concept tags (e.g., “song” vs. “track”).
Varying confidence levels for different tags (e.g., “company” vs. “role”).
Difficulty mining long‑tail domain terms and entities.
Need to extract concepts from non‑entity phrases such as symptoms.
Building hierarchical relationships for millions of concepts.
Filtering out noisy concepts like “hope” or “ideal”.
System Overview
Leveraging the Shenma Search Knowledge Graph and its entity repository, we built a cognitive concept graph that connects user search intent with external commonsense and domain knowledge, providing unified data for search, recommendation, and knowledge‑driven intelligent services.
The graph contains rich concept instances (both entities like "Liu Dehua" and non‑entities like "we"), multi‑granularity concepts (e.g., "actor", "pink‑boy in the entertainment circle"), and hierarchical relations (isA).
Levels:
Level 1: Domain nodes (e.g., “medical”, “music”).
Level 2: Specific cognitive concepts (e.g., “actor”).
Level 3: Fine‑grained user‑oriented concepts (e.g., “pink‑boy in the entertainment circle”).
Instance layer: mentions of concepts.
Features and Advantages
Dynamic: each concept instance carries weighted candidates generated from query distribution (e.g., "Zhou Jielun" → artist 0.58, singer 0.26, actor 0.13).
Highly automated: daily updates of high‑level concepts and evaluation.
Fine granularity: most instances include detailed user‑level concepts.
Broad coverage across domains such as people, medicine, history, automotive, music.
Algorithm Framework
The framework consists of data construction and algorithm services.
Data Construction Process
Steps include concept mining & fusion, confidence calculation, hierarchy building, and concept validation. After importing data, the graph can infer new hierarchical relations and feed back into the knowledge base, forming a closed data loop.
Domain Concept Mining
Entity attributes, encyclopedia tags, and rule‑based extraction provide concept tags for regular entities. For long‑tail domain entities, we train a skip‑gram model on domain texts, select seed words, cluster based on transitive similarity, and filter clusters with rules, extracting over 10 k domain terms.
Phrase Concept Mining
Search queries contain many phrases lacking corresponding entities. We perform unsupervised phrase mining, then train a classifier to label concepts. Frequent‑pattern mining (TopMine) and topic modeling are used to segment text and merge tokens based on contextual scores.
Concept Fusion
Redundant tags from different sources are merged: level 2 concepts use synonym dictionaries; level 3 concepts use character‑ and word‑level embeddings with a similarity threshold (1e‑3) followed by manual review.
Concept Confidence
Entity‑level confidence is derived from popularity (e.g., BM25 scores) and normalized. Query‑level confidence aggregates query‑tag frequencies over a month and normalizes. The two confidences are fused, with stop‑words and noisy concepts filtered, and domain‑specific concepts re‑weighted.
Hierarchy Construction
Concept instances and candidates are attached to hierarchies using two methods: (1) mapping tables built from Shenma information flow topics and Tencent concept graph topics for level 1–2; (2) query classification and rules for level 3. Instances with a similarity score > 0.3 are linked.
Concept Validation
Noise is filtered using rule‑based and GBDT models that consider instance length, confidence, part‑of‑speech, etc., removing about 500 k non‑concept instances. Ongoing validation addresses ambiguous or heterogeneous concepts.
Service Algorithms
Dictionary Matching & NER Boundary Tagging
We build dictionaries from graph instances and level 2/3 concepts, apply Trie‑tree + bidirectional maximum matching, and refine boundaries with an NER service.
BERT‑Based Entity Typing Disambiguation
A BERT model trained on ~1 billion search logs (without next‑sentence loss) provides embeddings for OOV words; a downstream MLP classifies entity mentions into disambiguated concepts.
Platform Access
Data can be accessed via a web UI ( https://concept.proxy.taobao.org ), API, ODPS tables, or Pangu data dumps.
Applications
Intent Recognition
By mapping query concepts (e.g., "Song Jiang" → "unpopular Liangshan hero", "fictional character"), the system accurately infers user intent, improving retrieval, ranking, and recommendation.
Entity Recommendation
When users search for multiple universities in Shanghai, the graph discovers the shared concept "Shanghai higher education institutions" and recommends related schools such as Shanghai International Studies University and Tongji University.
Future Plans and Reflections
Knowledge empowerment: apply the cognitive concept graph to more business scenarios.
Deeper user understanding: extract finer‑grained user needs from queries.
Content comprehension: link content with concepts for better text understanding.
Knowledge expansion: enrich the graph with more concepts.
Intelligent reasoning: automate large‑scale hierarchical construction.
References
认知智能基础 – https://www.atatech.org/articles/136437
A User‑Centered Concept Mining System for Query and Document Understanding at Tencent – KDD201
Concept‑based Short Text Classification and Ranking – CIKM2014
Query Understanding through Knowledge‑Based Conceptualization – IJCAI2015
Deep Short Text Classification with Knowledge Powered Attention – AAAI2019
Inferring Concept Hierarchies from Text Corpora via Hyperbolic Embeddings – ACL2019
Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification – IJCAI2017
COMET: Commonsense Transformers for Automatic Knowledge Graph Construction – ACL2019
Entity Suggestion with Conceptual Explanation – IJCAI2017
Matching Article Pairs with Graphical Decomposition and Convolutions – ACL2019
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
