How Hyperbolic Space and Contrastive Learning Boost Domain-Specific Language Models
This article introduces the KANGAROO model, which injects hierarchical semantic information via hyperbolic space and leverages contrastive learning on dense subgraph structures to overcome global sparsity in vertical‑domain knowledge‑enhanced pre‑trained language models, and evaluates its performance on finance and medical tasks.
Background
Knowledge‑enhanced pre‑trained language models (KEPLM) improve downstream NLP tasks by injecting factual knowledge from large‑scale knowledge graphs (KGs). However, vertical‑domain KGs exhibit global sparsity (low entity coverage) and local density (highly connected subgraphs), making open‑domain KEPLM methods unsuitable. To address these characteristics, a unified framework is proposed for learning KEPLM across various vertical domains.
Algorithm Overview
The KANGAROO model tackles the sparsity‑density dilemma with two modules:
Hyperbolic Knowledge‑aware Aggregator : embeds hierarchical semantic information of vertical‑domain KG data into a hyperbolic space (Poincaré ball) to compensate for global sparsity.
Multi‑Level Knowledge‑aware Augmenter : uses contrastive learning on point‑biconnected subgraphs to exploit local dense structures.
Hyperbolic Knowledge‑aware Aggregator
Euclidean embeddings struggle with complex hierarchical patterns. Inspired by the Poincaré ball model, the aggregator learns hyperbolic entity embeddings that capture both structural and semantic relations. The distance between two entities ei and ej is defined in hyperbolic space, and synonymous entities are brought closer by minimizing this distance.
Domain Knowledge Encoder
Entity Space Infusion
Hyperbolic entity embeddings are concatenated with token embeddings to inject hierarchical knowledge into contextual representations.
Entity Knowledge Injector
Heterogeneous features of entity embeddings {h_ej} are fused via a multi‑head attention layer, using overlapping word thresholds to match relevant entities from the domain KG.
Multi‑Level Knowledge‑aware Augmenter
This component learns fine‑grained semantic relations of injected knowledge triples by constructing positive and negative samples of varying difficulty based on point‑biconnected subgraph structures.
Positive Sample Construction
For a target entity, the K nearest neighboring triples are extracted, concatenated into a sentence, and encoded with a shared text encoder (e.g., BERT). Position embeddings are adjusted so that tokens belonging to the same triple share identical indices, and the [CLS] token represents the positive sample embedding.
Negative Sample Construction (Point‑Biconnected Component‑based)
Negative samples are generated by exploring nodes farther from the target entity using hop distances. Different hop levels correspond to varying difficulty; shorter hops yield harder negatives. Multiple negative samples per entity are built across several difficulty levels.
Training Objectives
The loss combines a standard token‑level masked language modeling (MLM) objective with a contrastive learning objective based on point‑biconnected components.
Algorithm Evaluation
KANGAROO is evaluated on full‑data and few‑shot fine‑tuning scenarios across finance and medical downstream tasks. Results show superior performance compared to baseline models. Additional experiments compare Euclidean versus hyperbolic distance metrics, confirming the advantage of hyperbolic representations.
Open‑Source Release
The KANGAROO algorithm will be contributed to the EasyNLP framework, enabling NLP practitioners and researchers to adopt the model.
References
Chengyu Wang et al., "EasyNLP: A Comprehensive and Easy‑to‑use Toolkit for Natural Language Processing", EMNLP 2022.
Zhengyan Zhang et al., "ERNIE: enhanced language representation with informative entities", ACL 2019.
Xiaozhi Wang et al., "KEPLER: A unified model for knowledge embedding and pre‑trained language representation", TACL 2021.
Yusheng Su et al., "CokeBERT: Contextual knowledge selection and embedding towards enhanced pre‑trained language models", AI Open 2021.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
