Artificial Intelligence 13 min read

Domain Knowledge Enhanced Pretrained Language Model for Medicinal Product Vertical Search

This article presents a domain‑knowledge‑enhanced pretrained language model that combines ELECTRA‑based token‑level masking with a novel product‑attribute prediction (PAP) task to improve query understanding, intent classification, and relevance matching in vertical drug e‑commerce search, and validates its effectiveness through extensive experiments on public and proprietary datasets.

DataFunTalk
DataFunTalk
DataFunTalk
Domain Knowledge Enhanced Pretrained Language Model for Medicinal Product Vertical Search

Business Background : In the e‑commerce scenario, drug search is a vertical search that must accurately match user medication needs with product supply, requiring high precision, professional medical knowledge, and handling ambiguous user queries.

Technical Background : The approach builds on pretrained language models (PLM) such as BERT, ELECTRA, and domain‑specific models (BioBERT, PubMedBERT). It introduces three application scenarios: query entity recognition (NER), query intent classification, and query‑title relevance classification.

Domain Knowledge Enhancement : Two common methods are employed – (1) continued pre‑training on domain‑specific corpora and (2) integrating knowledge graphs (e.g., ERNIE, KGLM). The authors also leverage structured product data from relational databases.

Method Overview :

RTD (Replacement Token Detection) task based on ELECTRA, enhanced with entity‑level masking for medical terms.

PAP (Product Attribute Prediction) task that jointly trains on title, attribute name, and attribute value triples, encouraging the model to bring title‑attribute pairs closer while pushing negative samples apart.

The overall loss is the sum of RTD and PAP losses, with a hyper‑parameter controlling their relative weight.

Experiments :

Datasets: public ChineseBLUE medical NLP benchmark and a proprietary drug search dataset (query‑title relevance, intent classification, NER).

Baselines: standard ELECTRA, BERT‑base, and variants with entity masking.

Results: The proposed model consistently outperforms baselines across most metrics, especially on hard relevance sets where PAP contributes significant gains.

Ablation studies confirm the importance of both entity masking and the PAP task.

Case Study : For the query “中耳炎”, the model correctly scores relevant drug titles high and irrelevant ones low, demonstrating understanding of treatment relationships.

Conclusion :

Introduced a knowledge‑enhanced PLM for vertical drug search.

Added entity‑level masking to ELECTRA.

Proposed a novel product‑attribute prediction task integrated via multitask learning.

Experimental results show notable improvements and validate the effectiveness of the PAP task.

Paper : Liu, Kesong, Jianhui Jiang and Feifei Lyu. “A Domain Knowledge Enhanced Pre‑Trained Language Model for Vertical Search: Case Study on Medicinal Products.” COLING (2022).

pretrained language modelmultitask learningELECTRAdomain knowledgemedical NLPvertical search
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.