Query Term Weighting Techniques for Medical Search: Statistical, Supervised, and Neural Approaches

This article reviews the challenges of short‑text query understanding in medical search and surveys a range of term‑weighting methods—including statistical models, supervised weighting, knowledge‑graph‑enhanced extraction, and neural network‑based approaches—highlighting their assumptions, implementations, and practical considerations for improving retrieval relevance.

DataFunTalk
DataFunTalk
DataFunTalk
Query Term Weighting Techniques for Medical Search: Statistical, Supervised, and Neural Approaches

Computing query term weight (also called Term Necessity) is a fundamental problem in information retrieval, especially for short medical queries where each word may carry a different importance for recall.

When a query contains a single word (e.g., "考研"), the user intent is broad and any document containing that term is acceptable. With multiple words, the problem becomes exponentially harder because the query may express entity attributes, combined entities, multiple related entities, or complex logical statements, requiring a high level of semantic understanding.

The article first introduces a simple statistical term‑weight model that uses query‑document features such as term frequency, co‑occurrence, and click data to assign weights, aiming to prioritize more informative terms (e.g., giving higher priority to "血粘度" over "判断标准" in the query "血粘度 的 判断标准").

It then surveys supervised term‑weighting techniques. The classic "Term Necessity Prediction" work treats term weight as a regression problem, extracting features like topic centrality, synonymy, and abstraction. A flexible supervised model combines three factors—local term presence, global document frequency (penalizing overly common terms), and a normalization factor—to learn weights from labeled data.

Several papers are summarized that enrich supervised weighting with additional signals: mutual information, chi‑square, information gain, inverse class frequency, and probabilistic descriptions of term discriminativeness. These methods integrate document‑type tags, click popularity, and other supervised cues.

To overcome the limitations of pure statistical methods, the article discusses keyword‑extraction‑based weighting. The MIKE framework integrates multidimensional information (co‑occurrence, topic distribution, Word2Vec) into a modified random‑walk graph. Knowledge‑graph‑enhanced extraction further enriches the graph with entity‑relation edges, allowing more expressive semantic contexts.

Unsupervised neural approaches are also covered. SIFRank combines sentence embeddings (SIF) with ELMo to compute similarity between candidate noun phrases and the whole document. DeepCT uses BERT to generate contextual token vectors and a linear regressor to predict term importance for both queries and passages, producing weights that can be stored in a traditional inverted index.

Finally, the article outlines a practical pipeline for the DXY medical search engine: a knowledge‑graph‑aware TextRank model that boosts entity terms, statistical features (TF‑IDF, attribute word lists, stop‑word lists), TFDeepCT scaling for high‑variance queries, and integration of MIKE‑style click and topic features. Future work includes merging knowledge‑graph structures and the TeKET tree‑based method to better suit medical domain characteristics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

information retrievalKnowledge Graphmedical searchneural modelsquery term weightingterm importance
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.