Query Term Weighting Techniques for Medical Search: Statistical, Supervised, and Neural Approaches
This article reviews the challenges of short‑text query understanding in medical search and surveys a range of term‑weighting methods—including statistical models, supervised weighting, knowledge‑graph‑enhanced extraction, and neural network‑based approaches—highlighting their assumptions, implementations, and practical considerations for improving retrieval relevance.
Computing query term weight (also called Term Necessity) is a fundamental problem in information retrieval, especially for short medical queries where each word may carry a different importance for recall.
When a query contains a single word (e.g., "考研"), the user intent is broad and any document containing that term is acceptable. With multiple words, the problem becomes exponentially harder because the query may express entity attributes, combined entities, multiple related entities, or complex logical statements, requiring a high level of semantic understanding.
The article first introduces a simple statistical term‑weight model that uses query‑document features such as term frequency, co‑occurrence, and click data to assign weights, aiming to prioritize more informative terms (e.g., giving higher priority to "血粘度" over "判断标准" in the query "血粘度 的 判断标准").
It then surveys supervised term‑weighting techniques. The classic "Term Necessity Prediction" work treats term weight as a regression problem, extracting features like topic centrality, synonymy, and abstraction. A flexible supervised model combines three factors—local term presence, global document frequency (penalizing overly common terms), and a normalization factor—to learn weights from labeled data.
Several papers are summarized that enrich supervised weighting with additional signals: mutual information, chi‑square, information gain, inverse class frequency, and probabilistic descriptions of term discriminativeness. These methods integrate document‑type tags, click popularity, and other supervised cues.
To overcome the limitations of pure statistical methods, the article discusses keyword‑extraction‑based weighting. The MIKE framework integrates multidimensional information (co‑occurrence, topic distribution, Word2Vec) into a modified random‑walk graph. Knowledge‑graph‑enhanced extraction further enriches the graph with entity‑relation edges, allowing more expressive semantic contexts.
Unsupervised neural approaches are also covered. SIFRank combines sentence embeddings (SIF) with ELMo to compute similarity between candidate noun phrases and the whole document. DeepCT uses BERT to generate contextual token vectors and a linear regressor to predict term importance for both queries and passages, producing weights that can be stored in a traditional inverted index.
Finally, the article outlines a practical pipeline for the DXY medical search engine: a knowledge‑graph‑aware TextRank model that boosts entity terms, statistical features (TF‑IDF, attribute word lists, stop‑word lists), TFDeepCT scaling for high‑variance queries, and integration of MIKE‑style click and topic features. Future work includes merging knowledge‑graph structures and the TeKET tree‑based method to better suit medical domain characteristics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
