How AI Understands Your Queries: Core Techniques of Semantic Vector Search

The article explains why traditional keyword search often fails when user questions differ from knowledge‑base wording, introduces semantic search that matches queries and documents via vector similarity, details query understanding and rewriting techniques, lists common pitfalls, provides a full Python implementation, and shares best‑practice recommendations.

AI Architect Hub
AI Architect Hub
AI Architect Hub
How AI Understands Your Queries: Core Techniques of Semantic Vector Search

Traditional keyword search breaks down when a user asks a question in natural language that does not share exact words with the documents in a knowledge base. For example, the policy "7‑day no‑reason return" and the user query "Can I return it?" have no overlapping characters, yet a semantic search system can embed both sentences, compute a similarity of 0.92, and retrieve the correct document.

Core concepts of semantic search

Semantic search relies on three key technologies:

Vectorization : Convert both queries and documents into dense vectors.

Query understanding : Clean the input, detect intent, and extract key terms.

Query rewriting : Expand synonyms, generate hypothetical answers (HyDE), and decompose complex questions.

Query understanding workflow

用户输入:"我想问一下就是那个退货的流程是咋回事儿"
    ↓
查询预处理
    ↓
结构化查询:"退货流程"

The preprocessing pipeline consists of four steps:

Text cleaning : lower‑case, remove punctuation and filler words, keep core terms (e.g., "退货").

Intent detection : determine the user’s goal (e.g., intent = "退货").

Keyword extraction : pull out important words such as "退货" and "流程".

Query rewriting : perform synonym expansion, HyDE, or query decomposition.

Query rewriting strategies

Three main strategies are illustrated with concrete code snippets.

# 1. Synonym expansion
query = "电脑很卡"
expanded = ["电脑很卡", "计算机卡顿", "PC运行慢", ...]

# 2. HyDE (hypothetical document)
if "怎么" in query:
    return f"根据{query},需要按照以下步骤操作:第一步...第二步..."

# 3. Query decomposition
# "苹果公司创始人是谁" → ["苹果公司", "创始人", "Tim Cook"]

Five common pitfalls and fixes

Query too short : "电脑" returns millions of results. Solution : expand with related terms ("笔记本电脑", "台式机", etc.).

Ambiguity : "苹果" matches fruit and phone. Solution : use intent detection + user history to disambiguate.

Oral noise : "就是那个啥我想退一下就是买的东西" yields no hits. Solution : extract key phrase "退货".

Mixed terminology : "显示器颜色发黄" vs "屏幕色偏暖色调". Solution : align terms across domains.

Negation handling : "不要苹果手机" still returns Apple phones. Solution : parse negation to include "手机" and exclude "苹果".

Full Python implementation

The article provides a complete, runnable pipeline:

class QueryCleaner:
    def clean(self, query: str) -> str:
        # remove filler words, punctuation, lower‑case, etc.
        ...
    def remove_negations(self, query: str) -> Tuple[str, List[str]]:
        ...
    def extract_keywords(self, query: str) -> List[str]:
        ...

class QueryRewriter:
    def __init__(self, embedder=None):
        self.synonym_dict = {
            '电脑': ['计算机', 'PC', '笔记本', '笔记本电脑'],
            '手机': ['移动电话', '智能手机', '移动端'],
            '退货': ['退换货', '退款', '退货退款'],
            ...
        }
    def expand_with_synonyms(self, query: str, top_k: int = 5) -> List[str]:
        ...
    def generate_hypothetical_answer(self, query: str) -> str:
        ...
    def decompose_query(self, query: str) -> List[str]:
        ...

class SemanticSearchPreprocessor:
    def __init__(self, embedder=None):
        self.cleaner = QueryCleaner()
        self.rewriter = QueryRewriter(embedder)
    def preprocess(self, query: str) -> Dict:
        # runs cleaning, synonym expansion, HyDE, decomposition
        ...
    def build_multi_strategy_query(self, query: str) -> List[str]:
        # combine original, cleaned, and expanded queries
        ...
    def debug_query(self, query: str) -> str:
        # pretty‑print each step
        ...

def demo():
    preprocessor = SemanticSearchPreprocessor()
    for q in ["就是那个我想退一下货", "电脑很卡怎么办", "苹果公司是干什么的", "不要苹果手机", "请问一下激活流程是什么"]:
        print(preprocessor.debug_query(q))
        print(f"Multi‑recall queries: {preprocessor.build_multi_strategy_query(q)}")

if __name__ == "__main__":
    demo()

An advanced demo shows how to load sentence‑transformers model BAAI/bge-base-zh-v1.5, generate embeddings for the multi‑strategy queries, and inspect the first five dimensions of each vector.

Best practices

Preprocessing can boost recall by 10‑30% regardless of the embedding model.

Combine synonym expansion, HyDE, and query decomposition for the strongest effect.

Configure strategies per scenario (short queries, oral queries, domain‑specific terminology, complex questions).

Continuously analyze search logs to identify high‑frequency failures and refine rules.

Pair preprocessing with a reranking stage to improve final precision.

Thought questions

Design a context‑aware disambiguation strategy for the ambiguous query "苹果".

Extend QueryRewriter with a Levenshtein‑based spelling‑correction module.

Handle combined negation and demand, e.g., user says "不需要了" while the document mentions "退款".

Next episode preview

In the next level we will compare vector databases such as Milvus, Chroma, and Weaviate to help you choose the right one.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonAIRAGVector Searchsemantic searchsentence-transformersquery preprocessing
AI Architect Hub
Written by

AI Architect Hub

Discuss AI and architecture; a ten-year veteran of major tech companies now transitioning to AI and continuing the journey.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.