Building an ASR+LLM+Vector Knowledge Base for Precise Video Ad Category Detection

This article presents a layered ASR‑LLM‑vector‑knowledge‑base pipeline that cleans speech transcripts, semantically repairs text, performs hierarchical exact and fuzzy matching, and iteratively refines mappings to accurately identify product categories in video advertisements, while detailing module functions, technical choices, and LLM parameter tuning.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Building an ASR+LLM+Vector Knowledge Base for Precise Video Ad Category Detection

Introduction

Rapid growth of video advertising creates a need for automatic product‑category identification. Three main technical pain points are:

Low accuracy of ASR transcription, resulting in poor input quality.

Heterogeneous ways products are mentioned (abbreviations, misspellings, order variations).

Weak model generalization and iteration capability.

The proposed solution combines an ASR engine, large language models (LLM), and a vector‑based knowledge base (KB) in a layered pipeline that performs semantic repair, exact matching, fuzzy retrieval, and a closed‑loop case‑mapping feedback.

Technical Architecture

[Video/audio] → [ASR transcription] → [Text preprocessing] → [Precise matching] → [Fuzzy retrieval] → [Recognition output]
                ↑                     ↓
   [Case‑mapping feedback] ← [Human review feedback]

ASR Transcription Layer

Audio → [Far‑field denoising + dual‑channel separation] → Speech‑to‑text → Raw transcript

Function: Convert spoken content to raw text.

Technology choice: ASR engine configured with far‑field denoising and colloquial cleaning parameters.

Text Preprocessing Layer

Raw transcript → [Semantic error correction (typos, accent fixes)] → [Word‑order normalization (e.g., "Apple phone" → "phone Apple")] → [Field validation (filter non‑core tokens)] → Structured text

Function: LLM‑driven repair of ASR output, fixing contextual errors such as "phone Apple IPHONE15" → "phone Apple iPhone 15".

Precise Matching Layer

Function: Map the normalized text to known product records in two stages.

First‑level matching: Full‑string and prefix search against a MySQL product catalog (fields: product_id, standard_model, category, brand). A hit directly yields the category.

Structured text → [Exact string / prefix match] → Category

Second‑level matching: Lookup in a human‑verified error‑statement library ("incorrect expression → standard model") to resolve residual misspellings, e.g., "phone Asus 11U" → "phone Asus ZenFone 11 Ultra".

Structured text → [Error‑statement mapping] → Category

Fuzzy Retrieval Layer

Fuzzy text → [Vectorization] → [Cosine similarity search (Top‑10)] → [LLM‑based mapping (model + category)] → Recognition result

Function: Handle expressions that miss the precise matcher by searching a vector KB.

KB construction: Store each standard model as an embedding in a vector database.

Retrieval logic: Convert the fuzzy expression (e.g., "brand 14‑inch notebook", "Mi 14U") to an embedding, retrieve the top‑10 most similar items, then let the LLM select the best match (often the top‑3 after a keyword filter).

Common‑Case Mapping Closed‑Loop

Recognition result → [Human review (label error case)] → [Error correction] → [Automatic ingestion into common‑case library] → [Sync dictionary to preprocessing layer] → Effect monitoring (error recurrence stats)

Function: Recycle mislabeled cases to improve future runs.

Human reviewers label "incorrect expression → standard model" pairs via a lightweight UI.

Approved pairs are automatically added to the common‑case mapping library and propagated to the preprocessing dictionary.

Key Technical Challenges & Solutions

1. Splitting a Single LLM Node

Problem: A monolithic LLM handling repair, extraction, and recognition leads to overly long contexts and degraded accuracy.

Solution: Decompose into three independent modules:

Text‑repair module – only fixes ASR errors.

Product‑extraction module – extracts candidate product keywords from repaired text.

Result‑generation module – merges precise‑match and fuzzy‑retrieval outputs into a standardized category.

2. Knowledge‑Base Retrieval Optimization

Problem: Fuzzy mentions such as "Mi 14U" or "brand robot vacuum" have low recall with pure keyword search.

Solution: Enrich the KB with additional dimensions (common error expressions, abbreviations, brand aliases) and apply a hybrid retrieval strategy:

Vector search to obtain the Top‑10 most similar embeddings.

Keyword filter on brand name, model number, and known aliases to select the final match.

Example: Query "Mi 14U" → vector top‑3 results "Xiaomi 14 Ultra", "Xiaomi 13 Ultra", "Redmi 14 Pro" → keyword filter yields "Xiaomi 14 Ultra".

3. LLM Parameter Tuning

LLM generation parameters (temperature, Top‑K, Top‑P, length penalty) directly affect repair quality, extraction completeness, and retrieval precision. Recommended settings:

ASR text repair: temperature 0.2, Top‑K 20, Top‑P 0.7, length penalty 1.1 – low temperature reduces hallucination while allowing enough diversity for correction.

Product keyword extraction: temperature 0.1, Top‑K 10, Top‑P 0.6, length penalty 1.1 – stricter sampling to avoid redundant or missing keywords.

Rationale: lower temperature yields deterministic outputs; Top‑K limits candidate pool; Top‑P provides flexible probability cutoff; length penalty controls verbosity.

Effect Feedback

Accuracy ranking observed: exact keyword matching > historical common‑case mapping > vector‑KB matching. As the common‑case library grows, the second‑level matcher captures more cases, gradually improving overall precision.

Effect diagram
Effect diagram

Summary

The "ASR + LLM + vector knowledge base" pipeline addresses the core challenges of video‑ad product‑category identification by:

Improving input quality through far‑field denoising and LLM‑based semantic repair.

Providing a two‑stage exact matcher backed by a MySQL catalog and an error‑statement library.

Handling fuzzy mentions via vector embeddings and hybrid retrieval.

Closing the loop with human‑reviewed case mapping to continuously enrich the preprocessing dictionary.

This architecture is applicable to other AI‑driven recognition tasks that require high‑quality textual input, fine‑tuned LLM parameters, and iterative knowledge‑base enrichment.

LLMvector databaseKnowledge Basesemantic matchingASRvideo ad classification
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.