How LLMs Power the “Find Data Assistant” for Smarter Data Retrieval
This article explains how the Volcano Engine DataLeap team leveraged large‑language models to build the “Find Data Assistant”, detailing its design, challenges, embedding‑and‑reranker enhancements, LLM‑driven semantic search, mixing architecture, and practical lessons for improving data asset management and retrieval.
Introduction
In the digital era, data is a critical asset, but the explosion of data volume makes finding and using the right data a major challenge for enterprises.
The Volcano Engine DataLeap team applied large‑model technology to build “Find Data Assistant”, a tool that addresses this problem.
Background and Scenarios
The data asset platform is an upgraded data‑map that serves as a one‑stop portal for data consumption, simplifying data production and usage. It supports data consumption, metadata management, and asset‑center capabilities.
Key scenarios include locating target data quickly in massive datasets and ensuring the data matches business needs, such as finding the best‑performing streamers in the last 30 days.
Challenges
Traditional keyword search lacks semantic understanding and cannot handle complex queries.
Metadata may be incomplete, making governance difficult.
Industry jargon and varied expressions require high model generalisation.
Solution Overview
The system combines keyword retrieval, semantic retrieval, and large language model (LLM) processing. User queries pass through a dialogue framework, intent and entity recognition, and optional multi‑turn merging, all powered by LLMs.
Retrieval uses three storage back‑ends: a Vector DB for embedding‑based semantic recall, Elasticsearch for keyword matching, and MySQL for conversation history.
LLM‑Enhanced Retrieval
Traditional keyword search suffers from limited accuracy, inability to handle jargon, and synonym variations. The solution integrates LLM‑driven semantic analysis, intent detection, and answer ranking, followed by LLM‑based mixing (reranking) to improve relevance.
Embedding and Reranker Improvements
To overcome data scarcity, LLM‑generated data fine‑tunes embedding and reranker models. Knowledge extraction converts documents into Q‑A pairs, enabling precise retrieval.
Mixing Architecture
LLM mixing addresses token limits, hallucination, and latency by selecting high‑generalisation models, fine‑tuning for specific scenarios, and implementing streaming output for faster user feedback.
Other Applications
LLMs are also used for answer summarisation, refusal handling, and automatic FAQ generation, helping to keep business knowledge up‑to‑date.
Summary and Recommendations
Key lessons include trusting LLM capabilities while recognising the need for small‑model optimisation, fine‑tuning to reduce hallucinations, and introducing agents for multi‑turn dialogue handling.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.