Artificial Intelligence 17 min read

How LLMs Power the “Find Data Assistant” for Smarter Data Retrieval

This article explains how the Volcano Engine DataLeap team leveraged large‑language models to build the “Find Data Assistant”, detailing its design, challenges, embedding‑and‑reranker enhancements, LLM‑driven semantic search, mixing architecture, and practical lessons for improving data asset management and retrieval.

ByteDance Data Platform

Sep 25, 2024

How LLMs Power the “Find Data Assistant” for Smarter Data Retrieval

Introduction

In the digital era, data is a critical asset, but the explosion of data volume makes finding and using the right data a major challenge for enterprises.

The Volcano Engine DataLeap team applied large‑model technology to build “Find Data Assistant”, a tool that addresses this problem.

Background and Scenarios

The data asset platform is an upgraded data‑map that serves as a one‑stop portal for data consumption, simplifying data production and usage. It supports data consumption, metadata management, and asset‑center capabilities.

Key scenarios include locating target data quickly in massive datasets and ensuring the data matches business needs, such as finding the best‑performing streamers in the last 30 days.

Challenges

Traditional keyword search lacks semantic understanding and cannot handle complex queries.

Metadata may be incomplete, making governance difficult.

Industry jargon and varied expressions require high model generalisation.

Solution Overview

The system combines keyword retrieval, semantic retrieval, and large language model (LLM) processing. User queries pass through a dialogue framework, intent and entity recognition, and optional multi‑turn merging, all powered by LLMs.

Retrieval uses three storage back‑ends: a Vector DB for embedding‑based semantic recall, Elasticsearch for keyword matching, and MySQL for conversation history.

LLM‑Enhanced Retrieval

Traditional keyword search suffers from limited accuracy, inability to handle jargon, and synonym variations. The solution integrates LLM‑driven semantic analysis, intent detection, and answer ranking, followed by LLM‑based mixing (reranking) to improve relevance.

Embedding and Reranker Improvements

To overcome data scarcity, LLM‑generated data fine‑tunes embedding and reranker models. Knowledge extraction converts documents into Q‑A pairs, enabling precise retrieval.

Mixing Architecture

LLM mixing addresses token limits, hallucination, and latency by selecting high‑generalisation models, fine‑tuning for specific scenarios, and implementing streaming output for faster user feedback.

Other Applications

LLMs are also used for answer summarisation, refusal handling, and automatic FAQ generation, helping to keep business knowledge up‑to‑date.

Summary and Recommendations

Key lessons include trusting LLM capabilities while recognising the need for small‑model optimisation, fine‑tuning to reduce hallucinations, and introducing agents for multi‑turn dialogue handling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Embedding semantic search Data Retrieval Data Asset Management reranker

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.