Artificial Intelligence 27 min read

Semantic Embedding with Large Language Models: A Comprehensive Survey

This survey reviews the evolution of semantic embedding—from Word2vec and GloVe to BERT, Sentence‑BERT, and recent contrastive methods—then examines how large language models improve embeddings via synthetic data generation and backbone architectures, detailing techniques such as contrastive prompting, in‑context learning, knowledge distillation, and discussing resource, privacy, and interpretability challenges.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
Semantic Embedding with Large Language Models: A Comprehensive Survey

This article provides a comprehensive survey of semantic embedding techniques enhanced by Large Language Models (LLMs). Semantic embedding, which maps text into semantic space, is a core technology in natural language processing, information retrieval, and recommendation systems for capturing deep semantic information in text.

The article begins by reviewing the evolution of semantic embedding methods from early approaches like Word2vec and GloVe to context-sensitive models like BERT, RoBERTa, and Sentence-BERT, and more recent contrastive learning frameworks like SimCSE. It then compares LLM-based semantic embeddings with traditional approaches, highlighting differences in model structure, training methods, embedding quality, and application scenarios.

The core content focuses on two main research directions for improving semantic embeddings using LLMs: (1) Synthetic Data Generation - leveraging LLMs to generate high-quality training data, including models like E5-mistral-7b-instruct, SFR-Embedding-Mistral, and Gecko; (2) Model Backbone - using LLMs as the core architecture for embedding, including NV-Embed-v2, BGE-EN-ICL, Echo-mistral, LLM2Vec, GRIT, GTE-Qwen1.5-7B-instruct, and stella_en_1.5B_v5.

Key techniques discussed include: contrastive learning with prompts, in-context learning (ICL), bidirectional attention mechanisms, two-stage instruction tuning, knowledge distillation from LLMs, and multi-task training. The article also addresses challenges including computational resource requirements, data privacy concerns, prompt quality dependency, and the need for better interpretability.

contrastive learninginformation retrievalNLPtext representationIn-Context Learningsemantic embeddingtransformer models
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.