Artificial Intelligence 27 min read

Semantic Embedding with Large Language Models: A Comprehensive Survey

This survey reviews the evolution of semantic embedding—from Word2vec and GloVe to BERT, Sentence‑BERT, and recent contrastive methods—then examines how large language models improve embeddings via synthetic data generation and backbone architectures, detailing techniques such as contrastive prompting, in‑context learning, knowledge distillation, and discussing resource, privacy, and interpretability challenges.

Baidu Tech Salon

Mar 21, 2025

Semantic Embedding with Large Language Models: A Comprehensive Survey

This article provides a comprehensive survey of semantic embedding techniques enhanced by Large Language Models (LLMs). Semantic embedding, which maps text into semantic space, is a core technology in natural language processing, information retrieval, and recommendation systems for capturing deep semantic information in text.

The article begins by reviewing the evolution of semantic embedding methods from early approaches like Word2vec and GloVe to context-sensitive models like BERT, RoBERTa, and Sentence-BERT, and more recent contrastive learning frameworks like SimCSE. It then compares LLM-based semantic embeddings with traditional approaches, highlighting differences in model structure, training methods, embedding quality, and application scenarios.

The core content focuses on two main research directions for improving semantic embeddings using LLMs: (1) Synthetic Data Generation - leveraging LLMs to generate high-quality training data, including models like E5-mistral-7b-instruct, SFR-Embedding-Mistral, and Gecko; (2) Model Backbone - using LLMs as the core architecture for embedding, including NV-Embed-v2, BGE-EN-ICL, Echo-mistral, LLM2Vec, GRIT, GTE-Qwen1.5-7B-instruct, and stella_en_1.5B_v5.

Key techniques discussed include: contrastive learning with prompts, in-context learning (ICL), bidirectional attention mechanisms, two-stage instruction tuning, knowledge distillation from LLMs, and multi-task training. The article also addresses challenges including computational resource requirements, data privacy concerns, prompt quality dependency, and the need for better interpretability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

contrastive learning information retrieval NLP text representation In-Context Learning semantic embedding transformer models

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.