8 min read

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

This article details the cost and speed challenges of embedding vectors in large‑scale log scenarios, analyzes inference framework choices, describes GPU utilization, priority queuing, and pipeline redesigns, and reports a 16‑fold throughput increase and dramatically lower per‑request costs.

Alibaba Cloud Native

Oct 17, 2025

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

Background and Challenges

In semantic indexing, embeddings determine recall quality but also dominate cost, with 1 GB of data costing hundreds of yuan and processing at only about 100 KB/s, making large‑scale streaming logs impractical for production.

To address performance and cost pressure, a systematic analysis of the embedding inference bottleneck was performed, targeting a 16× throughput boost while reducing per‑request resource consumption.

Key Technical Challenges

Inference framework selection

Multiple frameworks exist (vLLM, sglang, llama.cpp, TensorRT, sentence‑transformers); the chosen framework must maximize GPU performance for embedding workloads.

Framework efficiency (continuous batching, kernel optimizations) is a critical performance factor.

Maximizing GPU utilization

Batch processing is essential; single requests are far less efficient than batched inference.

Parallelism requires decoupling CPU preprocessing (tokenization), network I/O, and GPU computation.

Embedding models are relatively small, allowing multiple replicas on a single A10 GPU (≈15 % compute, 13 % memory each); efficiently packing replicas is key to cost reduction.

Priority scheduling

Index building (large‑batch, low priority) and online queries (small‑batch, high priority) must coexist without query requests being blocked by build tasks.

End‑to‑end bottlenecks

After improving GPU throughput, tokenization became the new bottleneck.

Optimization Strategies

1. Adopt vLLM as the core inference engine (replace llama.cpp)

Testing showed vLLM and sglang achieve twice the throughput of llama.cpp with 60 % lower average GPU utilization, thanks to continuous batching and highly optimized CUDA kernels.

2. Deploy multiple model replicas on a single GPU using Triton Inference Server

Triton controls the number of replicas and provides dynamic batching, allowing requests to be routed to different replicas while bypassing the vLLM HTTP server. The core library is invoked directly via LLMEngine in Triton's Python backend, reducing overhead.

3. Decouple tokenization from model inference

Tokenization became the performance bottleneck after scaling GPU usage. llama.cpp’s tokenizer is six times faster than vLLM’s, so tokenization is performed with llama.cpp and the resulting token IDs are fed to vLLM, eliminating the tokenizer bottleneck.

4. Use priority queues and dynamic batching in Triton

Embedding queries are assigned higher priority, reducing latency, while dynamic batching aggregates requests to improve overall throughput.

5. Full pipeline redesign

The semantic index construction was broken into a DAG of asynchronous tasks: DeserializeData → Chunking (parallel) → GenerateBatch → Embedding (parallel) → CollectResult → BuildIndex → Serialize → Finish. This enables concurrent CPU, network, and GPU utilization.

Additionally, a data‑driven scheduling framework was built to drive the pipeline efficiently.

Results

Throughput increased from 170 KB/s to 3 MB/s (≈16× improvement).

Cost per million tokens dropped to 0.01 CNY, two orders of magnitude cheaper than typical industry solutions.

These optimizations transformed the embedding service into a high‑performance, cost‑effective component suitable for large‑scale log processing.

vLLM Embedding throughput GPU Optimization Triton vector index

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.