9 min read

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

This article examines the high cost and low throughput of embedding vectors in log‑processing scenarios, analyzes the performance bottlenecks of inference frameworks, and details a series of cloud‑native optimizations—including switching to vLLM, deploying multiple model replicas with Triton, decoupling tokenization, and priority queuing—that together raise throughput by 16× and reduce per‑token pricing by two orders of magnitude.

Alibaba Cloud Observability

Oct 20, 2025

How We Boosted Embedding Throughput 16× and Cut Vector Index Costs in a Cloud‑Native Setup

In log‑processing scenarios, vector‑index cost and throughput are major challenges. Embedding is the key factor for semantic recall and also the core cost driver: 1 GB of embedding data can cost hundreds of yuan and transfer at only about 100 KB/s, making index building and storage costs negligible in comparison. While this may be acceptable for static knowledge‑base use cases, streaming data in SLS generates continuous data, imposing severe cost and speed constraints that are not production‑ready.

To address the performance and cost pressure of large‑scale applications, we performed systematic optimizations on the embedding service inference bottleneck, ultimately achieving a 16× increase in throughput and a significant reduction in per‑request resource cost.

Technical challenges and optimization ideas

1. Inference framework : The market offers many inference frameworks (vLLM, sglang, llama.cpp, TensorRT, sentence‑transformers, etc.), each with different strengths (general‑purpose vs. specialized, CPU vs. GPU). Selecting the framework that best fits embedding tasks and maximizes GPU performance is critical.

2. Maximizing GPU utilization : Embedding inference is highly sensitive to batch processing; single‑request efficiency is far lower than batch processing. We need an efficient request‑batching mechanism, parallel CPU preprocessing (e.g., tokenization) and GPU computation, and the ability to run multiple model replicas on a single GPU. Because embedding models are relatively small, a single replica on an A10 GPU consumes only about 15 % of compute and 13 % of memory, allowing several replicas to fill the GPU and lower cost while increasing throughput.

3. Priority scheduling : Semantic indexing consists of two stages—large‑batch, low‑priority index building and small‑batch, high‑priority online queries. A fine‑grained priority‑queue scheduler is required to ensure query embeddings are not blocked by building tasks.

4. End‑to‑end bottlenecks : After improving GPU utilization, other stages such as tokenization become new performance bottlenecks.

Optimization solutions

Optimization 1: Choose vLLM as the core inference engine (replace llama.cpp) – vLLM and sglang deliver roughly twice the throughput of llama.cpp and raise average GPU utilization by 60 %. The advantage stems from vLLM’s continuous batching and highly optimized CUDA kernels.

Optimization 2: Deploy multiple model replicas on a single GPU – We use Triton Inference Server as the service framework, which easily controls the number of model replicas per GPU and provides dynamic batching. By invoking the vLLM core library (LLMEngine) directly from Triton’s Python backend, we eliminate an extra HTTP layer. LLMEngine Optimization 3: Decouple tokenization from model inference – After scaling vLLM, tokenization becomes the new bottleneck. llama.cpp’s tokenizer is about six times faster than vLLM’s. We therefore perform high‑performance tokenization with llama.cpp, then feed the token IDs to vLLM, avoiding the vLLM tokenizer limitation.

Optimization 4: Priority queue and dynamic batching – Triton’s built‑in priority queuing and dynamic batching match the needs of embedding services. Query embeddings receive higher priority, reducing latency, while dynamic batching improves overall throughput.

Final architecture design

The embedding service is exposed as an independent remote service. The end‑to‑end pipeline now includes asynchronous and parallel stages: data reading → chunking → embedding request → result processing/storage, all driven by a DAG‑based task scheduler that supports parallel execution of individual tasks.

Key components:

Pipeline task orchestration (deserialize, chunking, batch generation, embedding, result collection, index building, serialization, finish)

Data‑ and event‑driven scheduling framework

Fully revamped build process with extensive code refactoring

After the pipeline overhaul, benchmark results show:

Throughput increased from 170 KB/s to 3 MB/s (≈ 16× improvement)

Vector‑index pricing reduced to 0.01 CNY per million tokens, offering two orders of magnitude cost advantage over industry solutions.

For more details, refer to the SLS vector index documentation: https://help.aliyun.com/zh/sls/vector-index