How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%
The article details how Huolala’s Dolphin platform engineers large‑model inference for high‑concurrency, long‑context, low‑latency production workloads, achieving 50‑60% GPU cost reduction through systematic resource allocation, model quantization, PD‑separation, speculative sampling, and kernel‑level optimizations while maintaining service stability.
Background
Large language model (LLM) inference at Huolala faces high request concurrency, long context lengths, and strict latency constraints. Efficient resource usage and cost reduction are required for stable production services.
Solution Overview
The Dolphin platform provides a cloud‑native inference stack organized into four layers (Business, AI Capability, AI Engine, Infrastructure). It unifies resource scheduling, model acceleration, and evaluation, supporting multiple model engines and runtime governance.
Key Technical Capabilities
Resource Allocation Strategy
GPU memory is divided into three parts:
Model weights – permanent allocation.
System/activation buffers – a small reserve (≤5% of GPU memory) for scheduling spikes.
KV‑Cache – dominates memory usage and grows linearly with context length, number of layers, heads, head dimension, and concurrent requests.
Allocation follows a business‑driven workflow:
Online workload profiling : Continuously collect context‑length distribution, peak concurrency, and request patterns.
KV‑Cache hard lower‑bound : Compute the minimal memory needed for the target concurrency and enforce it as a hard constraint.
Dynamic configuration : Based on GPU model (e.g., A10, L20, H20) and deployment mode (single‑card, multi‑card, distributed), set per‑instance memory limits and concurrency caps to reduce idle memory and enable mixed‑model workloads.
Inference Optimization System
Model‑Level Optimizations
Quantization : Deploy FP16, FP8, or INT4 (via GPTQ/AWQ) according to latency‑accuracy trade‑offs. FP16 gives highest precision, FP8 balances performance on Hopper‑series GPUs, and INT4 offers maximal memory savings with stricter validation.
Distillation : Train a smaller student model from a large teacher. Example: a 70‑billion‑parameter general model distilled to a lightweight driver‑assistant model while retaining >90% hit‑rate and satisfaction, reducing model size by several folds.
Framework‑Level Optimizations
Prefill‑Decode (PD) Separation : Split inference into a Prefill instance (context encoding, KV‑Cache generation) using tensor parallelism for low time‑to‑first‑token (TTFT), and a Decode instance (token‑wise generation) using data parallelism for higher throughput. This mitigates GPU under‑utilization caused by the compute‑bound Prefill phase and the memory‑bound Decode phase.
# Launch Prefill node on GPU 0
export CUDA_VISIBLE_DEVICES=0
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 20003 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--kv-transfer-config '{
"kv_connector": "P2pNcclConnector",
"kv_role": "kv_producer",
"kv_port": "21001",
"kv_buffer_size": "1e1",
"kv_connector_extra_config": {
"proxy_ip": "0.0.0.0",
"proxy_port": "30001",
"send_type": "PUT_ASYNC",
"nccl_num_channels": "16"
}
}' # Launch Decode node on GPU 1
export CUDA_VISIBLE_DEVICES=1
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 20005 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.7 \
--trust-remote-code \
--kv-transfer-config '{
"kv_connector": "P2pNcclConnector",
"kv_role": "kv_consumer",
"kv_port": "22001",
"kv_buffer_size": "8e9",
"kv_connector_extra_config": {
"proxy_ip": "0.0.0.0",
"proxy_port": "30001",
"send_type": "PUT_ASYNC",
"nccl_num_channels": "16"
}
}'A lightweight proxy service registers nodes, balances load, and handles KV‑Cache transfer:
# Start proxy service
python3 disagg_proxy_p2p_nccl_xpyd.pyTest the end‑to‑end pipeline with a simple curl request:
curl http://localhost:30001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "请简要介绍一下大模型推理中的 Prefill 和 Decode 阶段。",
"max_tokens": 100,
"temperature": 0.7,
"stream": false
}'Speculative Sampling
Speculative sampling (Draft → Verify → Accept/Reject) uses a small draft model to generate multiple tokens in parallel, then validates them with the large target model, reducing end‑to‑end latency by 30‑60% without degrading output quality.
python3 -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mem-fraction-static 0.65Model Evaluation Capability
Performance is measured with objective metrics:
Time‑to‑first‑token (TTFT)
Tokens‑per‑output‑time (TPOT)
Inter‑token latency (ITL)
End‑to‑end latency (E2EL)
Metrics are reported at percentile values (P90‑P99). The evaluation workflow:
Capture business input/output length distributions and separate fixed prefixes from user inputs.
Generate a scenario‑aligned dataset matching those lengths.
Gradually increase QPS until a target latency percentile (e.g., P99 E2EL ≤ 2 s) is reached, establishing the maximum sustainable throughput.
Example benchmark command (vLLM‑Benchmarks) for a 32‑billion‑parameter Qwen model:
python3 vllm/benchmarks/benchmark_serving.py \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model /models/qwen3-32b \
--tokenizer /models/qwen3-32b \
--host 0.0.0.0 \
--port 8000 \
--num-prompts 200 \
--percentile-metrics ttft,tpot,itl,e2el \
--metric-percentiles 90,95,99 \
--request-rate 3 \
--dataset-name sonnet \
--dataset-path vllm/benchmarks/sonnet.txt \
--sonnet-input-len 4900 \
--sonnet-output-len 10 \
--sonnet-prefix-len 4500 \
--seed 1772174119 \
--trust-remote-codeResults
By combining quantization, distillation, PD‑separation, speculative sampling, PagedAttention, and FlashAttention, the Dolphin platform achieved:
GPU memory utilization >95% on mid‑range GPUs (A10, L20).
Latency reductions of 10‑40% for long‑context workloads.
Throughput gains of 1.5‑2×.
GPU cost savings of 20‑50% and reduced operational complexity.
Future Outlook
Continued engineering of the Dolphin platform will further unlock large‑model value for business efficiency, user experience, and cost optimization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
