How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

The article details how Huolala’s Dolphin platform engineers large‑model inference for high‑concurrency, long‑context, low‑latency production workloads, achieving 50‑60% GPU cost reduction through systematic resource allocation, model quantization, PD‑separation, speculative sampling, and kernel‑level optimizations while maintaining service stability.

GPU utilizationModel QuantizationPerformance Evaluation

0 likes · 18 min read

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

Tencent Technical Engineering

Oct 31, 2025 · Artificial Intelligence

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

SpecExit combines speculative sampling with a lightweight draft model to predict early‑exit signals, shortening large‑reasoning model chains by up to two‑thirds and achieving up to 2.5× end‑to‑end inference acceleration on vLLM without sacrificing accuracy.

AI efficiencyEarly StoppingInference Optimization

0 likes · 12 min read

How SpecExit Cuts LLM Reasoning Chains by 66% and Boosts Inference Speed 2.5×

AI Frontier Lectures

Jul 29, 2025 · Industry Insights

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

SpecForge, an open‑source training framework built on Eagle3, enables end‑to‑end speculative sampling for ultra‑large language models, integrates tightly with the SGLang inference engine, offers online and offline training modes, supports advanced parallelism strategies, and demonstrates up to 2.18× inference speedup on benchmark tests, with all code and pretrained drafts available on GitHub and Hugging Face.

AI PerformanceInference AccelerationOpen Source

0 likes · 9 min read

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

DataFunSummit

Nov 4, 2024 · Artificial Intelligence

Performance Optimization Techniques for Large Model Inference Frameworks

This article outlines four key optimization areas for large model inference frameworks—quantization, speculative sampling, TTFT/TPOT improvements, and communication optimization—detailing specific techniques, experimental results, and practical benefits such as reduced memory usage, lower latency, and higher throughput.

AIPerformanceSpeculative Sampling

0 likes · 12 min read

Performance Optimization Techniques for Large Model Inference Frameworks

Xiaohongshu Tech REDtech

Oct 11, 2024 · Artificial Intelligence

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

HASS aligns training and decoding contexts and objectives for speculative sampling, using harmonized objective distillation and multi-step context alignment, achieving 2.81–4.05× speedup and 8%–20% improvement over EAGLE‑2 while preserving generation quality in real-world deployments at Xiaohongshu.

AIHASSInference Acceleration

0 likes · 11 min read

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

Alibaba Cloud Developer

Feb 20, 2024 · Artificial Intelligence

Boost LLM Inference Speed with KV‑Cache Reuse and Speculative Sampling

This article explains two production‑grade optimization techniques for large language model inference—KV‑cache reuse across multi‑turn dialogues and speculative sampling with a small draft model—detailing their design, implementation, and performance impact.

AIInference OptimizationKV Cache

0 likes · 14 min read

Boost LLM Inference Speed with KV‑Cache Reuse and Speculative Sampling