Tagged articles
6 articles
Page 1 of 1
Huolala Tech
Huolala Tech
Mar 6, 2026 · Artificial Intelligence

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

The article details how Huolala’s Dolphin platform engineers large‑model inference for high‑concurrency, long‑context, low‑latency production workloads, achieving 50‑60% GPU cost reduction through systematic resource allocation, model quantization, PD‑separation, speculative sampling, and kernel‑level optimizations while maintaining service stability.

GPU utilizationModel QuantizationPerformance Evaluation
0 likes · 18 min read
How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%
AI Frontier Lectures
AI Frontier Lectures
Jul 29, 2025 · Industry Insights

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

SpecForge, an open‑source training framework built on Eagle3, enables end‑to‑end speculative sampling for ultra‑large language models, integrates tightly with the SGLang inference engine, offers online and offline training modes, supports advanced parallelism strategies, and demonstrates up to 2.18× inference speedup on benchmark tests, with all code and pretrained drafts available on GitHub and Hugging Face.

AI PerformanceInference AccelerationOpen Source
0 likes · 9 min read
SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×
DataFunSummit
DataFunSummit
Nov 4, 2024 · Artificial Intelligence

Performance Optimization Techniques for Large Model Inference Frameworks

This article outlines four key optimization areas for large model inference frameworks—quantization, speculative sampling, TTFT/TPOT improvements, and communication optimization—detailing specific techniques, experimental results, and practical benefits such as reduced memory usage, lower latency, and higher throughput.

AIPerformanceSpeculative Sampling
0 likes · 12 min read
Performance Optimization Techniques for Large Model Inference Frameworks
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Oct 11, 2024 · Artificial Intelligence

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

HASS aligns training and decoding contexts and objectives for speculative sampling, using harmonized objective distillation and multi-step context alignment, achieving 2.81–4.05× speedup and 8%–20% improvement over EAGLE‑2 while preserving generation quality in real-world deployments at Xiaohongshu.

AIHASSInference Acceleration
0 likes · 11 min read
Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference