Tagged articles

inference acceleration

36 articles · Page 1 of 1

Jun 30, 2026 · Artificial Intelligence

Beyond DeepSeek: Open‑Source JetSpec and Other Projects Accelerate Large‑Model Decoding Up to 10×

The article compares DSpark and JetSpec, two recent open‑source speculative decoding frameworks that tackle inference efficiency from system‑level verification reduction and algorithmic token‑acceptance improvements, respectively, showing up to 9.64× end‑to‑end speedup on Qwen3‑8B and significant gains across math, code, and dialogue benchmarks.

DSparkJetSpeccausal consistency

0 likes · 14 min read

Beyond DeepSeek: Open‑Source JetSpec and Other Projects Accelerate Large‑Model Decoding Up to 10×

Geek Labs

Jun 29, 2026 · Artificial Intelligence

DeepSpec Boosts Large-Model Inference Speed by 2–5× with Speculative Decoding

DeepSpec, an open‑source framework from DeepSeek, accelerates large‑language‑model inference by 2–5× through speculative decoding, where a lightweight draft model generates candidate tokens that the target model validates in parallel, reducing the serial bottleneck of autoregressive decoding and offering a full‑stack pipeline from data preparation to evaluation.

DeepSpecGPUPython

0 likes · 6 min read

DeepSpec Boosts Large-Model Inference Speed by 2–5× with Speculative Decoding

DataFunSummit

Jun 13, 2026 · Artificial Intelligence

Beyond General LLMs: Efficient Adaptation and Data Value Mining for Finance

The article details a systematic practice—starting from the “iceberg” challenges of finance, through data and knowledge engineering, reverse knowledge extraction with REER, multi‑dimensional synthetic data generation, prompt engineering (APO), cost‑aware fine‑tuning, inference acceleration, and emotion‑value evaluation—culminating in actionable guidelines for deploying large models in banking scenarios.

Knowledge Engineeringemotion evaluationfinance

0 likes · 14 min read

Beyond General LLMs: Efficient Adaptation and Data Value Mining for Finance

Machine Heart

Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Efficient AttentionKV cache reductionLCA

0 likes · 10 min read

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

Old Zhang's AI Learning

Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashQwen3.5SGLang

0 likes · 12 min read

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

Machine Heart

Apr 10, 2026 · Artificial Intelligence

Keeping Image Quality with Only 20 Diffusion Steps: The TC‑Padé Acceleration Method

TC‑Padé uses a Padé‑based residual prediction framework, step‑aware strategies, and a trajectory‑stability indicator to accelerate diffusion sampling to as few as 20 steps while preserving visual fidelity, achieving up to 2.88× speed‑up on image generation and 1.72× on video generation.

Padé approximationTC-Padéimage generation

0 likes · 12 min read

Keeping Image Quality with Only 20 Diffusion Steps: The TC‑Padé Acceleration Method

Machine Heart

Apr 1, 2026 · Artificial Intelligence

SSD Framework Doubles Inference Speed Over Top Engines, Breaking the Serial Bottleneck

The SSD framework and its SAGUARO optimization, developed by researchers from Stanford, Princeton, and Together AI, parallelize drafting and verification in speculative decoding, eliminating serial dependencies and achieving up to 2× faster inference than the world’s strongest engines and up to 5× speedup over standard autoregressive generation, while addressing challenges such as prediction accuracy, acceptance‑rate trade‑offs, and fallback strategies.

SAGUAROSSDinference acceleration

0 likes · 7 min read

SSD Framework Doubles Inference Speed Over Top Engines, Breaking the Serial Bottleneck

AI Code to Success

Mar 27, 2026 · Artificial Intelligence

How Google’s TurboQuant Cuts LLM Memory by 6× and Speeds Up Inference 8×

Google Research’s TurboQuant algorithm compresses large‑language‑model KV caches from 32‑bit to 3‑bit, achieving a six‑fold reduction in memory usage and an eight‑fold inference speedup on H100 GPUs while preserving 100 % accuracy, and it also improves vector search performance without requiring large codebooks.

AI efficiencyLLM compressionTurboQuant

0 likes · 10 min read

How Google’s TurboQuant Cuts LLM Memory by 6× and Speeds Up Inference 8×

Old Zhang's AI Learning

Feb 24, 2026 · Industry Insights

How Taalas HC1 Embeds Llama 3.1 8B in Silicon to Achieve 17k tokens/s

Taalas embeds the Llama 3.1 8B model directly into a 6nm ASIC, delivering 17,000 tokens per second—nearly ten times faster than top NVIDIA GPUs—while cutting system cost by over tenfold and power consumption by tenfold, albeit with limited flexibility and quantization trade‑offs.

AI hardwareASICLlama 3.1

0 likes · 10 min read

How Taalas HC1 Embeds Llama 3.1 8B in Silicon to Achieve 17k tokens/s

Machine Learning Algorithms & Natural Language Processing

Feb 12, 2026 · Artificial Intelligence

How Huawei’s MindScale Cuts Agent Token Usage 5.7× and Automates Prompt & Workflow Design

The article outlines the four major obstacles hindering industry‑specific LLM agents—manual workflow maintenance, poor knowledge reuse, training‑inference inefficiency, and complex reasoning evaluation—and explains how Huawei Noah’s MindScale package tackles each with self‑evolving workflows, automated prompt optimization, and a novel KV‑Embedding cache that slashes token consumption by 5.7× while boosting inference speed up to 70%.

Industry AgentKV-EmbeddingLarge Language Model

0 likes · 7 min read

How Huawei’s MindScale Cuts Agent Token Usage 5.7× and Automates Prompt & Workflow Design

HyperAI Super Neural

Feb 10, 2026 · Artificial Intelligence

WeDLM Diffusion Language Model Tutorial: 3× Faster Inference Than vLLM AR Models

The Tencent WeChat AI team introduces WeDLM, a diffusion language model that, through topological reordering, surpasses autoregressive models on the industrial‑grade vLLM engine with over threefold speedup on math reasoning and up to tenfold in low‑entropy scenarios, and provides a step‑by‑step online tutorial with GPU compute credits.

Diffusion language modelGPU computeLarge Language Model

0 likes · 5 min read

WeDLM Diffusion Language Model Tutorial: 3× Faster Inference Than vLLM AR Models

AI2ML AI to Machine Learning

Feb 4, 2026 · Artificial Intelligence

Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

The article analyzes Google’s shift from scaling‑law to efficiency‑law, detailing how speculative decoding, language‑model cascades, distillation, CALM, accurate quantized training, and the Mixture‑of‑Recursions architecture together form a multi‑layered strategy to cut inference cost, boost throughput, and sustain the company’s AI moat.

Google TPULanguage Model CascadesQuantization

0 likes · 8 min read

Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

Bilibili Tech

Jan 28, 2026 · Artificial Intelligence

Boosting Video Generation Inference: Full Graph Compilation with torch.compile

This article examines the challenges of optimizing video generation model inference, moving from operator-level tweaks to full-graph compilation using torch.compile, and details systematic strategies to eliminate Graph Breaks, handle dynamic shapes, KV-Cache indexing, and Python-side caches, achieving a 47.6% speedup on a 14B model without accuracy loss.

AIgraph optimizationinference acceleration

0 likes · 14 min read

Boosting Video Generation Inference: Full Graph Compilation with torch.compile

AI Frontier Lectures

Jan 25, 2026 · Artificial Intelligence

Turning Chain‑of‑Thought into Images: The Render‑of‑Thought Breakthrough

Render‑of‑Thought (RoT) proposes a novel visual‑latent reasoning framework that compresses textual chain‑of‑thought into dense image embeddings, achieving faster inference, better interpretability, and plug‑and‑play integration without costly pre‑training, as demonstrated on multiple math and logic benchmarks.

Chain-of-ThoughtImplicit CoTLLM

0 likes · 11 min read

Turning Chain‑of‑Thought into Images: The Render‑of‑Thought Breakthrough

AI2ML AI to Machine Learning

Dec 27, 2025 · Artificial Intelligence

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Jeff Dean highlighted speculative decoding as a lossless inference acceleration technique that can boost large language model throughput by 2–3×, and the article breaks down its core concepts—including parallel token verification, draft‑target model collaboration, rejection sampling theory, and practical optimizations such as continuous batching and tree‑based verification.

Continuous BatchingDraft-Target ModelKV cache

0 likes · 8 min read

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Huawei Cloud Developer Alliance

Nov 24, 2025 · Artificial Intelligence

How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration

This article explains why Transformer models dominate modern AI agents, outlines the challenges of large parameter counts and latency, and presents a comprehensive guide to model compression (parameter sharing, knowledge distillation, quantization, pruning) and inference acceleration (parallel computing, optimized attention, TensorRT deployment), complete with PyTorch code examples and a real‑world case study showing speed‑up and storage savings.

AI AgentPyTorchTransformer

0 likes · 34 min read

How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration

Tencent Tech

Oct 27, 2025 · Artificial Intelligence

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

SpecExit combines early‑exit and speculative decoding to let large reasoning models detect when they have almost finished thinking, trimming redundant chain‑of‑thought steps, reducing over‑thinking by 72% and achieving up to 2.5× faster end‑to‑end inference without noticeable accuracy loss.

AIearly exitinference acceleration

0 likes · 6 min read

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

Data Party THU

Oct 10, 2025 · Artificial Intelligence

How DPad Cuts Inference Time 61× While Boosting Accuracy in Diffusion LLMs

The article analyzes a recent Duke University paper that reveals a "scratchpad" mechanism in diffusion large language models, proposes the DPad method to prune redundant suffix tokens before decoding, and demonstrates up to 61.4× faster inference with unchanged or even improved accuracy across multiple benchmarks.

DPaddiffusion LLMinference acceleration

0 likes · 10 min read

How DPad Cuts Inference Time 61× While Boosting Accuracy in Diffusion LLMs

AI Frontier Lectures

Jul 29, 2025 · Industry Insights

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

SpecForge, an open‑source training framework built on Eagle3, enables end‑to‑end speculative sampling for ultra‑large language models, integrates tightly with the SGLang inference engine, offers online and offline training modes, supports advanced parallelism strategies, and demonstrates up to 2.18× inference speedup on benchmark tests, with all code and pretrained drafts available on GitHub and Hugging Face.

AI performanceSpeculative SamplingTraining Framework

0 likes · 9 min read

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

AIWalker

Apr 28, 2025 · Artificial Intelligence

SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters

SimpleAR is a minimalist autoregressive visual generation framework that, with only 0.5 B parameters, achieves competitive 1024×1024 image synthesis through a three‑stage pipeline of large‑scale pretraining, supervised fine‑tuning, and GRPO‑based reinforcement learning, and demonstrates significant inference speedups using KV‑cache, vLLM, and speculative decoding.

autoregressive generationbenchmarkinference acceleration

0 likes · 14 min read

SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters

DataFunTalk

Apr 2, 2025 · Artificial Intelligence

Trends, Applications, and Future Directions of Large Models and Inference Acceleration

This article examines the current state and future prospects of large AI models and inference acceleration, covering technology trends, diverse application scenarios from research to industry, and the challenges and opportunities that lie ahead for intelligent data governance, multimodal agents, and AGI.

AGIAIData Governance

0 likes · 11 min read

Trends, Applications, and Future Directions of Large Models and Inference Acceleration

Bilibili Tech

Jan 21, 2025 · Artificial Intelligence

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Continuous BatchingMulti-modalOperator fusion

0 likes · 21 min read

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

Baobao Algorithm Notes

Jan 15, 2025 · Artificial Intelligence

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency

This article reviews the evolution of Multi‑Token Prediction (MTP) techniques—from early blockwise parallel decoding to Meta's and DeepSeek's implementations—explaining their architectures, training and inference workflows, and the speed‑up gains they offer for large language models.

DeepSeekLLMMTP

0 likes · 20 min read

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency

360 Tech Engineering

Oct 15, 2024 · Artificial Intelligence

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

The article details the design and deployment of 360's AI Compute Center, covering GPU server selection, high‑performance networking, Kubernetes‑based cluster management, advanced scheduling, training and inference acceleration techniques, and a comprehensive AI development platform with visualization and fault‑tolerance features.

AI InfrastructureDistributed ComputingGPU Cluster

0 likes · 21 min read

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

Xiaohongshu Tech REDtech

Oct 11, 2024 · Artificial Intelligence

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

HASS aligns training and decoding contexts and objectives for speculative sampling, using harmonized objective distillation and multi-step context alignment, achieving 2.81–4.05× speedup and 8%–20% improvement over EAGLE‑2 while preserving generation quality in real-world deployments at Xiaohongshu.

AIHASSSpeculative Sampling

0 likes · 11 min read

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

NewBeeNLP

May 21, 2024 · Industry Insights

How EcomXL Supercharges E‑commerce Image Generation with SDXL Optimizations and 3‑Second Inference

This article details how Alibaba's Wanxiang Lab adapted the SDXL diffusion model for large‑scale e‑commerce image generation, introducing the EcomXL series, a weighted‑distillation fine‑tuning method, hierarchical model fusion, specialized ControlNet variants, and the SLAM inference accelerator to achieve high‑quality, controllable product images within three seconds while boosting business metrics.

AIGCControlNetEcomXL

0 likes · 14 min read

How EcomXL Supercharges E‑commerce Image Generation with SDXL Optimizations and 3‑Second Inference

Alimama Tech

May 15, 2024 · Artificial Intelligence

EcomXL: Optimizing SDXL for Large‑Scale E‑commerce Image Generation

EcomXL enhances SDXL for large‑scale e‑commerce image generation by leveraging tens of millions of curated images, a two‑stage fine‑tuning with denoising‑weighted distillation and layer‑wise fusion, specialized ControlNets for inpainting and soft‑edge consistency, and the SLAM inference accelerator to achieve sub‑second generation while boosting visual quality and adoption metrics.

AIGCControlNetEcomXL

0 likes · 15 min read

EcomXL: Optimizing SDXL for Large‑Scale E‑commerce Image Generation

DeWu Technology

May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionMixture of ExpertsPageAttention

0 likes · 17 min read

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

DataFunTalk

May 8, 2024 · Artificial Intelligence

Intelligent NPCs: Infusing Soul into Game Characters with AI and the Art and Science of Deep Model Inference Acceleration

This talk explores how large‑model AI can give game NPCs personality, outlines the opportunities and challenges of intelligent NPCs, presents a case study of the "Jue Zhi An Nuan" NPC, and discusses future directions, safety compliance, and real‑time multimodal interaction solutions.

AIGame DevelopmentGame NPC

0 likes · 3 min read

Intelligent NPCs: Infusing Soul into Game Characters with AI and the Art and Science of Deep Model Inference Acceleration

Alibaba Cloud Native

Aug 6, 2023 · Artificial Intelligence

Boost Bloom‑7B1 Inference 2.5× Faster with FasterTransformer on ACK

This guide shows how to accelerate Bloom‑7B1 inference on Alibaba Cloud ACK by converting the model to FasterTransformer format, deploying it with Triton Server, and comparing performance against the original HuggingFace checkpoint, achieving roughly a 2.5‑fold speedup.

Bloom-7B1FasterTransformerTriton Server

0 likes · 17 min read

Boost Bloom‑7B1 Inference 2.5× Faster with FasterTransformer on ACK

Tencent Cloud Developer

Dec 12, 2022 · Artificial Intelligence

Performance Optimization of Tencent Cloud OCR Service: Reducing Latency and Improving Throughput

Tencent Cloud’s OCR team cut average response time from 1.8 seconds to under one second and boosted throughput by over 50 % by redesigning the model with self‑attention, accelerating inference with a Tensor‑Network accelerator, shrinking RPC payloads, enabling asynchronous logging, and optimizing multi‑region GPU memory utilization.

AI modelCloud ServicesLatency Reduction

0 likes · 13 min read

Performance Optimization of Tencent Cloud OCR Service: Reducing Latency and Improving Throughput

DataFunSummit

Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Dynamic BatchingOperator fusionQuantization

0 likes · 12 min read

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

ITPUB

Apr 27, 2022 · Artificial Intelligence

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

This article details the design and optimization of 58.com’s WPAI machine learning platform, covering background, training‑task scheduling, elastic inference scaling, offline‑online resource mixing, and model‑inference acceleration, and shows how these techniques collectively raised GPU usage by 51% and CPU usage by 38% while cutting costs.

AI platformElastic ScalingGPU Utilization

0 likes · 26 min read

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

58 Tech

Jan 10, 2022 · Artificial Intelligence

Resource Utilization Optimization Practices for the 58.com Machine Learning Platform (WPAI)

This article details the 58.com WPAI machine learning platform's architecture and the optimizations applied to training task scheduling, inference service elastic scaling, and offline‑online resource mixing, demonstrating how these techniques significantly improve GPU/CPU utilization and inference performance across both GPU and CPU environments.

AIElastic Scalinginference acceleration

0 likes · 27 min read

Resource Utilization Optimization Practices for the 58.com Machine Learning Platform (WPAI)

Ctrip Technology

Sep 16, 2021 · Artificial Intelligence

Automated AI Model Optimization Platform for Travel Services

This article describes the design, automated workflow, functional modules, and performance results of a comprehensive AI model optimization platform built for Ctrip's travel business, covering operator libraries, graph optimization, model compression techniques such as distillation, quantization, pruning, and deployment integration.

AutoMLai-optimizationinference acceleration

0 likes · 16 min read

Automated AI Model Optimization Platform for Travel Services

HomeTech

Sep 4, 2019 · Artificial Intelligence

Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results

This article explains how to use NVIDIA TensorRT to accelerate TensorFlow model inference by describing TensorRT architecture, optimization techniques such as layer fusion and precision calibration, detailing the conversion of frozen_graph and saved_model formats, presenting experimental setup and performance comparisons, and summarizing the achieved speed‑up.

Model OptimizationTensorFlowTensorRT

0 likes · 13 min read

Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results