Tagged articles
33 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Inference AccelerationKV cache reductionLCA
0 likes · 10 min read
LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 14, 2026 · Artificial Intelligence

Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

DFlashInference AccelerationSGLang
0 likes · 12 min read
Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss
Machine Heart
Machine Heart
Apr 10, 2026 · Artificial Intelligence

Keeping Image Quality with Only 20 Diffusion Steps: The TC‑Padé Acceleration Method

TC‑Padé uses a Padé‑based residual prediction framework, step‑aware strategies, and a trajectory‑stability indicator to accelerate diffusion sampling to as few as 20 steps while preserving visual fidelity, achieving up to 2.88× speed‑up on image generation and 1.72× on video generation.

Inference AccelerationPadé approximationTC-Padé
0 likes · 12 min read
Keeping Image Quality with Only 20 Diffusion Steps: The TC‑Padé Acceleration Method
Machine Heart
Machine Heart
Apr 1, 2026 · Artificial Intelligence

SSD Framework Doubles Inference Speed Over Top Engines, Breaking the Serial Bottleneck

The SSD framework and its SAGUARO optimization, developed by researchers from Stanford, Princeton, and Together AI, parallelize drafting and verification in speculative decoding, eliminating serial dependencies and achieving up to 2× faster inference than the world’s strongest engines and up to 5× speedup over standard autoregressive generation, while addressing challenges such as prediction accuracy, acceptance‑rate trade‑offs, and fallback strategies.

Inference AccelerationSAGUAROSSD
0 likes · 7 min read
SSD Framework Doubles Inference Speed Over Top Engines, Breaking the Serial Bottleneck
AI Code to Success
AI Code to Success
Mar 27, 2026 · Artificial Intelligence

How Google’s TurboQuant Cuts LLM Memory by 6× and Speeds Up Inference 8×

Google Research’s TurboQuant algorithm compresses large‑language‑model KV caches from 32‑bit to 3‑bit, achieving a six‑fold reduction in memory usage and an eight‑fold inference speedup on H100 GPUs while preserving 100 % accuracy, and it also improves vector search performance without requiring large codebooks.

AI efficiencyInference AccelerationLLM compression
0 likes · 10 min read
How Google’s TurboQuant Cuts LLM Memory by 6× and Speeds Up Inference 8×
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 12, 2026 · Artificial Intelligence

How Huawei’s MindScale Cuts Agent Token Usage 5.7× and Automates Prompt & Workflow Design

The article outlines the four major obstacles hindering industry‑specific LLM agents—manual workflow maintenance, poor knowledge reuse, training‑inference inefficiency, and complex reasoning evaluation—and explains how Huawei Noah’s MindScale package tackles each with self‑evolving workflows, automated prompt optimization, and a novel KV‑Embedding cache that slashes token consumption by 5.7× while boosting inference speed up to 70%.

Industry AgentInference AccelerationKV-Embedding
0 likes · 7 min read
How Huawei’s MindScale Cuts Agent Token Usage 5.7× and Automates Prompt & Workflow Design
HyperAI Super Neural
HyperAI Super Neural
Feb 10, 2026 · Artificial Intelligence

WeDLM Diffusion Language Model Tutorial: 3× Faster Inference Than vLLM AR Models

The Tencent WeChat AI team introduces WeDLM, a diffusion language model that, through topological reordering, surpasses autoregressive models on the industrial‑grade vLLM engine with over threefold speedup on math reasoning and up to tenfold in low‑entropy scenarios, and provides a step‑by‑step online tutorial with GPU compute credits.

Diffusion Language ModelGPU computeInference Acceleration
0 likes · 5 min read
WeDLM Diffusion Language Model Tutorial: 3× Faster Inference Than vLLM AR Models
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Feb 4, 2026 · Artificial Intelligence

Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

The article analyzes Google’s shift from scaling‑law to efficiency‑law, detailing how speculative decoding, language‑model cascades, distillation, CALM, accurate quantized training, and the Mixture‑of‑Recursions architecture together form a multi‑layered strategy to cut inference cost, boost throughput, and sustain the company’s AI moat.

Google TPUInference AccelerationLanguage Model Cascades
0 likes · 8 min read
Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades
Bilibili Tech
Bilibili Tech
Jan 28, 2026 · Artificial Intelligence

Boosting Video Generation Inference: Full Graph Compilation with torch.compile

This article examines the challenges of optimizing video generation model inference, moving from operator-level tweaks to full-graph compilation using torch.compile, and details systematic strategies to eliminate Graph Breaks, handle dynamic shapes, KV-Cache indexing, and Python-side caches, achieving a 47.6% speedup on a 14B model without accuracy loss.

AIInference AccelerationVideo Generation
0 likes · 14 min read
Boosting Video Generation Inference: Full Graph Compilation with torch.compile
AI Frontier Lectures
AI Frontier Lectures
Jan 25, 2026 · Artificial Intelligence

Turning Chain‑of‑Thought into Images: The Render‑of‑Thought Breakthrough

Render‑of‑Thought (RoT) proposes a novel visual‑latent reasoning framework that compresses textual chain‑of‑thought into dense image embeddings, achieving faster inference, better interpretability, and plug‑and‑play integration without costly pre‑training, as demonstrated on multiple math and logic benchmarks.

Chain-of-ThoughtImplicit CoTInference Acceleration
0 likes · 11 min read
Turning Chain‑of‑Thought into Images: The Render‑of‑Thought Breakthrough
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 27, 2025 · Artificial Intelligence

Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas

Jeff Dean highlighted speculative decoding as a lossless inference acceleration technique that can boost large language model throughput by 2–3×, and the article breaks down its core concepts—including parallel token verification, draft‑target model collaboration, rejection sampling theory, and practical optimizations such as continuous batching and tree‑based verification.

Continuous BatchingDraft-Target ModelInference Acceleration
0 likes · 8 min read
Why Jeff Dean Champions Speculative Decoding: The Underlying Ideas
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Nov 24, 2025 · Artificial Intelligence

How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration

This article explains why Transformer models dominate modern AI agents, outlines the challenges of large parameter counts and latency, and presents a comprehensive guide to model compression (parameter sharing, knowledge distillation, quantization, pruning) and inference acceleration (parallel computing, optimized attention, TensorRT deployment), complete with PyTorch code examples and a real‑world case study showing speed‑up and storage savings.

AI AgentInference AccelerationPyTorch
0 likes · 34 min read
How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration
Tencent Tech
Tencent Tech
Oct 27, 2025 · Artificial Intelligence

How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×

SpecExit combines early‑exit and speculative decoding to let large reasoning models detect when they have almost finished thinking, trimming redundant chain‑of‑thought steps, reducing over‑thinking by 72% and achieving up to 2.5× faster end‑to‑end inference without noticeable accuracy loss.

AIInference Accelerationearly exit
0 likes · 6 min read
How SpecExit Cuts Large Reasoning Model Inference Time by Up to 2.5×
Data Party THU
Data Party THU
Oct 10, 2025 · Artificial Intelligence

How DPad Cuts Inference Time 61× While Boosting Accuracy in Diffusion LLMs

The article analyzes a recent Duke University paper that reveals a "scratchpad" mechanism in diffusion large language models, proposes the DPad method to prune redundant suffix tokens before decoding, and demonstrates up to 61.4× faster inference with unchanged or even improved accuracy across multiple benchmarks.

DPadInference Accelerationdiffusion LLM
0 likes · 10 min read
How DPad Cuts Inference Time 61× While Boosting Accuracy in Diffusion LLMs
AI Frontier Lectures
AI Frontier Lectures
Jul 29, 2025 · Industry Insights

SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×

SpecForge, an open‑source training framework built on Eagle3, enables end‑to‑end speculative sampling for ultra‑large language models, integrates tightly with the SGLang inference engine, offers online and offline training modes, supports advanced parallelism strategies, and demonstrates up to 2.18× inference speedup on benchmark tests, with all code and pretrained drafts available on GitHub and Hugging Face.

AI PerformanceInference AccelerationSpeculative Sampling
0 likes · 9 min read
SpecForge: Open‑Source Framework Boosts Large‑Model Speculative Sampling by 2.18×
AIWalker
AIWalker
Apr 28, 2025 · Artificial Intelligence

SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters

SimpleAR is a minimalist autoregressive visual generation framework that, with only 0.5 B parameters, achieves competitive 1024×1024 image synthesis through a three‑stage pipeline of large‑scale pretraining, supervised fine‑tuning, and GRPO‑based reinforcement learning, and demonstrates significant inference speedups using KV‑cache, vLLM, and speculative decoding.

BenchmarkInference Accelerationautoregressive generation
0 likes · 14 min read
SimpleAR: Autoregressive Visual Generation at 1024×1024 Using Only 0.5B Parameters
Bilibili Tech
Bilibili Tech
Jan 21, 2025 · Artificial Intelligence

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Continuous BatchingHardware OptimizationInference Acceleration
0 likes · 21 min read
Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies
Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 15, 2025 · Artificial Intelligence

How Multi-Token Prediction Boosts LLM Training and Inference Efficiency

This article reviews the evolution of Multi‑Token Prediction (MTP) techniques—from early blockwise parallel decoding to Meta's and DeepSeek's implementations—explaining their architectures, training and inference workflows, and the speed‑up gains they offer for large language models.

DeepSeekInference AccelerationLLM
0 likes · 20 min read
How Multi-Token Prediction Boosts LLM Training and Inference Efficiency
360 Tech Engineering
360 Tech Engineering
Oct 15, 2024 · Artificial Intelligence

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

The article details the design and deployment of 360's AI Compute Center, covering GPU server selection, high‑performance networking, Kubernetes‑based cluster management, advanced scheduling, training and inference acceleration techniques, and a comprehensive AI development platform with visualization and fault‑tolerance features.

AI InfrastructureGPU clusterInference Acceleration
0 likes · 21 min read
Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Oct 11, 2024 · Artificial Intelligence

Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference

HASS aligns training and decoding contexts and objectives for speculative sampling, using harmonized objective distillation and multi-step context alignment, achieving 2.81–4.05× speedup and 8%–20% improvement over EAGLE‑2 while preserving generation quality in real-world deployments at Xiaohongshu.

AIHASSInference Acceleration
0 likes · 11 min read
Harmonized Speculative Sampling (HASS): Aligning Training and Decoding for Efficient Large Language Model Inference
NewBeeNLP
NewBeeNLP
May 21, 2024 · Industry Insights

How EcomXL Supercharges E‑commerce Image Generation with SDXL Optimizations and 3‑Second Inference

This article details how Alibaba's Wanxiang Lab adapted the SDXL diffusion model for large‑scale e‑commerce image generation, introducing the EcomXL series, a weighted‑distillation fine‑tuning method, hierarchical model fusion, specialized ControlNet variants, and the SLAM inference accelerator to achieve high‑quality, controllable product images within three seconds while boosting business metrics.

AIGCControlNetEcomXL
0 likes · 14 min read
How EcomXL Supercharges E‑commerce Image Generation with SDXL Optimizations and 3‑Second Inference
Alimama Tech
Alimama Tech
May 15, 2024 · Artificial Intelligence

EcomXL: Optimizing SDXL for Large‑Scale E‑commerce Image Generation

EcomXL enhances SDXL for large‑scale e‑commerce image generation by leveraging tens of millions of curated images, a two‑stage fine‑tuning with denoising‑weighted distillation and layer‑wise fusion, specialized ControlNets for inpainting and soft‑edge consistency, and the SLAM inference accelerator to achieve sub‑second generation while boosting visual quality and adoption metrics.

AIGCControlNetEcomXL
0 likes · 15 min read
EcomXL: Optimizing SDXL for Large‑Scale E‑commerce Image Generation
DeWu Technology
DeWu Technology
May 15, 2024 · Artificial Intelligence

Accelerating Large Language Model Inference: Techniques and Framework Recommendations

Deploying a dedicated inference cluster and applying four key optimizations—FlashAttention‑based attention computation, PageAttention KV‑cache management, Mixture‑of‑Experts parameter reduction, and tensor parallelism—can accelerate large language model inference by up to 50% for models as large as 70 B parameters while cutting deployment costs.

FlashAttentionInference AccelerationMixture of Experts
0 likes · 17 min read
Accelerating Large Language Model Inference: Techniques and Framework Recommendations
DataFunTalk
DataFunTalk
May 8, 2024 · Artificial Intelligence

Intelligent NPCs: Infusing Soul into Game Characters with AI and the Art and Science of Deep Model Inference Acceleration

This talk explores how large‑model AI can give game NPCs personality, outlines the opportunities and challenges of intelligent NPCs, presents a case study of the "Jue Zhi An Nuan" NPC, and discusses future directions, safety compliance, and real‑time multimodal interaction solutions.

AIGame DevelopmentGame NPC
0 likes · 3 min read
Intelligent NPCs: Infusing Soul into Game Characters with AI and the Art and Science of Deep Model Inference Acceleration
Alibaba Cloud Native
Alibaba Cloud Native
Aug 6, 2023 · Artificial Intelligence

Boost Bloom‑7B1 Inference 2.5× Faster with FasterTransformer on ACK

This guide shows how to accelerate Bloom‑7B1 inference on Alibaba Cloud ACK by converting the model to FasterTransformer format, deploying it with Triton Server, and comparing performance against the original HuggingFace checkpoint, achieving roughly a 2.5‑fold speedup.

Bloom-7B1FasterTransformerInference Acceleration
0 likes · 17 min read
Boost Bloom‑7B1 Inference 2.5× Faster with FasterTransformer on ACK
Tencent Cloud Developer
Tencent Cloud Developer
Dec 12, 2022 · Artificial Intelligence

Performance Optimization of Tencent Cloud OCR Service: Reducing Latency and Improving Throughput

Tencent Cloud’s OCR team cut average response time from 1.8 seconds to under one second and boosted throughput by over 50 % by redesigning the model with self‑attention, accelerating inference with a Tensor‑Network accelerator, shrinking RPC payloads, enabling asynchronous logging, and optimizing multi‑region GPU memory utilization.

AI modelCloud ServicesInference Acceleration
0 likes · 13 min read
Performance Optimization of Tencent Cloud OCR Service: Reducing Latency and Improving Throughput
DataFunSummit
DataFunSummit
Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Deep LearningDynamic BatchingInference Acceleration
0 likes · 12 min read
Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques
ITPUB
ITPUB
Apr 27, 2022 · Artificial Intelligence

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

This article details the design and optimization of 58.com’s WPAI machine learning platform, covering background, training‑task scheduling, elastic inference scaling, offline‑online resource mixing, and model‑inference acceleration, and shows how these techniques collectively raised GPU usage by 51% and CPU usage by 38% while cutting costs.

AI PlatformGPU utilizationInference Acceleration
0 likes · 26 min read
How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%
58 Tech
58 Tech
Jan 10, 2022 · Artificial Intelligence

Resource Utilization Optimization Practices for the 58.com Machine Learning Platform (WPAI)

This article details the 58.com WPAI machine learning platform's architecture and the optimizations applied to training task scheduling, inference service elastic scaling, and offline‑online resource mixing, demonstrating how these techniques significantly improve GPU/CPU utilization and inference performance across both GPU and CPU environments.

AIInference AccelerationKubernetes
0 likes · 27 min read
Resource Utilization Optimization Practices for the 58.com Machine Learning Platform (WPAI)
Ctrip Technology
Ctrip Technology
Sep 16, 2021 · Artificial Intelligence

Automated AI Model Optimization Platform for Travel Services

This article describes the design, automated workflow, functional modules, and performance results of a comprehensive AI model optimization platform built for Ctrip's travel business, covering operator libraries, graph optimization, model compression techniques such as distillation, quantization, pruning, and deployment integration.

AI OptimizationAutoMLInference Acceleration
0 likes · 16 min read
Automated AI Model Optimization Platform for Travel Services
HomeTech
HomeTech
Sep 4, 2019 · Artificial Intelligence

Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results

This article explains how to use NVIDIA TensorRT to accelerate TensorFlow model inference by describing TensorRT architecture, optimization techniques such as layer fusion and precision calibration, detailing the conversion of frozen_graph and saved_model formats, presenting experimental setup and performance comparisons, and summarizing the achieved speed‑up.

Deep LearningInference AccelerationModel Optimization
0 likes · 13 min read
Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results