Tagged articles
92 articles
Page 1 of 1
SuanNi
SuanNi
May 17, 2026 · Industry Insights

Cerebras' $5.55B IPO Unveils the World’s Largest AI Chip Challenging Nvidia

Cerebras Systems raised $5.55 billion in the largest 2026 IPO, debuting the wafer‑scale WSE‑3 chip that promises unprecedented inference bandwidth and could erode Nvidia’s dominance, while navigating CFIUS scrutiny, a dramatic financial turnaround, and a shifting AI‑chip market landscape.

AI ChipCerebrasIPO
0 likes · 15 min read
Cerebras' $5.55B IPO Unveils the World’s Largest AI Chip Challenging Nvidia
Old Zhang's AI Learning
Old Zhang's AI Learning
May 16, 2026 · Artificial Intelligence

vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models

The vLLM 0.21.0 release brings five major updates—including Transformers v4 deprecation, a C++20 build requirement, KV offload with hybrid memory, speculative decoding that respects thinking budgets, and a Blackwell token‑speed backend—while offering detailed upgrade guidance for different user groups.

C++20InferenceKV cache
0 likes · 12 min read
vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models
Machine Heart
Machine Heart
May 16, 2026 · Artificial Intelligence

Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining

In a deep interview, former Google TPU architect Reiner Pope explains that low‑concurrency fast‑mode services trade higher fees for faster streaming but are limited by memory‑bandwidth bottlenecks, that optimal concurrency balances compute and memory costs, and that pipeline‑parallel sparse expert models and reinforcement‑learning fine‑tuning introduce new inefficiencies and overtraining risks.

InferenceLLMMemory Bandwidth
0 likes · 7 min read
Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining
Geek Labs
Geek Labs
May 13, 2026 · Artificial Intelligence

Two LLM Inference Acceleration Projects: A Mac‑Local Engine vs a Data‑Center Engine

This article compares two recent GitHub LLM inference engines—ds4.c, a Metal‑optimized engine for DeepSeek V4 Flash on Apple Silicon Macs, and TokenSpeed, a Python/C++‑based, data‑center‑grade engine for GPU clusters—detailing their design choices, performance numbers, usage instructions, and suitable scenarios.

DeepSeekGPUInference
0 likes · 8 min read
Two LLM Inference Acceleration Projects: A Mac‑Local Engine vs a Data‑Center Engine
Machine Heart
Machine Heart
May 10, 2026 · Artificial Intelligence

Why SRAM Is Key to Overcoming GPU Limits in Inference as Demand Soars

As large‑model inference demand outpaces training, the decode stage hits a memory‑wall that GPUs cannot efficiently cross; SRAM’s on‑chip bandwidth and low‑energy access open a path forward, though capacity and process limits still pose challenges.

AI hardwareCompute ArchitectureGPU
0 likes · 7 min read
Why SRAM Is Key to Overcoming GPU Limits in Inference as Demand Soars
SuanNi
SuanNi
Apr 29, 2026 · Artificial Intelligence

Why Google’s Split 8th‑Gen TPU Could Out‑Earn General‑Purpose GPUs

Google’s Cloud Next 2026 reveal splits the 8th‑generation TPU into training‑focused Sunfish and inference‑focused Zebrafish, highlighting Ironwood’s record‑breaking performance, a multi‑vendor supply chain, Anthropic’s multi‑gigawatt order, and a broader industry shift toward custom AI chips that promise far higher profit margins than generic GPUs.

AICustom ASICGoogle
0 likes · 8 min read
Why Google’s Split 8th‑Gen TPU Could Out‑Earn General‑Purpose GPUs
Code Mala Tang
Code Mala Tang
Apr 25, 2026 · Artificial Intelligence

Why Claude Feels Nerfed Without a Formal Downgrade: A Deep Dive into System‑Level Performance Changes

The article examines the recent Claude performance controversy, showing that engineering adjustments to inference parameters, cache handling, and system prompts rewrote the model’s behavior, making it answer faster but think shallower, leading users to perceive a degradation despite no official model downgrade.

AICacheClaude
0 likes · 14 min read
Why Claude Feels Nerfed Without a Formal Downgrade: A Deep Dive into System‑Level Performance Changes
Machine Heart
Machine Heart
Apr 23, 2026 · Artificial Intelligence

Google's TPU 8t and 8i: Training Powerhouse vs. Inference Specialist

Google unveiled its eighth‑generation TPU line at Cloud Next 2026, introducing the training‑focused TPU 8t with a 2.7× performance boost and massive scaling, and the inference‑optimized TPU 8i featuring three‑times more on‑chip SRAM and an 80% performance uplift for agentic AI workloads, while positioning the chips as a complement—not a replacement—to Nvidia's offerings.

AI hardwareAgentic AIGoogle Cloud
0 likes · 9 min read
Google's TPU 8t and 8i: Training Powerhouse vs. Inference Specialist
AI Tech Publishing
AI Tech Publishing
Apr 9, 2026 · Artificial Intelligence

Engineering‑Focused Guide to Training and Inference of Large Language Models

This article walks engineers through the full LLM stack—from tokenization and positional encoding to transformer blocks, efficient fine‑tuning, quantization, and production‑grade inference techniques such as KV‑cache, FlashAttention, PagedAttention, continuous batching, and speculative decoding—highlighting trade‑offs, toolchains, and practical workflow steps.

Fine-tuningInferenceLLM
0 likes · 13 min read
Engineering‑Focused Guide to Training and Inference of Large Language Models
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 18, 2026 · Artificial Intelligence

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

This article details the vLLM‑Kunlun open‑source project that adapts the high‑performance vLLM inference engine to Baidu's Kunlun XPU, covering platform overview, model‑porting workflow, plugin architecture, concrete case studies with MIMO‑Flash‑V2 and Qwen 3.5, and the performance‑tuning techniques that enable seamless, GPU‑level inference on domestic hardware.

AIHardwareInference
0 likes · 12 min read
How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins
AI Info Trend
AI Info Trend
Mar 16, 2026 · Industry Insights

What 2025’s AI Landscape Reveals: Five Game-Changing Trends

The 2025 State of AI report from Artificial Analysis outlines five core trends—intensified competition, the rise of autonomous agents, native speech models, mainstream inference models, and booming image/video generation—showing how costs have plummeted, capabilities have surged, and AI is reshaping every industry.

2025AICost reduction
0 likes · 9 min read
What 2025’s AI Landscape Reveals: Five Game-Changing Trends
SuanNi
SuanNi
Mar 14, 2026 · Industry Insights

How Meta’s MTIA Chips Achieved 25× Compute Boost in Just Two Years

This article analyzes Meta's rapid evolution of four generations of MTIA AI chips, detailing how modular hardware, inference‑first design, deep software integration, and aggressive iteration cycles delivered up to 30 PFLOPs of performance and dramatically reshaped the AI compute landscape.

AI chipsHardware accelerationIndustry analysis
0 likes · 13 min read
How Meta’s MTIA Chips Achieved 25× Compute Boost in Just Two Years
Ops Community
Ops Community
Mar 13, 2026 · Backend Development

How to Diagnose and Fix Slow LLM Inference: A Full‑Stack Performance Guide

This article presents a comprehensive, step‑by‑step methodology for troubleshooting and optimizing large‑language‑model inference performance, covering GPU, CPU, memory, network, configuration, and application layers, with concrete benchmark scripts, diagnostic commands, and real‑world case studies.

CPUDebuggingGPU
0 likes · 48 min read
How to Diagnose and Fix Slow LLM Inference: A Full‑Stack Performance Guide
Woodpecker Software Testing
Woodpecker Software Testing
Mar 1, 2026 · Artificial Intelligence

Automating Regression Tests for TensorRT Inference Services

The article outlines a comprehensive, repeatable regression testing framework for TensorRT inference pipelines, covering engine build validation, functional correctness against golden outputs, performance monitoring, common pitfalls, and CI/CD integration to ensure model updates remain both fast and reliable.

Automated TestingINT8 QuantizationInference
0 likes · 12 min read
Automating Regression Tests for TensorRT Inference Services
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 26, 2026 · Artificial Intelligence

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

This guide reviews the Qwen3.5 model lineup, explains mixed‑inference and MoE architecture, presents benchmark comparisons with GPT‑5.2, Claude 4.5 and Gemini‑3 Pro, evaluates 4‑bit and 3‑bit quantization loss, outlines hardware requirements, and provides step‑by‑step deployment options using llama.cpp or llama‑server.

InferenceMoElarge language model
0 likes · 14 min read
Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 26, 2026 · Artificial Intelligence

Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio

Google’s recent study shows that the length of a model’s token chain is negatively correlated with inference accuracy, and introduces the Deep Thinking Ratio (DTR) metric to identify truly reasoning tokens, enabling the Think@n strategy to halve compute cost without sacrificing performance.

Deep Thinking RatioInferenceLLM
0 likes · 6 min read
Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio
AI Tech Publishing
AI Tech Publishing
Feb 6, 2026 · Artificial Intelligence

2026 Large Model Engineering Roadmap: From Foundations to Production

This roadmap outlines a step‑by‑step learning path for building, optimizing, and safely deploying large language model systems, covering fundamentals, vector stores, RAG, advanced techniques, fine‑tuning, inference speed, deployment, observability, agents, and production safeguards.

DeploymentFine-tuningInference
0 likes · 5 min read
2026 Large Model Engineering Roadmap: From Foundations to Production
AI Waka
AI Waka
Feb 1, 2026 · Artificial Intelligence

Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

This article reviews practical techniques for accelerating large language model inference—including reduced‑precision formats, post‑training quantization, adapter‑based fine‑tuning, pruning, continuous batch processing, and multi‑GPU deployment—while providing concrete code examples, benchmark results, and guidance on selecting the right approach for production workloads.

GPUInferenceLLM
0 likes · 20 min read
Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 27, 2026 · Artificial Intelligence

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.

DPOInferenceKunlun P800
0 likes · 32 min read
Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide
AI Cyberspace
AI Cyberspace
Jan 26, 2026 · Artificial Intelligence

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

This article explains the NVFP4 4‑bit floating‑point quantization technique, shows how to deploy Qwen3‑30B‑A3B models with TensorRT‑LLM and vLLM, compares performance across NVFP4, AWQ and INT8 quantizations, and provides practical profiling commands for NVIDIA DGX systems.

InferenceLLMNVFP4
0 likes · 23 min read
How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 21, 2026 · Artificial Intelligence

Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG

This article details how to deploy the 235‑billion‑parameter Qwen3‑235B model using PD‑separation and MoE techniques, explains the associated challenges, and demonstrates a production‑grade solution built on the high‑performance SGLang inference engine and the RoleBasedGroup (RBG) orchestration framework, complete with benchmark results and best‑practice YAML examples.

AIInferenceKubernetes
0 likes · 21 min read
Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG
MaGe Linux Operations
MaGe Linux Operations
Jan 18, 2026 · Artificial Intelligence

How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling

This guide walks through building a production‑grade Kubernetes GPU cluster for large language model inference, covering hardware sizing, GPU resource scheduling, model storage options, automated scaling with HPA, health checks, monitoring, troubleshooting, and multi‑model deployment strategies.

DockerGPUInference
0 likes · 49 min read
How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling
Fun with Large Models
Fun with Large Models
Jan 14, 2026 · Artificial Intelligence

Understanding Large Language Model Files: Structure, Tokens, and Inference with Qwen3

This article walks through the complete workflow of loading and running the open‑source Qwen3‑8B model, explaining each core file (weights, config, generation config, tokenizer), how the model tokenizes input, applies chat templates, generates responses, and decodes output, all illustrated with code and diagrams.

InferenceModelScopePython
0 likes · 16 min read
Understanding Large Language Model Files: Structure, Tokens, and Inference with Qwen3
MaGe Linux Operations
MaGe Linux Operations
Dec 27, 2025 · Artificial Intelligence

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

This guide walks you through deploying large language models such as ChatGLM and Llama in production, covering environment setup, model quantization, dynamic batching, service configuration, Nginx load balancing, monitoring, troubleshooting, and best‑practice recommendations for high‑performance, cost‑effective AI inference.

GPUInferenceLLM
0 likes · 48 min read
How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 22, 2025 · Artificial Intelligence

Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE

This article explains why KV‑Cache hit rate is critical for large‑model inference, describes vLLM's automatic prefix caching, outlines the distributed cache challenges, and provides a step‑by‑step guide to deploying Alibaba Cloud ACK Gateway with Inference Extension's precise‑mode prefix‑cache‑aware routing, backed by benchmark results.

Alibaba CloudInferenceKV cache
0 likes · 18 min read
Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 10, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.

InferenceKunlunLLM
0 likes · 8 min read
Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin
DataFunSummit
DataFunSummit
Nov 22, 2025 · Artificial Intelligence

Breaking the Recommendation Filter Bubble: Alibaba 1688’s Inference‑Driven AI

Alibaba’s 1688 platform leverages inference‑based large language models to enhance recommendation discovery, addressing the filter‑bubble problem by analyzing long‑term buyer behavior, compressing extensive activity streams, generating nuanced demand queries, and integrating multimodal data and market trend agents to deliver more diverse, explainable product suggestions for B‑type buyers.

AIE‑commerceInference
0 likes · 23 min read
Breaking the Recommendation Filter Bubble: Alibaba 1688’s Inference‑Driven AI
Code Wrench
Code Wrench
Oct 16, 2025 · Artificial Intelligence

Build a Go‑Powered Stock Trend Predictor with ONNX Runtime in Minutes

This guide walks you through setting up an Ubuntu environment, training a LightGBM stock‑movement model in Python, exporting it to ONNX, and deploying fast, cross‑platform inference in Go using ONNX Runtime, complete with code snippets and project structure.

AIGoInference
0 likes · 11 min read
Build a Go‑Powered Stock Trend Predictor with ONNX Runtime in Minutes
AntTech
AntTech
Oct 9, 2025 · Artificial Intelligence

Ling-1T: The Trillion‑Parameter AI Model Redefining Efficient Reasoning

Ling-1T, a trillion‑parameter flagship non‑thinking model, combines 50 billion active parameters per token, 128 K context, Evo‑CoT reasoning, and FP8 mixed‑precision training to achieve state‑of‑the‑art performance on complex reasoning, code generation, and multimodal tasks while outlining its architecture, benchmarks, limitations, and future roadmap.

AIBenchmarkFP8
0 likes · 11 min read
Ling-1T: The Trillion‑Parameter AI Model Redefining Efficient Reasoning
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 23, 2025 · Artificial Intelligence

How LongCat-Flash-Thinking Sets New SOTA in Open‑Source AI Inference

LongCat-Flash-Thinking, the latest open‑source model from Meituan, introduces domain‑parallel RL training, a high‑throughput DORA infra, and a dual‑path inference framework that together achieve state‑of‑the‑art performance on logical, mathematical, coding, and agentic tasks while maintaining top‑tier speed.

BenchmarkInferenceLongCat
0 likes · 10 min read
How LongCat-Flash-Thinking Sets New SOTA in Open‑Source AI Inference
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 4, 2025 · Artificial Intelligence

Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations

Baidu’s Baige 5.0 AI Computing Platform introduces FP8 mixed‑precision training, MoE‑aware distributed strategies, adaptive parallelism, and a three‑tier KV‑Cache, delivering over 30% training speedup and 50% inference throughput gains while keeping token latency under half a second for large‑scale models.

AIFP8Inference
0 likes · 16 min read
Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations
Architecture Development Notes
Architecture Development Notes
Jul 21, 2025 · Artificial Intelligence

Why Rust’s Burn Framework Is Redefining Deep Learning Performance

Burn, a native Rust deep learning framework by Tracel AI, combines extreme flexibility, high computational efficiency, and cross‑platform portability through a modular backend abstraction, type‑safe tensor operations, asynchronous execution, and extensive tooling, offering performance‑competitive alternatives to Python‑based frameworks for both training and inference.

BurnDeep LearningGPU
0 likes · 23 min read
Why Rust’s Burn Framework Is Redefining Deep Learning Performance
AIWalker
AIWalker
Jun 18, 2025 · Artificial Intelligence

Six New Directions for Large Language Models

Large language models are booming, and this article highlights six cutting‑edge research directions—LLM‑plus synthetic data, reward modeling, inference techniques, LLM‑as‑a‑Judge, safety alignment, and long‑context handling—each illustrated with recent papers, experimental results, and links to code repositories.

InferenceLLMReward Modeling
0 likes · 9 min read
Six New Directions for Large Language Models
DataFunTalk
DataFunTalk
May 23, 2025 · Artificial Intelligence

2025 AI Landscape: Inference Models Dominate, Open‑Source Momentum Accelerates

The 2025 Q1 AI report from Artificial Analysis highlights six major trends—including a thousand‑fold drop in inference cost, the rise of MoE models, the growing parity of Chinese open‑source labs, the emergence of autonomous AI agents, native multimodal capabilities, and the trade‑off between performance, cost, and context windows—painting a picture of a rapidly evolving, increasingly competitive AI ecosystem.

AIInferenceagents
0 likes · 11 min read
2025 AI Landscape: Inference Models Dominate, Open‑Source Momentum Accelerates
Meituan Technology Team
Meituan Technology Team
May 8, 2025 · Artificial Intelligence

Building a Mixed OR+ML Inference Framework with TritonServer: Architecture, Challenges, and Solutions

The article describes how a large‑scale dispatch system was re‑engineered with NVIDIA TritonServer to unify GPU‑accelerated operations‑research kernels and deep‑learning models, detailing a three‑stage architecture (in‑process, cross‑process, cross‑node), the performance, stability and memory challenges addressed, and future plans for heterogeneous GPU scaling.

GPUInferencePerformance Optimization
0 likes · 11 min read
Building a Mixed OR+ML Inference Framework with TritonServer: Architecture, Challenges, and Solutions
Architect's Alchemy Furnace
Architect's Alchemy Furnace
May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

InferenceLLMMLX
0 likes · 17 min read
Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama
ITPUB
ITPUB
Apr 13, 2025 · Operations

How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage

Cursor, the AI‑powered code editor, grew to handle billions of document queries and over a hundred‑million model calls daily, prompting a multi‑stage infrastructure overhaul that moved from a failing YugaByte setup to PostgreSQL RDS, then to object‑storage‑backed databases, while tackling indexing, inference scaling, and cold‑start challenges.

AIInferenceInfrastructure
0 likes · 11 min read
How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage
Baidu Geek Talk
Baidu Geek Talk
Apr 2, 2025 · Artificial Intelligence

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

DeepSeek-VL2InferenceMixture of Experts
0 likes · 36 min read
DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough
DataFunSummit
DataFunSummit
Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu's machine‑learning platform lead Wang Xin's presentation on the ZhiLight large‑model inference framework, covering model execution mechanisms, GPU workload analysis, pipeline and tensor parallelism, GPU architecture evolution, open‑source engine comparisons, ZhiLight's compute‑communication overlap and quantization optimizations, benchmark results, supported models, and future directions.

GPUInferenceLLM
0 likes · 13 min read
Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 9, 2025 · Cloud Computing

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

This guide walks you through using Alibaba Cloud Container Compute Service (ACS) to provision GPU resources, prepare the QwQ-32B model, configure persistent storage, deploy the model with vLLM, set up OpenWebUI, verify the service, and optionally benchmark its performance, all with detailed commands and YAML examples.

ACSAlibaba CloudBenchmark
0 likes · 17 min read
Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide
JD Retail Technology
JD Retail Technology
Mar 4, 2025 · Artificial Intelligence

JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

JD Retail’s Nine‑Number Algorithm Platform delivers an end‑to‑end AI engine that unifies GPU and domestic NPU resources across a thousand‑card cluster, offering zero‑cost model migration, optimized training and inference pipelines, support for over 40 LLM and multimodal models, and proven business‑level performance that reduces dependence on overseas chips.

AIDistributed TrainingGPU
0 likes · 19 min read
JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications
JD Tech Talk
JD Tech Talk
Mar 3, 2025 · Artificial Intelligence

AI Engine Technology Based on Domestic Chips for JD Retail

This article describes JD Retail's AI engine built on domestic NPU chips, covering challenges, heterogeneous GPU‑NPU scheduling, high‑performance training and inference engines, extensive model support, real‑world deployment cases, and future plans for large‑scale chip clusters and ecosystem development.

AIDistributed TrainingGPU
0 likes · 20 min read
AI Engine Technology Based on Domestic Chips for JD Retail
JD Cloud Developers
JD Cloud Developers
Mar 3, 2025 · Artificial Intelligence

How JD.com Leverages Domestic NPU Chips to Power Large‑Scale AI Models

This article details JD.com's challenges and solutions for deploying domestic NPU chips across heterogeneous GPU‑NPU clusters, covering architecture, scheduling, high‑performance training and inference engines, real‑world case studies, and future plans to scale AI workloads securely and efficiently.

AIDomestic ChipsInference
0 likes · 19 min read
How JD.com Leverages Domestic NPU Chips to Power Large‑Scale AI Models
Architect
Architect
Feb 27, 2025 · Artificial Intelligence

Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation

This article explains how inference‑oriented large language models such as DeepSeek‑R1 and OpenAI o1‑mini shift AI research from training‑time scaling to test‑time computation, detailing the underlying principles, new scaling laws, verification techniques, reinforcement‑learning pipelines, and practical methods for distilling reasoning capabilities into smaller models.

DeepSeek-R1Inferencelarge language models
0 likes · 18 min read
Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 14, 2025 · Artificial Intelligence

Unlock Faster LLM Inference: Full Stack of Chips, Frameworks & Services

The article examines the end‑to‑end architecture for large‑model inference, detailing seven layers—from chip hardware and programming toolkits to deep‑learning frameworks, inference accelerators, model providers, compute platforms, application orchestration, and traffic management—highlighting key vendors, open‑source projects, and performance‑optimizing techniques.

AI hardwareInferenceLLM
0 likes · 12 min read
Unlock Faster LLM Inference: Full Stack of Chips, Frameworks & Services
Baidu Geek Talk
Baidu Geek Talk
Feb 12, 2025 · Artificial Intelligence

Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

This guide walks you through creating a lightweight compute instance, adding it to Baidu Baige AI heterogeneous computing platform, deploying the vLLM tool, loading and serving small‑scale dense models such as DeepSeek, Llama and Qwen, and provides recommended configuration lists to achieve low‑cost, high‑performance inference.

AI Model DeploymentBaidu BaigeCloud AI
0 likes · 3 min read
Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform
DevOps
DevOps
Feb 9, 2025 · Artificial Intelligence

DeepSeek’s Impact on the Large Model Ecosystem and the Resurgence of AI PCs

The article examines DeepSeek’s rapid rise, its open‑source R1 model and distilled variants, the resurgence of AI PCs, hardware support from Nvidia, AMD and others, and how this ecosystem is reshaping personal AI experiences and the broader large‑model landscape.

AI PCDeepSeekHardware
0 likes · 11 min read
DeepSeek’s Impact on the Large Model Ecosystem and the Resurgence of AI PCs
AIWalker
AIWalker
Feb 4, 2025 · Artificial Intelligence

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

This article reviews a comprehensive study that applies Chain‑of‑Thought reasoning to autoregressive text‑to‑image generation, introducing extended test‑time computation, direct preference optimization, and two custom reward models (PARM and PARM++) that together improve generation quality by up to 15% over Stable Diffusion 3.

Direct Preference OptimizationInferenceMultimodal AI
0 likes · 13 min read
How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 17, 2025 · Artificial Intelligence

Elastic Scaling of Large Language Model Inference on Alibaba Cloud ACK with Knative, ResourcePolicy, and Fluid

This article explains how to reduce inference cost and improve performance for large language models on Alibaba Cloud ACK by using Knative's request‑based autoscaling, custom ResourcePolicy priority scheduling, and Fluid data‑caching to achieve elastic scaling, resource pre‑emption, and faster model loading.

FluidInferenceKnative
0 likes · 22 min read
Elastic Scaling of Large Language Model Inference on Alibaba Cloud ACK with Knative, ResourcePolicy, and Fluid
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 31, 2024 · Artificial Intelligence

Can China’s GLM‑Zero‑Preview Beat OpenAI’s o3? A Deep Dive into Inference Model Tests

The article evaluates the Chinese GLM‑Zero‑Preview inference model by subjecting it to a wide range of math, logic, language, coding, and multimodal questions, compares its token efficiency and reasoning style to other models, and discusses its current strengths, limitations, and public availability.

AI benchmarkingGLM-ZeroInference
0 likes · 9 min read
Can China’s GLM‑Zero‑Preview Beat OpenAI’s o3? A Deep Dive into Inference Model Tests
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 10, 2024 · Artificial Intelligence

How MCTS Powers Inference in OpenAI’s o1: A Deep Dive with rStar

This article explains how the inference component of OpenAI’s o1 model can be implemented using Monte‑Carlo Tree Search, detailing the action space, rollout process, UCT scoring, and best‑path selection, with a concrete walkthrough of Microsoft’s open‑source rStar code.

InferenceMCTSOpenAI o1
0 likes · 26 min read
How MCTS Powers Inference in OpenAI’s o1: A Deep Dive with rStar
Architect
Architect
Sep 28, 2024 · Artificial Intelligence

How Does OpenAI’s o1 Model Leverage Self‑Play RL and New Scaling Laws?

The article provides an in‑depth technical analysis of OpenAI’s multimodal o1 model, explaining its self‑play reinforcement‑learning pipeline, the novel train‑time and test‑time compute scaling laws, its long‑think reasoning abilities demonstrated through a cipher example, and speculative architectures for generator‑verifier systems.

InferenceOpenAIRL scaling
0 likes · 35 min read
How Does OpenAI’s o1 Model Leverage Self‑Play RL and New Scaling Laws?
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2024 · Artificial Intelligence

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

This article provides a thorough, yet concise, overview of Llama 3’s training pipeline, data handling, model architecture, scaling laws, post‑training techniques like SFT and DPO, and inference optimizations such as KV‑Cache, GQA, PagedAttention, and FP8 quantization, highlighting practical insights and benchmark results.

DPOInferenceKV cache
0 likes · 32 min read
Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization
JD Retail Technology
JD Retail Technology
Aug 30, 2024 · Artificial Intelligence

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

The article details JD Advertising's technical challenges and solutions for large‑scale sparse recommendation models, describing GPU‑focused storage, compute and I/O optimizations for both training and low‑latency inference, including distributed pipelines, heterogeneous deployment, batch aggregation, multi‑stream execution, and compiler extensions.

Distributed SystemsGPU OptimizationInference
0 likes · 13 min read
GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems
DataFunSummit
DataFunSummit
Aug 12, 2024 · Artificial Intelligence

Design and Application of Xiaohongshu Heterogeneous Training and Inference Engine

This article presents a comprehensive overview of Xiaohongshu's heterogeneous training and inference engine, covering the challenges of model engineering, the design of elastic heterogeneous engines, future HPC training frameworks, AI compilation techniques, and a forward‑looking outlook on scalability and performance.

AIAI CompilationHPC
0 likes · 19 min read
Design and Application of Xiaohongshu Heterogeneous Training and Inference Engine
NewBeeNLP
NewBeeNLP
Jul 24, 2024 · Industry Insights

From Black Iron to Silver: The Evolution of Large Model Infrastructure (2019‑2024)

The article traces the evolution of large‑model training and inference infrastructure from the early “black‑iron” era (2019‑2021) through the “golden” boom (2022‑2023) to the emerging “silver” phase (2024‑), highlighting key research breakthroughs, open‑source frameworks, hardware trends, market dynamics, and practical challenges for engineers entering the field.

AI InfrastructureInferenceLarge Model
0 likes · 22 min read
From Black Iron to Silver: The Evolution of Large Model Infrastructure (2019‑2024)
Architects' Tech Alliance
Architects' Tech Alliance
Jun 22, 2024 · Artificial Intelligence

Rising Compute Demand of Generative AI Models and GPU Accelerator Trends in 2024

The article analyzes how generative AI models from GPT‑1 to the upcoming GPT‑5 are driving exponential growth in compute requirements, prompting massive cloud capital expenditures and intense competition among GPU vendors such as NVIDIA, AMD, Google, and emerging domestic chip makers, while also highlighting interconnect innovations and cost‑effective solutions.

AIAcceleratorsCompute
0 likes · 12 min read
Rising Compute Demand of Generative AI Models and GPU Accelerator Trends in 2024
DataFunTalk
DataFunTalk
May 18, 2024 · Artificial Intelligence

Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions

This article details the background, goals, and evolution of Tencent's FinTech AI development platform, outlines the technical challenges faced in feature engineering, model training, and inference services, and presents the comprehensive solutions and future plans implemented to improve efficiency, stability, and scalability.

Cloud NativeFinTechInference
0 likes · 13 min read
Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions
Top Architect
Top Architect
Apr 18, 2024 · Artificial Intelligence

Understanding Transformers: Architecture, Attention Mechanism, Training and Inference

This article provides a comprehensive overview of Transformer models, covering their attention-based architecture, encoder-decoder structure, training procedures including teacher forcing, inference workflow, advantages over RNNs, and various applications in natural language processing such as translation, summarization, and classification.

Attention MechanismDeep LearningInference
0 likes · 11 min read
Understanding Transformers: Architecture, Attention Mechanism, Training and Inference
DataFunSummit
DataFunSummit
Apr 10, 2024 · Artificial Intelligence

Large Language Model Inference Overview and Performance Optimizations

This article presents a comprehensive overview of large language model inference, describing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and detailing a series of system-level optimizations—including pipeline parallelism, dynamic batching, KV‑cache quantization, and hardware considerations—to significantly improve inference efficiency on modern GPUs.

GPUInferenceLatency
0 likes · 23 min read
Large Language Model Inference Overview and Performance Optimizations
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Apr 10, 2024 · Artificial Intelligence

Early‑Stopping Self‑Consistency (ESC): Reducing Sampling Cost for Large Language Model Reasoning

Early‑Stopping Self‑Consistency (ESC) dynamically halts sampling once a sliding‑window answer distribution reaches zero entropy, cutting the number of required LLM reasoning samples by 34‑84 % across arithmetic, commonsense, and symbolic benchmarks while preserving accuracy and offering a theoretically‑bounded, robust, budget‑adaptive alternative to traditional Self‑Consistency.

AIEarly StoppingInference
0 likes · 14 min read
Early‑Stopping Self‑Consistency (ESC): Reducing Sampling Cost for Large Language Model Reasoning
Architect
Architect
Mar 26, 2024 · Artificial Intelligence

Why Transformers Outperform RNNs: A Deep Dive into Architecture and Training

This article explains the Transformer model’s core architecture, self‑attention mechanism, encoder‑decoder workflow, training with teacher forcing, inference steps, and why it surpasses RNNs and CNNs, while also outlining its major NLP applications.

Attention MechanismInferenceModel Training
0 likes · 14 min read
Why Transformers Outperform RNNs: A Deep Dive into Architecture and Training
DataFunTalk
DataFunTalk
Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUInference
0 likes · 20 min read
Efficient Deployment of Speech AI Models on GPUs
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Dec 14, 2023 · Artificial Intelligence

Unlocking LLaMA: Key Innovations, Architecture Insights, and MindSpore Inference Guide

This article reviews the LLaMA large‑language‑model series, covering its background, architectural innovations such as Add&Norm, SwiGLU, and RoPE, a known reversal‑curse bug, and provides step‑by‑step MindSpore Transformers code for model configuration, inference, and pipeline usage while previewing the upcoming LLaMA‑2 session.

AIInferenceLLaMA
0 likes · 6 min read
Unlocking LLaMA: Key Innovations, Architecture Insights, and MindSpore Inference Guide
DataFunTalk
DataFunTalk
Dec 1, 2023 · Artificial Intelligence

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

This article details Xiaohongshu's end‑to‑end GPU‑centric transformation for search‑related machine‑learning models, covering model characteristics, training and inference frameworks, system‑level GPU and CPU optimizations, multi‑card and compilation techniques, and future directions for scaling large sparse and dense models.

GPU OptimizationInferenceModel Serving
0 likes · 16 min read
GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario
JD Tech
JD Tech
Aug 4, 2023 · Artificial Intelligence

Deploying and Evaluating the Vicuna Open‑Source Large Language Model on a Single Machine

This article details a step‑by‑step guide to deploying the Vicuna open‑source LLM on a single server, covering model preparation, environment setup, dependency installation, GPU and CUDA configuration, inference commands, performance evaluation, and attempted fine‑tuning, while sharing practical observations and results.

Fine‑tuningGPUInference
0 likes · 16 min read
Deploying and Evaluating the Vicuna Open‑Source Large Language Model on a Single Machine
JD Tech
JD Tech
Jul 31, 2023 · Artificial Intelligence

Local Deployment, Fine‑tuning, and Inference of the Open‑source Alpaca‑LoRA Model on GPU Servers

This article details the step‑by‑step process of installing GPU drivers, setting up a Python environment, deploying the open‑source Alpaca‑LoRA large language model, fine‑tuning it with Chinese data on a multi‑GPU server, and running inference, while discussing practical challenges and performance observations.

AlpacaFine-tuningGPU
0 likes · 14 min read
Local Deployment, Fine‑tuning, and Inference of the Open‑source Alpaca‑LoRA Model on GPU Servers
High Availability Architecture
High Availability Architecture
Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

AI OptimizationCUDAInference
0 likes · 10 min read
InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration
JD Retail Technology
JD Retail Technology
May 18, 2023 · Artificial Intelligence

Local Deployment, Inference, and Fine‑tuning of the Vicuna‑7B Large Language Model

This article details the step‑by‑step process of preparing the environment, merging weights, installing dependencies, running inference, evaluating Vicuna‑7B against other models, and attempting fine‑tuning, while highlighting performance results, encountered issues, and future work for large language model deployment.

Fine-tuningGPUInference
0 likes · 11 min read
Local Deployment, Inference, and Fine‑tuning of the Vicuna‑7B Large Language Model
DataFunTalk
DataFunTalk
Mar 31, 2023 · Artificial Intelligence

Estimating the Resource and Cost Requirements for Large Language Model Training and Inference

The article analyses the computational resources, hardware costs, and human investment needed to train and serve large language models such as GPT‑3, discusses practical cost calculations, highlights the challenges faced by Chinese AI teams, and argues for sustained, long‑term funding to achieve meaningful breakthroughs.

AI InfrastructureChina AIInference
0 likes · 14 min read
Estimating the Resource and Cost Requirements for Large Language Model Training and Inference
58 Tech
58 Tech
Dec 21, 2021 · Artificial Intelligence

dl_inference: Open‑Source Deep Learning Inference Service with TensorRT and MKL Acceleration

dl_inference is an open‑source, production‑grade deep learning inference platform that supports TensorFlow, PyTorch and Caffe models, offering GPU and CPU deployment, TensorRT and MKL acceleration, multi‑node load balancing, and extensive Q&A on model conversion, hardware requirements, INT8 quantization, and performance gains.

CPUGPUInference
0 likes · 8 min read
dl_inference: Open‑Source Deep Learning Inference Service with TensorRT and MKL Acceleration
DataFunTalk
DataFunTalk
Nov 2, 2021 · Artificial Intelligence

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

The article outlines a technical exchange hosted by 58.com AI Lab and Tianjin University that discusses high‑efficiency AI computing, resource‑aware scheduling for both online inference and offline training, and methods to mitigate GPU under‑utilization and gray‑interference in distributed deep‑learning platforms.

AIGPU utilizationInference
0 likes · 4 min read
Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training
WeChat Backend Team
WeChat Backend Team
Jun 7, 2021 · Artificial Intelligence

How WeChat’s TFCC Boosts Deep Learning Inference Performance Across Platforms

The TFCC framework, developed by WeChat's backend team, delivers high‑performance, easy‑to‑use, and universal deep‑learning inference by supporting numerous ONNX and TensorFlow operations, optimizing model structures, constants, and operators, and providing a versatile runtime and math library for both CPU and GPU platforms.

Deep LearningFrameworkInference
0 likes · 8 min read
How WeChat’s TFCC Boosts Deep Learning Inference Performance Across Platforms
DataFunSummit
DataFunSummit
Dec 14, 2020 · Artificial Intelligence

LightSeq: High‑Performance Open‑Source Inference Engine for Transformers, GPT and Other NLP Models

This article introduces LightSeq, an open‑source, GPU‑accelerated inference engine that dramatically speeds up Transformer‑based models such as BERT and GPT by up to 14× over TensorFlow, supports multiple decoding strategies, integrates seamlessly with major deep‑learning frameworks, and provides detailed performance benchmarks and technical optimizations.

Deep LearningGPUInference
0 likes · 15 min read
LightSeq: High‑Performance Open‑Source Inference Engine for Transformers, GPT and Other NLP Models
JD Tech Talk
JD Tech Talk
Nov 16, 2020 · Artificial Intelligence

Practical Guide to Deploying Federated Learning: Architecture, Deployment, Training, and Inference

This article provides a comprehensive overview of federated learning engineering, covering deployment via Docker containers, the design of training and inference frameworks, key services such as communication, training, model management, and registration, and practical considerations for scaling and reliability in production environments.

AIDeploymentDocker
0 likes · 11 min read
Practical Guide to Deploying Federated Learning: Architecture, Deployment, Training, and Inference
JD Tech Talk
JD Tech Talk
Nov 13, 2020 · Artificial Intelligence

Practical Engineering Guide to Federated Learning: Deployment, Training, and Inference

This article provides a comprehensive engineering overview of federated learning, covering its core distributed‑learning concept, Docker‑based deployment, detailed training‑service architecture with validation, scheduling, metadata, and model‑management components, as well as a complete inference framework and workflow for production use.

AI EngineeringDistributed SystemsDocker
0 likes · 12 min read
Practical Engineering Guide to Federated Learning: Deployment, Training, and Inference
Didi Tech
Didi Tech
Jul 5, 2019 · Artificial Intelligence

How Didi’s Jianshu Machine Learning Platform Boosts AI Development Efficiency

An in‑depth look at Didi’s Jianshu Machine Learning Platform reveals its end‑to‑end AI workflow—from experiment environments and batch training to high‑availability online serving—highlighting resource‑efficient Kubernetes scheduling, Docker‑based reproducible environments, a custom parameter server, and the IFX inference engine that together accelerate development, training, and deployment.

AIPlatformDockerInference
0 likes · 11 min read
How Didi’s Jianshu Machine Learning Platform Boosts AI Development Efficiency