Tagged articles

AI inference

123 articles · Page 1 of 2

Jun 30, 2026 · Artificial Intelligence

Running DeepSeek V4 on M5 Max: 5 tps Speedup Without Large Memory

Developer Anemll demonstrates that the DS4 IQ2_Q2 version of DeepSeek V4 on an Apple M5 Max gains a 5‑tps throughput boost, using SSD‑streamed MoE sidecar loading to run large models without requiring high memory, and provides full build and execution instructions.

AI inferenceApple SiliconDS4

0 likes · 8 min read

Running DeepSeek V4 on M5 Max: 5 tps Speedup Without Large Memory

Black & White Path

Jun 29, 2026 · Artificial Intelligence

Why Pay for AI? Access 100+ Free Models on NVIDIA Build

NVIDIA Build offers over 100 free AI models with no credit‑card sign‑up, 1,000 free inference credits and a 40‑requests‑per‑minute limit, and this guide shows how to obtain an API key and configure the service in Cursor and VS Code as a cost‑free alternative to paid AI subscriptions.

AI inferenceAPI keyCline plugin

0 likes · 5 min read

Why Pay for AI? Access 100+ Free Models on NVIDIA Build

DataFunSummit

Jun 17, 2026 · Artificial Intelligence

Why Agentic AI Inference Is Slow and How NVIDIA Dynamo 1.1 Solves It

Developers deploying Agentic AI face multi‑turn latency caused by repeated token recomputation, KV‑cache eviction, and cold‑starts, and NVIDIA Dynamo 1.1 addresses these issues with KV‑cache‑aware routing, multi‑level cache offload, priority scheduling, and Prefill/Decode separation, as demonstrated in an upcoming Kubernetes‑based live session.

AI inferenceAgentic AIDistributed Inference

0 likes · 3 min read

Why Agentic AI Inference Is Slow and How NVIDIA Dynamo 1.1 Solves It

Lao Guo's Learning Space

Jun 11, 2026 · Industry Insights

Why the M5 Ultra Is Poised to Be 2026’s Most Powerful Desktop Workstation

The M5 Ultra combines a 36‑core full‑performance CPU, an 80‑core GPU with built‑in neural accelerators, 256 GB of unified memory and ~1100 GB/s bandwidth in a silent, compact box, delivering unmatched desktop AI inference performance compared with RTX 5090 and DGX Spark.

AI inferenceApple SiliconDesktop workstation

0 likes · 12 min read

Why the M5 Ultra Is Poised to Be 2026’s Most Powerful Desktop Workstation

Fighter's World

Jun 6, 2026 · Industry Insights

Inference Foundry: Token Physical Cost and Exploding Demand Force Heterogeneous Division

The article analyzes how the immutable physical cost of each AI token and the exponential rise in inference demand outpace hardware improvements, driving a shift toward heterogeneous compute architectures, disaggregation, and ultimately an inference foundry model exemplified by NVIDIA's rapid acquisition of Groq.

AI inferenceGroqNVIDIA

0 likes · 26 min read

Inference Foundry: Token Physical Cost and Exploding Demand Force Heterogeneous Division

Architects' Tech Alliance

Jun 1, 2026 · Industry Insights

Intel’s World‑First 1.8nm Data‑Center CPU Packs 288 Cores – A Performance Leap

Intel unveiled the world’s first 1.8nm data‑center CPU, the Xeon 6+ with 288 cores, leveraging RibbonFET, PowerVia and 3D chiplet stacking to achieve up to 2.26× higher performance and 55% better performance‑per‑watt than the previous generation, while adding SGX/TDX security and a 200 GbE Ethernet plus a new AI‑focused GPU.

1.8nmAI inferenceIntel

0 likes · 10 min read

Intel’s World‑First 1.8nm Data‑Center CPU Packs 288 Cores – A Performance Leap

Machine Heart

May 30, 2026 · Artificial Intelligence

How Abstract Symbols Cut AI Inference Cost by 11×

The article examines IBM Research's Abstract‑CoT approach, which replaces verbose natural‑language chain‑of‑thought reasoning with a compact abstract token vocabulary, achieving up to an 11‑fold reduction in inference tokens while maintaining comparable accuracy across math, instruction‑following, and multi‑hop QA benchmarks.

AI inferenceAbstract-CoTChain-of-Thought

0 likes · 11 min read

How Abstract Symbols Cut AI Inference Cost by 11×

AI Step-by-Step

May 26, 2026 · Artificial Intelligence

How Prompt Caching Works in LLMs and How to Write More Efficient Prompts

The article explains that LLM prompt caching reuses internal KV states rather than full answers, compares provider implementations, quantifies cost and latency savings, and provides concrete guidelines for structuring prompts to maximize cache hits, along with monitoring signals and a practical evaluation checklist.

AI inferenceLLMPrompt Engineering

0 likes · 13 min read

How Prompt Caching Works in LLMs and How to Write More Efficient Prompts

Machine Learning Algorithms & Natural Language Processing

May 24, 2026 · Artificial Intelligence

Inference Set to Consume 70% of AI Compute Power, Leaving 30% for Training

Zhang Lu, a Silicon Valley investor, argues that AI's focus is shifting from training to inference—now accounting for half of current compute and projected to reach 70%—while communication energy, data quality, physical AI, and edge deployment become the next critical bottlenecks and opportunities across medical, space, and nano‑robotics applications.

AI ApplicationsAI inferenceData Quality

0 likes · 19 min read

Inference Set to Consume 70% of AI Compute Power, Leaving 30% for Training

Machine Heart

May 14, 2026 · Artificial Intelligence

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

The recent SGLang × MUSA meetup revealed that MUSA’s GPU backend has been merged into SGLang’s official codebase, delivering zero‑learning‑cost integration, performance gains of up to 66 % on DeepSeek‑V4, and a growing ecosystem of adapters, high‑performance kernels, and distributed inference support.

AI inferenceDeepSeekGPU

0 likes · 12 min read

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

Architects' Tech Alliance

May 9, 2026 · Artificial Intelligence

Fractile Claims 90% Cost Cut and 100× Speed Over Nvidia GPUs

Fractile, a UK AI‑chip startup founded in 2022, says its SRAM‑compute‑on‑die architecture eliminates data movement, promising up to 100‑fold faster inference and 90% lower cost than Nvidia GPUs, yet the chip is still in simulation and not expected to ship until 2027, sparking both investor hype and industry skepticism.

AI hardware marketAI inferenceAnthropic

0 likes · 6 min read

Fractile Claims 90% Cost Cut and 100× Speed Over Nvidia GPUs

AI Explorer

May 7, 2026 · Artificial Intelligence

Nvidia Endorses Open-Source “Light-Speed” Inference Engine for Coding Agents

The article examines how Nvidia’s open-source ‘light-speed’ inference engine tackles the token-bloat and compute bottlenecks of modern coding agents by redesigning attention and memory management, enabling order-of-magnitude speed gains without losing accuracy, and reshaping the AI-as-a-service ecosystem.

AI inferenceAttention optimizationNVIDIA

0 likes · 6 min read

Nvidia Endorses Open-Source “Light-Speed” Inference Engine for Coding Agents

大转转FE

May 7, 2026 · Artificial Intelligence

Running AI Inference Directly in the Browser with WebNN

WebNN brings hardware‑accelerated AI inference to web pages, letting developers run millisecond‑level face detection, real‑time filters, and semantic segmentation locally without cloud calls, while improving latency, privacy, and cost through a unified JavaScript API that maps to CPUs, GPUs or NPUs.

AI inferenceEdgeGPU

0 likes · 16 min read

Running AI Inference Directly in the Browser with WebNN

Architects' Tech Alliance

May 6, 2026 · Artificial Intelligence

How DeepSeek V4 and Huawei Ascend 950 Redefined China’s AI Chip Landscape

The article details how DeepSeek V4 became the first top‑level large model to run on Huawei's Ascend 950 PR chip, delivering up to 2.87× the performance of Nvidia H20, cutting inference cost by up to 90%, and spurring a booming domestic AI‑chip ecosystem and supply‑chain surge.

AI chip performanceAI inferenceCANN Next

0 likes · 10 min read

How DeepSeek V4 and Huawei Ascend 950 Redefined China’s AI Chip Landscape

Amazon Cloud Developers

Apr 28, 2026 · Cloud Computing

How AWS Achieved Day‑0 Adaptation of Xiaomi’s MiMo‑V2.5‑Pro on Trainium

AWS has completed a Day‑0 rapid adaptation of Xiaomi’s open‑source MiMo‑V2.5‑Pro model, enabling developers worldwide to run the 1‑trillion‑parameter, 1‑million‑token model on Amazon Trainium chips with high‑throughput, low‑latency inference via Neuron SDK integration, and offers three deployment paths—EC2, SageMaker, and EKS/ECS.

AI inferenceAWSAmazon Trainium

0 likes · 6 min read

How AWS Achieved Day‑0 Adaptation of Xiaomi’s MiMo‑V2.5‑Pro on Trainium

Lao Guo's Learning Space

Apr 27, 2026 · Artificial Intelligence

DeepSeek V4 & Huawei Ascend 950PR: Is Domestic Compute Ready for Enterprise AI?

DeepSeek V4, paired with Huawei’s Ascend 950PR chip, delivers inference speed up to 2.87× that of Nvidia H20 and introduces a CSA+HCA attention compression that cuts KV cache usage to under 10%, but its 94‑96% hallucination rate and high token consumption raise concerns for production use.

AI inferenceCSA+HCADeepSeek-V4

0 likes · 13 min read

DeepSeek V4 & Huawei Ascend 950PR: Is Domestic Compute Ready for Enterprise AI?

ITPUB

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Unleashed: 1M‑Token Context Becomes Commodity, Teams with Ascend to Challenge Compute Dominance

DeepSeek released two V4 models—Pro and Flash—both supporting 1‑million‑token context as a standard feature, showcasing top‑tier agentic coding, world‑knowledge, and inference performance, while introducing DSA sparse attention and announcing upcoming large‑scale deployment on Huawei Ascend hardware.

1M contextAI inferenceDSA sparse attention

0 likes · 6 min read

DeepSeek V4 Unleashed: 1M‑Token Context Becomes Commodity, Teams with Ascend to Challenge Compute Dominance

Machine Heart

Apr 24, 2026 · Artificial Intelligence

Cambricon Achieves Day‑0 Native Support for DeepSeek‑V4, Uniting Two Chinese AI Leaders

Cambricon leveraged its NeuWare stack and vLLM framework to deliver Day‑0 native support for DeepSeek‑V4‑flash (285 B) and DeepSeek‑V4‑pro (1.6 T), open‑sourcing the adaptation and showcasing rapid model migration alongside extreme performance optimizations across software and hardware layers.

AI inferenceCambriconDeepSeek-V4

0 likes · 5 min read

Cambricon Achieves Day‑0 Native Support for DeepSeek‑V4, Uniting Two Chinese AI Leaders

ITPUB

Apr 22, 2026 · Artificial Intelligence

Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10

Ant’s newly released Ling‑2.6‑flash model, hidden as the anonymous “Elephant Alpha,” combines a 104B‑parameter MoE design with only 7.4B active weights per inference, achieving ten‑fold token savings, top‑tier benchmark scores and a $0.10 per‑million‑token price that dramatically cuts inference costs for developers and enterprises.

AI inferenceLarge Language Modelbenchmark

0 likes · 6 min read

Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10

Architect's Must-Have

Apr 19, 2026 · Artificial Intelligence

TurboQuant: Google’s 6× KV Compression & 8× Speedup Break the AI Memory Wall

With LLM context windows soaring to millions of tokens, the KV‑cache memory wall threatens scalable inference; Google’s TurboQuant tackles this by compressing KV data up to six‑fold without precision loss and accelerating attention up to eight‑fold, using PolarQuant and 1‑bit QJL techniques, reshaping hardware costs and edge AI possibilities.

AI inferenceKV compressionTurboQuant

0 likes · 25 min read

TurboQuant: Google’s 6× KV Compression & 8× Speedup Break the AI Memory Wall

IT Services Circle

Mar 31, 2026 · Artificial Intelligence

How Google’s TurboQuant Cuts KV‑Cache Memory by 83% and Boosts LLM Speed

Google’s newly released TurboQuant algorithm compresses KV‑Cache from 16‑bit to 3‑bit, slashing memory usage to one‑sixth while preserving zero accuracy loss, dramatically accelerating large‑language‑model inference on GPUs and reshaping the memory market.

AI inferenceGoogle ResearchKV cache

0 likes · 7 min read

How Google’s TurboQuant Cuts KV‑Cache Memory by 83% and Boosts LLM Speed

Lao Guo's Learning Space

Mar 30, 2026 · Artificial Intelligence

The 2026 Complete Guide to Free Large‑Model APIs and One‑Click OpenClaw Setup

This article compiles over 15 domestic and international free large‑model API providers, explains why they offer free tiers, presents detailed OpenClaw configuration snippets for each platform, and offers practical usage strategies and cautions for achieving near‑unlimited access.

AI inferenceFree APIOpenClaw

0 likes · 11 min read

The 2026 Complete Guide to Free Large‑Model APIs and One‑Click OpenClaw Setup

PaperAgent

Mar 26, 2026 · Artificial Intelligence

TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed

TurboQuant, presented at ICLR 2026, introduces a theoretically grounded vector quantization technique that reduces large‑language‑model key‑value cache memory by at least six times, achieves up to eight‑fold speedups, and maintains zero accuracy loss by combining PolarQuant’s polar‑coordinate compression with a 1‑bit QJL error‑correction step, as demonstrated on benchmarks such as LongBench and GloVe.

AI inferenceBenchmarkingMemory compression

0 likes · 10 min read

TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed

Alibaba Cloud Developer

Mar 17, 2026 · Backend Development

How RocketMQ LiteTopic Eliminates AI Inference Queue Bottlenecks with Millisecond‑Level Flow Control

This article explains why traditional message‑queue throttling fails in AI inference workloads, introduces Apache RocketMQ 5.x LiteTopic’s lightweight topic model, and details its four core features—physical isolation, elastic scaling, precise flow control, and consumption suspension—that together provide millisecond‑level real‑time throttling and minute‑level busy‑idle scheduling for personalized traffic management.

AI inferenceFlow ControlLiteTopic

0 likes · 14 min read

How RocketMQ LiteTopic Eliminates AI Inference Queue Bottlenecks with Millisecond‑Level Flow Control

Baidu Intelligent Cloud Tech Hub

Mar 6, 2026 · Artificial Intelligence

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

Baidu Baige built a full‑stack quantization pipeline that integrates model‑level, framework‑level, and hardware‑level optimizations on the Kunlun XPU platform, enabling FP16/BF16 large models to be compressed to 25‑50% of their original size while boosting inference speed by 30‑50% and dramatically reducing memory consumption for enterprise deployments.

AI inferenceINT4INT8

0 likes · 16 min read

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

SuanNi

Mar 4, 2026 · Artificial Intelligence

How to Fit Large Language Models into Cars and Robots: A Hardware‑Aware Scaling Law

This article presents a hardware‑aware co‑design framework for edge‑deployed large language models, revealing a scaling law that balances model accuracy and inference latency, and demonstrates how Pareto‑optimal architectures can be discovered quickly using roofline analysis and systematic search on devices like NVIDIA Jetson Orin.

AI inferencePareto optimizationRoofline Model

0 likes · 15 min read

How to Fit Large Language Models into Cars and Robots: A Hardware‑Aware Scaling Law

AI Engineering

Feb 27, 2026 · Artificial Intelligence

Ubuntu 26.04 LTS Optimized for Local AI with Plug‑and‑Play GPU Drivers and Sandbox Inference

Ubuntu 26.04 LTS adds automatic detection and installation of NVIDIA CUDA or AMD ROCm drivers and introduces pre‑configured Inference Snaps sandbox containers, building on the AI groundwork laid by 24.04 LTS to dramatically lower the setup barrier for local AI development.

AI inferenceCUDAGPU drivers

0 likes · 4 min read

Ubuntu 26.04 LTS Optimized for Local AI with Plug‑and‑Play GPU Drivers and Sandbox Inference

Fun with Large Models

Feb 17, 2026 · Artificial Intelligence

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

Qwen3.5‑397B‑A17B, the newly open‑sourced multimodal giant, combines a 400‑billion‑parameter sparse MoE architecture with FP8 pipelines and an asynchronous RL framework to deliver GPT‑5.2‑level capabilities, 60% lower memory usage, up to 19× higher throughput, and extensive image, video, and agent support, while outlining its deployment requirements and API pricing.

AI inferenceFP8Qwen3.5

0 likes · 11 min read

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

Alibaba Cloud Big Data AI Platform

Feb 17, 2026 · Artificial Intelligence

Deploy Alibaba’s Qwen3.5‑397B‑A17B Model in One Click with PAI‑Model Gallery

Alibaba's open‑source Qwen3.5‑397B‑A17B model, featuring 397 billion parameters and a hybrid Gated Delta Network/MoE architecture, delivers superior performance and reduced memory usage, and can be deployed instantly through the PAI‑Model Gallery with step‑by‑step guidance and enterprise‑grade security.

AI inferenceAlibaba CloudLarge Language Model

0 likes · 5 min read

Deploy Alibaba’s Qwen3.5‑397B‑A17B Model in One Click with PAI‑Model Gallery

AI Engineering

Jan 22, 2026 · Industry Insights

SGLang Spins Out as RadixArk with $400M Valuation Amid Inference Infrastructure Boom

SGLang, the open‑source inference accelerator, has been spun out into RadixArk—a $400 million‑valued startup aiming to democratize AI infrastructure, while the broader market sees a surge of funding for inference‑focused companies.

AI InfrastructureAI inferenceRadixArk

0 likes · 5 min read

SGLang Spins Out as RadixArk with $400M Valuation Amid Inference Infrastructure Boom

Amazon Cloud Developers

Jan 8, 2026 · Artificial Intelligence

18 New Open‑Source Models on Amazon Bedrock—Switch Without Code Changes

Amazon Bedrock now offers 18 additional fully managed open‑source models from providers such as Google, Mistral AI, NVIDIA and OpenAI, bringing the total to nearly 100 serverless models; the new offerings include Mistral Large 3 and three Ministral 3 variants optimized for edge deployment, and can be accessed via a unified API without modifying existing application code or infrastructure, while Amazon’s Guardrails and evaluation tools help ensure security and compliance.

AI inferenceAmazon BedrockMistral AI

0 likes · 6 min read

18 New Open‑Source Models on Amazon Bedrock—Switch Without Code Changes

Baidu Geek Talk

Jan 7, 2026 · Artificial Intelligence

How Baidu’s vLLM‑Kunlun Plugin Powered MiMo Flash V2 on Kunlun XPU in 2 Days

Within two days, Baidu’s Baige and Kunlun Chip teams adapted the 309‑billion‑parameter MiMo Flash V2 model—featuring a hybrid SWA+Sink and Full Attention mechanism—to run efficiently on the Kunlun P800 XPU using the vLLM‑Kunlun Plugin, achieving lossless performance comparable to GPU inference.

AI inferenceKunlun XPUMiMo Flash V2

0 likes · 7 min read

How Baidu’s vLLM‑Kunlun Plugin Powered MiMo Flash V2 on Kunlun XPU in 2 Days

Network Intelligence Research Center (NIRC)

Dec 31, 2025 · Artificial Intelligence

Why AI Inference Is Slow and How Cutting‑Edge Tech Boosts It in Industrial Settings

The article analyzes the severe inference bottlenecks of large language models, CNNs, and recommendation systems and presents a suite of research‑driven accelerations—including token‑level pipeline parallelism (HPipe), KV‑cache clustering (ClusterAttn), quantization (QoKV), heterogeneous edge frameworks (DeepZoning, PICO), delay‑aware edge‑cloud scheduling (DECC), and operator choreography (RACE)—validated on real‑world industrial workloads.

AI inferenceRecommendation Systemsedge AI

0 likes · 16 min read

Why AI Inference Is Slow and How Cutting‑Edge Tech Boosts It in Industrial Settings

Alibaba Cloud Infrastructure

Dec 23, 2025 · Cloud Native

How Knative Serverless Cuts AI Inference Costs in Half and Doubles Efficiency

This article explains how the cloud‑native Knative serverless framework reduces GPU waste, enables request‑driven autoscaling to zero, accelerates AI model versioning and startup with Fluid, and integrates protocols like MCP and A2A to deliver cost‑effective, high‑performance AI inference services.

AI inferenceCloud NativeGPU

0 likes · 17 min read

How Knative Serverless Cuts AI Inference Costs in Half and Doubles Efficiency

Alibaba Cloud Developer

Dec 17, 2025 · Cloud Native

How 3FS Powers High‑Performance KVCache for AI Inference: Architecture, Optimizations, and Cloud‑Native Deployment

This article details the design and engineering of the 3FS distributed file system as a scalable KVCache backend for large‑language‑model inference, covering its architecture, performance tuning, reliability fixes, integration with SGLang/vLLM, and cloud‑native Kubernetes operator deployment.

3FSAI inferenceCloud Native

0 likes · 30 min read

How 3FS Powers High‑Performance KVCache for AI Inference: Architecture, Optimizations, and Cloud‑Native Deployment

Raymond Ops

Dec 16, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.

AI inferenceCUDAGPU

0 likes · 15 min read

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

Instant Consumer Technology Team

Nov 28, 2025 · Artificial Intelligence

How to Run Powerful AI Locally with Open‑Source LocalAI: A Complete Guide

LocalAI is an open‑source, self‑hosted alternative to OpenAI that lets you run large language, image and audio models on your own CPU or GPU, offering full data privacy, zero cloud costs, and offline capability while remaining compatible with the OpenAI API ecosystem.

AI inferenceDocker deploymentLocalAI

0 likes · 11 min read

How to Run Powerful AI Locally with Open‑Source LocalAI: A Complete Guide

Alibaba Cloud Infrastructure

Oct 20, 2025 · Artificial Intelligence

How ACK Inference Gateway Tripled Large‑Model Performance for an Insurance Giant

This article details how Guotai Insurance tackled the high latency and cost of large‑model inference by deploying Alibaba Cloud's ACK Inference Gateway, which uses load‑aware, prefix‑aware routing, intelligent queuing, and comprehensive observability to boost efficiency threefold while reducing expenses.

ACK GatewayAI inferenceCloud Native

0 likes · 18 min read

How ACK Inference Gateway Tripled Large‑Model Performance for an Insurance Giant

Programmer DD

Oct 13, 2025 · Artificial Intelligence

Running ONNX AI Inference Natively in Java Without Python

This article explains how enterprise architects can integrate ONNX‑based machine‑learning inference directly into Java applications, covering tokenizer integration, GPU acceleration, deployment patterns, and lifecycle management to achieve secure, scalable, and observable AI services without relying on Python runtimes.

AI inferenceEnterprise ArchitectureGPU

0 likes · 16 min read

Running ONNX AI Inference Natively in Java Without Python

Tencent Technical Engineering

Oct 10, 2025 · Artificial Intelligence

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Tequila introduces a novel 1.58‑bit ternary quantization for large language models that tackles the dead‑zone trap by reactivating zero‑weight biases with dynamic offline offsets, achieving near‑full‑precision performance, faster convergence, and up to three‑fold CPU inference speedups.

AI inferenceLLM Quantizationdynamic bias

0 likes · 9 min read

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Architect's Alchemy Furnace

Sep 27, 2025 · Artificial Intelligence

How to Set Up Xinference with NVIDIA RTX 4090 on Oracle Linux: A Step‑by‑Step Guide

This guide walks you through configuring a high‑performance AI inference server on Oracle Linux, covering hardware specs, NVIDIA driver and CUDA installation, Conda environment setup, Xinference deployment, service startup, and example model loading commands, all with clear code snippets and images.

AI inferenceCUDAConda

0 likes · 10 min read

How to Set Up Xinference with NVIDIA RTX 4090 on Oracle Linux: A Step‑by‑Step Guide

Architects' Tech Alliance

Sep 19, 2025 · Artificial Intelligence

Why Nvidia’s Rubin CPX GPU Could Revolutionize Long-Context AI Inference

Nvidia's Rubin CPX GPU, unveiled in September 2025, uses GDDR7 memory and a split‑stage architecture to dramatically boost token‑per‑second rates for long‑context inference, while its integration into third‑generation Oberon servers promises higher power density, improved ROI, and scalable data‑center deployments.

AI inferenceData CenterGPU architecture

0 likes · 9 min read

Why Nvidia’s Rubin CPX GPU Could Revolutionize Long-Context AI Inference

Instant Consumer Technology Team

Aug 20, 2025 · Artificial Intelligence

Nvidia Unveils Nemotron‑Nano‑9B‑v2: Tiny Open‑Source LLM with Switchable Reasoning

Nvidia’s newly released Nemotron‑Nano‑9B‑v2, a 9‑billion‑parameter open‑source LLM optimized for a single Nvidia A10 GPU, introduces a toggleable reasoning mode and budget controls, delivering up to six‑fold speed gains, multilingual support, and strong benchmark results across various tasks.

AI inferenceMambaNVIDIA

0 likes · 5 min read

Nvidia Unveils Nemotron‑Nano‑9B‑v2: Tiny Open‑Source LLM with Switchable Reasoning

Baidu Geek Talk

Aug 11, 2025 · Artificial Intelligence

FLUX-Lightning Slashes Diffusion Inference to 4 Steps, Doubling Speed

FLUX-Lightning, introduced by PaddleMIX, combines phased consistency distillation, adversarial learning, distribution‑matching distillation, and reflow loss to reduce diffusion model inference to just four steps while preserving image quality, and leverages the CINN compiler to achieve over 30% speed gains on A800 GPUs, surpassing existing SOTA acceleration methods.

AI inferenceCINNDiffusion Models

0 likes · 21 min read

FLUX-Lightning Slashes Diffusion Inference to 4 Steps, Doubling Speed

Code Wrench

Aug 10, 2025 · Cloud Native

Boost Go Performance with Nuclio: A Serverless Platform for High‑Throughput Edge and AI Workloads

Nuclio is an open‑source, Go‑friendly serverless platform that delivers high‑throughput, low‑latency function execution on local machines, Kubernetes, or edge environments, offering native Go support, flexible triggers, built‑in observability, and easy deployment steps for streaming, API, and AI inference use cases.

AI inferenceNuclioServerless

0 likes · 6 min read

Boost Go Performance with Nuclio: A Serverless Platform for High‑Throughput Edge and AI Workloads

Data Thinking Notes

Aug 6, 2025 · Artificial Intelligence

OpenAI Unveils gpt-oss 120B & 20B: Open‑Source MoE Models with 4‑Bit Quantization

OpenAI's gpt-oss series introduces two open‑source large language models—gpt‑oss‑120b and gpt‑oss‑20b—featuring Mixture‑of‑Experts architecture, 4‑bit MXFP4 quantization, extensive benchmark results, and broad deployment options across cloud and consumer hardware.

4-bit quantizationAI inferenceGPT-OSS

0 likes · 11 min read

OpenAI Unveils gpt-oss 120B & 20B: Open‑Source MoE Models with 4‑Bit Quantization

Alibaba Cloud Infrastructure

Aug 6, 2025 · Artificial Intelligence

How Multi-Cluster Smart Scheduling Cuts AI Inference Costs with ACK One

This article explains how Alibaba Cloud's ACK One fleet uses inventory‑aware multi‑cluster elastic scheduling to dynamically allocate GPU resources across regions, reducing AI inference costs while ensuring high availability and seamless scaling for large‑model services.

AI inferenceElastic Scalingkubernetes

0 likes · 9 min read

How Multi-Cluster Smart Scheduling Cuts AI Inference Costs with ACK One

Alibaba Cloud Big Data AI Platform

Jul 24, 2025 · Artificial Intelligence

How Alibaba Cloud’s Asynchronous Inference Transforms AI Model Deployment

This article explains how Alibaba Cloud's PAI platform uses an asynchronous inference framework with dedicated queue and inference services to overcome high‑latency challenges, enable load‑balanced request distribution, provide health‑check failover, and support automatic scaling for large‑model AI workloads.

AI inferenceAlibaba CloudScalable Architecture

0 likes · 7 min read

How Alibaba Cloud’s Asynchronous Inference Transforms AI Model Deployment

Tencent Technical Engineering

Jul 8, 2025 · Artificial Intelligence

Why GPUs Power Large‑Model Inference: From Graphics to GPGPU

This article explains how modern GPUs evolved from graphics rendering to general‑purpose computing, details the CPU‑GPU heterogenous architecture, walks through a CUDA demo that adds two billion‑element arrays, compares CPU and GPU performance, and describes the compilation, loading, and execution pipeline of CUDA kernels.

AI inferenceCUDAGPGPU

0 likes · 33 min read

Why GPUs Power Large‑Model Inference: From Graphics to GPGPU

Alibaba Cloud Big Data AI Platform

Jun 26, 2025 · Artificial Intelligence

Master Cloud AI Inference: Load‑Testing Strategies with Alibaba PAI‑EAS

This article explains how Alibaba Cloud’s PAI‑EAS platform enables efficient, scalable AI inference by detailing distributed architecture, serverless resource scheduling, comprehensive load‑testing modes, key performance metrics, and step‑by‑step usage instructions, helping developers optimize latency, throughput, and cost for large language models.

AI inferenceAlibaba PAICloud Computing

0 likes · 7 min read

Master Cloud AI Inference: Load‑Testing Strategies with Alibaba PAI‑EAS

JD Cloud Developers

Jun 24, 2025 · Artificial Intelligence

How JD Retail’s xLLM Architecture Revolutionizes AI Inference for E‑Commerce

At GAITC2025, JD Retail’s AI Infra lead Zhang Ke detailed the challenges of e‑commerce AI inference and introduced the xLLM edge‑cloud unified large‑model architecture, highlighting adaptive scheduling, offline unified scheduling, multi‑layer pipelines, and agent collaboration that boost performance, cut costs, and pave the way for future AI advancements.

AI inferenceModel Optimizatione-commerce

0 likes · 6 min read

How JD Retail’s xLLM Architecture Revolutionizes AI Inference for E‑Commerce

AntTech

Jun 21, 2025 · Artificial Intelligence

Ring-lite: Open‑Source Lightweight MoE Model Sets SOTA on AIME and LiveCodeBench

Ring-lite, an open‑source lightweight Mixture‑of‑Experts inference model built on Ling‑lite‑1.5, introduces the C3PO reinforcement‑learning training method and achieves state‑of‑the‑art results on benchmarks such as AIME24/25, LiveCodeBench, CodeForce, and GPQA‑diamond, while offering full transparency of weights, code, and data.

AI inferenceC3PObenchmark

0 likes · 11 min read

Ring-lite: Open‑Source Lightweight MoE Model Sets SOTA on AIME and LiveCodeBench

JD Retail Technology

Jun 20, 2025 · Artificial Intelligence

How JD Retail’s xLLM Architecture Revolutionizes AI Inference for E‑Commerce

The article details JD Retail’s collaboration with Tsinghua University to build the xLLM edge‑cloud unified large‑model inference framework, addressing e‑commerce AI challenges such as diverse inputs, task scheduling, model compression, and cost, while outlining future research directions and performance gains.

AI inferenceModel Optimizatione-commerce

0 likes · 7 min read

Alibaba Cloud Big Data AI Platform

Jun 13, 2025 · Artificial Intelligence

How EasyDistill Cuts LLM Costs: Mastering DistilQwen-ThoughtX on Alibaba Cloud

EasyDistill, an open-source framework from Alibaba Cloud PAI, streamlines knowledge distillation for large language models, introducing the DistilQwen-ThoughtX series with variable-length chain-of-thought reasoning, and provides comprehensive best-practice guidance for training, fine-tuning, evaluation, compression, and deployment via the PAI-ModelGallery.

AI inferenceLLMknowledge distillation

0 likes · 12 min read

How EasyDistill Cuts LLM Costs: Mastering DistilQwen-ThoughtX on Alibaba Cloud

Baidu Geek Talk

May 19, 2025 · Artificial Intelligence

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

To meet the demanding network requirements of large‑scale PD‑separated inference, Baidu Cloud built a 4 µs end‑to‑end low‑latency HPN cluster, optimized traffic management, adaptive routing, and custom Alltoall operators, resulting in up to 20 % throughput gains and reduced latency for both Prefill and Decode stages.

AI inferenceAlltoall optimizationHPN

0 likes · 14 min read

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

Baidu Intelligent Cloud Tech Hub

May 16, 2025 · Artificial Intelligence

How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference

Baidu Intelligent Cloud built a 4µs end-to-end low‑latency HPN cluster, optimized traffic management and communication operators, and introduced dynamic expert balancing to dramatically improve the performance of large‑scale PD‑separated inference services, showcasing the deep integration of network infrastructure with AI workloads.

AI inferenceAll-to-AllHPN

0 likes · 14 min read

How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference

Alibaba Cloud Native

Apr 29, 2025 · Artificial Intelligence

Qwen3 Unveiled: 8 Open‑Source Hybrid Inference Models Redefine AI Capabilities

Qwen3 introduces eight fully open‑source hybrid inference models—including two MoE and six dense variants—offering massive parameter scales, dual reasoning modes, 119‑language support, and record‑breaking agent performance that rival top‑tier LLMs.

AI inferenceQwen3multilingual

0 likes · 4 min read

Qwen3 Unveiled: 8 Open‑Source Hybrid Inference Models Redefine AI Capabilities

AI Frontier Lectures

Apr 12, 2025 · Artificial Intelligence

How ByteDance Scales Attn/MoE: Cost Models, Mesh Communication, and Network Hacks

The article analyzes ByteDance's MegaScale‑Infer paper, detailing micro‑batching, M:N Attn‑MoE ratios, cost‑driven constraint search, communication redesign with Mesh All‑2‑All, network latency challenges, and innovative NIC and routing solutions for large‑scale mixture‑of‑experts inference.

AI inferenceByteDanceM:N scaling

0 likes · 7 min read

How ByteDance Scales Attn/MoE: Cost Models, Mesh Communication, and Network Hacks

Volcano Engine Developer Services

Apr 8, 2025 · Artificial Intelligence

Which Cloud Platform Delivers the Fastest DeepSeek‑R1 API? A Comprehensive Benchmark

This article aggregates multiple independent evaluations of DeepSeek‑R1 across major cloud providers, comparing accuracy on AIME math problems, token‑per‑second throughput, first‑token latency, stability under high concurrency, and overall service reliability, ultimately highlighting Volcano Engine as the top performer.

AI inferenceAPI performanceDeepSeek

0 likes · 12 min read

Which Cloud Platform Delivers the Fastest DeepSeek‑R1 API? A Comprehensive Benchmark

Code Mala Tang

Apr 3, 2025 · Artificial Intelligence

Intel Core Ultra 5 vs Apple M1: Which Wins for Large Language Model Inference?

This article compares the inference performance of a high‑end Intel Core Ultra 5 AI workstation with an Apple M1 MacBook Air using the IPEX‑LLM library, detailing installation steps, minimal code changes, resource usage, and benchmark results for small and large language models.

AI inferenceApple M1Hardware Comparison

0 likes · 9 min read

Intel Core Ultra 5 vs Apple M1: Which Wins for Large Language Model Inference?

Alibaba Cloud Big Data AI Platform

Mar 29, 2025 · Artificial Intelligence

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

The article introduces the DistilQwen2.5‑R1 series, which leverages a novel knowledge‑distillation pipeline—including CoT data evaluation, improvement, and validation—to transfer deep reasoning abilities from large models like DeepSeek‑R1 to compact models, achieving superior performance across math, code, and scientific benchmarks and providing open‑source checkpoints and deployment guides for practical use.

AI inferencebenchmark evaluationknowledge distillation

0 likes · 17 min read

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

Architects' Tech Alliance

Mar 28, 2025 · Artificial Intelligence

How DeepSeek Leverages Huawei Ascend to Boost AI Inference Efficiency

The report analyzes DeepSeek's latest V3 and R1 models, highlights their scaling‑law‑driven cost reductions, explains how Huawei Ascend optimizes inference by cutting KV‑Cache storage and improving compute efficiency, and surveys the model’s deployments across finance, government, manufacturing, and healthcare sectors.

AI efficiencyAI inferenceDeepSeek

0 likes · 4 min read

How DeepSeek Leverages Huawei Ascend to Boost AI Inference Efficiency

Alibaba Cloud Developer

Mar 26, 2025 · Artificial Intelligence

Why DeepSeek Is Shaking Up the LLM Landscape: Architecture, Performance, and Cost

DeepSeek, a Chinese AI startup, offers open‑source large language models—DeepSeek‑V3 for general tasks and DeepSeek‑R1 for intensive reasoning—featuring MoE, MLA, low‑cost training, and competitive performance against OpenAI’s GPT‑4o, while providing detailed usage guidance and cost analysis.

AI inferenceDeepSeekcost efficiency

0 likes · 21 min read

Why DeepSeek Is Shaking Up the LLM Landscape: Architecture, Performance, and Cost

Alibaba Cloud Observability

Mar 24, 2025 · Artificial Intelligence

Achieving Full Observability for AI Inference Apps with Prometheus

This article explores the observability challenges of AI inference services, outlines a comprehensive Prometheus‑based metric collection strategy, and demonstrates practical monitoring implementations for Ray Serve, vLLM, GPU resources, and custom metrics to build stable, high‑performance inference pipelines.

AI inferenceObservabilityRay Serve

0 likes · 19 min read

Achieving Full Observability for AI Inference Apps with Prometheus

Java Tech Enthusiast

Mar 18, 2025 · Artificial Intelligence

Can Apple’s M3 Ultra Mac Studio Run Full‑Scale DeepSeek R1 at 11 Tokens/s?

Early adopters benchmarked the M3 Ultra‑powered Mac Studio running the 671‑billion‑parameter DeepSeek R1 model, achieving around 11 tokens per second in practice (up to 20 tokens/s theoretically), and compared its performance and cost against GPU‑based solutions and the newer M4 Max hardware.

AI inferenceDeepSeekLLM Benchmark

0 likes · 5 min read

Can Apple’s M3 Ultra Mac Studio Run Full‑Scale DeepSeek R1 at 11 Tokens/s?

Alibaba Cloud Infrastructure

Mar 18, 2025 · Cloud Native

Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes

This guide explains how to deploy large language model inference services on a GPU-enabled Kubernetes cluster, configure ACK Gateway with AI Extension for intelligent routing and load balancing, and perform gray releases for both LoRA fine‑tuned models and base models such as QwQ‑32B and DeepSeek‑R1, including step‑by‑step commands and validation procedures.

ACK GatewayAI inferenceCloud Native

0 likes · 25 min read

Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes

Alibaba Cloud Developer

Mar 18, 2025 · Artificial Intelligence

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

This article explores the monitoring challenges of large‑scale AI inference services, outlines the key observability requirements, and provides a complete Prometheus‑based metric collection framework—including Ray Serve and vLLM integrations—to help developers build stable, high‑performance inference applications.

AI inferenceRay Serveprometheus

0 likes · 21 min read

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

Alibaba Cloud Developer

Mar 14, 2025 · Artificial Intelligence

Solving Rate Limiting, Load Balancing, and Data Challenges in AI Inference with Tair

This article explains how AI inference services can tackle five core problems—rate limiting, load balancing, asynchronous processing, user data management, and index enhancement—by leveraging Tair's rich data structures, offering practical code examples, architectural diagrams, and a comparison with alternative solutions.

AI inferenceRAGTair

0 likes · 20 min read

Solving Rate Limiting, Load Balancing, and Data Challenges in AI Inference with Tair

Programmer DD

Mar 6, 2025 · Artificial Intelligence

Discover QwQ-32B: A 32B LLM Matching 671B DeepSeek‑R1 Performance

The QwQ-32B model, released by Alibaba Cloud, delivers DeepSeek‑R1‑level results with only 32 billion parameters, offers integrated agent capabilities, is open‑source under Apache 2.0, and can be quickly deployed locally via Ollama or integrated into Java applications using Spring AI.

AI inferenceLarge Language ModelModel Deployment

0 likes · 4 min read

Discover QwQ-32B: A 32B LLM Matching 671B DeepSeek‑R1 Performance

Software Engineering 3.0 Era

Feb 19, 2025 · Artificial Intelligence

Three Breakthroughs in AI Inference Models: 1% Data for 99% Performance and More

The article reviews three recent AI inference model advances—open‑source models surpassing OpenAI, the LIMO approach that gains 99% performance with just 1% of the data, and the CoAT framework that combines Monte‑Carlo tree search with associative memory to enable iterative, self‑correcting reasoning.

AI inferenceBenchmarkingCoAT

0 likes · 7 min read

Three Breakthroughs in AI Inference Models: 1% Data for 99% Performance and More

Architects' Tech Alliance

Feb 18, 2025 · Artificial Intelligence

How to Distill DeepSeek LLMs into Lightweight Models for Local Deployment

This article explains DeepSeek's knowledge‑distillation approach for compressing large language models into small, efficient student models, details step‑by‑step local deployment requirements, performance optimizations, and highlights the cost, privacy, and application benefits of running the distilled model on‑premise.

AI inferenceDeepSeekLLM

0 likes · 10 min read

How to Distill DeepSeek LLMs into Lightweight Models for Local Deployment

AIWalker

Feb 15, 2025 · Artificial Intelligence

How 1.58‑bit Quantization Cuts FLUX Parameters by 99.5% While Matching Full‑Precision Quality

This article presents a 1.58‑bit quantization of the FLUX.1‑dev text‑to‑image model that reduces 99.5% of its 11.9 B parameters, introduces a custom low‑bit kernel, and achieves storage, memory, and latency improvements while preserving generation quality on standard benchmarks.

1.58-bitAI inferenceFlux

0 likes · 8 min read

How 1.58‑bit Quantization Cuts FLUX Parameters by 99.5% While Matching Full‑Precision Quality

Java Tech Enthusiast

Feb 15, 2025 · Artificial Intelligence

DeepSeek-R1: High-Performance AI Inference Model

DeepSeek‑R1 is a high‑performance AI inference model that leverages reinforcement‑learning techniques to boost reasoning on complex tasks, has become a Chinese‑New‑Year sensation, and requires substantial hardware resources for local deployment, especially the full‑scale 671‑billion‑parameter version.

AI DeploymentAI inferenceAI model

0 likes · 4 min read

DeepSeek-R1: High-Performance AI Inference Model

Data Thinking Notes

Feb 11, 2025 · Artificial Intelligence

Why DeepSeek V3 and R1 Are Redefining LLM Efficiency and Power

This article analyzes DeepSeek's V3 and R1 large language models, detailing their low‑cost Mixture‑of‑Experts architecture, Multi‑Head Latent Attention redesign, distributed training optimizations, and reasoning‑focused innovations that together challenge traditional GPU/NPU compute demands.

AI inferenceDeepSeekMLA

0 likes · 15 min read

Why DeepSeek V3 and R1 Are Redefining LLM Efficiency and Power

Baidu Geek Talk

Feb 10, 2025 · Artificial Intelligence

How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled

Baidu Cloud's Qianfan platform launched DeepSeek‑R1 and DeepSeek‑V3 with ultra‑low inference pricing, leveraging advanced engine performance tweaks, a split Prefill/Decode architecture, and comprehensive security measures that together boost throughput, cut costs, and ensure enterprise‑grade reliability.

AI inferenceBaidu CloudPerformance Optimization

0 likes · 5 min read

How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled

Architect

Feb 8, 2025 · Artificial Intelligence

DeepSeek‑R1: From Zero to Full‑Featured AI Model via Cold‑Start Data and Multi‑Stage Training

The article explains how DeepSeek‑R1 improves upon the Zero version by introducing expert‑crafted cold‑start data and a four‑phase multi‑stage training pipeline, resulting in markedly better reasoning, coding, and general knowledge performance across benchmark tests.

AI inferenceDeepSeekcold-start data

0 likes · 8 min read

DeepSeek‑R1: From Zero to Full‑Featured AI Model via Cold‑Start Data and Multi‑Stage Training

Huawei Cloud Developer Alliance

Feb 8, 2025 · Artificial Intelligence

Why DeepSeek V3 and R1 Are Redefining Low‑Cost AI: Architecture, Training Tricks, and Industry Impact

This article analyses DeepSeek's V3 and R1 models, explaining how their innovative MoE architecture, Multi‑Head Latent Attention, low‑cost training strategies, and distributed‑training optimizations deliver high‑performance large language models while reducing GPU/NPU demand and sparking industry excitement.

AI inferenceDeepSeekMixture of Experts

0 likes · 16 min read

Why DeepSeek V3 and R1 Are Redefining Low‑Cost AI: Architecture, Training Tricks, and Industry Impact

Infra Learning Club

Feb 6, 2025 · Artificial Intelligence

Getting Started with Huawei Ascend AI Accelerators

This guide walks through the fundamentals of Huawei Ascend NPU hardware, the CANN software stack, driver and firmware installation, Kubernetes integration via Docker runtime and device plugin, and a complete ResNet‑50 inference demo on Ascend 310P.

AI inferenceCANNDocker Runtime

0 likes · 12 min read

Getting Started with Huawei Ascend AI Accelerators

Huawei Cloud Developer Alliance

Feb 5, 2025 · Artificial Intelligence

Deploy DeepSeek‑V3 on Ascend: Step‑by‑Step Guide for Fast AI Inference

This guide walks developers through obtaining the DeepSeek‑V3 model on the Ascend community, converting weights for GPU and NPU, loading the appropriate MindIE Docker image, launching the container, and configuring service‑level parameters to achieve efficient, out‑of‑the‑box AI inference on Ascend hardware.

AI inferenceAscendDeepSeek

0 likes · 4 min read

Deploy DeepSeek‑V3 on Ascend: Step‑by‑Step Guide for Fast AI Inference

Radish, Keep Going!

Feb 4, 2025 · Artificial Intelligence

How DeepSeek Is Redefining AI: Efficiency, Open‑Source Impact, and Future Trends

The article reviews DeepSeek's breakthrough in inference efficiency, explores the trade‑offs of model distillation, compares open‑source and closed‑source ecosystems, examines shifting compute demands, highlights Chinese engineering innovations, and outlines future directions for AI development.

AI inferenceDeepSeekMultimodal AI

0 likes · 9 min read

How DeepSeek Is Redefining AI: Efficiency, Open‑Source Impact, and Future Trends

Tencent Tech

Feb 4, 2025 · Artificial Intelligence

Deploy and Test DeepSeek Large Language Models on Tencent Cloud TI in Minutes

This guide walks you through quickly deploying DeepSeek series models on the Tencent Cloud TI platform, covering model selection, resource planning, step‑by‑step service creation, free online trial, API testing via built‑in tools or curl, and managing inference services for both large and compact models.

AI inferenceDeepSeekModel Deployment

0 likes · 13 min read

Deploy and Test DeepSeek Large Language Models on Tencent Cloud TI in Minutes

Alibaba Cloud Big Data AI Platform

Feb 1, 2025 · Artificial Intelligence

Deploy DeepSeek-V3 and R1 Models with One-Click on Alibaba Cloud PAI Model Gallery

This article introduces Alibaba Cloud's PAI Model Gallery, detailing the DeepSeek-V3 and DeepSeek‑R1 large language models, their architectures and parameters, and provides a step‑by‑step guide for one‑click deployment of these models and their distilled variants using vLLM or BladeLLM.

AI inferenceAlibaba CloudDeepSeek

0 likes · 6 min read

Deploy DeepSeek-V3 and R1 Models with One-Click on Alibaba Cloud PAI Model Gallery

Baobao Algorithm Notes

Jan 9, 2025 · Artificial Intelligence

How to Efficiently Deploy and Manage 100 LoRA‑Enhanced LLMs with vLLM

A technical walkthrough shows how to use vLLM to load multiple LoRA adapters for role‑playing LLMs, analyzes the massive GPU and labor costs of naïve deployment, and presents a hosted multi‑LoRA platform as a cost‑effective solution.

AI inferenceLLMLoRA

0 likes · 11 min read

How to Efficiently Deploy and Manage 100 LoRA‑Enhanced LLMs with vLLM

DevOps

Jan 6, 2025 · Artificial Intelligence

Ten Popular Large Language Model Deployment Engines and Tools: Features, Advantages, and Limitations

This article reviews ten mainstream LLM deployment solutions—including WebLLM, LM Studio, Ollama, vLLM, LightLLM, OpenLLM, HuggingFace TGI, GPT4ALL, llama.cpp, and Triton Inference Server—detailing their technical characteristics, strengths, drawbacks, and example deployment workflows for both personal and enterprise environments.

AI inferenceGPU AccelerationLLM

0 likes · 16 min read

Ten Popular Large Language Model Deployment Engines and Tools: Features, Advantages, and Limitations

Architects' Tech Alliance

Jan 6, 2025 · Industry Insights

How Nvidia’s GB300 GPU Is Shaping AI Inference and Cloud Supply Chains

The article provides a detailed technical analysis of Nvidia’s new GB300 and B300 GPUs, comparing their performance, memory architecture, and power consumption to previous generations, and examines how these changes affect AI inference workloads, NVL72 accelerator systems, and the supply‑chain strategies of major cloud providers.

AI inferenceCloud ComputingGPU

0 likes · 12 min read

How Nvidia’s GB300 GPU Is Shaping AI Inference and Cloud Supply Chains

DevOps Cloud Academy

Dec 2, 2024 · Artificial Intelligence

Key Kubernetes Features that Benefit AI Inference Workloads

This article explains how Kubernetes’ native scalability, resource optimization, performance tuning, portability, and fault‑tolerance features align with the demands of AI inference, helping organizations run large ML models efficiently, cost‑effectively, and reliably across diverse environments.

AI inferencePortabilityfault tolerance

0 likes · 15 min read

Key Kubernetes Features that Benefit AI Inference Workloads

AI Large Model Application Practice

Nov 28, 2024 · Artificial Intelligence

Can Tiny Multimodal Models Power Edge AI? Meet OmniVision-968M

This article explores how compact multimodal models like OmniVision-968M enable efficient generative AI on edge devices, detailing their architectural advantages, benchmark superiority over larger models, and step‑by‑step instructions for local installation and visual inference using NexaSDK.

AI inferenceOmniVision-968Medge AI

0 likes · 9 min read

Can Tiny Multimodal Models Power Edge AI? Meet OmniVision-968M

Architects' Tech Alliance

Nov 12, 2024 · Artificial Intelligence

How Retrieval‑Augmented Generation Boosts Enterprise AI with Intel Optimizations

This article explains the fundamentals of Retrieval‑Augmented Generation (RAG), its four‑step workflow, architecture, and how Intel’s hardware and software optimizations—including vector search, quantized embeddings, and advanced inference extensions—enhance performance, security, and scalability for enterprise LLM applications.

AI inferenceEmbedding QuantizationIntel Optimization

0 likes · 14 min read

How Retrieval‑Augmented Generation Boosts Enterprise AI with Intel Optimizations

Alibaba Cloud Infrastructure

Nov 8, 2024 · Industry Insights

Unlocking Efficient LLM Inference: Insights from China’s Cloud Computing Conference

The 5th China Cloud Computing Infrastructure Developer Conference in Beijing highlighted cutting‑edge AI inference optimization, Knative‑based serverless acceleration, AMD PMU virtualization, and CDI‑driven GPU management, offering detailed technical insights and real‑world case studies that illustrate how cloud providers are tackling performance and cost challenges of modern workloads.

AI inferenceAMD virtualizationCloud Native

0 likes · 9 min read

Unlocking Efficient LLM Inference: Insights from China’s Cloud Computing Conference

Sohu Tech Products

Oct 18, 2024 · Artificial Intelligence

Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes

This article details a comprehensive engineering practice for optimizing AI inference services at ZhiZhuan, covering background analysis, selection of TorchServe over alternatives, GPU/CPU performance tuning, custom handlers, Torch‑TRT integration, and deployment on Kubernetes, with measured improvements in throughput and resource utilization.

AI inferenceGPU OptimizationPerformance Tuning

0 likes · 16 min read

Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes

dbaplus Community

Aug 13, 2024 · Artificial Intelligence

Why Kubernetes Is the Ideal Platform for AI Inference: 5 Key Benefits

Kubernetes aligns perfectly with AI inference demands by offering built‑in scalability, resource and performance optimization, seamless portability across clouds, and robust fault‑tolerance, making it a cost‑effective, high‑availability foundation for deploying large‑scale machine‑learning models.

AI inferencefault tolerancekubernetes

0 likes · 10 min read

Why Kubernetes Is the Ideal Platform for AI Inference: 5 Key Benefits

Open Source Linux

Jul 2, 2024 · Fundamentals

Why GPUs Power AI and Gaming: A Beginner’s Guide to Their Architecture

This article explains what a GPU is, how it differs from a CPU, its internal architecture, and why its massive parallel processing makes it essential for graphics rendering, scientific computation, and AI inference, illustrated with examples such as NVIDIA RTX 3090.

AI inferenceGPUGraphics Rendering

0 likes · 8 min read

Why GPUs Power AI and Gaming: A Beginner’s Guide to Their Architecture

Alibaba Cloud Native

Jun 29, 2024 · Cloud Native

Deploy TensorRT‑LLM Optimized Llama‑2 on KServe with Alibaba Cloud ASM

This guide walks through enabling KServe on Alibaba Cloud ASM, preparing the Llama‑2‑7B model with TensorRT‑LLM, creating the necessary Kubernetes resources, and deploying a serverless AI inference service that can be queried via a simple curl request.

AI inferenceKServeLLM

0 likes · 14 min read

Deploy TensorRT‑LLM Optimized Llama‑2 on KServe with Alibaba Cloud ASM

Alibaba Cloud Infrastructure

Jun 12, 2024 · Artificial Intelligence

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.

AI inferenceKServeLlama 2

0 likes · 13 min read

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

JD Tech

Mar 18, 2024 · Artificial Intelligence

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

The article describes how JD’s advertising team tackled the high‑concurrency, low‑latency challenges of online recommendation inference by designing a distributed graph heterogeneous computing framework, optimizing GPU kernel launches with TensorBatch, deep‑learning compiler techniques, and a multi‑stream GPU architecture, achieving significant throughput and latency improvements.

AI inferenceDeep Learning CompilerDistributed Computing

0 likes · 14 min read

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

Open Source Tech Hub

Mar 12, 2024 · Artificial Intelligence

Step-by-Step Guide to Install ModelScope and Perform NLP Inference in Python & PHP

This guide walks you through setting up a Conda Python environment, installing PyTorch and the ModelScope library, running NLP pipelines for tasks like word segmentation and text classification, and calling ModelScope models from PHP using the PHPY extension, complete with code examples and troubleshooting tips.

AI inferenceModelScopeNLP

0 likes · 14 min read

Step-by-Step Guide to Install ModelScope and Perform NLP Inference in Python & PHP

Open Source Tech Hub

Jan 20, 2024 · Artificial Intelligence

How to Set Up ModelScope with Anaconda and Run OCR Inference via PHP

This guide walks through installing Anaconda, creating a Python 3.10 conda environment, adding PyTorch and ModelScope libraries, installing domain-specific dependencies, verifying NLP pipelines, and using PHPY to call ModelScope's OCR model from PHP, complete with code snippets and troubleshooting tips.

AI inferenceAnacondaModelScope

0 likes · 10 min read

How to Set Up ModelScope with Anaconda and Run OCR Inference via PHP

php Courses

Jan 8, 2024 · Artificial Intelligence

Setting Up and Using LocalAI as an Open‑Source Alternative to the ChatGPT API

LocalAI is an open‑source, cost‑effective alternative to the ChatGPT API that lets you download and run thousands of language models locally via Docker or compiled binaries, offering privacy, customization, and easy integration into projects through a compatible API.

AI inferenceAPIDocker

0 likes · 7 min read

Setting Up and Using LocalAI as an Open‑Source Alternative to the ChatGPT API

Kuaishou Tech

Dec 20, 2023 · Artificial Intelligence

SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision

SAMP is an adaptive mixed-precision inference toolkit that automatically controls floating-point and integer operations to accelerate model inference while maintaining computational accuracy.

AI inferenceNLP accelerationmixed-precision computing

0 likes · 9 min read

SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision

Baobao Algorithm Notes

Dec 1, 2023 · Operations

Deploy Hugging Face Transformers with One Click Using LMDeploy

This article explains how LMDeploy streamlines the deployment of Hugging Face transformer models by adding online conversion, offering an OpenAI‑compatible API server, a Gradio WebUI, and 4‑bit weight‑only quantization with AWQ, providing step‑by‑step commands, code examples, and performance insights.

AI inferenceAPI ServerHugging Face

0 likes · 9 min read

Deploy Hugging Face Transformers with One Click Using LMDeploy