Tagged articles
110 articles
Page 1 of 2
Machine Heart
Machine Heart
May 14, 2026 · Artificial Intelligence

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

The recent SGLang × MUSA meetup revealed that MUSA’s GPU backend has been merged into SGLang’s official codebase, delivering zero‑learning‑cost integration, performance gains of up to 66 % on DeepSeek‑V4, and a growing ecosystem of adapters, high‑performance kernels, and distributed inference support.

AI inferenceDeepSeekGPU
0 likes · 12 min read
How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline
Architects' Tech Alliance
Architects' Tech Alliance
May 9, 2026 · Artificial Intelligence

Fractile Claims 90% Cost Cut and 100× Speed Over Nvidia GPUs

Fractile, a UK AI‑chip startup founded in 2022, says its SRAM‑compute‑on‑die architecture eliminates data movement, promising up to 100‑fold faster inference and 90% lower cost than Nvidia GPUs, yet the chip is still in simulation and not expected to ship until 2027, sparking both investor hype and industry skepticism.

AI hardware marketAI inferenceAnthropic
0 likes · 6 min read
Fractile Claims 90% Cost Cut and 100× Speed Over Nvidia GPUs
AI Explorer
AI Explorer
May 7, 2026 · Artificial Intelligence

Nvidia Endorses Open-Source “Light-Speed” Inference Engine for Coding Agents

The article examines how Nvidia’s open-source ‘light-speed’ inference engine tackles the token-bloat and compute bottlenecks of modern coding agents by redesigning attention and memory management, enabling order-of-magnitude speed gains without losing accuracy, and reshaping the AI-as-a-service ecosystem.

AI inferenceAttention optimizationNvidia
0 likes · 6 min read
Nvidia Endorses Open-Source “Light-Speed” Inference Engine for Coding Agents
ITPUB
ITPUB
Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Unleashed: 1M‑Token Context Becomes Commodity, Teams with Ascend to Challenge Compute Dominance

DeepSeek released two V4 models—Pro and Flash—both supporting 1‑million‑token context as a standard feature, showcasing top‑tier agentic coding, world‑knowledge, and inference performance, while introducing DSA sparse attention and announcing upcoming large‑scale deployment on Huawei Ascend hardware.

1M contextAI inferenceDSA sparse attention
0 likes · 6 min read
DeepSeek V4 Unleashed: 1M‑Token Context Becomes Commodity, Teams with Ascend to Challenge Compute Dominance
Machine Heart
Machine Heart
Apr 24, 2026 · Artificial Intelligence

Cambricon Achieves Day‑0 Native Support for DeepSeek‑V4, Uniting Two Chinese AI Leaders

Cambricon leveraged its NeuWare stack and vLLM framework to deliver Day‑0 native support for DeepSeek‑V4‑flash (285 B) and DeepSeek‑V4‑pro (1.6 T), open‑sourcing the adaptation and showcasing rapid model migration alongside extreme performance optimizations across software and hardware layers.

AI inferenceCambriconDeepSeek-V4
0 likes · 5 min read
Cambricon Achieves Day‑0 Native Support for DeepSeek‑V4, Uniting Two Chinese AI Leaders
ITPUB
ITPUB
Apr 22, 2026 · Artificial Intelligence

Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10

Ant’s newly released Ling‑2.6‑flash model, hidden as the anonymous “Elephant Alpha,” combines a 104B‑parameter MoE design with only 7.4B active weights per inference, achieving ten‑fold token savings, top‑tier benchmark scores and a $0.10 per‑million‑token price that dramatically cuts inference costs for developers and enterprises.

AI inferenceBenchmarkToken efficiency
0 likes · 6 min read
Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10
Architect's Must-Have
Architect's Must-Have
Apr 19, 2026 · Artificial Intelligence

TurboQuant: Google’s 6× KV Compression & 8× Speedup Break the AI Memory Wall

With LLM context windows soaring to millions of tokens, the KV‑cache memory wall threatens scalable inference; Google’s TurboQuant tackles this by compressing KV data up to six‑fold without precision loss and accelerating attention up to eight‑fold, using PolarQuant and 1‑bit QJL techniques, reshaping hardware costs and edge AI possibilities.

AI inferenceKV compressionMemory Wall
0 likes · 25 min read
TurboQuant: Google’s 6× KV Compression & 8× Speedup Break the AI Memory Wall
PaperAgent
PaperAgent
Mar 26, 2026 · Artificial Intelligence

TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed

TurboQuant, presented at ICLR 2026, introduces a theoretically grounded vector quantization technique that reduces large‑language‑model key‑value cache memory by at least six times, achieves up to eight‑fold speedups, and maintains zero accuracy loss by combining PolarQuant’s polar‑coordinate compression with a 1‑bit QJL error‑correction step, as demonstrated on benchmarks such as LongBench and GloVe.

AI inferenceBenchmarkingTurboQuant
0 likes · 10 min read
TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 17, 2026 · Backend Development

How RocketMQ LiteTopic Eliminates AI Inference Queue Bottlenecks with Millisecond‑Level Flow Control

This article explains why traditional message‑queue throttling fails in AI inference workloads, introduces Apache RocketMQ 5.x LiteTopic’s lightweight topic model, and details its four core features—physical isolation, elastic scaling, precise flow control, and consumption suspension—that together provide millisecond‑level real‑time throttling and minute‑level busy‑idle scheduling for personalized traffic management.

AI inferenceFlow ControlLiteTopic
0 likes · 14 min read
How RocketMQ LiteTopic Eliminates AI Inference Queue Bottlenecks with Millisecond‑Level Flow Control
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 6, 2026 · Artificial Intelligence

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

Baidu Baige built a full‑stack quantization pipeline that integrates model‑level, framework‑level, and hardware‑level optimizations on the Kunlun XPU platform, enabling FP16/BF16 large models to be compressed to 25‑50% of their original size while boosting inference speed by 30‑50% and dramatically reducing memory consumption for enterprise deployments.

AI inferenceHardware accelerationINT4
0 likes · 16 min read
How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU
SuanNi
SuanNi
Mar 4, 2026 · Artificial Intelligence

How to Fit Large Language Models into Cars and Robots: A Hardware‑Aware Scaling Law

This article presents a hardware‑aware co‑design framework for edge‑deployed large language models, revealing a scaling law that balances model accuracy and inference latency, and demonstrates how Pareto‑optimal architectures can be discovered quickly using roofline analysis and systematic search on devices like NVIDIA Jetson Orin.

AI inferenceEdge ComputingPareto optimization
0 likes · 15 min read
How to Fit Large Language Models into Cars and Robots: A Hardware‑Aware Scaling Law
Fun with Large Models
Fun with Large Models
Feb 17, 2026 · Artificial Intelligence

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

Qwen3.5‑397B‑A17B, the newly open‑sourced multimodal giant, combines a 400‑billion‑parameter sparse MoE architecture with FP8 pipelines and an asynchronous RL framework to deliver GPT‑5.2‑level capabilities, 60% lower memory usage, up to 19× higher throughput, and extensive image, video, and agent support, while outlining its deployment requirements and API pricing.

AI inferenceFP8multimodal model
0 likes · 11 min read
Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 17, 2026 · Artificial Intelligence

Deploy Alibaba’s Qwen3.5‑397B‑A17B Model in One Click with PAI‑Model Gallery

Alibaba's open‑source Qwen3.5‑397B‑A17B model, featuring 397 billion parameters and a hybrid Gated Delta Network/MoE architecture, delivers superior performance and reduced memory usage, and can be deployed instantly through the PAI‑Model Gallery with step‑by‑step guidance and enterprise‑grade security.

AI inferenceAlibaba CloudOne‑Click Deployment
0 likes · 5 min read
Deploy Alibaba’s Qwen3.5‑397B‑A17B Model in One Click with PAI‑Model Gallery
Baidu Geek Talk
Baidu Geek Talk
Jan 7, 2026 · Artificial Intelligence

How Baidu’s vLLM‑Kunlun Plugin Powered MiMo Flash V2 on Kunlun XPU in 2 Days

Within two days, Baidu’s Baige and Kunlun Chip teams adapted the 309‑billion‑parameter MiMo Flash V2 model—featuring a hybrid SWA+Sink and Full Attention mechanism—to run efficiently on the Kunlun P800 XPU using the vLLM‑Kunlun Plugin, achieving lossless performance comparable to GPU inference.

AI inferenceKunlun XPUMiMo Flash V2
0 likes · 7 min read
How Baidu’s vLLM‑Kunlun Plugin Powered MiMo Flash V2 on Kunlun XPU in 2 Days
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Dec 31, 2025 · Artificial Intelligence

Why AI Inference Is Slow and How Cutting‑Edge Tech Boosts It in Industrial Settings

The article analyzes the severe inference bottlenecks of large language models, CNNs, and recommendation systems and presents a suite of research‑driven accelerations—including token‑level pipeline parallelism (HPipe), KV‑cache clustering (ClusterAttn), quantization (QoKV), heterogeneous edge frameworks (DeepZoning, PICO), delay‑aware edge‑cloud scheduling (DECC), and operator choreography (RACE)—validated on real‑world industrial workloads.

AI inferenceRecommendation Systemsedge AI
0 likes · 16 min read
Why AI Inference Is Slow and How Cutting‑Edge Tech Boosts It in Industrial Settings
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 17, 2025 · Cloud Native

How 3FS Powers High‑Performance KVCache for AI Inference: Architecture, Optimizations, and Cloud‑Native Deployment

This article details the design and engineering of the 3FS distributed file system as a scalable KVCache backend for large‑language‑model inference, covering its architecture, performance tuning, reliability fixes, integration with SGLang/vLLM, and cloud‑native Kubernetes operator deployment.

3FSAI inferenceCloud Native
0 likes · 30 min read
How 3FS Powers High‑Performance KVCache for AI Inference: Architecture, Optimizations, and Cloud‑Native Deployment
Raymond Ops
Raymond Ops
Dec 16, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.

AI inferenceCUDAGPU
0 likes · 15 min read
Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 20, 2025 · Artificial Intelligence

How ACK Inference Gateway Tripled Large‑Model Performance for an Insurance Giant

This article details how Guotai Insurance tackled the high latency and cost of large‑model inference by deploying Alibaba Cloud's ACK Inference Gateway, which uses load‑aware, prefix‑aware routing, intelligent queuing, and comprehensive observability to boost efficiency threefold while reducing expenses.

ACK GatewayAI inferenceCloud Native
0 likes · 18 min read
How ACK Inference Gateway Tripled Large‑Model Performance for an Insurance Giant
Programmer DD
Programmer DD
Oct 13, 2025 · Artificial Intelligence

Running ONNX AI Inference Natively in Java Without Python

This article explains how enterprise architects can integrate ONNX‑based machine‑learning inference directly into Java applications, covering tokenizer integration, GPU acceleration, deployment patterns, and lifecycle management to achieve secure, scalable, and observable AI services without relying on Python runtimes.

AI inferenceGPUJava
0 likes · 16 min read
Running ONNX AI Inference Natively in Java Without Python
Tencent Technical Engineering
Tencent Technical Engineering
Oct 10, 2025 · Artificial Intelligence

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Tequila introduces a novel 1.58‑bit ternary quantization for large language models that tackles the dead‑zone trap by reactivating zero‑weight biases with dynamic offline offsets, achieving near‑full‑precision performance, faster convergence, and up to three‑fold CPU inference speedups.

AI inferenceLLM quantizationdynamic bias
0 likes · 9 min read
How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs
Architects' Tech Alliance
Architects' Tech Alliance
Sep 19, 2025 · Artificial Intelligence

Why Nvidia’s Rubin CPX GPU Could Revolutionize Long-Context AI Inference

Nvidia's Rubin CPX GPU, unveiled in September 2025, uses GDDR7 memory and a split‑stage architecture to dramatically boost token‑per‑second rates for long‑context inference, while its integration into third‑generation Oberon servers promises higher power density, improved ROI, and scalable data‑center deployments.

AI inferenceData centerGPU architecture
0 likes · 9 min read
Why Nvidia’s Rubin CPX GPU Could Revolutionize Long-Context AI Inference
Instant Consumer Technology Team
Instant Consumer Technology Team
Aug 20, 2025 · Artificial Intelligence

Nvidia Unveils Nemotron‑Nano‑9B‑v2: Tiny Open‑Source LLM with Switchable Reasoning

Nvidia’s newly released Nemotron‑Nano‑9B‑v2, a 9‑billion‑parameter open‑source LLM optimized for a single Nvidia A10 GPU, introduces a toggleable reasoning mode and budget controls, delivering up to six‑fold speed gains, multilingual support, and strong benchmark results across various tasks.

AI inferenceMambaNvidia
0 likes · 5 min read
Nvidia Unveils Nemotron‑Nano‑9B‑v2: Tiny Open‑Source LLM with Switchable Reasoning
Baidu Geek Talk
Baidu Geek Talk
Aug 11, 2025 · Artificial Intelligence

FLUX-Lightning Slashes Diffusion Inference to 4 Steps, Doubling Speed

FLUX-Lightning, introduced by PaddleMIX, combines phased consistency distillation, adversarial learning, distribution‑matching distillation, and reflow loss to reduce diffusion model inference to just four steps while preserving image quality, and leverages the CINN compiler to achieve over 30% speed gains on A800 GPUs, surpassing existing SOTA acceleration methods.

AI inferenceCINNDistillation
0 likes · 21 min read
FLUX-Lightning Slashes Diffusion Inference to 4 Steps, Doubling Speed
Code Wrench
Code Wrench
Aug 10, 2025 · Cloud Native

Boost Go Performance with Nuclio: A Serverless Platform for High‑Throughput Edge and AI Workloads

Nuclio is an open‑source, Go‑friendly serverless platform that delivers high‑throughput, low‑latency function execution on local machines, Kubernetes, or edge environments, offering native Go support, flexible triggers, built‑in observability, and easy deployment steps for streaming, API, and AI inference use cases.

AI inferenceEdge ComputingKubernetes
0 likes · 6 min read
Boost Go Performance with Nuclio: A Serverless Platform for High‑Throughput Edge and AI Workloads
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 24, 2025 · Artificial Intelligence

How Alibaba Cloud’s Asynchronous Inference Transforms AI Model Deployment

This article explains how Alibaba Cloud's PAI platform uses an asynchronous inference framework with dedicated queue and inference services to overcome high‑latency challenges, enable load‑balanced request distribution, provide health‑check failover, and support automatic scaling for large‑model AI workloads.

AI inferenceAlibaba CloudCloud AI
0 likes · 7 min read
How Alibaba Cloud’s Asynchronous Inference Transforms AI Model Deployment
Tencent Technical Engineering
Tencent Technical Engineering
Jul 8, 2025 · Artificial Intelligence

Why GPUs Power Large‑Model Inference: From Graphics to GPGPU

This article explains how modern GPUs evolved from graphics rendering to general‑purpose computing, details the CPU‑GPU heterogenous architecture, walks through a CUDA demo that adds two billion‑element arrays, compares CPU and GPU performance, and describes the compilation, loading, and execution pipeline of CUDA kernels.

AI inferenceCUDAGPGPU
0 likes · 33 min read
Why GPUs Power Large‑Model Inference: From Graphics to GPGPU
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 26, 2025 · Artificial Intelligence

Master Cloud AI Inference: Load‑Testing Strategies with Alibaba PAI‑EAS

This article explains how Alibaba Cloud’s PAI‑EAS platform enables efficient, scalable AI inference by detailing distributed architecture, serverless resource scheduling, comprehensive load‑testing modes, key performance metrics, and step‑by‑step usage instructions, helping developers optimize latency, throughput, and cost for large language models.

AI inferenceAlibaba PAILoad Testing
0 likes · 7 min read
Master Cloud AI Inference: Load‑Testing Strategies with Alibaba PAI‑EAS
JD Cloud Developers
JD Cloud Developers
Jun 24, 2025 · Artificial Intelligence

How JD Retail’s xLLM Architecture Revolutionizes AI Inference for E‑Commerce

At GAITC2025, JD Retail’s AI Infra lead Zhang Ke detailed the challenges of e‑commerce AI inference and introduced the xLLM edge‑cloud unified large‑model architecture, highlighting adaptive scheduling, offline unified scheduling, multi‑layer pipelines, and agent collaboration that boost performance, cut costs, and pave the way for future AI advancements.

AI inferenceLarge ModelModel Optimization
0 likes · 6 min read
How JD Retail’s xLLM Architecture Revolutionizes AI Inference for E‑Commerce
AntTech
AntTech
Jun 21, 2025 · Artificial Intelligence

Ring-lite: Open‑Source Lightweight MoE Model Sets SOTA on AIME and LiveCodeBench

Ring-lite, an open‑source lightweight Mixture‑of‑Experts inference model built on Ling‑lite‑1.5, introduces the C3PO reinforcement‑learning training method and achieves state‑of‑the‑art results on benchmarks such as AIME24/25, LiveCodeBench, CodeForce, and GPQA‑diamond, while offering full transparency of weights, code, and data.

AI inferenceBenchmarkC3PO
0 likes · 11 min read
Ring-lite: Open‑Source Lightweight MoE Model Sets SOTA on AIME and LiveCodeBench
JD Retail Technology
JD Retail Technology
Jun 20, 2025 · Artificial Intelligence

How JD Retail’s xLLM Architecture Revolutionizes AI Inference for E‑Commerce

The article details JD Retail’s collaboration with Tsinghua University to build the xLLM edge‑cloud unified large‑model inference framework, addressing e‑commerce AI challenges such as diverse inputs, task scheduling, model compression, and cost, while outlining future research directions and performance gains.

AI inferenceModel Optimizationedge-cloud
0 likes · 7 min read
How JD Retail’s xLLM Architecture Revolutionizes AI Inference for E‑Commerce
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 13, 2025 · Artificial Intelligence

How EasyDistill Cuts LLM Costs: Mastering DistilQwen-ThoughtX on Alibaba Cloud

EasyDistill, an open-source framework from Alibaba Cloud PAI, streamlines knowledge distillation for large language models, introducing the DistilQwen-ThoughtX series with variable-length chain-of-thought reasoning, and provides comprehensive best-practice guidance for training, fine-tuning, evaluation, compression, and deployment via the PAI-ModelGallery.

AI inferenceLLMknowledge distillation
0 likes · 12 min read
How EasyDistill Cuts LLM Costs: Mastering DistilQwen-ThoughtX on Alibaba Cloud
Baidu Geek Talk
Baidu Geek Talk
May 19, 2025 · Artificial Intelligence

How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations

To meet the demanding network requirements of large‑scale PD‑separated inference, Baidu Cloud built a 4 µs end‑to‑end low‑latency HPN cluster, optimized traffic management, adaptive routing, and custom Alltoall operators, resulting in up to 20 % throughput gains and reduced latency for both Prefill and Decode stages.

AI inferenceAlltoall optimizationDistributed Training
0 likes · 14 min read
How Baidu Cloud Achieved 4µs Low-Latency PD Inference with HPN Network Optimizations
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 16, 2025 · Artificial Intelligence

How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference

Baidu Intelligent Cloud built a 4µs end-to-end low‑latency HPN cluster, optimized traffic management and communication operators, and introduced dynamic expert balancing to dramatically improve the performance of large‑scale PD‑separated inference services, showcasing the deep integration of network infrastructure with AI workloads.

AI inferenceAll-to-AllHPN
0 likes · 14 min read
How Baidu Cloud Achieved 4µs End-to-End Latency for Large-Scale PD Inference
AI Frontier Lectures
AI Frontier Lectures
Apr 12, 2025 · Artificial Intelligence

How ByteDance Scales Attn/MoE: Cost Models, Mesh Communication, and Network Hacks

The article analyzes ByteDance's MegaScale‑Infer paper, detailing micro‑batching, M:N Attn‑MoE ratios, cost‑driven constraint search, communication redesign with Mesh All‑2‑All, network latency challenges, and innovative NIC and routing solutions for large‑scale mixture‑of‑experts inference.

AI inferenceByteDanceCost Optimization
0 likes · 7 min read
How ByteDance Scales Attn/MoE: Cost Models, Mesh Communication, and Network Hacks
Volcano Engine Developer Services
Volcano Engine Developer Services
Apr 8, 2025 · Artificial Intelligence

Which Cloud Platform Delivers the Fastest DeepSeek‑R1 API? A Comprehensive Benchmark

This article aggregates multiple independent evaluations of DeepSeek‑R1 across major cloud providers, comparing accuracy on AIME math problems, token‑per‑second throughput, first‑token latency, stability under high concurrency, and overall service reliability, ultimately highlighting Volcano Engine as the top performer.

AI inferenceAPI performanceBenchmark
0 likes · 12 min read
Which Cloud Platform Delivers the Fastest DeepSeek‑R1 API? A Comprehensive Benchmark
Code Mala Tang
Code Mala Tang
Apr 3, 2025 · Artificial Intelligence

Intel Core Ultra 5 vs Apple M1: Which Wins for Large Language Model Inference?

This article compares the inference performance of a high‑end Intel Core Ultra 5 AI workstation with an Apple M1 MacBook Air using the IPEX‑LLM library, detailing installation steps, minimal code changes, resource usage, and benchmark results for small and large language models.

AI inferenceApple M1IPEX-LLM
0 likes · 9 min read
Intel Core Ultra 5 vs Apple M1: Which Wins for Large Language Model Inference?
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 29, 2025 · Artificial Intelligence

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

The article introduces the DistilQwen2.5‑R1 series, which leverages a novel knowledge‑distillation pipeline—including CoT data evaluation, improvement, and validation—to transfer deep reasoning abilities from large models like DeepSeek‑R1 to compact models, achieving superior performance across math, code, and scientific benchmarks and providing open‑source checkpoints and deployment guides for practical use.

AI inferencebenchmark evaluationknowledge distillation
0 likes · 17 min read
How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation
Architects' Tech Alliance
Architects' Tech Alliance
Mar 28, 2025 · Artificial Intelligence

How DeepSeek Leverages Huawei Ascend to Boost AI Inference Efficiency

The report analyzes DeepSeek's latest V3 and R1 models, highlights their scaling‑law‑driven cost reductions, explains how Huawei Ascend optimizes inference by cutting KV‑Cache storage and improving compute efficiency, and surveys the model’s deployments across finance, government, manufacturing, and healthcare sectors.

AI efficiencyAI inferenceDeepSeek
0 likes · 4 min read
How DeepSeek Leverages Huawei Ascend to Boost AI Inference Efficiency
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 26, 2025 · Artificial Intelligence

Why DeepSeek Is Shaking Up the LLM Landscape: Architecture, Performance, and Cost

DeepSeek, a Chinese AI startup, offers open‑source large language models—DeepSeek‑V3 for general tasks and DeepSeek‑R1 for intensive reasoning—featuring MoE, MLA, low‑cost training, and competitive performance against OpenAI’s GPT‑4o, while providing detailed usage guidance and cost analysis.

AI inferenceDeepSeekModel architecture
0 likes · 21 min read
Why DeepSeek Is Shaking Up the LLM Landscape: Architecture, Performance, and Cost
Alibaba Cloud Observability
Alibaba Cloud Observability
Mar 24, 2025 · Artificial Intelligence

Achieving Full Observability for AI Inference Apps with Prometheus

This article explores the observability challenges of AI inference services, outlines a comprehensive Prometheus‑based metric collection strategy, and demonstrates practical monitoring implementations for Ray Serve, vLLM, GPU resources, and custom metrics to build stable, high‑performance inference pipelines.

AI inferenceObservabilityPrometheus
0 likes · 19 min read
Achieving Full Observability for AI Inference Apps with Prometheus
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 18, 2025 · Cloud Native

Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes

This guide explains how to deploy large language model inference services on a GPU-enabled Kubernetes cluster, configure ACK Gateway with AI Extension for intelligent routing and load balancing, and perform gray releases for both LoRA fine‑tuned models and base models such as QwQ‑32B and DeepSeek‑R1, including step‑by‑step commands and validation procedures.

ACK GatewayAI inferenceCloud Native
0 likes · 25 min read
Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 18, 2025 · Artificial Intelligence

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

This article explores the monitoring challenges of large‑scale AI inference services, outlines the key observability requirements, and provides a complete Prometheus‑based metric collection framework—including Ray Serve and vLLM integrations—to help developers build stable, high‑performance inference applications.

AI inferencePrometheusRay Serve
0 likes · 21 min read
How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 14, 2025 · Artificial Intelligence

Solving Rate Limiting, Load Balancing, and Data Challenges in AI Inference with Tair

This article explains how AI inference services can tackle five core problems—rate limiting, load balancing, asynchronous processing, user data management, and index enhancement—by leveraging Tair's rich data structures, offering practical code examples, architectural diagrams, and a comparison with alternative solutions.

AI inferenceRAGTair
0 likes · 20 min read
Solving Rate Limiting, Load Balancing, and Data Challenges in AI Inference with Tair
Programmer DD
Programmer DD
Mar 6, 2025 · Artificial Intelligence

Discover QwQ-32B: A 32B LLM Matching 671B DeepSeek‑R1 Performance

The QwQ-32B model, released by Alibaba Cloud, delivers DeepSeek‑R1‑level results with only 32 billion parameters, offers integrated agent capabilities, is open‑source under Apache 2.0, and can be quickly deployed locally via Ollama or integrated into Java applications using Spring AI.

AI inferenceModel DeploymentOllama
0 likes · 4 min read
Discover QwQ-32B: A 32B LLM Matching 671B DeepSeek‑R1 Performance
Architects' Tech Alliance
Architects' Tech Alliance
Feb 18, 2025 · Artificial Intelligence

How to Distill DeepSeek LLMs into Lightweight Models for Local Deployment

This article explains DeepSeek's knowledge‑distillation approach for compressing large language models into small, efficient student models, details step‑by‑step local deployment requirements, performance optimizations, and highlights the cost, privacy, and application benefits of running the distilled model on‑premise.

AI inferenceDeepSeekLLM
0 likes · 10 min read
How to Distill DeepSeek LLMs into Lightweight Models for Local Deployment
Java Tech Enthusiast
Java Tech Enthusiast
Feb 15, 2025 · Artificial Intelligence

DeepSeek-R1: High-Performance AI Inference Model

DeepSeek‑R1 is a high‑performance AI inference model that leverages reinforcement‑learning techniques to boost reasoning on complex tasks, has become a Chinese‑New‑Year sensation, and requires substantial hardware resources for local deployment, especially the full‑scale 671‑billion‑parameter version.

AI deploymentAI inferenceAI model
0 likes · 4 min read
DeepSeek-R1: High-Performance AI Inference Model
Data Thinking Notes
Data Thinking Notes
Feb 11, 2025 · Artificial Intelligence

Why DeepSeek V3 and R1 Are Redefining LLM Efficiency and Power

This article analyzes DeepSeek's V3 and R1 large language models, detailing their low‑cost Mixture‑of‑Experts architecture, Multi‑Head Latent Attention redesign, distributed training optimizations, and reasoning‑focused innovations that together challenge traditional GPU/NPU compute demands.

AI inferenceDeepSeekMLA
0 likes · 15 min read
Why DeepSeek V3 and R1 Are Redefining LLM Efficiency and Power
Baidu Geek Talk
Baidu Geek Talk
Feb 10, 2025 · Artificial Intelligence

How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled

Baidu Cloud's Qianfan platform launched DeepSeek‑R1 and DeepSeek‑V3 with ultra‑low inference pricing, leveraging advanced engine performance tweaks, a split Prefill/Decode architecture, and comprehensive security measures that together boost throughput, cut costs, and ensure enterprise‑grade reliability.

AI inferenceBaidu CloudModel Serving
0 likes · 5 min read
How Baidu Cloud Slashes Inference Costs: DeepSeek Model Optimizations Unveiled
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Feb 8, 2025 · Artificial Intelligence

Why DeepSeek V3 and R1 Are Redefining Low‑Cost AI: Architecture, Training Tricks, and Industry Impact

This article analyses DeepSeek's V3 and R1 models, explaining how their innovative MoE architecture, Multi‑Head Latent Attention, low‑cost training strategies, and distributed‑training optimizations deliver high‑performance large language models while reducing GPU/NPU demand and sparking industry excitement.

AI inferenceDeepSeekMixture of Experts
0 likes · 16 min read
Why DeepSeek V3 and R1 Are Redefining Low‑Cost AI: Architecture, Training Tricks, and Industry Impact
Infra Learning Club
Infra Learning Club
Feb 6, 2025 · Artificial Intelligence

Getting Started with Huawei Ascend AI Accelerators

This guide walks through the fundamentals of Huawei Ascend NPU hardware, the CANN software stack, driver and firmware installation, Kubernetes integration via Docker runtime and device plugin, and a complete ResNet‑50 inference demo on Ascend 310P.

AI inferenceCANNDocker Runtime
0 likes · 12 min read
Getting Started with Huawei Ascend AI Accelerators
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Feb 5, 2025 · Artificial Intelligence

Deploy DeepSeek‑V3 on Ascend: Step‑by‑Step Guide for Fast AI Inference

This guide walks developers through obtaining the DeepSeek‑V3 model on the Ascend community, converting weights for GPU and NPU, loading the appropriate MindIE Docker image, launching the container, and configuring service‑level parameters to achieve efficient, out‑of‑the‑box AI inference on Ascend hardware.

AI inferenceAscendDeepSeek
0 likes · 4 min read
Deploy DeepSeek‑V3 on Ascend: Step‑by‑Step Guide for Fast AI Inference
Tencent Tech
Tencent Tech
Feb 4, 2025 · Artificial Intelligence

Deploy and Test DeepSeek Large Language Models on Tencent Cloud TI in Minutes

This guide walks you through quickly deploying DeepSeek series models on the Tencent Cloud TI platform, covering model selection, resource planning, step‑by‑step service creation, free online trial, API testing via built‑in tools or curl, and managing inference services for both large and compact models.

AI inferenceDeepSeekModel Deployment
0 likes · 13 min read
Deploy and Test DeepSeek Large Language Models on Tencent Cloud TI in Minutes
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 1, 2025 · Artificial Intelligence

Deploy DeepSeek-V3 and R1 Models with One-Click on Alibaba Cloud PAI Model Gallery

This article introduces Alibaba Cloud's PAI Model Gallery, detailing the DeepSeek-V3 and DeepSeek‑R1 large language models, their architectures and parameters, and provides a step‑by‑step guide for one‑click deployment of these models and their distilled variants using vLLM or BladeLLM.

AI inferenceAlibaba CloudDeepSeek
0 likes · 6 min read
Deploy DeepSeek-V3 and R1 Models with One-Click on Alibaba Cloud PAI Model Gallery
DevOps
DevOps
Jan 6, 2025 · Artificial Intelligence

Ten Popular Large Language Model Deployment Engines and Tools: Features, Advantages, and Limitations

This article reviews ten mainstream LLM deployment solutions—including WebLLM, LM Studio, Ollama, vLLM, LightLLM, OpenLLM, HuggingFace TGI, GPT4ALL, llama.cpp, and Triton Inference Server—detailing their technical characteristics, strengths, drawbacks, and example deployment workflows for both personal and enterprise environments.

AI inferenceGPU AccelerationLLM
0 likes · 16 min read
Ten Popular Large Language Model Deployment Engines and Tools: Features, Advantages, and Limitations
Architects' Tech Alliance
Architects' Tech Alliance
Jan 6, 2025 · Industry Insights

How Nvidia’s GB300 GPU Is Shaping AI Inference and Cloud Supply Chains

The article provides a detailed technical analysis of Nvidia’s new GB300 and B300 GPUs, comparing their performance, memory architecture, and power consumption to previous generations, and examines how these changes affect AI inference workloads, NVL72 accelerator systems, and the supply‑chain strategies of major cloud providers.

AI inferenceGPUNvidia
0 likes · 12 min read
How Nvidia’s GB300 GPU Is Shaping AI Inference and Cloud Supply Chains
DevOps Cloud Academy
DevOps Cloud Academy
Dec 2, 2024 · Artificial Intelligence

Key Kubernetes Features that Benefit AI Inference Workloads

This article explains how Kubernetes’ native scalability, resource optimization, performance tuning, portability, and fault‑tolerance features align with the demands of AI inference, helping organizations run large ML models efficiently, cost‑effectively, and reliably across diverse environments.

AI inferenceKubernetesPortability
0 likes · 15 min read
Key Kubernetes Features that Benefit AI Inference Workloads
AI Large Model Application Practice
AI Large Model Application Practice
Nov 28, 2024 · Artificial Intelligence

Can Tiny Multimodal Models Power Edge AI? Meet OmniVision-968M

This article explores how compact multimodal models like OmniVision-968M enable efficient generative AI on edge devices, detailing their architectural advantages, benchmark superiority over larger models, and step‑by‑step instructions for local installation and visual inference using NexaSDK.

AI inferenceOmniVision-968MTutorial
0 likes · 9 min read
Can Tiny Multimodal Models Power Edge AI? Meet OmniVision-968M
Architects' Tech Alliance
Architects' Tech Alliance
Nov 12, 2024 · Artificial Intelligence

How Retrieval‑Augmented Generation Boosts Enterprise AI with Intel Optimizations

This article explains the fundamentals of Retrieval‑Augmented Generation (RAG), its four‑step workflow, architecture, and how Intel’s hardware and software optimizations—including vector search, quantized embeddings, and advanced inference extensions—enhance performance, security, and scalability for enterprise LLM applications.

AI inferenceEmbedding QuantizationIntel Optimization
0 likes · 14 min read
How Retrieval‑Augmented Generation Boosts Enterprise AI with Intel Optimizations
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 8, 2024 · Industry Insights

Unlocking Efficient LLM Inference: Insights from China’s Cloud Computing Conference

The 5th China Cloud Computing Infrastructure Developer Conference in Beijing highlighted cutting‑edge AI inference optimization, Knative‑based serverless acceleration, AMD PMU virtualization, and CDI‑driven GPU management, offering detailed technical insights and real‑world case studies that illustrate how cloud providers are tackling performance and cost challenges of modern workloads.

AI inferenceAMD virtualizationCloud Native
0 likes · 9 min read
Unlocking Efficient LLM Inference: Insights from China’s Cloud Computing Conference
Sohu Tech Products
Sohu Tech Products
Oct 18, 2024 · Artificial Intelligence

Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes

This article details a comprehensive engineering practice for optimizing AI inference services at ZhiZhuan, covering background analysis, selection of TorchServe over alternatives, GPU/CPU performance tuning, custom handlers, Torch‑TRT integration, and deployment on Kubernetes, with measured improvements in throughput and resource utilization.

AI inferenceGPU OptimizationKubernetes
0 likes · 16 min read
Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes
dbaplus Community
dbaplus Community
Aug 13, 2024 · Artificial Intelligence

Why Kubernetes Is the Ideal Platform for AI Inference: 5 Key Benefits

Kubernetes aligns perfectly with AI inference demands by offering built‑in scalability, resource and performance optimization, seamless portability across clouds, and robust fault‑tolerance, making it a cost‑effective, high‑availability foundation for deploying large‑scale machine‑learning models.

AI inferenceKubernetesResource Optimization
0 likes · 10 min read
Why Kubernetes Is the Ideal Platform for AI Inference: 5 Key Benefits
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 12, 2024 · Artificial Intelligence

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.

AI inferenceKServeKubernetes
0 likes · 13 min read
Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide
JD Tech
JD Tech
Mar 18, 2024 · Artificial Intelligence

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

The article describes how JD’s advertising team tackled the high‑concurrency, low‑latency challenges of online recommendation inference by designing a distributed graph heterogeneous computing framework, optimizing GPU kernel launches with TensorBatch, deep‑learning compiler techniques, and a multi‑stream GPU architecture, achieving significant throughput and latency improvements.

AI inferenceDeep Learning CompilerGPU Optimization
0 likes · 14 min read
High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization
Open Source Tech Hub
Open Source Tech Hub
Mar 12, 2024 · Artificial Intelligence

Step-by-Step Guide to Install ModelScope and Perform NLP Inference in Python & PHP

This guide walks you through setting up a Conda Python environment, installing PyTorch and the ModelScope library, running NLP pipelines for tasks like word segmentation and text classification, and calling ModelScope models from PHP using the PHPY extension, complete with code examples and troubleshooting tips.

AI inferenceModelScopeNLP
0 likes · 14 min read
Step-by-Step Guide to Install ModelScope and Perform NLP Inference in Python & PHP
Open Source Tech Hub
Open Source Tech Hub
Jan 20, 2024 · Artificial Intelligence

How to Set Up ModelScope with Anaconda and Run OCR Inference via PHP

This guide walks through installing Anaconda, creating a Python 3.10 conda environment, adding PyTorch and ModelScope libraries, installing domain-specific dependencies, verifying NLP pipelines, and using PHPY to call ModelScope's OCR model from PHP, complete with code snippets and troubleshooting tips.

AI inferenceAnacondaModelScope
0 likes · 10 min read
How to Set Up ModelScope with Anaconda and Run OCR Inference via PHP
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 1, 2023 · Operations

Deploy Hugging Face Transformers with One Click Using LMDeploy

This article explains how LMDeploy streamlines the deployment of Hugging Face transformer models by adding online conversion, offering an OpenAI‑compatible API server, a Gradio WebUI, and 4‑bit weight‑only quantization with AWQ, providing step‑by‑step commands, code examples, and performance insights.

AI inferenceAPI ServerHugging Face
0 likes · 9 min read
Deploy Hugging Face Transformers with One Click Using LMDeploy
DataFunSummit
DataFunSummit
Jul 4, 2023 · Artificial Intelligence

PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime

The article presents SenseTime's PPL framework, detailing its toolchain, inference engine, multi‑backend operator library, quantization tools, CUDA optimizations, performance benchmarks across CPUs, GPUs, DSPs and DSAs, and outlines future plans for broader chip support and AI for Science.

AI inferenceCUDA optimizationDeep Learning Deployment
0 likes · 23 min read
PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime
Bilibili Tech
Bilibili Tech
Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX
0 likes · 10 min read
InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving
Baidu Tech Salon
Baidu Tech Salon
Mar 29, 2023 · Artificial Intelligence

Punica System: Enhancing AI Inference Service Efficiency Through FaaS Architecture

The Punica system unifies AI inference development, testing, deployment, and maintenance on a FaaS‑based one‑stop platform that automates resource scheduling, self‑healing, and monitoring, supporting multiple frameworks and GPUs, thereby doubling onboarding speed, quintuple scaling efficiency, and reclaiming hundreds of GPU cards.

AI inferenceFaaS architectureGPU scheduling
0 likes · 13 min read
Punica System: Enhancing AI Inference Service Efficiency Through FaaS Architecture
Baidu Geek Talk
Baidu Geek Talk
Mar 29, 2023 · Cloud Native

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Punica provides a cloud‑native, one‑stop platform that unifies Baidu’s content‑understanding inference services, automates testing, resource provisioning, and monitoring, and enables unattended, self‑healing operations with dynamic scaling and GPU scheduling, cutting onboarding time by half and reclaiming hundreds of GPUs.

AI inferenceInference PlatformService Orchestration
0 likes · 14 min read
Punica: A Cloud‑Native Platform for Content Understanding Inference Services
Baidu Geek Talk
Baidu Geek Talk
Jan 5, 2023 · Artificial Intelligence

How Baidu’s AIAK‑Inference Supercharges AI Model Inference on GPUs

This article provides an end‑to‑end analysis of AI inference bottlenecks, reviews common industry acceleration techniques, and details Baidu Intelligent Cloud’s AIAK‑Inference suite—including its architecture, optimization strategies such as model pruning, operator fusion, and single‑operator tuning—followed by a demo showing significant latency reductions on ResNet‑50 and other models.

AI inferenceAIAK-InferenceBaidu Cloud
0 likes · 16 min read
How Baidu’s AIAK‑Inference Supercharges AI Model Inference on GPUs
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 27, 2022 · Artificial Intelligence

How to Supercharge AI Inference: End‑to‑End Acceleration Strategies and Baidu’s AIAK‑Inference

This article presents a comprehensive analysis of AI inference bottlenecks, explores industry acceleration techniques such as model simplification, operator fusion, and single‑operator optimization, and details Baidu Cloud's AIAK‑Inference suite with practical demos showing up to 90% latency reduction.

AI inferenceAIAK-InferenceBaidu Cloud
0 likes · 16 min read
How to Supercharge AI Inference: End‑to‑End Acceleration Strategies and Baidu’s AIAK‑Inference
ITPUB
ITPUB
Dec 22, 2022 · Cloud Native

How 58 Tongcheng Built a Cloud‑Native Deep Learning Inference Platform with Istio

This article details the evolution of 58 Tongcheng's deep learning inference platform—from the initial WPAI‑based architecture to a cloud‑native, Istio‑powered design—covering its background, technical challenges, architectural redesign, traffic‑management features, adaptive rate limiting, model warm‑up, and observability improvements.

AI inferenceIstioKubernetes
0 likes · 24 min read
How 58 Tongcheng Built a Cloud‑Native Deep Learning Inference Platform with Istio
DataFunTalk
DataFunTalk
Dec 7, 2022 · Artificial Intelligence

Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library

The article details vivo's development of a high‑accuracy, high‑performance streaming speech‑recognition inference engine built on the wenet framework, its optimization techniques such as dynamic batching and memory pooling, collaborative acceleration with KunlunChip's high‑performance inference library, and extensive performance benchmarks demonstrating multi‑batch GPU and XPU gains.

AI inferenceKunlun chipPerformance Optimization
0 likes · 10 min read
Vivo's Self‑Developed Streaming Speech‑Recognition Inference Engine and KunlunChip High‑Performance Inference Library
Tencent Architect
Tencent Architect
Jun 9, 2022 · Artificial Intelligence

From Zero to Chip: Tencent’s Multi‑Year Journey in AI, FPGA, and Smart‑NIC Development

Tencent’s hardware teams evolved from a lack of verification tools in 2019 to building AI inference chips, video‑encoding silicon, and intelligent NICs, overcoming FPGA challenges, scaling cloud infrastructure, and delivering high‑performance, low‑cost solutions for massive multimedia and AI workloads.

AI inferenceChip DesignFPGA
0 likes · 16 min read
From Zero to Chip: Tencent’s Multi‑Year Journey in AI, FPGA, and Smart‑NIC Development
Alipay Experience Technology
Alipay Experience Technology
Feb 10, 2022 · Frontend Development

How Ant Group Supercharged Front‑End AI with Cross‑Platform Smart Apps

This talk explains how Ant Group’s frontend engineers built edge‑AI services that run directly in browsers, boosting real‑time performance, preserving privacy, and cutting cloud costs, while showcasing two real‑world cases—pet identification and screen‑break insurance—and detailing the WebGL‑based engine optimizations that lifted device coverage from 30% to 93%.

AI inferencePerformance OptimizationWebGL
0 likes · 8 min read
How Ant Group Supercharged Front‑End AI with Cross‑Platform Smart Apps
Yiche Technology
Yiche Technology
Jan 27, 2022 · Backend Development

C++ Multithreaded Service Architecture for High‑Throughput AI Inference

The article explains how to design a C++‑based multithreaded service that uses Pthreads, channels, and TensorRT to parallelize deep‑learning inference tasks, thereby reducing latency and dramatically increasing throughput for AI applications such as facial‑recognition access control systems.

AI inferenceCTensorRT
0 likes · 11 min read
C++ Multithreaded Service Architecture for High‑Throughput AI Inference
Tencent Cloud Developer
Tencent Cloud Developer
Jul 20, 2021 · Artificial Intelligence

Deploying AI Inference Functions on Tencent Cloud Serverless with Rust and WebAssembly

Michael Yuan’s ServerlessDays China 2021 talk shows how combining Rust with WebAssembly on Tencent Cloud Serverless lets developers deploy TensorFlow AI models in just a few lines, achieving 10‑20 fps inference, 100× faster cold starts than Python, and offering lightweight, secure, portable runtimes that could eventually supplant containers for edge and AI workloads.

AI inferenceCloud NativeEdge Computing
0 likes · 19 min read
Deploying AI Inference Functions on Tencent Cloud Serverless with Rust and WebAssembly