Tagged articles

92 articles

Page 1 of 1

May 17, 2026 · Industry Insights

Cerebras' $5.55B IPO Unveils the World’s Largest AI Chip Challenging Nvidia

Cerebras Systems raised $5.55 billion in the largest 2026 IPO, debuting the wafer‑scale WSE‑3 chip that promises unprecedented inference bandwidth and could erode Nvidia’s dominance, while navigating CFIUS scrutiny, a dramatic financial turnaround, and a shifting AI‑chip market landscape.

AI ChipCerebrasIPO

0 likes · 15 min read

Cerebras' $5.55B IPO Unveils the World’s Largest AI Chip Challenging Nvidia

Old Zhang's AI Learning

May 16, 2026 · Artificial Intelligence

vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models

The vLLM 0.21.0 release brings five major updates—including Transformers v4 deprecation, a C++20 build requirement, KV offload with hybrid memory, speculative decoding that respects thinking budgets, and a Blackwell token‑speed backend—while offering detailed upgrade guidance for different user groups.

C++20InferenceKV cache

0 likes · 12 min read

vLLM 0.21.0 Arrives: Speculative Decoding Now Supports Reasoning Models

Machine Heart

May 16, 2026 · Artificial Intelligence

Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining

In a deep interview, former Google TPU architect Reiner Pope explains that low‑concurrency fast‑mode services trade higher fees for faster streaming but are limited by memory‑bandwidth bottlenecks, that optimal concurrency balances compute and memory costs, and that pipeline‑parallel sparse expert models and reinforcement‑learning fine‑tuning introduce new inefficiencies and overtraining risks.

InferenceLLMMemory Bandwidth

0 likes · 7 min read

Why More Compute Can't Fix LLM Inference Lag and Why RL Leads to Overtraining

Geek Labs

May 13, 2026 · Artificial Intelligence

Two LLM Inference Acceleration Projects: A Mac‑Local Engine vs a Data‑Center Engine

This article compares two recent GitHub LLM inference engines—ds4.c, a Metal‑optimized engine for DeepSeek V4 Flash on Apple Silicon Macs, and TokenSpeed, a Python/C++‑based, data‑center‑grade engine for GPU clusters—detailing their design choices, performance numbers, usage instructions, and suitable scenarios.

DeepSeekGPUInference

0 likes · 8 min read

Two LLM Inference Acceleration Projects: A Mac‑Local Engine vs a Data‑Center Engine

Machine Heart

May 10, 2026 · Artificial Intelligence

Why SRAM Is Key to Overcoming GPU Limits in Inference as Demand Soars

As large‑model inference demand outpaces training, the decode stage hits a memory‑wall that GPUs cannot efficiently cross; SRAM’s on‑chip bandwidth and low‑energy access open a path forward, though capacity and process limits still pose challenges.

AI hardwareCompute ArchitectureGPU

0 likes · 7 min read

Why SRAM Is Key to Overcoming GPU Limits in Inference as Demand Soars

SuanNi

Apr 29, 2026 · Artificial Intelligence

Why Google’s Split 8th‑Gen TPU Could Out‑Earn General‑Purpose GPUs

Google’s Cloud Next 2026 reveal splits the 8th‑generation TPU into training‑focused Sunfish and inference‑focused Zebrafish, highlighting Ironwood’s record‑breaking performance, a multi‑vendor supply chain, Anthropic’s multi‑gigawatt order, and a broader industry shift toward custom AI chips that promise far higher profit margins than generic GPUs.

AICustom ASICGoogle

0 likes · 8 min read

Why Google’s Split 8th‑Gen TPU Could Out‑Earn General‑Purpose GPUs

Code Mala Tang

Apr 25, 2026 · Artificial Intelligence

Why Claude Feels Nerfed Without a Formal Downgrade: A Deep Dive into System‑Level Performance Changes

The article examines the recent Claude performance controversy, showing that engineering adjustments to inference parameters, cache handling, and system prompts rewrote the model’s behavior, making it answer faster but think shallower, leading users to perceive a degradation despite no official model downgrade.

AICacheClaude

0 likes · 14 min read

Why Claude Feels Nerfed Without a Formal Downgrade: A Deep Dive into System‑Level Performance Changes

Machine Heart

Apr 23, 2026 · Artificial Intelligence

Google's TPU 8t and 8i: Training Powerhouse vs. Inference Specialist

Google unveiled its eighth‑generation TPU line at Cloud Next 2026, introducing the training‑focused TPU 8t with a 2.7× performance boost and massive scaling, and the inference‑optimized TPU 8i featuring three‑times more on‑chip SRAM and an 80% performance uplift for agentic AI workloads, while positioning the chips as a complement—not a replacement—to Nvidia's offerings.

AI hardwareAgentic AIGoogle Cloud

0 likes · 9 min read

Google's TPU 8t and 8i: Training Powerhouse vs. Inference Specialist

Ray's Galactic Tech

Apr 18, 2026 · Operations

How to Build a Resilient GPU Inference Autoscaling System on Kubernetes

This article explains why scaling GPU inference services on Kubernetes is challenging and presents a multi‑layer control architecture, metric upgrades, and production‑ready implementations using HPA, KEDA, KServe, and Karpenter to achieve stable, cost‑effective autoscaling.

GPUHPAInference

0 likes · 29 min read

How to Build a Resilient GPU Inference Autoscaling System on Kubernetes

Architects' Tech Alliance

Apr 16, 2026 · Industry Insights

Why Inference, Not Training, Will Dominate the AI Chip Race by 2026

By 2026 inference will consume over 70% of AI compute, prompting a shift from GPU‑centric training to specialized, low‑latency, low‑cost inference chips, with Nvidia, Google, Amazon, Microsoft, Intel and newcomers like Groq and CoreWeave racing to capture the new battlefield.

AI chipsGPUHardware

0 likes · 10 min read

Why Inference, Not Training, Will Dominate the AI Chip Race by 2026

AI Tech Publishing

Apr 9, 2026 · Artificial Intelligence

Engineering‑Focused Guide to Training and Inference of Large Language Models

This article walks engineers through the full LLM stack—from tokenization and positional encoding to transformer blocks, efficient fine‑tuning, quantization, and production‑grade inference techniques such as KV‑cache, FlashAttention, PagedAttention, continuous batching, and speculative decoding—highlighting trade‑offs, toolchains, and practical workflow steps.

Fine-tuningInferenceLLM

0 likes · 13 min read

Engineering‑Focused Guide to Training and Inference of Large Language Models

Baidu Intelligent Cloud Tech Hub

Mar 18, 2026 · Artificial Intelligence

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

This article details the vLLM‑Kunlun open‑source project that adapts the high‑performance vLLM inference engine to Baidu's Kunlun XPU, covering platform overview, model‑porting workflow, plugin architecture, concrete case studies with MIMO‑Flash‑V2 and Qwen 3.5, and the performance‑tuning techniques that enable seamless, GPU‑level inference on domestic hardware.

AIHardwareInference

0 likes · 12 min read

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

AI Info Trend

Mar 16, 2026 · Industry Insights

What 2025’s AI Landscape Reveals: Five Game-Changing Trends

The 2025 State of AI report from Artificial Analysis outlines five core trends—intensified competition, the rise of autonomous agents, native speech models, mainstream inference models, and booming image/video generation—showing how costs have plummeted, capabilities have surged, and AI is reshaping every industry.

2025AICost reduction

0 likes · 9 min read

What 2025’s AI Landscape Reveals: Five Game-Changing Trends

SuanNi

Mar 14, 2026 · Industry Insights

How Meta’s MTIA Chips Achieved 25× Compute Boost in Just Two Years

This article analyzes Meta's rapid evolution of four generations of MTIA AI chips, detailing how modular hardware, inference‑first design, deep software integration, and aggressive iteration cycles delivered up to 30 PFLOPs of performance and dramatically reshaped the AI compute landscape.

AI chipsHardware accelerationIndustry analysis

0 likes · 13 min read

How Meta’s MTIA Chips Achieved 25× Compute Boost in Just Two Years

Ops Community

Mar 13, 2026 · Backend Development

How to Diagnose and Fix Slow LLM Inference: A Full‑Stack Performance Guide

This article presents a comprehensive, step‑by‑step methodology for troubleshooting and optimizing large‑language‑model inference performance, covering GPU, CPU, memory, network, configuration, and application layers, with concrete benchmark scripts, diagnostic commands, and real‑world case studies.

CPUDebuggingGPU

0 likes · 48 min read

How to Diagnose and Fix Slow LLM Inference: A Full‑Stack Performance Guide

Woodpecker Software Testing

Mar 1, 2026 · Artificial Intelligence

Automating Regression Tests for TensorRT Inference Services

The article outlines a comprehensive, repeatable regression testing framework for TensorRT inference pipelines, covering engine build validation, functional correctness against golden outputs, performance monitoring, common pitfalls, and CI/CD integration to ensure model updates remain both fast and reliable.

Automated TestingINT8 QuantizationInference

0 likes · 12 min read

Automating Regression Tests for TensorRT Inference Services

Old Zhang's AI Learning

Feb 26, 2026 · Artificial Intelligence

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

This guide reviews the Qwen3.5 model lineup, explains mixed‑inference and MoE architecture, presents benchmark comparisons with GPT‑5.2, Claude 4.5 and Gemini‑3 Pro, evaluates 4‑bit and 3‑bit quantization loss, outlines hardware requirements, and provides step‑by‑step deployment options using llama.cpp or llama‑server.

InferenceMoElarge language model

0 likes · 14 min read

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

Machine Learning Algorithms & Natural Language Processing

Feb 26, 2026 · Artificial Intelligence

Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio

Google’s recent study shows that the length of a model’s token chain is negatively correlated with inference accuracy, and introduces the Deep Thinking Ratio (DTR) metric to identify truly reasoning tokens, enabling the Think@n strategy to halve compute cost without sacrificing performance.

Deep Thinking RatioInferenceLLM

0 likes · 6 min read

Why Longer Token Chains Don't Mean Better Reasoning: Google's Deep Thinking Ratio

AI Tech Publishing

Feb 6, 2026 · Artificial Intelligence

2026 Large Model Engineering Roadmap: From Foundations to Production

This roadmap outlines a step‑by‑step learning path for building, optimizing, and safely deploying large language model systems, covering fundamentals, vector stores, RAG, advanced techniques, fine‑tuning, inference speed, deployment, observability, agents, and production safeguards.

DeploymentFine-tuningInference

0 likes · 5 min read

2026 Large Model Engineering Roadmap: From Foundations to Production

AI Waka

Feb 1, 2026 · Artificial Intelligence

Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

This article reviews practical techniques for accelerating large language model inference—including reduced‑precision formats, post‑training quantization, adapter‑based fine‑tuning, pruning, continuous batch processing, and multi‑GPU deployment—while providing concrete code examples, benchmark results, and guidance on selecting the right approach for production workloads.

GPUInferenceLLM

0 likes · 20 min read

Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

Baidu Intelligent Cloud Tech Hub

Jan 27, 2026 · Artificial Intelligence

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.

DPOInferenceKunlun P800

0 likes · 32 min read

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

AI Cyberspace

Jan 26, 2026 · Artificial Intelligence

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

This article explains the NVFP4 4‑bit floating‑point quantization technique, shows how to deploy Qwen3‑30B‑A3B models with TensorRT‑LLM and vLLM, compares performance across NVFP4, AWQ and INT8 quantizations, and provides practical profiling commands for NVIDIA DGX systems.

InferenceLLMNVFP4

0 likes · 23 min read

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

Alibaba Cloud Infrastructure

Jan 21, 2026 · Artificial Intelligence

Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG

This article details how to deploy the 235‑billion‑parameter Qwen3‑235B model using PD‑separation and MoE techniques, explains the associated challenges, and demonstrates a production‑grade solution built on the high‑performance SGLang inference engine and the RoleBasedGroup (RBG) orchestration framework, complete with benchmark results and best‑practice YAML examples.

AIInferenceKubernetes

0 likes · 21 min read

Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG

MaGe Linux Operations

Jan 18, 2026 · Artificial Intelligence

How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling

This guide walks through building a production‑grade Kubernetes GPU cluster for large language model inference, covering hardware sizing, GPU resource scheduling, model storage options, automated scaling with HPA, health checks, monitoring, troubleshooting, and multi‑model deployment strategies.

DockerGPUInference

0 likes · 49 min read

How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling

AI Engineering

Jan 17, 2026 · Artificial Intelligence

Can Tiny LLMs Compute Accurately? WorldModel‑Qwen Inference‑Time WASM Execution

The article details how the small Qwen‑0.6B model was adapted to generate and run WebAssembly code during inference, achieving deterministic calculations and revealing both the promise and current limitations of integrating world‑model reasoning into tiny LLMs.

InferenceLLMQwen-0.6B

0 likes · 5 min read

Can Tiny LLMs Compute Accurately? WorldModel‑Qwen Inference‑Time WASM Execution

Fun with Large Models

Jan 14, 2026 · Artificial Intelligence

Understanding Large Language Model Files: Structure, Tokens, and Inference with Qwen3

This article walks through the complete workflow of loading and running the open‑source Qwen3‑8B model, explaining each core file (weights, config, generation config, tokenizer), how the model tokenizes input, applies chat templates, generates responses, and decodes output, all illustrated with code and diagrams.

InferenceModelScopePython

0 likes · 16 min read

Understanding Large Language Model Files: Structure, Tokens, and Inference with Qwen3

MaGe Linux Operations

Dec 27, 2025 · Artificial Intelligence

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

This guide walks you through deploying large language models such as ChatGLM and Llama in production, covering environment setup, model quantization, dynamic batching, service configuration, Nginx load balancing, monitoring, troubleshooting, and best‑practice recommendations for high‑performance, cost‑effective AI inference.

GPUInferenceLLM

0 likes · 48 min read

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

Alibaba Cloud Infrastructure

Dec 22, 2025 · Artificial Intelligence

Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE

This article explains why KV‑Cache hit rate is critical for large‑model inference, describes vLLM's automatic prefix caching, outlines the distributed cache challenges, and provides a step‑by‑step guide to deploying Alibaba Cloud ACK Gateway with Inference Extension's precise‑mode prefix‑cache‑aware routing, backed by benchmark results.

Alibaba CloudInferenceKV cache

0 likes · 18 min read

Boost LLM Inference with KV‑Cache‑Aware Routing on Alibaba Cloud ACK GIE

Baidu Intelligent Cloud Tech Hub

Dec 10, 2025 · Artificial Intelligence

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.

InferenceKunlunLLM

0 likes · 8 min read

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

DataFunSummit

Nov 22, 2025 · Artificial Intelligence

Breaking the Recommendation Filter Bubble: Alibaba 1688’s Inference‑Driven AI

Alibaba’s 1688 platform leverages inference‑based large language models to enhance recommendation discovery, addressing the filter‑bubble problem by analyzing long‑term buyer behavior, compressing extensive activity streams, generating nuanced demand queries, and integrating multimodal data and market trend agents to deliver more diverse, explainable product suggestions for B‑type buyers.

AIE‑commerceInference

0 likes · 23 min read

Breaking the Recommendation Filter Bubble: Alibaba 1688’s Inference‑Driven AI

Code Wrench

Oct 16, 2025 · Artificial Intelligence

Build a Go‑Powered Stock Trend Predictor with ONNX Runtime in Minutes

This guide walks you through setting up an Ubuntu environment, training a LightGBM stock‑movement model in Python, exporting it to ONNX, and deploying fast, cross‑platform inference in Go using ONNX Runtime, complete with code snippets and project structure.

AIGoInference

0 likes · 11 min read

Build a Go‑Powered Stock Trend Predictor with ONNX Runtime in Minutes

AntTech

Oct 9, 2025 · Artificial Intelligence

Ling-1T: The Trillion‑Parameter AI Model Redefining Efficient Reasoning

Ling-1T, a trillion‑parameter flagship non‑thinking model, combines 50 billion active parameters per token, 128 K context, Evo‑CoT reasoning, and FP8 mixed‑precision training to achieve state‑of‑the‑art performance on complex reasoning, code generation, and multimodal tasks while outlining its architecture, benchmarks, limitations, and future roadmap.

AIBenchmarkFP8

0 likes · 11 min read

Ling-1T: The Trillion‑Parameter AI Model Redefining Efficient Reasoning

Baobao Algorithm Notes

Sep 23, 2025 · Artificial Intelligence

How LongCat-Flash-Thinking Sets New SOTA in Open‑Source AI Inference

LongCat-Flash-Thinking, the latest open‑source model from Meituan, introduces domain‑parallel RL training, a high‑throughput DORA infra, and a dual‑path inference framework that together achieve state‑of‑the‑art performance on logical, mathematical, coding, and agentic tasks while maintaining top‑tier speed.

BenchmarkInferenceLongCat

0 likes · 10 min read

How LongCat-Flash-Thinking Sets New SOTA in Open‑Source AI Inference

Baidu Intelligent Cloud Tech Hub

Sep 4, 2025 · Artificial Intelligence

Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations

Baidu’s Baige 5.0 AI Computing Platform introduces FP8 mixed‑precision training, MoE‑aware distributed strategies, adaptive parallelism, and a three‑tier KV‑Cache, delivering over 30% training speedup and 50% inference throughput gains while keeping token latency under half a second for large‑scale models.

AIFP8Inference

0 likes · 16 min read

Unlocking MoE Model Power: Baidu’s Baige 5.0 AI Platform’s FP8 and Distributed Innovations

Java Architecture Diary

Jul 30, 2025 · Artificial Intelligence

What’s New in LangChain4j 1.2.0? Key AI Features and Enhancements

LangChain4j 1.2.0 introduces a suite of stable modules, advanced inference and thinking capabilities, streaming tool calls, and extensive AI service enhancements, offering developers finer control, lower latency, and richer debugging for LLM‑driven applications.

AIInferenceJava

0 likes · 7 min read

What’s New in LangChain4j 1.2.0? Key AI Features and Enhancements

Architecture Development Notes

Jul 21, 2025 · Artificial Intelligence

Why Rust’s Burn Framework Is Redefining Deep Learning Performance

Burn, a native Rust deep learning framework by Tracel AI, combines extreme flexibility, high computational efficiency, and cross‑platform portability through a modular backend abstraction, type‑safe tensor operations, asynchronous execution, and extensive tooling, offering performance‑competitive alternatives to Python‑based frameworks for both training and inference.

BurnDeep LearningGPU

0 likes · 23 min read

Why Rust’s Burn Framework Is Redefining Deep Learning Performance

AIWalker

Jun 18, 2025 · Artificial Intelligence

Six New Directions for Large Language Models

Large language models are booming, and this article highlights six cutting‑edge research directions—LLM‑plus synthetic data, reward modeling, inference techniques, LLM‑as‑a‑Judge, safety alignment, and long‑context handling—each illustrated with recent papers, experimental results, and links to code repositories.

InferenceLLMReward Modeling

0 likes · 9 min read

Six New Directions for Large Language Models

DataFunTalk

May 23, 2025 · Artificial Intelligence

2025 AI Landscape: Inference Models Dominate, Open‑Source Momentum Accelerates

The 2025 Q1 AI report from Artificial Analysis highlights six major trends—including a thousand‑fold drop in inference cost, the rise of MoE models, the growing parity of Chinese open‑source labs, the emergence of autonomous AI agents, native multimodal capabilities, and the trade‑off between performance, cost, and context windows—painting a picture of a rapidly evolving, increasingly competitive AI ecosystem.

AIInferenceagents

0 likes · 11 min read

2025 AI Landscape: Inference Models Dominate, Open‑Source Momentum Accelerates

Meituan Technology Team

May 8, 2025 · Artificial Intelligence

Building a Mixed OR+ML Inference Framework with TritonServer: Architecture, Challenges, and Solutions

The article describes how a large‑scale dispatch system was re‑engineered with NVIDIA TritonServer to unify GPU‑accelerated operations‑research kernels and deep‑learning models, detailing a three‑stage architecture (in‑process, cross‑process, cross‑node), the performance, stability and memory challenges addressed, and future plans for heterogeneous GPU scaling.

GPUInferencePerformance Optimization

0 likes · 11 min read

Building a Mixed OR+ML Inference Framework with TritonServer: Architecture, Challenges, and Solutions

Architect's Alchemy Furnace

May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

InferenceLLMMLX

0 likes · 17 min read

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

ITPUB

Apr 13, 2025 · Operations

How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage

Cursor, the AI‑powered code editor, grew to handle billions of document queries and over a hundred‑million model calls daily, prompting a multi‑stage infrastructure overhaul that moved from a failing YugaByte setup to PostgreSQL RDS, then to object‑storage‑backed databases, while tackling indexing, inference scaling, and cold‑start challenges.

AIInferenceInfrastructure

0 likes · 11 min read

How Cursor Scaled Its AI Code Editor: Lessons from Indexing to Object Storage

Baidu Geek Talk

Apr 2, 2025 · Artificial Intelligence

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

DeepSeek-VL2InferenceMixture of Experts

0 likes · 36 min read

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

Architect's Alchemy Furnace

Mar 31, 2025 · Artificial Intelligence

How to Deploy and Run Large Language Models with Xinference: A Step‑by‑Step Guide

Xinference is a powerful distributed inference framework that enables quick deployment and efficient serving of open‑source large language models via Docker or source installation, offering Web UI, CLI, and API interfaces with detailed setup, model launching, and Chatbox integration instructions.

APIDockerInference

0 likes · 11 min read

How to Deploy and Run Large Language Models with Xinference: A Step‑by‑Step Guide

DataFunSummit

Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu's machine‑learning platform lead Wang Xin's presentation on the ZhiLight large‑model inference framework, covering model execution mechanisms, GPU workload analysis, pipeline and tensor parallelism, GPU architecture evolution, open‑source engine comparisons, ZhiLight's compute‑communication overlap and quantization optimizations, benchmark results, supported models, and future directions.

GPUInferenceLLM

0 likes · 13 min read

Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

Alibaba Cloud Developer

Mar 11, 2025 · Artificial Intelligence

How to Deploy the Open‑Source QwQ‑32B Inference Model on Alibaba Cloud CAP

This guide walks you through deploying the open‑source QwQ‑32B inference model using Alibaba Cloud's Serverless AI platform CAP, covering benchmark highlights, preparation steps, two deployment methods (application template and model service), verification, and project cleanup.

AI Model DeploymentAlibaba CloudCAP

0 likes · 7 min read

How to Deploy the Open‑Source QwQ‑32B Inference Model on Alibaba Cloud CAP

Alibaba Cloud Infrastructure

Mar 9, 2025 · Cloud Computing

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

This guide walks you through using Alibaba Cloud Container Compute Service (ACS) to provision GPU resources, prepare the QwQ-32B model, configure persistent storage, deploy the model with vLLM, set up OpenWebUI, verify the service, and optionally benchmark its performance, all with detailed commands and YAML examples.

ACSAlibaba CloudBenchmark

0 likes · 17 min read

Deploy QwQ-32B LLM Inference on Alibaba Cloud ACS with vLLM: Step‑by‑Step Guide

Alibaba Cloud Infrastructure

Mar 8, 2025 · Artificial Intelligence

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

This guide explains how to deploy the QwQ-32B large language model using vLLM on an Alibaba Cloud ACK Kubernetes cluster, configure storage, set up OpenWebUI, enable ACK Gateway with AI Extension for intelligent routing, and benchmark the inference service performance.

ACKBenchmarkInference

0 likes · 17 min read

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

JD Retail Technology

Mar 4, 2025 · Artificial Intelligence

JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

JD Retail’s Nine‑Number Algorithm Platform delivers an end‑to‑end AI engine that unifies GPU and domestic NPU resources across a thousand‑card cluster, offering zero‑cost model migration, optimized training and inference pipelines, support for over 40 LLM and multimodal models, and proven business‑level performance that reduces dependence on overseas chips.

AIDistributed TrainingGPU

0 likes · 19 min read

JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

JD Tech Talk

Mar 3, 2025 · Artificial Intelligence

AI Engine Technology Based on Domestic Chips for JD Retail

This article describes JD Retail's AI engine built on domestic NPU chips, covering challenges, heterogeneous GPU‑NPU scheduling, high‑performance training and inference engines, extensive model support, real‑world deployment cases, and future plans for large‑scale chip clusters and ecosystem development.

AIDistributed TrainingGPU

0 likes · 20 min read

AI Engine Technology Based on Domestic Chips for JD Retail

JD Cloud Developers

Mar 3, 2025 · Artificial Intelligence

How JD.com Leverages Domestic NPU Chips to Power Large‑Scale AI Models

This article details JD.com's challenges and solutions for deploying domestic NPU chips across heterogeneous GPU‑NPU clusters, covering architecture, scheduling, high‑performance training and inference engines, real‑world case studies, and future plans to scale AI workloads securely and efficiently.

AIDomestic ChipsInference

0 likes · 19 min read

How JD.com Leverages Domestic NPU Chips to Power Large‑Scale AI Models

Architect

Feb 27, 2025 · Artificial Intelligence

Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation

This article explains how inference‑oriented large language models such as DeepSeek‑R1 and OpenAI o1‑mini shift AI research from training‑time scaling to test‑time computation, detailing the underlying principles, new scaling laws, verification techniques, reinforcement‑learning pipelines, and practical methods for distilling reasoning capabilities into smaller models.

DeepSeek-R1Inferencelarge language models

0 likes · 18 min read

Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation

Baobao Algorithm Notes

Feb 25, 2025 · Artificial Intelligence

FlashMLA vs FlashInfer: DeepSeek Inference Performance Benchmarks Revealed

The author benchmarks DeepSeek's FlashMLA against FlashInfer and several Triton-based implementations, detailing setup challenges, decode‑only bandwidth results, and observations that the official DeepSeek version leads while Triton optimizations show mixed performance across different head sizes.

AIBenchmarkDeepSeek

0 likes · 6 min read

FlashMLA vs FlashInfer: DeepSeek Inference Performance Benchmarks Revealed

Java Architecture Diary

Feb 24, 2025 · Artificial Intelligence

Run Large Language Models Directly in Java with Jlama – Quick Start Guide

This article introduces Jlama, an open‑source Java LLM inference engine, outlines its key features, provides step‑by‑step CLI and Maven integration instructions, shows code examples, run logs, and special setup notes for using large language models efficiently within Java applications.

AIInferenceJlama

0 likes · 6 min read

Run Large Language Models Directly in Java with Jlama – Quick Start Guide

Alibaba Cloud Developer

Feb 14, 2025 · Artificial Intelligence

Unlock Faster LLM Inference: Full Stack of Chips, Frameworks & Services

The article examines the end‑to‑end architecture for large‑model inference, detailing seven layers—from chip hardware and programming toolkits to deep‑learning frameworks, inference accelerators, model providers, compute platforms, application orchestration, and traffic management—highlighting key vendors, open‑source projects, and performance‑optimizing techniques.

AI hardwareInferenceLLM

0 likes · 12 min read

Unlock Faster LLM Inference: Full Stack of Chips, Frameworks & Services

Baidu Geek Talk

Feb 12, 2025 · Artificial Intelligence

Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

This guide walks you through creating a lightweight compute instance, adding it to Baidu Baige AI heterogeneous computing platform, deploying the vLLM tool, loading and serving small‑scale dense models such as DeepSeek, Llama and Qwen, and provides recommended configuration lists to achieve low‑cost, high‑performance inference.

AI Model DeploymentBaidu BaigeCloud AI

0 likes · 3 min read

Deploy DeepSeek, Llama, Qwen Models Fast on Baidu Baige AI Heterogeneous Platform

DevOps

Feb 9, 2025 · Artificial Intelligence

DeepSeek’s Impact on the Large Model Ecosystem and the Resurgence of AI PCs

The article examines DeepSeek’s rapid rise, its open‑source R1 model and distilled variants, the resurgence of AI PCs, hardware support from Nvidia, AMD and others, and how this ecosystem is reshaping personal AI experiences and the broader large‑model landscape.

AI PCDeepSeekHardware

0 likes · 11 min read

DeepSeek’s Impact on the Large Model Ecosystem and the Resurgence of AI PCs

Alibaba Cloud Infrastructure

Feb 8, 2025 · Artificial Intelligence

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

This guide explains how to deploy a production‑ready DeepSeek‑R1 inference service on Alibaba Cloud ACK using KServe, covering model preparation, storage configuration, service deployment, observability, autoscaling, model acceleration, gray‑release and GPU‑shared inference.

DeepSeekGPUInference

0 likes · 13 min read

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

AIWalker

Feb 4, 2025 · Artificial Intelligence

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

This article reviews a comprehensive study that applies Chain‑of‑Thought reasoning to autoregressive text‑to‑image generation, introducing extended test‑time computation, direct preference optimization, and two custom reward models (PARM and PARM++) that together improve generation quality by up to 15% over Stable Diffusion 3.

Direct Preference OptimizationInferenceMultimodal AI

0 likes · 13 min read

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

Alibaba Cloud Infrastructure

Jan 31, 2025 · Cloud Computing

How to Deploy DeepSeek‑R1 on Alibaba Cloud Compute Nest in Minutes

This guide walks you through deploying the open‑source DeepSeek‑R1 inference model on Alibaba Cloud's Compute Nest platform, covering service creation, instance configuration, login procedures, and API calls with sample curl commands for text generation and chat.

AI modelAlibaba CloudCompute Nest

0 likes · 4 min read

How to Deploy DeepSeek‑R1 on Alibaba Cloud Compute Nest in Minutes

Alibaba Cloud Infrastructure

Jan 17, 2025 · Artificial Intelligence

Elastic Scaling of Large Language Model Inference on Alibaba Cloud ACK with Knative, ResourcePolicy, and Fluid

This article explains how to reduce inference cost and improve performance for large language models on Alibaba Cloud ACK by using Knative's request‑based autoscaling, custom ResourcePolicy priority scheduling, and Fluid data‑caching to achieve elastic scaling, resource pre‑emption, and faster model loading.

FluidInferenceKnative

0 likes · 22 min read

Elastic Scaling of Large Language Model Inference on Alibaba Cloud ACK with Knative, ResourcePolicy, and Fluid

Baobao Algorithm Notes

Dec 31, 2024 · Artificial Intelligence

Can China’s GLM‑Zero‑Preview Beat OpenAI’s o3? A Deep Dive into Inference Model Tests

The article evaluates the Chinese GLM‑Zero‑Preview inference model by subjecting it to a wide range of math, logic, language, coding, and multimodal questions, compares its token efficiency and reasoning style to other models, and discusses its current strengths, limitations, and public availability.

AI benchmarkingGLM-ZeroInference

0 likes · 9 min read

Can China’s GLM‑Zero‑Preview Beat OpenAI’s o3? A Deep Dive into Inference Model Tests

Architects' Tech Alliance

Dec 25, 2024 · Artificial Intelligence

Performance Analysis of NVIDIA H20 and L20 AI Inference Chips

This article evaluates NVIDIA's China‑specific H20 and L20 inference chips, comparing their compute and memory‑bandwidth characteristics against A100, H100 and H200, and shows how they achieve superior throughput in large‑model inference despite reduced specifications.

AIGPUH20

0 likes · 6 min read

Performance Analysis of NVIDIA H20 and L20 AI Inference Chips

Baobao Algorithm Notes

Oct 10, 2024 · Artificial Intelligence

How MCTS Powers Inference in OpenAI’s o1: A Deep Dive with rStar

This article explains how the inference component of OpenAI’s o1 model can be implemented using Monte‑Carlo Tree Search, detailing the action space, rollout process, UCT scoring, and best‑path selection, with a concrete walkthrough of Microsoft’s open‑source rStar code.

InferenceMCTSOpenAI o1

0 likes · 26 min read

How MCTS Powers Inference in OpenAI’s o1: A Deep Dive with rStar

Architect

Sep 28, 2024 · Artificial Intelligence

How Does OpenAI’s o1 Model Leverage Self‑Play RL and New Scaling Laws?

The article provides an in‑depth technical analysis of OpenAI’s multimodal o1 model, explaining its self‑play reinforcement‑learning pipeline, the novel train‑time and test‑time compute scaling laws, its long‑think reasoning abilities demonstrated through a cipher example, and speculative architectures for generator‑verifier systems.

InferenceOpenAIRL scaling

0 likes · 35 min read

How Does OpenAI’s o1 Model Leverage Self‑Play RL and New Scaling Laws?

Baobao Algorithm Notes

Sep 28, 2024 · Artificial Intelligence

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

This article provides a thorough, yet concise, overview of Llama 3’s training pipeline, data handling, model architecture, scaling laws, post‑training techniques like SFT and DPO, and inference optimizations such as KV‑Cache, GQA, PagedAttention, and FP8 quantization, highlighting practical insights and benchmark results.

DPOInferenceKV cache

0 likes · 32 min read

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

JD Retail Technology

Aug 30, 2024 · Artificial Intelligence

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

The article details JD Advertising's technical challenges and solutions for large‑scale sparse recommendation models, describing GPU‑focused storage, compute and I/O optimizations for both training and low‑latency inference, including distributed pipelines, heterogeneous deployment, batch aggregation, multi‑stream execution, and compiler extensions.

Distributed SystemsGPU OptimizationInference

0 likes · 13 min read

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

DataFunSummit

Aug 12, 2024 · Artificial Intelligence

Design and Application of Xiaohongshu Heterogeneous Training and Inference Engine

This article presents a comprehensive overview of Xiaohongshu's heterogeneous training and inference engine, covering the challenges of model engineering, the design of elastic heterogeneous engines, future HPC training frameworks, AI compilation techniques, and a forward‑looking outlook on scalability and performance.

AIAI CompilationHPC

0 likes · 19 min read

Design and Application of Xiaohongshu Heterogeneous Training and Inference Engine

NewBeeNLP

Jul 24, 2024 · Industry Insights

From Black Iron to Silver: The Evolution of Large Model Infrastructure (2019‑2024)

The article traces the evolution of large‑model training and inference infrastructure from the early “black‑iron” era (2019‑2021) through the “golden” boom (2022‑2023) to the emerging “silver” phase (2024‑), highlighting key research breakthroughs, open‑source frameworks, hardware trends, market dynamics, and practical challenges for engineers entering the field.

AI InfrastructureInferenceLarge Model

0 likes · 22 min read

From Black Iron to Silver: The Evolution of Large Model Infrastructure (2019‑2024)

Architects' Tech Alliance

Jun 22, 2024 · Artificial Intelligence

Rising Compute Demand of Generative AI Models and GPU Accelerator Trends in 2024

The article analyzes how generative AI models from GPT‑1 to the upcoming GPT‑5 are driving exponential growth in compute requirements, prompting massive cloud capital expenditures and intense competition among GPU vendors such as NVIDIA, AMD, Google, and emerging domestic chip makers, while also highlighting interconnect innovations and cost‑effective solutions.

AIAcceleratorsCompute

0 likes · 12 min read

Rising Compute Demand of Generative AI Models and GPU Accelerator Trends in 2024

DataFunTalk

May 18, 2024 · Artificial Intelligence

Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions

This article details the background, goals, and evolution of Tencent's FinTech AI development platform, outlines the technical challenges faced in feature engineering, model training, and inference services, and presents the comprehensive solutions and future plans implemented to improve efficiency, stability, and scalability.

Cloud NativeFinTechInference

0 likes · 13 min read

Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions

Top Architect

Apr 18, 2024 · Artificial Intelligence

Understanding Transformers: Architecture, Attention Mechanism, Training and Inference

This article provides a comprehensive overview of Transformer models, covering their attention-based architecture, encoder-decoder structure, training procedures including teacher forcing, inference workflow, advantages over RNNs, and various applications in natural language processing such as translation, summarization, and classification.

Attention MechanismDeep LearningInference

0 likes · 11 min read

Understanding Transformers: Architecture, Attention Mechanism, Training and Inference

DataFunSummit

Apr 10, 2024 · Artificial Intelligence

Large Language Model Inference Overview and Performance Optimizations

This article presents a comprehensive overview of large language model inference, describing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and detailing a series of system-level optimizations—including pipeline parallelism, dynamic batching, KV‑cache quantization, and hardware considerations—to significantly improve inference efficiency on modern GPUs.

GPUInferenceLatency

0 likes · 23 min read

Large Language Model Inference Overview and Performance Optimizations

Xiaohongshu Tech REDtech

Apr 10, 2024 · Artificial Intelligence

Early‑Stopping Self‑Consistency (ESC): Reducing Sampling Cost for Large Language Model Reasoning

Early‑Stopping Self‑Consistency (ESC) dynamically halts sampling once a sliding‑window answer distribution reaches zero entropy, cutting the number of required LLM reasoning samples by 34‑84 % across arithmetic, commonsense, and symbolic benchmarks while preserving accuracy and offering a theoretically‑bounded, robust, budget‑adaptive alternative to traditional Self‑Consistency.

AIEarly StoppingInference

0 likes · 14 min read

Early‑Stopping Self‑Consistency (ESC): Reducing Sampling Cost for Large Language Model Reasoning

Architect

Mar 26, 2024 · Artificial Intelligence

Why Transformers Outperform RNNs: A Deep Dive into Architecture and Training

This article explains the Transformer model’s core architecture, self‑attention mechanism, encoder‑decoder workflow, training with teacher forcing, inference steps, and why it surpasses RNNs and CNNs, while also outlining its major NLP applications.

Attention MechanismInferenceModel Training

0 likes · 14 min read

Why Transformers Outperform RNNs: A Deep Dive into Architecture and Training

Open Source Tech Hub

Mar 25, 2024 · Artificial Intelligence

Quick Guide to Using ModelScope Library for Multi‑Modal AI Model Inference

This article introduces the ModelScope Python library, explains its support for PyTorch and TensorFlow, and provides step‑by‑step code examples for loading models, running inference pipelines on text and images, handling batch inputs, and customizing preprocessors in both Python and PHP.

AIInferenceModelScope

0 likes · 14 min read

Quick Guide to Using ModelScope Library for Multi‑Modal AI Model Inference

DataFunTalk

Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUInference

0 likes · 20 min read

Efficient Deployment of Speech AI Models on GPUs

Huawei Cloud Developer Alliance

Dec 14, 2023 · Artificial Intelligence

Unlocking LLaMA: Key Innovations, Architecture Insights, and MindSpore Inference Guide

This article reviews the LLaMA large‑language‑model series, covering its background, architectural innovations such as Add&Norm, SwiGLU, and RoPE, a known reversal‑curse bug, and provides step‑by‑step MindSpore Transformers code for model configuration, inference, and pipeline usage while previewing the upcoming LLaMA‑2 session.

AIInferenceLLaMA

0 likes · 6 min read

Unlocking LLaMA: Key Innovations, Architecture Insights, and MindSpore Inference Guide

DataFunTalk

Dec 1, 2023 · Artificial Intelligence

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

This article details Xiaohongshu's end‑to‑end GPU‑centric transformation for search‑related machine‑learning models, covering model characteristics, training and inference frameworks, system‑level GPU and CPU optimizations, multi‑card and compilation techniques, and future directions for scaling large sparse and dense models.

GPU OptimizationInferenceModel Serving

0 likes · 16 min read

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

Architects' Tech Alliance

Nov 19, 2023 · Artificial Intelligence

NVIDIA H100 vs L40S: AI‑Focused GPU Comparison and Practical Alternatives

This article compares NVIDIA's high‑end AI GPUs—H100, A100, and the newer L40S—detailing their specifications, performance trade‑offs, pricing, availability, and suitability for training and inference workloads, while highlighting why L40S can be a cost‑effective alternative for many enterprises.

AIGPUH100

0 likes · 10 min read

NVIDIA H100 vs L40S: AI‑Focused GPU Comparison and Practical Alternatives

JD Tech

Aug 4, 2023 · Artificial Intelligence

Deploying and Evaluating the Vicuna Open‑Source Large Language Model on a Single Machine

This article details a step‑by‑step guide to deploying the Vicuna open‑source LLM on a single server, covering model preparation, environment setup, dependency installation, GPU and CUDA configuration, inference commands, performance evaluation, and attempted fine‑tuning, while sharing practical observations and results.

Fine‑tuningGPUInference

0 likes · 16 min read

Deploying and Evaluating the Vicuna Open‑Source Large Language Model on a Single Machine

JD Tech

Jul 31, 2023 · Artificial Intelligence

Local Deployment, Fine‑tuning, and Inference of the Open‑source Alpaca‑LoRA Model on GPU Servers

This article details the step‑by‑step process of installing GPU drivers, setting up a Python environment, deploying the open‑source Alpaca‑LoRA large language model, fine‑tuning it with Chinese data on a multi‑GPU server, and running inference, while discussing practical challenges and performance observations.

AlpacaFine-tuningGPU

0 likes · 14 min read

Local Deployment, Fine‑tuning, and Inference of the Open‑source Alpaca‑LoRA Model on GPU Servers

High Availability Architecture

Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

AI OptimizationCUDAInference

0 likes · 10 min read

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

JD Retail Technology

May 18, 2023 · Artificial Intelligence

Local Deployment, Inference, and Fine‑tuning of the Vicuna‑7B Large Language Model

This article details the step‑by‑step process of preparing the environment, merging weights, installing dependencies, running inference, evaluating Vicuna‑7B against other models, and attempting fine‑tuning, while highlighting performance results, encountered issues, and future work for large language model deployment.

Fine-tuningGPUInference

0 likes · 11 min read

Local Deployment, Inference, and Fine‑tuning of the Vicuna‑7B Large Language Model

DataFunTalk

Mar 31, 2023 · Artificial Intelligence

Estimating the Resource and Cost Requirements for Large Language Model Training and Inference

The article analyses the computational resources, hardware costs, and human investment needed to train and serve large language models such as GPT‑3, discusses practical cost calculations, highlights the challenges faced by Chinese AI teams, and argues for sustained, long‑term funding to achieve meaningful breakthroughs.

AI InfrastructureChina AIInference

0 likes · 14 min read

Estimating the Resource and Cost Requirements for Large Language Model Training and Inference

58 Tech

Dec 21, 2021 · Artificial Intelligence

dl_inference: Open‑Source Deep Learning Inference Service with TensorRT and MKL Acceleration

dl_inference is an open‑source, production‑grade deep learning inference platform that supports TensorFlow, PyTorch and Caffe models, offering GPU and CPU deployment, TensorRT and MKL acceleration, multi‑node load balancing, and extensive Q&A on model conversion, hardware requirements, INT8 quantization, and performance gains.

CPUGPUInference

0 likes · 8 min read

dl_inference: Open‑Source Deep Learning Inference Service with TensorRT and MKL Acceleration

58 Tech

Dec 8, 2021 · Artificial Intelligence

dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration

The article introduces dl_inference, an open‑source deep learning inference platform that integrates TensorRT GPU acceleration, Intel MKL CPU optimization, and Caffe support, detailing its features, model conversion workflow, deployment steps, performance gains, and how developers can contribute.

InferenceIntel MKLTensorRT

0 likes · 12 min read

dl_inference: A General Deep Learning Inference Service with TensorRT and Intel MKL Acceleration

DataFunTalk

Nov 2, 2021 · Artificial Intelligence

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

The article outlines a technical exchange hosted by 58.com AI Lab and Tianjin University that discusses high‑efficiency AI computing, resource‑aware scheduling for both online inference and offline training, and methods to mitigate GPU under‑utilization and gray‑interference in distributed deep‑learning platforms.

AIGPU utilizationInference

0 likes · 4 min read

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

WeChat Backend Team

Jun 7, 2021 · Artificial Intelligence

How WeChat’s TFCC Boosts Deep Learning Inference Performance Across Platforms

The TFCC framework, developed by WeChat's backend team, delivers high‑performance, easy‑to‑use, and universal deep‑learning inference by supporting numerous ONNX and TensorFlow operations, optimizing model structures, constants, and operators, and providing a versatile runtime and math library for both CPU and GPU platforms.

Deep LearningFrameworkInference

0 likes · 8 min read

How WeChat’s TFCC Boosts Deep Learning Inference Performance Across Platforms

DataFunSummit

Dec 14, 2020 · Artificial Intelligence

LightSeq: High‑Performance Open‑Source Inference Engine for Transformers, GPT and Other NLP Models

This article introduces LightSeq, an open‑source, GPU‑accelerated inference engine that dramatically speeds up Transformer‑based models such as BERT and GPT by up to 14× over TensorFlow, supports multiple decoding strategies, integrates seamlessly with major deep‑learning frameworks, and provides detailed performance benchmarks and technical optimizations.

Deep LearningGPUInference

0 likes · 15 min read

LightSeq: High‑Performance Open‑Source Inference Engine for Transformers, GPT and Other NLP Models

JD Tech Talk

Nov 16, 2020 · Artificial Intelligence

Practical Guide to Deploying Federated Learning: Architecture, Deployment, Training, and Inference

This article provides a comprehensive overview of federated learning engineering, covering deployment via Docker containers, the design of training and inference frameworks, key services such as communication, training, model management, and registration, and practical considerations for scaling and reliability in production environments.

AIDeploymentDocker

0 likes · 11 min read

Practical Guide to Deploying Federated Learning: Architecture, Deployment, Training, and Inference

JD Tech Talk

Nov 13, 2020 · Artificial Intelligence

Practical Engineering Guide to Federated Learning: Deployment, Training, and Inference

This article provides a comprehensive engineering overview of federated learning, covering its core distributed‑learning concept, Docker‑based deployment, detailed training‑service architecture with validation, scheduling, metadata, and model‑management components, as well as a complete inference framework and workflow for production use.

AI EngineeringDistributed SystemsDocker

0 likes · 12 min read

Practical Engineering Guide to Federated Learning: Deployment, Training, and Inference

Didi Tech

Jul 5, 2019 · Artificial Intelligence

How Didi’s Jianshu Machine Learning Platform Boosts AI Development Efficiency

An in‑depth look at Didi’s Jianshu Machine Learning Platform reveals its end‑to‑end AI workflow—from experiment environments and batch training to high‑availability online serving—highlighting resource‑efficient Kubernetes scheduling, Docker‑based reproducible environments, a custom parameter server, and the IFX inference engine that together accelerate development, training, and deployment.

AIPlatformDockerInference

0 likes · 11 min read

How Didi’s Jianshu Machine Learning Platform Boosts AI Development Efficiency