Tag

Inference

0 views collected around this technical thread.

DataFunTalk
DataFunTalk
May 23, 2025 · Artificial Intelligence

2025 AI Landscape: Inference Models Dominate, Open‑Source Momentum Accelerates

The 2025 Q1 AI report from Artificial Analysis highlights six major trends—including a thousand‑fold drop in inference cost, the rise of MoE models, the growing parity of Chinese open‑source labs, the emergence of autonomous AI agents, native multimodal capabilities, and the trade‑off between performance, cost, and context windows—painting a picture of a rapidly evolving, increasingly competitive AI ecosystem.

AIAgentsInference
0 likes · 11 min read
2025 AI Landscape: Inference Models Dominate, Open‑Source Momentum Accelerates
Baidu Geek Talk
Baidu Geek Talk
Apr 2, 2025 · Artificial Intelligence

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

CodeDeepSeek-VL2Inference
0 likes · 36 min read
DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough
DataFunSummit
DataFunSummit
Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu's machine‑learning platform lead Wang Xin's presentation on the ZhiLight large‑model inference framework, covering model execution mechanisms, GPU workload analysis, pipeline and tensor parallelism, GPU architecture evolution, open‑source engine comparisons, ZhiLight's compute‑communication overlap and quantization optimizations, benchmark results, supported models, and future directions.

GPUInferenceLLM
0 likes · 13 min read
Insights from Zhihu's ZhiLight Large‑Model Inference Framework: Architecture, Parallelism, and Performance Optimizations
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 8, 2025 · Artificial Intelligence

Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing

This guide explains how to deploy the QwQ-32B large language model using vLLM on an Alibaba Cloud ACK Kubernetes cluster, configure storage, set up OpenWebUI, enable ACK Gateway with AI Extension for intelligent routing, and benchmark the inference service performance.

AckInferenceKubernetes
0 likes · 17 min read
Deploying QwQ-32B LLM with vLLM on Alibaba Cloud ACK and Configuring Intelligent Routing
JD Retail Technology
JD Retail Technology
Mar 4, 2025 · Artificial Intelligence

JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications

JD Retail’s Nine‑Number Algorithm Platform delivers an end‑to‑end AI engine that unifies GPU and domestic NPU resources across a thousand‑card cluster, offering zero‑cost model migration, optimized training and inference pipelines, support for over 40 LLM and multimodal models, and proven business‑level performance that reduces dependence on overseas chips.

AIGPUInference
0 likes · 19 min read
JD Retail End-to-End AI Engine Compatible with GPU and Domestic NPU: Architecture, Optimization, and Applications
JD Tech Talk
JD Tech Talk
Mar 3, 2025 · Artificial Intelligence

AI Engine Technology Based on Domestic Chips for JD Retail

This article describes JD Retail's AI engine built on domestic NPU chips, covering challenges, heterogeneous GPU‑NPU scheduling, high‑performance training and inference engines, extensive model support, real‑world deployment cases, and future plans for large‑scale chip clusters and ecosystem development.

AIGPUInference
0 likes · 20 min read
AI Engine Technology Based on Domestic Chips for JD Retail
Architect
Architect
Feb 27, 2025 · Artificial Intelligence

Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation

This article explains how inference‑oriented large language models such as DeepSeek‑R1 and OpenAI o1‑mini shift AI research from training‑time scaling to test‑time computation, detailing the underlying principles, new scaling laws, verification techniques, reinforcement‑learning pipelines, and practical methods for distilling reasoning capabilities into smaller models.

DeepSeek-R1Inferencelarge language models
0 likes · 18 min read
Understanding Inference Large Language Models: DeepSeek‑R1 and the Rise of Test‑Time Computation
Java Architecture Diary
Java Architecture Diary
Feb 24, 2025 · Artificial Intelligence

Run Large Language Models Directly in Java with Jlama – Quick Start Guide

This article introduces Jlama, an open‑source Java LLM inference engine, outlines its key features, provides step‑by‑step CLI and Maven integration instructions, shows code examples, run logs, and special setup notes for using large language models efficiently within Java applications.

AIInferenceJava
0 likes · 6 min read
Run Large Language Models Directly in Java with Jlama – Quick Start Guide
DevOps
DevOps
Feb 9, 2025 · Artificial Intelligence

DeepSeek’s Impact on the Large Model Ecosystem and the Resurgence of AI PCs

The article examines DeepSeek’s rapid rise, its open‑source R1 model and distilled variants, the resurgence of AI PCs, hardware support from Nvidia, AMD and others, and how this ecosystem is reshaping personal AI experiences and the broader large‑model landscape.

AI PCDeepSeekHardware
0 likes · 11 min read
DeepSeek’s Impact on the Large Model Ecosystem and the Resurgence of AI PCs
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 8, 2025 · Artificial Intelligence

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

This guide explains how to deploy a production‑ready DeepSeek‑R1 inference service on Alibaba Cloud ACK using KServe, covering model preparation, storage configuration, service deployment, observability, autoscaling, model acceleration, gray‑release and GPU‑shared inference.

Alibaba CloudDeepSeekGPU
0 likes · 13 min read
Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 17, 2025 · Artificial Intelligence

Elastic Scaling of Large Language Model Inference on Alibaba Cloud ACK with Knative, ResourcePolicy, and Fluid

This article explains how to reduce inference cost and improve performance for large language models on Alibaba Cloud ACK by using Knative's request‑based autoscaling, custom ResourcePolicy priority scheduling, and Fluid data‑caching to achieve elastic scaling, resource pre‑emption, and faster model loading.

FluidInferenceKnative
0 likes · 22 min read
Elastic Scaling of Large Language Model Inference on Alibaba Cloud ACK with Knative, ResourcePolicy, and Fluid
Architects' Tech Alliance
Architects' Tech Alliance
Dec 25, 2024 · Artificial Intelligence

Performance Analysis of NVIDIA H20 and L20 AI Inference Chips

This article evaluates NVIDIA's China‑specific H20 and L20 inference chips, comparing their compute and memory‑bandwidth characteristics against A100, H100 and H200, and shows how they achieve superior throughput in large‑model inference despite reduced specifications.

AIGPUH20
0 likes · 6 min read
Performance Analysis of NVIDIA H20 and L20 AI Inference Chips
JD Retail Technology
JD Retail Technology
Aug 30, 2024 · Artificial Intelligence

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

The article details JD Advertising's technical challenges and solutions for large‑scale sparse recommendation models, describing GPU‑focused storage, compute and I/O optimizations for both training and low‑latency inference, including distributed pipelines, heterogeneous deployment, batch aggregation, multi‑stream execution, and compiler extensions.

Distributed SystemsGPU optimizationInference
0 likes · 13 min read
GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems
DataFunSummit
DataFunSummit
Aug 12, 2024 · Artificial Intelligence

Design and Application of Xiaohongshu Heterogeneous Training and Inference Engine

This article presents a comprehensive overview of Xiaohongshu's heterogeneous training and inference engine, covering the challenges of model engineering, the design of elastic heterogeneous engines, future HPC training frameworks, AI compilation techniques, and a forward‑looking outlook on scalability and performance.

AIAI CompilationHPC
0 likes · 19 min read
Design and Application of Xiaohongshu Heterogeneous Training and Inference Engine
Architects' Tech Alliance
Architects' Tech Alliance
Jun 22, 2024 · Artificial Intelligence

Rising Compute Demand of Generative AI Models and GPU Accelerator Trends in 2024

The article analyzes how generative AI models from GPT‑1 to the upcoming GPT‑5 are driving exponential growth in compute requirements, prompting massive cloud capital expenditures and intense competition among GPU vendors such as NVIDIA, AMD, Google, and emerging domestic chip makers, while also highlighting interconnect innovations and cost‑effective solutions.

AIAcceleratorsCompute
0 likes · 12 min read
Rising Compute Demand of Generative AI Models and GPU Accelerator Trends in 2024
DataFunTalk
DataFunTalk
May 18, 2024 · Artificial Intelligence

Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions

This article details the background, goals, and evolution of Tencent's FinTech AI development platform, outlines the technical challenges faced in feature engineering, model training, and inference services, and presents the comprehensive solutions and future plans implemented to improve efficiency, stability, and scalability.

Artificial IntelligenceFinTechInference
0 likes · 13 min read
Tencent FinTech AI Development Platform: Architecture, Challenges, and Solutions
Top Architect
Top Architect
Apr 18, 2024 · Artificial Intelligence

Understanding Transformers: Architecture, Attention Mechanism, Training and Inference

This article provides a comprehensive overview of Transformer models, covering their attention-based architecture, encoder-decoder structure, training procedures including teacher forcing, inference workflow, advantages over RNNs, and various applications in natural language processing such as translation, summarization, and classification.

InferenceNLPTransformer
0 likes · 11 min read
Understanding Transformers: Architecture, Attention Mechanism, Training and Inference
DataFunSummit
DataFunSummit
Apr 10, 2024 · Artificial Intelligence

Large Language Model Inference Overview and Performance Optimizations

This article presents a comprehensive overview of large language model inference, describing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and detailing a series of system-level optimizations—including pipeline parallelism, dynamic batching, KV‑cache quantization, and hardware considerations—to significantly improve inference efficiency on modern GPUs.

GPUInferencelatency
0 likes · 23 min read
Large Language Model Inference Overview and Performance Optimizations
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Apr 10, 2024 · Artificial Intelligence

Early‑Stopping Self‑Consistency (ESC): Reducing Sampling Cost for Large Language Model Reasoning

Early‑Stopping Self‑Consistency (ESC) dynamically halts sampling once a sliding‑window answer distribution reaches zero entropy, cutting the number of required LLM reasoning samples by 34‑84 % across arithmetic, commonsense, and symbolic benchmarks while preserving accuracy and offering a theoretically‑bounded, robust, budget‑adaptive alternative to traditional Self‑Consistency.

AIChain-of-ThoughtEarly-Stopping
0 likes · 14 min read
Early‑Stopping Self‑Consistency (ESC): Reducing Sampling Cost for Large Language Model Reasoning
DataFunTalk
DataFunTalk
Feb 19, 2024 · Artificial Intelligence

Large Language Model Inference Overview and Performance Optimizations

This article presents a comprehensive overview of large language model inference, detailing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and a series of system-level optimizations—including pipeline parallelism, dynamic batching, specialized attention kernels, virtual memory allocation, KV‑cache quantization, and mixed‑precision strategies—to improve GPU utilization and overall inference efficiency.

GPUInferenceLLM
0 likes · 24 min read
Large Language Model Inference Overview and Performance Optimizations