Tagged articles

19 articles

Page 1 of 1

Apr 15, 2026 · Industry Insights

How DeepSeek V4 Uses Huawei Ascend 950PR to Outperform Nvidia H20 by 2.9×

The article analyzes DeepSeek V4's migration to Huawei's Ascend 950PR chip and CANN framework, detailing three hardware‑level innovations, the CUDA‑to‑CANN transition, and the resulting 35× inference speed boost, 2.87× performance over Nvidia H20, and dramatic cost reductions for trillion‑parameter models.

AI hardwareCANN frameworkDeepSeek

0 likes · 10 min read

How DeepSeek V4 Uses Huawei Ascend 950PR to Outperform Nvidia H20 by 2.9×

Baidu Intelligent Cloud Tech Hub

Feb 6, 2026 · Artificial Intelligence

Accelerating GLM‑4.x Inference on Kunlun XPU with SGLang & vLLM

Baidu’s Baige team successfully adapted the GLM‑4.x series language models to the Kunlun XPU platform by leveraging SGLang and the vLLM‑Kunlun plugin, employing agile adaptation, precision alignment with torch_xray, and extensive performance tuning to achieve GPU‑level accuracy and superior inference speed.

AIHardware accelerationXPU

0 likes · 6 min read

Accelerating GLM‑4.x Inference on Kunlun XPU with SGLang & vLLM

Baidu Intelligent Cloud Tech Hub

Dec 17, 2025 · Artificial Intelligence

How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%

The article details the Attention‑FFN Disaggregation (AFD) technique used by Baidu Baige to separate self‑attention and feed‑forward network stages in DeepSeek‑V3 models, describing multi‑stage scheduling, three‑batch overlap, communication optimizations, and performance results that achieve up to 19% throughput improvement under a 100 ms SLO.

3BOAFDAttention-FFN Disaggregation

0 likes · 17 min read

How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%

Sohu Smart Platform Tech Team

Nov 20, 2025 · Artificial Intelligence

How Hooop Turns HarmonyOS into an Offline AI Basketball Coach

Hooop leverages HarmonyOS's on‑device AI and custom vision algorithms to provide real‑time, offline basketball training by detecting shots, analyzing trajectories, automatically clipping scoring clips, and tracking performance metrics without an internet connection.

AIComputer VisionHarmonyOS

0 likes · 12 min read

How Hooop Turns HarmonyOS into an Offline AI Basketball Coach

Meituan Technology Team

Sep 11, 2025 · Artificial Intelligence

How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

LongCat-Flash, an open‑source Mixture‑of‑Experts model released by Meituan, leverages model‑system co‑design, PD‑disaggregation, SBO scheduling and large‑scale expert parallelism within the SGLang framework to deliver dramatically lower latency, higher throughput and cost‑effective inference for AI agents, with detailed deployment instructions provided.

LongCat-FlashLow latencyMixture of Experts

0 likes · 15 min read

How LongCat-Flash Achieves Ultra-Fast, Low-Cost AI Agent Inference with SGLang

vivo Internet Technology

Sep 3, 2025 · Artificial Intelligence

How to Enable On‑Device AI in WeChat Mini‑Programs with TensorFlow.js and Native Inference

This article details a complete engineering solution for bringing on‑device AI to WeChat mini‑programs, comparing TensorFlow.js and WeChat native inference, covering model conversion, package‑size optimization, integration steps, performance metrics, and a hybrid strategy that boosts recommendation click‑through rates by 30%.

Mini ProgramTensorFlow.jsWeChat

0 likes · 13 min read

How to Enable On‑Device AI in WeChat Mini‑Programs with TensorFlow.js and Native Inference

Tencent Cloud Developer

Jul 17, 2025 · Artificial Intelligence

Why GPUs Are the New CPUs: Unpacking AI Infrastructure Challenges

This article explores how AI infrastructure has shifted from CPU‑centric designs to GPU‑driven architectures, detailing hardware evolution, software changes, and the engineering challenges of large‑model training and inference, while offering practical insights for traditional backend engineers transitioning to AI systems.

AI InfrastructureDeep LearningGPU computing

0 likes · 16 min read

Why GPUs Are the New CPUs: Unpacking AI Infrastructure Challenges

AI Algorithm Path

Mar 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Structured Output in Local Vision Language Models with Pydantic

This article walks through the challenges of prompting small vision language models, demonstrates a conventional JSON‑based prompt, then shows how to define Pydantic models, embed their JSON schema into prompts, run inference with Qwen2.5‑VL, and visualize the structured results.

JSON SchemaPydanticPython

0 likes · 10 min read

Step-by-Step Guide to Structured Output in Local Vision Language Models with Pydantic

Zhihu Tech Column

Mar 14, 2025 · Artificial Intelligence

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

The article summarizes Zhihu’s technical talk on the ZhiLight large‑model inference framework, detailing model execution mechanisms, GPU load analysis, multi‑GPU parallel strategies, open‑source engine comparisons, compute‑communication overlap, quantization techniques, benchmark results, and future directions for scalable LLM deployment.

GPU parallelismSGLangTensor Parallelism

0 likes · 11 min read

Insights from Zhihu’s ZhiLight Large Model Inference Framework: Architecture, Parallelism, and Performance Optimizations

Data Thinking Notes

Mar 4, 2025 · Artificial Intelligence

Unlock AI-Powered Research: The DeepSeek‑R1 & DeepResearch Guide

Compiled by Tsinghua University experts, this guide systematically analyzes the DeepSeek‑R1 inference model and DeepResearch platform, offering multi‑model comparisons, real‑world case studies, and end‑to‑end AI‑driven solutions from data collection to report generation for researchers.

AI researchData AutomationDeepSeek

0 likes · 6 min read

Unlock AI-Powered Research: The DeepSeek‑R1 & DeepResearch Guide

DataFunTalk

Mar 3, 2025 · Artificial Intelligence

FlightVGM: FPGA-Accelerated Inference for Video Generation Models Wins Best Paper at FPGA 2025

The FlightVGM paper, awarded Best Paper at FPGA 2025, details a novel FPGA-based inference IP for video generation models that leverages time‑space activation sparsity, mixed‑precision DSP58 extensions, and adaptive scheduling to achieve up to 1.30× performance and 4.49× energy‑efficiency gains over a NVIDIA 3090 GPU while preserving model accuracy.

AIFPGAHardware acceleration

0 likes · 11 min read

FlightVGM: FPGA-Accelerated Inference for Video Generation Models Wins Best Paper at FPGA 2025

ZhongAn Tech Team

Feb 16, 2025 · Artificial Intelligence

DeepSeek R1 and V3: Model Innovations, Industry Impact, and Future Trends

The article reviews DeepSeek's open‑source R1 and V3 large language models, highlighting their technical breakthroughs, cost advantages, expert opinions, industry adoption across chips, cloud services, and applications, and discusses future directions for model scaling, distillation, and AI competition.

AI competitionAI industryDeepSeek

0 likes · 13 min read

DeepSeek R1 and V3: Model Innovations, Industry Impact, and Future Trends

Alimama Tech

Feb 12, 2025 · Artificial Intelligence

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

HighService, Alibaba’s Pythonic AI service framework, accelerates large‑model inference and maximizes GPU utilization by separating CPU‑GPU processes, offering out‑of‑the‑box quantization, parallelism and caching, and dynamically reallocating idle GPUs across clusters through a master‑worker scheduler to keep online latency low while boosting offline throughput for diffusion and LLM workloads.

AI ServiceDistributed SystemsPython

0 likes · 16 min read

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

Rare Earth Juejin Tech Community

Jan 17, 2024 · Artificial Intelligence

Building a License Plate Recognition Service with C++, TensorRT, and Go

This article details how to train a YOLOv8‑pose model for license‑plate detection, convert it to TensorRT engine, implement C++ inference and preprocessing, expose the functionality via CGO to Go, and assemble a lightweight web service for real‑time plate recognition.

CGoTensorRT

0 likes · 12 min read

Building a License Plate Recognition Service with C++, TensorRT, and Go

HelloTech

Aug 9, 2023 · Artificial Intelligence

Device Intelligence: Concepts, Architecture, and Applications

Device intelligence brings on-device reasoning and real-time inference to smartphones and IoT gateways, delivering low-latency, privacy-preserving, personalized services such as AR/VR enhancements and recommendation re-ranking, while confronting challenges of hardware fragmentation and model size, and complementing cloud AI through architectures like Hala’s MNN-based pipeline.

Device IntelligenceEdge ComputingMobile AI

0 likes · 10 min read

Device Intelligence: Concepts, Architecture, and Applications

Architects' Tech Alliance

Nov 7, 2022 · Artificial Intelligence

FastDeploy: One-Click AI Model Deployment Across GPUs, CPUs, and Edge Devices

FastDeploy is an open‑source toolkit that standardizes AI model APIs and enables developers to deploy vision, NLP, and speech models on diverse hardware—including GPUs, CPUs, Jetson, ARM, and various NPUs—using just three lines of code or a single command, while delivering end‑to‑end performance optimizations.

AI deploymentCPUEdge Computing

0 likes · 11 min read

FastDeploy: One-Click AI Model Deployment Across GPUs, CPUs, and Edge Devices

Alibaba Cloud Big Data AI Platform

Aug 12, 2022 · Artificial Intelligence

How AI Model Inference Optimization Boosted Address Standardization Speed by 4×

By applying high‑performance operators, quantization, and AI compiler optimizations with Alibaba Cloud PAI Blade and Intel Xeon back‑ends, the address‑standardization service’s deep‑learning models achieved up to 4.11× faster end‑to‑end inference without sacrificing accuracy, enabling more complex models and lower latency.

AI Optimizationaddress standardizationhigh-performance operators

0 likes · 19 min read

How AI Model Inference Optimization Boosted Address Standardization Speed by 4×

Meituan Technology Team

Jul 6, 2022 · Artificial Intelligence

Engineering Practices for Large-Scale Deep Learning Models in Meituan Takeaway Advertising

The article details Meituan's engineering journey from small DNNs to hundred‑gigabyte deep learning models for food‑delivery ads, analyzing online latency and offline efficiency challenges and presenting distributed storage, CPU/GPU acceleration, OpenVINO, TensorRT, CodeGen, and data‑pipeline optimizations that dramatically improve throughput, memory usage, and sample‑building speed.

CPU accelerationDeep LearningGPU Acceleration

0 likes · 45 min read

Engineering Practices for Large-Scale Deep Learning Models in Meituan Takeaway Advertising

iQIYI Technical Product Team

Jul 23, 2021 · Artificial Intelligence

XGBoost Serving: An Open‑Source High‑Performance Inference System for GBDT and GBDT+FM Models

XGBoost Serving is an open‑source, high‑performance inference system built on TensorFlow Serving that adds dedicated servables for pure GBDT, GBDT+FM binary‑classification, and GBDT+FM multi‑classification models, providing automatic version lifecycle management, GRPC/HTTP APIs, and up to 50 % latency reduction, now available on GitHub after successful deployment in iQIYI’s recommendation platform.

GBDTServing ArchitectureXGBoost Serving

0 likes · 12 min read

XGBoost Serving: An Open‑Source High‑Performance Inference System for GBDT and GBDT+FM Models