Tagged articles
536 articles
Page 3 of 6
Baidu Tech Salon
Baidu Tech Salon
Nov 22, 2024 · Artificial Intelligence

How GPU‑Accelerated ANN Search Cuts Costs and Boosts Throughput in High‑Volume Retrieval

This article analyzes a GPU‑based approximate nearest neighbor (ANN) retrieval solution built on NVIDIA's RAFT library, detailing algorithm selection, offline indexing tricks, batch online search design, performance results on a 25‑million‑vector workload, and cost‑saving implications for large‑scale search services.

ANNGPUIVF_INT8
0 likes · 21 min read
How GPU‑Accelerated ANN Search Cuts Costs and Boosts Throughput in High‑Volume Retrieval
Architects' Tech Alliance
Architects' Tech Alliance
Nov 21, 2024 · Fundamentals

El Capitan Supercomputer and the Rise of AMD GPU‑Driven HPC: Architecture, Performance, and Market Impact

The article examines the El Capitan supercomputer unveiled at SC24, detailing its AMD CPU‑GPU hybrid architecture, benchmark results, its dominance in the November 2024 Top500 list, and the broader implications for high‑performance computing, AI workloads, and the competitive landscape between AMD and NVIDIA.

AIAMDCPU
0 likes · 20 min read
El Capitan Supercomputer and the Rise of AMD GPU‑Driven HPC: Architecture, Performance, and Market Impact
Baidu Geek Talk
Baidu Geek Talk
Nov 20, 2024 · Artificial Intelligence

Boosting ANN Search with GPU: Inside RAFT’s IVF_INT8 Implementation

This article examines how Baidu and NVIDIA leveraged the open‑source RAFT library to build a GPU‑accelerated approximate nearest neighbor (ANN) retrieval system, detailing algorithm choices, offline indexing, online batch processing, performance results, and practical guidelines for deploying ANN on GPUs.

ANNGPUIVF_INT8
0 likes · 20 min read
Boosting ANN Search with GPU: Inside RAFT’s IVF_INT8 Implementation
AntTech
AntTech
Nov 16, 2024 · Information Security

WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores Accepted at HPCA 2025

Ant Group’s Computing Systems Lab announced that its GPU‑accelerated fully homomorphic encryption framework WarpDrive, which exploits Tensor and CUDA cores for high‑throughput NTT operations and parallel kernel designs, has been accepted as a paper at the IEEE HPCA 2025 conference.

CUDAFully Homomorphic EncryptionGPU
0 likes · 4 min read
WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores Accepted at HPCA 2025
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 13, 2024 · Industry Insights

Why GPU Scale‑Up Interconnects Need a New Protocol – Inside UALink and Alibaba’s Alink

The article analyzes the growing demand for high‑bandwidth, low‑latency GPU Scale‑Up interconnects in AI clusters, explains why existing Ethernet and RDMA solutions fall short, and examines the industry‑wide UALink alliance and Alibaba's Alink System as a new open‑ecosystem solution.

AI InfrastructureAlink SystemGPU
0 likes · 12 min read
Why GPU Scale‑Up Interconnects Need a New Protocol – Inside UALink and Alibaba’s Alink
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 7, 2024 · Artificial Intelligence

RTAMS-GANNS: A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighbor Search

RTAMS‑GANNS, the award‑winning real‑time adaptive multi‑stream GPU system for online approximate nearest neighbor search, eliminates costly memory allocations and serial execution by using a dynamic memory‑block insertion algorithm and separate CUDA streams, cutting latency by 40‑80% and reliably serving over 100 million daily users in production.

GPUPerformance EvaluationVector Insertion
0 likes · 19 min read
RTAMS-GANNS: A Real-Time Adaptive Multi-Stream GPU System for Online Approximate Nearest Neighbor Search
Linux Kernel Journey
Linux Kernel Journey
Nov 5, 2024 · Artificial Intelligence

Understanding AI Flame Graphs: Insights from Brendan Gregg

The article introduces Intel's AI Flame Graph, a low‑overhead profiling tool that visualizes AI accelerator and GPU workloads across the full software stack, explains its design, demonstrates SYCL matrix‑multiply benchmarks, discusses challenges of AI instruction analysis, and outlines future adoption and impact.

AI profilingGPUIntel
0 likes · 16 min read
Understanding AI Flame Graphs: Insights from Brendan Gregg
Linux Code Review Hub
Linux Code Review Hub
Nov 2, 2024 · Artificial Intelligence

Inside Intel’s AI Flame Graph: Low‑Overhead Profiling for Faster, Greener AI

The article introduces Intel’s AI Flame Graph, a low‑overhead profiling tool that visualizes AI accelerator and GPU execution alongside the full software stack, explains its design, shows SYCL matrix‑multiply examples, discusses challenges of AI workload analysis, and outlines future adoption and impact on performance and energy savings.

AI profilingGPUIntel
0 likes · 16 min read
Inside Intel’s AI Flame Graph: Low‑Overhead Profiling for Faster, Greener AI
Architecture and Beyond
Architecture and Beyond
Nov 2, 2024 · Artificial Intelligence

Step-by-Step Guide to Training a LoRA Model with Flux1_dev on ComfyUI

This tutorial walks programmers through preparing a GPU cloud environment, installing ComfyUI, downloading Flux1_dev models, integrating a custom LoRA, labeling generated images, and finally training the LoRA using ai‑toolkit, providing detailed commands, configuration tips, and practical cost estimates.

AI image generationComfyUIFlux
0 likes · 12 min read
Step-by-Step Guide to Training a LoRA Model with Flux1_dev on ComfyUI
Kuaishou Tech
Kuaishou Tech
Oct 16, 2024 · Frontend Development

How Kola2d’s WebGL Engine Achieves 50+ FPS for Million‑Cell Spreadsheets

This article details the design and optimization of Kola2d, a custom WebGL rendering engine for Docs online spreadsheets, explaining why WebGL was chosen, how the system separates business and rendering layers, and the many performance tricks that enable smooth 50+ FPS rendering of tables with up to a million cells.

GPUKola2dOnline Spreadsheet
0 likes · 19 min read
How Kola2d’s WebGL Engine Achieves 50+ FPS for Million‑Cell Spreadsheets
21CTO
21CTO
Oct 15, 2024 · Artificial Intelligence

Why Mojo Could Redefine AI Programming: Insights from Chris Lattner

The article explores Chris Lattner’s vision for Mojo—a Python‑compatible language designed for AI, GPU, and accelerator workloads—detailing its performance claims, SIMD support, complex‑number handling, and the growing developer community behind it.

AIGPUMojo
0 likes · 9 min read
Why Mojo Could Redefine AI Programming: Insights from Chris Lattner
Architects' Tech Alliance
Architects' Tech Alliance
Oct 7, 2024 · Industry Insights

What AMD Unveiled at Computex 2024: Zen 5, XDNA NPU, Ryzen 9000 and AI‑Focused Innovations

At Computex 2024, AMD showcased its latest CPU, GPU, and AI‑accelerated technologies—including the high‑performance Zen 5 core, second‑generation XDNA NPU with 50 TOPS, the Ryzen 9000 consumer processor, the AI‑PC Strix Point platform, Versal AI Edge Gen 2, the upcoming MI‑series AI GPUs, and the new UA‑Link interconnect—highlighting the company’s roadmap for next‑generation computing and AI workloads.

AIAMDCPU
0 likes · 5 min read
What AMD Unveiled at Computex 2024: Zen 5, XDNA NPU, Ryzen 9000 and AI‑Focused Innovations
Java Tech Enthusiast
Java Tech Enthusiast
Sep 30, 2024 · Artificial Intelligence

The AI Smile Curve: Profit Distribution and Future Outlook

The AI industry’s profit landscape mirrors a smile curve, with upstream GPU manufacturers and downstream application developers capturing most returns while costly large‑model R&D yields low margins, prompting predictions of GPU valuation corrections, a push for consumer‑facing killer apps, and massive application turnover through creative destruction.

AIGPUIndustry analysis
0 likes · 11 min read
The AI Smile Curve: Profit Distribution and Future Outlook
Architects' Tech Alliance
Architects' Tech Alliance
Sep 29, 2024 · Industry Insights

Why Super‑Heterogeneous Computing Is the Next Frontier in Computing Architecture

The article analyzes the limits of the von Neumann model and Moore's law, explains how instruction set complexity defines processor categories, and argues that integrating CPUs, GPUs, FPGAs, DPUs and ASICs into a super‑heterogeneous ecosystem—driven by Intel, NVIDIA, ARM and emerging trends—will shape the future of computing through diverse workloads, AI demand, green efficiency and a global compute network by 2030.

AIARMCPU
0 likes · 12 min read
Why Super‑Heterogeneous Computing Is the Next Frontier in Computing Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Sep 25, 2024 · Fundamentals

NVIDIA Quantum‑2 InfiniBand Platform: Technical Overview, Q&A, and Deployment Guidance

This article explains the growing demand for high‑performance computing, introduces NVIDIA's Quantum‑2 InfiniBand platform with its high‑speed, low‑latency capabilities, provides a curated list of related technical articles, and offers an extensive Q&A covering compatibility, cabling, UFM, PCIe limits, and best‑practice deployment for AI and HPC workloads.

AIGPUInfiniBand
0 likes · 11 min read
NVIDIA Quantum‑2 InfiniBand Platform: Technical Overview, Q&A, and Deployment Guidance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Sep 18, 2024 · Artificial Intelligence

How Distributed Training Powers Massive Language Models: Concepts, Strategies, and Code

This article explains why single‑machine resources are insufficient for training ever‑larger language models, introduces the fundamentals of distributed training systems, details various parallel strategies such as data, model, pipeline, and hybrid parallelism, and provides practical PyTorch code and memory‑optimization techniques to accelerate large‑scale model training.

Deep LearningGPUParallelism
0 likes · 29 min read
How Distributed Training Powers Massive Language Models: Concepts, Strategies, and Code
Infra Learning Club
Infra Learning Club
Sep 16, 2024 · Cloud Native

Survey of GPU Sharing and Virtualization Solutions for Kubernetes

The article surveys open‑source GPU sharing and virtualization approaches for AI workloads, comparing soft isolation, CUDA‑level isolation, NVIDIA MPS, driver‑level isolation, GPU pooling and deep‑learning memory sharing, and highlights their architectures, isolation guarantees, and performance trade‑offs.

Device PluginGPUKubernetes
0 likes · 5 min read
Survey of GPU Sharing and Virtualization Solutions for Kubernetes
Architects' Tech Alliance
Architects' Tech Alliance
Aug 29, 2024 · Industry Insights

How NVIDIA Builds 256‑GPU and 576‑GPU SuperPods with H100, GH200, and GB200 Interconnects

The article analyzes NVIDIA's DGX SuperPOD architectures across three GPU generations—H100, GH200, and GB200—detailing their NVLink/NVSwitch topologies, bandwidth calculations, scalability limits, and the practical challenges of constructing 256‑GPU and 576‑GPU supercomputing clusters.

Data centerGPUHigh‑performance computing
0 likes · 11 min read
How NVIDIA Builds 256‑GPU and 576‑GPU SuperPods with H100, GH200, and GB200 Interconnects
Architects' Tech Alliance
Architects' Tech Alliance
Aug 25, 2024 · Industry Insights

Why GPUs May Lose the AI Race: TPU, FPGA, and Future Hardware Trends

While GPUs have driven AI acceleration for years, this article analyzes their architectural constraints, compares emerging alternatives such as Google's TPU and high‑end FPGAs, and explores future application niches like VR/AR, cloud gaming, and military systems where GPUs may still thrive or be replaced.

AI hardwareDeep LearningFPGA
0 likes · 15 min read
Why GPUs May Lose the AI Race: TPU, FPGA, and Future Hardware Trends
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Aug 23, 2024 · Mobile Development

GPU Command and Syncpoint Analysis on SM8650 Platform

On the SM8650 platform, GLES issues synchronous and draw commands that the kernel‑mode driver translates into kgsl_drawobj structures, queues them in per‑context dispatch lists, processes fence, timestamp, and timeline syncpoints via dedicated kernel threads, and finally submits draw objects to the GPU firmware, with eglSwapBuffers triggering a fence syncpoint, a draw command, and a GPU fence creation.

AndroidGPUGraphics
0 likes · 12 min read
GPU Command and Syncpoint Analysis on SM8650 Platform
Baidu Geek Talk
Baidu Geek Talk
Aug 19, 2024 · Artificial Intelligence

PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance Gains

The PaddlePaddle Neural Network Compiler (CINN) combines a PIR‑based frontend that performs graph‑level optimizations such as constant folding, dead‑code elimination and operator fusion with a backend that applies schedule transformations and auto‑tuning, delivering up to 4× faster RMSNorm kernels and 30‑60% overall speed‑ups for generative AI and scientific‑computing workloads.

CINNDeep LearningGPU
0 likes · 18 min read
PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance Gains
ByteDance Cloud Native
ByteDance Cloud Native
Aug 12, 2024 · Cloud Native

How to Deploy NVIDIA NIM AI Models on Volcengine VKE in Minutes

This guide walks you through deploying large language models with NVIDIA NIM on Volcengine's Kubernetes Engine (VKE), covering environment setup, model optimization, Helm chart deployment, monitoring integration, and the key advantages of using NIM as a cloud‑native AI micro‑service.

AI deploymentGPUKubernetes
0 likes · 12 min read
How to Deploy NVIDIA NIM AI Models on Volcengine VKE in Minutes
Architects' Tech Alliance
Architects' Tech Alliance
Jul 25, 2024 · Artificial Intelligence

NVIDIA H20 AI Chip Launch and the Rapid Growth of China's AI Chip Market

The article reviews NVIDIA's newly released H20 AI accelerator for China, compares its performance and pricing with domestic chips, outlines the expanding Chinese AI chip ecosystem—including Huawei, Cambricon, HaiGuang, Alibaba, ByteDance, and Baidu—while highlighting market size growth, multi‑chip heterogeneity strategies, and the strong demand forecast through 2024.

AI chipsAI computeChina
0 likes · 8 min read
NVIDIA H20 AI Chip Launch and the Rapid Growth of China's AI Chip Market
360 Smart Cloud
360 Smart Cloud
Jul 17, 2024 · Artificial Intelligence

Parallelism and Memory‑Optimization Techniques for Distributed Large‑Scale Transformer Training

This article reviews the principles and practical implementations of data, pipeline, tensor, sequence, and context parallelism together with memory‑saving strategies such as recomputation and ZeRO, and demonstrates how the QLM framework leverages these techniques to accelerate large‑model training and fine‑tuning on multi‑GPU clusters.

GPUMegatron-LMMemory Optimization
0 likes · 18 min read
Parallelism and Memory‑Optimization Techniques for Distributed Large‑Scale Transformer Training
Architects' Tech Alliance
Architects' Tech Alliance
Jul 9, 2024 · Industry Insights

How Nvidia’s Accelerated GPU Roadmap Is Shaping AI‑Scale Networking

Nvidia plans to shorten its GPU generation cycle to one year, launching Blackwell Ultra in 2025, Rubin in 2026, and Rubin Ultra in 2027, while boosting token‑generation efficiency and introducing AI‑optimized Ethernet solutions like Spectrum‑X800, aiming to dominate large‑scale AI clusters and reshape the high‑performance networking market.

AIGPUNvidia
0 likes · 6 min read
How Nvidia’s Accelerated GPU Roadmap Is Shaping AI‑Scale Networking
Architects' Tech Alliance
Architects' Tech Alliance
Jun 22, 2024 · Artificial Intelligence

Rising Compute Demand of Generative AI Models and GPU Accelerator Trends in 2024

The article analyzes how generative AI models from GPT‑1 to the upcoming GPT‑5 are driving exponential growth in compute requirements, prompting massive cloud capital expenditures and intense competition among GPU vendors such as NVIDIA, AMD, Google, and emerging domestic chip makers, while also highlighting interconnect innovations and cost‑effective solutions.

AIAcceleratorsCompute
0 likes · 12 min read
Rising Compute Demand of Generative AI Models and GPU Accelerator Trends in 2024
Architects' Tech Alliance
Architects' Tech Alliance
Jun 16, 2024 · Industry Insights

How Nvidia’s Blackwell GPUs Aim to Slash AI Training Costs and Power

The article analyzes Nvidia’s historic advantage, the massive performance and energy efficiency gains from Pascal to Blackwell GPUs, the economics of training large language models like GPT‑4, and the detailed roadmap of upcoming GPU, memory, and interconnect technologies shaping the future of data‑center AI.

AIGPUNvidia
0 likes · 14 min read
How Nvidia’s Blackwell GPUs Aim to Slash AI Training Costs and Power
Java Tech Enthusiast
Java Tech Enthusiast
Jun 7, 2024 · Fundamentals

Engineer Builds GPU from Scratch in Two Weeks

In just two weeks, engineer Adam Majmudar designed and implemented a minimalist GPU called tiny‑gpu—complete with a custom 11‑instruction ISA, Verilog RTL, and verified via OpenLane—sharing the open‑source project on GitHub, earning thousands of stars, and preparing it for fabrication through Tiny Tapeout 7, showcasing how modern tools make DIY chip design increasingly accessible.

Chip DesignEDAGPU
0 likes · 8 min read
Engineer Builds GPU from Scratch in Two Weeks
IT Services Circle
IT Services Circle
Jun 6, 2024 · Artificial Intelligence

Nvidia Unveils Blackwell GPU and AI Supercomputing Roadmap

Nvidia’s latest Blackwell GPU, presented by Jensen Huang, promises unprecedented performance and energy efficiency for large‑scale AI models, while the company also showcases accelerated computing, NVLink interconnects, AI‑optimized DGX servers, the NIM platform for rapid LLM deployment, and ambitious projects such as Earth‑2 digital twins and next‑generation embodied AI robots.

AIBlackwellGPU
0 likes · 18 min read
Nvidia Unveils Blackwell GPU and AI Supercomputing Roadmap
Architects' Tech Alliance
Architects' Tech Alliance
Jun 5, 2024 · Industry Insights

How HBM Is Transforming GPU Power and Driving the AI Memory Boom

HBM's near‑memory architecture, stacked design, and TSV integration dramatically cut latency and space while boosting bandwidth, leading NVIDIA and AMD to adopt it across multiple GPU generations, spurring fierce competition among SK Hynix, Samsung, and Micron and projecting a four‑fold market surge to $169 billion by 2024.

AIGPUHBM
0 likes · 11 min read
How HBM Is Transforming GPU Power and Driving the AI Memory Boom
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 31, 2024 · Cloud Native

Best Practices for Deploying AI Model Inference on Knative

This guide explains how to efficiently deploy AI model inference services on Knative by externalizing model data, using Fluid for accelerated loading, configuring secrets, ImageCache, graceful shutdown, probes, autoscaling parameters, mixed ECS/ECI resources, shared GPU scheduling, and observability features to achieve fast scaling, low cost, and high elasticity.

AI Model InferenceCloud NativeGPU
0 likes · 19 min read
Best Practices for Deploying AI Model Inference on Knative
Bilibili Tech
Bilibili Tech
May 24, 2024 · Cloud Computing

Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training

The article explains how NCCL’s collective communication libraries enable efficient large‑scale model training by parsing GPU‑to‑NIC topology, forming flat‑ring and tree rings, improving logging and bandwidth metrics, detailing Ring AllReduce primitives, and proposing solutions to missing topology, metric, and mapping information for future optimization.

Distributed TrainingGPUNCCL
0 likes · 23 min read
Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training
Open Source Linux
Open Source Linux
May 22, 2024 · Artificial Intelligence

Why GPUs Are the Powerhouse Behind Modern AI: A Deep Dive

This article explains how GPUs, with their parallel architecture and extensive software ecosystem, have become essential for accelerating AI training and inference, outperforming CPUs and shaping the future of artificial intelligence across various industries.

Deep LearningGPUHardware acceleration
0 likes · 10 min read
Why GPUs Are the Powerhouse Behind Modern AI: A Deep Dive
Architects' Tech Alliance
Architects' Tech Alliance
May 15, 2024 · Artificial Intelligence

Detailed Overview of GPU Server Architectures: A100/A800 and H100/H800 Nodes

This article provides a comprehensive technical overview of large‑scale GPU server architectures, detailing the component topology of 8‑GPU A100/A800 and H100/H800 nodes, explaining storage network cards, NVSwitch interconnects, bandwidth calculations, and the trade‑offs between RoCEv2 and InfiniBand for AI workloads.

GPUHigh‑performance computingNVLink
0 likes · 13 min read
Detailed Overview of GPU Server Architectures: A100/A800 and H100/H800 Nodes
Architects' Tech Alliance
Architects' Tech Alliance
May 14, 2024 · Artificial Intelligence

Why GPUs Are Essential for Modern Artificial Intelligence and How They Compare with CPUs, ASICs, and FPGAs

This article explains the pivotal role of GPUs in today’s generative AI era, describes their architecture and applications, compares them with CPUs, ASICs, and FPGAs, and offers guidance on selecting the right processor for AI workloads while also noting related reference resources.

Deep LearningGPUHardware
0 likes · 12 min read
Why GPUs Are Essential for Modern Artificial Intelligence and How They Compare with CPUs, ASICs, and FPGAs
DataFunTalk
DataFunTalk
May 10, 2024 · Artificial Intelligence

GPU Performance Optimization Practices for Tencent PCG Recommendation Model Training Framework

This article presents a comprehensive overview of Tencent PCG's GPU‑based recommendation model training framework, detailing why GPU adoption is essential, the hardware and software challenges faced, the multi‑level data architecture, pipeline design, and a series of network, storage, and compute optimizations, followed by future directions.

Distributed TrainingGPUModel Training
0 likes · 13 min read
GPU Performance Optimization Practices for Tencent PCG Recommendation Model Training Framework
Architects' Tech Alliance
Architects' Tech Alliance
May 9, 2024 · Artificial Intelligence

AI Servers: Market Opportunities, Architecture, and Future Demand Driven by Generative AI

The article examines how the surge of generative AI (AIGC) is fueling rapid growth in AI server demand, detailing the emerging AIGC ecosystem, server hardware composition, model scaling, heterogeneous computing, training vs. inference workloads, market size forecasts, and the competitive landscape of AI server manufacturers.

AI InfrastructureAI serversGPU
0 likes · 15 min read
AI Servers: Market Opportunities, Architecture, and Future Demand Driven by Generative AI
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
May 8, 2024 · Frontend Development

How We Halved Cloud Music Desktop Startup Time and Fixed UI Lag with a React Refactor

This article details the migration of the Cloud Music desktop client from a legacy NEJ‑CEF hybrid to a React‑based architecture, outlines four major performance challenges, and explains the step‑by‑step optimizations—including API preloading, render memoization, virtual‑list replacement, and resource‑usage reductions—that cut startup latency by 48%, eliminated interaction stutter, and dramatically lowered CPU, GPU, and memory consumption.

CPUGPUHybrid App
0 likes · 30 min read
How We Halved Cloud Music Desktop Startup Time and Fixed UI Lag with a React Refactor
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Apr 24, 2024 · Artificial Intelligence

Training MNIST with Burn on wgpu: From PyTorch to Rust Backend

This tutorial demonstrates how to train a MNIST digit‑recognition model using the Rust‑based Burn framework on top of the cross‑platform wgpu API, covering model export from PyTorch to ONNX, code generation, data loading, training loops, and performance comparison across CPU, GPU, and other backends.

BurnDeep LearningGPU
0 likes · 13 min read
Training MNIST with Burn on wgpu: From PyTorch to Rust Backend
Architects' Tech Alliance
Architects' Tech Alliance
Apr 21, 2024 · Fundamentals

Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training

This article explains how Remote Direct Memory Access (RDMA) technologies such as InfiniBand and RoCE bypass OS kernels to achieve ultra‑low latency and high bandwidth, discusses their hardware implementations, cost considerations, and their critical impact on large‑scale AI model training and HPC network design.

AIGPUHigh‑Performance Computing
0 likes · 11 min read
Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Apr 12, 2024 · Industry Insights

Why AI Server Demand Is Set to Explode by 2025 – Key Trends and Market Drivers

The article analyzes the rapid evolution of AI servers, detailing the shift from general‑purpose to GPU‑enhanced AI hardware, the split between training and inference workloads, cost structures, forecasted compute needs for large models like GPT‑4, and the impact of US export restrictions and domestic competition on the global market.

AI serversGPUMarket analysis
0 likes · 6 min read
Why AI Server Demand Is Set to Explode by 2025 – Key Trends and Market Drivers
Architects' Tech Alliance
Architects' Tech Alliance
Apr 10, 2024 · Industry Insights

Inside the GPU Server: Architecture of A100/A800 and H100/H800 Nodes

This article provides a detailed technical breakdown of modern multi‑GPU server nodes, covering component composition, storage network cards, NVSwitch interconnects, bandwidth calculations, and the architectural differences between NVIDIA A100/A800 and H100/H800 configurations for AI training workloads.

A100AI trainingGPU
0 likes · 12 min read
Inside the GPU Server: Architecture of A100/A800 and H100/H800 Nodes
DataFunSummit
DataFunSummit
Apr 10, 2024 · Artificial Intelligence

Large Language Model Inference Overview and Performance Optimizations

This article presents a comprehensive overview of large language model inference, describing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and detailing a series of system-level optimizations—including pipeline parallelism, dynamic batching, KV‑cache quantization, and hardware considerations—to significantly improve inference efficiency on modern GPUs.

GPUInferenceLatency
0 likes · 23 min read
Large Language Model Inference Overview and Performance Optimizations
Python Programming Learning Circle
Python Programming Learning Circle
Apr 3, 2024 · Fundamentals

Accelerating Python Code with Taichi: Up to 100× Speed Boosts

This article introduces Taichi, a Python‑embedded DSL that compiles kernel functions for CPU and GPU execution, and demonstrates through three practical examples how importing the library and adding decorators can accelerate Python code by up to a hundredfold, with detailed performance numbers and installation instructions.

DSLGPUPython
0 likes · 7 min read
Accelerating Python Code with Taichi: Up to 100× Speed Boosts
360 Smart Cloud
360 Smart Cloud
Apr 3, 2024 · Backend Development

Understanding FFmpeg Hardware Acceleration Architecture and Implementation

FFmpeg provides a comprehensive, cross‑platform hardware acceleration framework that abstracts diverse GPU and dedicated video codec interfaces, defines HWContext types, device and frame contexts, and various codec configuration methods, enabling efficient video encoding, decoding, and filtering while addressing performance, compatibility, and pipeline complexity challenges.

GPUHardware accelerationMultimedia
0 likes · 10 min read
Understanding FFmpeg Hardware Acceleration Architecture and Implementation
Architects' Tech Alliance
Architects' Tech Alliance
Mar 30, 2024 · Industry Insights

How NVIDIA’s B200 GPU Redefines AI Compute and What It Means for the Chip Market

The article analyzes the latest AI‑compute announcements from NVIDIA, AMD and Intel—including NVIDIA’s B200 GPU with 20 petaFLOPS FP4 performance, AMD’s MI300/MI400 roadmap, and Intel’s Gaudi 3 and Falcon Shores—while examining pricing, launch timelines, supply‑chain capacity, and the shifting market share among major cloud providers.

AI computeAMDGPU
0 likes · 10 min read
How NVIDIA’s B200 GPU Redefines AI Compute and What It Means for the Chip Market
Architects' Tech Alliance
Architects' Tech Alliance
Mar 20, 2024 · Industry Insights

What Nvidia’s B100 and GB200 Reveal About the Future of AI GPUs

The GTC 2024 recap highlights Nvidia’s upcoming B100 and GB200 GPUs, their BlackWell architecture, performance breakthroughs, embodied‑intelligence initiatives, and the expanding AI application ecosystem across industries, offering a clear view of the next wave in accelerated computing.

AIB100Embodied Intelligence
0 likes · 7 min read
What Nvidia’s B100 and GB200 Reveal About the Future of AI GPUs
21CTO
21CTO
Mar 20, 2024 · Artificial Intelligence

Nvidia Unveils Blackwell GPU: A Quantum Leap for Generative AI

Nvidia introduced the Blackwell GPU architecture at GTC, highlighting six breakthrough technologies, a 4nm process, massive performance gains, and its integration into DGX SuperPOD systems that promise to accelerate generative AI, data processing, and high‑performance computing across industries.

AIBlackwellGPU
0 likes · 14 min read
Nvidia Unveils Blackwell GPU: A Quantum Leap for Generative AI
Architects' Tech Alliance
Architects' Tech Alliance
Mar 18, 2024 · Industry Insights

Why Nvidia’s NVLink C2C Is Redefining GPU‑CPU Interconnects

The article provides an in‑depth technical analysis of Nvidia’s NVLink C2C interconnect, comparing its latency, bandwidth, power efficiency, density and cost against traditional SerDes solutions and examining its role in building SuperChip architectures with Grace CPUs and Hopper GPUs.

GPUNVLinkcost analysis
0 likes · 12 min read
Why Nvidia’s NVLink C2C Is Redefining GPU‑CPU Interconnects
Architects' Tech Alliance
Architects' Tech Alliance
Mar 17, 2024 · Industry Insights

Why GPUs Remain the Dominant AI Compute Engine: Trends, Risks, and Future Outlook

The article analyzes current AI hardware options, explains why GPUs continue to dominate model training due to architectural compatibility, ecosystem support, and market maturity, and outlines emerging trends such as model miniaturization, optical interconnects, and chiplet technology that will shape the next generation of AI compute.

AI computeChipletGPU
0 likes · 6 min read
Why GPUs Remain the Dominant AI Compute Engine: Trends, Risks, and Future Outlook
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 15, 2024 · Artificial Intelligence

Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging

By fusing sparse‑feature operators, enabling multi‑stream execution, consolidating data copies, and merging inference batches, iQIYI reduced GPU CTR model latency to CPU‑level, boosted throughput over sixfold, and cut operational costs by more than 40%, overcoming launch‑overhead bottlenecks.

CTRGPUInference Optimization
0 likes · 10 min read
Optimizing GPU Inference for CTR Models: Kernel Fusion, Multi‑Stream Execution, and Batch Merging
Tencent Cloud Developer
Tencent Cloud Developer
Mar 14, 2024 · Mobile Development

Aurora Animation and 3D Penguin Effects in Mobile QQ: Noise Algorithms, Color Mapping, Performance Optimization, and Rendering Techniques

The new QQ 9.0 introduces aurora‑style animations generated by continuous, smoothed noise algorithms with uniform‑probability color mapping, and a spring‑driven 3D penguin rendered via Filament’s PBR materials and GPU compute shaders, achieving sub‑2 ms performance on most Android and iOS devices.

3DGPUMobile
0 likes · 17 min read
Aurora Animation and 3D Penguin Effects in Mobile QQ: Noise Algorithms, Color Mapping, Performance Optimization, and Rendering Techniques
NewBeeNLP
NewBeeNLP
Mar 8, 2024 · Industry Insights

Why Building LLMs Is Like Buying a Hardware Lottery – Lessons from a Startup

The article recounts Yi Tay’s experience founding Reka and building large language models from scratch, highlighting the unpredictable quality of GPU clusters, the challenges of multi‑cluster orchestration, code‑base choices, and how startups must rely on fast, intuition‑driven experimentation to succeed.

Cluster ManagementGPUHardware
0 likes · 12 min read
Why Building LLMs Is Like Buying a Hardware Lottery – Lessons from a Startup
MaGe Linux Operations
MaGe Linux Operations
Mar 5, 2024 · Cloud Native

How to Run GPU‑Accelerated AI Workloads on Kubernetes

This article explains how Kubernetes supports GPU workloads for AI and machine learning, covering device plugins, pod GPU requests, oversubscription, security isolation, cloud‑provider node setup, and protecting GPU nodes from non‑GPU pods.

AI workloadsCloud NativeDevice Plugin
0 likes · 8 min read
How to Run GPU‑Accelerated AI Workloads on Kubernetes
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Mar 1, 2024 · Mobile Development

GPU Frequency Scaling on Qualcomm Adreno Using the Linux devfreq Framework

Using Qualcomm’s Adreno GPU as a case study, the article explains how the Linux devfreq framework enables GPU frequency scaling by creating a kgsl devfreq device and an msm‑adreno‑tz governor, detailing their initialization, event handling, target‑frequency computation, and the kernel callbacks that apply the new rates.

AdrenoGPULinux kernel
0 likes · 5 min read
GPU Frequency Scaling on Qualcomm Adreno Using the Linux devfreq Framework
DataFunTalk
DataFunTalk
Feb 19, 2024 · Artificial Intelligence

Large Language Model Inference Overview and Performance Optimizations

This article presents a comprehensive overview of large language model inference, detailing the prefill and decoding stages, key performance metrics such as throughput, latency and QPS, and a series of system-level optimizations—including pipeline parallelism, dynamic batching, specialized attention kernels, virtual memory allocation, KV‑cache quantization, and mixed‑precision strategies—to improve GPU utilization and overall inference efficiency.

GPULLMLatency
0 likes · 24 min read
Large Language Model Inference Overview and Performance Optimizations
DataFunSummit
DataFunSummit
Feb 11, 2024 · Artificial Intelligence

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

This article details Xiaohongshu's end‑to‑end GPU‑based transformation of its recommendation and search models, covering background, model characteristics, training and inference frameworks, system‑level and GPU‑level optimizations, compilation tricks, hardware upgrades, and future directions for large‑scale machine‑learning infrastructure.

GPUModel ServingTraining
0 likes · 18 min read
GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu
Architects' Tech Alliance
Architects' Tech Alliance
Jan 30, 2024 · Industry Insights

Why Computing Power Leasing Is Booming: 2024 Industry Framework & Trends

The article outlines the 2024 computing‑power leasing industry framework, explains three main rental models, highlights the surge in demand driven by generative AI, the shortage of high‑end GPUs, and provides an extensive collection of links to reports and analyses on GPU technology, market dynamics, and future development paths.

AIGPUIndustry analysis
0 likes · 5 min read
Why Computing Power Leasing Is Booming: 2024 Industry Framework & Trends
DataFunTalk
DataFunTalk
Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUInference
0 likes · 20 min read
Efficient Deployment of Speech AI Models on GPUs
Architects' Tech Alliance
Architects' Tech Alliance
Jan 14, 2024 · Fundamentals

Overview of CPU, GPU, and Storage Fundamentals in the Xinchuang Industry

This article introduces the Xinchuang (information technology innovation) industry, outlines its hardware components, and provides concise explanations of CPU concepts, instruction sets, GPU architecture and operation, as well as storage classifications, while also linking to related research reports and promotional resources.

CPUGPUInformation Technology
0 likes · 8 min read
Overview of CPU, GPU, and Storage Fundamentals in the Xinchuang Industry
Architects' Tech Alliance
Architects' Tech Alliance
Jan 4, 2024 · Industry Insights

China’s 2023 Xinchuang Boom: Key Trends in CPUs, GPUs, DPU & Cloud

The 2023 Xinchuang industry report outlines how China's information‑technology innovation sector entered a rapid growth phase, highlighting market expansion, dominant keywords, the evolving hardware ecosystem—including CPUs, GPUs, AI chips, DPU and cloud databases—and the strategic shift toward full‑industry adoption across eight critical sectors.

CPUChinaDPU
0 likes · 14 min read
China’s 2023 Xinchuang Boom: Key Trends in CPUs, GPUs, DPU & Cloud