Tagged articles

GPU

562 articles · Page 1 of 6
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jul 4, 2026 · Artificial Intelligence

How Xpeng Built AI-Driven Cars on Alibaba Cloud: A Deep Dive

The article examines Xpeng Motors' cloud‑native AI strategy, detailing its 10‑thousand‑GPU cluster with over 95% utilization, AI‑powered digital employees in customer service and finance, AI Coding acceleration, and global 10 EFLOPS compute infrastructure that sustains high‑traffic car launches.

AIAI codingAlibaba Cloud
0 likes · 7 min read
How Xpeng Built AI-Driven Cars on Alibaba Cloud: A Deep Dive
Black & White Path
Black & White Path
Jun 30, 2026 · Artificial Intelligence

A 27B Red‑Team AI Model That Runs on Just 12 GB VRAM

The BugTraceAI CORE Ultra 27B model, fine‑tuned on 2,541 real vulnerability reports, generates fully functional Nuclei templates, CVE PoCs, webshell bypasses, JWT cracking tools, and kernel exploits with a 0 % rejection rate, and its quantized Q4 version runs on a single 24 GB GPU, making advanced red‑team automation accessible.

BugTraceAIGPULLM
0 likes · 7 min read
A 27B Red‑Team AI Model That Runs on Just 12 GB VRAM
Geek Labs
Geek Labs
Jun 29, 2026 · Artificial Intelligence

DeepSpec Boosts Large-Model Inference Speed by 2–5× with Speculative Decoding

DeepSpec, an open‑source framework from DeepSeek, accelerates large‑language‑model inference by 2–5× through speculative decoding, where a lightweight draft model generates candidate tokens that the target model validates in parallel, reducing the serial bottleneck of autoregressive decoding and offering a full‑stack pipeline from data preparation to evaluation.

DeepSpecGPUPython
0 likes · 6 min read
DeepSpec Boosts Large-Model Inference Speed by 2–5× with Speculative Decoding
Machine Heart
Machine Heart
Jun 28, 2026 · Artificial Intelligence

When the Memory Wall Locks AI Compute, Is HBM the Key or Another Lock?

The article analyzes how the growing memory‑wall bottleneck forces GPUs to idle while waiting for data, compares on‑chip SRAM and high‑bandwidth memory (HBM) as remedies, and examines HBM’s technical advantages, supply constraints, and divergent manufacturing routes that may turn it into a new limitation.

AI computeGPUHBM
0 likes · 6 min read
When the Memory Wall Locks AI Compute, Is HBM the Key or Another Lock?
21CTO
21CTO
Jun 25, 2026 · Industry Insights

Can OpenAI’s Jalapeño Chip Disrupt Nvidia’s GPU Dominance?

OpenAI unveiled its custom AI inference chip Jalapeño, co‑designed with Broadcom, claiming far‑better power‑efficiency than existing high‑end GPUs and signaling a strategic shift that could erode Nvidia’s near‑monopoly in AI hardware.

AI chipASICBroadcom
0 likes · 9 min read
Can OpenAI’s Jalapeño Chip Disrupt Nvidia’s GPU Dominance?
DeepHub IMBA
DeepHub IMBA
Jun 15, 2026 · Artificial Intelligence

Flash-KMeans: Fast, Memory-Efficient Exact K-Means for Billion-Scale Clustering on a Single GPU

Flash‑KMeans is a newly proposed framework that re‑designs exact K‑Means for GPUs by eliminating distance‑matrix materialization, using FlashAssign’s online argmin and Sort‑Inverse Update to cut memory bandwidth and atomic‑write contention, achieving up to 12.5× speedup and dramatically lower VRAM usage on billion‑point datasets.

ClusteringFlashAssignGPU
0 likes · 23 min read
Flash-KMeans: Fast, Memory-Efficient Exact K-Means for Billion-Scale Clustering on a Single GPU
Ubuntu
Ubuntu
Jun 15, 2026 · Artificial Intelligence

Running AI/ML Models on WSL with CUDA Acceleration: A PyTorch Hands‑On Guide

This guide shows how to enable NVIDIA GPU passthrough in WSL 2, install the CUDA toolkit, set up a PyTorch GPU environment, verify GPU visibility, and run real‑world AI/ML workloads such as LLM inference, YOLO object detection, and Jupyter monitoring, while providing performance comparisons, optimization tips, and troubleshooting FAQs.

AICUDAGPU
0 likes · 13 min read
Running AI/ML Models on WSL with CUDA Acceleration: A PyTorch Hands‑On Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
Jun 13, 2026 · Cloud Computing

Google’s Low‑Key Launch: The Google Colab CLI Brings Notebooks to the Terminal

The article introduces Google Colab CLI, a command‑line interface that moves Colab notebooks from the browser to the terminal, detailing its installation on Linux/macOS, authentication steps, core features like instant VM provisioning, kernel state persistence, shebang GPU scripts, and practical examples such as fine‑tuning Gemma 3‑1B.

AI AgentsAutomationCLI
0 likes · 9 min read
Google’s Low‑Key Launch: The Google Colab CLI Brings Notebooks to the Terminal
Geek Labs
Geek Labs
Jun 12, 2026 · Artificial Intelligence

Boost Developer Productivity with NBD VRAM, Reg Factory, and vLLM Studio

This article introduces three open‑source tools that improve GPU‑centric development: NBD VRAM, which turns GPU memory into Linux swap space; Reg Factory, a scheduler and monitor for multi‑GPU clusters; and vLLM Studio, a web UI for deploying and managing large‑model inference.

GPUModel DeploymentResource Scheduling
0 likes · 4 min read
Boost Developer Productivity with NBD VRAM, Reg Factory, and vLLM Studio
Lao Guo's Learning Space
Lao Guo's Learning Space
Jun 10, 2026 · Artificial Intelligence

2026 Top 10 Local LLMs Ranked by Real Downloads, GPU Fit, and License Risks

The article analyzes why local large‑language‑model deployment is essential for privacy, offline use, and cost control, then ranks the ten most popular models in 2026 using Ollama download counts, GitHub stars, benchmark scores, and hardware requirements, and finally provides a GPU‑based selection guide, deployment‑tool comparison, license‑risk table, decision‑tree and quick‑start instructions.

GPULLMbenchmark
0 likes · 19 min read
2026 Top 10 Local LLMs Ranked by Real Downloads, GPU Fit, and License Risks
Architects' Tech Alliance
Architects' Tech Alliance
Jun 7, 2026 · Industry Insights

2026 China GPU Chip Industry: Market Share, Technology Trends, and Future Outlook

The 2026 analysis shows China's GPU market capturing a growing share of the $1.12 trillion global AI GPU market, with Huawei Ascend leading at 44%, domestic firms leveraging 7nm processes, Chiplet and FP8 breakthroughs, while Nvidia and AMD face increasing competition from Chinese players expanding into inference, edge and enterprise segments.

AI chipsChinaChiplet
0 likes · 5 min read
2026 China GPU Chip Industry: Market Share, Technology Trends, and Future Outlook
Architects' Tech Alliance
Architects' Tech Alliance
Jun 4, 2026 · Industry Insights

2026 Global GPU Chip Landscape: Domestic AI Accelerators Surge Past 60% Share

In 2026 the GPU market pivots as domestic AI accelerators capture over 60% share, slashing Nvidia’s hold to roughly 8%, while companies like Huawei Ascend, Biren, Moore Threads, HaiGuang and MuXi compete with 7 nm chiplets, petaflop performance and emerging software ecosystems to chase the trillion‑dollar AI chip opportunity.

AI acceleratorBirenChiplet
0 likes · 6 min read
2026 Global GPU Chip Landscape: Domestic AI Accelerators Surge Past 60% Share
Lao Guo's Learning Space
Lao Guo's Learning Space
Jun 3, 2026 · Industry Insights

Can Apple’s M5 Ultra Still Compete After NVIDIA’s RTX Spark Launch?

The RTX Spark desktop processor delivers 1 PFLOP of AI compute—about 14 times the M5 Ultra—while the M5 Ultra retains a three‑times higher memory bandwidth and twice the memory capacity, making it superior for certain inference workloads; the article breaks down specs, benchmarks, ecosystem differences, pricing and market positioning to show how each platform fits distinct AI use cases.

AI computeApple M5 UltraCUDA
0 likes · 12 min read
Can Apple’s M5 Ultra Still Compete After NVIDIA’s RTX Spark Launch?
Machine Heart
Machine Heart
Jun 1, 2026 · Industry Insights

Nvidia Redefines PCs with the Ultra‑Efficient RTX Spark CPU

Nvidia and Microsoft unveiled the RTX Spark‑powered Windows PC, a thin‑and‑light laptop and desktop that combine an ARM‑based Vera CPU, a Blackwell RTX GPU with 6144 CUDA cores, up to 1 petaflop AI performance and 128 GB unified memory to enable local AI agents, high‑end creative workloads, and next‑gen gaming.

AI AgentsArmCPU
0 likes · 8 min read
Nvidia Redefines PCs with the Ultra‑Efficient RTX Spark CPU
Architects' Tech Alliance
Architects' Tech Alliance
May 31, 2026 · Industry Insights

Huawei AI Data Center Reference Design – Downloadable Blueprint

The Huawei AI Data Center Reference Design offers a standardized, integrated, high‑performance compute infrastructure for large‑model training and inference, built on GB/T 50174, featuring modular GPU/HBM servers, 20–50 kW per rack, leaf‑spine 100/200/400 Gbps networking, liquid cooling, redundant power, and intelligent management, with a downloadable package for replication.

AIData CenterGPU
0 likes · 4 min read
Huawei AI Data Center Reference Design – Downloadable Blueprint
TonyBai
TonyBai
May 26, 2026 · Artificial Intelligence

Why NVIDIA Chose Go for Its GPU Cloud Platform: Inside the AI Infrastructure Rewrite

NVIDIA quietly rewrote its AI cloud platform using Go, open‑sourcing NVCF, AICR, and AIStore, where Go accounts for over 80% of the code, enabling a three‑plane architecture, scale‑to‑zero via NATS JetStream, and a cloud‑native stack that balances performance, maintainability, and rapid iteration.

AI InfrastructureCloud NativeGPU
0 likes · 15 min read
Why NVIDIA Chose Go for Its GPU Cloud Platform: Inside the AI Infrastructure Rewrite
Machine Heart
Machine Heart
May 14, 2026 · Artificial Intelligence

How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

The recent SGLang × MUSA meetup revealed that MUSA’s GPU backend has been merged into SGLang’s official codebase, delivering zero‑learning‑cost integration, performance gains of up to 66 % on DeepSeek‑V4, and a growing ecosystem of adapters, high‑performance kernels, and distributed inference support.

AI inferenceDeepSeekGPU
0 likes · 12 min read
How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline
Architects' Tech Alliance
Architects' Tech Alliance
May 14, 2026 · Artificial Intelligence

Jensen Huang’s China Visit: Could It Revive GPU Prospects? Inside Nvidia’s DGX H200 Cluster Design

The article reviews the US‑approved export of Nvidia's DGX H200, the lack of deliveries, Jensen Huang’s surprise China trip that may speed approvals, and then provides a detailed technical breakdown of the DGX H200 cluster’s compute and storage networking, topology, optical link choices, and cable count estimates.

AI InfrastructureDGX H200Data Center Networking
0 likes · 8 min read
Jensen Huang’s China Visit: Could It Revive GPU Prospects? Inside Nvidia’s DGX H200 Cluster Design
Baidu Geek Talk
Baidu Geek Talk
May 13, 2026 · Artificial Intelligence

LoongForge Boosts Multimodal Training Speed by 45% on GPU and Kunlun XPU

LoongForge, Baidu Baige’s open‑source full‑modal training framework, unifies LLM, VLM and VLA workloads, runs unchanged on NVIDIA GPUs and Kunlun XPU, and delivers 15‑45% end‑to‑end speedups with up to 90% linear scaling on 5,000‑plus card clusters, while simplifying model integration via YAML.

AI InfrastructureGPUKunlun XPU
0 likes · 23 min read
LoongForge Boosts Multimodal Training Speed by 45% on GPU and Kunlun XPU
Geek Labs
Geek Labs
May 13, 2026 · Artificial Intelligence

Two LLM Inference Acceleration Projects: A Mac‑Local Engine vs a Data‑Center Engine

This article compares two recent GitHub LLM inference engines—ds4.c, a Metal‑optimized engine for DeepSeek V4 Flash on Apple Silicon Macs, and TokenSpeed, a Python/C++‑based, data‑center‑grade engine for GPU clusters—detailing their design choices, performance numbers, usage instructions, and suitable scenarios.

DeepSeekGPULLM
0 likes · 8 min read
Two LLM Inference Acceleration Projects: A Mac‑Local Engine vs a Data‑Center Engine
21CTO
21CTO
May 11, 2026 · Artificial Intelligence

Mojo 1.0 Beta: A New Era of Python‑C++ Performance

Mojo 1.0 beta combines familiar Python syntax with C/Rust‑level speed, introduces API‑stabilizing language changes, expands cross‑vendor GPU support, and delivers measurable AI/ML performance gains, while offering a decision framework that weighs its early‑stage ecosystem against production needs.

AIC#GPU
0 likes · 10 min read
Mojo 1.0 Beta: A New Era of Python‑C++ Performance
Machine Heart
Machine Heart
May 10, 2026 · Artificial Intelligence

Why SRAM Is Key to Overcoming GPU Limits in Inference as Demand Soars

As large‑model inference demand outpaces training, the decode stage hits a memory‑wall that GPUs cannot efficiently cross; SRAM’s on‑chip bandwidth and low‑energy access open a path forward, though capacity and process limits still pose challenges.

AI hardwareCompute ArchitectureGPU
0 likes · 7 min read
Why SRAM Is Key to Overcoming GPU Limits in Inference as Demand Soars
SuanNi
SuanNi
May 7, 2026 · Industry Insights

Musk Gives 220k GPUs to Claude; Anthropic’s $1.2T Valuation Crowned AI King

Elon Musk redirected 220,000 GPUs to Anthropic’s Claude, fueling a dramatic 80‑fold Q1 usage surge and a $1.2 trillion valuation that now eclipses OpenAI, while the article dissects the compute‑capacity crunch, Colossus data‑center dynamics, and the broader AI market power shift.

AI valuationAnthropicClaude
0 likes · 8 min read
Musk Gives 220k GPUs to Claude; Anthropic’s $1.2T Valuation Crowned AI King
大转转FE
大转转FE
May 7, 2026 · Artificial Intelligence

Running AI Inference Directly in the Browser with WebNN

WebNN brings hardware‑accelerated AI inference to web pages, letting developers run millisecond‑level face detection, real‑time filters, and semantic segmentation locally without cloud calls, while improving latency, privacy, and cost through a unified JavaScript API that maps to CPUs, GPUs or NPUs.

AI inferenceEdgeGPU
0 likes · 16 min read
Running AI Inference Directly in the Browser with WebNN
Old Zhang's AI Learning
Old Zhang's AI Learning
May 1, 2026 · Artificial Intelligence

NVIDIA’s Open‑Source Multimodal Nemotron 3 Nano Omni: Run Locally on Consumer GPUs (English‑Only)

NVIDIA’s Nemotron 3 Nano Omni 30B‑A3B‑Reasoning model, an open‑source multimodal LLM with 30 B parameters, 256K context and video‑audio‑image‑text capabilities, outperforms comparable models by up to 9.2× in video throughput, runs on consumer GPUs via 4‑bit GGUF quantization, but currently supports only English input.

GGUFGPUMultimodal
0 likes · 17 min read
NVIDIA’s Open‑Source Multimodal Nemotron 3 Nano Omni: Run Locally on Consumer GPUs (English‑Only)
SuanNi
SuanNi
Apr 30, 2026 · Artificial Intelligence

Deploy a 24/7 Document Recognition Toolbox with the PaddleOCR Image on the Cloud

This guide explains how to use Baidu's open‑source PaddleOCR engine—its full OCR and layout analysis pipeline, multi‑language support, and output formats—to set up a continuously running document recognition service on the 算网 GPU cloud platform, including environment preparation, model configuration, and inference execution.

Document processingGPUMagicMind
0 likes · 6 min read
Deploy a 24/7 Document Recognition Toolbox with the PaddleOCR Image on the Cloud
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 24, 2026 · Artificial Intelligence

LoongForge: Open‑Source Multimodal Training Framework Runs on GPU and Kunlun XPU with 45% Speedup

LoongForge is an open‑source, Megatron‑based multimodal training framework that unifies LLM, VLM, VLA and diffusion models, runs seamlessly on NVIDIA GPUs and Baidu Kunlun XPU, and delivers 15%‑45% end‑to‑end training acceleration while scaling linearly on thousands of cards.

GPUKunlun XPULoongForge
0 likes · 23 min read
LoongForge: Open‑Source Multimodal Training Framework Runs on GPU and Kunlun XPU with 45% Speedup
DataFunTalk
DataFunTalk
Apr 19, 2026 · Industry Insights

Why Nvidia Still Rules AI Hardware: Inside Jensen Huang’s Strategic Interview

In a candid two‑hour podcast, Nvidia CEO Jensen Huang explains how the company’s focus on accelerated computing, a massive CUDA ecosystem, strategic supply‑chain partnerships and a philosophy of doing only what’s essential have built a durable moat that outpaces rivals like TPU, while also revealing why Nvidia prefers to empower cloud providers rather than become one itself.

AI hardwareCloud ComputingGPU
0 likes · 36 min read
Why Nvidia Still Rules AI Hardware: Inside Jensen Huang’s Strategic Interview
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 17, 2026 · Artificial Intelligence

Can Table Modeling Scale? Rethinking Tree Models in the Age of Massive Compute

The article examines how the dramatic increase in GPU compute power—illustrated by a single H100 GPU equaling about 200 Hadoop instances—challenges the dominance of tree‑based models for structured data, presents scaling‑law experiments with KMLP and FOUND, and argues that pre‑training can redefine the balance between compute, data, and algorithms.

FOUNDGPUKMLP
0 likes · 10 min read
Can Table Modeling Scale? Rethinking Tree Models in the Age of Massive Compute
Baidu Geek Talk
Baidu Geek Talk
Apr 13, 2026 · Artificial Intelligence

How Baidu’s 7th‑Gen AI Confidential VM Delivers Full‑Stack Secure Compute

Baidu Cloud’s 7th‑generation AI confidential virtual machine combines Intel TDX‑based CPU trusted execution, GPU confidential computing, and DPU‑offloaded I/O to provide end‑to‑end encrypted data paths, multi‑GPU scaling, and near‑native performance for high‑sensitivity AI workloads, redefining secure cloud AI infrastructure.

AICloudConfidential Computing
0 likes · 15 min read
How Baidu’s 7th‑Gen AI Confidential VM Delivers Full‑Stack Secure Compute
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 13, 2026 · Industry Insights

How UALink 2.0 and CXL Are Redefining AI Scale‑Up Interconnects

At the 2026 Open AI Infra Summit, Alibaba Cloud showcased the evolution of the UALink 2.0 protocol and its integration with CXL, detailing new specifications, in‑network compute capabilities, and ecosystem developments that aim to overcome scale‑up bottlenecks in AI training and inference.

AI InfrastructureCXLCloud Computing
0 likes · 8 min read
How UALink 2.0 and CXL Are Redefining AI Scale‑Up Interconnects
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 12, 2026 · Artificial Intelligence

Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide

MiniMax‑M2.7, the newly open‑sourced 230‑billion‑parameter MoE model, offers self‑evolution, professional software engineering and agent capabilities, and can be deployed locally using Ollama, vLLM, SGLang or Docker with 4‑8 H200 GPUs, while the article details hardware needs, performance gains and tool‑calling/Thinking features.

GPULLMMiniMax M2.7
0 likes · 11 min read
Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 10, 2026 · Artificial Intelligence

How a 9B‑parameter Qwen3.5 model achieves full‑auto data analysis on a consumer GPU

The open‑source CoPaw‑Flash‑9B‑DataAnalyst‑LoRA model, fine‑tuned via LoRA, can autonomously load, explore, statistically analyze, visualize, and generate structured reports for CSV/Excel/JSON datasets, achieving a 90% success rate with an average of 26 iteration rounds, and it runs on a single consumer‑grade GPU using vLLM and the Data Analyst framework.

AgentData AnalystGPU
0 likes · 10 min read
How a 9B‑parameter Qwen3.5 model achieves full‑auto data analysis on a consumer GPU
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 7, 2026 · Artificial Intelligence

vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

The vLLM 0.19.0 release adds first‑day Gemma 4 support, merges zero‑bubble asynchronous scheduling with speculative decoding, matures Model Runner V2, introduces full‑CUDA‑graph acceleration for ViT, generalizes DBO, brings CPU KV cache offload, and expands hardware and Transformers compatibility, offering substantial performance and flexibility gains for production LLM inference.

CPU KV offloadGPUGemma 4
0 likes · 18 min read
vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload
AI Info Trend
AI Info Trend
Mar 24, 2026 · Industry Insights

NVIDIA’s DLSS 5 & CUDA Flywheel: Transforming AI in Gaming and Enterprise

The GTC 2026 keynote revealed NVIDIA’s latest DLSS 5 technology using 3‑D guided neural rendering to deliver cinematic‑quality graphics in real time, outlined a 20‑year CUDA ecosystem flywheel that fuels AI acceleration across structured and unstructured data, showcased enterprise case studies like Nestlé’s data‑refresh breakthrough, and highlighted a vast partner network, illustrating how AI is moving from experimental labs to everyday production.

AICUDADLSS
0 likes · 5 min read
NVIDIA’s DLSS 5 & CUDA Flywheel: Transforming AI in Gaming and Enterprise
Ops Community
Ops Community
Mar 13, 2026 · Backend Development

How to Diagnose and Fix Slow LLM Inference: A Full‑Stack Performance Guide

This article presents a comprehensive, step‑by‑step methodology for troubleshooting and optimizing large‑language‑model inference performance, covering GPU, CPU, memory, network, configuration, and application layers, with concrete benchmark scripts, diagnostic commands, and real‑world case studies.

CPUGPUdebugging
0 likes · 48 min read
How to Diagnose and Fix Slow LLM Inference: A Full‑Stack Performance Guide
MaGe Linux Operations
MaGe Linux Operations
Mar 12, 2026 · Backend Development

How to Deploy vLLM Inference Service on Kubernetes with Ingress and Service Load Balancing

This guide walks through deploying a production‑grade vLLM inference service on Kubernetes, covering GPU resource scheduling, Service and Ingress configuration, session affinity, health checks, performance tuning, scaling, monitoring, fault‑tolerance, and best‑practice recommendations for high‑availability AI workloads.

GPUHigh AvailabilityIngress
0 likes · 47 min read
How to Deploy vLLM Inference Service on Kubernetes with Ingress and Service Load Balancing
TonyBai
TonyBai
Mar 9, 2026 · Cloud Native

A Decade of Docker: How It Reshaped Cloud‑Native Infrastructure

The article reviews Docker’s ten‑year evolution—from early Linux namespace tricks and layered images to Mac/Windows support via HyperKit, network handling with SLIRP/vpnkit, storage bridging with virtio‑fs, and recent extensions for ARM, TEE, GPU and AI agents—highlighting the engineering compromises that made containers the backbone of modern cloud‑native platforms.

AI AgentsCloud NativeContainers
0 likes · 13 min read
A Decade of Docker: How It Reshaped Cloud‑Native Infrastructure
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 7, 2026 · Artificial Intelligence

vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

ASRAnthropic APIFlashAttention
0 likes · 12 min read
vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility
SpringMeng
SpringMeng
Mar 2, 2026 · Backend Development

Deep Dive into an Asynchronous Spring Boot + Tesseract OCR Pipeline for Invoice Recognition

This article presents a complete design and implementation of a high‑throughput, asynchronous OCR pipeline built with Spring Boot and Tesseract, covering distributed architecture, thread‑pool tuning, image‑preprocessing, multi‑engine recognition, data extraction strategies, Kubernetes deployment, security compliance, chaos testing, and future AI‑driven enhancements.

AsynchronousGPUJava
0 likes · 10 min read
Deep Dive into an Asynchronous Spring Boot + Tesseract OCR Pipeline for Invoice Recognition
MaGe Linux Operations
MaGe Linux Operations
Feb 27, 2026 · Artificial Intelligence

How to Deploy Scalable LLM Inference with vLLM on Kubernetes and GPU Scheduling

This guide explains how to deploy vLLM for large‑language‑model serving on Kubernetes, covering GPU resource management, tensor‑parallel configuration, continuous batching, quantization choices, autoscaling with HPA and KEDA, multi‑model routing, and best‑practice recommendations for performance, cost control, and high availability.

GPULLM Inferencekubernetes
0 likes · 48 min read
How to Deploy Scalable LLM Inference with vLLM on Kubernetes and GPU Scheduling
Data STUDIO
Data STUDIO
Feb 21, 2026 · Big Data

Boost Python Performance Up to 50× Without Changing Your Code

Python’s reputation for slowness can be overcome by selecting the right tools—Numba, PyPy, CuPy, JAX, Ray, Joblib, async I/O, memory profilers, and big‑data frameworks—delivering speedups from 6× to over 50× with minimal or no code modifications.

GPUProfilingRay
0 likes · 22 min read
Boost Python Performance Up to 50× Without Changing Your Code
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 21, 2026 · Artificial Intelligence

Why Fine‑Tuning Large Models Is Now Ridiculously Easy

The article explains how Unsloth dramatically lowers the barrier to fine‑tuning large language models, offering one‑click installation, free Colab GPU support, extensive model coverage, impressive speed and memory gains, and detailed step‑by‑step guides that let anyone with basic Python skills train powerful models.

ColabGPULoRA
0 likes · 14 min read
Why Fine‑Tuning Large Models Is Now Ridiculously Easy
dbaplus Community
dbaplus Community
Feb 9, 2026 · Artificial Intelligence

How EffectiveGPU Cuts GPU Costs with Fine‑Grained Partitioning and Volcano Scheduling

This article details how SF Tech's EffectiveGPU (EGPU) platform redesigns GPU resource management on Kubernetes, introducing fine‑grained memory and compute partitioning, priority‑based scheduling, Volcano integration, and monitoring pipelines to dramatically improve utilization and reduce hardware costs for AI workloads.

AI platformGPUGPU partitioning
0 likes · 23 min read
How EffectiveGPU Cuts GPU Costs with Fine‑Grained Partitioning and Volcano Scheduling
AI Waka
AI Waka
Feb 1, 2026 · Artificial Intelligence

Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

This article reviews practical techniques for accelerating large language model inference—including reduced‑precision formats, post‑training quantization, adapter‑based fine‑tuning, pruning, continuous batch processing, and multi‑GPU deployment—while providing concrete code examples, benchmark results, and guidance on selecting the right approach for production workloads.

GPULLMQuantization
0 likes · 20 min read
Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 28, 2026 · Artificial Intelligence

How to Deploy DeepSeek‑OCR‑2 Locally: A Hands‑On Walkthrough

The article details a step‑by‑step local deployment of DeepSeek‑OCR‑2, covering GPU memory requirements, accuracy on complex tables, long inference times, dependency hurdles like GCC, GLIBC and flash‑attn, and provides concrete solutions using conda environments and symlinks.

CondaDeepSeek-OCR 2GPU
0 likes · 7 min read
How to Deploy DeepSeek‑OCR‑2 Locally: A Hands‑On Walkthrough
21CTO
21CTO
Jan 26, 2026 · Artificial Intelligence

What’s New in PyTorch 2.10? Deep Dive into GPU and CUDA Enhancements

PyTorch 2.10 introduces extensive upgrades for AMD ROCm, Intel XPU, and NVIDIA CUDA, adds new Torch XPU APIs, expands Python 3.14 support, and brings performance‑focused improvements such as fused kernels and enhanced quantization, all available via the official GitHub release.

CUDAGPUPyTorch
0 likes · 4 min read
What’s New in PyTorch 2.10? Deep Dive into GPU and CUDA Enhancements
MaGe Linux Operations
MaGe Linux Operations
Jan 18, 2026 · Artificial Intelligence

How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling

This guide walks through building a production‑grade Kubernetes GPU cluster for large language model inference, covering hardware sizing, GPU resource scheduling, model storage options, automated scaling with HPA, health checks, monitoring, troubleshooting, and multi‑model deployment strategies.

DockerGPULLM
0 likes · 49 min read
How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling
Architects' Tech Alliance
Architects' Tech Alliance
Jan 16, 2026 · Artificial Intelligence

Why Do GPUs and NPUs Produce Different FP16 Results? Uncovering AI Chip Precision Secrets

Engineers training large AI models often see noticeable FP16/BF16 result differences between GPUs and NPUs, and even between generations of the same chip, due to floating‑point representation limits, hardware design choices, software library implementations, compiler optimizations, and parallel execution nondeterminism.

AIGPUHardware Design
0 likes · 10 min read
Why Do GPUs and NPUs Produce Different FP16 Results? Uncovering AI Chip Precision Secrets
Architects' Tech Alliance
Architects' Tech Alliance
Jan 1, 2026 · Artificial Intelligence

Why Nvidia’s Blackwell B200 Could Redefine AI GPU Performance

The article provides an in‑depth technical analysis of Nvidia’s Blackwell B200 GPU, detailing its multi‑chip architecture, cache hierarchy, memory bandwidth, atomic operation latency, compute throughput, and tensor memory features, and compares these metrics against Nvidia H100, A100 and AMD MI300X to assess its suitability for AI workloads.

AIAMDGPU
0 likes · 19 min read
Why Nvidia’s Blackwell B200 Could Redefine AI GPU Performance
Past Memory Big Data
Past Memory Big Data
Dec 31, 2025 · Industry Insights

NVIDIA Data‑Center GPU Evolution: V100 to B300 – A Programmer’s Selection Guide

The article maps the evolution of NVIDIA’s data‑center GPUs—from the Volta‑based V100 through Ampere A100, Hopper H100, specialized A800/H800/H20, up to the Blackwell B200/B300—detailing architectures, memory, interconnect, performance trade‑offs, and offers a decision framework for programmers to match each model to specific AI workloads, budgets and regulatory constraints.

AIData CenterGPU
0 likes · 11 min read
NVIDIA Data‑Center GPU Evolution: V100 to B300 – A Programmer’s Selection Guide
Architects' Tech Alliance
Architects' Tech Alliance
Dec 31, 2025 · Artificial Intelligence

Why Google’s TPUv7 Is Outsmarting Nvidia GPUs: From Performance to System Efficiency

The article examines the shifting AI‑chip landscape, explaining how Google’s TPUv7, backed by massive pod architecture and optical circuit switching, challenges Nvidia’s GPU dominance by offering superior system‑level efficiency and lower total cost of ownership for large‑scale model training.

AI hardwareGPULarge-scale AI training
0 likes · 12 min read
Why Google’s TPUv7 Is Outsmarting Nvidia GPUs: From Performance to System Efficiency
MaGe Linux Operations
MaGe Linux Operations
Dec 27, 2025 · Artificial Intelligence

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

This guide walks you through deploying large language models such as ChatGLM and Llama in production, covering environment setup, model quantization, dynamic batching, service configuration, Nginx load balancing, monitoring, troubleshooting, and best‑practice recommendations for high‑performance, cost‑effective AI inference.

GPULLMPerformance Tuning
0 likes · 48 min read
How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide
MaGe Linux Operations
MaGe Linux Operations
Dec 26, 2025 · Operations

Taming vLLM OOM: Real‑World Causes and Proven Fixes for Production

This article examines why vLLM experiences out‑of‑memory errors in production, explains memory fragmentation caused by PagedAttention, outlines four typical OOM scenarios with concrete command‑line solutions, and provides deep analysis, configuration scripts, dynamic tuning, troubleshooting flowcharts, monitoring alerts, and best‑practice recommendations.

GPUMemory FragmentationOOM
0 likes · 24 min read
Taming vLLM OOM: Real‑World Causes and Proven Fixes for Production
MaGe Linux Operations
MaGe Linux Operations
Dec 19, 2025 · Artificial Intelligence

Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

After discovering that only a few vLLM settings truly impact performance, this guide details how adjusting gpu_memory_utilization, max_num_batched_tokens, and enabling chunked prefill can raise Qwen2.5‑72B‑Instruct throughput from ~1800 to over 2500 tokens/s, improve latency, and provides comprehensive deployment, monitoring, and troubleshooting instructions.

DockerGPUInference Optimization
0 likes · 30 min read
Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks
Raymond Ops
Raymond Ops
Dec 16, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.

AI inferenceCUDAGPU
0 likes · 15 min read
Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production
Data STUDIO
Data STUDIO
Dec 9, 2025 · Artificial Intelligence

20 Core PyTorch Concepts to Accelerate Your AI Projects

This article walks through twenty essential PyTorch concepts—from basic Tensor creation and manipulation, through autograd and neural‑network construction, to data loading, GPU acceleration, model saving, and practical training tricks—providing concrete code examples and clear explanations for developers eager to build and deploy AI models.

AutogradDataLoaderGPU
0 likes · 16 min read
20 Core PyTorch Concepts to Accelerate Your AI Projects
Sohu Tech Products
Sohu Tech Products
Dec 3, 2025 · Frontend Development

Recreating Stunning Strange Attractor, Fibonacci Sphere & Galaxy Animations in Flutter with Pure Dart

This article explains how to implement three complex visual effects—Strange Attractor, Fibonacci Sphere, and Galaxy animations—in Flutter using only Dart code, covering the underlying differential equations, Euler integration, 3D‑to‑2D projection, rotation, perspective, performance optimizations, and solutions to common GPU tile‑artifact issues.

DARTFlutterGPU
0 likes · 16 min read
Recreating Stunning Strange Attractor, Fibonacci Sphere & Galaxy Animations in Flutter with Pure Dart
AntTech
AntTech
Nov 27, 2025 · Artificial Intelligence

How AMem NCCL‑Plugin Cuts GPU Memory Overhead for Trillion‑Parameter RL Models

The article explains the design, implementation, and performance of the AMem NCCL‑Plugin, a lightweight extension to NVIDIA's NCCL that enables transparent offloading and rapid recovery of GPU memory during reinforcement‑learning training of trillion‑parameter models, detailing its architecture, APIs, benchmarks, installation steps, and integration guidelines.

ASystemGPUNCCL
0 likes · 18 min read
How AMem NCCL‑Plugin Cuts GPU Memory Overhead for Trillion‑Parameter RL Models
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Nov 24, 2025 · Artificial Intelligence

Simplifying AI Operator Development with TileLang DSL

TileLang is a Python‑style DSL built on TVM that separates algorithm logic from hardware scheduling, offers beginner to expert interfaces, supports multiple GPU and CPU backends, and delivers performance on par with or better than existing AI kernels, as demonstrated with GEMM, FlashAttention and other benchmarks.

AI operatorsGEMMGPU
0 likes · 10 min read
Simplifying AI Operator Development with TileLang DSL
Deepin Linux
Deepin Linux
Nov 10, 2025 · Fundamentals

How the Linux DRM GPU Driver Framework Powers Modern Graphics

An in‑depth look at Linux’s DRM GPU driver framework reveals how Direct Rendering Manager, libdrm, KMS, GEM and related components collaborate to manage GPU resources, render graphics, and support multi‑display setups, complete with illustrative code examples and practical debugging tips.

DRMGPUGraphics
0 likes · 47 min read
How the Linux DRM GPU Driver Framework Powers Modern Graphics
IT Services Circle
IT Services Circle
Nov 9, 2025 · Fundamentals

Why Nvidia’s GPUs Are the Secret Key to the Quantum Computing Era

Nvidia leverages its GPUs to solve quantum computers' fragile error‑correction problem, introducing ultra‑fast NVQLink interconnect and the CUDA‑Q programming platform, creating a feedback loop that secures its dominance in both traditional and emerging quantum markets.

CUDA-QGPUNVQLink
0 likes · 6 min read
Why Nvidia’s GPUs Are the Secret Key to the Quantum Computing Era
IT Services Circle
IT Services Circle
Nov 7, 2025 · Artificial Intelligence

Why Microsoft’s GPU Fleet Is Sitting Idle – The Power Crisis Behind AI’s Growth

Microsoft’s CEO Satya Nadella admits the tech giant’s massive stock of Nvidia GPUs are idle due to insufficient electricity and lack of ready‑to‑use data‑center facilities, highlighting a broader industry shift where AI’s soaring compute demand is now constrained by power and infrastructure limits.

AICloud ComputingData Centers
0 likes · 8 min read
Why Microsoft’s GPU Fleet Is Sitting Idle – The Power Crisis Behind AI’s Growth
Linux Kernel Journey
Linux Kernel Journey
Nov 4, 2025 · Operations

How to Use Kernel Tracepoints for Zero‑Overhead GPU Driver Monitoring

This tutorial explains how to leverage Linux kernel tracepoints with eBPF and bpftrace to capture real‑time GPU driver activity—including job scheduling, memory management, and command submission—across Intel, AMD, Nouveau, and NVIDIA GPUs, providing detailed examples, scripts, and analysis of the resulting data.

DRMGPUbpftrace
0 likes · 20 min read
How to Use Kernel Tracepoints for Zero‑Overhead GPU Driver Monitoring
Open Source Linux
Open Source Linux
Nov 4, 2025 · Artificial Intelligence

Why NVIDIA Left China and How Domestic AI Chips Are Rising to Lead

After NVIDIA’s abrupt exit from the Chinese market, domestic AI chip makers such as Huawei Ascend, Cambricon, Moores Thread, and Muxi are rapidly filling the gap, with increasing market share, diverse architectures, and ambitious production goals that could soon surpass foreign competitors.

AI chipsChina MarketDomestic semiconductor
0 likes · 6 min read
Why NVIDIA Left China and How Domestic AI Chips Are Rising to Lead
DataFunTalk
DataFunTalk
Oct 30, 2025 · Artificial Intelligence

Why Nvidia’s $5 Trillion Valuation Marks a New Era for AI Infrastructure

Nvidia just became the first company to break the $5 trillion market‑cap threshold, a milestone that underscores its rapid growth, ambitious AI‑factory vision, 6G edge‑AI plans, autonomous‑driving initiatives, digital‑twin manufacturing, and the strategic importance of its CUDA ecosystem.

AIGPUNVIDIA
0 likes · 8 min read
Why Nvidia’s $5 Trillion Valuation Marks a New Era for AI Infrastructure
Efficient Ops
Efficient Ops
Oct 28, 2025 · Fundamentals

What Is Computing Power and Why It Drives AI, Cloud, and Blockchain

This article explains the concept of computing power, its measurement units, classifications into general and specialized types, the role of CPUs, GPUs, FPGA and ASIC chips, and how it underpins AI model training, blockchain mining, and scientific research.

AIASICCloud Computing
0 likes · 7 min read
What Is Computing Power and Why It Drives AI, Cloud, and Blockchain
Linux Kernel Journey
Linux Kernel Journey
Oct 21, 2025 · Industry Insights

Bridging the GPU Observability Gap: Why eBPF on GPUs Matters

The article explains how bpftime extends eBPF to NVIDIA and AMD GPUs, exposing fine‑grained execution details that traditional CPU‑side tools miss, and demonstrates a unified, programmable observability stack that overcomes the limitations of existing GPU profilers in both synchronous and asynchronous workloads.

CUDAGPUObservability
0 likes · 23 min read
Bridging the GPU Observability Gap: Why eBPF on GPUs Matters
Programmer DD
Programmer DD
Oct 13, 2025 · Artificial Intelligence

Running ONNX AI Inference Natively in Java Without Python

This article explains how enterprise architects can integrate ONNX‑based machine‑learning inference directly into Java applications, covering tokenizer integration, GPU acceleration, deployment patterns, and lifecycle management to achieve secure, scalable, and observable AI services without relying on Python runtimes.

AI inferenceEnterprise ArchitectureGPU
0 likes · 16 min read
Running ONNX AI Inference Natively in Java Without Python
Programmer DD
Programmer DD
Oct 12, 2025 · Backend Development

Boost Java Performance: Integrate CUDA GPU Acceleration via JNI

This guide explains why Java struggles with high‑performance or data‑intensive workloads, introduces GPU acceleration with CUDA, compares integration options such as JCuda, JNI, and JNA, walks through a practical encryption use case with performance benchmarks, and provides production‑grade best practices for memory, threading, testing, security, and deployment.

CUDAGPUHigh-performance computing
0 likes · 23 min read
Boost Java Performance: Integrate CUDA GPU Acceleration via JNI
DataFunTalk
DataFunTalk
Oct 10, 2025 · Artificial Intelligence

Is Oracle’s AI Cloud a Hidden Money‑Sink? Uncovering the Real Profit Margins

An in‑depth analysis reveals that Oracle’s AI‑focused cloud business, built on expensive Nvidia GPU rentals for OpenAI and other AI developers, generates massive revenue but suffers from alarmingly low profit margins, creating a systemic risk that could ripple through the entire AI infrastructure ecosystem.

AI cloudCloud ComputingGPU
0 likes · 14 min read
Is Oracle’s AI Cloud a Hidden Money‑Sink? Uncovering the Real Profit Margins
21CTO
21CTO
Oct 7, 2025 · Artificial Intelligence

Why Microsoft Is Shifting AI Workloads from GPUs to Its Own Maia Accelerators

Microsoft, after buying massive GPU inventories from Nvidia and AMD, is accelerating its move to custom AI accelerators like Maia to improve cost‑performance in its data centers, even though its first‑generation chips still lag behind industry leaders.

AI acceleratorCloud ComputingGPU
0 likes · 5 min read
Why Microsoft Is Shifting AI Workloads from GPUs to Its Own Maia Accelerators
Java Tech Enthusiast
Java Tech Enthusiast
Oct 6, 2025 · Artificial Intelligence

How China’s New GPU Startup Moore Thread Is Accelerating the AI Race

Amid US export restrictions, China’s five‑year‑old GPU pioneer Moore Thread is racing to fill the high‑end GPU gap, detailing the technology’s role in AI, its ecosystem strategy, and the significance of its fast‑track IPO for the domestic semiconductor and AI compute landscape.

AI computingChinaGPU
0 likes · 10 min read
How China’s New GPU Startup Moore Thread Is Accelerating the AI Race