Tagged articles
11 articles
Page 1 of 1
AI Explorer
AI Explorer
Mar 3, 2026 · Artificial Intelligence

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

LMCache separates the KV cache from a vLLM instance into a shared service, dramatically cutting first‑token latency for repeated text, enabling multiple GPU instances to reuse cached vectors, improving hardware utilization, and supporting use cases such as long‑document QA, multi‑GPU load balancing, and prompt‑engineering, with a quick Docker‑based demo.

DockerKV cacheLLM inference
0 likes · 6 min read
How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency
MaGe Linux Operations
MaGe Linux Operations
Jul 21, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA setup, native and Docker deployment methods, detailed parameter tuning, advanced sharding strategies, troubleshooting, performance optimization, and production‑grade monitoring to maximize throughput and stability of large language models.

AI deploymentCUDAOllama
0 likes · 16 min read
Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production
Instant Consumer Technology Team
Instant Consumer Technology Team
Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI PerformanceData ParallelGPU inference
0 likes · 11 min read
Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Apr 23, 2025 · Artificial Intelligence

DeepQueueNet in Practice: Quickly Achieve High‑Precision Network Simulation

This article walks through using DeepQueueNet—a deep‑learning‑enhanced network performance estimator—to set up a device model, train the PyTorch version, configure a fattree16 topology, and run multi‑GPU simulations that deliver minute‑level, packet‑accurate results in as little as 1 minute 27 seconds.

Deep LearningDeepQueueNetPyTorch
0 likes · 6 min read
DeepQueueNet in Practice: Quickly Achieve High‑Precision Network Simulation
AI Algorithm Path
AI Algorithm Path
Mar 10, 2025 · Artificial Intelligence

How Much GPU Memory Does an LLM Service Really Need?

This article explains a simple formula for estimating the GPU VRAM required to serve large language models, demonstrates the calculation with a 7‑billion‑parameter example, clarifies why a 20% safety buffer is needed, and offers practical strategies such as quantization, CPU offload, and multi‑GPU parallelism to reduce memory usage.

DeploymentGPU MemoryLLM
0 likes · 6 min read
How Much GPU Memory Does an LLM Service Really Need?
Architect
Architect
Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM inferencePaged Attentionchunked prefill
0 likes · 23 min read
How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism
DataFunSummit
DataFunSummit
Feb 17, 2025 · Artificial Intelligence

NorthStar Large‑Model Training Framework: Architecture, APIs, Pipeline and Multi‑GPU Strategies

The article introduces the NorthStar large‑model training framework developed by DeWu, detailing its background challenges, pipeline architecture, rich API support, multi‑GPU training modes, multi‑level embedding storage, hardware selection considerations, and a brief Q&A on data versus model parallelism.

AI FrameworkEmbedding Storagelarge model training
0 likes · 9 min read
NorthStar Large‑Model Training Framework: Architecture, APIs, Pipeline and Multi‑GPU Strategies
NetEase Media Technology Team
NetEase Media Technology Team
Aug 9, 2023 · Artificial Intelligence

GPU Model Inference Optimization Practices in NetEase News Recommendation System

The article outlines practical GPU inference optimization for NetEase’s news recommendation, covering model analysis with Netron, multi‑GPU parallelism, memory‑copy reduction, batch sizing, TensorRT conversion and tuning, custom plugins, and the GRPS serving framework to achieve significant latency and utilization gains.

GPU inferenceModel OptimizationProfiling
0 likes · 44 min read
GPU Model Inference Optimization Practices in NetEase News Recommendation System
Python Programming Learning Circle
Python Programming Learning Circle
Aug 23, 2021 · Artificial Intelligence

Efficient PyTorch Training Pipeline: Tips, Profiling, and Multi‑GPU Strategies

This article presents practical strategies for building high‑performance PyTorch training pipelines, covering bottleneck identification, efficient data loading, RAM‑based datasets, profiling tools, multi‑GPU training with DataParallel and DistributedDataParallel, custom loss implementation, and hardware‑vs‑software trade‑offs to accelerate deep‑learning workloads.

Custom LossDataLoaderDeep Learning
0 likes · 13 min read
Efficient PyTorch Training Pipeline: Tips, Profiling, and Multi‑GPU Strategies
Liulishuo Tech Team
Liulishuo Tech Team
Mar 25, 2017 · Artificial Intelligence

Building a Student Model with TensorFlow: Deep Knowledge Tracing for Adaptive Learning

This article reviews how Liulishuo applied TensorFlow to implement a Deep Knowledge Tracing (DKT) student model for an adaptive learning system, covering the problem background, model architecture, TensorFlow implementation details, multi‑GPU training, and practical deployment considerations.

Deep Knowledge TracingRNNStudent Modeling
0 likes · 12 min read
Building a Student Model with TensorFlow: Deep Knowledge Tracing for Adaptive Learning