Tagged articles

multi‑GPU

11 articles · Page 1 of 1

Mar 3, 2026 · Artificial Intelligence

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

LMCache separates the KV cache from a vLLM instance into a shared service, dramatically cutting first‑token latency for repeated text, enabling multiple GPU instances to reuse cached vectors, improving hardware utilization, and supporting use cases such as long‑document QA, multi‑GPU load balancing, and prompt‑engineering, with a quick Docker‑based demo.

DockerKV cacheLLM Inference

0 likes · 6 min read

How LMCache’s Lightning‑Fast KV Cache Slashes LLM First‑Token Latency

MaGe Linux Operations

Jul 21, 2025 · Artificial Intelligence

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA setup, native and Docker deployment methods, detailed parameter tuning, advanced sharding strategies, troubleshooting, performance optimization, and production‑grade monitoring to maximize throughput and stability of large language models.

AI DeploymentCUDAOllama

0 likes · 16 min read

Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

Instant Consumer Technology Team

Jul 11, 2025 · Artificial Intelligence

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

A recent migration of a multimodal image inference system from an internal network to a cloud environment revealed that NVLink bridges dramatically improve multi‑GPU inference speed by reducing inter‑GPU communication overhead, while tensor‑parallel and data‑parallel strategies each have distinct trade‑offs for model deployment.

AI performanceData ParallelGPU inference

0 likes · 11 min read

Why NVLink Boosts Multi‑GPU Inference: Tensor Parallelism Explained

Network Intelligence Research Center (NIRC)

Apr 23, 2025 · Artificial Intelligence

DeepQueueNet in Practice: Quickly Achieve High‑Precision Network Simulation

This article walks through using DeepQueueNet—a deep‑learning‑enhanced network performance estimator—to set up a device model, train the PyTorch version, configure a fattree16 topology, and run multi‑GPU simulations that deliver minute‑level, packet‑accurate results in as little as 1 minute 27 seconds.

Deep LearningDeepQueueNetPyTorch

0 likes · 6 min read

DeepQueueNet in Practice: Quickly Achieve High‑Precision Network Simulation

AI Algorithm Path

Mar 10, 2025 · Artificial Intelligence

How Much GPU Memory Does an LLM Service Really Need?

This article explains a simple formula for estimating the GPU VRAM required to serve large language models, demonstrates the calculation with a 7‑billion‑parameter example, clarifies why a 20% safety buffer is needed, and offers practical strategies such as quantization, CPU offload, and multi‑GPU parallelism to reduce memory usage.

DeploymentGPU memoryLLM

0 likes · 6 min read

How Much GPU Memory Does an LLM Service Really Need?

Architect

Mar 1, 2025 · Artificial Intelligence

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

This article analyzes the challenges of deploying large language models locally and presents a comprehensive set of engineering techniques—including CPU/GPU process separation, Paged Attention, Radix Attention, chunked prefill, output‑length reduction, multi‑GPU tensor parallelism, and speculative decoding—to dramatically boost inference throughput and cut response latency.

LLM InferencePerformance Optimizationchunked prefill

0 likes · 23 min read

How to Build a High‑Performance, Scalable LLM Inference Engine: From Paged Attention to Multi‑GPU Parallelism

DataFunSummit

Feb 17, 2025 · Artificial Intelligence

NorthStar Large‑Model Training Framework: Architecture, APIs, Pipeline and Multi‑GPU Strategies

The article introduces the NorthStar large‑model training framework developed by DeWu, detailing its background challenges, pipeline architecture, rich API support, multi‑GPU training modes, multi‑level embedding storage, hardware selection considerations, and a brief Q&A on data versus model parallelism.

AI FrameworkEmbedding Storagelarge model training

0 likes · 9 min read

NorthStar Large‑Model Training Framework: Architecture, APIs, Pipeline and Multi‑GPU Strategies

NetEase Media Technology Team

Aug 9, 2023 · Artificial Intelligence

GPU Model Inference Optimization Practices in NetEase News Recommendation System

The article outlines practical GPU inference optimization for NetEase’s news recommendation, covering model analysis with Netron, multi‑GPU parallelism, memory‑copy reduction, batch sizing, TensorRT conversion and tuning, custom plugins, and the GRPS serving framework to achieve significant latency and utilization gains.

GPU inferenceModel OptimizationPerformance

0 likes · 44 min read

GPU Model Inference Optimization Practices in NetEase News Recommendation System

HomeTech

Feb 15, 2022 · Artificial Intelligence

Horovod Distributed Deep Learning Training: Architecture, Performance, and Kubernetes Deployment

This article provides a comprehensive overview of Horovod, Uber's open-source distributed deep learning framework, covering its architecture, communication mechanisms, performance benchmarks, and deployment on Kubernetes and Spark for accelerated multi-GPU training.

Deep LearningGPU AccelerationHorovod

0 likes · 17 min read

Horovod Distributed Deep Learning Training: Architecture, Performance, and Kubernetes Deployment

Python Programming Learning Circle

Aug 23, 2021 · Artificial Intelligence

Efficient PyTorch Training Pipeline: Tips, Profiling, and Multi‑GPU Strategies

This article presents practical strategies for building high‑performance PyTorch training pipelines, covering bottleneck identification, efficient data loading, RAM‑based datasets, profiling tools, multi‑GPU training with DataParallel and DistributedDataParallel, custom loss implementation, and hardware‑vs‑software trade‑offs to accelerate deep‑learning workloads.

Custom LossDataLoaderDeep Learning

0 likes · 13 min read

Efficient PyTorch Training Pipeline: Tips, Profiling, and Multi‑GPU Strategies

Liulishuo Tech Team

Mar 25, 2017 · Artificial Intelligence

Building a Student Model with TensorFlow: Deep Knowledge Tracing for Adaptive Learning

This article reviews how Liulishuo applied TensorFlow to implement a Deep Knowledge Tracing (DKT) student model for an adaptive learning system, covering the problem background, model architecture, TensorFlow implementation details, multi‑GPU training, and practical deployment considerations.

Deep Knowledge TracingRNNStudent Modeling

0 likes · 12 min read

Building a Student Model with TensorFlow: Deep Knowledge Tracing for Adaptive Learning