Tagged articles
8 articles
Page 1 of 1
AI Cyberspace
AI Cyberspace
Jan 29, 2026 · Artificial Intelligence

Step‑by‑Step Guide to Efficient LLM Fine‑Tuning with LoRA, QLoRA, and Llama‑Factory

This tutorial explains the concepts, methods, and practical commands for fine‑tuning large language models using efficient techniques like LoRA and QLoRA, covering model selection, resource considerations, Docker deployment, dataset preparation, training configuration, evaluation metrics, model merging, and deployment with GGUF and Ollama.

GGUFGPU memory optimizationLLM fine-tuning
0 likes · 27 min read
Step‑by‑Step Guide to Efficient LLM Fine‑Tuning with LoRA, QLoRA, and Llama‑Factory
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 15, 2026 · Artificial Intelligence

How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs

This article explains how a hierarchical sparse‑attention framework redesigns KVCache storage across GPU, CPU, and remote memory, eliminates bandwidth and capacity bottlenecks, and enables efficient inference for 128K‑token and larger contexts with dramatically reduced GPU memory usage and higher throughput.

Dynamic Sparse AttentionGPU memory optimizationHierarchical Storage
0 likes · 20 min read
How Hierarchical Sparse Attention Breaks KVCache Limits for Ultra‑Long Context LLMs
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 13, 2025 · Artificial Intelligence

Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability

This article analyzes Deepseek's EP‑based inference architecture for V3/R1 models, comparing it with TP, detailing how EP reduces memory and compute overhead, boosts batch size, cuts GPU memory usage, and introduces reliability, scalability, and maintainability challenges for large‑scale deployments.

AI InfrastructureExpert ParallelismGPU memory optimization
0 likes · 18 min read
Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability
DataFunTalk
DataFunTalk
Dec 6, 2023 · Artificial Intelligence

Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)

This article presents a comprehensive overview of state‑of‑the‑art distributed training methods for large language models, using GPT‑175B as a case study to analyze memory, communication, and compute overheads, and to recommend practical optimization strategies such as tensor, pipeline, and sequence parallelism, ZeRO‑1 optimizer, and selective activation checkpointing.

Distributed TrainingGPU memory optimizationLLM
0 likes · 22 min read
Distributed Training Techniques and Quantitative Analysis for Large Language Models (GPT‑175B)
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 15, 2023 · Artificial Intelligence

Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device

This article explains how to overcome GPU memory limits by using PyTorch 1.9's meta device to create an empty model, load large‑scale model weights layer‑by‑layer, move each part to a 16 GB GPU for inference, and release memory, enabling a 70B FP16 model to run on a single consumer‑grade GPU.

GPU memory optimizationPyTorchmeta device
0 likes · 12 min read
Run a 70B FP16 Model on a Single 16 GB GPU with PyTorch Meta Device
Kuaishou Large Model
Kuaishou Large Model
Jul 29, 2022 · Fundamentals

How Automatic Quantization Slashes Memory Use in High‑Resolution Physical Simulations

This article explains how researchers applied quantization techniques to high‑resolution physical simulations, enabling over 50% memory reduction without noticeable visual loss, by modeling error propagation, using constrained optimization, and introducing dithering, with results demonstrated on GPU‑based smoke, fluid, and elastic body simulations.

GPU memory optimizationPhysical SimulationSIGGRAPH
0 likes · 6 min read
How Automatic Quantization Slashes Memory Use in High‑Resolution Physical Simulations
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 9, 2017 · Artificial Intelligence

How to Train Deeper TensorFlow Models by Optimizing GPU Memory

This article summarizes an NIPS 2017 paper that introduces GPU memory‑optimization techniques—swap‑out/in and a memory‑efficient attention layer—integrated into TensorFlow, enabling significantly larger batch sizes and deeper models without sacrificing accuracy.

Deep LearningGPU memory optimizationNIPS 2017
0 likes · 8 min read
How to Train Deeper TensorFlow Models by Optimizing GPU Memory