Tagged articles
14 articles
Page 1 of 1
AI Waka
AI Waka
Feb 1, 2026 · Artificial Intelligence

Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

This article reviews practical techniques for accelerating large language model inference—including reduced‑precision formats, post‑training quantization, adapter‑based fine‑tuning, pruning, continuous batch processing, and multi‑GPU deployment—while providing concrete code examples, benchmark results, and guidance on selecting the right approach for production workloads.

GPUInferenceLLM
0 likes · 20 min read
Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Aug 23, 2025 · Artificial Intelligence

Why LoRA, QLoRA, Prompt & Prefix Tuning Are Changing Large‑Model Fine‑Tuning

This article explains the mathematical basis of LoRA, compares it with QLoRA, Prompt Tuning, Prefix Tuning and P‑tuning, shows practical PyTorch implementations, and provides mixed‑precision training tips so readers can choose the most memory‑efficient fine‑tuning method for their large language models.

LoRAPrompt TuningQLoRA
0 likes · 17 min read
Why LoRA, QLoRA, Prompt & Prefix Tuning Are Changing Large‑Model Fine‑Tuning
AI Algorithm Path
AI Algorithm Path
Mar 16, 2025 · Artificial Intelligence

Speed Up Your PyTorch Model Training: Practical Tips and Tricks

This article walks through concrete techniques to accelerate PyTorch training, covering mixed‑precision with torch.cuda.amp, profiling with torch.profiler, DataLoader tuning, torch.compile, distributed strategies like DataParallel and DDP, gradient accumulation, and advanced libraries such as Lightning, Apex, and DeepSpeed, plus model‑level optimizations and monitoring tips.

DataLoaderDistributed TrainingProfiling
0 likes · 12 min read
Speed Up Your PyTorch Model Training: Practical Tips and Tricks
AI Algorithm Path
AI Algorithm Path
Mar 16, 2025 · Artificial Intelligence

How to Train PyTorch Models Using Far Less GPU Memory

This article walks through a suite of PyTorch techniques—including automatic mixed precision, BF16, gradient checkpointing, gradient accumulation, tensor sharding, efficient data loading, in‑place ops, lightweight optimizers, memory profiling, TorchScript, and kernel fusion—that together can cut peak GPU memory usage by up to twenty‑fold while preserving model accuracy.

GPU MemoryPyTorchdata loading
0 likes · 13 min read
How to Train PyTorch Models Using Far Less GPU Memory
DataFunTalk
DataFunTalk
Mar 3, 2025 · Artificial Intelligence

FlightVGM: FPGA-Accelerated Inference for Video Generation Models Wins Best Paper at FPGA 2025

The FlightVGM paper, awarded Best Paper at FPGA 2025, details a novel FPGA-based inference IP for video generation models that leverages time‑space activation sparsity, mixed‑precision DSP58 extensions, and adaptive scheduling to achieve up to 1.30× performance and 4.49× energy‑efficiency gains over a NVIDIA 3090 GPU while preserving model accuracy.

AIFPGAHardware acceleration
0 likes · 11 min read
FlightVGM: FPGA-Accelerated Inference for Video Generation Models Wins Best Paper at FPGA 2025
NewBeeNLP
NewBeeNLP
Nov 18, 2024 · Artificial Intelligence

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

This article examines various techniques for compressing and accelerating the KV cache in transformer models—including MQA, GQA, MLA, sliding‑window and linear attention, flash attention, page and ring attention, as well as mixed‑precision training and ZeRO parallelism—providing code snippets, implementation details, and practical trade‑offs.

FlashAttentionKV cacheModel Parallelism
0 likes · 17 min read
How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

Distributed TrainingGPU MemoryPipeline Parallel
0 likes · 9 min read
GPU Memory Analysis and Distributed Training Strategies
NewBeeNLP
NewBeeNLP
Feb 5, 2024 · Artificial Intelligence

How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization

HiFT introduces a layer‑wise hierarchical fine‑tuning strategy that freezes most parameters per step, reduces optimizer state memory, and adapts mixed‑precision training, enabling 7B and 13B models to be fine‑tuned on 16‑31 GB GPUs while maintaining competitive performance.

GPU MemoryHiFTLLM fine-tuning
0 likes · 12 min read
How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization
Baidu Geek Talk
Baidu Geek Talk
Jan 16, 2023 · Artificial Intelligence

Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets

This technical walkthrough explains how Swin Transformer training and inference can be dramatically accelerated on NVIDIA GPUs by using Nsight Systems profiling, mixed‑precision tensor‑core kernels, Apex‑based and custom CUDA operator fusion, half2 vectorization, register‑array caching, and INT8 quantization, achieving up to 2.85× training and 7.34× inference speedups while preserving model accuracy.

GPU performanceINT8 QuantizationNsight Profiling
0 likes · 23 min read
Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 29, 2022 · Artificial Intelligence

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

This article details how to use NVIDIA profiling tools, mixed‑precision training, operator fusion, kernel optimizations, and INT8 quantization to identify and eliminate performance bottlenecks in Swin Transformer models, achieving up to 2.85× training speedup and up to 7.34× inference acceleration on modern GPUs.

AI PerformanceGPU OptimizationOperator fusion
0 likes · 23 min read
Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques
21CTO
21CTO
Oct 2, 2021 · Artificial Intelligence

How PyTorch Lightning Can Make Your Deep Learning Pipeline 10× Faster

This article explains six practical techniques—parallel data loading, distributed multi‑GPU training, mixed precision, early stopping, sharded training, and inference optimizations—using PyTorch Lightning to dramatically accelerate deep‑learning pipelines, turning days‑long experiments into minute‑scale runs.

Deep LearningGPUPyTorch Lightning
0 likes · 7 min read
How PyTorch Lightning Can Make Your Deep Learning Pipeline 10× Faster
Meituan Technology Team
Meituan Technology Team
Nov 14, 2019 · Artificial Intelligence

MT-BERT: Pre‑training and Fine‑tuning Practices at Meituan‑Dianping

MT‑BERT at Meituan‑Dianping combines mixed‑precision, domain‑adapted continual pre‑training, knowledge‑graph‑aware masking, and extensive compression techniques to produce fast, accurate BERT models that power fine‑grained sentiment analysis, intent classification, recommendation reasoning, and other NLP tasks across the platform.

BERTKnowledge GraphMT-BERT
0 likes · 33 min read
MT-BERT: Pre‑training and Fine‑tuning Practices at Meituan‑Dianping
Tencent Architect
Tencent Architect
Jul 30, 2018 · Artificial Intelligence

Four‑Minute ImageNet Training: Tencent’s AI Platform Sets a New World Record

Tencent’s intelligent machine‑learning platform achieved a world‑record by training AlexNet in 4 minutes and ResNet‑50 in 6.6 minutes on ImageNet, using large batch sizes, mixed‑precision, LARS optimization, hierarchical synchronization, gradient fusion, and pipeline I/O techniques to overcome accuracy and scalability challenges.

AI accelerationDeep LearningImageNet
0 likes · 24 min read
Four‑Minute ImageNet Training: Tencent’s AI Platform Sets a New World Record