Tagged articles

14 articles

Page 1 of 1

Feb 1, 2026 · Artificial Intelligence

Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

This article reviews practical techniques for accelerating large language model inference—including reduced‑precision formats, post‑training quantization, adapter‑based fine‑tuning, pruning, continuous batch processing, and multi‑GPU deployment—while providing concrete code examples, benchmark results, and guidance on selecting the right approach for production workloads.

GPUInferenceLLM

0 likes · 20 min read

Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

Wu Shixiong's Large Model Academy

Aug 23, 2025 · Artificial Intelligence

Why LoRA, QLoRA, Prompt & Prefix Tuning Are Changing Large‑Model Fine‑Tuning

This article explains the mathematical basis of LoRA, compares it with QLoRA, Prompt Tuning, Prefix Tuning and P‑tuning, shows practical PyTorch implementations, and provides mixed‑precision training tips so readers can choose the most memory‑efficient fine‑tuning method for their large language models.

LoRAPrompt TuningQLoRA

0 likes · 17 min read

Why LoRA, QLoRA, Prompt & Prefix Tuning Are Changing Large‑Model Fine‑Tuning

Python Programming Learning Circle

Apr 3, 2025 · Artificial Intelligence

Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code

This article explains how to dramatically speed up PyTorch model training using code optimizations, mixed‑precision, torch.compile, distributed data parallelism, and DeepSpeed, presenting benchmark results that show up to 11.5× acceleration on multiple GPUs while maintaining high accuracy.

Deep LearningDeepSpeedDistributed Training

0 likes · 6 min read

Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code

AI Algorithm Path

Mar 16, 2025 · Artificial Intelligence

Speed Up Your PyTorch Model Training: Practical Tips and Tricks

This article walks through concrete techniques to accelerate PyTorch training, covering mixed‑precision with torch.cuda.amp, profiling with torch.profiler, DataLoader tuning, torch.compile, distributed strategies like DataParallel and DDP, gradient accumulation, and advanced libraries such as Lightning, Apex, and DeepSpeed, plus model‑level optimizations and monitoring tips.

DataLoaderDistributed TrainingProfiling

0 likes · 12 min read

Speed Up Your PyTorch Model Training: Practical Tips and Tricks

AI Algorithm Path

Mar 16, 2025 · Artificial Intelligence

How to Train PyTorch Models Using Far Less GPU Memory

This article walks through a suite of PyTorch techniques—including automatic mixed precision, BF16, gradient checkpointing, gradient accumulation, tensor sharding, efficient data loading, in‑place ops, lightweight optimizers, memory profiling, TorchScript, and kernel fusion—that together can cut peak GPU memory usage by up to twenty‑fold while preserving model accuracy.

GPU MemoryPyTorchdata loading

0 likes · 13 min read

How to Train PyTorch Models Using Far Less GPU Memory

DataFunTalk

Mar 3, 2025 · Artificial Intelligence

FlightVGM: FPGA-Accelerated Inference for Video Generation Models Wins Best Paper at FPGA 2025

The FlightVGM paper, awarded Best Paper at FPGA 2025, details a novel FPGA-based inference IP for video generation models that leverages time‑space activation sparsity, mixed‑precision DSP58 extensions, and adaptive scheduling to achieve up to 1.30× performance and 4.49× energy‑efficiency gains over a NVIDIA 3090 GPU while preserving model accuracy.

AIFPGAHardware acceleration

0 likes · 11 min read

FlightVGM: FPGA-Accelerated Inference for Video Generation Models Wins Best Paper at FPGA 2025

NewBeeNLP

Nov 18, 2024 · Artificial Intelligence

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

This article examines various techniques for compressing and accelerating the KV cache in transformer models—including MQA, GQA, MLA, sliding‑window and linear attention, flash attention, page and ring attention, as well as mixed‑precision training and ZeRO parallelism—providing code snippets, implementation details, and practical trade‑offs.

FlashAttentionKV cacheModel Parallelism

0 likes · 17 min read

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

Rare Earth Juejin Tech Community

May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

Distributed TrainingGPU MemoryPipeline Parallel

0 likes · 9 min read

GPU Memory Analysis and Distributed Training Strategies

NewBeeNLP

Feb 5, 2024 · Artificial Intelligence

How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization

HiFT introduces a layer‑wise hierarchical fine‑tuning strategy that freezes most parameters per step, reduces optimizer state memory, and adapts mixed‑precision training, enabling 7B and 13B models to be fine‑tuned on 16‑31 GB GPUs while maintaining competitive performance.

GPU MemoryHiFTLLM fine-tuning

0 likes · 12 min read

How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization

Baidu Geek Talk

Jan 16, 2023 · Artificial Intelligence

Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets

This technical walkthrough explains how Swin Transformer training and inference can be dramatically accelerated on NVIDIA GPUs by using Nsight Systems profiling, mixed‑precision tensor‑core kernels, Apex‑based and custom CUDA operator fusion, half2 vectorization, register‑array caching, and INT8 quantization, achieving up to 2.85× training and 7.34× inference speedups while preserving model accuracy.

GPU performanceINT8 QuantizationNsight Profiling

0 likes · 23 min read

Boosting Swin Transformer Speed: Profiling, Mixed Precision, and Kernel Fusion Secrets

Baidu Intelligent Cloud Tech Hub

Dec 29, 2022 · Artificial Intelligence

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

This article details how to use NVIDIA profiling tools, mixed‑precision training, operator fusion, kernel optimizations, and INT8 quantization to identify and eliminate performance bottlenecks in Swin Transformer models, achieving up to 2.85× training speedup and up to 7.34× inference acceleration on modern GPUs.

AI PerformanceGPU OptimizationOperator fusion

0 likes · 23 min read

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

21CTO

Oct 2, 2021 · Artificial Intelligence

How PyTorch Lightning Can Make Your Deep Learning Pipeline 10× Faster

This article explains six practical techniques—parallel data loading, distributed multi‑GPU training, mixed precision, early stopping, sharded training, and inference optimizations—using PyTorch Lightning to dramatically accelerate deep‑learning pipelines, turning days‑long experiments into minute‑scale runs.

Deep LearningGPUPyTorch Lightning

0 likes · 7 min read

How PyTorch Lightning Can Make Your Deep Learning Pipeline 10× Faster

Meituan Technology Team

Nov 14, 2019 · Artificial Intelligence

MT-BERT: Pre‑training and Fine‑tuning Practices at Meituan‑Dianping

MT‑BERT at Meituan‑Dianping combines mixed‑precision, domain‑adapted continual pre‑training, knowledge‑graph‑aware masking, and extensive compression techniques to produce fast, accurate BERT models that power fine‑grained sentiment analysis, intent classification, recommendation reasoning, and other NLP tasks across the platform.

BERTKnowledge GraphMT-BERT

0 likes · 33 min read

MT-BERT: Pre‑training and Fine‑tuning Practices at Meituan‑Dianping

Tencent Architect

Jul 30, 2018 · Artificial Intelligence

Four‑Minute ImageNet Training: Tencent’s AI Platform Sets a New World Record

Tencent’s intelligent machine‑learning platform achieved a world‑record by training AlexNet in 4 minutes and ResNet‑50 in 6.6 minutes on ImageNet, using large batch sizes, mixed‑precision, LARS optimization, hierarchical synchronization, gradient fusion, and pipeline I/O techniques to overcome accuracy and scalability challenges.

AI accelerationDeep LearningImageNet

0 likes · 24 min read

Four‑Minute ImageNet Training: Tencent’s AI Platform Sets a New World Record