Tag

Mixed Precision

0 views collected around this technical thread.

Python Programming Learning Circle
Python Programming Learning Circle
Apr 3, 2025 · Artificial Intelligence

Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code

This article explains how to dramatically speed up PyTorch model training using code optimizations, mixed‑precision, torch.compile, distributed data parallelism, and DeepSpeed, presenting benchmark results that show up to 11.5× acceleration on multiple GPUs while maintaining high accuracy.

DeepSpeedGPUMixed Precision
0 likes · 6 min read
Accelerating PyTorch Model Training: Techniques, Benchmarks, and Code
DataFunTalk
DataFunTalk
Mar 3, 2025 · Artificial Intelligence

FlightVGM: FPGA-Accelerated Inference for Video Generation Models Wins Best Paper at FPGA 2025

The FlightVGM paper, awarded Best Paper at FPGA 2025, details a novel FPGA-based inference IP for video generation models that leverages time‑space activation sparsity, mixed‑precision DSP58 extensions, and adaptive scheduling to achieve up to 1.30× performance and 4.49× energy‑efficiency gains over a NVIDIA 3090 GPU while preserving model accuracy.

AIFPGAMixed Precision
0 likes · 11 min read
FlightVGM: FPGA-Accelerated Inference for Video Generation Models Wins Best Paper at FPGA 2025
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

GPU memoryMixed PrecisionPipeline Parallel
0 likes · 9 min read
GPU Memory Analysis and Distributed Training Strategies
Tencent Architect
Tencent Architect
Jul 30, 2018 · Artificial Intelligence

Four‑Minute ImageNet Training: Tencent’s AI Platform Sets a New World Record

Tencent’s intelligent machine‑learning platform achieved a world‑record by training AlexNet in 4 minutes and ResNet‑50 in 6.6 minutes on ImageNet, using large batch sizes, mixed‑precision, LARS optimization, hierarchical synchronization, gradient fusion, and pipeline I/O techniques to overcome accuracy and scalability challenges.

AI accelerationImageNetLarge Batch Training
0 likes · 24 min read
Four‑Minute ImageNet Training: Tencent’s AI Platform Sets a New World Record