Tagged articles

ViT³

10 articles · Page 1 of 1
Machine Heart
Machine Heart
Jun 12, 2026 · Artificial Intelligence

ViT³ Reaches CVPR 2026 Best‑Paper Finalist Using Test‑Time Training to Break Transformer Complexity

The ViT³ paper, a CVPR 2026 best‑paper finalist, introduces test‑time training to compress visual context, achieving 4.6× faster inference and 90 % lower GPU memory on 1248×1248 images, while outlining six design principles and demonstrating its adaptability to classification, detection, segmentation, and generation tasks.

CVPR 2026Efficient AttentionHigh-Resolution Vision
0 likes · 16 min read
ViT³ Reaches CVPR 2026 Best‑Paper Finalist Using Test‑Time Training to Break Transformer Complexity
Baidu Geek Talk
Baidu Geek Talk
May 25, 2026 · Artificial Intelligence

Accelerating Multimodal Model Training: LoongForge's DP Load‑Balancing Optimization Explained

The article analyzes how data‑parallel (DP) load imbalance hampers large‑scale multimodal model training, details LoongForge's two‑stage adaptive data‑reallocation method that builds a precise compute‑cost model and dynamically redistributes samples, and presents experimental results showing up to 10% throughput gains on massive DP clusters.

DP load balancingData ParallelLoongForge
0 likes · 16 min read
Accelerating Multimodal Model Training: LoongForge's DP Load‑Balancing Optimization Explained
AIWalker
AIWalker
May 19, 2026 · Artificial Intelligence

How EUPE’s Three‑Stage Distillation Lets an 86M Model Run Classification, Segmentation and VLM on iPhone in 62 ms (SOTA)

EUPE introduces a three‑stage “scale‑then‑shrink” distillation pipeline that first trains a large proxy model to absorb heterogeneous expert knowledge and then compresses it into an 86M encoder, achieving state‑of‑the‑art performance on image classification, dense prediction and vision‑language tasks on an iPhone with only 62 ms latency.

EUPEKnowledge DistillationViT³
0 likes · 16 min read
How EUPE’s Three‑Stage Distillation Lets an 86M Model Run Classification, Segmentation and VLM on iPhone in 62 ms (SOTA)
xkx's Tech General Store
xkx's Tech General Store
Apr 16, 2026 · Artificial Intelligence

Understanding Vision Transformers: Core ViT Principles and Multimodal Applications

This article explains the Vision Transformer (ViT) architecture, compares it with CNNs and traditional NLP Transformers, details its encoding process and attention mechanisms, and demonstrates a practical leaf‑disease classification project that showcases ViT’s role in multimodal AI systems.

AI FundamentalsDeep LearningMultimodal AI
0 likes · 10 min read
Understanding Vision Transformers: Core ViT Principles and Multimodal Applications
Data Party THU
Data Party THU
Mar 25, 2026 · Artificial Intelligence

How Knowledge‑Guided Context Optimization Boosts Zero‑Shot Vision‑Language Models

The article analyzes the Base‑to‑New generalization problem of CLIP‑based visual‑language models, explains why standard prompt tuning (CoOp) forgets base knowledge, and presents the KgCoOp framework that adds a knowledge‑guided loss to keep learned prompts close to hand‑crafted ones, dramatically improving unseen‑class performance while preserving efficiency.

CLIPKnowledge-guided OptimizationPrompt Tuning
0 likes · 12 min read
How Knowledge‑Guided Context Optimization Boosts Zero‑Shot Vision‑Language Models
DeepHub IMBA
DeepHub IMBA
Mar 23, 2026 · Artificial Intelligence

How KgCoOp Uses Knowledge‑Guided Context Optimization to Prevent Prompt Tuning Forgetting

The article analyzes why standard prompt tuning (CoOp) causes catastrophic forgetting in visual‑language models, introduces the KgCoOp framework that adds a knowledge‑guided loss to regularize prompts, and shows through extensive experiments on 11 benchmarks that KgCoOp improves unseen‑class accuracy, harmonic mean, and efficiency while discussing trade‑offs and limitations.

Catastrophic ForgettingKnowledge-guided OptimizationPrompt Tuning
0 likes · 11 min read
How KgCoOp Uses Knowledge‑Guided Context Optimization to Prevent Prompt Tuning Forgetting
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jul 12, 2023 · Artificial Intelligence

Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance

This article provides an in‑depth, English‑language overview of Vision Transformer (ViT), covering its Transformer‑based architecture, patch‑to‑token conversion, token and position embeddings, fine‑tuning strategies such as 2‑D interpolation, experimental results versus CNNs, and the model’s broader significance for multimodal AI research.

Deep LearningFine‑tuningPatch Embedding
0 likes · 25 min read
Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 18, 2022 · Artificial Intelligence

Practical Implementation of Vision Transformer (ViT) for Image Classification in PyTorch

This article walks readers through building, training, and evaluating a Vision Transformer (ViT) model for a five‑class flower classification task, providing detailed code snippets, model architecture explanations, training script adjustments, and experimental results that highlight the importance of pre‑trained weights.

Deep LearningPretrained ModelsPyTorch
0 likes · 13 min read
Practical Implementation of Vision Transformer (ViT) for Image Classification in PyTorch