Tagged articles
35 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 2, 2026 · Artificial Intelligence

LongCat-Next: Turning Images, Audio, and Text into Tokens – What’s Next?

LongCat-Next is a 68.5‑billion‑parameter discrete‑native autoregressive multimodal model that tokenizes images, audio and text, challenges the belief that visual tokenization loses detail, matches specialized models on fine‑grained tasks, and demonstrates that joint understanding‑generation training can even improve generation quality.

Audio SynthesisLarge ModelLongCat-Next
0 likes · 21 min read
LongCat-Next: Turning Images, Audio, and Text into Tokens – What’s Next?
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 30, 2026 · Artificial Intelligence

Meituan’s Fully Discrete Multimodal Base (LongCat-Next) Shows All Physical Signals Can Converge to Tokens

LongCat-Next, a 3‑billion‑parameter multimodal model released by Meituan, adopts a pure discrete token‑based architecture (DiNA) and next‑token prediction, outperforming same‑size rivals on OmniDocBench‑EN, CharXivRQ, and matching QwenVL on visual tasks, while avoiding catastrophic forgetting and achieving a SWE‑Bench score of 43.0, as demonstrated through extensive benchmarks, receipt extraction, OCR, audio dialect reasoning, and image generation experiments.

DiNALongCat-NextOmniDocBench
0 likes · 10 min read
Meituan’s Fully Discrete Multimodal Base (LongCat-Next) Shows All Physical Signals Can Converge to Tokens
AI Frontier Lectures
AI Frontier Lectures
Mar 19, 2026 · Artificial Intelligence

Can Circulant Attention Reduce Vision Transformer Cost by 7×?

The article reviews the AAAI 2026 paper "Vision Transformers are Circulant Attention Learners", explaining how modeling self‑attention as a Block‑Circulant matrix enables FFT‑based multiplication that cuts the quadratic complexity of standard attention, achieving up to seven‑fold inference speed‑up while preserving accuracy across ImageNet, COCO and ADE20K benchmarks.

BCCB MatrixCirculant AttentionComputer Vision
0 likes · 15 min read
Can Circulant Attention Reduce Vision Transformer Cost by 7×?
AI Frontier Lectures
AI Frontier Lectures
Mar 19, 2026 · Artificial Intelligence

Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It

The article analyzes the hidden conflict between [CLS] and patch tokens in Vision Transformers, reveals how shared normalization and linear layers cause computational friction, and demonstrates that layer‑specific parameters dramatically improve dense prediction tasks without increasing inference FLOPs.

Computer VisionDense PredictionLayer Specialization
0 likes · 9 min read
Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It
AIWalker
AIWalker
Feb 26, 2026 · Artificial Intelligence

Overcoming Vision Transformer Bottlenecks: The Plug‑and‑Play Upgrade of ViT‑5

ViT‑5 systematically revisits five years of Transformer architecture advances, introducing seven plug‑and‑play components—LayerScale, RMSNorm, GeLU, dual positional encodings, high‑frequency RoPE for register tokens, QK‑Norm, and bias‑free projections—that together raise ImageNet‑1k Top‑1 accuracy to 84.2% (Base) and achieve superior performance across classification, generation, and segmentation tasks.

Computer VisionViT-5Vision Transformer
0 likes · 14 min read
Overcoming Vision Transformer Bottlenecks: The Plug‑and‑Play Upgrade of ViT‑5
HyperAI Super Neural
HyperAI Super Neural
Feb 11, 2026 · Artificial Intelligence

Reduce Memory by 75% Using D‑CHAG’s Cross‑Channel Hierarchical Aggregation

Researchers at Oak Ridge National Laboratory introduced D‑CHAG, a distributed cross‑channel hierarchical aggregation method that cuts memory consumption by up to 75% and more than doubles throughput when training massive multi‑channel foundation models on up to 1,024 AMD GPUs, as demonstrated on hyperspectral imaging and weather‑forecasting workloads.

D-CHAGDistributed TrainingMemory Optimization
0 likes · 14 min read
Reduce Memory by 75% Using D‑CHAG’s Cross‑Channel Hierarchical Aggregation
Sohu Tech Products
Sohu Tech Products
Dec 17, 2025 · Artificial Intelligence

How We Cut Vision Transformer Inference Latency from 53 ms to 8 ms

Facing 53.64 ms per‑image latency in a Flask‑served Vision Transformer classifier, we iteratively optimized the pipeline—switching to ONNX Runtime, leveraging TensorRT, replacing Pillow with OpenCV, eliminating URL downloads, and finally batching requests—reducing average server‑side processing to 8.34 ms, a 6.4× speedup.

BatchingFlaskONNX
0 likes · 28 min read
How We Cut Vision Transformer Inference Latency from 53 ms to 8 ms
AI Frontier Lectures
AI Frontier Lectures
Nov 22, 2025 · Artificial Intelligence

Can Vision Transformers Crack the ARC Puzzle? Introducing VARC

MIT researchers argue that the ARC benchmark is essentially a visual problem and present the Vision ARC (VARC) framework, which reformulates ARC as an image‑to‑image translation task using a Vision Transformer, achieving human‑level accuracy through a novel canvas representation and test‑time training.

ARCImage-to-Image TranslationTest-Time Training
0 likes · 9 min read
Can Vision Transformers Crack the ARC Puzzle? Introducing VARC
AI Algorithm Path
AI Algorithm Path
Nov 1, 2025 · Artificial Intelligence

Deep Dive into Vision Transformer Patch Embedding Mechanisms

This article explains how Vision Transformers convert images into patch embeddings, compares flattening versus convolutional approaches, discusses position and CLS tokens, analyzes the effect of patch size, explores pixel‑level tokens, and contrasts ViT’s inductive bias with CNNs.

Computer VisionConvolutionInductive Bias
0 likes · 10 min read
Deep Dive into Vision Transformer Patch Embedding Mechanisms
Data Party THU
Data Party THU
Jul 31, 2025 · Artificial Intelligence

How LaVin-DiT Revolutionizes Vision Generation with ST‑VAE and Joint Diffusion Transformer

The LaVin-DiT paper introduces a large‑scale vision diffusion transformer that combines a spatiotemporal variational auto‑encoder, a joint diffusion transformer with full‑sequence joint attention, and 3D rotary position encoding to enable unified, efficient generation across diverse visual tasks such as segmentation and video prediction.

3D RoPEComputer VisionVision Transformer
0 likes · 11 min read
How LaVin-DiT Revolutionizes Vision Generation with ST‑VAE and Joint Diffusion Transformer
Amap Tech
Amap Tech
Jul 11, 2025 · Artificial Intelligence

Unified Self‑Supervised Pretraining Boosts Image Generation and Understanding

The USP framework introduces masked latent modeling within a VAE space to pretrain ViT encoders, enabling seamless weight transfer to both image classification and diffusion‑based generation tasks, dramatically accelerating training while preserving strong performance across multiple benchmarks.

Vision Transformerdiffusion modelsimage generation
0 likes · 10 min read
Unified Self‑Supervised Pretraining Boosts Image Generation and Understanding
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jul 7, 2025 · Artificial Intelligence

Exploring Collaborative Perception with V2X‑ViT: Architecture, Innovations, and Practical Insights

This article reviews the V2X‑ViT collaborative perception framework for autonomous driving, detailing its end‑to‑end pipeline, the novel HMSA and MSwin attention mechanisms, and the delay‑aware positional encoding that together enable high‑accuracy 3D object detection across vehicles and infrastructure.

3D Object DetectionCollaborative PerceptionHMSA
0 likes · 10 min read
Exploring Collaborative Perception with V2X‑ViT: Architecture, Innovations, and Practical Insights
AI Algorithm Path
AI Algorithm Path
Jul 5, 2025 · Artificial Intelligence

Beginner’s Guide to Vision‑Language Models Day 7: How CLIP Achieves Joint Visual‑Language Understanding

This article explains CLIP’s dual‑encoder architecture—using a Vision Transformer for images and a Transformer for text—how both encoders map inputs into a shared embedding space, the role of cosine similarity, and the InfoNCE contrastive loss that drives joint visual‑language learning.

CLIPInfoNCEMulti-modal Embedding
0 likes · 8 min read
Beginner’s Guide to Vision‑Language Models Day 7: How CLIP Achieves Joint Visual‑Language Understanding
AI Algorithm Path
AI Algorithm Path
Jun 29, 2025 · Artificial Intelligence

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

CLIP (Contrastive Language‑Image Pre‑training) is an OpenAI model that learns visual concepts from 400 million image‑text pairs using a dual‑encoder architecture, enabling zero‑shot classification, flexible text‑driven search, and cross‑modal reasoning, while its strengths, limitations, and emerging applications are examined in detail.

CLIPContrastive Language-Image PretrainingDual Encoder
0 likes · 15 min read
Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision
AIWalker
AIWalker
Jun 18, 2025 · Artificial Intelligence

SeNaTra: Nvidia’s Spatial Grouping Layer Pushes Semantic Segmentation Past Swin Transformer

Nvidia introduces SeNaTra, a native‑segmentation vision transformer that replaces uniform down‑sampling with a content‑aware spatial grouping layer, delivering superior zero‑shot and supervised segmentation performance while cutting parameters and FLOPs compared with Swin Transformer and other backbones.

NvidiaVision Transformersemantic segmentation
0 likes · 29 min read
SeNaTra: Nvidia’s Spatial Grouping Layer Pushes Semantic Segmentation Past Swin Transformer
AI Algorithm Path
AI Algorithm Path
Mar 20, 2025 · Artificial Intelligence

Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis

This article surveys the latest multimodal large language model research, dissecting the design, training strategies, and performance trade‑offs of models such as Llama 3.2, Molmo, NVLM, Qwen2‑VL, Pixtral, MM1.5, Emu3, and Janus, and highlights the challenges of fair cross‑model evaluation.

AI researchCross-AttentionModel Training Strategies
0 likes · 16 min read
Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis
AIWalker
AIWalker
Feb 28, 2025 · Artificial Intelligence

FlexTok: Reconstruct Images with as Few as 8 Tokens – Variable‑Length Tokenizer Beats TiTok

FlexTok is a flexible‑length 1‑D image tokenizer that can resample pictures into as few as 1‑256 discrete tokens, achieving superior reconstruction (FID) and autoregressive generation quality compared with TiTok, thanks to nested random dropout, causal masks and a flow‑based decoder evaluated on ImageNet and DFN.

FlexTokVision Transformerautoregressive generation
0 likes · 21 min read
FlexTok: Reconstruct Images with as Few as 8 Tokens – Variable‑Length Tokenizer Beats TiTok
AIWalker
AIWalker
Feb 23, 2025 · Artificial Intelligence

U‑ViT: How a ViT‑Based Diffusion Model Beats DiT and Redefines Image Generation

U‑ViT replaces the convolutional U‑Net backbone of diffusion models with a Vision Transformer, treats time, condition and noisy patches as tokens, adds long skip connections and a lightweight 3×3 convolution, and through extensive ablations and scaling studies achieves state‑of‑the‑art FID scores on unconditional, class‑conditional and text‑to‑image generation tasks.

AdaLNFIDLong Skip Connections
0 likes · 16 min read
U‑ViT: How a ViT‑Based Diffusion Model Beats DiT and Redefines Image Generation
AIWalker
AIWalker
Feb 9, 2025 · Artificial Intelligence

Douyin’s BDVQAGroup Secures Global Runner‑Up in DXOMARK Image Quality Challenge at CVPR 2024

At CVPR 2024 NTIRE, Douyin’s BDVQAGroup achieved second place worldwide in the DXOMARK portrait quality track using their SampleIQA model, which combines data‑re‑sampling, a Swin‑Transformer backbone, twin‑network ranking loss and content‑aware cropping to outperform existing IQA state‑of‑the‑art methods.

Computer VisionDXOMARKDeep Learning
0 likes · 10 min read
Douyin’s BDVQAGroup Secures Global Runner‑Up in DXOMARK Image Quality Challenge at CVPR 2024
AIWalker
AIWalker
Jan 13, 2025 · Artificial Intelligence

Multi-View Transformer (MVFormer) Sets New Top‑1 Accuracy Records in Classification, Detection, and Segmentation

The paper proposes MVFormer, a Vision Transformer that combines a Multi‑View Normalization (MVN) module and a Multi‑View Token Mixer (MVTM) to diversify feature learning, achieving state‑of‑the‑art Top‑1 accuracy of 83.4%‑84.6% on ImageNet‑1K and superior performance on COCO detection and ADE20K segmentation while using comparable or fewer parameters and MACs.

Computer VisionDeep LearningMulti-View Normalization
0 likes · 25 min read
Multi-View Transformer (MVFormer) Sets New Top‑1 Accuracy Records in Classification, Detection, and Segmentation
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Jan 29, 2024 · Artificial Intelligence

Can Vision Transformers Revolutionize Edge AI Video Analysis?

This article examines the rapid rise of edge AI video analytics, explains how Vision Transformers (ViT) overcome the limitations of traditional CNNs, details a technical pre‑research and POC conducted by a Chinese AI firm, evaluates several open‑source large models, and concludes that the OFA model best meets current edge deployment needs.

OFAVision Transformeredge AI
0 likes · 14 min read
Can Vision Transformers Revolutionize Edge AI Video Analysis?
Ximalaya Technology Team
Ximalaya Technology Team
Oct 10, 2023 · Artificial Intelligence

MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis

MiniGPT-5 is a novel multimodal generation model using generative vokens to interleave text and image synthesis, integrating Stable Diffusion and LLMs with a two-stage training that requires no domain-specific annotations, achieving state‑of‑the‑art coherence and quality on benchmarks like CC3M, VIST, and MMDialog.

AI researchStable DiffusionVision Transformer
0 likes · 9 min read
MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Oct 8, 2023 · Artificial Intelligence

Why the Scale‑Aware Modulation Transformer Outperforms CNNs and Vision Transformers with Fewer Parameters

The Scale‑Aware Modulation Transformer (SMT) introduces a lightweight SAM module and an Evolutionary Hybrid Network that together achieve higher accuracy on ImageNet, COCO, and ADE20K while using significantly fewer parameters and FLOPs than existing CNN and Transformer baselines.

Image ClassificationSMTScale‑Aware Modulation
0 likes · 12 min read
Why the Scale‑Aware Modulation Transformer Outperforms CNNs and Vision Transformers with Fewer Parameters
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jul 24, 2023 · Artificial Intelligence

Understanding Slide-Transformer: An Efficient Local Attention Module for Vision Transformers

This article explains the Slide-Transformer paper, describing how the proposed Slide Attention replaces inefficient Im2Col‑based local attention with depthwise convolutions and a deformable shift module, achieving high efficiency, flexibility, and hardware‑agnostic performance for Vision Transformers.

Computer VisionDeep LearningDeformable Shift
0 likes · 13 min read
Understanding Slide-Transformer: An Efficient Local Attention Module for Vision Transformers
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jul 12, 2023 · Artificial Intelligence

Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance

This article provides an in‑depth, English‑language overview of Vision Transformer (ViT), covering its Transformer‑based architecture, patch‑to‑token conversion, token and position embeddings, fine‑tuning strategies such as 2‑D interpolation, experimental results versus CNNs, and the model’s broader significance for multimodal AI research.

Computer VisionDeep LearningFine‑tuning
0 likes · 25 min read
Comprehensive Guide to Vision Transformer (ViT): Architecture, Patch Tokenization, Embedding, Fine‑tuning, and Performance
DataFunSummit
DataFunSummit
Jun 23, 2023 · Artificial Intelligence

Frontiers of Video Action Recognition: Concepts, Algorithms, and Applications

This article introduces video action recognition, covering its basic definition, downstream tasks, major algorithmic families—including CNN‑based, Vision‑Transformer, self‑supervised, and multimodal approaches—and discusses practical deployment scenarios and open challenges in the field.

CNNVision Transformermultimodal models
0 likes · 16 min read
Frontiers of Video Action Recognition: Concepts, Algorithms, and Applications
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 18, 2022 · Artificial Intelligence

Practical Implementation of Vision Transformer (ViT) for Image Classification in PyTorch

This article walks readers through building, training, and evaluating a Vision Transformer (ViT) model for a five‑class flower classification task, providing detailed code snippets, model architecture explanations, training script adjustments, and experimental results that highlight the importance of pre‑trained weights.

Deep LearningImage ClassificationPyTorch
0 likes · 13 min read
Practical Implementation of Vision Transformer (ViT) for Image Classification in PyTorch
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 10, 2022 · Artificial Intelligence

A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers

This article introduces the fundamentals of Vision Transformers (ViT) for computer‑vision developers, starting with an overview of the transformer architecture, detailed explanation of self‑attention and multi‑head attention, and step‑by‑step PyTorch code examples that illustrate query, key, value computation and attention scoring.

PyTorchSelf-AttentionTransformer
0 likes · 12 min read
A Beginner’s Journey into Vision Transformers (ViT) for Computer Vision Engineers
AntTech
AntTech
Jun 15, 2022 · Artificial Intelligence

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

XYLayoutLM introduces a layout‑aware multimodal network that improves visually‑rich document understanding by augmenting XY‑Cut for robust reading order generation and employing a Dilated Conditional Position Encoding to handle variable‑length inputs, achieving state‑of‑the‑art performance on XFUN and FUNSD datasets.

Vision TransformerXYCutdocument understanding
0 likes · 10 min read
XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding
DataFunTalk
DataFunTalk
Jun 9, 2022 · Artificial Intelligence

Understanding and Reproducing MAE (Masked AutoEncoder) for Self‑Supervised Vision Learning with EasyCV

This article introduces the MAE (Masked AutoEncoder) self‑supervised learning method, explains its asymmetric encoder‑decoder design and high masking ratio, evaluates its performance, and provides a step‑by‑step guide to reproduce MAE using Alibaba’s EasyCV framework, including code snippets, training tips, and troubleshooting.

EasyCVMAEPyTorch
0 likes · 15 min read
Understanding and Reproducing MAE (Masked AutoEncoder) for Self‑Supervised Vision Learning with EasyCV
Baobao Algorithm Notes
Baobao Algorithm Notes
Apr 11, 2022 · Artificial Intelligence

Can ResNet Still Beat Transformers? A Deep Dive into Modern Training Tricks

This article reviews recent research and official PyTorch blog updates that modify ResNet architectures and training tricks, compares their performance against EfficientNet, ConvNeXt, and Vision Transformers using extensive ImageNet benchmarks, and provides both literature‑based and local evaluation results to assess whether classic CNNs remain competitive.

CNNResNetVision Transformer
0 likes · 13 min read
Can ResNet Still Beat Transformers? A Deep Dive into Modern Training Tricks
Baidu Geek Talk
Baidu Geek Talk
Mar 28, 2022 · Artificial Intelligence

Robust Input Visualization Methods for Vision Transformers

The paper proposes a robust Grad‑CAM‑inspired visualization for Vision Transformers that combines attention weights and gradients to generate class‑specific saliency maps, demonstrates superior alignment with discriminative regions across ViT, Swin and Volo models, and shows a 76% false‑positive reduction in Baidu’s porn‑content risk control system.

Deep LearningGrad-CAMInput Visualization
0 likes · 11 min read
Robust Input Visualization Methods for Vision Transformers