Tagged articles
6 articles
Page 1 of 1
Data Party THU
Data Party THU
Apr 14, 2026 · Artificial Intelligence

Heterogeneous Hyperbolic Manifolds for Better Vision-Language Tree Alignment

This paper introduces a novel framework that constructs and aligns dual visual‑textual trees on heterogeneous hyperbolic manifolds, addressing asymmetric modality alignment in hierarchical classification tasks and achieving state‑of‑the‑art performance on benchmarks such as CIFAR‑100, ImageNet and Rare Species datasets.

Cross-AttentionHierarchical ClassificationVision-Language Models
0 likes · 8 min read
Heterogeneous Hyperbolic Manifolds for Better Vision-Language Tree Alignment
Alimama Tech
Alimama Tech
Dec 17, 2025 · Artificial Intelligence

How VeM Achieves Precise Semantic, Temporal, and Rhythmic Alignment in Video-to-Music Generation

The VeM model introduces a latent diffusion framework that leverages hierarchical video parsing, scene‑guided cross‑attention, and a transition‑beat alignment adapter to generate high‑fidelity background music perfectly synchronized with video semantics, timing, and rhythm, outperforming existing baselines on extensive quantitative and qualitative evaluations.

Cross-AttentionLatent Diffusionaudio generation
0 likes · 14 min read
How VeM Achieves Precise Semantic, Temporal, and Rhythmic Alignment in Video-to-Music Generation
AI Algorithm Path
AI Algorithm Path
Mar 20, 2025 · Artificial Intelligence

Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis

This article surveys the latest multimodal large language model research, dissecting the design, training strategies, and performance trade‑offs of models such as Llama 3.2, Molmo, NVLM, Qwen2‑VL, Pixtral, MM1.5, Emu3, and Janus, and highlights the challenges of fair cross‑model evaluation.

AI researchCross-AttentionModel Training Strategies
0 likes · 16 min read
Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis
AI Algorithm Path
AI Algorithm Path
Mar 19, 2025 · Artificial Intelligence

Understanding Multimodal Large Language Models: Part 1

This article explains the fundamentals of multimodal large language models, covering their definition, typical applications, two main architectural approaches—unified embedding decoder and cross‑modal attention—along with detailed component breakdowns, a PyTorch implementation of image‑patch projection, and training considerations, ending with a discussion of trade‑offs between the methods.

Cross-AttentionImage EncoderLinear Projection
0 likes · 14 min read
Understanding Multimodal Large Language Models: Part 1
DaTaobao Tech
DaTaobao Tech
Oct 13, 2023 · Artificial Intelligence

Understanding Stable Diffusion: Core Principles and Technical Architecture

The article demystifies Stable Diffusion by explaining its low‑cost latent‑space design and conditioning mechanisms, comparing it to autoregressive, VAE, flow‑based and GAN models, detailing the iterative noise‑to‑image process, token‑based text‑to‑image control, version differences, common generation issues, and providing implementation code examples.

AI image generationComputer VisionCross-Attention
0 likes · 15 min read
Understanding Stable Diffusion: Core Principles and Technical Architecture