Heterogeneous Hyperbolic Manifolds for Better Vision-Language Tree Alignment

This paper introduces a novel framework that constructs and aligns dual visual‑textual trees on heterogeneous hyperbolic manifolds, addressing asymmetric modality alignment in hierarchical classification tasks and achieving state‑of‑the‑art performance on benchmarks such as CIFAR‑100, ImageNet and Rare Species datasets.

Data Party THU
Data Party THU
Data Party THU
Heterogeneous Hyperbolic Manifolds for Better Vision-Language Tree Alignment

Introduction

The authors propose a new method for building and aligning image‑text dual trees on heterogeneous hyperbolic manifolds, dramatically improving state‑of‑the‑art results on hierarchical classification tasks.

Motivation

Real‑world multimodal data often exhibit inherent hierarchical structures (e.g., biological taxonomy). Existing visual‑language models (VLMs) align a single visual feature (typically the [CLS] token) with multi‑level textual features, leading to asymmetric alignment and information loss.

Problem Definition

In Taxonomic Open Set (TOS) classification, class labels form a semantic tree. Given an input image, the classifier must predict labels at multiple semantic levels by maximizing the posterior probability of a leaf node.

Limitations of Baselines

Current methods (e.g., MaPLe, PromptSRC, BioCLIP) extract rich hierarchical textual features but pair them with a single visual feature, causing asymmetric alignment.

Visual features are noisy and span coarse to fine semantics, while textual features are relatively clean, resulting in a modality gap that Euclidean or uniform‑curvature hyperbolic spaces cannot model accurately.

Proposed Methodology

Semantic‑Aware Visual Feature Extraction

Observing that shallow ViT layers encode coarse semantics and deeper layers encode fine semantics, the authors introduce a cross‑attention mechanism:

Query : textual features from each level of the text tree.

Key/Value : intermediate ViT layer tokens and the final class token, linearly projected into a common space, enabling extraction of a multi‑level visual feature tree.

Heterogeneous Manifold Alignment Algorithm

Both visual and textual trees are embedded in separate hyperbolic manifolds with learnable curvatures. To bridge the curvature gap, an intermediate manifold is introduced.

Theorem 1 : Data on hyperbolic manifolds are modeled as Wrapped Normal Distributions; a KL‑based distance approximation is derived.

Optimization Objective : Minimize the sum of distances from the textual and visual manifolds to the intermediate manifold, effectively learning optimal curvatures.

Proposition 1 & 2 : Prove the uniqueness of the global optimum via strict convexity of the second‑order derivative.

Loss and Geometric Constraints

Inter‑Modal Constraint : Visual features must lie inside the entailment cones of their corresponding textual features.

Intra‑Modal Constraint : Fine‑grained visual features must be contained within the cones of coarse‑grained features, mirroring hierarchical consistency.

The curvature search uses a non‑differentiable golden‑section search combined with the Implicit Function Theorem to compute gradients efficiently.

Experiments

Setup

Evaluation on Taxonomic Open Set classification across CIFAR‑100, SUN, ImageNet and the challenging Rare Species dataset (7‑level taxonomy). Metrics include Leaf Accuracy (LA), Hierarchical Consistent Accuracy (HCA) and Mean Tree Accuracy (MTA).

Results

The proposed method outperforms prior prompt‑learning baselines (MaPLe, PromptSRC, ProTeCt) in both 1‑shot and 16‑shot settings.

On HCA, the 16‑shot configuration gains a 28.83 % improvement.

Compared to the latest hyperbolic multimodal model HyCoCLIP, the new approach boosts MTA by 50.79 %.

Key Takeaways

Respecting the intrinsic geometry of hierarchical data is crucial; forcing data into flat Euclidean space harms performance.

The cross‑attention module effectively enriches visual features with hierarchical depth.

Learning separate curvatures for each modality and aligning them via an intermediate manifold yields substantial gains.

The golden‑section search with implicit gradients offers an elegant solution for curvature optimization.

Figure
Figure
vision-language modelsCross-AttentionHierarchical Classificationhyperbolic manifoldsmodality alignment
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.