Tagged articles

cross-modal attention

3 articles · Page 1 of 1

Jan 8, 2026 · Artificial Intelligence

LTX-2 Open‑Source: The First Model That Generates Video and Audio Together

LTX-2, an open‑source multimodal diffusion model from Lightricks, jointly generates synchronized video and audio using an asymmetric dual‑stream architecture, achieving 49.18 processing steps per minute—far faster than many pure video models—while supporting about 20 seconds of high‑resolution output.

LTX-2audio-visual diffusioncross-modal attention

0 likes · 3 min read

LTX-2 Open‑Source: The First Model That Generates Video and Audio Together

Baidu Geek Talk

Dec 25, 2024 · Industry Insights

How to Build a Multimodal Web Page Model for the LLM Era

This article examines the unique multimodal and multi‑granular nature of web pages, compares fusion strategies, proposes a cross‑modal attention approach, outlines fine‑ and coarse‑grained pre‑training tasks, and explores low‑cost adaptor methods for adapting large multimodal models to web‑page modeling in the LLM era.

AIHTMLLLM adaptation

0 likes · 10 min read

How to Build a Multimodal Web Page Model for the LLM Era

Alibaba Cloud Developer

Apr 10, 2019 · Artificial Intelligence

Bilinear Residual Layers: Boosting Text‑Guided Image Editing

This article explores multimodal representation learning by introducing a Bilinear Residual Layer that automatically fuses image and text features, demonstrates its superiority over traditional concatenation and FiLM methods on text‑guided image editing and fashion synthesis tasks, and reports state‑of‑the‑art results on several benchmark datasets.

GaNText-to-Image Generationbilinear residual layer

0 likes · 17 min read

Bilinear Residual Layers: Boosting Text‑Guided Image Editing