AI Engineering
AI Engineering
Jan 8, 2026 · Artificial Intelligence

LTX-2 Open‑Source: The First Model That Generates Video and Audio Together

LTX-2, an open‑source multimodal diffusion model from Lightricks, jointly generates synchronized video and audio using an asymmetric dual‑stream architecture, achieving 49.18 processing steps per minute—far faster than many pure video models—while supporting about 20 seconds of high‑resolution output.

LTX-2audio-visual diffusioncross-modal attention
0 likes · 3 min read
LTX-2 Open‑Source: The First Model That Generates Video and Audio Together
Baidu Geek Talk
Baidu Geek Talk
Dec 25, 2024 · Industry Insights

How to Build a Multimodal Web Page Model for the LLM Era

This article examines the unique multimodal and multi‑granular nature of web pages, compares fusion strategies, proposes a cross‑modal attention approach, outlines fine‑ and coarse‑grained pre‑training tasks, and explores low‑cost adaptor methods for adapting large multimodal models to web‑page modeling in the LLM era.

AIHTMLLLM adaptation
0 likes · 10 min read
How to Build a Multimodal Web Page Model for the LLM Era
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 10, 2019 · Artificial Intelligence

Bilinear Residual Layers: Boosting Text‑Guided Image Editing

This article explores multimodal representation learning by introducing a Bilinear Residual Layer that automatically fuses image and text features, demonstrates its superiority over traditional concatenation and FiLM methods on text‑guided image editing and fashion synthesis tasks, and reports state‑of‑the‑art results on several benchmark datasets.

GANMultimodal Learningbilinear residual layer
0 likes · 17 min read
Bilinear Residual Layers: Boosting Text‑Guided Image Editing