Tagged articles

Multimodal Generation

17 articles · Page 1 of 1

Jun 14, 2026 · Artificial Intelligence

GaussianDWM: 3D Gaussian Representation for Driving Understanding and Generation

GaussianDWM introduces a unified 3D Gaussian scene model that simultaneously supports autonomous‑driving perception and multimodal generation, embedding geometry, appearance and language semantics into LLM‑compatible tokens, and demonstrates superior visual‑grounding and RGB‑D generation performance on NuInteract and nuScenes compared with prior methods.

3D GaussianLLMMultimodal Generation

0 likes · 10 min read

GaussianDWM: 3D Gaussian Representation for Driving Understanding and Generation

Machine Learning Algorithms & Natural Language Processing

May 25, 2026 · Artificial Intelligence

VeRL-Omni: A Universal RL Post‑Training Framework for Diffusion and Multimodal Generation Models

VeRL-Omni introduces a universal reinforcement‑learning post‑training framework that extends the verl and vLLM‑Omni stacks to support diffusion transformers, hybrid AR‑DiT, and unified understanding‑generation models, offering high‑throughput multimodal rollout, flexible reward engines, modular trainers, and broad hardware compatibility.

FlowGRPOMultimodal GenerationRL

0 likes · 9 min read

VeRL-Omni: A Universal RL Post‑Training Framework for Diffusion and Multimodal Generation Models

Machine Heart

May 25, 2026 · Artificial Intelligence

VeRL-Omni: Universal RL Post‑Training for Diffusion and Multimodal Models

VeRL-Omni is an open‑source RL post‑training framework built on verl and vLLM‑Omni that enables efficient, high‑throughput rollout and flexible reward computation for diffusion, AR‑DiT, and unified multimodal generation models, supporting diverse hardware, modular trainers, and demonstrating up to 14% latency reduction and high training throughput in benchmark experiments.

Diffusion ModelsFlowGRPOMultimodal Generation

0 likes · 9 min read

VeRL-Omni: Universal RL Post‑Training for Diffusion and Multimodal Models

Old Zhang's AI Learning

May 18, 2026 · Artificial Intelligence

Testing a Cloud AI Agent: From Data Analysis to PPT to Video with a Single Input

The author walks through a hands‑on test of the Skywork cloud AI Agent, showing how it can ingest exported Excel data, generate a data‑analysis report, automatically create a PPT, produce narrated video and images, all via a single input without any local deployment.

AI AgentMultimodal GenerationPPT generation

0 likes · 8 min read

Testing a Cloud AI Agent: From Data Analysis to PPT to Video with a Single Input

Machine Heart

May 8, 2026 · Artificial Intelligence

How an Agentic Loop Turns Text‑to‑3D Scene Generation into an Iterative Planning Process

Scenethesis, a new ICLR 2026 framework from NVIDIA and Purdue, combines language, vision, and physics in a closed‑loop agent to turn one‑shot text‑to‑3D generation into a repeatable plan‑check‑repair workflow, dramatically improving spatial realism and physical plausibility.

Language ModelsMultimodal GenerationVision models

0 likes · 9 min read

How an Agentic Loop Turns Text‑to‑3D Scene Generation into an Iterative Planning Process

Machine Learning Algorithms & Natural Language Processing

Apr 23, 2026 · Artificial Intelligence

ControlAudio: Script‑Driven, Time‑Precise Text‑to‑Audio Generation Presented at ACL 2026

ControlAudio, a progressive diffusion framework introduced by Tsinghua researchers, unifies text, timing, and phoneme modeling to enable precise control over when sounds occur and what is spoken, achieving superior alignment and intelligibility while preserving high‑fidelity audio generation.

ACL 2026ControlAudioMultimodal Generation

0 likes · 11 min read

ControlAudio: Script‑Driven, Time‑Precise Text‑to‑Audio Generation Presented at ACL 2026

Machine Heart

Apr 9, 2026 · Artificial Intelligence

From Direct Generation to Agentic Text-to-Image: Introducing the Open-Source Gen-Searcher

Gen-Searcher equips text-to-image models with searchable, reasoning, and web‑browsing capabilities, turning the traditional direct‑generation pipeline into an agentic system that fetches and verifies real‑world knowledge, dramatically improving accuracy and quality across multiple benchmarks.

Gen-SearcherKnowGenMultimodal Generation

0 likes · 7 min read

From Direct Generation to Agentic Text-to-Image: Introducing the Open-Source Gen-Searcher

AI Engineering

Jan 8, 2026 · Artificial Intelligence

LTX-2 Open‑Source: The First Model That Generates Video and Audio Together

LTX-2, an open‑source multimodal diffusion model from Lightricks, jointly generates synchronized video and audio using an asymmetric dual‑stream architecture, achieving 49.18 processing steps per minute—far faster than many pure video models—while supporting about 20 seconds of high‑resolution output.

LTX-2Multimodal GenerationOpen-source AI

0 likes · 3 min read

LTX-2 Open‑Source: The First Model That Generates Video and Audio Together

Kuaishou Tech

Sep 17, 2025 · Artificial Intelligence

How MIDAS Achieves Real‑Time Multimodal Digital‑Human Video Generation

The MIDAS framework introduced by the Kling Team combines autoregressive video generation with a lightweight diffusion denoising head to deliver real‑time, high‑quality digital‑human synthesis under multimodal control, achieving sub‑500 ms latency, 64× compression, and robust performance across multilingual dialogue, singing, and interactive world modeling tasks.

AIMultimodal Generationautoregressive

0 likes · 6 min read

How MIDAS Achieves Real‑Time Multimodal Digital‑Human Video Generation

Tencent Tech

Aug 21, 2025 · Artificial Intelligence

Yan: Tencent’s Real‑Time High‑Fidelity Interactive Video Generation

Tencent’s newly released Yan system advances interactive video generation by delivering high‑fidelity, real‑time, editable content for games, virtual worlds and AIGC, featuring a three‑module architecture—Yan‑Sim for AAA‑level simulation, Yan‑Gen for multimodal generation, and Yan‑Edit for granular editing—while also introducing a large‑scale high‑quality dataset and efficient inference optimizations.

Generative AIInteractive VideoMultimodal Generation

0 likes · 12 min read

Yan: Tencent’s Real‑Time High‑Fidelity Interactive Video Generation

AntTech

Nov 27, 2024 · Artificial Intelligence

EchoMimicV2: An End-to-End Audio‑Driven Semi‑Body Human Animation Framework

EchoMimicV2, an open‑source project from Ant Group's Alipay AI team, introduces an end‑to‑end audio‑driven framework that generates high‑quality semi‑body portrait videos by jointly coordinating audio, pose, and image inputs, while addressing challenges of condition complexity, model stability, and computational cost.

Diffusion ModelsMultimodal Generationaudio-driven animation

0 likes · 16 min read

EchoMimicV2: An End-to-End Audio‑Driven Semi‑Body Human Animation Framework

Baidu Tech Salon

Nov 14, 2024 · Artificial Intelligence

How Baidu’s Wenxin Model Hit 430 Million Users and What Its New Tech Means for AI

At Baidu World 2024, CTO Wang Haifeng revealed that Wenxin Yiyan has reached 430 million users, detailed the model’s retrieval‑augmented and multimodal generation breakthroughs, showcased intelligent‑agent‑driven coding tools, and highlighted expanding AI applications across education, sports, and industry.

AIIntelligent agentsLarge Language Model

0 likes · 7 min read

How Baidu’s Wenxin Model Hit 430 Million Users and What Its New Tech Means for AI

Tencent Cloud Developer

May 15, 2024 · Artificial Intelligence

Tencent Open-Sources HunYuan DiT: First Chinese-Native Text-to-Image Model with 1.5B Parameters

Tencent has open‑sourced its upgraded 1.5‑billion‑parameter HunYuan DiT model—the first Chinese‑native, bilingual (Chinese‑English) text‑to‑image diffusion‑with‑transformer system—delivering about 20% visual quality improvement, multi‑round generation, video‑generation potential, and free commercial use, with full weights, inference code, and algorithms available on Hugging Face and GitHub for developers and enterprises.

Chinese-native AIDiT architectureMultimodal Generation

0 likes · 6 min read

Tencent Open-Sources HunYuan DiT: First Chinese-Native Text-to-Image Model with 1.5B Parameters

DataFunTalk

Jan 31, 2024 · Artificial Intelligence

Industry Trends and Challenges of Large Language Models in Enterprise Applications (2023 Review)

The article reviews the rapid development of large language models in enterprise settings, covering internal collaboration tools, AI assistants for development and marketing, multimodal generation, inference speed bottlenecks, resource constraints, and future directions such as open‑source models and academic‑industry cooperation.

AI assistantsAI in marketingEnterprise AI

0 likes · 8 min read

Industry Trends and Challenges of Large Language Models in Enterprise Applications (2023 Review)

Ximalaya Technology Team

Oct 10, 2023 · Artificial Intelligence

MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis

MiniGPT-5 is a novel multimodal generation model using generative vokens to interleave text and image synthesis, integrating Stable Diffusion and LLMs with a two-stage training that requires no domain-specific annotations, achieving state‑of‑the‑art coherence and quality on benchmarks like CC3M, VIST, and MMDialog.

AI researchMultimodal GenerationStable Diffusion

0 likes · 9 min read

MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis

Alimama Tech

Aug 2, 2023 · Artificial Intelligence

Can AI Fully Automate Advertising Poster Creation and Video Outpainting?

This article reviews four ACM MM 2023 papers that introduce AI‑driven systems for automatic advertising poster generation, multimodal text‑image creation, few‑shot style‑guided visual captioning, and hierarchical 3D diffusion models for video outpainting, detailing their methods, datasets, and experimental results.

AI-generated designDiffusion ModelsMultimodal Generation

0 likes · 9 min read

Can AI Fully Automate Advertising Poster Creation and Video Outpainting?

DataFunTalk

Mar 4, 2023 · Artificial Intelligence

Advances in AIGC: AliceMind Text Generation Models and Multimodal mPLUG from Alibaba DAMO Academy

This article reviews recent AIGC progress, introducing the AliceMind series of text generation models—including PALM, PLUG, and a Chinese GPT‑3—alongside the multimodal mPLUG architecture, and discusses their training strategies, performance results, and practical deployment insights.

AIGCAliceMindMultimodal Generation

0 likes · 16 min read

Advances in AIGC: AliceMind Text Generation Models and Multimodal mPLUG from Alibaba DAMO Academy