Tagged articles

MLLM

9 articles · Page 1 of 1

Apr 24, 2026 · Artificial Intelligence

Audio-Omni: A Unified Multimodal Model for Understanding, Generating, and Editing Audio Across Sound, Music, and Speech

Audio-Omni, a unified multimodal audio model presented at SIGGRAPH 2026, combines a frozen large multimodal language model with a trainable diffusion generator to achieve state‑of‑the‑art understanding, generation, and instruction‑based editing across general sounds, music, and speech, leveraging a million‑scale AudioEdit dataset and a hybrid conditioning architecture.

Audio-OmniAudioEditDiffusion Generation

0 likes · 11 min read

Audio-Omni: A Unified Multimodal Model for Understanding, Generating, and Editing Audio Across Sound, Music, and Speech

Machine Heart

Apr 21, 2026 · Artificial Intelligence

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Monet introduces a training paradigm that lets multimodal large language models reason directly in a continuous latent visual space, replacing external tool calls with implicit visual embeddings, and demonstrates significant gains on both in‑distribution perception tasks and out‑of‑distribution abstract visual reasoning through a three‑stage supervised fine‑tuning and a novel visual‑latent policy optimization.

Latent EmbeddingMLLMMultimodal

0 likes · 15 min read

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Advanced AI Application Practice

Apr 16, 2026 · Artificial Intelligence

Can AI Deliver Scalable, High‑Quality Test Assets for Enterprises?

The article analyzes enterprise testing challenges and presents the AIO intelligent testing platform, which combines cloud‑native architecture, MLLM‑RAG dual engines, and a knowledge‑graph to automate test case generation, improve coverage, and cut maintenance costs, backed by concrete benchmarks and multi‑modal inputs.

AI testingCloud NativeKnowledge Graph

0 likes · 18 min read

Can AI Deliver Scalable, High‑Quality Test Assets for Enterprises?

Tencent Advertising Technology

Nov 28, 2025 · Artificial Intelligence

How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Retrv‑R1, a reasoning‑driven multimodal large language model framework, tackles the precision‑efficiency dilemma of universal multimodal retrieval by introducing a two‑stage coarse‑to‑fine pipeline, an information‑compression module, a detail‑inspection mechanism, and a three‑stage training strategy, achieving SOTA performance across accuracy, efficiency, and generalization benchmarks.

EfficiencyMLLMdetail inspection

0 likes · 21 min read

How Retrv-R1 Redefines Universal Multimodal Retrieval with Reasoning‑Driven MLLM

Data Party THU

Sep 21, 2025 · Artificial Intelligence

How the New ECD Dataset Supercharges Multimodal LLM Chart Understanding

The paper introduces the Effective Chart Dataset (ECD), a large, high‑quality, diverse synthetic chart collection and the ECDBench benchmark, detailing a five‑stage modular synthesis pipeline, extensive QA generation, and experiments that show consistent performance gains for open‑source multimodal large language models on chart‑understanding tasks.

AIBenchmarkMLLM

0 likes · 9 min read

How the New ECD Dataset Supercharges Multimodal LLM Chart Understanding

AI Frontier Lectures

Jul 30, 2025 · Artificial Intelligence

How MetaQuery Bridges MLLMs and Diffusion Models for Superior Multimodal Generation

MetaQuery introduces learnable queries that connect a frozen multimodal LLM with diffusion models, enabling knowledge‑enhanced image generation, reconstruction, and editing while preserving state‑of‑the‑art multimodal understanding, and achieves new SOTA results across multiple benchmarks.

AI researchMLLMMetaQuery

0 likes · 18 min read

How MetaQuery Bridges MLLMs and Diffusion Models for Superior Multimodal Generation

Architect

Mar 24, 2025 · Artificial Intelligence

How Multimodal Alignment Is Shaping the Future of Large Language Models

This article provides a systematic review of recent advances in multimodal alignment for large language models, covering key contributions, application scenarios, dataset construction, evaluation benchmarks, future challenges, and insights from LLM alignment research to guide both academia and industry.

AI safetyDataset ConstructionMLLM

0 likes · 26 min read

How Multimodal Alignment Is Shaping the Future of Large Language Models

JD Tech Talk

Mar 19, 2025 · Artificial Intelligence

Reliable Advertising Image Generation and Creative Selection Using Multimodal Feedback and MLLM Representations

The 2024 advertising team introduced a suite of AI‑driven techniques—including a trustworthy feedback network, a large‑scale human‑annotated dataset, multimodal large language model representations, and online ranking architecture upgrades—to dramatically improve the quality, coverage, and personalization of generated ad creatives.

AIGCAdvertisingMLLM

0 likes · 10 min read

Reliable Advertising Image Generation and Creative Selection Using Multimodal Feedback and MLLM Representations

AntTech

Mar 14, 2025 · Artificial Intelligence

MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding

The CVPR 2025 paper "MP-GUI: Modality Perception with MLLMs for GUI Understanding" presents a novel algorithm that enhances multimodal large language models' ability to perceive and reason about graphical user interfaces by integrating text, visual, and spatial signals through specialized perception modules and a dynamic fusion gate, achieving state‑of‑the‑art performance on multiple GUI benchmarks.

CVPR2025GUI UnderstandingMLLM

0 likes · 5 min read

MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding