Tagged articles
70 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 14, 2026 · Artificial Intelligence

Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)

APO (Autonomous Preference Optimization) converts the drift and conflict among multiple teacher multimodal LLMs into dynamic negative constraints while treating consensus as a positive preference, enabling robust concept alignment and superior diagnostic accuracy on the CXR‑MAX benchmark, as demonstrated by extensive ICML‑2026 experiments.

APOICML 2026concept drift
0 likes · 11 min read
Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)
Machine Heart
Machine Heart
May 13, 2026 · Artificial Intelligence

Turning Multi-Teacher Conflict into Dynamic Constraints for Precise Multimodal Model Alignment (ICML 2026)

The paper introduces APO, a novel autonomous preference optimization framework that converts concept drift among multiple teacher multimodal LLMs into dynamic negative constraints and treats consensus as a positive preference, achieving robust concept alignment and surpassing strong teachers on a high‑risk medical X‑ray benchmark.

APOCXR-MAXICML 2026
0 likes · 11 min read
Turning Multi-Teacher Conflict into Dynamic Constraints for Precise Multimodal Model Alignment (ICML 2026)
Machine Heart
Machine Heart
May 13, 2026 · Artificial Intelligence

Super‑Charging MiniCPM‑V 4.6 on One RTX 4090: 1B‑Parameter Multimodal Model Sets New Efficiency Bar

MiniCPM‑V 4.6, a 1.3 B‑parameter multimodal LLM, outperforms larger rivals such as Qwen3.5‑0.8B and Gemma 4 on both accuracy and speed, thanks to early ViT token compression and 4×/16× visual token reduction, delivering sub‑100 ms latency and over 2.6 k token/s throughput on a single RTX 4090 while also running offline on mobile devices.

MiniCPM-VRTX 4090Token Compression
0 likes · 16 min read
Super‑Charging MiniCPM‑V 4.6 on One RTX 4090: 1B‑Parameter Multimodal Model Sets New Efficiency Bar
Machine Heart
Machine Heart
May 11, 2026 · Artificial Intelligence

Why Visual Perception Limits STEM Large Models and How CodePercept Breaks the Barrier

The authors demonstrate that visual perception, not reasoning, is the primary bottleneck for STEM multimodal large language models, introduce the CodePercept paradigm and the ICC-1M dataset, and show that code‑driven perception dramatically improves performance, surpassing much larger models on new benchmarks.

BenchmarkCVPR2026CodePercept
0 likes · 9 min read
Why Visual Perception Limits STEM Large Models and How CodePercept Breaks the Barrier
Data Party THU
Data Party THU
Apr 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark

The MME-Emotion benchmark, introduced by researchers from CUHK and Alibaba Tongyi and accepted at ICLR 2026, provides a large‑scale, multimodal evaluation of emotional intelligence in large language models, revealing current models’ limited emotion recognition and reasoning abilities across diverse real‑world scenarios.

AIBenchmarkMME-Emotion
0 likes · 10 min read
Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark
PaperAgent
PaperAgent
Apr 8, 2026 · Artificial Intelligence

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

This article examines the visual token redundancy in decoder-only multimodal large language models and introduces a training-free dynamic computation reduction framework—featuring Probe-Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that significantly lowers inference cost while preserving performance.

decoder-only architecturedynamic computationefficient inference
0 likes · 12 min read
How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs
Machine Heart
Machine Heart
Apr 3, 2026 · Artificial Intelligence

Google Open‑Sources Gemma 4, Outperforming a 13×‑Larger Qwen 3.5

Google DeepMind released the open‑source Gemma 4 family—four model sizes ranging from 2 B to 31 B parameters, supporting text, images, video and audio, with up to 256 k token context, Apache 2.0 licensing, and benchmark results that place it on par with the 397 B Qwen 3.5 despite being far smaller.

Apache 2.0BenchmarkGemma 4
0 likes · 11 min read
Google Open‑Sources Gemma 4, Outperforming a 13×‑Larger Qwen 3.5
Weekly Large Model Application
Weekly Large Model Application
Mar 23, 2026 · Artificial Intelligence

Inside Step‑Audio2: End‑to‑End Multimodal Audio LLM Architecture and Deployment

This article dissects Step‑Audio2, an industrial‑grade multimodal large language model that unifies speech understanding, translation, dialogue and audio generation in a single causal LM, detailing its inference pipeline, key implementation tricks, deployment modes, strengths, limitations, and suitable application scenarios.

PythonSpeech synthesisStep-Audio2
0 likes · 10 min read
Inside Step‑Audio2: End‑to‑End Multimodal Audio LLM Architecture and Deployment
AI Frontier Lectures
AI Frontier Lectures
Mar 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark

This article presents MME-Emotion, a large‑scale multimodal benchmark that evaluates both emotion recognition and reasoning abilities of multimodal large language models across 27 real‑world scenarios, revealing current models’ significant gaps in emotional intelligence and outlining future research directions.

AIBenchmarkDataset
0 likes · 9 min read
Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 13, 2026 · Artificial Intelligence

Can Multimodal LLMs Beat Humans in Real Web Search? GPT‑5.2 Scores Only 36% on New BrowseComp‑V3 Benchmark

A new multimodal browsing benchmark, BrowseComp‑V3, reveals that human experts achieve a 68.03% success rate while the strongest closed‑source model, GPT‑5.2, manages just 36.17%, highlighting current limitations in deep web‑scale visual‑text reasoning and the critical role of tool‑augmented agents.

GPT-5.2OmniSeekerhuman performance
0 likes · 12 min read
Can Multimodal LLMs Beat Humans in Real Web Search? GPT‑5.2 Scores Only 36% on New BrowseComp‑V3 Benchmark
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 12, 2026 · Artificial Intelligence

LongHorizonUI: A Unified Robust Framework for Long‑Horizon GUI Agent Automation

LongHorizonUI tackles the steep success‑rate drop of GUI agents on tasks longer than 10‑15 steps by introducing three tightly coupled modules—enhanced perception, deep reflective decision, and compensatory execution—and validates the approach on the new LongGUIBench benchmark with consistent performance gains across both app and game scenarios.

BenchmarkGUI automationICLR 2026
0 likes · 12 min read
LongHorizonUI: A Unified Robust Framework for Long‑Horizon GUI Agent Automation
Huolala Tech
Huolala Tech
Mar 4, 2026 · Artificial Intelligence

How Lalamove Built an AI‑Powered Edge‑Cloud Review System for Global Driver Verification

Lalamove tackled the scalability and accuracy challenges of worldwide driver onboarding by designing a layered edge‑cloud AI architecture that combines lightweight mobile models, cloud‑based large‑language and computer‑vision models, OCR, and multimodal LLMs to filter low‑quality inputs, automate identity checks, and reduce manual effort while maintaining data compliance.

AIDriver VerificationOCR
0 likes · 12 min read
How Lalamove Built an AI‑Powered Edge‑Cloud Review System for Global Driver Verification
AIWalker
AIWalker
Mar 3, 2026 · Artificial Intelligence

RetouchIQ’s Instruction‑Driven AI Editing Overcomes Traditional Retouching Limits

RetouchIQ introduces an instruction‑driven AI retouching system that uses a general reward model to interpret abstract user commands, delivering precise image adjustments with higher semantic consistency and visual naturalness than existing multimodal large language models, thereby lowering the technical barrier for cinematic‑style edits.

AI Image EditingRetouchIQReward model
0 likes · 3 min read
RetouchIQ’s Instruction‑Driven AI Editing Overcomes Traditional Retouching Limits
Data Party THU
Data Party THU
Feb 25, 2026 · Artificial Intelligence

Why Multimodal LLMs Miss Tiny Objects—and How to Fix It

This article analyzes why multimodal large language models often fail to detect small objects, identifies three core bottlenecks, and presents a four‑tiered optimization roadmap—from zero‑cost inference tricks to data augmentation, model fine‑tuning, and engineering safeguards—backed by three real‑world case studies and actionable guidelines.

Inference Optimizationdata augmentationmodel fine-tuning
0 likes · 20 min read
Why Multimodal LLMs Miss Tiny Objects—and How to Fix It
AI Engineering
AI Engineering
Feb 16, 2026 · Artificial Intelligence

Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×

Alibaba’s Qwen3.5-397B-A17B, a 397‑billion‑parameter open‑source multimodal LLM, combines mixed linear attention with a sparse MoE architecture to achieve 8.6‑19× higher decoding throughput than Qwen3‑Max, supports 201 languages, and can be deployed via vLLM, Docker, Transformers, or SGLang with various optimization presets.

Inference Optimizationlarge language modelmultimodal LLM
0 likes · 8 min read
Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×
JD Tech
JD Tech
Jan 27, 2026 · Artificial Intelligence

How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation

Uni-Layout introduces a unified framework that integrates a universal layout generator, a human‑feedback‑simulating evaluator, and a dynamic margin preference optimization technique to align generation and evaluation across diverse e‑commerce design tasks, backed by a new 100k human‑annotated dataset.

Human Feedbackdynamic margin optimizatione-commerce design
0 likes · 11 min read
How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation
JD Cloud Developers
JD Cloud Developers
Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment

Uni-Layout introduces a unified framework that combines a multimodal large language model‑based generator, a human‑like evaluator trained on the large Layout‑HF100k dataset, and a Dynamic Margin Preference Optimization (DMPO) method to align generation and evaluation, achieving state‑of‑the‑art results across diverse layout tasks.

DMPOHuman Feedbackevaluation
0 likes · 11 min read
Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment
JD Tech Talk
JD Tech Talk
Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation

Uni-Layout introduces a unified framework that generates layouts across diverse tasks, simulates human evaluation with a novel feedback dataset, and aligns generation and assessment through dynamic margin preference optimization, achieving state‑of‑the‑art performance on multiple benchmarks.

AI designHuman Feedbackevaluation
0 likes · 11 min read
Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation
JD Retail Technology
JD Retail Technology
Jan 8, 2026 · Artificial Intelligence

Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation

Uni-Layout introduces a unified layout generation framework that consolidates diverse design tasks, leverages multimodal large language models for flexible generation, and aligns outputs with human perception through a novel human‑feedback dataset (Layout‑HF100k) and a dynamic margin preference optimization (DMPO) evaluator.

ACM MultimediaHuman Feedbackdynamic margin optimization
0 likes · 11 min read
Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Dec 4, 2025 · Artificial Intelligence

CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning tasks, providing 5,331 videos, 9,015 QA pairs, four high‑level dimensions and ten specific tasks, and exposing significant performance gaps between current models and humans.

AI Evaluationcross-video reasoningmultimodal LLM
0 likes · 9 min read
CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning
HyperAI Super Neural
HyperAI Super Neural
Nov 28, 2025 · Artificial Intelligence

Weekly AI paper roundup: protein design, open‑source agent, HunyuanOCR, Olmo 3

This weekly roundup highlights five recent AI papers—including HumanSense for multimodal LLM evaluation, JAM‑2 for de novo antibody design, the open‑source Olmo 3 language models, the Lumine generalist 3D agent, and the lightweight HunyuanOCR vision‑language model—summarizing their core contributions, results, and links.

OCRgeneralist agentsmultimodal LLM
0 likes · 6 min read
Weekly AI paper roundup: protein design, open‑source agent, HunyuanOCR, Olmo 3
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 13, 2025 · Artificial Intelligence

Introducing UNO‑Bench: The First Unified Omni‑Modal LLM Evaluation Suite

UNO‑Bench, an open‑source benchmark from Meituan’s LongCat team, provides the first high‑quality, low‑redundancy unified evaluation framework for omni‑modal large language models, featuring 1,250 manually annotated cross‑modal samples and 2,480 enhanced single‑modal samples covering 44 fine‑grained tasks and five modality combinations.

AI Scaling LawBenchmarkdata pipeline
0 likes · 15 min read
Introducing UNO‑Bench: The First Unified Omni‑Modal LLM Evaluation Suite
Tencent Technical Engineering
Tencent Technical Engineering
Nov 5, 2025 · Artificial Intelligence

iDetex: The Winning AI Model Transforming Image Quality Assessment

iDetex, the champion solution of the ICCV 2025 MIPI Detailed Image Quality Assessment Challenge, introduces a novel multimodal LLM-driven framework that precisely locates, describes, and grades image distortions, outperforming traditional IQA models and enabling practical deployments across video, live streaming, e‑commerce, and image‑processing pipelines.

AIComputer VisionICCV 2025
0 likes · 18 min read
iDetex: The Winning AI Model Transforming Image Quality Assessment
HyperAI Super Neural
HyperAI Super Neural
Oct 27, 2025 · Artificial Intelligence

Weekly AI Paper Digest: New OCR Model, Multimodal LLM, Next‑Gen DNA Sequencing

This week’s AI roundup highlights five recent papers: DeepSeek‑OCR’s context‑compression model for large‑scale data generation, Rex‑Omni’s 3‑billion‑parameter multimodal LLM achieving state‑of‑the‑art object perception, Alpha‑Service’s proactive AI‑glass framework, a bias‑variance approach to narrowing cross‑lingual gaps, and GATK’s MapReduce‑based toolkit for next‑generation DNA sequencing.

AI GlassesCross-lingual NLPDNA Sequencing
0 likes · 6 min read
Weekly AI Paper Digest: New OCR Model, Multimodal LLM, Next‑Gen DNA Sequencing
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 17, 2025 · Artificial Intelligence

Exploring MLLM4TS: A Universal Multimodal Framework for Time‑Series Analysis

This article reviews the MLLM4TS framework, which fuses visual representations of multivariate time series with large language models to address complex temporal dependencies, cross‑channel interactions, and task generalization, and demonstrates superior performance on classification, anomaly detection, forecasting, and few‑shot scenarios across multiple benchmarks.

Ablation StudyBenchmark resultsFew‑Shot Learning
0 likes · 11 min read
Exploring MLLM4TS: A Universal Multimodal Framework for Time‑Series Analysis
Amap Tech
Amap Tech
Oct 4, 2025 · Artificial Intelligence

How JanusVLN Redefines Vision‑Language Navigation with Dual Implicit Memory

JanusVLN presents a groundbreaking Vision‑and‑Language Navigation framework that decouples semantic understanding from spatial geometry using dual implicit memory, eliminates explicit memory overhead, achieves state‑of‑the‑art performance with only RGB video input, and dramatically improves efficiency and generalization across VLN benchmarks.

3D spatial reasoningDual Implicit Memorymultimodal LLM
0 likes · 10 min read
How JanusVLN Redefines Vision‑Language Navigation with Dual Implicit Memory
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 2, 2025 · Artificial Intelligence

FinZero: Multimodal Large‑Model Reasoning for Financial Time‑Series Forecasting

FinZero is a multimodal large‑model that leverages a 30‑billion‑parameter Qwen2.5‑VL backbone fine‑tuned with the UARPO strategy on the FVLDB dataset, enabling accurate financial time‑series prediction, uncertainty quantification, and outperforming larger models such as GPT‑4o by about 13.5% in high‑confidence groups.

FinZeroGPT-4o comparisonQwen2.5-VL-3B
0 likes · 10 min read
FinZero: Multimodal Large‑Model Reasoning for Financial Time‑Series Forecasting
Data Party THU
Data Party THU
Sep 26, 2025 · Artificial Intelligence

How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding

Keye‑VL‑1.5, an 8‑billion‑parameter multimodal large language model, introduces a Slow‑Fast video encoding strategy, a four‑stage progressive pre‑training pipeline with 128K context, and a sophisticated post‑training regime that together achieve state‑of‑the‑art performance on video and vision‑language benchmarks while maintaining strong general capabilities.

Benchmarklarge language modelmultimodal LLM
0 likes · 21 min read
How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding
HyperAI Super Neural
HyperAI Super Neural
Sep 19, 2025 · Artificial Intelligence

Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs

This article surveys five recent AI papers, covering reinforcement learning for large reasoning models, a tree‑structured table QA framework (ST‑Raptor), visual representation alignment for multimodal LLMs, GraphRAG‑based generation, and an LLM‑driven cryptographic vulnerability detector, each with key insights and links.

cryptographic vulnerability detectiongraph retrievallarge language models
0 likes · 5 min read
Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs
Kuaishou Tech
Kuaishou Tech
Sep 16, 2025 · Artificial Intelligence

How Kling-Avatar Generates Long, Emotionally Rich Digital Human Videos with Multimodal LLMs

Kuaishou's Kling-Avatar leverages a multimodal large‑language‑model‑driven two‑stage generation framework to produce minute‑long digital‑human videos that synchronize lip movements, facial expressions, and body gestures with audio, achieving high visual quality, identity consistency, and controllable storytelling across diverse scenarios.

AI AvatarDigital HumanVideo Generation
0 likes · 9 min read
How Kling-Avatar Generates Long, Emotionally Rich Digital Human Videos with Multimodal LLMs
Kuaishou Large Model
Kuaishou Large Model
Sep 8, 2025 · Artificial Intelligence

Keye-VL-1.5-8B: The New Multimodal LLM That Beats GPT-4o on Vision Benchmarks

Kwai's newly released Keye-VL-1.5-8B multimodal large language model dramatically improves visual, reasoning, and temporal understanding, achieving top scores on public video benchmarks and surpassing closed‑source models like GPT‑4o, while offering an open‑source release and detailed technical documentation.

benchmark performancemultimodal LLMopen-source
0 likes · 11 min read
Keye-VL-1.5-8B: The New Multimodal LLM That Beats GPT-4o on Vision Benchmarks
DaTaobao Tech
DaTaobao Tech
Sep 3, 2025 · Artificial Intelligence

Why a Simple Workflow Beats Complex Agents in AI‑Powered Insurance Audits

A retrospective of an AI‑based insurance claim audit project shows that a well‑designed workflow, precise prompt engineering, and rule‑based pre‑filtering can achieve stable, high‑accuracy results, while overly complex agent architectures often become fragile patchwork solutions.

AI auditPrompt Engineeringinsurance claim
0 likes · 24 min read
Why a Simple Workflow Beats Complex Agents in AI‑Powered Insurance Audits
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Aug 27, 2025 · Artificial Intelligence

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Perception‑R1, a post‑training framework that applies rule‑based reinforcement learning to existing multimodal LLMs, dramatically improves visual perception tasks such as grounding, OCR, counting and object detection, as demonstrated by extensive benchmarks and ablation studies.

GRPOPerception PolicyReward Modeling
0 likes · 10 min read
Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision
vivo Internet Technology
vivo Internet Technology
Aug 25, 2025 · Artificial Intelligence

How DiMo-GUI Boosts Multimodal LLMs for GUI Grounding Without Training

DiMo-GUI is a plug‑and‑play framework that dramatically improves multimodal large language models' ability to locate GUI elements by using a hierarchical dynamic visual reasoning loop and modality‑aware optimization, achieving up to double the performance on high‑resolution GUI benchmarks without any additional training data.

GUI groundingTest-Time Scalingdynamic visual reasoning
0 likes · 7 min read
How DiMo-GUI Boosts Multimodal LLMs for GUI Grounding Without Training
AIWalker
AIWalker
Aug 13, 2025 · Artificial Intelligence

Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception

Look-Back is an implicit training paradigm that enables the Qwen‑2.5‑VL‑7B multimodal LLM to autonomously re‑focus on visual inputs during reasoning, achieving a 6.3 % boost in perception tasks, outperforming prior baselines while requiring no extra image tokens or model architecture changes.

Look-BackQwen-2.5-VLimplicit training
0 likes · 26 min read
Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception
AI Algorithm Path
AI Algorithm Path
Aug 9, 2025 · Artificial Intelligence

How LoRA Enables Multimodal Capabilities in Large Language Models

This article compares two ways to add vision to large language models—training a native multimodal model from scratch or attaching a visual module to a pretrained LLM—then details the VoRA approach that uses LoRA adapters to inject visual knowledge without extra inference cost.

ChameleonLLaVALoRA
0 likes · 7 min read
How LoRA Enables Multimodal Capabilities in Large Language Models
Bilibili Tech
Bilibili Tech
Aug 8, 2025 · Artificial Intelligence

Can Language‑Centric Tree Reasoning Transform Video Question Answering?

This article introduces a language‑centric tree reasoning (LTR) framework that recursively decomposes VideoQA queries into perceptual sub‑questions and performs bottom‑up logical inference with video assistance, achieving significantly higher accuracy and explainability across eleven benchmark datasets.

Tree ReasoningVideoQAartificial intelligence
0 likes · 17 min read
Can Language‑Centric Tree Reasoning Transform Video Question Answering?
AIWalker
AIWalker
Aug 5, 2025 · Artificial Intelligence

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

The paper introduces Perception‑R1, a rule‑based reinforcement‑learning framework that trains multimodal large language models for visual perception tasks without relying on chain‑of‑thought reasoning, and demonstrates up to 17.9% performance gains on RefCOCO+, PixMo‑Count, PageOCR and COCO2017, while analyzing the key roles of perception confusion and reward design.

BenchmarkRLHFVisual Perception
0 likes · 24 min read
Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks
AIWalker
AIWalker
Aug 3, 2025 · Artificial Intelligence

CVPR 2025: DeQA-Score Lets LLMs Predict Image Quality Score Distributions

DeQA-Score introduces a soft‑label discretization that lets multimodal large language models regress continuous image‑quality scores as Gaussian distributions, achieving 30× lower mean error and preserving variance and inter‑image relationships, with KL‑divergence and fidelity losses driving state‑of‑the‑art performance.

CVPR2025DeQA-Scoreimage quality assessment
0 likes · 8 min read
CVPR 2025: DeQA-Score Lets LLMs Predict Image Quality Score Distributions
Meituan Technology Team
Meituan Technology Team
Jul 31, 2025 · Artificial Intelligence

8 Must-Read ACL 2025 Papers from Meituan: Generative Retrieval, Multimodal LLMs & More

Meituan’s research team showcases eight ACL 2025 papers spanning generative retrieval, multi‑objective preference alignment, rich‑text image understanding, cross‑language transfer, multimodal math reasoning, and more, offering insights and breakthroughs that can inspire and aid fellow researchers.

ACL 2025Code-SwitchingGenerative Retrieval
0 likes · 15 min read
8 Must-Read ACL 2025 Papers from Meituan: Generative Retrieval, Multimodal LLMs & More
Amap Tech
Amap Tech
Jul 24, 2025 · Artificial Intelligence

FingER: Fine-Grained Evaluation and Reasoning for AI-Generated Videos

The paper introduces FingER, an entity-level evaluation framework and the FingER-Instruct-60k dataset for assessing AI-generated video quality with fine-grained reasoning, and demonstrates state-of-the-art zero-shot performance on multiple benchmarks using novel training strategies.

AI-generated videoDatasetfine-grained evaluation
0 likes · 9 min read
FingER: Fine-Grained Evaluation and Reasoning for AI-Generated Videos
AntTech
AntTech
Jul 2, 2025 · Artificial Intelligence

How Multimodal Large Models Revolutionize UI Automation Testing

This article details how Alibaba's Ant Group leverages multimodal large‑language models and multi‑agent architectures to create a low‑code, AI‑driven UI automation testing framework that improves test coverage, reduces manual effort, and scales across diverse mobile mini‑program scenarios.

AI testingSoftware qualityUI automation
0 likes · 9 min read
How Multimodal Large Models Revolutionize UI Automation Testing
AIWalker
AIWalker
Jun 30, 2025 · Artificial Intelligence

ICCV 2025 MIPI Workshop Launches ViDA-UGC: A New UGC Image Quality Assessment Challenge

The ICCV MIPI workshop introduces the ViDA-UGC competition, presenting a richly annotated UGC image quality dataset, a benchmark suite covering degradation detection, region perception, and quality description, detailed evaluation metrics, submission formats, prize information, and open participation for researchers worldwide.

BenchmarkDatasetICCV
0 likes · 15 min read
ICCV 2025 MIPI Workshop Launches ViDA-UGC: A New UGC Image Quality Assessment Challenge
Amap Tech
Amap Tech
Apr 21, 2025 · Artificial Intelligence

Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models

At ICASSP 2025, Gaode’s two accepted papers present Lenna, a language‑enhanced reasoning detection assistant that adds a DET token to multimodal LLMs and achieves state‑of‑the‑art accuracy on RefCOCO benchmarks, and a chain‑of‑thought image‑editing framework that converts complex prompts into segmented masks and repair prompts for diffusion‑based inpainting, surpassing existing methods.

AIComputer VisionICASSP
0 likes · 10 min read
Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models
AIWalker
AIWalker
Apr 11, 2025 · Artificial Intelligence

Teaching Large Language Models to Predict Image Quality Scores with DeQA-Score

DeQA-Score, a CVPR 2025 work, shows how to train multimodal large language models to regress continuous image quality scores by discretizing scores into soft-label level tokens, preserving Gaussian distribution statistics and achieving state‑of‑the‑art performance without any installation.

CVPR2025DeQA-Scoreimage quality assessment
0 likes · 8 min read
Teaching Large Language Models to Predict Image Quality Scores with DeQA-Score
AI Frontier Lectures
AI Frontier Lectures
Apr 3, 2025 · Artificial Intelligence

How ChartMoE Uses Sparse MoE to Master Chart Understanding and Preserve General Knowledge

ChartMoE, an oral paper at ICLR 2025, introduces a multi‑stage alignment training pipeline and a diversified MoE Connector that dramatically improves chart comprehension while maintaining performance on general multimodal tasks, backed by extensive data construction, training recipes, and thorough evaluations.

Chart UnderstandingChartMoEMixture of Experts
0 likes · 10 min read
How ChartMoE Uses Sparse MoE to Master Chart Understanding and Preserve General Knowledge
Snowball Engineer Team
Snowball Engineer Team
Mar 31, 2025 · Frontend Development

Leveraging Multimodal Large Language Models for Frontend Automated Testing (NL2Test)

This article explores how multimodal large language models (MM‑LLMs) combined with structured prompt engineering can transform frontend regression testing by enabling natural‑language‑driven test case generation, visual verification, and script self‑healing, thereby reducing maintenance costs and improving coverage across dynamic UI scenarios.

AI automationNL2Testmultimodal LLM
0 likes · 17 min read
Leveraging Multimodal Large Language Models for Frontend Automated Testing (NL2Test)
JD Tech
JD Tech
Mar 26, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

The JD advertising team proposes a CTR‑driven advertising image generation framework (CAIG) that leverages multimodal large language models, a novel reward model, and product‑centric preference optimization to produce ad images with superior click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward modeladvertising image generation
0 likes · 10 min read
CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)
AI Frontier Lectures
AI Frontier Lectures
Mar 25, 2025 · Artificial Intelligence

What Drives Alignment in Multimodal Large Language Models? A Comprehensive Review

This article provides an in‑depth review of alignment algorithms for multimodal large language models, covering application scenarios, dataset construction methods, evaluation benchmarks, current challenges, and future research directions, while summarizing contributions from leading academic institutions.

AI researchalignment algorithmsdataset construction
0 likes · 22 min read
What Drives Alignment in Multimodal Large Language Models? A Comprehensive Review
AI Algorithm Path
AI Algorithm Path
Mar 20, 2025 · Artificial Intelligence

Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis

This article surveys the latest multimodal large language model research, dissecting the design, training strategies, and performance trade‑offs of models such as Llama 3.2, Molmo, NVLM, Qwen2‑VL, Pixtral, MM1.5, Emu3, and Janus, and highlights the challenges of fair cross‑model evaluation.

AI researchCross-AttentionModel Training Strategies
0 likes · 16 min read
Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis
AI Algorithm Path
AI Algorithm Path
Mar 19, 2025 · Artificial Intelligence

Understanding Multimodal Large Language Models: Part 1

This article explains the fundamentals of multimodal large language models, covering their definition, typical applications, two main architectural approaches—unified embedding decoder and cross‑modal attention—along with detailed component breakdowns, a PyTorch implementation of image‑patch projection, and training considerations, ending with a discussion of trade‑offs between the methods.

Cross-AttentionImage EncoderLinear Projection
0 likes · 14 min read
Understanding Multimodal Large Language Models: Part 1
AntTech
AntTech
Mar 14, 2025 · Artificial Intelligence

MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding

The CVPR 2025 paper "MP-GUI: Modality Perception with MLLMs for GUI Understanding" presents a novel algorithm that enhances multimodal large language models' ability to perceive and reason about graphical user interfaces by integrating text, visual, and spatial signals through specialized perception modules and a dynamic fusion gate, achieving state‑of‑the‑art performance on multiple GUI benchmarks.

CVPR2025Computer VisionGUI Understanding
0 likes · 5 min read
MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding
JD Retail Technology
JD Retail Technology
Mar 14, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models

The paper presents CAIG, a CTR‑driven advertising image generation pipeline that pre‑trains a multimodal LLM on e‑commerce data, trains a reward model on CTR‑labeled image pairs, and fine‑tunes generation via product‑centric preference optimization, achieving state‑of‑the‑art online and offline performance.

AICTRad image generation
0 likes · 11 min read
CTR-Driven Advertising Image Generation Using Multimodal Large Language Models
JD Cloud Developers
JD Cloud Developers
Mar 13, 2025 · Artificial Intelligence

Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation

This paper presents a CTR‑driven advertising image generation framework that leverages multimodal large language models, reward modeling, and reinforcement learning to produce product‑centric ad visuals with higher click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward modeladvertising image generation
0 likes · 13 min read
Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 2, 2025 · Artificial Intelligence

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

Distributed TrainingPPOPRM
0 likes · 21 min read
Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance
Full-Stack Cultivation Path
Full-Stack Cultivation Path
Nov 25, 2024 · Artificial Intelligence

Get High-Quality OCR with Ollama-OCR in Just a Few Lines of Code

This guide shows how to set up the open‑source Ollama‑OCR tool, which leverages the Llama 3.2‑Vision multimodal model to perform high‑quality OCR, covering installation of Ollama, the vision model, the OCR package, and example code for plain‑text and Markdown outputs.

Llama 3.2-VisionNode.jsOCR
0 likes · 6 min read
Get High-Quality OCR with Ollama-OCR in Just a Few Lines of Code
NewBeeNLP
NewBeeNLP
Nov 11, 2024 · Artificial Intelligence

What Do Recent Multimodal LLM Papers Reveal About Vision‑Language Models?

This article surveys ten recent multimodal large language model papers, covering vision representation laws, a stricter instruction benchmark, safety impacts of visual adaptation, the Mini‑Gemini architecture, automatic pruning, vision capability boosting, long‑context transfer, efficient token sparsification, math reasoning, and hallucination mitigation.

BenchmarkTraining StrategiesVision-Language Models
0 likes · 18 min read
What Do Recent Multimodal LLM Papers Reveal About Vision‑Language Models?
DataFunSummit
DataFunSummit
Nov 1, 2024 · Artificial Intelligence

Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook

This article reviews recent advances in multimodal large language models, covering their background, architectural components, training strategies, application scenarios, evaluation benchmarks, team research on hallucination mitigation and long‑video understanding, and outlines promising future research directions.

Model architectureevaluation benchmarksfuture research
0 likes · 15 min read
Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 24, 2024 · Artificial Intelligence

How NoteLLM-2 Boosts Multimodal Recommendations with In-Content Learning

NoteLLM-2 introduces multimodal In-Content Learning and Late Fusion to overcome visual‑modality bias in end‑to‑end fine‑tuned large representation models, delivering significant gains over baseline multimodal LLMs and traditional retrieval methods in recommendation tasks.

AI researchRecommendation Systemscontrastive learning
0 likes · 11 min read
How NoteLLM-2 Boosts Multimodal Recommendations with In-Content Learning
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 17, 2024 · Artificial Intelligence

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

Meta’s newly released 92‑page Movie Gen paper introduces a multimodal LLM that unifies text‑to‑image, text‑to‑video, personalized video, precise video editing, and audio generation, detailing its dual‑model architecture, training pipeline, temporal auto‑encoder design, scaling strategies, evaluation benchmark, and ablation studies.

Deep LearningModel ScalingVideo Generation
0 likes · 34 min read
How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights
360 Tech Engineering
360 Tech Engineering
Jun 25, 2023 · Artificial Intelligence

Visual Capability as a Fundamental Requirement for AGI and the SEEChat Multimodal Dialogue Model

The article reviews why visual ability is essential for artificial general intelligence, compares native multimodal and expert‑stitching integration approaches, details the architectures of models such as KOSMOS‑1, PALM‑E, Flamingo, BLIP‑2, LLAVA, miniGPT‑4, and introduces the SEEChat project that fuses CLIP vision encoders with chatGLM6B via a projection layer, presenting its training pipeline, experimental results, and future directions.

AGIImage CaptioningSEEChat
0 likes · 13 min read
Visual Capability as a Fundamental Requirement for AGI and the SEEChat Multimodal Dialogue Model