Tagged articles

multimodal model

21 articles · Page 1 of 1

Jun 9, 2026 · Artificial Intelligence

Run Gemma 4 12B on a 16 GB Laptop – Near‑26B MoE Performance via Encoder‑Free Design

Google DeepMind’s Gemma 4 12B model, using a novel encoder‑free architecture that unifies text, image, and audio processing, delivers performance close to a 26 B MoE model while running on a consumer‑grade laptop with only 16 GB memory, and HyperAI provides a one‑click notebook for easy deployment.

16GB laptopAI DeploymentGemma 4

0 likes · 5 min read

Run Gemma 4 12B on a 16 GB Laptop – Near‑26B MoE Performance via Encoder‑Free Design

Machine Heart

Jun 4, 2026 · Artificial Intelligence

Is Google I/O’s Biggest Winner Not Google? Inside Gemini Omni Flash

Google’s Gemini Omni Flash, unveiled at I/O, lets users generate and edit videos from any modality via natural‑language prompts, but user tests reveal smooth editing alongside notable limits in facial consistency, long‑shot detail, and usage quotas, especially when compared with competing models like Seedance 2.0.

AI video generationGemini Omni FlashGoogle I/O

0 likes · 8 min read

Is Google I/O’s Biggest Winner Not Google? Inside Gemini Omni Flash

SuanNi

May 22, 2026 · Artificial Intelligence

All‑In‑One Image & Video: ByteDance’s Deployable Native Multimodal Model Lance

Lance, ByteDance’s newly open‑sourced 3‑billion‑parameter multimodal model, runs on a single 40 GB GPU, tops HuggingFace trend charts, and achieves leading scores on DPG Bench, GenEval, and video generation benchmarks while surpassing several state‑of‑the‑art single‑modal models.

AI researchByteDanceLance

0 likes · 3 min read

All‑In‑One Image & Video: ByteDance’s Deployable Native Multimodal Model Lance

Machine Heart

Apr 10, 2026 · Artificial Intelligence

Alibaba’s HappyHorse-1.0 Tops AI Video Generation Leaderboard

Alibaba’s HappyHorse-1.0, a new text‑to‑video and image‑to‑video model from the ATH team, claimed the #1 spot on Artificial Analysis’s video arena rankings, matches ByteDance’s Dreamina Seedance 2.0 in Elo score, supports four generation modes, and will open its API around April 30.

AI video generationAlibabaArtificial Analysis

0 likes · 3 min read

Alibaba’s HappyHorse-1.0 Tops AI Video Generation Leaderboard

AI Explorer

Apr 7, 2026 · Artificial Intelligence

MedGRPO Redefines Med Video Understanding, Shifting AI from Assistant to Partner

MedGRPO, a multimodal large model, achieves a breakthrough in medical video understanding by introducing clinical semantic parsing that aligns visual cues with structured medical knowledge, boosting performance and raising ethical questions about AI’s evolving role from a supportive assistant to a collaborative clinical partner.

AI ethicsClinical Semantic Parsingmedical AI

0 likes · 6 min read

MedGRPO Redefines Med Video Understanding, Shifting AI from Assistant to Partner

SuanNi

Apr 6, 2026 · Artificial Intelligence

How OmniLottie Turns Text and Images into High‑Quality Vector Animations

OmniLottie, a collaborative framework from Fudan, HKU, and Queensland University, uses a specialized tokenizer and a large multimodal model to compress Lottie files, generate vector animations from text, images or video, and achieves state‑of‑the‑art performance on custom benchmarks and extensive evaluations.

AILottieVector Animation

0 likes · 11 min read

How OmniLottie Turns Text and Images into High‑Quality Vector Animations

AI Engineering

Mar 31, 2026 · Artificial Intelligence

Qwen3.5-Omni Introduces Audio‑Visual Vibe Coding: Code by Speaking and Gesturing

Alibaba's newly released Qwen3.5-Omni multimodal model adds an Audio‑Visual Vibe Coding feature that lets users describe a website or game with speech and gestures to generate code, while offering advanced audio comprehension, long‑duration media support, multilingual capabilities, fine‑grained voice control, and voice cloning, though its weights remain closed‑source.

AIAlibabaAudio-Visual Vibe Coding

0 likes · 3 min read

Qwen3.5-Omni Introduces Audio‑Visual Vibe Coding: Code by Speaking and Gesturing

Xiaomi Tech

Mar 18, 2026 · Artificial Intelligence

Xiaomi’s MiMo‑V2‑Omni: A Full‑Modal Agent Base that Sees, Listens, and Acts

Xiaomi unveiled MiMo‑V2‑Omni, a full‑modal agent base that unifies text, image, video and audio perception with tool‑calling and GUI actions, outperforming leading models such as Gemini 3 Pro and Claude Opus 4.6 on benchmarks, and offering a 256K‑context API for diverse real‑world tasks.

APIAgent AIMiMo-V2-Omni

0 likes · 8 min read

Xiaomi’s MiMo‑V2‑Omni: A Full‑Modal Agent Base that Sees, Listens, and Acts

Fun with Large Models

Feb 17, 2026 · Artificial Intelligence

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

Qwen3.5‑397B‑A17B, the newly open‑sourced multimodal giant, combines a 400‑billion‑parameter sparse MoE architecture with FP8 pipelines and an asynchronous RL framework to deliver GPT‑5.2‑level capabilities, 60% lower memory usage, up to 19× higher throughput, and extensive image, video, and agent support, while outlining its deployment requirements and API pricing.

AI inferenceFP8Qwen3.5

0 likes · 11 min read

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

DataFunSummit

Oct 31, 2025 · Artificial Intelligence

How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI

OPPO AI Center introduces AndesVL, an open‑source, fully‑adapted multimodal large model ranging from 0.6B to 4B parameters, designed for high‑performance, privacy‑preserving, low‑latency AI on mobile devices, with advanced architecture, training pipelines, on‑device optimizations, and state‑of‑the‑art benchmark results.

Large Language Modelmobile AImodel compression

0 likes · 21 min read

How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI

Alipay Experience Technology

Sep 30, 2025 · Artificial Intelligence

How UI-UG Unifies UI Understanding and Generation with a 7B Multimodal Model

The open‑source UI‑UG‑7B multimodal model from Alipay combines UI understanding and generation in a single framework, delivering state‑of‑the‑art performance across referring, grounding, captioning, and code generation tasks while dramatically speeding up UI creation for developers.

Artificial IntelligenceUI UnderstandingUI generation

0 likes · 12 min read

How UI-UG Unifies UI Understanding and Generation with a 7B Multimodal Model

HyperAI Super Neural

Sep 26, 2025 · Artificial Intelligence

Redefining Next‑Gen OCR: IBM’s Open‑Source Granite‑Docling‑258M for Unified Structure and Content Understanding

IBM’s newly released open‑source model Granite‑Docling‑258M tackles the long‑standing challenge of converting diverse digital documents into machine‑readable, structured data by preserving layout, tables, formulas, and supporting multiple languages, while remaining lightweight at 258 M parameters and outperforming its predecessor SmolDocling‑256M‑Preview.

DoclingIBMMultilingual

0 likes · 5 min read

Redefining Next‑Gen OCR: IBM’s Open‑Source Granite‑Docling‑258M for Unified Structure and Content Understanding

Amap Tech

Jul 24, 2025 · Artificial Intelligence

FingER: Fine-Grained, Reasoning‑Based Evaluation of AI‑Generated Videos

This article introduces FingER, a novel entity‑level evaluation framework and the FingER‑Instruct‑60k dataset for assessing AI‑generated video quality with fine‑grained reasoning, and demonstrates its state‑of‑the‑art performance on multiple benchmarks using advanced training strategies such as GRPO.

AI-generated videofine-grained evaluationmultimodal model

0 likes · 9 min read

FingER: Fine-Grained, Reasoning‑Based Evaluation of AI‑Generated Videos

Alibaba Cloud Big Data AI Platform

May 22, 2025 · Artificial Intelligence

Deploy NVIDIA Cosmos Reason-1: Zero‑Code Physical AI on Alibaba Cloud PAI

Cosmos Reason-1, a customizable multimodal physical AI model from NVIDIA, can be quickly deployed on Alibaba Cloud’s PAI‑Model Gallery with zero‑code, offering automatic cloud resource adaptation, ready‑to‑use APIs, enterprise‑grade security, and demonstrated superior reasoning on video tasks, while the upcoming tools enable fine‑tuning via SFT and RL.

Alibaba CloudNVIDIAZero‑Code Deployment

0 likes · 8 min read

Deploy NVIDIA Cosmos Reason-1: Zero‑Code Physical AI on Alibaba Cloud PAI

AI Frontier Lectures

Apr 11, 2025 · Artificial Intelligence

How Q-Insight Uses Reinforcement Learning to Make AI Truly Understand Image Quality

Q-Insight, a multimodal large‑model introduced by Peking University and Volcano Engine, leverages reinforcement learning and a novel Group Relative Policy Optimization algorithm to evaluate image quality, providing detailed reasoning, degradation detection, and zero‑shot comparison, outperforming state‑of‑the‑art methods on multiple benchmarks.

AIVideo Cloudcomputer vision

0 likes · 10 min read

How Q-Insight Uses Reinforcement Learning to Make AI Truly Understand Image Quality

Baidu Geek Talk

Apr 2, 2025 · Artificial Intelligence

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

DeepSeek-VL2Mixture of ExpertsPaddleMIX

0 likes · 36 min read

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

MaGe Linux Operations

Mar 26, 2025 · Artificial Intelligence

Why Qwen2.5‑VL‑32B Is the New AI Breakthrough for Vision and Math

Alibaba's newly released Qwen2.5‑VL‑32B multimodal model delivers state‑of‑the‑art visual and textual performance, offering human‑aligned responses, superior mathematical reasoning, fine‑grained image understanding, and efficient deployment features that make it a compelling tool for developers and AI researchers alike.

AI researchLarge Language ModelQwen2.5-VL-32B

0 likes · 9 min read

Why Qwen2.5‑VL‑32B Is the New AI Breakthrough for Vision and Math

21CTO

Mar 19, 2025 · Artificial Intelligence

Mistral Small 3.1: How a 2.4B‑Parameter Open‑Source Model Challenges GPT‑4

Paris‑based startup Mistral AI has open‑sourced Mistral Small 3.1, a 2.4 billion‑parameter multimodal model that claims superior performance to OpenAI and Google equivalents, runs on modest hardware, processes up to 128 k tokens, and highlights a sustainable, accessible AI strategy.

AI sustainabilityMistral AIOpen-source

0 likes · 4 min read

Mistral Small 3.1: How a 2.4B‑Parameter Open‑Source Model Challenges GPT‑4

Full-Stack Cultivation Path

Jan 23, 2025 · Artificial Intelligence

Introducing UI‑TARS: An Open‑Source Model for Automated UI Interaction

UI‑TARS is a native GUI‑agent model that takes screenshots and natural‑language commands to predict the next UI action, and its integration with Midscene.js addresses the bottlenecks of generic multimodal LLMs, offering target‑driven planning, lower token usage, open‑source 7B/72B models, and detailed deployment guidance.

AIMidscene.jsOpen-source

0 likes · 13 min read

Introducing UI‑TARS: An Open‑Source Model for Automated UI Interaction

AI Large Model Application Practice

Nov 28, 2024 · Artificial Intelligence

Can Tiny Multimodal Models Power Edge AI? Meet OmniVision-968M

This article explores how compact multimodal models like OmniVision-968M enable efficient generative AI on edge devices, detailing their architectural advantages, benchmark superiority over larger models, and step‑by‑step instructions for local installation and visual inference using NexaSDK.

AI inferenceOmniVision-968Medge AI

0 likes · 9 min read

Can Tiny Multimodal Models Power Edge AI? Meet OmniVision-968M

Sohu Tech Products

May 21, 2024 · Artificial Intelligence

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations

OPPO details how it deploys multimodal pretrained models on resource‑constrained edge devices by compressing CLIP‑based image‑text retrieval, adapting Chinese text‑to‑image generation with LoRA and adapters, and lightweighting diffusion models through layer pruning and progressive distillation, achieving sub‑3‑second generation while preserving cloud‑level quality.

CLIPDistillationEdge deployment

0 likes · 18 min read

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations