Tagged articles
17 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 10, 2026 · Artificial Intelligence

Alibaba’s HappyHorse-1.0 Tops AI Video Generation Leaderboard

Alibaba’s HappyHorse-1.0, a new text‑to‑video and image‑to‑video model from the ATH team, claimed the #1 spot on Artificial Analysis’s video arena rankings, matches ByteDance’s Dreamina Seedance 2.0 in Elo score, supports four generation modes, and will open its API around April 30.

AI video generationAlibabaArtificial Analysis
0 likes · 3 min read
Alibaba’s HappyHorse-1.0 Tops AI Video Generation Leaderboard
AI Explorer
AI Explorer
Apr 7, 2026 · Artificial Intelligence

MedGRPO Redefines Med Video Understanding, Shifting AI from Assistant to Partner

MedGRPO, a multimodal large model, achieves a breakthrough in medical video understanding by introducing clinical semantic parsing that aligns visual cues with structured medical knowledge, boosting performance and raising ethical questions about AI’s evolving role from a supportive assistant to a collaborative clinical partner.

AI ethicsClinical Semantic Parsingmedical-ai
0 likes · 6 min read
MedGRPO Redefines Med Video Understanding, Shifting AI from Assistant to Partner
SuanNi
SuanNi
Apr 6, 2026 · Artificial Intelligence

How OmniLottie Turns Text and Images into High‑Quality Vector Animations

OmniLottie, a collaborative framework from Fudan, HKU, and Queensland University, uses a specialized tokenizer and a large multimodal model to compress Lottie files, generate vector animations from text, images or video, and achieves state‑of‑the‑art performance on custom benchmarks and extensive evaluations.

AIDatasetLottie
0 likes · 11 min read
How OmniLottie Turns Text and Images into High‑Quality Vector Animations
AI Engineering
AI Engineering
Mar 31, 2026 · Artificial Intelligence

Qwen3.5-Omni Introduces Audio‑Visual Vibe Coding: Code by Speaking and Gesturing

Alibaba's newly released Qwen3.5-Omni multimodal model adds an Audio‑Visual Vibe Coding feature that lets users describe a website or game with speech and gestures to generate code, while offering advanced audio comprehension, long‑duration media support, multilingual capabilities, fine‑grained voice control, and voice cloning, though its weights remain closed‑source.

AIAlibabaAudio-Visual Vibe Coding
0 likes · 3 min read
Qwen3.5-Omni Introduces Audio‑Visual Vibe Coding: Code by Speaking and Gesturing
Fun with Large Models
Fun with Large Models
Feb 17, 2026 · Artificial Intelligence

Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features

Qwen3.5‑397B‑A17B, the newly open‑sourced multimodal giant, combines a 400‑billion‑parameter sparse MoE architecture with FP8 pipelines and an asynchronous RL framework to deliver GPT‑5.2‑level capabilities, 60% lower memory usage, up to 19× higher throughput, and extensive image, video, and agent support, while outlining its deployment requirements and API pricing.

AI inferenceFP8multimodal model
0 likes · 11 min read
Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features
DataFunSummit
DataFunSummit
Oct 31, 2025 · Artificial Intelligence

How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI

OPPO AI Center introduces AndesVL, an open‑source, fully‑adapted multimodal large model ranging from 0.6B to 4B parameters, designed for high‑performance, privacy‑preserving, low‑latency AI on mobile devices, with advanced architecture, training pipelines, on‑device optimizations, and state‑of‑the‑art benchmark results.

Mobile AIlarge language modelmodel compression
0 likes · 21 min read
How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI
Alipay Experience Technology
Alipay Experience Technology
Sep 30, 2025 · Artificial Intelligence

How UI-UG Unifies UI Understanding and Generation with a 7B Multimodal Model

The open‑source UI‑UG‑7B multimodal model from Alipay combines UI understanding and generation in a single framework, delivering state‑of‑the‑art performance across referring, grounding, captioning, and code generation tasks while dramatically speeding up UI creation for developers.

UI GenerationUI Understandingartificial intelligence
0 likes · 12 min read
How UI-UG Unifies UI Understanding and Generation with a 7B Multimodal Model
HyperAI Super Neural
HyperAI Super Neural
Sep 26, 2025 · Artificial Intelligence

Redefining Next‑Gen OCR: IBM’s Open‑Source Granite‑Docling‑258M for Unified Structure and Content Understanding

IBM’s newly released open‑source model Granite‑Docling‑258M tackles the long‑standing challenge of converting diverse digital documents into machine‑readable, structured data by preserving layout, tables, formulas, and supporting multiple languages, while remaining lightweight at 258 M parameters and outperforming its predecessor SmolDocling‑256M‑Preview.

DoclingDocument AIIBM
0 likes · 5 min read
Redefining Next‑Gen OCR: IBM’s Open‑Source Granite‑Docling‑258M for Unified Structure and Content Understanding
Amap Tech
Amap Tech
Jul 24, 2025 · Artificial Intelligence

FingER: Fine-Grained, Reasoning‑Based Evaluation of AI‑Generated Videos

This article introduces FingER, a novel entity‑level evaluation framework and the FingER‑Instruct‑60k dataset for assessing AI‑generated video quality with fine‑grained reasoning, and demonstrates its state‑of‑the‑art performance on multiple benchmarks using advanced training strategies such as GRPO.

AI-generated videofine-grained evaluationmultimodal model
0 likes · 9 min read
FingER: Fine-Grained, Reasoning‑Based Evaluation of AI‑Generated Videos
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
May 22, 2025 · Artificial Intelligence

Deploy NVIDIA Cosmos Reason-1: Zero‑Code Physical AI on Alibaba Cloud PAI

Cosmos Reason-1, a customizable multimodal physical AI model from NVIDIA, can be quickly deployed on Alibaba Cloud’s PAI‑Model Gallery with zero‑code, offering automatic cloud resource adaptation, ready‑to‑use APIs, enterprise‑grade security, and demonstrated superior reasoning on video tasks, while the upcoming tools enable fine‑tuning via SFT and RL.

Alibaba CloudNvidiaPhysical AI
0 likes · 8 min read
Deploy NVIDIA Cosmos Reason-1: Zero‑Code Physical AI on Alibaba Cloud PAI
AI Frontier Lectures
AI Frontier Lectures
Apr 11, 2025 · Artificial Intelligence

How Q-Insight Uses Reinforcement Learning to Make AI Truly Understand Image Quality

Q-Insight, a multimodal large‑model introduced by Peking University and Volcano Engine, leverages reinforcement learning and a novel Group Relative Policy Optimization algorithm to evaluate image quality, providing detailed reasoning, degradation detection, and zero‑shot comparison, outperforming state‑of‑the‑art methods on multiple benchmarks.

AIComputer VisionVideo Cloud
0 likes · 10 min read
How Q-Insight Uses Reinforcement Learning to Make AI Truly Understand Image Quality
Baidu Geek Talk
Baidu Geek Talk
Apr 2, 2025 · Artificial Intelligence

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

DeepSeek-VL2InferenceMixture of Experts
0 likes · 36 min read
DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough
MaGe Linux Operations
MaGe Linux Operations
Mar 26, 2025 · Artificial Intelligence

Why Qwen2.5‑VL‑32B Is the New AI Breakthrough for Vision and Math

Alibaba's newly released Qwen2.5‑VL‑32B multimodal model delivers state‑of‑the‑art visual and textual performance, offering human‑aligned responses, superior mathematical reasoning, fine‑grained image understanding, and efficient deployment features that make it a compelling tool for developers and AI researchers alike.

AI researchQwen2.5-VL-32Blarge language model
0 likes · 9 min read
Why Qwen2.5‑VL‑32B Is the New AI Breakthrough for Vision and Math
21CTO
21CTO
Mar 19, 2025 · Artificial Intelligence

Mistral Small 3.1: How a 2.4B‑Parameter Open‑Source Model Challenges GPT‑4

Paris‑based startup Mistral AI has open‑sourced Mistral Small 3.1, a 2.4 billion‑parameter multimodal model that claims superior performance to OpenAI and Google equivalents, runs on modest hardware, processes up to 128 k tokens, and highlights a sustainable, accessible AI strategy.

AI sustainabilityMistral AIlightweight LLM
0 likes · 4 min read
Mistral Small 3.1: How a 2.4B‑Parameter Open‑Source Model Challenges GPT‑4
Full-Stack Cultivation Path
Full-Stack Cultivation Path
Jan 23, 2025 · Artificial Intelligence

Introducing UI‑TARS: An Open‑Source Model for Automated UI Interaction

UI‑TARS is a native GUI‑agent model that takes screenshots and natural‑language commands to predict the next UI action, and its integration with Midscene.js addresses the bottlenecks of generic multimodal LLMs, offering target‑driven planning, lower token usage, open‑source 7B/72B models, and detailed deployment guidance.

AIMidscene.jsUI automation
0 likes · 13 min read
Introducing UI‑TARS: An Open‑Source Model for Automated UI Interaction
AI Large Model Application Practice
AI Large Model Application Practice
Nov 28, 2024 · Artificial Intelligence

Can Tiny Multimodal Models Power Edge AI? Meet OmniVision-968M

This article explores how compact multimodal models like OmniVision-968M enable efficient generative AI on edge devices, detailing their architectural advantages, benchmark superiority over larger models, and step‑by‑step instructions for local installation and visual inference using NexaSDK.

AI inferenceOmniVision-968MTutorial
0 likes · 9 min read
Can Tiny Multimodal Models Power Edge AI? Meet OmniVision-968M
Sohu Tech Products
Sohu Tech Products
May 21, 2024 · Artificial Intelligence

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations

OPPO details how it deploys multimodal pretrained models on resource‑constrained edge devices by compressing CLIP‑based image‑text retrieval, adapting Chinese text‑to‑image generation with LoRA and adapters, and lightweighting diffusion models through layer pruning and progressive distillation, achieving sub‑3‑second generation while preserving cloud‑level quality.

CLIPDistillationLoRA
0 likes · 18 min read
OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations