Tagged articles
303 articles
Page 2 of 4
Meituan Technology Team
Meituan Technology Team
Oct 15, 2025 · Artificial Intelligence

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

This curated list showcases Meituan’s latest large‑model breakthroughs and academic papers up to October 2025, spanning LLM system optimizations, multimodal generation, evaluation benchmarks, quantization techniques, and reinforcement‑learning‑driven improvements, offering researchers valuable insights and resources across the AI landscape.

AI researchBenchmarkingMultimodal AI
0 likes · 10 min read
What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025
Amap Tech
Amap Tech
Oct 2, 2025 · Artificial Intelligence

How FantasyWorld Unifies Video Generation and 3D Geometry for Consistent Virtual Worlds

FantasyWorld introduces a geometry‑enhanced framework that augments a frozen video diffusion model with a trainable geometry branch, enabling simultaneous video representation and implicit 3D field generation, achieving spatially consistent, high‑quality virtual worlds and outperforming recent baselines in multi‑view coherence and geometric fidelity.

3D modelingComputer VisionMultimodal AI
0 likes · 11 min read
How FantasyWorld Unifies Video Generation and 3D Geometry for Consistent Virtual Worlds
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Oct 1, 2025 · Artificial Intelligence

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

The 2025 open‑source reports reveal major advances in large‑model engineering, including drastic cost cuts such as DeepSeek‑V3 training for $5.57 M, performance gains where Gemma 3 4B matches Gemma 2 27B, memory efficiencies like 85 % KV‑cache reduction, and a suite of new techniques—from loss‑free MoE balancing to multi‑token prediction—that together push context lengths to one million tokens and enable multimodal, aligned, and industry‑specific models.

Cost reductionMultimodal AIattention mechanisms
0 likes · 13 min read
2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Sep 30, 2025 · Artificial Intelligence

Dynamic Multimodal Video Generation: Prioritizing Stability and High Quality

The article surveys the evolution of video generation models—from early GANs and DCGAN to diffusion‑based approaches like Stable Diffusion and DiT—highlighting how stability, high quality, massive compute, and multimodal data pipelines are shaping the current and future paths of dynamic multimodal video generation.

Latent DiffusionMultimodal AIStable Diffusion
0 likes · 7 min read
Dynamic Multimodal Video Generation: Prioritizing Stability and High Quality
DataFunTalk
DataFunTalk
Sep 29, 2025 · Artificial Intelligence

How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP

In an interview before the DACon conference, Dr. Feng Ziyong reveals how Glint‑MVT and novel data‑synthesis techniques overcome distribution gaps, improve compositional understanding, and enable billion‑scale, second‑level retrieval for city‑level surveillance, while balancing model efficiency and effectiveness.

Embedding RetrievalMultimodal AIcity surveillance
0 likes · 11 min read
How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP
DataFunSummit
DataFunSummit
Sep 28, 2025 · Artificial Intelligence

Unlocking Enterprise Knowledge: Building Multimodal AI Systems with LLMs

This article examines the challenges of processing massive multimodal data in enterprises and presents a knowledge‑augmentation framework that leverages Retrieval‑Augmented Generation, memory‑inspired architecture, and feedback loops to enable reliable, scalable AI‑driven decision making across diverse business scenarios.

Enterprise KnowledgeKnowledge GraphLLM
0 likes · 29 min read
Unlocking Enterprise Knowledge: Building Multimodal AI Systems with LLMs
Instant Consumer Technology Team
Instant Consumer Technology Team
Sep 25, 2025 · Artificial Intelligence

Inside Qwen’s Midnight Release: New Guard, Travel Agent, LiveTranslate, Code & Vision Models Unveiled

Late at night on the 23rd, Lin Junyang of Tongyi Lab announced six AI model releases—including a safety‑audit guard, a personal travel planner, a real‑time multilingual translator, upgraded coding models, a powerful vision‑language model, and the flagship Qwen3‑Max—each detailed with capabilities, highlights, and direct download links.

Multimodal AISafetyartificial intelligence
0 likes · 11 min read
Inside Qwen’s Midnight Release: New Guard, Travel Agent, LiveTranslate, Code & Vision Models Unveiled
Instant Consumer Technology Team
Instant Consumer Technology Team
Sep 23, 2025 · Artificial Intelligence

What’s Driving the Latest Tech Frontier? Vite’s Speed Boost, AI Coding Agents, and End‑to‑End Generative Search

This roundup highlights Vite’s next‑gen Rolldown engine delivering up to 45% faster builds, AI‑powered coding tools like Comate Zulu and Claude Code enabling solo developers, Browser‑Use’s AI web automation, Alipay’s AI travel assistant built by a four‑person team, ROMA’s dark‑mode adaptation, Kuaishou’s OneSearch generative framework, MIDAS multimodal digital‑human breakthroughs, and the open‑source VoxCPM speech model.

AI CodingBrowser AutomationGenerative Search
0 likes · 6 min read
What’s Driving the Latest Tech Frontier? Vite’s Speed Boost, AI Coding Agents, and End‑to‑End Generative Search
DataFunTalk
DataFunTalk
Sep 19, 2025 · Artificial Intelligence

GenAI Summit 2025: Large Model Innovations & Real-World Applications

The DataFun GenAI Summit 2025 brings together leading experts from Alibaba, Tencent, Ant Financial, and other tech giants to showcase the latest breakthroughs in large-model research, generative AI, multimodal understanding, and real-world deployments across finance, e-commerce, media, and enterprise services.

AI applicationsEnterprise AIGenAI
0 likes · 25 min read
GenAI Summit 2025: Large Model Innovations & Real-World Applications
DaTaobao Tech
DaTaobao Tech
Sep 17, 2025 · Artificial Intelligence

Boosting ID Card Photo Quality with Multimodal AI: A Practical Deployment Guide

This article details how a multimodal AI model was integrated to detect and improve ID card photo quality, covering common image issues, differences between OCR and multimodal extraction, deployment strategies, performance metrics, cost estimation, and the resulting business and technical benefits.

ID verificationModel DeploymentMultimodal AI
0 likes · 13 min read
Boosting ID Card Photo Quality with Multimodal AI: A Practical Deployment Guide
Raymond Ops
Raymond Ops
Sep 14, 2025 · Artificial Intelligence

Create AI Videos with DeepSeek + Tongyi Wanxiang: Step-by-Step Guide

This article explains how to leverage the Chinese AI multimodal platform Tongyi Wanxiang together with DeepSeek to generate high-quality AI videos, covering AI video fundamentals, core features, application scenarios, detailed workflow, script creation, video synthesis, and Java API integration with code examples.

AI video generationDeepSeekJava SDK
0 likes · 25 min read
Create AI Videos with DeepSeek + Tongyi Wanxiang: Step-by-Step Guide
DataFunSummit
DataFunSummit
Sep 12, 2025 · Artificial Intelligence

How AI Dressing and Digital Humans Are Revolutionizing Home Service Experiences

In an exclusive interview, AI expert Wang Mingzhong details the technical challenges and breakthroughs behind AI dressing, AI video resumes, short‑video templates, and digital‑human live streaming for 58 Home services, highlighting model choices, multimodal architectures, modular design, and future directions for emotional interaction.

AI dressingAI video resumeMultimodal AI
0 likes · 9 min read
How AI Dressing and Digital Humans Are Revolutionizing Home Service Experiences
Kuaishou Tech
Kuaishou Tech
Sep 5, 2025 · Artificial Intelligence

How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI

Fast‑search platform Kwai has open‑sourced the 8‑billion‑parameter multimodal LLM Keye‑VL‑1.5, which introduces a slow‑fast frame encoding, a progressive four‑stage pre‑training pipeline, and an automated data construction workflow, achieving state‑of‑the‑art results on video and vision‑language benchmarks and surpassing many closed‑source models.

Multimodal AIbenchmark performancelarge language model
0 likes · 12 min read
How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI
Data Party THU
Data Party THU
Sep 3, 2025 · Artificial Intelligence

Exploring Multimodal Generative AI: A Tsinghua Tutorial at IJCAI 2025

This article introduces a 1.5‑hour tutorial presented by Tsinghua researchers at IJCAI 2025, covering the latest advances in multimodal generative AI, including multimodal large language models, diffusion models, post‑training generalization techniques, and unified understanding‑generation frameworks.

Generative ModelsIJCAI 2025Multimodal AI
0 likes · 5 min read
Exploring Multimodal Generative AI: A Tsinghua Tutorial at IJCAI 2025
AntTech
AntTech
Aug 27, 2025 · Artificial Intelligence

How AI Is Revolutionizing Content Safety – The Tech Behind Shanghai’s Top Award

Shanghai’s 2024 Science and Technology Award honored a joint effort by Shanghai Jiao Tong University and Ant Group for pioneering AI-driven technologies—multimodal hallucination mitigation, controllable data generation, integrated content security monitoring, and adversarial model protection—that set international standards in detecting harmful online media and AIGC content.

AI content safetyAIGC detectionMultimodal AI
0 likes · 6 min read
How AI Is Revolutionizing Content Safety – The Tech Behind Shanghai’s Top Award
Baidu Geek Talk
Baidu Geek Talk
Aug 25, 2025 · Artificial Intelligence

How ERNIE‑4.5‑VL Redefines Multimodal AI with 100+ Language Support

The ERNIE‑4.5‑VL visual‑language model breaks single‑modality limits by delivering breakthrough image, video, and text understanding across more than 100 languages, offering lightweight yet competitive performance against models like Qwen2.5‑VL, supporting 128K context, dual “thinking” modes, and extensive deployment resources.

AI researchErnieMultimodal AI
0 likes · 4 min read
How ERNIE‑4.5‑VL Redefines Multimodal AI with 100+ Language Support
Data Party THU
Data Party THU
Aug 22, 2025 · Artificial Intelligence

TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

TwigVLM introduces a lightweight “twig” module that prunes visual tokens early and enables self‑speculative decoding, achieving up to 154% speedup on long‑text generation while preserving 96% of original LVLM accuracy, as demonstrated on LLaVA‑1.5‑7B and other benchmarks.

LVLMMultimodal AIToken Pruning
0 likes · 14 min read
TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models
37 Interactive Technology Team
37 Interactive Technology Team
Aug 20, 2025 · Artificial Intelligence

Unlocking ChatGPT‑4o: How the New Multimodal Model Revolutionizes Image Generation

ChatGPT‑4o, OpenAI’s latest multimodal model, dramatically enhances text and image generation with higher quality visuals, flexible style control, faster response, and integrated image editing, and the article showcases diverse real‑world use cases—from advertising graphics to game UI design—demonstrating its practical impact across industries.

AI applicationsChatGPT4oMultimodal AI
0 likes · 11 min read
Unlocking ChatGPT‑4o: How the New Multimodal Model Revolutionizes Image Generation
AntTech
AntTech
Aug 19, 2025 · Artificial Intelligence

How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks

Ant Group's open‑source native GUI agent UI‑Venus leverages multimodal large‑model and reinforcement‑learning techniques to outperform prior models on grounding and navigation benchmarks, while using a high‑quality data pipeline and a self‑evolving alignment mechanism to push the limits of GUI automation.

BenchmarkGUI AgentMultimodal AI
0 likes · 7 min read
How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks
Data Party THU
Data Party THU
Aug 15, 2025 · Artificial Intelligence

What’s Next for Visual Reinforcement Learning? A Comprehensive 2024‑2025 Survey

This article provides a critical, up‑to‑date overview of visual reinforcement learning, formalizes the problem, traces policy‑optimization evolution, categorizes over 200 recent works into four pillars, analyzes algorithms, reward design, benchmarks, and highlights open challenges and future research directions.

Multimodal AIRLHFdiffusion models
0 likes · 7 min read
What’s Next for Visual Reinforcement Learning? A Comprehensive 2024‑2025 Survey
Ctrip Technology
Ctrip Technology
Aug 14, 2025 · Artificial Intelligence

How Multimodal Large Models Can Auto-Generate UI Test Cases End‑to‑End

Leveraging multimodal large‑model AI, this article outlines a four‑stage evolution from text‑based UI element identification to fully autonomous, end‑to‑end generation of executable UI automation scripts, detailing system architecture, intelligent reasoning engine, and real‑world Ctrip hotel refund test case results.

Multimodal AISoftware TestingTest Case Generation
0 likes · 17 min read
How Multimodal Large Models Can Auto-Generate UI Test Cases End‑to‑End
Data Party THU
Data Party THU
Aug 13, 2025 · Artificial Intelligence

How Large Language Models Are Revolutionizing Automated Scholarly Paper Review

This survey examines the rapid rise of large language models in automated scholarly paper review (ASPR), analyzing model types, technical breakthroughs such as long‑text, multimodal, and multi‑turn capabilities, new generation methods, datasets, open‑source tools, current challenges, publisher policies, and future research directions.

ASPRMultimodal AIautomated paper review
0 likes · 19 min read
How Large Language Models Are Revolutionizing Automated Scholarly Paper Review
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 8, 2025 · Artificial Intelligence

Master PromptPilot: Step‑by‑Step Guide to Build, Optimize, and Debug AI Prompts

This comprehensive tutorial walks you through the entire PromptPilot workflow—from initial setup and prompt generation to iterative optimization, visual debugging, batch testing, and intelligent refinement—showcasing how to create high‑quality, production‑ready prompts for AI agents and applications.

AI toolsMultimodal AIPrompt engineering
0 likes · 10 min read
Master PromptPilot: Step‑by‑Step Guide to Build, Optimize, and Debug AI Prompts
AIWalker
AIWalker
Aug 6, 2025 · Artificial Intelligence

Why ByteDance’s 7B BAGEL Model Rivals GPT‑4o in Unified Multimodal Understanding and Generation

The article provides an in‑depth technical analysis of ByteDance’s 7‑billion‑parameter BAGEL model, detailing its MoT architecture, high‑quality interleaved multimodal pre‑training data, multi‑stage training strategy, emergent capabilities, and extensive benchmark results that show BAGEL matching or surpassing GPT‑4o on vision‑language tasks.

BAGELGPT-4o comparisonMultimodal AI
0 likes · 24 min read
Why ByteDance’s 7B BAGEL Model Rivals GPT‑4o in Unified Multimodal Understanding and Generation
Bilibili Tech
Bilibili Tech
Aug 5, 2025 · Artificial Intelligence

How Bilibili’s IndexTTS2 Achieves Real‑Time, Emotion‑Rich Voice Translation

IndexTTS2 introduces a cross‑modal, multi‑language voice translation system that preserves speaker identity, acoustic space, and multi‑source timbre, while tackling challenges like voice personality loss, subtitle cognitive load, localization costs, multi‑speaker diarization, and cultural adaptation through novel time‑coding, adversarial RL, and diffusion‑based lip‑sync techniques.

Multimodal AISpeech synthesisadversarial reinforcement learning
0 likes · 20 min read
How Bilibili’s IndexTTS2 Achieves Real‑Time, Emotion‑Rich Voice Translation
Baidu MEUX
Baidu MEUX
Jul 30, 2025 · Artificial Intelligence

What’s New in AI? 10 Breakthrough Tools and Models Shaping 2024

This roundup highlights ten recent AI breakthroughs—including Perplexity’s Comet browser, Google’s T5Gemma series, xAI’s Grok‑4, a revamped PNG format, Bilibili’s Code‑H creator, JD’s AI social apps, ByteDance’s AI doctor, Vivo’s edge‑side multimodal model, Tencent’s payment‑enabled platform, and ByteDance’s Xverse image generator—showcasing rapid advances across browsing, modeling, multimedia, and commerce.

AIMultimodal AImachine learning
0 likes · 7 min read
What’s New in AI? 10 Breakthrough Tools and Models Shaping 2024
AI Info Trend
AI Info Trend
Jul 24, 2025 · Industry Insights

What’s Driving AI Adoption in 2025? Six Key Trends Uncovered

The AI Adoption Survey H1 2025 reveals that nearly half of organizations have deployed AI in production, engineering and R&D lead usage, Chinese LLMs gain overseas interest, and cost, reliability and intelligence remain the top challenges, while tool preferences and multimodal trends reshape the market.

AI InfrastructureAI adoptionAI trends
0 likes · 7 min read
What’s Driving AI Adoption in 2025? Six Key Trends Uncovered
FunTester
FunTester
Jul 21, 2025 · Artificial Intelligence

How First Principles Shape the Future of AI Agents: Evolution, Capabilities, and Emerging Trends

This article explores how first‑principle reasoning underpins the development of AI agents, traces their collaborative technology evolution, details core capabilities such as compute, memory, prediction and action, and forecasts future directions like multimodal models, reduced prompting, and extensive data sharing.

AI agentsAgent CollaborationFuture AI
0 likes · 15 min read
How First Principles Shape the Future of AI Agents: Evolution, Capabilities, and Emerging Trends
AntTech
AntTech
Jul 17, 2025 · Artificial Intelligence

How M2-Reasoning-7B Achieves State‑of‑the‑Art Spatial Reasoning in Multimodal AI

M2-Reasoning-7B, an open‑source 7B multimodal model from Ant Group, combines a high‑quality data pipeline with dynamic multi‑task training and a novel reward function to deliver state‑of‑the‑art performance on both general and spatial reasoning benchmarks, surpassing many larger competitors.

BenchmarkM2-ReasoningMultimodal AI
0 likes · 9 min read
How M2-Reasoning-7B Achieves State‑of‑the‑Art Spatial Reasoning in Multimodal AI
Tencent Cloud Developer
Tencent Cloud Developer
Jul 16, 2025 · Artificial Intelligence

How First Principles Shape the Future of AI Agents: Evolution, Capabilities, and Trends

This article explores how first‑principle thinking underpins AI agents, traces their development from single‑craftsman tools to enterprise‑level collaborations, outlines core capabilities such as compute, memory, prediction and action, and forecasts future directions like multimodal models, reduced prompting, and extensive data sharing.

AI agentsAgent CollaborationFuture AI
0 likes · 15 min read
How First Principles Shape the Future of AI Agents: Evolution, Capabilities, and Trends
Kuaishou Large Model
Kuaishou Large Model
Jul 11, 2025 · Artificial Intelligence

How MODA’s Modular Duplex Attention Boosts Multimodal Emotion Understanding

The paper introduces MODA, a new multimodal model that tackles attention imbalance across modalities with a modular duplex attention mechanism, achieving significant performance gains on perception, cognition, and emotion tasks across 21 benchmarks and demonstrating strong potential for human‑machine interaction.

Deep LearningMODA modelMultimodal AI
0 likes · 13 min read
How MODA’s Modular Duplex Attention Boosts Multimodal Emotion Understanding
DataFunTalk
DataFunTalk
Jul 11, 2025 · Artificial Intelligence

When AI Sees Six Fingers: Why Vision Models Miss the Mark

The article examines how multimodal AI models repeatedly miscount a six‑finger image, explores the underlying bias revealed in the paper “Vision Language Models are Biased,” and warns that such prior‑knowledge‑driven errors can have serious safety implications in real‑world applications.

AI biasMultimodal AIVision-Language Models
0 likes · 10 min read
When AI Sees Six Fingers: Why Vision Models Miss the Mark
Kuaishou Tech
Kuaishou Tech
Jul 10, 2025 · Artificial Intelligence

How MODA’s Modular Duplex Attention Solves Multimodal Attention Imbalance and Boosts Emotion Understanding

The paper introduces MODA, a modular duplex attention multimodal model that addresses severe cross‑modal attention imbalance in existing large multimodal models, proposes a novel attention paradigm and masking scheme, and demonstrates significant performance gains across 21 benchmarks in perception, cognition, and emotion tasks, earning a Spotlight paper at ICML 2025.

Emotion RecognitionMoDAMultimodal AI
0 likes · 13 min read
How MODA’s Modular Duplex Attention Solves Multimodal Attention Imbalance and Boosts Emotion Understanding
DataFunSummit
DataFunSummit
Jul 8, 2025 · Artificial Intelligence

Explore Cutting-Edge AI Knowledge Graphs: From Multimodal GraphRAG to Industry Applications

This article presents a curated catalog of cutting‑edge AI resources, covering multimodal GraphRAG, knowledge‑graph and large‑model integration, financial industry AI products, Chinese‑medicine decision support, AI‑driven knowledge‑graph evolution, private‑domain Q&A pipelines, and emerging trends and standards, with a QR code for the full ebook.

Document IntelligenceKnowledge GraphMultimodal AI
0 likes · 2 min read
Explore Cutting-Edge AI Knowledge Graphs: From Multimodal GraphRAG to Industry Applications
DataFunTalk
DataFunTalk
Jul 2, 2025 · Artificial Intelligence

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

Zhipu AI unveiled the GLM-4.1V-Thinking series, an open‑source multimodal model that outperforms larger rivals on visual‑language tasks, supports video analysis, GUI agents, and advanced scientific reasoning, while introducing a curriculum‑sampling reinforcement‑learning framework and a new Agent application platform.

AI agentsGLM-4.1VMultimodal AI
0 likes · 10 min read
How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning
DataFunTalk
DataFunTalk
Jul 2, 2025 · Artificial Intelligence

How Multimodal Large Models Are Revolutionizing Complex Document OCR

In a detailed interview, Zhao Chenyang explains how multimodal large models (VLM) overcome the limitations of traditional OCR in mixed layouts, table reconstruction, and handwritten text by leveraging self‑supervised pre‑training, lightweight fine‑tuning, and hybrid pipelines that dramatically cut annotation costs and improve recall rates.

AI deploymentMultimodal AIdocument OCR
0 likes · 13 min read
How Multimodal Large Models Are Revolutionizing Complex Document OCR
DataFunTalk
DataFunTalk
Jun 30, 2025 · Artificial Intelligence

Wenxin 4.5 Series: Open‑Source Multimodal MoE Models and FastDeploy Guide

The Wenxin 4.5 series introduces ten open‑source models—including large‑scale MoE and dense variants—featuring a novel multimodal heterogeneous architecture, high training efficiency, SOTA benchmark performance, and comprehensive toolkits (ERNIEKit, FastDeploy) for fine‑tuning and multi‑hardware deployment.

ERNIEKitFastDeployMoE
0 likes · 8 min read
Wenxin 4.5 Series: Open‑Source Multimodal MoE Models and FastDeploy Guide
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jun 29, 2025 · Artificial Intelligence

Multimodal AI Assistant Boosts Network Config: 96.6% Accuracy, 26× Labor Cut

The paper presents NLI2Conf, an intent‑driven network configuration model that fuses configuration files, topology and performance data via a multimodal interface, using large language and graph neural models to align natural‑language intents with forwarding and performance constraints, achieving 96.6% accuracy and a 26‑fold reduction in manual effort.

Graph Neural NetworkMultimodal AINLI2Conf
0 likes · 6 min read
Multimodal AI Assistant Boosts Network Config: 96.6% Accuracy, 26× Labor Cut
AI Algorithm Path
AI Algorithm Path
Jun 29, 2025 · Artificial Intelligence

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

CLIP (Contrastive Language‑Image Pre‑training) is an OpenAI model that learns visual concepts from 400 million image‑text pairs using a dual‑encoder architecture, enabling zero‑shot classification, flexible text‑driven search, and cross‑modal reasoning, while its strengths, limitations, and emerging applications are examined in detail.

CLIPContrastive Language-Image PretrainingDual Encoder
0 likes · 15 min read
Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision
AI Algorithm Path
AI Algorithm Path
Jun 20, 2025 · Artificial Intelligence

Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter

This article introduces visual‑language models (VLMs), explaining how they combine large language models with visual encoders, why they overcome the rigidity of traditional computer‑vision systems, their key advantages, modular architecture, training methods, and practical applications such as image captioning and visual question answering.

AI applicationsComputer VisionMultimodal AI
0 likes · 8 min read
Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter
AntTech
AntTech
Jun 18, 2025 · Artificial Intelligence

How Ant Group’s Baoling Models Push Toward AGI with MoE and Multimodal Innovations

In a detailed AICon talk, Ant Group’s Baoling team leader Zhou Jun outlines their latest large‑model training techniques, MoE architecture optimizations, multimodal breakthroughs, open‑source releases, and the strategic roadmap needed to turn AI into a ubiquitous, “scan‑code‑level” everyday assistant.

AI InfrastructureMixture of ExpertsMultimodal AI
0 likes · 25 min read
How Ant Group’s Baoling Models Push Toward AGI with MoE and Multimodal Innovations
Kuaishou Tech
Kuaishou Tech
Jun 10, 2025 · Artificial Intelligence

Top 12 Cutting-Edge Video Generation Papers from Kuaishou at CVPR 2025

The article highlights CVPR 2025’s acceptance statistics and showcases twelve cutting‑edge video‑generation papers from Kuaishou, spanning datasets, quality assessment, style control, scaling laws, 4D simulation, interleaved image‑text data, vision‑language acceleration, high‑fidelity avatars, patch‑wise super‑resolution, narrative‑driven benchmarks, sketch‑based editing, and spatio‑temporal diffusion, each with links and abstracts.

CVPR2025Computer VisionKuaishou
0 likes · 20 min read
Top 12 Cutting-Edge Video Generation Papers from Kuaishou at CVPR 2025
Software Development Quality
Software Development Quality
Jun 10, 2025 · Frontend Development

How Midscene.js Leverages Multimodal AI for Zero‑Code UI Automation

Midscene.js, an open‑source UI automation framework from ByteDance’s Web Infra team, combines multimodal AI inference with Chrome extensions, YAML scripts, and JavaScript SDKs to enable zero‑code testing across Web, Android, Playwright, and Puppeteer, offering key interfaces for actions, queries, and assertions.

JavaScriptMultimodal AIPlaywright
0 likes · 8 min read
How Midscene.js Leverages Multimodal AI for Zero‑Code UI Automation
Kuaishou Large Model
Kuaishou Large Model
Jun 5, 2025 · Artificial Intelligence

7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances

Kuaishou's foundational large‑model team secured seven papers at the prestigious ACL 2025 conference, covering alignment bias during model training, safety in inference, decoding strategies, fine‑grained video‑temporal understanding, and new evaluation benchmarks that push the frontier of multimodal large language models.

ACL 2025BenchmarkMultimodal AI
0 likes · 16 min read
7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 5, 2025 · Databases

Enable Native Multimodal AI Search with SQL on PolarDB

This article explains how to use standard SQL on PolarDB PostgreSQL to directly invoke multimodal AI services for image feature extraction and vectorization, eliminating data migration and complex toolchains while providing low‑threshold integration, flexible scenario adaptation, full‑link security, and serverless, pay‑as‑you‑go deployment.

Multimodal AIPolardbSQL
0 likes · 6 min read
Enable Native Multimodal AI Search with SQL on PolarDB
AntTech
AntTech
Jun 4, 2025 · Artificial Intelligence

LLaDA and LLaDA‑V: Large Language Diffusion Models and Their Multimodal Extensions

This article presents the LLaDA series of diffusion‑based large language models, explains how their generative‑modeling principle yields language intelligence comparable to autoregressive models, and details the multimodal LLaDA‑V architecture, training methods, experimental results, and broader implications for AI research.

Generative ModelingMultimodal AIdiffusion models
0 likes · 10 min read
LLaDA and LLaDA‑V: Large Language Diffusion Models and Their Multimodal Extensions
Tencent Technical Engineering
Tencent Technical Engineering
May 23, 2025 · Artificial Intelligence

Can a 3B Open‑Source Multimodal Model Beat GPT‑4V in Math? A Deep Dive into VLR1‑3B

The preview release of the 3‑billion‑parameter VLR1‑3B multimodal model demonstrates state‑of‑the‑art reasoning on math benchmarks, outperforms many commercial closed‑source models, and shows promising results on geometry, physics, and general vision tasks, while also revealing typical hallucination issues.

BenchmarkMultimodal AIVLR1-3B
0 likes · 8 min read
Can a 3B Open‑Source Multimodal Model Beat GPT‑4V in Math? A Deep Dive into VLR1‑3B
Kuaishou Tech
Kuaishou Tech
May 13, 2025 · Artificial Intelligence

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

This article analyzes KuaiMod, a multimodal large‑model solution developed by Kuaishou for short‑video content quality assessment, detailing its benchmark dataset, chain‑of‑thought data construction, offline SFT + DPO training, online reinforcement‑learning updates, evaluation results, and large‑scale deployment impact.

BenchmarkKuaiModMultimodal AI
0 likes · 19 min read
How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality
AIWalker
AIWalker
May 11, 2025 · Artificial Intelligence

Unified Multimodal Understanding and Generation: A 30K‑Word Survey of Recent Advances

This comprehensive survey reviews the rapid progress of multimodal understanding and text‑to‑image generation models, categorises existing unified architectures into diffusion‑based, autoregressive, and hybrid paradigms, analyses their tokenisation strategies, datasets and benchmarks, and highlights current challenges and future research directions.

Autoregressive ModelsDatasetsMultimodal AI
0 likes · 64 min read
Unified Multimodal Understanding and Generation: A 30K‑Word Survey of Recent Advances
Baidu Tech Salon
Baidu Tech Salon
Apr 28, 2025 · Artificial Intelligence

Inside Baidu’s Wenxin 4.5 Turbo & X1 Turbo: Architecture, Training Tricks, and Real-World Impact

At the Create2025 AI Developer Conference, Baidu unveiled the multimodal Wenxin 4.5 Turbo and X1 Turbo models, detailing their innovative architecture, self‑feedback post‑training, composite reasoning chains, data pipelines, and the new Wenxin KuaiMa 3.5 code assistant, while also showcasing ecosystem growth and cultural AI applications.

AI ConferenceBaiduCode Generation
0 likes · 9 min read
Inside Baidu’s Wenxin 4.5 Turbo & X1 Turbo: Architecture, Training Tricks, and Real-World Impact
Meituan Technology Team
Meituan Technology Team
Apr 24, 2025 · Artificial Intelligence

Meituan AI Recruitment: Join Our Advanced Technology Teams

Meituan's AI recruitment page showcases diverse opportunities across AI infrastructure, intelligent interaction, visual intelligence, and intelligent products, featuring roles from algorithm engineers to product managers working on cutting-edge technologies including large models, intelligent agents, and multimodal systems.

AI RecruitmentComputer VisionIntelligent agents
0 likes · 5 min read
Meituan AI Recruitment: Join Our Advanced Technology Teams
Tencent Cloud Developer
Tencent Cloud Developer
Apr 24, 2025 · Industry Insights

How RAG, AI Agents, and Multimodal Models Are Reshaping Industry – Trends, Challenges, and Real‑World Cases

The article analyzes the rapid evolution of large‑model technologies—Retrieval‑Augmented Generation, autonomous agents, and multimodal AI—detailing their technical foundations, practical challenges, industry applications such as unified multimodal tasks, open‑world detection, and video moderation, and forecasting future development directions.

AI agentsMultimodal AIRAG
0 likes · 15 min read
How RAG, AI Agents, and Multimodal Models Are Reshaping Industry – Trends, Challenges, and Real‑World Cases
DaTaobao Tech
DaTaobao Tech
Apr 14, 2025 · Artificial Intelligence

Taobao AIGC Content Generation: Short Video Production Techniques

Taobao’s Content AI team leverages a proprietary multimodal Mixture‑of‑Experts model to automatically generate short‑form videos—extracting highlights from live streams and creating customized product explainers—using two‑stage CLIP/VideoBLIP training, character‑level timestamps, LLM re‑segmentation and OCR masking, now producing over 100 k daily videos with a 12 % approval boost and notable conversion gains.

AIGCMultimodal AIcontent AI
0 likes · 20 min read
Taobao AIGC Content Generation: Short Video Production Techniques
Tencent Cloud Developer
Tencent Cloud Developer
Apr 10, 2025 · Artificial Intelligence

The Magic of GPT‑4o: Technical Overview and Speculated Architecture

GPT‑4o combines extremely long‑form text generation, high‑quality image creation and interactive editing by likely using an autoregressive multimodal transformer that tokenizes visuals via VQ‑VAE/GAN pipelines, trained on massive data and refined through fine‑tuning and RLHF, offering a unified model for generation, editing, and understanding.

GPT-4oMultimodal AIVQ-VAE
0 likes · 17 min read
The Magic of GPT‑4o: Technical Overview and Speculated Architecture
21CTO
21CTO
Apr 7, 2025 · Artificial Intelligence

Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities

Meta's Llama 4 series introduces the Scout, Maverick, and Behemoth models—featuring Mixture‑of‑Experts architectures, unprecedented 10‑million‑token context windows, and state‑of‑the‑art performance across vision, language, and multimodal benchmarks—while emphasizing efficient training, open‑source availability, and robust safety safeguards.

AI SafetyLlama 4Mixture of Experts
0 likes · 14 min read
Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities
DataFunTalk
DataFunTalk
Apr 6, 2025 · Artificial Intelligence

Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context

Meta announced the Llama 4 series—Scout, Maverick and Behemoth—featuring multimodal capabilities, Mixture‑of‑Experts design, up to 10 million‑token context windows, and state‑of‑the‑art performance on STEM, multilingual and image benchmarks, with models now downloadable from llama.com and Hugging Face.

Llama 4Mixture of ExpertsModel Training
0 likes · 14 min read
Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context
Architects' Tech Alliance
Architects' Tech Alliance
Apr 1, 2025 · Artificial Intelligence

What’s New in Large Language Models? DeepSeek V3, Qwen2.5‑Omni, Gemini 2.5 Pro, and GPT‑4o Unpacked

This article reviews the latest updates from major LLM providers—DeepSeek V3’s parameter boost and longer context, Qwen2.5‑Omni’s open‑source multimodal 7B model, Google Gemini 2.5 Pro’s 1 M‑token window and multimodal prowess, and OpenAI GPT‑4o’s image generation and reduced pricing—highlighting technical specs, capabilities, and availability.

DeepSeekGPT-4oGemini
0 likes · 9 min read
What’s New in Large Language Models? DeepSeek V3, Qwen2.5‑Omni, Gemini 2.5 Pro, and GPT‑4o Unpacked
Architects' Tech Alliance
Architects' Tech Alliance
Mar 31, 2025 · Artificial Intelligence

A Comprehensive History of Large Language Models from the Transformer Era (2017) to DeepSeek‑R1 (2025)

This article reviews the evolution of large language models from the 2017 Transformer breakthrough through BERT, GPT series, alignment techniques, multimodal extensions, open‑weight releases, and the cost‑efficient DeepSeek‑R1 in 2025, highlighting key technical advances, scaling trends, and their societal impact.

AI AlignmentLLM evolutionMultimodal AI
0 likes · 26 min read
A Comprehensive History of Large Language Models from the Transformer Era (2017) to DeepSeek‑R1 (2025)
Architect
Architect
Mar 30, 2025 · Artificial Intelligence

What Is Retrieval-Augmented Generation? A Deep Dive into RAG Techniques

This article provides a comprehensive survey of Retrieval‑Augmented Generation (RAG), covering its basic principles, key components, seven technical variants, challenges, evaluation methods, and future research directions across multimodal, graph‑based, and agentic extensions.

AI SurveyKnowledge RetrievalMultimodal AI
0 likes · 9 min read
What Is Retrieval-Augmented Generation? A Deep Dive into RAG Techniques
MaGe Linux Operations
MaGe Linux Operations
Mar 28, 2025 · Artificial Intelligence

How to Create AI-Generated Videos with Tongyi Wanxiang and DeepSeek: A Step‑by‑Step Guide

This article explains the fundamentals of AI video technology, details the features of Alibaba Cloud's Tongyi Wanxiang platform, demonstrates how to use DeepSeek for script generation, and provides a complete workflow—including code examples—for producing high‑quality AI‑generated videos.

AI video generationDeepSeekJava SDK
0 likes · 24 min read
How to Create AI-Generated Videos with Tongyi Wanxiang and DeepSeek: A Step‑by‑Step Guide
Sohu Tech Products
Sohu Tech Products
Mar 26, 2025 · Artificial Intelligence

How SpatialLM Turns 3D Point Clouds into Structured Scene Understanding

SpatialLM is a large language model designed for 3D spatial understanding that converts point‑cloud data from videos, RGB‑D images or LiDAR into structured scene descriptions, and this guide explains its architecture, model versions, repository links, and step‑by‑step deployment on Ubuntu with PyTorch.

3D point cloudMultimodal AIPyTorch
0 likes · 7 min read
How SpatialLM Turns 3D Point Clouds into Structured Scene Understanding
ByteDance Web Infra
ByteDance Web Infra
Mar 21, 2025 · Artificial Intelligence

Midscene.js: An AI‑Driven UI Automation Framework from ByteDance

Midscene.js is an open‑source UI automation framework that leverages multimodal AI to simplify web UI testing and interaction, offering three core interfaces—Action, Query, and Assert—along with a JavaScript SDK, support for multiple AI models, YAML scripting, and future‑focused features for stable, scalable automation.

AIJavaScriptMidscene.js
0 likes · 21 min read
Midscene.js: An AI‑Driven UI Automation Framework from ByteDance
DevOps
DevOps
Mar 19, 2025 · Artificial Intelligence

From Claude 3.5 Sonnet to Manus: The Evolution and Landscape of Computer‑Use AI Agents

This article surveys the rapid development of computer‑use AI agents—from Anthropic’s Claude 3.5 Sonnet and OpenAI’s Operator to the multi‑agent Manus platform—detailing their capabilities, benchmark results, open‑source alternatives, practical challenges, and future prospects for autonomous digital assistants.

AI agentsAnthropicAutomation
0 likes · 24 min read
From Claude 3.5 Sonnet to Manus: The Evolution and Landscape of Computer‑Use AI Agents
Baidu Geek Talk
Baidu Geek Talk
Mar 19, 2025 · Artificial Intelligence

Inside Baidu’s New Wenxin 4.5 & X1: Multimodal Breakthroughs and Tool‑Enabled AI

Baidu officially launched the Wenxin 4.5 and X1 large language models, showcasing native multimodal foundations, advanced attention masks, heterogeneous expert extensions, and tool‑calling capabilities, while offering low‑cost API access on the Qianfan platform and outlining the underlying technical innovations that drive their performance gains.

AI PlatformBaiduMultimodal AI
0 likes · 8 min read
Inside Baidu’s New Wenxin 4.5 & X1: Multimodal Breakthroughs and Tool‑Enabled AI
AIWalker
AIWalker
Mar 17, 2025 · Artificial Intelligence

How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance

The paper introduces UNIFIEDREWARD, the first unified reward model for multimodal understanding and generation that supports pairwise ranking and pointwise scoring, builds a 236K human‑preference dataset across image and video tasks, and uses DPO to align VLMs and diffusion models, achieving significant performance gains on both image and video benchmarks.

Direct Preference OptimizationMultimodal AIPreference Modeling
0 likes · 19 min read
How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance
DaTaobao Tech
DaTaobao Tech
Mar 12, 2025 · Artificial Intelligence

Multimodal Automatic Layout Generation for E-commerce

The project develops a multimodal automatic layout generation system for e‑commerce by fine‑tuning the qwen‑vl‑7b vision‑language model with LoRA on poster and Taobao image‑layout data, employing diffusion‑based image generation and coordinate‑prediction methods to produce structured layouts that power poster, marketing image, and video‑cover creation with over 90% adoption, while exploring multi‑image, style‑aware, and iterative refinement extensions.

LLMMultimodal AIdiffusion
0 likes · 12 min read
Multimodal Automatic Layout Generation for E-commerce
AI Code to Success
AI Code to Success
Mar 6, 2025 · Artificial Intelligence

How Monica’s ‘Manus’ AI Agent Redefines Human‑Computer Collaboration

Monica’s new AI agent Manus, unveiled on March 6, claims to autonomously handle complex tasks through multimodal processing, continuous learning, and intelligent decision‑making, with real‑world demos ranging from strategic planning to a smart home buying assistant, while sparking market hype, competitive comparisons, and debates on AI’s future role in the workforce.

AI AgentAI MarketCompetitive analysis
0 likes · 5 min read
How Monica’s ‘Manus’ AI Agent Redefines Human‑Computer Collaboration
Ma Wei Says
Ma Wei Says
Mar 4, 2025 · Artificial Intelligence

Microsoft’s Open‑Source Multimodal AI Agent Model Magma: Capabilities and Innovations

On February 25 2025, Microsoft open‑sourced its first multimodal AI agent foundation model, Magma, which extends multimodal processing to images, video, and text, introduces Set‑of‑Mark and Trace‑of‑Mark techniques for spatial‑temporal reasoning, optimizes modular inference for edge devices, and integrates reinforcement learning for adaptive task execution.

Edge ComputingMagmaMultimodal AI
0 likes · 6 min read
Microsoft’s Open‑Source Multimodal AI Agent Model Magma: Capabilities and Innovations
Architects' Tech Alliance
Architects' Tech Alliance
Mar 1, 2025 · Artificial Intelligence

Decoding DeepSeek: A Four‑Tier Capability Framework for Multimodal AI

The article outlines DeepSeek's four‑level capability hierarchy—basic multimodal data fusion and dynamic governance, intermediate domain modeling with causal reasoning and multi‑objective optimization, advanced complex system modeling with digital twins and multi‑agent coordination, and ultimate autonomous evolution features such as concept‑space exploration and self‑programming.

DeepSeekDigital TwinModel Capability
0 likes · 5 min read
Decoding DeepSeek: A Four‑Tier Capability Framework for Multimodal AI
Huolala Tech
Huolala Tech
Feb 27, 2025 · Artificial Intelligence

How Huolala’s Wukong Platform Solves Large‑Model Deployment Challenges

Huolala’s Wukong platform tackles the common “technology hype, implementation difficulty” dilemma of generative AI by unifying multimodal enterprise knowledge, enabling dynamic multi‑agent workflows, and providing low‑code tools, observability, and stable deployment across dozens of business scenarios.

AI workflowEnterprise AILarge Model
0 likes · 11 min read
How Huolala’s Wukong Platform Solves Large‑Model Deployment Challenges
DataFunSummit
DataFunSummit
Feb 26, 2025 · Artificial Intelligence

Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music

This article details how NetEase Cloud Music leverages multimodal large language models to improve music recommendation across daily, personalized, and playlist scenarios by extracting rich audio, text, and visual features, addressing data skew, cold‑start challenges, and achieving measurable gains in user engagement and distribution efficiency.

Multimodal AINetEase Cloud Musicfeature extraction
0 likes · 12 min read
Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music
DaTaobao Tech
DaTaobao Tech
Feb 26, 2025 · Artificial Intelligence

How Taobao’s AI Turns Static Clothing Images into Seamless Virtual Try‑On Videos

This article analyzes Taobao’s AIGC video virtual try‑on pipeline, detailing the challenges of frame‑level realism and continuity, the upgraded DiT‑based model, 3D‑VAE compression, large‑scale data collection, template‑matching mechanisms, and the resulting product capabilities for automated marketing and personalized shopper experiences.

AI video generationContent GenerationMultimodal AI
0 likes · 13 min read
How Taobao’s AI Turns Static Clothing Images into Seamless Virtual Try‑On Videos
JD Retail Technology
JD Retail Technology
Feb 25, 2025 · Artificial Intelligence

How JD’s “JingDianDian” AI Platform Revolutionizes E‑commerce Content Creation

JD Retail’s self‑built AIGC platform ‘JingDianDian’ leverages multimodal diffusion models, ControlNet, RAG and reinforcement learning to automatically generate high‑quality product images, videos and marketing copy, cutting production time from days to seconds, slashing costs by over 99% for more than 350 k merchants.

AIGCContent GenerationMultimodal AI
0 likes · 15 min read
How JD’s “JingDianDian” AI Platform Revolutionizes E‑commerce Content Creation
DaTaobao Tech
DaTaobao Tech
Feb 24, 2025 · Artificial Intelligence

AIGC Video Generation Techniques for E‑commerce: Lip‑Sync, Head/Body Driving, and Business Applications

The article surveys recent AIGC video generation advances for Taobao e‑commerce, detailing lip‑sync models like Wav2Lip and MuseTalk, head‑driven systems such as Hallo and EchoMimic, body‑driven pipelines including AnimateAnyone and Tango, and a four‑stage production workflow that boosts click‑through rates and enables virtual try‑on.

AIGCDeep LearningMultimodal AI
0 likes · 21 min read
AIGC Video Generation Techniques for E‑commerce: Lip‑Sync, Head/Body Driving, and Business Applications
DataFunSummit
DataFunSummit
Feb 21, 2025 · Artificial Intelligence

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

This article explores multimodal Retrieval‑Augmented Generation (RAG), detailing five core topics—including semantic extraction, visual‑language models, scaling strategies, technical roadmap choices, and a Q&A—while presenting three implementation pathways, performance evaluations, and future directions for AI‑driven document understanding.

Multimodal AIRAGTensor Retrieval
0 likes · 11 min read
Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects
DataFunTalk
DataFunTalk
Feb 19, 2025 · Artificial Intelligence

Large Models: Concepts, Principles, Classifications and Applications

This report provides a comprehensive overview of large-scale AI models, explaining their definition, massive parameter and data requirements, underlying transformer architecture, classification into language, vision and multimodal models, notable examples such as DeepSeek, and a survey of popular AIGC tools and practical use cases.

AIGC toolsDeep LearningMultimodal AI
0 likes · 9 min read
Large Models: Concepts, Principles, Classifications and Applications
Architects' Tech Alliance
Architects' Tech Alliance
Feb 18, 2025 · Artificial Intelligence

How DeepSeek’s Latest Models Redefine AI Performance and Industry Adoption

The DeepSeek report details rapid model releases from 2024 onward, highlighting innovations such as model distillation, a 671 B MoE architecture, FP8 mixed‑precision, and the Janus‑Pro multimodal framework, while also documenting major cloud and chip providers' integration of these models into their services.

AI industry adoptionDeepSeekMoE architecture
0 likes · 10 min read
How DeepSeek’s Latest Models Redefine AI Performance and Industry Adoption
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Feb 17, 2025 · Artificial Intelligence

WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios

WorldSense, a new benchmark of 1,662 real‑world video‑audio clips and 3,172 QA pairs across 26 cognitive tasks, reveals that current multimodal large models achieve only 25%–48% accuracy, highlighting the crucial role of combined visual‑audio input and the difficulty of audio‑ and emotion‑related reasoning.

Multimodal AIbenchmark datasetlarge models
0 likes · 12 min read
WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios
AIWalker
AIWalker
Feb 12, 2025 · Artificial Intelligence

Goku: How HKU and ByteDance’s New Model Sets New Benchmarks in Commercial Image and Video Generation

The paper presents Goku, a rectified‑flow transformer that jointly generates high‑quality images and videos at commercial scale, detailing its novel architecture, massive high‑quality data pipeline, efficient large‑scale training tricks, and state‑of‑the‑art results on GenEval, DPG‑Bench, VBench and UCF‑101.

Large-Scale TrainingMultimodal AIVideo Generation
0 likes · 29 min read
Goku: How HKU and ByteDance’s New Model Sets New Benchmarks in Commercial Image and Video Generation
IT Architects Alliance
IT Architects Alliance
Feb 8, 2025 · Artificial Intelligence

Inside DeepSeek: How Its Innovative Architecture Redefines AI Performance

This article examines DeepSeek's advanced Transformer‑based architecture, dynamic routing, MoE system, multi‑stage training, efficient inference, multimodal capabilities, real‑world applications, technical challenges, and future prospects, providing a comprehensive technical analysis of the model's strengths and limitations.

AI ArchitectureDeepSeekModel Optimization
0 likes · 15 min read
Inside DeepSeek: How Its Innovative Architecture Redefines AI Performance
AIWalker
AIWalker
Feb 4, 2025 · Artificial Intelligence

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

This article reviews a comprehensive study that applies Chain‑of‑Thought reasoning to autoregressive text‑to‑image generation, introducing extended test‑time computation, direct preference optimization, and two custom reward models (PARM and PARM++) that together improve generation quality by up to 15% over Stable Diffusion 3.

Direct Preference OptimizationInferenceMultimodal AI
0 likes · 13 min read
How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme
Architect
Architect
Jan 29, 2025 · Artificial Intelligence

How Janus‑Pro Redefines Multimodal AI with Bigger Models and New Training Strategies

DeepSeek’s newly released Janus‑Pro series (1B and 7B) advances multimodal AI by decoupling visual understanding and generation, employing optimized three‑stage training, massive data expansion, and larger LLM backbones, achieving performance that matches or exceeds leading models such as Meta, Google, OpenAI, and Stability AI.

DeepSeekJanus-ProModel Scaling
0 likes · 6 min read
How Janus‑Pro Redefines Multimodal AI with Bigger Models and New Training Strategies
ByteDance Web Infra
ByteDance Web Infra
Jan 22, 2025 · Artificial Intelligence

Introducing UI‑TARS: A Native GUI Agent Model Integrated with Midscene.js for Multimodal UI Automation

The article presents UI‑TARS, a native GUI‑agent model that combines multimodal large‑language models with the open‑source Midscene.js framework to enable more accurate, token‑efficient, and privacy‑preserving UI automation, while discussing its architecture, advantages, limitations, and integration steps.

GUI AgentMidscene.jsMultimodal AI
0 likes · 11 min read
Introducing UI‑TARS: A Native GUI Agent Model Integrated with Midscene.js for Multimodal UI Automation
NewBeeNLP
NewBeeNLP
Jan 17, 2025 · Artificial Intelligence

Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction

This comprehensive survey examines the foundations, tokenization techniques, model architectures, training paradigms, evaluation benchmarks, and open challenges of multimodal next‑token prediction (MMNTP), offering researchers a clear roadmap for future advances in multimodal AI.

Model architectureMultimodal AINext Token Prediction
0 likes · 9 min read
Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction
DataFunTalk
DataFunTalk
Jan 1, 2025 · Artificial Intelligence

Applying Large Language Models to Financial Risk Control at Akulaku

This article details Akulaku’s deployment of large language models across multimodal financial risk‑control scenarios—covering business background, a three‑module intelligent‑agent architecture, concrete tool‑ and planning‑enhancement case studies, and future outlook—demonstrating how LLMs boost efficiency, reduce labeling effort, and enable copilot‑style assistance.

Agent ArchitectureKYC verificationMultimodal AI
0 likes · 15 min read
Applying Large Language Models to Financial Risk Control at Akulaku
AntTech
AntTech
Dec 23, 2024 · Artificial Intelligence

Ant Group’s AIGC Security Detection System Earns Top Rating in China ICT Academy’s Multimodal Evaluation

Ant Group’s AIGC security detection system was evaluated by the China Information and Communication Research Institute, achieving the highest "Excellent" rating with a 0.99 F1 score across image, video, and audio modalities, while also releasing large‑scale detection datasets for the research community.

AIGC detectionAnt GroupBenchmark
0 likes · 5 min read
Ant Group’s AIGC Security Detection System Earns Top Rating in China ICT Academy’s Multimodal Evaluation
360 Tech Engineering
360 Tech Engineering
Dec 17, 2024 · Artificial Intelligence

Innovative Multimodal Architectures: IAA for Extending Language Models and BDM for Chinese-Native AI Painting

The article introduces two 360 AI Research Institute projects—IAA, an architecture that equips frozen language models with multimodal capabilities via plug‑in layers, and BDM, a Chinese‑native diffusion model compatible with the Stable Diffusion ecosystem—detailing their motivations, designs, benchmark results, and open‑source resources.

Chinese AI paintingLanguage ModelMultimodal AI
0 likes · 6 min read
Innovative Multimodal Architectures: IAA for Extending Language Models and BDM for Chinese-Native AI Painting
ByteDance Web Infra
ByteDance Web Infra
Dec 17, 2024 · Frontend Development

Midscene.js: Multimodal AI‑Powered UI Automation for Web Frontend Testing

Midscene.js, an open‑source UI automation framework from ByteDance Web Infra, leverages multimodal AI to simplify writing, maintaining, and debugging web UI tests with JavaScript or YAML integrations, while discussing its origins, usage patterns, limitations, cost, and security considerations.

JavaScriptMidscene.jsMultimodal AI
0 likes · 11 min read
Midscene.js: Multimodal AI‑Powered UI Automation for Web Frontend Testing
AI Large Model Application Practice
AI Large Model Application Practice
Dec 9, 2024 · Artificial Intelligence

How GUI Agents Use Large Models to Automate Any Desktop Task

This article explains why GUI agents are needed, defines their multimodal capabilities, walks through a high‑level automation scenario, details the architecture of large‑model‑driven GUI agents, highlights recent open‑source projects, and compares them with traditional RPA solutions.

AI automationGUI AgentHuman-Computer Interaction
0 likes · 10 min read
How GUI Agents Use Large Models to Automate Any Desktop Task