Tagged articles

303 articles

Page 2 of 4

Oct 15, 2025 · Artificial Intelligence

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

This curated list showcases Meituan’s latest large‑model breakthroughs and academic papers up to October 2025, spanning LLM system optimizations, multimodal generation, evaluation benchmarks, quantization techniques, and reinforcement‑learning‑driven improvements, offering researchers valuable insights and resources across the AI landscape.

AI researchBenchmarkingMultimodal AI

0 likes · 10 min read

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

Amap Tech

Oct 2, 2025 · Artificial Intelligence

How FantasyWorld Unifies Video Generation and 3D Geometry for Consistent Virtual Worlds

FantasyWorld introduces a geometry‑enhanced framework that augments a frozen video diffusion model with a trainable geometry branch, enabling simultaneous video representation and implicit 3D field generation, achieving spatially consistent, high‑quality virtual worlds and outperforming recent baselines in multi‑view coherence and geometric fidelity.

3D modelingComputer VisionMultimodal AI

0 likes · 11 min read

How FantasyWorld Unifies Video Generation and 3D Geometry for Consistent Virtual Worlds

AI2ML AI to Machine Learning

Oct 1, 2025 · Artificial Intelligence

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

The 2025 open‑source reports reveal major advances in large‑model engineering, including drastic cost cuts such as DeepSeek‑V3 training for $5.57 M, performance gains where Gemma 3 4B matches Gemma 2 27B, memory efficiencies like 85 % KV‑cache reduction, and a suite of new techniques—from loss‑free MoE balancing to multi‑token prediction—that together push context lengths to one million tokens and enable multimodal, aligned, and industry‑specific models.

Cost reductionMultimodal AIattention mechanisms

0 likes · 13 min read

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

AI2ML AI to Machine Learning

Sep 30, 2025 · Artificial Intelligence

Dynamic Multimodal Video Generation: Prioritizing Stability and High Quality

The article surveys the evolution of video generation models—from early GANs and DCGAN to diffusion‑based approaches like Stable Diffusion and DiT—highlighting how stability, high quality, massive compute, and multimodal data pipelines are shaping the current and future paths of dynamic multimodal video generation.

Latent DiffusionMultimodal AIStable Diffusion

0 likes · 7 min read

Dynamic Multimodal Video Generation: Prioritizing Stability and High Quality

DataFunTalk

Sep 29, 2025 · Artificial Intelligence

How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP

In an interview before the DACon conference, Dr. Feng Ziyong reveals how Glint‑MVT and novel data‑synthesis techniques overcome distribution gaps, improve compositional understanding, and enable billion‑scale, second‑level retrieval for city‑level surveillance, while balancing model efficiency and effectiveness.

Embedding RetrievalMultimodal AIcity surveillance

0 likes · 11 min read

How Glint-MVT Powers City‑Scale Multimodal AI: Insights from a Tech VP

DataFunSummit

Sep 28, 2025 · Artificial Intelligence

Unlocking Enterprise Knowledge: Building Multimodal AI Systems with LLMs

This article examines the challenges of processing massive multimodal data in enterprises and presents a knowledge‑augmentation framework that leverages Retrieval‑Augmented Generation, memory‑inspired architecture, and feedback loops to enable reliable, scalable AI‑driven decision making across diverse business scenarios.

Enterprise KnowledgeKnowledge GraphLLM

0 likes · 29 min read

Unlocking Enterprise Knowledge: Building Multimodal AI Systems with LLMs

Instant Consumer Technology Team

Sep 25, 2025 · Artificial Intelligence

Inside Qwen’s Midnight Release: New Guard, Travel Agent, LiveTranslate, Code & Vision Models Unveiled

Late at night on the 23rd, Lin Junyang of Tongyi Lab announced six AI model releases—including a safety‑audit guard, a personal travel planner, a real‑time multilingual translator, upgraded coding models, a powerful vision‑language model, and the flagship Qwen3‑Max—each detailed with capabilities, highlights, and direct download links.

Multimodal AISafetyartificial intelligence

0 likes · 11 min read

Inside Qwen’s Midnight Release: New Guard, Travel Agent, LiveTranslate, Code & Vision Models Unveiled

Instant Consumer Technology Team

Sep 23, 2025 · Artificial Intelligence

What’s Driving the Latest Tech Frontier? Vite’s Speed Boost, AI Coding Agents, and End‑to‑End Generative Search

This roundup highlights Vite’s next‑gen Rolldown engine delivering up to 45% faster builds, AI‑powered coding tools like Comate Zulu and Claude Code enabling solo developers, Browser‑Use’s AI web automation, Alipay’s AI travel assistant built by a four‑person team, ROMA’s dark‑mode adaptation, Kuaishou’s OneSearch generative framework, MIDAS multimodal digital‑human breakthroughs, and the open‑source VoxCPM speech model.

AI CodingBrowser AutomationGenerative Search

0 likes · 6 min read

What’s Driving the Latest Tech Frontier? Vite’s Speed Boost, AI Coding Agents, and End‑to‑End Generative Search

DataFunTalk

Sep 19, 2025 · Artificial Intelligence

GenAI Summit 2025: Large Model Innovations & Real-World Applications

The DataFun GenAI Summit 2025 brings together leading experts from Alibaba, Tencent, Ant Financial, and other tech giants to showcase the latest breakthroughs in large-model research, generative AI, multimodal understanding, and real-world deployments across finance, e-commerce, media, and enterprise services.

AI applicationsEnterprise AIGenAI

0 likes · 25 min read

GenAI Summit 2025: Large Model Innovations & Real-World Applications

DaTaobao Tech

Sep 17, 2025 · Artificial Intelligence

Boosting ID Card Photo Quality with Multimodal AI: A Practical Deployment Guide

This article details how a multimodal AI model was integrated to detect and improve ID card photo quality, covering common image issues, differences between OCR and multimodal extraction, deployment strategies, performance metrics, cost estimation, and the resulting business and technical benefits.

ID verificationModel DeploymentMultimodal AI

0 likes · 13 min read

Boosting ID Card Photo Quality with Multimodal AI: A Practical Deployment Guide

Raymond Ops

Sep 14, 2025 · Artificial Intelligence

Create AI Videos with DeepSeek + Tongyi Wanxiang: Step-by-Step Guide

This article explains how to leverage the Chinese AI multimodal platform Tongyi Wanxiang together with DeepSeek to generate high-quality AI videos, covering AI video fundamentals, core features, application scenarios, detailed workflow, script creation, video synthesis, and Java API integration with code examples.

AI video generationDeepSeekJava SDK

0 likes · 25 min read

Create AI Videos with DeepSeek + Tongyi Wanxiang: Step-by-Step Guide

DataFunSummit

Sep 12, 2025 · Artificial Intelligence

How AI Dressing and Digital Humans Are Revolutionizing Home Service Experiences

In an exclusive interview, AI expert Wang Mingzhong details the technical challenges and breakthroughs behind AI dressing, AI video resumes, short‑video templates, and digital‑human live streaming for 58 Home services, highlighting model choices, multimodal architectures, modular design, and future directions for emotional interaction.

AI dressingAI video resumeMultimodal AI

0 likes · 9 min read

How AI Dressing and Digital Humans Are Revolutionizing Home Service Experiences

Kuaishou Tech

Sep 5, 2025 · Artificial Intelligence

How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI

Fast‑search platform Kwai has open‑sourced the 8‑billion‑parameter multimodal LLM Keye‑VL‑1.5, which introduces a slow‑fast frame encoding, a progressive four‑stage pre‑training pipeline, and an automated data construction workflow, achieving state‑of‑the‑art results on video and vision‑language benchmarks and surpassing many closed‑source models.

Multimodal AIbenchmark performancelarge language model

0 likes · 12 min read

How Keye‑VL‑1.5‑8B Sets New Benchmarks in Multimodal AI

Data Party THU

Sep 3, 2025 · Artificial Intelligence

Exploring Multimodal Generative AI: A Tsinghua Tutorial at IJCAI 2025

This article introduces a 1.5‑hour tutorial presented by Tsinghua researchers at IJCAI 2025, covering the latest advances in multimodal generative AI, including multimodal large language models, diffusion models, post‑training generalization techniques, and unified understanding‑generation frameworks.

Generative ModelsIJCAI 2025Multimodal AI

0 likes · 5 min read

Exploring Multimodal Generative AI: A Tsinghua Tutorial at IJCAI 2025

AntTech

Aug 27, 2025 · Artificial Intelligence

How AI Is Revolutionizing Content Safety – The Tech Behind Shanghai’s Top Award

Shanghai’s 2024 Science and Technology Award honored a joint effort by Shanghai Jiao Tong University and Ant Group for pioneering AI-driven technologies—multimodal hallucination mitigation, controllable data generation, integrated content security monitoring, and adversarial model protection—that set international standards in detecting harmful online media and AIGC content.

AI content safetyAIGC detectionMultimodal AI

0 likes · 6 min read

How AI Is Revolutionizing Content Safety – The Tech Behind Shanghai’s Top Award

Baidu Geek Talk

Aug 25, 2025 · Artificial Intelligence

How ERNIE‑4.5‑VL Redefines Multimodal AI with 100+ Language Support

The ERNIE‑4.5‑VL visual‑language model breaks single‑modality limits by delivering breakthrough image, video, and text understanding across more than 100 languages, offering lightweight yet competitive performance against models like Qwen2.5‑VL, supporting 128K context, dual “thinking” modes, and extensive deployment resources.

AI researchErnieMultimodal AI

0 likes · 4 min read

How ERNIE‑4.5‑VL Redefines Multimodal AI with 100+ Language Support

Data Party THU

Aug 22, 2025 · Artificial Intelligence

TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

TwigVLM introduces a lightweight “twig” module that prunes visual tokens early and enables self‑speculative decoding, achieving up to 154% speedup on long‑text generation while preserving 96% of original LVLM accuracy, as demonstrated on LLaVA‑1.5‑7B and other benchmarks.

LVLMMultimodal AIToken Pruning

0 likes · 14 min read

TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

37 Interactive Technology Team

Aug 20, 2025 · Artificial Intelligence

Unlocking ChatGPT‑4o: How the New Multimodal Model Revolutionizes Image Generation

ChatGPT‑4o, OpenAI’s latest multimodal model, dramatically enhances text and image generation with higher quality visuals, flexible style control, faster response, and integrated image editing, and the article showcases diverse real‑world use cases—from advertising graphics to game UI design—demonstrating its practical impact across industries.

AI applicationsChatGPT4oMultimodal AI

0 likes · 11 min read

Unlocking ChatGPT‑4o: How the New Multimodal Model Revolutionizes Image Generation

AntTech

Aug 19, 2025 · Artificial Intelligence

How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks

Ant Group's open‑source native GUI agent UI‑Venus leverages multimodal large‑model and reinforcement‑learning techniques to outperform prior models on grounding and navigation benchmarks, while using a high‑quality data pipeline and a self‑evolving alignment mechanism to push the limits of GUI automation.

BenchmarkGUI AgentMultimodal AI

0 likes · 7 min read

How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks

Data Party THU

Aug 15, 2025 · Artificial Intelligence

What’s Next for Visual Reinforcement Learning? A Comprehensive 2024‑2025 Survey

This article provides a critical, up‑to‑date overview of visual reinforcement learning, formalizes the problem, traces policy‑optimization evolution, categorizes over 200 recent works into four pillars, analyzes algorithms, reward design, benchmarks, and highlights open challenges and future research directions.

Multimodal AIRLHFdiffusion models

0 likes · 7 min read

What’s Next for Visual Reinforcement Learning? A Comprehensive 2024‑2025 Survey

Ctrip Technology

Aug 14, 2025 · Artificial Intelligence

How Multimodal Large Models Can Auto-Generate UI Test Cases End‑to‑End

Leveraging multimodal large‑model AI, this article outlines a four‑stage evolution from text‑based UI element identification to fully autonomous, end‑to‑end generation of executable UI automation scripts, detailing system architecture, intelligent reasoning engine, and real‑world Ctrip hotel refund test case results.

Multimodal AISoftware TestingTest Case Generation

0 likes · 17 min read

How Multimodal Large Models Can Auto-Generate UI Test Cases End‑to‑End

Data Party THU

Aug 13, 2025 · Artificial Intelligence

How Large Language Models Are Revolutionizing Automated Scholarly Paper Review

This survey examines the rapid rise of large language models in automated scholarly paper review (ASPR), analyzing model types, technical breakthroughs such as long‑text, multimodal, and multi‑turn capabilities, new generation methods, datasets, open‑source tools, current challenges, publisher policies, and future research directions.

ASPRMultimodal AIautomated paper review

0 likes · 19 min read

How Large Language Models Are Revolutionizing Automated Scholarly Paper Review

Volcano Engine Developer Services

Aug 8, 2025 · Artificial Intelligence

Master PromptPilot: Step‑by‑Step Guide to Build, Optimize, and Debug AI Prompts

This comprehensive tutorial walks you through the entire PromptPilot workflow—from initial setup and prompt generation to iterative optimization, visual debugging, batch testing, and intelligent refinement—showcasing how to create high‑quality, production‑ready prompts for AI agents and applications.

AI toolsMultimodal AIPrompt engineering

0 likes · 10 min read

Master PromptPilot: Step‑by‑Step Guide to Build, Optimize, and Debug AI Prompts

AIWalker

Aug 6, 2025 · Artificial Intelligence

Why ByteDance’s 7B BAGEL Model Rivals GPT‑4o in Unified Multimodal Understanding and Generation

The article provides an in‑depth technical analysis of ByteDance’s 7‑billion‑parameter BAGEL model, detailing its MoT architecture, high‑quality interleaved multimodal pre‑training data, multi‑stage training strategy, emergent capabilities, and extensive benchmark results that show BAGEL matching or surpassing GPT‑4o on vision‑language tasks.

BAGELGPT-4o comparisonMultimodal AI

0 likes · 24 min read

Why ByteDance’s 7B BAGEL Model Rivals GPT‑4o in Unified Multimodal Understanding and Generation

Bilibili Tech

Aug 5, 2025 · Artificial Intelligence

How Bilibili’s IndexTTS2 Achieves Real‑Time, Emotion‑Rich Voice Translation

IndexTTS2 introduces a cross‑modal, multi‑language voice translation system that preserves speaker identity, acoustic space, and multi‑source timbre, while tackling challenges like voice personality loss, subtitle cognitive load, localization costs, multi‑speaker diarization, and cultural adaptation through novel time‑coding, adversarial RL, and diffusion‑based lip‑sync techniques.

Multimodal AISpeech synthesisadversarial reinforcement learning

0 likes · 20 min read

How Bilibili’s IndexTTS2 Achieves Real‑Time, Emotion‑Rich Voice Translation

Baidu MEUX

Jul 30, 2025 · Artificial Intelligence

What’s New in AI? 10 Breakthrough Tools and Models Shaping 2024

This roundup highlights ten recent AI breakthroughs—including Perplexity’s Comet browser, Google’s T5Gemma series, xAI’s Grok‑4, a revamped PNG format, Bilibili’s Code‑H creator, JD’s AI social apps, ByteDance’s AI doctor, Vivo’s edge‑side multimodal model, Tencent’s payment‑enabled platform, and ByteDance’s Xverse image generator—showcasing rapid advances across browsing, modeling, multimedia, and commerce.

AIMultimodal AImachine learning

0 likes · 7 min read

What’s New in AI? 10 Breakthrough Tools and Models Shaping 2024

AI Info Trend

Jul 24, 2025 · Industry Insights

What’s Driving AI Adoption in 2025? Six Key Trends Uncovered

The AI Adoption Survey H1 2025 reveals that nearly half of organizations have deployed AI in production, engineering and R&D lead usage, Chinese LLMs gain overseas interest, and cost, reliability and intelligence remain the top challenges, while tool preferences and multimodal trends reshape the market.

AI InfrastructureAI adoptionAI trends

0 likes · 7 min read

What’s Driving AI Adoption in 2025? Six Key Trends Uncovered

FunTester

Jul 21, 2025 · Artificial Intelligence

How First Principles Shape the Future of AI Agents: Evolution, Capabilities, and Emerging Trends

This article explores how first‑principle reasoning underpins the development of AI agents, traces their collaborative technology evolution, details core capabilities such as compute, memory, prediction and action, and forecasts future directions like multimodal models, reduced prompting, and extensive data sharing.

AI agentsAgent CollaborationFuture AI

0 likes · 15 min read

How First Principles Shape the Future of AI Agents: Evolution, Capabilities, and Emerging Trends

AntTech

Jul 17, 2025 · Artificial Intelligence

How M2-Reasoning-7B Achieves State‑of‑the‑Art Spatial Reasoning in Multimodal AI

M2-Reasoning-7B, an open‑source 7B multimodal model from Ant Group, combines a high‑quality data pipeline with dynamic multi‑task training and a novel reward function to deliver state‑of‑the‑art performance on both general and spatial reasoning benchmarks, surpassing many larger competitors.

BenchmarkM2-ReasoningMultimodal AI

0 likes · 9 min read

How M2-Reasoning-7B Achieves State‑of‑the‑Art Spatial Reasoning in Multimodal AI

Tencent Cloud Developer

Jul 16, 2025 · Artificial Intelligence

How First Principles Shape the Future of AI Agents: Evolution, Capabilities, and Trends

This article explores how first‑principle thinking underpins AI agents, traces their development from single‑craftsman tools to enterprise‑level collaborations, outlines core capabilities such as compute, memory, prediction and action, and forecasts future directions like multimodal models, reduced prompting, and extensive data sharing.

AI agentsAgent CollaborationFuture AI

0 likes · 15 min read

How First Principles Shape the Future of AI Agents: Evolution, Capabilities, and Trends

Kuaishou Large Model

Jul 11, 2025 · Artificial Intelligence

How MODA’s Modular Duplex Attention Boosts Multimodal Emotion Understanding

The paper introduces MODA, a new multimodal model that tackles attention imbalance across modalities with a modular duplex attention mechanism, achieving significant performance gains on perception, cognition, and emotion tasks across 21 benchmarks and demonstrating strong potential for human‑machine interaction.

Deep LearningMODA modelMultimodal AI

0 likes · 13 min read

How MODA’s Modular Duplex Attention Boosts Multimodal Emotion Understanding

DataFunTalk

Jul 11, 2025 · Artificial Intelligence

When AI Sees Six Fingers: Why Vision Models Miss the Mark

The article examines how multimodal AI models repeatedly miscount a six‑finger image, explores the underlying bias revealed in the paper “Vision Language Models are Biased,” and warns that such prior‑knowledge‑driven errors can have serious safety implications in real‑world applications.

AI biasMultimodal AIVision-Language Models

0 likes · 10 min read

When AI Sees Six Fingers: Why Vision Models Miss the Mark

Kuaishou Tech

Jul 10, 2025 · Artificial Intelligence

How MODA’s Modular Duplex Attention Solves Multimodal Attention Imbalance and Boosts Emotion Understanding

The paper introduces MODA, a modular duplex attention multimodal model that addresses severe cross‑modal attention imbalance in existing large multimodal models, proposes a novel attention paradigm and masking scheme, and demonstrates significant performance gains across 21 benchmarks in perception, cognition, and emotion tasks, earning a Spotlight paper at ICML 2025.

Emotion RecognitionMoDAMultimodal AI

0 likes · 13 min read

How MODA’s Modular Duplex Attention Solves Multimodal Attention Imbalance and Boosts Emotion Understanding

Baobao Algorithm Notes

Jul 10, 2025 · Industry Insights

Grok 4 Unveiled: Why xAI Claims Its New Model Beats the Competition

On July 10, xAI launched Grok 4, a multimodal LLM with a 256K‑token context window, tool‑use upgrades and benchmark scores that surpass existing models, while pricing it at $30/month for the standard tier and $300/month for the heavy tier.

AI benchmarksGrok 4Industry analysis

0 likes · 6 min read

Grok 4 Unveiled: Why xAI Claims Its New Model Beats the Competition

DataFunSummit

Jul 8, 2025 · Artificial Intelligence

Explore Cutting-Edge AI Knowledge Graphs: From Multimodal GraphRAG to Industry Applications

This article presents a curated catalog of cutting‑edge AI resources, covering multimodal GraphRAG, knowledge‑graph and large‑model integration, financial industry AI products, Chinese‑medicine decision support, AI‑driven knowledge‑graph evolution, private‑domain Q&A pipelines, and emerging trends and standards, with a QR code for the full ebook.

Document IntelligenceKnowledge GraphMultimodal AI

0 likes · 2 min read

Explore Cutting-Edge AI Knowledge Graphs: From Multimodal GraphRAG to Industry Applications

DataFunTalk

Jul 2, 2025 · Artificial Intelligence

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

Zhipu AI unveiled the GLM-4.1V-Thinking series, an open‑source multimodal model that outperforms larger rivals on visual‑language tasks, supports video analysis, GUI agents, and advanced scientific reasoning, while introducing a curriculum‑sampling reinforcement‑learning framework and a new Agent application platform.

AI agentsGLM-4.1VMultimodal AI

0 likes · 10 min read

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

DataFunTalk

Jul 2, 2025 · Artificial Intelligence

How Multimodal Large Models Are Revolutionizing Complex Document OCR

In a detailed interview, Zhao Chenyang explains how multimodal large models (VLM) overcome the limitations of traditional OCR in mixed layouts, table reconstruction, and handwritten text by leveraging self‑supervised pre‑training, lightweight fine‑tuning, and hybrid pipelines that dramatically cut annotation costs and improve recall rates.

AI deploymentMultimodal AIdocument OCR

0 likes · 13 min read

How Multimodal Large Models Are Revolutionizing Complex Document OCR

DataFunTalk

Jun 30, 2025 · Artificial Intelligence

Wenxin 4.5 Series: Open‑Source Multimodal MoE Models and FastDeploy Guide

The Wenxin 4.5 series introduces ten open‑source models—including large‑scale MoE and dense variants—featuring a novel multimodal heterogeneous architecture, high training efficiency, SOTA benchmark performance, and comprehensive toolkits (ERNIEKit, FastDeploy) for fine‑tuning and multi‑hardware deployment.

ERNIEKitFastDeployMoE

0 likes · 8 min read

Wenxin 4.5 Series: Open‑Source Multimodal MoE Models and FastDeploy Guide

Network Intelligence Research Center (NIRC)

Jun 29, 2025 · Artificial Intelligence

Multimodal AI Assistant Boosts Network Config: 96.6% Accuracy, 26× Labor Cut

The paper presents NLI2Conf, an intent‑driven network configuration model that fuses configuration files, topology and performance data via a multimodal interface, using large language and graph neural models to align natural‑language intents with forwarding and performance constraints, achieving 96.6% accuracy and a 26‑fold reduction in manual effort.

Graph Neural NetworkMultimodal AINLI2Conf

0 likes · 6 min read

Multimodal AI Assistant Boosts Network Config: 96.6% Accuracy, 26× Labor Cut

AI Algorithm Path

Jun 29, 2025 · Artificial Intelligence

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

CLIP (Contrastive Language‑Image Pre‑training) is an OpenAI model that learns visual concepts from 400 million image‑text pairs using a dual‑encoder architecture, enabling zero‑shot classification, flexible text‑driven search, and cross‑modal reasoning, while its strengths, limitations, and emerging applications are examined in detail.

CLIPContrastive Language-Image PretrainingDual Encoder

0 likes · 15 min read

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

AI Algorithm Path

Jun 20, 2025 · Artificial Intelligence

Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter

This article introduces visual‑language models (VLMs), explaining how they combine large language models with visual encoders, why they overcome the rigidity of traditional computer‑vision systems, their key advantages, modular architecture, training methods, and practical applications such as image captioning and visual question answering.

AI applicationsComputer VisionMultimodal AI

0 likes · 8 min read

Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter

AntTech

Jun 18, 2025 · Artificial Intelligence

How Ant Group’s Baoling Models Push Toward AGI with MoE and Multimodal Innovations

In a detailed AICon talk, Ant Group’s Baoling team leader Zhou Jun outlines their latest large‑model training techniques, MoE architecture optimizations, multimodal breakthroughs, open‑source releases, and the strategic roadmap needed to turn AI into a ubiquitous, “scan‑code‑level” everyday assistant.

AI InfrastructureMixture of ExpertsMultimodal AI

0 likes · 25 min read

How Ant Group’s Baoling Models Push Toward AGI with MoE and Multimodal Innovations

Kuaishou Tech

Jun 10, 2025 · Artificial Intelligence

How Midscene.js Leverages Multimodal AI for Zero‑Code UI Automation

Midscene.js, an open‑source UI automation framework from ByteDance’s Web Infra team, combines multimodal AI inference with Chrome extensions, YAML scripts, and JavaScript SDKs to enable zero‑code testing across Web, Android, Playwright, and Puppeteer, offering key interfaces for actions, queries, and assertions.

JavaScriptMultimodal AIPlaywright

0 likes · 8 min read

How Midscene.js Leverages Multimodal AI for Zero‑Code UI Automation

Kuaishou Large Model

Jun 5, 2025 · Artificial Intelligence

7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances

Kuaishou's foundational large‑model team secured seven papers at the prestigious ACL 2025 conference, covering alignment bias during model training, safety in inference, decoding strategies, fine‑grained video‑temporal understanding, and new evaluation benchmarks that push the frontier of multimodal large language models.

ACL 2025BenchmarkMultimodal AI

0 likes · 16 min read

7 Kuaishou Papers Accepted at ACL 2025 Reveal Cutting‑Edge AI Advances

Alibaba Cloud Developer

Jun 5, 2025 · Databases

Enable Native Multimodal AI Search with SQL on PolarDB

This article explains how to use standard SQL on PolarDB PostgreSQL to directly invoke multimodal AI services for image feature extraction and vectorization, eliminating data migration and complex toolchains while providing low‑threshold integration, flexible scenario adaptation, full‑link security, and serverless, pay‑as‑you‑go deployment.

Multimodal AIPolardbSQL

0 likes · 6 min read

Enable Native Multimodal AI Search with SQL on PolarDB

AntTech

Jun 4, 2025 · Artificial Intelligence

LLaDA and LLaDA‑V: Large Language Diffusion Models and Their Multimodal Extensions

This article presents the LLaDA series of diffusion‑based large language models, explains how their generative‑modeling principle yields language intelligence comparable to autoregressive models, and details the multimodal LLaDA‑V architecture, training methods, experimental results, and broader implications for AI research.

Generative ModelingMultimodal AIdiffusion models

0 likes · 10 min read

LLaDA and LLaDA‑V: Large Language Diffusion Models and Their Multimodal Extensions

Tencent Technical Engineering

May 23, 2025 · Artificial Intelligence

Can a 3B Open‑Source Multimodal Model Beat GPT‑4V in Math? A Deep Dive into VLR1‑3B

The preview release of the 3‑billion‑parameter VLR1‑3B multimodal model demonstrates state‑of‑the‑art reasoning on math benchmarks, outperforms many commercial closed‑source models, and shows promising results on geometry, physics, and general vision tasks, while also revealing typical hallucination issues.

BenchmarkMultimodal AIVLR1-3B

0 likes · 8 min read

Can a 3B Open‑Source Multimodal Model Beat GPT‑4V in Math? A Deep Dive into VLR1‑3B

AI Frontier Lectures

May 21, 2025 · Artificial Intelligence

How BGE’s New Code and Multimodal Vector Models Set New Retrieval Benchmarks

The article introduces three BGE vector models—BGE‑Code‑v1, BGE‑VL‑v1.5, and BGE‑VL‑Screenshot—detailing their architectures, open‑source resources, benchmark results on CoIR, Code‑RAG, MMEB, and MVRB, and their impact on code and multimodal retrieval research.

AI researchMultimodal AIOpen-source models

0 likes · 8 min read

How BGE’s New Code and Multimodal Vector Models Set New Retrieval Benchmarks

Kuaishou Tech

May 13, 2025 · Artificial Intelligence

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

This article analyzes KuaiMod, a multimodal large‑model solution developed by Kuaishou for short‑video content quality assessment, detailing its benchmark dataset, chain‑of‑thought data construction, offline SFT + DPO training, online reinforcement‑learning updates, evaluation results, and large‑scale deployment impact.

BenchmarkKuaiModMultimodal AI

0 likes · 19 min read

How KuaiMod Uses Multimodal AI to Revolutionize Short‑Video Content Quality

AIWalker

May 11, 2025 · Artificial Intelligence

Unified Multimodal Understanding and Generation: A 30K‑Word Survey of Recent Advances

This comprehensive survey reviews the rapid progress of multimodal understanding and text‑to‑image generation models, categorises existing unified architectures into diffusion‑based, autoregressive, and hybrid paradigms, analyses their tokenisation strategies, datasets and benchmarks, and highlights current challenges and future research directions.

Autoregressive ModelsDatasetsMultimodal AI

0 likes · 64 min read

Unified Multimodal Understanding and Generation: A 30K‑Word Survey of Recent Advances

Baidu Tech Salon

Apr 28, 2025 · Artificial Intelligence

Inside Baidu’s Wenxin 4.5 Turbo & X1 Turbo: Architecture, Training Tricks, and Real-World Impact

At the Create2025 AI Developer Conference, Baidu unveiled the multimodal Wenxin 4.5 Turbo and X1 Turbo models, detailing their innovative architecture, self‑feedback post‑training, composite reasoning chains, data pipelines, and the new Wenxin KuaiMa 3.5 code assistant, while also showcasing ecosystem growth and cultural AI applications.

AI ConferenceBaiduCode Generation

0 likes · 9 min read

Inside Baidu’s Wenxin 4.5 Turbo & X1 Turbo: Architecture, Training Tricks, and Real-World Impact

AI Algorithm Path

Apr 26, 2025 · Artificial Intelligence

OpenAI Launches GPT-Image-1: Bringing ChatGPT‑Style Image Generation to Developers

OpenAI has opened the GPT‑Image‑1 API, a multimodal model that supports both image generation and editing, offers configurable quality, size, and format options, provides JavaScript code samples, outlines token‑based pricing, and is already being integrated by platforms such as Adobe, Canva, and HeyGen.

APIGPT-Image-1JavaScript

0 likes · 9 min read

OpenAI Launches GPT-Image-1: Bringing ChatGPT‑Style Image Generation to Developers

Meituan Technology Team

Apr 24, 2025 · Artificial Intelligence

Meituan AI Recruitment: Join Our Advanced Technology Teams

Meituan's AI recruitment page showcases diverse opportunities across AI infrastructure, intelligent interaction, visual intelligence, and intelligent products, featuring roles from algorithm engineers to product managers working on cutting-edge technologies including large models, intelligent agents, and multimodal systems.

AI RecruitmentComputer VisionIntelligent agents

0 likes · 5 min read

Meituan AI Recruitment: Join Our Advanced Technology Teams

Tencent Cloud Developer

Apr 24, 2025 · Industry Insights

How RAG, AI Agents, and Multimodal Models Are Reshaping Industry – Trends, Challenges, and Real‑World Cases

The article analyzes the rapid evolution of large‑model technologies—Retrieval‑Augmented Generation, autonomous agents, and multimodal AI—detailing their technical foundations, practical challenges, industry applications such as unified multimodal tasks, open‑world detection, and video moderation, and forecasting future development directions.

AI agentsMultimodal AIRAG

0 likes · 15 min read

How RAG, AI Agents, and Multimodal Models Are Reshaping Industry – Trends, Challenges, and Real‑World Cases

DaTaobao Tech

Apr 14, 2025 · Artificial Intelligence

Taobao AIGC Content Generation: Short Video Production Techniques

Taobao’s Content AI team leverages a proprietary multimodal Mixture‑of‑Experts model to automatically generate short‑form videos—extracting highlights from live streams and creating customized product explainers—using two‑stage CLIP/VideoBLIP training, character‑level timestamps, LLM re‑segmentation and OCR masking, now producing over 100 k daily videos with a 12 % approval boost and notable conversion gains.

AIGCMultimodal AIcontent AI

0 likes · 20 min read

Taobao AIGC Content Generation: Short Video Production Techniques

Tencent Cloud Developer

Apr 10, 2025 · Artificial Intelligence

The Magic of GPT‑4o: Technical Overview and Speculated Architecture

GPT‑4o combines extremely long‑form text generation, high‑quality image creation and interactive editing by likely using an autoregressive multimodal transformer that tokenizes visuals via VQ‑VAE/GAN pipelines, trained on massive data and refined through fine‑tuning and RLHF, offering a unified model for generation, editing, and understanding.

GPT-4oMultimodal AIVQ-VAE

0 likes · 17 min read

The Magic of GPT‑4o: Technical Overview and Speculated Architecture

21CTO

Apr 7, 2025 · Artificial Intelligence

Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities

Meta's Llama 4 series introduces the Scout, Maverick, and Behemoth models—featuring Mixture‑of‑Experts architectures, unprecedented 10‑million‑token context windows, and state‑of‑the‑art performance across vision, language, and multimodal benchmarks—while emphasizing efficient training, open‑source availability, and robust safety safeguards.

AI SafetyLlama 4Mixture of Experts

0 likes · 14 min read

Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities

DataFunTalk

Apr 6, 2025 · Artificial Intelligence

Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context

Meta announced the Llama 4 series—Scout, Maverick and Behemoth—featuring multimodal capabilities, Mixture‑of‑Experts design, up to 10 million‑token context windows, and state‑of‑the‑art performance on STEM, multilingual and image benchmarks, with models now downloadable from llama.com and Hugging Face.

Llama 4Mixture of ExpertsModel Training

0 likes · 14 min read

Meta Unveils Llama 4: New Multimodal AI Models with Mixture‑of‑Experts Architecture and 10 Million‑Token Context

Baobao Algorithm Notes

Apr 6, 2025 · Artificial Intelligence

Inside Llama 4: How Meta’s New Multimodal MoE Models Achieve 10M‑Token Contexts

Meta unveils Llama 4 Scout, Maverick, and the upcoming Behemoth, detailing their Mixture‑of‑Experts architecture, massive 10‑million‑token context windows, efficient FP8 training, safety mechanisms, and competitive benchmark results that surpass leading multimodal models.

AI SafetyLlama 4Mixture of Experts

0 likes · 16 min read

Inside Llama 4: How Meta’s New Multimodal MoE Models Achieve 10M‑Token Contexts

Architects' Tech Alliance

Apr 1, 2025 · Artificial Intelligence

What’s New in Large Language Models? DeepSeek V3, Qwen2.5‑Omni, Gemini 2.5 Pro, and GPT‑4o Unpacked

This article reviews the latest updates from major LLM providers—DeepSeek V3’s parameter boost and longer context, Qwen2.5‑Omni’s open‑source multimodal 7B model, Google Gemini 2.5 Pro’s 1 M‑token window and multimodal prowess, and OpenAI GPT‑4o’s image generation and reduced pricing—highlighting technical specs, capabilities, and availability.

DeepSeekGPT-4oGemini

0 likes · 9 min read

What’s New in Large Language Models? DeepSeek V3, Qwen2.5‑Omni, Gemini 2.5 Pro, and GPT‑4o Unpacked

Architects' Tech Alliance

Mar 31, 2025 · Artificial Intelligence

A Comprehensive History of Large Language Models from the Transformer Era (2017) to DeepSeek‑R1 (2025)

This article reviews the evolution of large language models from the 2017 Transformer breakthrough through BERT, GPT series, alignment techniques, multimodal extensions, open‑weight releases, and the cost‑efficient DeepSeek‑R1 in 2025, highlighting key technical advances, scaling trends, and their societal impact.

AI AlignmentLLM evolutionMultimodal AI

0 likes · 26 min read

A Comprehensive History of Large Language Models from the Transformer Era (2017) to DeepSeek‑R1 (2025)

Architect

Mar 30, 2025 · Artificial Intelligence

What Is Retrieval-Augmented Generation? A Deep Dive into RAG Techniques

This article provides a comprehensive survey of Retrieval‑Augmented Generation (RAG), covering its basic principles, key components, seven technical variants, challenges, evaluation methods, and future research directions across multimodal, graph‑based, and agentic extensions.

AI SurveyKnowledge RetrievalMultimodal AI

0 likes · 9 min read

What Is Retrieval-Augmented Generation? A Deep Dive into RAG Techniques

MaGe Linux Operations

Mar 28, 2025 · Artificial Intelligence

How to Create AI-Generated Videos with Tongyi Wanxiang and DeepSeek: A Step‑by‑Step Guide

This article explains the fundamentals of AI video technology, details the features of Alibaba Cloud's Tongyi Wanxiang platform, demonstrates how to use DeepSeek for script generation, and provides a complete workflow—including code examples—for producing high‑quality AI‑generated videos.

AI video generationDeepSeekJava SDK

0 likes · 24 min read

How to Create AI-Generated Videos with Tongyi Wanxiang and DeepSeek: A Step‑by‑Step Guide

Sohu Tech Products

Mar 26, 2025 · Artificial Intelligence

How SpatialLM Turns 3D Point Clouds into Structured Scene Understanding

SpatialLM is a large language model designed for 3D spatial understanding that converts point‑cloud data from videos, RGB‑D images or LiDAR into structured scene descriptions, and this guide explains its architecture, model versions, repository links, and step‑by‑step deployment on Ubuntu with PyTorch.

3D point cloudMultimodal AIPyTorch

0 likes · 7 min read

How SpatialLM Turns 3D Point Clouds into Structured Scene Understanding

ByteDance Web Infra

Mar 21, 2025 · Artificial Intelligence

Midscene.js: An AI‑Driven UI Automation Framework from ByteDance

Midscene.js is an open‑source UI automation framework that leverages multimodal AI to simplify web UI testing and interaction, offering three core interfaces—Action, Query, and Assert—along with a JavaScript SDK, support for multiple AI models, YAML scripting, and future‑focused features for stable, scalable automation.

AIJavaScriptMidscene.js

0 likes · 21 min read

Midscene.js: An AI‑Driven UI Automation Framework from ByteDance

AI Frontier Lectures

Mar 20, 2025 · Artificial Intelligence

Why Multimodal LLMs Still Struggle with Multi-Image Math Reasoning: Insights from MV‑MATH

This article introduces the MV‑MATH dataset, a large‑scale multi‑image math benchmark, and evaluates 24 open‑source and closed‑source multimodal large language models, revealing significant performance gaps, especially on complex visual dependencies and higher difficulty levels.

DatasetModel EvaluationMultimodal AI

0 likes · 8 min read

Why Multimodal LLMs Still Struggle with Multi-Image Math Reasoning: Insights from MV‑MATH

DevOps

Mar 19, 2025 · Artificial Intelligence

From Claude 3.5 Sonnet to Manus: The Evolution and Landscape of Computer‑Use AI Agents

This article surveys the rapid development of computer‑use AI agents—from Anthropic’s Claude 3.5 Sonnet and OpenAI’s Operator to the multi‑agent Manus platform—detailing their capabilities, benchmark results, open‑source alternatives, practical challenges, and future prospects for autonomous digital assistants.

AI agentsAnthropicAutomation

0 likes · 24 min read

From Claude 3.5 Sonnet to Manus: The Evolution and Landscape of Computer‑Use AI Agents

Baidu Geek Talk

Mar 19, 2025 · Artificial Intelligence

Inside Baidu’s New Wenxin 4.5 & X1: Multimodal Breakthroughs and Tool‑Enabled AI

Baidu officially launched the Wenxin 4.5 and X1 large language models, showcasing native multimodal foundations, advanced attention masks, heterogeneous expert extensions, and tool‑calling capabilities, while offering low‑cost API access on the Qianfan platform and outlining the underlying technical innovations that drive their performance gains.

AI PlatformBaiduMultimodal AI

0 likes · 8 min read

Inside Baidu’s New Wenxin 4.5 & X1: Multimodal Breakthroughs and Tool‑Enabled AI

AIWalker

Mar 17, 2025 · Artificial Intelligence

How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance

The paper introduces UNIFIEDREWARD, the first unified reward model for multimodal understanding and generation that supports pairwise ranking and pointwise scoring, builds a 236K human‑preference dataset across image and video tasks, and uses DPO to align VLMs and diffusion models, achieving significant performance gains on both image and video benchmarks.

Direct Preference OptimizationMultimodal AIPreference Modeling

0 likes · 19 min read

How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance

DaTaobao Tech

Mar 12, 2025 · Artificial Intelligence

Multimodal Automatic Layout Generation for E-commerce

The project develops a multimodal automatic layout generation system for e‑commerce by fine‑tuning the qwen‑vl‑7b vision‑language model with LoRA on poster and Taobao image‑layout data, employing diffusion‑based image generation and coordinate‑prediction methods to produce structured layouts that power poster, marketing image, and video‑cover creation with over 90% adoption, while exploring multi‑image, style‑aware, and iterative refinement extensions.

LLMMultimodal AIdiffusion

0 likes · 12 min read

Multimodal Automatic Layout Generation for E-commerce

Full-Stack Cultivation Path

Mar 9, 2025 · Artificial Intelligence

Why Computer Use Agents Like Manus Signal a New Era for AI Automation

The article examines the emerging Computer Use paradigm—LLMs that can see and control a computer screen—detailing its technical foundations, three implementation approaches, performance trade‑offs, and why it could become a dominant design pattern for future AI agents.

AI AgentComputer UseMultimodal AI

0 likes · 9 min read

Why Computer Use Agents Like Manus Signal a New Era for AI Automation

AI Code to Success

Mar 6, 2025 · Artificial Intelligence

How Monica’s ‘Manus’ AI Agent Redefines Human‑Computer Collaboration

Monica’s new AI agent Manus, unveiled on March 6, claims to autonomously handle complex tasks through multimodal processing, continuous learning, and intelligent decision‑making, with real‑world demos ranging from strategic planning to a smart home buying assistant, while sparking market hype, competitive comparisons, and debates on AI’s future role in the workforce.

AI AgentAI MarketCompetitive analysis

0 likes · 5 min read

How Monica’s ‘Manus’ AI Agent Redefines Human‑Computer Collaboration

Ma Wei Says

Mar 4, 2025 · Artificial Intelligence

Microsoft’s Open‑Source Multimodal AI Agent Model Magma: Capabilities and Innovations

On February 25 2025, Microsoft open‑sourced its first multimodal AI agent foundation model, Magma, which extends multimodal processing to images, video, and text, introduces Set‑of‑Mark and Trace‑of‑Mark techniques for spatial‑temporal reasoning, optimizes modular inference for edge devices, and integrates reinforcement learning for adaptive task execution.

Edge ComputingMagmaMultimodal AI

0 likes · 6 min read

Microsoft’s Open‑Source Multimodal AI Agent Model Magma: Capabilities and Innovations

Architects' Tech Alliance

Mar 1, 2025 · Artificial Intelligence

Decoding DeepSeek: A Four‑Tier Capability Framework for Multimodal AI

The article outlines DeepSeek's four‑level capability hierarchy—basic multimodal data fusion and dynamic governance, intermediate domain modeling with causal reasoning and multi‑objective optimization, advanced complex system modeling with digital twins and multi‑agent coordination, and ultimate autonomous evolution features such as concept‑space exploration and self‑programming.

DeepSeekDigital TwinModel Capability

0 likes · 5 min read

Decoding DeepSeek: A Four‑Tier Capability Framework for Multimodal AI

Huolala Tech

Feb 27, 2025 · Artificial Intelligence

How Huolala’s Wukong Platform Solves Large‑Model Deployment Challenges

Huolala’s Wukong platform tackles the common “technology hype, implementation difficulty” dilemma of generative AI by unifying multimodal enterprise knowledge, enabling dynamic multi‑agent workflows, and providing low‑code tools, observability, and stable deployment across dozens of business scenarios.

AI workflowEnterprise AILarge Model

0 likes · 11 min read

How Huolala’s Wukong Platform Solves Large‑Model Deployment Challenges

DataFunSummit

Feb 26, 2025 · Artificial Intelligence

Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music

This article details how NetEase Cloud Music leverages multimodal large language models to improve music recommendation across daily, personalized, and playlist scenarios by extracting rich audio, text, and visual features, addressing data skew, cold‑start challenges, and achieving measurable gains in user engagement and distribution efficiency.

Multimodal AINetEase Cloud Musicfeature extraction

0 likes · 12 min read

Applying Multimodal Large Models to Music Recommendation at NetEase Cloud Music

DaTaobao Tech

Feb 26, 2025 · Artificial Intelligence

How Taobao’s AI Turns Static Clothing Images into Seamless Virtual Try‑On Videos

This article analyzes Taobao’s AIGC video virtual try‑on pipeline, detailing the challenges of frame‑level realism and continuity, the upgraded DiT‑based model, 3D‑VAE compression, large‑scale data collection, template‑matching mechanisms, and the resulting product capabilities for automated marketing and personalized shopper experiences.

AI video generationContent GenerationMultimodal AI

0 likes · 13 min read

How Taobao’s AI Turns Static Clothing Images into Seamless Virtual Try‑On Videos

Architecture & Thinking

Feb 26, 2025 · Artificial Intelligence

Unlocking DeepSeek: A Comprehensive Guide to China’s Cutting-Edge AI Chat Model

This article provides an in‑depth overview of DeepSeek, covering its core multimodal and multilingual features, long‑context capabilities, domain optimizations, security, main functions, diverse application scenarios, and practical usage via web interface or API integration.

AI chatbotDeepSeekMultimodal AI

0 likes · 6 min read

Unlocking DeepSeek: A Comprehensive Guide to China’s Cutting-Edge AI Chat Model

JD Retail Technology

Feb 25, 2025 · Artificial Intelligence

How JD’s “JingDianDian” AI Platform Revolutionizes E‑commerce Content Creation

JD Retail’s self‑built AIGC platform ‘JingDianDian’ leverages multimodal diffusion models, ControlNet, RAG and reinforcement learning to automatically generate high‑quality product images, videos and marketing copy, cutting production time from days to seconds, slashing costs by over 99% for more than 350 k merchants.

AIGCContent GenerationMultimodal AI

0 likes · 15 min read

How JD’s “JingDianDian” AI Platform Revolutionizes E‑commerce Content Creation

DaTaobao Tech

Feb 24, 2025 · Artificial Intelligence

AIGC Video Generation Techniques for E‑commerce: Lip‑Sync, Head/Body Driving, and Business Applications

The article surveys recent AIGC video generation advances for Taobao e‑commerce, detailing lip‑sync models like Wav2Lip and MuseTalk, head‑driven systems such as Hallo and EchoMimic, body‑driven pipelines including AnimateAnyone and Tango, and a four‑stage production workflow that boosts click‑through rates and enables virtual try‑on.

AIGCDeep LearningMultimodal AI

0 likes · 21 min read

AIGC Video Generation Techniques for E‑commerce: Lip‑Sync, Head/Body Driving, and Business Applications

DataFunSummit

Feb 21, 2025 · Artificial Intelligence

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

This article explores multimodal Retrieval‑Augmented Generation (RAG), detailing five core topics—including semantic extraction, visual‑language models, scaling strategies, technical roadmap choices, and a Q&A—while presenting three implementation pathways, performance evaluations, and future directions for AI‑driven document understanding.

Multimodal AIRAGTensor Retrieval

0 likes · 11 min read

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

DataFunTalk

Feb 19, 2025 · Artificial Intelligence

Large Models: Concepts, Principles, Classifications and Applications

This report provides a comprehensive overview of large-scale AI models, explaining their definition, massive parameter and data requirements, underlying transformer architecture, classification into language, vision and multimodal models, notable examples such as DeepSeek, and a survey of popular AIGC tools and practical use cases.

AIGC toolsDeep LearningMultimodal AI

0 likes · 9 min read

Large Models: Concepts, Principles, Classifications and Applications

Architects' Tech Alliance

Feb 18, 2025 · Artificial Intelligence

How DeepSeek’s Latest Models Redefine AI Performance and Industry Adoption

The DeepSeek report details rapid model releases from 2024 onward, highlighting innovations such as model distillation, a 671 B MoE architecture, FP8 mixed‑precision, and the Janus‑Pro multimodal framework, while also documenting major cloud and chip providers' integration of these models into their services.

AI industry adoptionDeepSeekMoE architecture

0 likes · 10 min read

How DeepSeek’s Latest Models Redefine AI Performance and Industry Adoption

Xiaohongshu Tech REDtech

Feb 17, 2025 · Artificial Intelligence

WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios

WorldSense, a new benchmark of 1,662 real‑world video‑audio clips and 3,172 QA pairs across 26 cognitive tasks, reveals that current multimodal large models achieve only 25%–48% accuracy, highlighting the crucial role of combined visual‑audio input and the difficulty of audio‑ and emotion‑related reasoning.

Multimodal AIbenchmark datasetlarge models

0 likes · 12 min read

WorldSense: A New Benchmark for Evaluating Multimodal Large Models in Real‑World Scenarios

AIWalker

Feb 15, 2025 · Artificial Intelligence

Janus-Pro Unveiled: A Unified Architecture for Multimodal Understanding and Generation

Janus-Pro, the open‑source successor to Janus, introduces a decoupled visual encoder and scaled training data to boost both multimodal understanding and text‑to‑image generation, achieving state‑of‑the‑art results on benchmarks such as GQA, GenEval and DPG‑Bench.

Janus-ProModel ScalingMultimodal AI

0 likes · 13 min read

Janus-Pro Unveiled: A Unified Architecture for Multimodal Understanding and Generation

AIWalker

Feb 12, 2025 · Artificial Intelligence

Goku: How HKU and ByteDance’s New Model Sets New Benchmarks in Commercial Image and Video Generation

The paper presents Goku, a rectified‑flow transformer that jointly generates high‑quality images and videos at commercial scale, detailing its novel architecture, massive high‑quality data pipeline, efficient large‑scale training tricks, and state‑of‑the‑art results on GenEval, DPG‑Bench, VBench and UCF‑101.

Large-Scale TrainingMultimodal AIVideo Generation

0 likes · 29 min read

Goku: How HKU and ByteDance’s New Model Sets New Benchmarks in Commercial Image and Video Generation

IT Architects Alliance

Feb 8, 2025 · Artificial Intelligence

Inside DeepSeek: How Its Innovative Architecture Redefines AI Performance

This article examines DeepSeek's advanced Transformer‑based architecture, dynamic routing, MoE system, multi‑stage training, efficient inference, multimodal capabilities, real‑world applications, technical challenges, and future prospects, providing a comprehensive technical analysis of the model's strengths and limitations.

AI ArchitectureDeepSeekModel Optimization

0 likes · 15 min read

Inside DeepSeek: How Its Innovative Architecture Redefines AI Performance

AIWalker

Feb 4, 2025 · Artificial Intelligence

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

This article reviews a comprehensive study that applies Chain‑of‑Thought reasoning to autoregressive text‑to‑image generation, introducing extended test‑time computation, direct preference optimization, and two custom reward models (PARM and PARM++) that together improve generation quality by up to 15% over Stable Diffusion 3.

Direct Preference OptimizationInferenceMultimodal AI

0 likes · 13 min read

How Chain‑of‑Thought Boosts Text‑to‑Image Generation: The New o1 Inference Scheme

Radish, Keep Going!

Feb 4, 2025 · Artificial Intelligence

How DeepSeek Is Redefining AI: Efficiency, Open‑Source Impact, and Future Trends

The article reviews DeepSeek's breakthrough in inference efficiency, explores the trade‑offs of model distillation, compares open‑source and closed‑source ecosystems, examines shifting compute demands, highlights Chinese engineering innovations, and outlines future directions for AI development.

AI inferenceDeepSeekMultimodal AI

0 likes · 9 min read

How DeepSeek Is Redefining AI: Efficiency, Open‑Source Impact, and Future Trends

Architect

Jan 29, 2025 · Artificial Intelligence

How Janus‑Pro Redefines Multimodal AI with Bigger Models and New Training Strategies

DeepSeek’s newly released Janus‑Pro series (1B and 7B) advances multimodal AI by decoupling visual understanding and generation, employing optimized three‑stage training, massive data expansion, and larger LLM backbones, achieving performance that matches or exceeds leading models such as Meta, Google, OpenAI, and Stability AI.

DeepSeekJanus-ProModel Scaling

0 likes · 6 min read

How Janus‑Pro Redefines Multimodal AI with Bigger Models and New Training Strategies

ByteDance Web Infra

Jan 22, 2025 · Artificial Intelligence

Introducing UI‑TARS: A Native GUI Agent Model Integrated with Midscene.js for Multimodal UI Automation

The article presents UI‑TARS, a native GUI‑agent model that combines multimodal large‑language models with the open‑source Midscene.js framework to enable more accurate, token‑efficient, and privacy‑preserving UI automation, while discussing its architecture, advantages, limitations, and integration steps.

GUI AgentMidscene.jsMultimodal AI

0 likes · 11 min read

Introducing UI‑TARS: A Native GUI Agent Model Integrated with Midscene.js for Multimodal UI Automation

NewBeeNLP

Jan 17, 2025 · Artificial Intelligence

Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction

This comprehensive survey examines the foundations, tokenization techniques, model architectures, training paradigms, evaluation benchmarks, and open challenges of multimodal next‑token prediction (MMNTP), offering researchers a clear roadmap for future advances in multimodal AI.

Model architectureMultimodal AINext Token Prediction

0 likes · 9 min read

Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction

DataFunTalk

Jan 1, 2025 · Artificial Intelligence

Applying Large Language Models to Financial Risk Control at Akulaku

This article details Akulaku’s deployment of large language models across multimodal financial risk‑control scenarios—covering business background, a three‑module intelligent‑agent architecture, concrete tool‑ and planning‑enhancement case studies, and future outlook—demonstrating how LLMs boost efficiency, reduce labeling effort, and enable copilot‑style assistance.

Agent ArchitectureKYC verificationMultimodal AI

0 likes · 15 min read

Applying Large Language Models to Financial Risk Control at Akulaku

Baobao Algorithm Notes

Dec 25, 2024 · Artificial Intelligence

Create a Free Multimodal Calorie Counter with GLM‑4V‑Flash in Minutes

This guide shows how to install the ZhipuAI SDK, obtain a free GLM‑4V‑Flash API key, craft prompts for image‑based calorie estimation, and build a Python demo that calculates food calories, BMI, and personalized diet advice using a multimodal large model.

GLM-4V-FlashMultimodal AIPython

0 likes · 9 min read

Create a Free Multimodal Calorie Counter with GLM‑4V‑Flash in Minutes

AntTech

Dec 23, 2024 · Artificial Intelligence

Ant Group’s AIGC Security Detection System Earns Top Rating in China ICT Academy’s Multimodal Evaluation

Ant Group’s AIGC security detection system was evaluated by the China Information and Communication Research Institute, achieving the highest "Excellent" rating with a 0.99 F1 score across image, video, and audio modalities, while also releasing large‑scale detection datasets for the research community.

AIGC detectionAnt GroupBenchmark

0 likes · 5 min read

Ant Group’s AIGC Security Detection System Earns Top Rating in China ICT Academy’s Multimodal Evaluation

Full-Stack Cultivation Path

Dec 18, 2024 · Frontend Development

Midscene.js: An AI‑Powered UI Automation Framework for Web Testing

Midscene.js leverages multimodal AI to simplify web UI automation by providing .ai, .aiQuery and .aiAssert methods, supporting JavaScript and YAML integrations, a Chrome extension, and detailed cost analysis while acknowledging latency, interaction limits, and prompt‑engineering challenges.

Chrome ExtensionJavaScriptLLM

0 likes · 9 min read

Midscene.js: An AI‑Powered UI Automation Framework for Web Testing

360 Tech Engineering

Dec 17, 2024 · Artificial Intelligence

Innovative Multimodal Architectures: IAA for Extending Language Models and BDM for Chinese-Native AI Painting

The article introduces two 360 AI Research Institute projects—IAA, an architecture that equips frozen language models with multimodal capabilities via plug‑in layers, and BDM, a Chinese‑native diffusion model compatible with the Stable Diffusion ecosystem—detailing their motivations, designs, benchmark results, and open‑source resources.

Chinese AI paintingLanguage ModelMultimodal AI

0 likes · 6 min read

Innovative Multimodal Architectures: IAA for Extending Language Models and BDM for Chinese-Native AI Painting

ByteDance Web Infra

Dec 17, 2024 · Frontend Development

Midscene.js: Multimodal AI‑Powered UI Automation for Web Frontend Testing

Midscene.js, an open‑source UI automation framework from ByteDance Web Infra, leverages multimodal AI to simplify writing, maintaining, and debugging web UI tests with JavaScript or YAML integrations, while discussing its origins, usage patterns, limitations, cost, and security considerations.

JavaScriptMidscene.jsMultimodal AI

0 likes · 11 min read

Midscene.js: Multimodal AI‑Powered UI Automation for Web Frontend Testing

AI Large Model Application Practice

Dec 9, 2024 · Artificial Intelligence

How GUI Agents Use Large Models to Automate Any Desktop Task

This article explains why GUI agents are needed, defines their multimodal capabilities, walks through a high‑level automation scenario, details the architecture of large‑model‑driven GUI agents, highlights recent open‑source projects, and compares them with traditional RPA solutions.

AI automationGUI AgentHuman-Computer Interaction

0 likes · 10 min read

How GUI Agents Use Large Models to Automate Any Desktop Task