Tagged articles
38 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 30, 2026 · Artificial Intelligence

How DeepSeek’s Visual‑Primitive Paradigm Redefines Multimodal Reasoning

DeepSeek has released a multimodal model built on a visual‑primitive reasoning paradigm that treats coordinates and bounding boxes as reasoning units, dramatically compresses visual tokens, and achieves state‑of‑the‑art performance on counting, spatial, and topological tasks, while exposing current limits of multimodal inference.

AI reasoningCompressed Sparse AttentionDeepSeek
0 likes · 12 min read
How DeepSeek’s Visual‑Primitive Paradigm Redefines Multimodal Reasoning
Machine Heart
Machine Heart
Apr 24, 2026 · Artificial Intelligence

Vision Banana Shows That Image Generation Equals Understanding – DeepMind’s GPT‑like Leap

DeepMind’s Vision Banana model demonstrates that large‑scale image‑generation pre‑training can produce powerful, universal visual representations, achieving state‑of‑the‑art results on segmentation, depth, and normal estimation without task‑specific heads, thereby supporting the hypothesis that generation and understanding are fundamentally linked.

DeepMindVision Bananagenerative AI
0 likes · 13 min read
Vision Banana Shows That Image Generation Equals Understanding – DeepMind’s GPT‑like Leap
PaperAgent
PaperAgent
Apr 14, 2026 · Artificial Intelligence

Can Neural Computers Replace Traditional CPUs? Inside the Latest AI Harness Designs

This article analyzes the emerging concept of Neural Computers, explains how Harness engineering unifies compute, memory, and I/O into a single learned runtime, reviews recent multimodal models from Anthropic, Meta, and OpenAI, and presents detailed experimental results from the NCCLIGen and NCGUIWorld prototypes.

Neural computerharness designmultimodal models
0 likes · 8 min read
Can Neural Computers Replace Traditional CPUs? Inside the Latest AI Harness Designs
DataFunTalk
DataFunTalk
Apr 7, 2026 · Artificial Intelligence

How a Champion Quantized a 150 GB Multimodal Model in Just 4 Hours

In a four‑hour competition, algorithm engineer Zhang Zhen from a Chinese EV company detailed his end‑to‑end workflow for quantizing the massive Qwen3‑Next‑80B model, covering sensitive‑layer analysis, iterative smoothing, fallback strategies, and parallel "horse‑race" debugging that led his team to win the GeekDay challenge.

Iterative SmoothModel Quantizationlarge language models
0 likes · 9 min read
How a Champion Quantized a 150 GB Multimodal Model in Just 4 Hours
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 31, 2026 · Artificial Intelligence

Unified Multimodal Modeling: How LongCat-Next Bridges Understanding and Generation

The article analyzes why text models naturally combine understanding and generation, explains the fundamental conflicts that prevent images from sharing the same tokenization, and details LongCat-Next’s discrete autoregressive approach—using SAE visual encoders, residual vector quantization, and a unified LLM backbone—to achieve a single model that can both comprehend and create multimodal content.

LongCat-NextRVQdNaViT
0 likes · 21 min read
Unified Multimodal Modeling: How LongCat-Next Bridges Understanding and Generation
Woodpecker Software Testing
Woodpecker Software Testing
Mar 15, 2026 · Industry Insights

Five Major AI Testing Tool Trends Shaping 2026

A 2026 study of 137 leading tech firms reveals that AI is deeply embedded across the software testing lifecycle, replacing manual exploration with intent‑understanding, autonomous verification, and causal attribution, and outlines five concrete trends—from native AI test engines to edge‑cloud collaborative architectures and AI‑on‑AI trust verification.

AI testingAI trustSoftware quality
0 likes · 9 min read
Five Major AI Testing Tool Trends Shaping 2026
AIWalker
AIWalker
Mar 13, 2026 · Artificial Intelligence

Towards AI That Truly Understands Art: Introducing the ArtiMuse Aesthetic Understanding Model

ArtiMuse, a new image aesthetic model unveiled at CVPR 2026 by Shanghai AI Lab and the China Academy of Art, combines a massive 10K fine‑grained dataset, a Token‑As‑Score scoring scheme, and unified textual‑and‑numeric feedback to deliver culturally aware, expert‑level art analysis and robust quantitative ratings.

AI aestheticsToken-As-Scoreart analysis
0 likes · 7 min read
Towards AI That Truly Understands Art: Introducing the ArtiMuse Aesthetic Understanding Model
AIWalker
AIWalker
Mar 5, 2026 · Artificial Intelligence

How ViDA-UGC Leverages Large Multimodal Models for Fine-Grained Visual Quality Assessment

The article introduces ViDA-UGC, a large‑scale UGC visual‑quality dataset and its companion benchmark ViDA‑Bench, explains the MILP‑driven sampling, expert annotation pipeline, and CoT‑based evaluation framework, and shows how fine‑tuning popular multimodal LLMs on this data markedly improves low‑level quality perception, grounding, and description capabilities.

BenchmarkDatasetchain-of-thought
0 likes · 12 min read
How ViDA-UGC Leverages Large Multimodal Models for Fine-Grained Visual Quality Assessment
AI Frontier Lectures
AI Frontier Lectures
Feb 6, 2026 · Artificial Intelligence

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

The paper introduces Mixture‑of‑Visual‑Thoughts (MoVT), a context‑adaptive reasoning paradigm that integrates pure‑text and visually‑grounded inference modes within a single model, and presents the two‑stage AdaVaR training framework with a novel AdaGRPO reinforcement‑learning algorithm to automatically select the optimal mode for each visual‑language task, achieving consistent gains across eight benchmarks and surpassing strong baselines including GPT‑4o.

AdaVaRMixture-of-Visual-ThoughtsVisual Reasoning
0 likes · 16 min read
Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?
Baidu Geek Talk
Baidu Geek Talk
Feb 2, 2026 · Artificial Intelligence

How Cloud AI Infra Powers the Next Wave of Embodied Intelligence

This article outlines the rapid rise of embodied intelligence, the explosion of Vision‑Language‑Action (VLA) research, and how cloud‑based AI infrastructure—including multi‑level IaaS, data pipelines, dual‑system model designs, and reinforcement‑learning workflows—addresses emerging scaling and deployment challenges.

VLAmultimodal modelsreinforcement learning
0 likes · 13 min read
How Cloud AI Infra Powers the Next Wave of Embodied Intelligence
PaperAgent
PaperAgent
Dec 10, 2025 · Artificial Intelligence

How AI Agents Like UFO, Mobile-Agent, and UI-TARS Are Shaping 2025 Smartphones

The article examines the underlying GUI‑Agent technologies behind the 2025 “Doubao” smartphone, comparing Microsoft’s UFO series, Alibaba’s Mobile‑Agent v2/v3, and ByteDance’s UI‑TARS, detailing their model foundations, input modalities, action spaces, planning mechanisms, learning strategies, open‑source status, and multi‑agent frameworks.

AI agentsGUI automationMobile AI
0 likes · 8 min read
How AI Agents Like UFO, Mobile-Agent, and UI-TARS Are Shaping 2025 Smartphones
Tencent Technical Engineering
Tencent Technical Engineering
Sep 6, 2025 · Artificial Intelligence

ARC Lab’s Blueprint: Turning Multimodal AI Research into Real-World Impact

The article outlines ARC Lab’s evolution from its 2019 founding as an internal corporate research unit to a high‑impact AI team that pursues difficult multimodal understanding and generation problems, measures success through a technology‑impact funnel, publishes 30‑40 top‑tier papers annually, and translates research into open‑source tools and products that drive academic, industry, business, and societal value.

AI researchcorporate researchmultimodal models
0 likes · 19 min read
ARC Lab’s Blueprint: Turning Multimodal AI Research into Real-World Impact
ZhongAn Tech Team
ZhongAn Tech Team
Aug 11, 2025 · Artificial Intelligence

What’s New in AI? GPT‑5, SWE‑Swiss, Agentic Web, and More This Week

This week’s tech roundup highlights major AI breakthroughs—including OpenAI’s GPT‑5 launch, the SWE‑Swiss code‑fixing model from Peking University and ByteDance, Pinduoduo’s AI talent hiring surge, the emerging Agentic Web paradigm, Google’s Genie 3 world model, multimodal railway design AI, DJI’s first robot vacuum, AI‑enhanced smart glasses, and a new humanoid robot perception system—all reflecting rapid advances across generative, multimodal, and applied AI.

AIAI hiringAgentic Web
0 likes · 20 min read
What’s New in AI? GPT‑5, SWE‑Swiss, Agentic Web, and More This Week
AI Frontier Lectures
AI Frontier Lectures
Jul 27, 2025 · Information Security

Can Hidden Activations Expose Multimodal Model Jailbreaks?

The paper reveals that large multimodal language models retain refusal signals in their hidden states even after jailbreak attempts, and proposes a training‑free detection method that leverages these signals to identify unsafe inputs across text and image modalities with strong generalization.

AI SafetyLVLM securityhidden activation analysis
0 likes · 7 min read
Can Hidden Activations Expose Multimodal Model Jailbreaks?
AIWalker
AIWalker
Jun 30, 2025 · Artificial Intelligence

Chinese Team Builds First AI That Understands Film, Using 440K Shot Library for Director‑Level Camera Moves

FilMaster is a pioneering AI system that learns cinematic principles from a 440,000‑shot movie database, combines multimodal LLMs, RAG, and audience‑centric rhythm control to generate editable, high‑quality films, and outperforms prior methods by over 50% on the new FilmEval benchmark.

AI film generationFilmEval benchmarkRetrieval Augmented Generation
0 likes · 18 min read
Chinese Team Builds First AI That Understands Film, Using 440K Shot Library for Director‑Level Camera Moves
21CTO
21CTO
Jun 19, 2025 · Artificial Intelligence

How ByteDance’s Seedance 1.0 Outperforms Google’s Veo 3 in AI Video Generation

ByteDance’s newly released Seedance 1.0, a bilingual text‑to‑video and image‑to‑video model, surpasses Google’s Veo 3 in visual consistency, motion realism, and inference speed, achieving top rankings on multiple benchmarks while requiring significantly less compute time per 1080p clip.

AI video generationbenchmark comparisoninference speed
0 likes · 7 min read
How ByteDance’s Seedance 1.0 Outperforms Google’s Veo 3 in AI Video Generation
AntTech
AntTech
Jun 15, 2025 · Artificial Intelligence

21 Ant Research Papers Shaping CVPR 2025: AI Image & Video Generation Breakthroughs

The Interactive Intelligence Lab of Ant Technology Research Institute presented 21 accepted CVPR 2025 papers covering visual generation, editing, 3D vision, digital humans and multimodal AI, highlighting tools such as MagicQuill, Lumos, Aurora, FLARE, LeviTor, MangaNinja, AniDoc, Mimir, AvatarArtist, DiffListener, MotionStone, TensorialGaussianAvatars, DualTalk, CompreCap and Uni-AD.

CVPR2025Computer VisionVideo Generation
0 likes · 20 min read
21 Ant Research Papers Shaping CVPR 2025: AI Image & Video Generation Breakthroughs
AntTech
AntTech
May 30, 2025 · Artificial Intelligence

Insights from Ant Group’s 10th Technical Open Day: Multimodal, Embodied, and Future Model Architectures for AGI

The Ant Group’s 10th Technical Open Day gathered leading AI experts who examined the current state and future directions of multimodal large models, embodied AI, world models, transformer architectures, and vertical applications, offering a comprehensive view of the challenges and opportunities on the path toward AGI.

AGIAI SafetyEmbodied AI
0 likes · 16 min read
Insights from Ant Group’s 10th Technical Open Day: Multimodal, Embodied, and Future Model Architectures for AGI
DevOps
DevOps
May 6, 2025 · Artificial Intelligence

PPTAgent: An Open‑Source AI System for Automated Presentation Generation Using a Two‑Stage Editing Approach

PPTAgent, an open‑source AI tool jointly developed by the Chinese Academy of Sciences and Shanghai Jiexin Technology, automatically creates high‑quality PowerPoint slides by analyzing reference decks, extracting layout patterns, and iteratively editing content with a self‑correction mechanism, achieving superior content, design, and coherence scores compared to existing methods.

AIPPTAgentmultimodal models
0 likes · 6 min read
PPTAgent: An Open‑Source AI System for Automated Presentation Generation Using a Two‑Stage Editing Approach
Baidu MEUX
Baidu MEUX
Apr 28, 2025 · Artificial Intelligence

Top 10 AI Model Breakthroughs of 2024: From ChatGPT‑4o to 3D Digital Humans

This article surveys the latest AI breakthroughs, covering ChatGPT‑4o's native image generation, Runway's Gen‑4 video model, Midjourney V7, AnimeGamer's infinite anime simulation, JiMeng 3.0 poster creator, ComfyUI‑Copilot workflow assistant, DomoAI's voice‑image digital humans, Ready AI web builder, DeepSeek‑V3, and Alibaba's ultra‑realistic 3D digital human model.

AIVideo Generationdigital humans
0 likes · 8 min read
Top 10 AI Model Breakthroughs of 2024: From ChatGPT‑4o to 3D Digital Humans
Alipay Experience Technology
Alipay Experience Technology
Apr 25, 2025 · Artificial Intelligence

Creating Lifelike Talking Avatars from Voice and Photo with EchoMimic

This article introduces EchoMimic V1 and V2, open‑source generative digital‑human systems that turn a single voice clip and a portrait photo into synchronized talking avatars, covering their technical background, architecture, training strategies, performance comparisons, and potential application scenarios.

digital avatargenerative AImultimodal models
0 likes · 13 min read
Creating Lifelike Talking Avatars from Voice and Photo with EchoMimic
JavaScript
JavaScript
Mar 20, 2025 · Artificial Intelligence

How MiniMax’s Linear‑Attention Architecture Is Redefining Long‑Context AI Models

MiniMax’s rapid 2025 releases—including a video model, open‑source LLM, and high‑fidelity voice model—showcase its multimodal linear‑attention architecture that handles up to 4 million tokens, earns a16z recognition, and signals China’s growing influence in open‑source AI innovation.

Linear Attentionartificial intelligencelarge language models
0 likes · 8 min read
How MiniMax’s Linear‑Attention Architecture Is Redefining Long‑Context AI Models
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 7, 2025 · Artificial Intelligence

How Pai‑Megatron‑Patch Boosts Qwen2‑VL Multimodal Training Efficiency

This article explains how the Pai‑Megatron‑Patch toolkit enhances the usability and training performance of the Qwen2‑VL multimodal large model by introducing model‑parallel weight conversion, user‑friendly data loading, visual feature processing optimizations, optimizer offloading, and pipeline parallelism techniques, supported by extensive experimental analysis.

MegatronPipeline ParallelismQwen2-VL
0 likes · 25 min read
How Pai‑Megatron‑Patch Boosts Qwen2‑VL Multimodal Training Efficiency
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 29, 2024 · Big Data

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

The article details ByteDance's use of Ray and RayData to construct scalable audio and video data processing pipelines for multimodal AI models, addressing challenges of massive data volume, resource constraints, and fault tolerance through pipeline design, RayCore enhancements, and custom scheduling optimizations.

AIBig DataByteDance
0 likes · 16 min read
How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray
HyperAI Super Neural
HyperAI Super Neural
Nov 20, 2024 · Artificial Intelligence

From Computer Vision to Medical AI: Prof. Xie's Work Hits Nature, NeurIPS, CVPR

Professor Xie's team at Shanghai Jiao Tong University reports rapid progress in AI for Science, detailing multimodal medical AI models, large open datasets, language and vision‑language models, and knowledge‑enhanced representations that outperform existing baselines across multiple benchmarks.

Knowledge GraphsOpen Datasetslarge language models
0 likes · 14 min read
From Computer Vision to Medical AI: Prof. Xie's Work Hits Nature, NeurIPS, CVPR
IT Services Circle
IT Services Circle
Jun 9, 2024 · Artificial Intelligence

Plagiarism Allegations Between Stanford's Llama3‑V and China's MiniCPM‑Llama3‑V 2.5 Model

The article details the controversy surrounding Stanford's Llama3‑V team admitting to copying the architecture and code of the Chinese MiniCPM‑Llama3‑V 2.5 model, presents new evidence of weight similarity, compares performance metrics, and discusses broader concerns about the recognition of Chinese AI research in the open‑source community.

AI ethicsLlama3-VMiniCPM
0 likes · 9 min read
Plagiarism Allegations Between Stanford's Llama3‑V and China's MiniCPM‑Llama3‑V 2.5 Model
Tencent Tech
Tencent Tech
Oct 20, 2023 · Artificial Intelligence

Tencent OCR's AI Triumph at ICDAR 2023: Four Championship Wins

At ICDAR 2023, Tencent's OCR team leveraged self‑developed algorithms and large‑model backbones to clinch four official championship titles across the DSText and SVRD tracks, showcasing breakthroughs in dense video text detection, tracking, end‑to‑end recognition, and structured information extraction.

ICDAR 2023OCRStructured Information Extraction
0 likes · 14 min read
Tencent OCR's AI Triumph at ICDAR 2023: Four Championship Wins
DataFunSummit
DataFunSummit
Jun 23, 2023 · Artificial Intelligence

Frontiers of Video Action Recognition: Concepts, Algorithms, and Applications

This article introduces video action recognition, covering its basic definition, downstream tasks, major algorithmic families—including CNN‑based, Vision‑Transformer, self‑supervised, and multimodal approaches—and discusses practical deployment scenarios and open challenges in the field.

CNNVision Transformermultimodal models
0 likes · 16 min read
Frontiers of Video Action Recognition: Concepts, Algorithms, and Applications
Kuaishou Tech
Kuaishou Tech
Apr 23, 2023 · Artificial Intelligence

Kuaishou & Renmin AI Institute: Driving Multimodal Large Model Innovation

The article details how Kuaishou’s multimodal AI research, including its K7 trillion‑parameter model and VLUA algorithm, partners with Renmin University’s Gaoling AI Institute to launch a joint lab, produce cutting‑edge papers such as WebBrain and ChatImg, and advance recommendation and search technologies across the short‑video ecosystem.

AIIndustry collaborationRecommendation Systems
0 likes · 17 min read
Kuaishou & Renmin AI Institute: Driving Multimodal Large Model Innovation
Kuaishou Large Model
Kuaishou Large Model
Mar 31, 2023 · Artificial Intelligence

How Kuaishou Elevates Video Quality and AI Performance at NVIDIA GTC 2023

At NVIDIA GTC 2023, Kuaishou engineers unveiled cutting‑edge solutions ranging from video quality assessment and enhancement, 3D digital‑human live streaming, a custom TensorRT‑based performance framework, large‑scale recommendation model acceleration, to multimodal massive‑model deployment for short‑video scenarios.

AI OptimizationDigital HumanRecommendation Systems
0 likes · 9 min read
How Kuaishou Elevates Video Quality and AI Performance at NVIDIA GTC 2023
JD Retail Technology
JD Retail Technology
Dec 12, 2022 · Artificial Intelligence

Keynote Presentations from the 2022 Global AI Technology Conference – First Industrial Vision Frontier Forum

The 2022 Global AI Technology Conference’s First Industrial Vision Frontier Forum in Hangzhou gathered leading experts to discuss advances in industrial AI visual defect detection, multimodal pre‑training models, smart meteorology, digital intelligence in retail, third‑generation compound semiconductor detection, meta‑imaging, and broader industrial AI applications, highlighting the future of intelligent manufacturing.

AIIndustrial VisionMeta Imaging
0 likes · 12 min read
Keynote Presentations from the 2022 Global AI Technology Conference – First Industrial Vision Frontier Forum
DataFunTalk
DataFunTalk
Nov 23, 2022 · Artificial Intelligence

Lightweight Adaptation Techniques for Multimodal Large Models

This article presents a comprehensive overview of lightweight adaptation methods—including language, domain, and optimization‑goal adapters and structured prompts—to overcome language mismatch, low domain fit, and objective differences when deploying open‑source multimodal large models in real‑world AI applications.

AIAdapterModel Adaptation
0 likes · 14 min read
Lightweight Adaptation Techniques for Multimodal Large Models
Zuoyebang Tech Team
Zuoyebang Tech Team
Aug 12, 2022 · Artificial Intelligence

How End-to-End Speech Recognition is Transforming AI Voice Applications

The AISummit AI conference highlighted advances in intelligent voice, with experts from ZuoYeBang, ByteDance, Microsoft and others discussing end‑to‑end speech recognition, pronunciation correction, and high‑quality speech synthesis, and exploring how multimodal pre‑trained models will shape the future of voice AI.

AI Conferenceend-to-end AIintelligent voice
0 likes · 6 min read
How End-to-End Speech Recognition is Transforming AI Voice Applications