Tagged articles

303 articles

Page 3 of 4

Nov 15, 2024 · Artificial Intelligence

Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)

At the 2024 Global Machine Learning Conference in Beijing, 360 AI Research Institute showcased cutting‑edge multimodal large‑model research, fine‑grained open‑world object detection, and document understanding technologies, highlighting open‑source releases, real‑world deployments, and competitive achievements in AI competitions.

AI researchMultimodal AIdocument understanding

0 likes · 7 min read

Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)

iQIYI Technical Product Team

Nov 7, 2024 · Artificial Intelligence

Multimodal Speaker Diarization for Long-Form Video Scripts

iQIYI’s multimodal speaker diarization system splits long‑form video using subtitle timestamps and scene detection, extracts voiceprints with a custom model, hierarchically clusters them, and applies an Activate Speaker Detection algorithm combined with face‑recognition to assign speakers, achieving around 90 % precision and recall and boosting downstream tasks such as summarization, translation, and dubbing.

Multimodal AIdialogue recognitioniQIYI

0 likes · 8 min read

Multimodal Speaker Diarization for Long-Form Video Scripts

Alimama Tech

Nov 6, 2024 · Artificial Intelligence

How AI Generates Synchronized Video Narrations for E‑Commerce

This article presents the research behind Synchronized Video Storytelling, introducing the E‑SyncVidStory dataset, the VideoNarrator multimodal architecture, and extensive experiments that demonstrate high‑quality, product‑aware video narration generation for e‑commerce applications.

DatasetLLMMultimodal AI

0 likes · 12 min read

How AI Generates Synchronized Video Narrations for E‑Commerce

DataFunTalk

Nov 2, 2024 · Artificial Intelligence

Embodied Intelligence: Core Concepts, Three Elements, and Four Functional Modules

This article introduces embodied intelligence, explains its basic definition, three essential elements (body, intelligence, environment), and details the four functional modules—perception, decision, action, and feedback—while describing the sensors and algorithms that enable physical AI systems to interact with the real world.

AI roboticsEmbodied IntelligenceFeedback Loop

0 likes · 13 min read

Embodied Intelligence: Core Concepts, Three Elements, and Four Functional Modules

DaTaobao Tech

Nov 1, 2024 · Artificial Intelligence

Multimodal Large Model for Voucher Verification: Prompt Engineering and Fine‑Tuning

By leveraging multimodal large models such as GPT‑4o and fine‑tuned Qwen‑VL, the study builds a prompt‑engineered and SFT‑enhanced voucher verification system that classifies product categories, detects diverse defects, and estimates problem counts, achieving up to 90 % accuracy and meeting real‑time business throughput requirements.

Multimodal AIPrompt engineeringe‑commerce

0 likes · 10 min read

Multimodal Large Model for Voucher Verification: Prompt Engineering and Fine‑Tuning

Tencent Cloud Developer

Oct 30, 2024 · Artificial Intelligence

Comprehensive Survey of AIGC Research: Papers, Resources, and Technical Overview

This survey acts as a comprehensive portal that organizes AIGC research across seven domains—text, image, and audio generation, cross‑modal association, text‑guided image and audio synthesis, and supporting resources—detailing seminal models such as GPT, Diffusion, CLIP, DALL·E, Stable Diffusion, MusicLM, and key papers that shaped each field.

AIGCCLIPComputer Vision

0 likes · 19 min read

Comprehensive Survey of AIGC Research: Papers, Resources, and Technical Overview

Ops Development & AI Practice

Oct 4, 2024 · Artificial Intelligence

How ChatGPT 4.0 with Canvas Redefines Multimodal Human‑AI Interaction

ChatGPT 4.0 with Canvas introduces a visual "canvas" that blends language and graphics, enabling multimodal dialogue, real‑time visual feedback, and collaborative workflows across education, design, and business, while posing technical challenges in vision‑language integration, context consistency, and performance optimization.

AI applicationsCanvasChatGPT

0 likes · 10 min read

How ChatGPT 4.0 with Canvas Redefines Multimodal Human‑AI Interaction

Sohu Tech Products

Sep 25, 2024 · Artificial Intelligence

Multimodal AI-Powered Video Content Moderation System Using Chinese CLIP and Vector Search

The article describes a multimodal AI video moderation system built on Alibaba’s Chinese‑CLIP model and hybrid RedisSearch/ElasticSearch vector databases, enabling real‑time violation detection and historical recall, with fine‑tuned black‑market ad detection, FP16 quantization, and OpenVINO acceleration to boost speed and cut storage.

Chinese CLIPMultimodal AIOpenVINO optimization

0 likes · 16 min read

Multimodal AI-Powered Video Content Moderation System Using Chinese CLIP and Vector Search

Ops Development & AI Practice

Sep 16, 2024 · Industry Insights

Why Mistral AI Is Shaping the Future of Open‑Source Large Language Models

Mistral AI, a French startup founded in 2023, leverages open‑source large language models, efficient architecture, and multimodal research to offer scalable AI solutions across enterprises, content creation, and healthcare, while pursuing a community‑driven strategy that positions it as a rising force in the competitive AI landscape.

AI industryMistral AIMultimodal AI

0 likes · 9 min read

Why Mistral AI Is Shaping the Future of Open‑Source Large Language Models

DataFunTalk

Sep 1, 2024 · Artificial Intelligence

Building Multi‑Scenario AI Assistants with Large Models at Huolala

Huolala, a logistics technology company, shares how it leverages large language models to create personal and office AI assistants across dozens of real‑world scenarios, detailing the underlying platform, prompt engineering, multimodal capabilities, multi‑agent coordination, and the resulting business empowerment.

AI assistantsLarge Language ModelsMultimodal AI

0 likes · 13 min read

Building Multi‑Scenario AI Assistants with Large Models at Huolala

Baobao Algorithm Notes

Aug 12, 2024 · Industry Insights

Why AI Search Is the New Swiss‑Army Knife: Baidu’s Edge in the Red‑Ocean Market

The article examines how AI‑enhanced search, driven by large language models, is reshaping the competitive landscape, highlights Baidu’s multi‑modal features and market positioning, and explains why intelligent search is emerging as a high‑potential, user‑centric product in a crowded industry.

AI agentsAI searchBaidu

0 likes · 10 min read

Why AI Search Is the New Swiss‑Army Knife: Baidu’s Edge in the Red‑Ocean Market

Kuaishou Tech

Jul 31, 2024 · Artificial Intelligence

Kuaishou’s Kolors Text‑to‑Image Model: Architecture, Evaluation, and Real‑World Applications

The article presents a comprehensive overview of Kuaishou’s Kolors (formerly 可图) multimodal generative model, detailing its data collection strategy, diffusion‑based architecture, evaluation metrics, derived capabilities such as prompt refinement and interactive generation, and a range of practical applications from AI‑powered live‑stream gifts to virtual try‑on, while also offering strategic advice for the domestic visual‑generation community.

AI applicationsDiffusion ModelsKolors

0 likes · 27 min read

Kuaishou’s Kolors Text‑to‑Image Model: Architecture, Evaluation, and Real‑World Applications

DataFunSummit

Jul 30, 2024 · Artificial Intelligence

Multimodal Mobile AI Agent (Mobile‑Agent): From V1 to V2 and Open‑Source Practice

This article introduces Alibaba Tongyi Lab's multimodal mobile AI agent, Mobile‑Agent, covering the background of large‑model agents, the design and capabilities of V1 and V2, the multi‑agent framework, evaluation results, open‑source resources, and future development directions.

AI PlanningMulti-AgentMultimodal AI

0 likes · 13 min read

Multimodal Mobile AI Agent (Mobile‑Agent): From V1 to V2 and Open‑Source Practice

Java Tech Enthusiast

Jul 23, 2024 · Industry Insights

Can Baidu’s Orange Paper Outperform Kimi? A Deep Dive into AI Writing Tools

This article compares Baidu’s new AI writing platform Orange Paper with Kimi, evaluating their long‑text understanding, multimodal editing, document upload limits, outline generation, and overall usability for research and academic writing, highlighting Orange Paper’s advantages in knowledge retrieval, large‑scale content creation, and deep editing capabilities.

AI writingKnowledge RetrievalLong Text Generation

0 likes · 11 min read

Can Baidu’s Orange Paper Outperform Kimi? A Deep Dive into AI Writing Tools

DataFunSummit

Jul 18, 2024 · Artificial Intelligence

Tencent Music Tianqin Lab’s Practice and Applications of Audio Representation Large Models

This article reviews Tencent Music Tianqin Lab’s research on audio representation large models, covering background, the evolution of audio features, self‑supervised methods such as SimCLR, BYOL, MAE, MLM, benchmark results, multimodal extensions, and real‑world applications like song authenticity detection and search ranking.

Multimodal AITencent Musicaudio representation

0 likes · 20 min read

Tencent Music Tianqin Lab’s Practice and Applications of Audio Representation Large Models

Kuaishou Tech

Jul 17, 2024 · Artificial Intelligence

Key Technical Innovations in Kuaishou’s “Kuaiyi” Large Model and Its Real-World Applications

The article details Kuaishou’s development of the 175B “Kuaiyi” multimodal large model, presenting eight novel technical innovations—from Temporal Scaling Law and MiLe Loss to MoE‑enhanced reward modeling—and describes how these advances enable high‑performance AI services such as the AI Xiao Kuai chatbot across diverse real‑world scenarios.

AI applicationsModel OptimizationMultimodal AI

0 likes · 12 min read

Key Technical Innovations in Kuaishou’s “Kuaiyi” Large Model and Its Real-World Applications

DataFunSummit

Jun 23, 2024 · Artificial Intelligence

Tongyi Xingchen Personalized Large Model: Technical Overview and Applications

This article summarizes the development background of large language models, Alibaba's progression in foundational and personalized AI, the design and capabilities of the Tongyi Xingchen personalized model, its multimodal and agent-based architecture, various industry use cases, and the safety and responsibility measures applied to ensure trustworthy AI deployment.

AI SafetyLarge Language ModelsMultimodal AI

0 likes · 13 min read

Tongyi Xingchen Personalized Large Model: Technical Overview and Applications

21CTO

Jun 2, 2024 · Artificial Intelligence

Geoff Hinton on Scaling Laws, Multimodal AI, and the Future of Intelligence

In a candid interview, Geoff Hinton reflects on his AI journey—from early disappointments in physiology and philosophy to breakthroughs in neural networks, scaling laws, multimodal learning, fast‑weight concepts, and the ethical challenges shaping the future of artificial intelligence.

AI ethicsDeep LearningGeoff Hinton

0 likes · 25 min read

Geoff Hinton on Scaling Laws, Multimodal AI, and the Future of Intelligence

NewBeeNLP

May 29, 2024 · Artificial Intelligence

How Ant’s Multimodal Team Boosted Video‑Text Retrieval by 24% and Cut Copyright Search Costs 85%

This article presents Ant Group's multimodal research on video retrieval, detailing a large Chinese video‑text pre‑training dataset, three techniques that raise video‑text semantic search performance by up to 24.5%, and an end‑to‑end video‑video copyright detection system that reduces storage by 85% and speeds up inference 18‑fold.

Multimodal AIcopyright detectionfine-grained modeling

0 likes · 40 min read

How Ant’s Multimodal Team Boosted Video‑Text Retrieval by 24% and Cut Copyright Search Costs 85%

21CTO

May 23, 2024 · Artificial Intelligence

How xAI’s Grok 1.5V Adds Multimodal Image Input for Developers

xAI’s Grok 1.5V is set to support multimodal image input, allowing developers to upload pictures and receive text‑based answers via the Python SDK, marking a major upgrade that narrows the gap with leading models like GPT‑4 and signals a new frontier for AI chatbots.

AI chatbotsMultimodal AIPython SDK

0 likes · 4 min read

How xAI’s Grok 1.5V Adds Multimodal Image Input for Developers

21CTO

May 21, 2024 · Artificial Intelligence

How Google’s ScreenAI Could Redefine UI Understanding and UX Design

Google’s new ScreenAI visual‑language model, built on the PaLI architecture, can interpret user interfaces and infographics, answer UI‑related questions, generate summaries and navigate screens, and sets new benchmarks that may reshape future user‑experience research and applications.

Google AIMultimodal AIScreenAI

0 likes · 9 min read

How Google’s ScreenAI Could Redefine UI Understanding and UX Design

DataFunTalk

May 15, 2024 · Artificial Intelligence

Advances in Video Multimodal Retrieval: Video‑Text Semantic Search and Video‑Video Same‑Source Search

This article presents Ant Group's multimodal research on video retrieval, detailing video‑text semantic search and video‑video same‑source search, introducing a large Chinese pre‑training dataset, novel pre‑training, hard‑sample mining, fine‑grained modeling techniques, and an efficient end‑to‑end copyright detection framework.

Multimodal AIcopyright detectionfine-grained modeling

0 likes · 38 min read

Advances in Video Multimodal Retrieval: Video‑Text Semantic Search and Video‑Video Same‑Source Search

Rare Earth Juejin Tech Community

May 15, 2024 · Artificial Intelligence

OpenAI Unveils GPT‑4o: An Omni‑Capable Multimodal Model Offered Free to All Users

OpenAI introduced GPT‑4o, a free, omni‑capable multimodal model that processes text, audio, and images together, delivers near‑human response latency, showcases impressive live demos, and will soon be available via a discounted API, marking a significant step forward in end‑to‑end AI research.

AI researchGPT-4oMultimodal AI

0 likes · 7 min read

OpenAI Unveils GPT‑4o: An Omni‑Capable Multimodal Model Offered Free to All Users

21CTO

May 14, 2024 · Artificial Intelligence

What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?

OpenAI’s latest flagship model GPT‑4o combines text, audio, image and video processing in a single, faster, cheaper multimodal system that delivers near‑human response times, expanded API access, and new safety measures, reshaping how developers and users interact with AI.

AI modelAudio ProcessingGPT-4o

0 likes · 10 min read

What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?

JD Tech

Apr 29, 2024 · Artificial Intelligence

Relation-Aware Diffusion Models for Automated Poster Layout and Product Background Generation

This article presents JD Advertising's 2023 AI-driven framework that uses a relation‑aware diffusion model with visual‑text and geometric modules, combined with category‑common and personalized generators and a planning‑and‑rendering network, to automate high‑quality, scalable e‑commerce poster creation and background synthesis.

Diffusion ModelsImage GenerationMultimodal AI

0 likes · 18 min read

Relation-Aware Diffusion Models for Automated Poster Layout and Product Background Generation

Open Source Linux

Apr 16, 2024 · Artificial Intelligence

How Sora’s Text-to-Video Model Is Redefining AI‑Generated Video

Sora, a new text‑to‑video AI model, can create one‑minute videos from textual prompts or static images, delivering industry‑leading fidelity, resolution, and coherent motion by using spatial‑temporal patches inspired by ViViT, and shows emergent capabilities that hint at universal physical simulation.

Multimodal AISora modelViViT

0 likes · 4 min read

How Sora’s Text-to-Video Model Is Redefining AI‑Generated Video

NewBeeNLP

Apr 13, 2024 · Artificial Intelligence

How a Multimodal ‘Joke‑King’ Model Beats GPT‑4 at Humor Generation

A research team from Sun Yat‑sen University, Sea AI Lab and Harvard built a multimodal large model that learns to generate creative jokes and memes by training on the Oogiri‑GO dataset, introducing a Leap‑of‑Thought (LoT) paradigm and CLoT fine‑tuning, which outperforms GPT‑4 and other state‑of‑the‑art models in humor tasks.

CLoTLarge Language ModelsLeap-of-Thought

0 likes · 9 min read

How a Multimodal ‘Joke‑King’ Model Beats GPT‑4 at Humor Generation

JD Retail Technology

Mar 12, 2024 · Artificial Intelligence

Multimodal Large Models: Recent Advances, Industry Impact, and Challenges – An Expert Interview

In a detailed interview, Tsinghua researcher Zhao Sicheng and JD Retail senior director Peng Changping discuss the latest progress in multimodal large models, their practical applications in advertising and e‑commerce, persistent challenges such as hallucinations and data alignment, and the skills engineers need to thrive in the emerging AI era.

AI researchMultimodal AIe‑commerce

0 likes · 19 min read

Multimodal Large Models: Recent Advances, Industry Impact, and Challenges – An Expert Interview

Alibaba Cloud Big Data AI Platform

Mar 12, 2024 · Artificial Intelligence

AAAI‑2024 Highlights: Alibaba Cloud’s Deep Tabular Learning & Multi‑Modal Fusion

Alibaba Cloud’s AI platform PAI showcased four cutting‑edge papers at AAAI‑2024—introducing AMFormer for deep tabular learning via arithmetic feature interaction, MuLTI for efficient video‑language understanding, M2SD for few‑shot class‑incremental learning, and M2Doc for multi‑modal document layout analysis—demonstrating the platform’s growing impact on artificial‑intelligence research.

Deep LearningFew‑Shot LearningMultimodal AI

0 likes · 9 min read

AAAI‑2024 Highlights: Alibaba Cloud’s Deep Tabular Learning & Multi‑Modal Fusion

NewBeeNLP

Mar 7, 2024 · Artificial Intelligence

How Sora is Redefining Large Vision Models: A Deep Dive into Technology, Limits, and Opportunities

This comprehensive review examines Sora, the first model capable of generating minute‑long, high‑quality videos from text, covering its historical background, core diffusion‑Transformer architecture, data preprocessing strategies, prompt engineering techniques, diverse applications, and the ethical and technical limitations that shape its future.

Multimodal AIPrompt engineeringSora

0 likes · 28 min read

How Sora is Redefining Large Vision Models: A Deep Dive into Technology, Limits, and Opportunities

Rare Earth Juejin Tech Community

Mar 7, 2024 · Artificial Intelligence

Anthropic Announces Claude 3 Model Family: Opus, Sonnet, and Haiku

Anthropic has launched the Claude 3 family of large language models—Opus, Sonnet, and Haiku—offering varying balances of intelligence, speed, and cost, with enhanced reasoning, multilingual, vision capabilities, reduced refusals, and improved safety, now available via API in over 159 countries.

AI SafetyAnthropicClaude 3

0 likes · 11 min read

Anthropic Announces Claude 3 Model Family: Opus, Sonnet, and Haiku

DataFunSummit

Mar 6, 2024 · Artificial Intelligence

Document Intelligence: Background, Technology, Large Models, and Enterprise Applications

This article presents a comprehensive overview of document intelligence, covering its background, technical evolution, large‑model advancements, and practical enterprise digital transformation use cases, with a focus on multimodal processing, unified document representation, and industry‑specific applications such as legal contract automation.

Document IntelligenceEnterprise AutomationLarge Language Models

0 likes · 14 min read

Document Intelligence: Background, Technology, Large Models, and Enterprise Applications

Java Tech Enthusiast

Mar 5, 2024 · Artificial Intelligence

Claude 3 vs GPT‑4: A Deep Dive into the New AI Giant’s Multimodal Edge

Claude 3 has arrived, outperforming GPT‑4 across benchmark scores, offering free Sonnet and paid Opus tiers, and showcasing unprecedented multimodal, long‑context, and code‑generation abilities that reshape competitive dynamics in large‑language‑model research.

AnthropicClaude 3Context Window

0 likes · 12 min read

Claude 3 vs GPT‑4: A Deep Dive into the New AI Giant’s Multimodal Edge

AntTech

Mar 1, 2024 · Artificial Intelligence

Ant Group Unveils SkySense: 2.06‑Billion‑Parameter Multimodal Remote‑Sensing Foundation Model Accepted at CVPR 2024

Ant Group introduced SkySense, a 2.06‑billion‑parameter multimodal remote‑sensing foundation model that outperformed 18 international rivals across 17 benchmark tasks, was accepted to CVPR 2024, and aims to support applications such as agriculture, urban planning, and disaster response.

Ant GroupCVPR 2024Multimodal AI

0 likes · 6 min read

Ant Group Unveils SkySense: 2.06‑Billion‑Parameter Multimodal Remote‑Sensing Foundation Model Accepted at CVPR 2024

Architects' Tech Alliance

Feb 25, 2024 · Artificial Intelligence

How Sora Redefined Video Generation: Breakthroughs and Industry Impact

The article provides an in‑depth technical analysis of OpenAI's Sora, highlighting its 60‑second 1080p video generation capability, the novel patches‑vectorization and transformer training pipeline that leverages GPT‑generated prompts for multimodal alignment, and its potential to become a universal video‑generation base model that could reshape the AI industry.

AGIMultimodal AISora

0 likes · 6 min read

How Sora Redefined Video Generation: Breakthroughs and Industry Impact

NewBeeNLP

Feb 17, 2024 · Artificial Intelligence

How Sora Highlights the Next Leap Toward AGI and Shifts AI Competition

The article analyzes OpenAI's Sora video model, arguing that its integration of large‑language‑model reasoning with diffusion techniques marks a major step toward true world understanding, reshapes creative workflows, widens the AI talent gap, and accelerates the path to artificial general intelligence.

AGIAI trendsLarge Language Models

0 likes · 7 min read

How Sora Highlights the Next Leap Toward AGI and Shifts AI Competition

DataFunTalk

Feb 5, 2024 · Artificial Intelligence

Mobile-Agent: An Autonomous Multi‑Modal Mobile Device Agent with Visual Perception

The Mobile-Agent paper presents a vision‑only, autonomous multi‑modal AI system that can interpret user commands, locate UI elements on a smartphone screen, and execute complex tasks such as browsing, commenting, and content creation through a defined operation space, self‑planning, and self‑reflection mechanisms, achieving high success rates across diverse Chinese and English scenarios.

Mobile AutomationMultimodal AIVisual Perception

0 likes · 7 min read

Mobile-Agent: An Autonomous Multi‑Modal Mobile Device Agent with Visual Perception

21CTO

Jan 31, 2024 · Artificial Intelligence

Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model

This article introduces LLaVA, an open‑source large language‑visual assistant that replicates GPT‑4‑V capabilities, explains its architecture, training process, and key features, and provides step‑by‑step instructions for using the web demo, running it locally via Ollama or HuggingFace, and building a simple Gradio chatbot with code examples.

GradioLLaVAMultimodal AI

0 likes · 11 min read

Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model

DataFunSummit

Jan 20, 2024 · Artificial Intelligence

Cross‑Modal Video Open‑Tag Mining: Techniques, Methods, and Applications

The article presents a comprehensive overview of cross‑modal video open‑tag mining, detailing its technical background, related multimodal research methods, a four‑stage open‑tag solution from 360 AI Research Institute, and future application prospects such as unsupervised tag coverage, semantic retrieval, and content moderation.

Cross-modalMultimodal AIlabel extraction

0 likes · 15 min read

Cross‑Modal Video Open‑Tag Mining: Techniques, Methods, and Applications

Xiaohongshu Tech REDtech

Jan 20, 2024 · Artificial Intelligence

Decoding Xiaohongshu’s Recommendation System: How Ordinary Users Gain Visibility

Xiaohongshu’s recommendation system uses large‑scale multimodal embeddings, dual‑tower and graph models, and diversity techniques like DPP and SSD to quickly surface high‑quality user‑generated content, enabling ordinary users to gain visibility while balancing personalization, exploration, and efficient LLM‑augmented pipelines.

Large Language ModelsMultimodal AIXiaohongshu

0 likes · 15 min read

Decoding Xiaohongshu’s Recommendation System: How Ordinary Users Gain Visibility

DataFunSummit

Jan 5, 2024 · Artificial Intelligence

Multimodal Large Model Platform: History, Architecture, Practices, and Future Outlook by Jiuzhang Yunji DataCanvas

This article reviews the evolution of multimodal large models, introduces Jiuzhang Yunji DataCanvas' multimodal model platform—including AI foundation software, model tools, serving, and prompt management—shares practical building methods, memory‑augmented models, ETL pipelines, knowledge‑base applications, and offers a forward‑looking perspective on enterprise data management and intelligent agents.

AI Foundation SoftwareKnowledge BaseMultimodal AI

0 likes · 14 min read

Multimodal Large Model Platform: History, Architecture, Practices, and Future Outlook by Jiuzhang Yunji DataCanvas

DataFunSummit

Jan 1, 2024 · Artificial Intelligence

Advances in Image and Video Enhancement, Quality Assessment, and Multimodal AI Techniques

This article reviews the latest research from Alibaba DAMO Academy on real-world image quality problems, covering spatial, temporal, and color enhancement methods, advanced quality assessment metrics, multimodal diffusion models, and future directions toward large‑model integration and lightweight deployment.

Deep LearningMOS regressionMultimodal AI

0 likes · 24 min read

Advances in Image and Video Enhancement, Quality Assessment, and Multimodal AI Techniques

Programmer DD

Dec 8, 2023 · Artificial Intelligence

Is Google’s Gemini Demo a Staged Illusion? The Truth Behind the AI Showcase

The article examines Google’s Gemini multimodal AI demo, revealing that the striking video was largely fabricated using static image frames and engineered prompts, which misleads viewers about the model’s real‑time capabilities and raises concerns about trust in AI demonstrations.

AI demonstrationAI trustGemini

0 likes · 8 min read

Is Google’s Gemini Demo a Staged Illusion? The Truth Behind the AI Showcase

21CTO

Dec 7, 2023 · Artificial Intelligence

Google Gemini vs GPT‑4: Can the New AI Model Outperform ChatGPT?

Google's Gemini AI suite, unveiled in December, brings three model sizes—Nano, Pro, and Ultra—to power Bard and other services, claims superior performance over GPT‑4 across most benchmarks, and introduces multimodal capabilities that signal a major shift in the AI landscape.

AI language modelGPT-4 comparisonGoogle Gemini

0 likes · 6 min read

Google Gemini vs GPT‑4: Can the New AI Model Outperform ChatGPT?

DataFunTalk

Oct 19, 2023 · Artificial Intelligence

Multimodal Large Model Platform: History, Architecture, and Practice by Nine Chapters Cloud Extreme DataCanvas

This article presents Nine Chapters Cloud Extreme DataCanvas's insights and practices on multimodal large model platforms, covering their historical development, platform components such as AI Foundation Software and Prompt Manager, practical implementations like memory-augmented models and ETL pipelines, and future prospects for enterprise knowledge bases and agents.

AI PlatformKnowledge BaseMultimodal AI

0 likes · 13 min read

Multimodal Large Model Platform: History, Architecture, and Practice by Nine Chapters Cloud Extreme DataCanvas

Bilibili Tech

Oct 13, 2023 · Artificial Intelligence

Multimodal Video High‑Energy Segment Extraction for Dynamic Video Covers

The authors present a multimodal system that automatically extracts high‑energy video segments for dynamic covers by analyzing subtitles, audio, visual frames, and danmu, employing LLM prompt‑tuning, scene‑cut detection, and aesthetic scoring to reduce manual effort and boost click‑through rates.

ASRMultimodal AIOCR

0 likes · 14 min read

Multimodal Video High‑Energy Segment Extraction for Dynamic Video Covers

21CTO

Sep 27, 2023 · Artificial Intelligence

How ChatGPT’s New Voice and Image Features Transform AI Interaction

OpenAI’s latest update adds multimodal voice and image capabilities to ChatGPT, letting users speak or upload pictures for more natural, context‑rich conversations powered by advanced GPT‑3.5 and GPT‑4 models.

AI assistantsChatGPTMultimodal AI

0 likes · 6 min read

How ChatGPT’s New Voice and Image Features Transform AI Interaction

DataFunSummit

Sep 5, 2023 · Artificial Intelligence

Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications

This article presents a comprehensive overview of document intelligence, covering its background, the evolution of related technologies, large‑model approaches such as multimodal pre‑training and domain‑specific models, and concrete enterprise use cases across various business functions.

Document IntelligenceEnterprise AIMultimodal AI

0 likes · 14 min read

Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications

DataFunSummit

Aug 16, 2023 · Artificial Intelligence

Kuaipedia: Building a Short‑Video Encyclopedia with Multimodal Knowledge Extraction

This article introduces Kuaipedia, Kuaishou's multimodal short‑video encyclopedia, detailing its background, system architecture, knowledge‑video recognition pipeline, multimodal entity linking techniques, and downstream applications, while also providing implementation insights and a brief Q&A.

KuaipediaMultimodal AIentity linking

0 likes · 12 min read

Kuaipedia: Building a Short‑Video Encyclopedia with Multimodal Knowledge Extraction

DataFunTalk

Aug 11, 2023 · Artificial Intelligence

Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation

mPLUG-Owl is a modular multimodal dialogue large model from Alibaba DAMO Academy that builds on the mPLUG series, offering advanced image, video, OCR, and multilingual capabilities, with extensive evaluations showing superior performance over MiniGPT‑4, LLaVA, and other multimodal LLMs across various tasks.

Multimodal AIevaluationmPLUG-Owl

0 likes · 17 min read

Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation

Baidu Geek Talk

Jul 26, 2023 · Artificial Intelligence

Insights on AIGC Development and Commercial Applications by Baidu's Chief Architect

Baidu’s chief architect Li Shuanglong outlined how AIGC, driven by advanced large‑language and multimodal models, is already powering commercial tools such as automated copywriting, 2D digital‑human video creation and lead‑generation chatbots, while emphasizing future progress in engineering scalability, algorithmic fidelity, data quality, and scenario‑focused applications.

AI commercializationAI researchAIGC

0 likes · 8 min read

Insights on AIGC Development and Commercial Applications by Baidu's Chief Architect

Alibaba Cloud Big Data AI Platform

Jul 10, 2023 · Artificial Intelligence

Alibaba’s PAI Platform Powers Three Groundbreaking ACL 2023 AI Papers

Three papers from Alibaba Cloud's PAI platform were selected for ACL 2023, showcasing FashionKLIP for e‑commerce image‑text retrieval, ConaCLIP's lightweight dual‑encoder distillation, and a fast domain‑specific diffusion model, all of which will be open‑sourced for the AI community.

ACL 2023Diffusion ModelsMultimodal AI

0 likes · 8 min read

Alibaba’s PAI Platform Powers Three Groundbreaking ACL 2023 AI Papers

21CTO

Jul 8, 2023 · Artificial Intelligence

What Developers Need to Know About GPT‑4’s New 8K Context and Multimodal Capabilities

OpenAI has opened GPT‑4’s API to all paid users, offering an 8K‑token context window (up to 32K), multimodal image input, enhanced creativity, longer text handling, and upcoming fine‑tuning options, while also outlining phased deprecation of older models and current limitations.

AI SafetyAPIGPT-4

0 likes · 10 min read

What Developers Need to Know About GPT‑4’s New 8K Context and Multimodal Capabilities

DataFunSummit

May 19, 2023 · Artificial Intelligence

Expert Roundtable on the Impact of GPT‑4 and Large Models on Knowledge Graphs

In this expert roundtable, leading AI researchers discuss GPT‑4’s multimodal breakthroughs, the future convergence of large models with knowledge graphs, practical integration strategies, and the evolving relevance of traditional NLP tasks, offering deep insights into the direction of artificial intelligence research.

Artificial IntelligenceGPT-4Knowledge Graphs

0 likes · 44 min read

Expert Roundtable on the Impact of GPT‑4 and Large Models on Knowledge Graphs

AntTech

May 10, 2023 · Artificial Intelligence

Brainwave and Behavior Recognition: Multi‑Modal Biometric Authentication with Adversarial Contrastive Transfer Learning

This article presents Ant Security's research on novel biometric methods—brainwave (脑纹) and behavior recognition—detailing their scientific background, data collection, multi‑modal deep‑learning algorithms, adversarial and contrastive training strategies, experimental results, and practical applications for inclusive, secure identity verification.

Multimodal AIaccessibilityadversarial learning

0 likes · 17 min read

Brainwave and Behavior Recognition: Multi‑Modal Biometric Authentication with Adversarial Contrastive Transfer Learning

ITPUB

Apr 14, 2023 · Artificial Intelligence

How Do Generative, Perceptual, and Decision AI Interact? Insights from Jina AI’s Founder

In this interview, Jina AI’s founder Shao Han examines the relationships among generative, perceptual, and decision AI, compares single‑modal and multimodal approaches, discusses large language model development, and evaluates the impact of ChatGPT on search and future AI commercialization.

AI commercializationLarge Language ModelsMultimodal AI

0 likes · 11 min read

How Do Generative, Perceptual, and Decision AI Interact? Insights from Jina AI’s Founder

Python Programming Learning Circle

Apr 3, 2023 · Artificial Intelligence

Key Highlights of GPT‑4: Multimodal Capabilities, Benchmark Performance, and Future Implications

GPT‑4, the new multimodal AI model, can process images and text, generate code and natural language, achieve human‑level scores on standardized exams, handle up to 32 K tokens, and demonstrates advanced reasoning, while OpenAI emphasizes its safety improvements and current limitations as a still‑emerging technology.

AI SafetyGPT-4Multimodal AI

0 likes · 6 min read

Key Highlights of GPT‑4: Multimodal Capabilities, Benchmark Performance, and Future Implications

DataFunTalk

Mar 27, 2023 · Artificial Intelligence

GPT-4 Shows Early Signs of Artificial General Intelligence: Insights from the "Sparks of AGI" Paper

A recent 154‑page Microsoft paper titled "Sparks of Artificial General Intelligence: Early Experiments with GPT‑4" argues that GPT‑4, despite being an early prototype, already exhibits many capabilities—multimodal reasoning, programming, mathematics, and human‑like interaction—suggesting it may be an early form of AGI, though experts highlight significant limitations and ongoing debates.

AI EvaluationArtificial General IntelligenceGPT-4

0 likes · 15 min read

GPT-4 Shows Early Signs of Artificial General Intelligence: Insights from the "Sparks of AGI" Paper

Baidu Geek Talk

Mar 23, 2023 · Artificial Intelligence

Advanced Image Search in Baidu Netdisk: Semantic Vector Retrieval and Multi-Modal Fusion

Baidu Netdisk’s new image search combines ERNIE‑ViL‑based semantic vectors, cross‑modal matching and metadata such as timestamps, GPS and facial tags, using LSH‑optimized indexing to let users find specific photos among billions with natural‑language queries, delivering faster, more accurate results without manual tagging.

ERNIE-ViLLSH hashingMultimodal AI

0 likes · 11 min read

Advanced Image Search in Baidu Netdisk: Semantic Vector Retrieval and Multi-Modal Fusion

Programmer DD

Mar 19, 2023 · Artificial Intelligence

How Visual ChatGPT Adds Image Interaction to ChatGPT – A Deep Dive

Microsoft's open‑source Visual ChatGPT extends ChatGPT with image send/receive capabilities, explains its multimodal architecture, demo scenarios, used visual models, and points to the arXiv paper, highlighting its rapid popularity growth on GitHub.

LLMMicrosoftMultimodal AI

0 likes · 4 min read

How Visual ChatGPT Adds Image Interaction to ChatGPT – A Deep Dive

Huawei Cloud Developer Alliance

Mar 18, 2023 · Artificial Intelligence

Unveiling NetEase’s ‘YuZhi’ Multimodal Model: Boosting Personalized Recommendations

NetEase’s Fuxi team developed the multimodal ‘YuZhi’ model, a large‑scale image‑text dual‑tower system optimized with the EET inference framework, which powers personalized recommendations in NetEase News and Cloud Music, while a partnership with Huawei Ascend AI and MindSpore enables further model acceleration, compression, and the new ‘YuZhi‑Wukong’ model that improves video recommendation metrics by about 5%.

Huawei Ascend AILarge ModelMindSpore

0 likes · 5 min read

Unveiling NetEase’s ‘YuZhi’ Multimodal Model: Boosting Personalized Recommendations

Tencent Cloud Developer

Mar 16, 2023 · Artificial Intelligence

What Makes GPT‑4 a Game‑Changer? 10 Expert Insights on Its Capabilities and Impact

This article provides a detailed analysis of GPT‑4, covering its multimodal abilities, performance gains, training innovations, safety improvements, new application scenarios, impact on developers, and future trends in large language models.

AI SafetyGPT-4LLM trends

0 likes · 16 min read

What Makes GPT‑4 a Game‑Changer? 10 Expert Insights on Its Capabilities and Impact

21CTO

Mar 15, 2023 · Artificial Intelligence

What Makes OpenAI’s New GPT‑4 a Game‑Changer for Multimodal AI?

OpenAI’s GPT‑4, a multimodal large language model that accepts text and image inputs, powers ChatGPT and Bing, offers improved creativity and problem‑solving while still facing hallucination risks, and is now available via ChatGPT Plus and an open API for developers.

AI SafetyGPT-4Multimodal AI

0 likes · 5 min read

What Makes OpenAI’s New GPT‑4 a Game‑Changer for Multimodal AI?

21CTO

Mar 11, 2023 · Artificial Intelligence

Microsoft Announces Multimodal GPT-4: A New ‘iPhone Moment’ for AI

Microsoft Germany's CTO announced the imminent release of a multimodal GPT‑4, highlighting its ability to process text, images and video, while executives liken the breakthrough to an “iPhone moment” for AI, emphasizing new capabilities, industry disruption, and responsible data use.

AI DevelopmentGPT-4Large Language Models

0 likes · 6 min read

Microsoft Announces Multimodal GPT-4: A New ‘iPhone Moment’ for AI

Programmer DD

Feb 18, 2023 · Artificial Intelligence

Alibaba’s New Multimodal ChatGPT Rival: How the Tongyi Model Achieves Unified AI

Alibaba’s internal‑test “Damoyuan‑version ChatGPT” showcases a multimodal AI that combines text, image, code, and creative generation, built on the Tongyi large model’s unified architecture, while other Chinese tech giants rush to launch their own ChatGPT‑style products.

ChatGPTMultimodal AITongyi

0 likes · 7 min read

Alibaba’s New Multimodal ChatGPT Rival: How the Tongyi Model Achieves Unified AI

Architect

Feb 18, 2023 · Artificial Intelligence

Paradigm Shifts in Large Language Models: From Pre‑training to AGI and Future Research Directions

The article reviews the evolution of large language models, highlighting two major paradigm shifts after GPT‑3, the role of scaling laws, knowledge acquisition, prompting techniques, reasoning abilities, and outlines future research priorities for building more capable and efficient AI systems.

AI reasoningIn-Context LearningModel Scaling

0 likes · 71 min read

Paradigm Shifts in Large Language Models: From Pre‑training to AGI and Future Research Directions

AntTech

Jan 18, 2023 · Artificial Intelligence

Ant Security's Tianjian Content Risk Control System Receives Five‑Star Rating in 2022 Content Review Service Evaluation

On January 17, the China Academy of Information and Communications Technology announced that Ant Security's self‑developed Tianjian multimodal content risk control system achieved the highest five‑star rating in both text and image assessments of the 2022 content review service evaluation, highlighting its advanced AI‑driven moderation capabilities.

Image AnalysisMultimodal AIant security

0 likes · 4 min read

Ant Security's Tianjian Content Risk Control System Receives Five‑Star Rating in 2022 Content Review Service Evaluation

DataFunSummit

Dec 19, 2022 · Artificial Intelligence

Multimodal Large‑Model Driven Virtual Digital Humans: Background, Methods, and Applications

This article introduces the rapid development of multimodal digital humans powered by large AI models, covering their background, current challenges, NeRF‑GAN based modeling methods, multimodal dialogue capabilities, and real‑world application cases such as virtual assistants, tourism guides, and sign‑language avatars.

AIGCHuman-Computer InteractionLarge Model

0 likes · 14 min read

Multimodal Large‑Model Driven Virtual Digital Humans: Background, Methods, and Applications

21CTO

Dec 15, 2022 · Artificial Intelligence

Sam Altman & Reid Hoffman on AI’s Future: Business, Multimodal Models, Society

In a candid conversation, Sam Altman and Reid Hoffman explore the next stage of AI, discussing commercial opportunities of large language models, the rise of AI‑plus applications in science and the metaverse, future directions such as multimodal and continuously learning models, and the societal challenges of AGI, wealth distribution and universal basic income.

AGIAI commercializationLarge Language Models

0 likes · 16 min read

Sam Altman & Reid Hoffman on AI’s Future: Business, Multimodal Models, Society

NetEase Smart Enterprise Tech+

Dec 14, 2022 · Artificial Intelligence

Boosting AI Efficiency in Digital Content Risk Control: Insights from QCon

In this interview, NetEase AI expert Li Yuke shares how lightweight, cost‑effective AI solutions improve digital content risk control, audio‑video processing, and conversational systems, while discussing technical committees, data standards, and future AI trends such as multimodal and unsupervised learning.

AI efficiencyAI productionMultimodal AI

0 likes · 11 min read

Boosting AI Efficiency in Digital Content Risk Control: Insights from QCon

DataFunSummit

Dec 9, 2022 · Artificial Intelligence

Volcano Engine Virtual Digital Human Technology Overview

This article provides a comprehensive overview of Volcano Engine's virtual digital human platform, detailing its definition, AI‑driven and human‑driven classifications, 2D and 3D technical architectures, multi‑modal perception, interaction capabilities, application scenarios, and future development directions.

2D avatar3D AvatarComputer Vision

0 likes · 15 min read

Volcano Engine Virtual Digital Human Technology Overview

DataFunSummit

Nov 26, 2022 · Artificial Intelligence

Multimodal Digital Human Driving: Motionverse Engine and Metaverse Applications

This article introduces the evolution of digital human technology, explains the five maturity levels (L1‑L5), describes the Motionverse multimodal motion‑generation platform and its large‑scale data and AI models, and outlines SDK integration strategies for diverse metaverse scenarios.

MetaverseMultimodal AImotion generation

0 likes · 11 min read

Multimodal Digital Human Driving: Motionverse Engine and Metaverse Applications

Xiaohongshu Tech REDtech

Nov 25, 2022 · Artificial Intelligence

Youth AI Technology Salon: Multimodal Learning, AIGC, and Career Guidance

At the REDtech Youth AI Technology Salon in Beijing, leading AI experts and top university students discussed the evolution of multimodal learning, Xiaohongshu’s practical applications, autonomous‑driving perception, and offered career guidance, emphasizing solid fundamentals, user value, and opportunities within Xiaohongshu’s talent‑development programs.

AI talent developmentAIGCMultimodal AI

0 likes · 16 min read

Youth AI Technology Salon: Multimodal Learning, AIGC, and Career Guidance

Tencent Cloud Developer

Nov 11, 2022 · Artificial Intelligence

Tencent Advertising Multimedia AI Technology: Research and Application

Liu Wei outlines Tencent’s Advertising Multimedia AI ecosystem on the Taiji platform, describing a five‑platform matrix—Jue for content understanding, Qiankun for automated video creation, Shenzhen for AI‑driven review, Tianyin for hierarchical fingerprinting, and Hunyuan as a multimodal large model—featuring innovations such as massive multimodal pre‑training, logo retrieval, QA‑style attribute extraction, spatiotemporal video analysis, advanced auto‑judgment, and high‑performance hashing that achieve top cross‑modal retrieval results.

Computer VisionMultimodal AIadvertising technology

0 likes · 18 min read

Tencent Advertising Multimedia AI Technology: Research and Application

Xiaohongshu Tech REDtech

Nov 11, 2022 · Artificial Intelligence

Large-Scale Deep Learning Systems and Their Application at Xiaohongshu (RED)

Xiaohongshu’s in‑house LarC platform powers real‑time, multimodal recommendation, life‑search, and generative‑AI commercial content for its 200 million‑user community by processing billions of daily feedback samples, employing conflict‑free parameter servers, diversified sequence modeling, and large‑scale representation learning to deliver personalized, fresh, and diverse user experiences.

AI InfrastructureMachine Learning PlatformMultimodal AI

0 likes · 13 min read

Large-Scale Deep Learning Systems and Their Application at Xiaohongshu (RED)

DataFunSummit

Oct 9, 2022 · Artificial Intelligence

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

The article introduces the GIT image‑to‑text (image captioning) model, explains its transformer‑based architecture, showcases multiple example outputs, discusses training details, compares its performance with Flamingo and COCO, and highlights its applicability to tasks such as VQA, video captioning, and image classification.

GIT modelImage CaptioningMultimodal AI

0 likes · 12 min read

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

DataFunTalk

Sep 24, 2022 · Artificial Intelligence

Cross‑Modal Image‑Text Representation: The Zero Dataset and R2D2 Pre‑training Framework

This article introduces the importance of image‑text cross‑modal representation, presents the Chinese Zero dataset with two pre‑training subsets and five downstream tasks, describes the R2D2 dual‑tower‑plus‑single‑tower pre‑training framework with multiple loss functions, and reports extensive experiments and real‑world deployment insights.

Cross-modalMultimodal AIR2D2 framework

0 likes · 19 min read

Cross‑Modal Image‑Text Representation: The Zero Dataset and R2D2 Pre‑training Framework

Alimama Tech

Aug 17, 2022 · Artificial Intelligence

How Multimodal AI Transforms Advertising Copy: From Image Text to Video Scripts

Alibaba’s advertising AI team presents a comprehensive study of four new multimodal copywriting tasks—image overlay text generation, video narration, text style transfer, and detail-page extraction—detailing model architectures, training on billions of images, experimental results, and practical deployment in the “Xiyu” product.

Large-Scale TrainingMultimodal AIStyle Transfer

0 likes · 17 min read

How Multimodal AI Transforms Advertising Copy: From Image Text to Video Scripts

DataFunSummit

Apr 14, 2022 · Artificial Intelligence

Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library

This article reviews Alibaba's digital‑human (virtual avatar) research over the past few years, covering the product’s evolution, a six‑stage pipeline for building digital humans, solutions to key challenges in realism, multimodal interaction, and the open‑source MMTK algorithm library.

Digital HumanEmotion ModelingMultimodal AI

0 likes · 12 min read

Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library

DataFunTalk

Mar 26, 2022 · Artificial Intelligence

Advances in Alibaba's Digital Human (XiaoMi) Technology: Development, Construction, and Interaction

This article reviews Alibaba's XiaoMi digital human technology, covering its evolution since 2019, a six‑stage pipeline for building avatars, methods to enhance emotional, textual, vocal, and motion expressiveness, and approaches for improving long‑term interactive capabilities such as controllable script generation, multimodal QA, sign‑language translation, and intelligent behavior decision, culminating in the release of the MMTK multimodal algorithm library.

Digital HumanEmotion ModelingMultimodal AI

0 likes · 11 min read

Advances in Alibaba's Digital Human (XiaoMi) Technology: Development, Construction, and Interaction

iQIYI Technical Product Team

Feb 25, 2022 · Artificial Intelligence

Short Video Content Tagging: Multimodal AI Model Framework and Applications

The framework tags short videos by fusing text, image and audio‑video features through specialized extraction, classification, generative and retrieval modules, then ranking candidates with a multimodal BERT model, delivering accurate, business‑specific tags that boost recommendation, search and advertising.

Deep LearningMultimodal AIcontent tagging

0 likes · 10 min read

Short Video Content Tagging: Multimodal AI Model Framework and Applications

DataFunTalk

Jan 3, 2022 · Artificial Intelligence

How Taobao’s Cutting‑Edge 3D XR, Multimodal AI, and Low‑Carbon Tech Redefined Double 11

Taobao’s Double 11 showcased a suite of advanced technologies—including 3D immersive XR live rooms, large‑scale multimodal product search, ultra‑low‑latency streaming, AI‑driven content moderation, and low‑carbon model compression—demonstrating how e‑commerce can innovate while promoting sustainability and inclusivity.

3D XRE‑commerce InnovationMultimodal AI

0 likes · 11 min read

How Taobao’s Cutting‑Edge 3D XR, Multimodal AI, and Low‑Carbon Tech Redefined Double 11

Volcano Engine Developer Services

Oct 12, 2021 · Artificial Intelligence

How ByteDance’s AI‑Powered Audio Signal Processing Elevates Voice, VR, and VoIP

This article reviews ByteDance’s intelligent audio signal processing technologies, covering foundational algorithms, multimodal audio scaling, sound‑field reconstruction, and high‑quality low‑latency VoIP, and explains how these advances improve audio capture, immersive media, and smart voice interaction across devices.

AR/VR audioMultimodal AIVoIP

0 likes · 13 min read

How ByteDance’s AI‑Powered Audio Signal Processing Elevates Voice, VR, and VoIP

DataFunTalk

Sep 14, 2021 · Artificial Intelligence

Multimodal and Human‑Computer Interaction Technologies for E‑commerce Live Streaming: From Q&A to Live Broadcast

This talk explores how multimodal AI, knowledge‑graph‑enhanced script generation, and advanced reading‑comprehension techniques enable virtual anchors to transform e‑commerce live streaming from simple Q&A bots into interactive, content‑rich live broadcasts, addressing challenges of material sourcing, personalization, and low‑latency response.

Content GenerationLiveQAMultimodal AI

0 likes · 19 min read

Multimodal and Human‑Computer Interaction Technologies for E‑commerce Live Streaming: From Q&A to Live Broadcast

DataFunTalk

Aug 14, 2021 · Artificial Intelligence

Multimodal Advertisement Detection System for WeChat "KanKan" Articles

This article introduces a multimodal advertisement detection framework for WeChat KanKan that decomposes the problem into text, image, and article‑structure dimensions, presents novel models for ad text and image recognition, and describes how sequence classification and visualisation are used to filter severe ad‑spam articles.

Image ClassificationMultimodal AIWeChat

0 likes · 16 min read

Multimodal Advertisement Detection System for WeChat "KanKan" Articles

ITPUB

Jun 25, 2021 · Artificial Intelligence

How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy

Alibaba’s DAMO Academy unveiled the low‑carbon M6 multimodal model, a trillion‑parameter AI trained on just 480 V100 GPUs, achieving over 80% energy reduction and 11‑fold speedup compared to prior trillion‑parameter efforts, and already powering e‑commerce and manufacturing design tools.

GPU efficiencyLarge ModelM6

0 likes · 5 min read

How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy

Baidu Geek Talk

Jun 21, 2021 · Artificial Intelligence

Detecting Pornographic Videos with Dual‑Modal AI: Images + Audio

This article presents a technical overview of a multimodal AI framework that combines image and audio analysis to identify pornographic video content, detailing model architectures, feature extraction methods, and experimental results achieving 93.4% accuracy on a 3,000‑sample test set.

Audio AnalysisDeep LearningMultimodal AI

0 likes · 6 min read

Detecting Pornographic Videos with Dual‑Modal AI: Images + Audio

DataFunTalk

Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI InfrastructureDistributed TrainingGPU clustering

0 likes · 17 min read

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

DataFunSummit

Mar 9, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's multimodal content understanding platform, covering its massive data challenges, heterogeneous model support, standardized pipelines, platformization, workflow architecture, GPU heterogeneous cluster management, resource scheduling, performance optimization, and full‑stack monitoring to achieve stable, low‑latency AI services at scale.

Distributed TrainingGPU clusterModel Serving

0 likes · 18 min read

DataFunTalk

Feb 9, 2021 · Artificial Intelligence

Multimodal AI Research: Video-Aware Dialog, Dual-Channel Reasoning, and Multimodal Machine Translation

This article surveys recent multimodal AI research, covering video scene‑aware dialog with a GPT‑2 based unified pre‑training framework, dual‑channel multi‑hop reasoning for visual dialog, capsule‑network‑enhanced multimodal machine translation, and graph‑neural‑network‑driven multimodal translation, highlighting experimental results and future directions.

Graph Neural NetworkMultimodal AIMultimodal Learning

0 likes · 12 min read

Multimodal AI Research: Video-Aware Dialog, Dual-Channel Reasoning, and Multimodal Machine Translation

JD Tech

Feb 2, 2021 · Artificial Intelligence

Advances and Trends in Multimodal Digital Content Generation and Automatic Text Summarization

The article reviews recent research on multimodal digital content generation and automatic text summarization, outlining the evolution from extractive to abstractive methods, highlighting four key technology trends such as pretrained language models, transformer dominance, knowledge‑enhanced generation, and multimodal‑knowledge joint modeling, and describing an industrial e‑commerce application built on these advances.

Generative ModelsMultimodal AIe‑commerce

0 likes · 12 min read

Advances and Trends in Multimodal Digital Content Generation and Automatic Text Summarization

DataFunTalk

Oct 22, 2020 · Artificial Intelligence

Analyzing Video Excitement: Methods, Frameworks, and Applications

This article presents a comprehensive overview of video excitement analysis, covering quality, aesthetics, and narrative factors, describing a multimodal framework with supervised, weakly supervised, and multi‑task models, and illustrating practical applications such as preview generation, clipping, and automatic cover creation.

Multimodal AIWeak Supervisioncontent recommendation

0 likes · 14 min read

Analyzing Video Excitement: Methods, Frameworks, and Applications

Meituan Technology Team

Oct 15, 2020 · Artificial Intelligence

Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

The paper introduces the Answer‑Driven Visual State Estimator (ADVSE), which uses answer‑driven focusing attention and conditional visual information fusion to dynamically incorporate answers into visual dialogue, overcoming static encoding limitations and achieving state‑of‑the‑art performance on the GuessWhat?! question‑generation and guessing tasks.

Attention MechanismMultimodal AIState Estimation

0 likes · 10 min read

Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

DataFunTalk

Jul 31, 2020 · Artificial Intelligence

WeChat 'Kan Kan' Content Understanding: Architecture and Techniques for Recommendation

This article details the technical architecture behind WeChat's 'Kan Kan' content understanding platform, covering text and multimedia analysis, tag extraction, entity recognition, knowledge graph construction, and how these components enhance recommendation recall, ranking, and user engagement across the ecosystem.

Multimodal AIRecommendation Systemscontent understanding

0 likes · 46 min read

WeChat 'Kan Kan' Content Understanding: Architecture and Techniques for Recommendation

Xianyu Technology

Jul 9, 2020 · Product Management

Xianyu Product Structuring: Evolution, Current Strategies, and Future Directions

Xianyu’s product‑information structuring has progressed from simple text mining to multimodal AI pipelines that now boost coverage by nearly 50 %, while facing precision and engineering hurdles, and it plans to adopt a standardized VID attribute system, plug‑in multimodal models, and rule‑based input assistance to enable seamless, photo‑driven publishing.

Multimodal AIdata engineeringe‑commerce

0 likes · 10 min read

Xianyu Product Structuring: Evolution, Current Strategies, and Future Directions

Suning Technology

Apr 9, 2020 · Artificial Intelligence

Affective Computing in Retail: Boosting Customer Experience with Emotion AI

This article explores the development and application of affective computing in the retail sector, covering its psychological foundations, emotion recognition algorithms for facial expressions, speech, and text, multimodal fusion techniques, market players, and future prospects for enhancing shopper experiences, staff service quality, and sales performance.

Affective ComputingEmotion RecognitionMultimodal AI

0 likes · 20 min read

Affective Computing in Retail: Boosting Customer Experience with Emotion AI

JD Tech Talk

Mar 9, 2020 · Artificial Intelligence

Advances in Deep Learning for Content Recommendation and User Behavior Modeling by JD Digits

The article reviews recent deep‑learning breakthroughs in personalized content recommendation, covering news and e‑commerce systems, JD Digits' multi‑dimensional user behavior prediction models, knowledge‑graph meta‑learning, and the impact of multimodal AI on future recommendation technologies.

Deep LearningMultimodal AIRecommendation Systems

0 likes · 6 min read

Advances in Deep Learning for Content Recommendation and User Behavior Modeling by JD Digits

Amap Tech

Dec 6, 2019 · Artificial Intelligence

Semantic Understanding of Merchant Signboards for Automatic POI Name Generation at Amap

Amap's POI naming automation uses a two-stage cascade model: Stage 1 extracts token and sentence features with POS tags and domain-adapted BERT‑POI; Stage 2 employs a Bi‑LSTM to model line relationships, achieving over 95% semantic accuracy and 3‑6% recall improvements, thereby enhancing automatic signboard‑based POI name generation.

BERTLSTMMultimodal AI

0 likes · 7 min read

Semantic Understanding of Merchant Signboards for Automatic POI Name Generation at Amap

iQIYI Technical Product Team

Apr 12, 2019 · Artificial Intelligence

iQIYI Multimodal Technology: Datasets, Applications, and Future Directions

iQIYI leverages multimodal AI—combining audio, visual, and textual cues—to advance video understanding, releasing the world’s largest celebrity dataset (iQIYI‑VID), powering applications such as actor‑focused playback, AI Radar, emoji generation, and rapid automated editing, while pursuing future research in emoji captioning, cross‑modal retrieval, visual question answering, and broader health‑care and education uses.

DatasetsMultimodal AIiQIYI

0 likes · 13 min read

iQIYI Multimodal Technology: Datasets, Applications, and Future Directions