Tagged articles
303 articles
Page 3 of 4
360 Tech Engineering
360 Tech Engineering
Nov 15, 2024 · Artificial Intelligence

Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)

At the 2024 Global Machine Learning Conference in Beijing, 360 AI Research Institute showcased cutting‑edge multimodal large‑model research, fine‑grained open‑world object detection, and document understanding technologies, highlighting open‑source releases, real‑world deployments, and competitive achievements in AI competitions.

AI researchMultimodal AIdocument understanding
0 likes · 7 min read
Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 7, 2024 · Artificial Intelligence

Multimodal Speaker Diarization for Long-Form Video Scripts

iQIYI’s multimodal speaker diarization system splits long‑form video using subtitle timestamps and scene detection, extracts voiceprints with a custom model, hierarchically clusters them, and applies an Activate Speaker Detection algorithm combined with face‑recognition to assign speakers, achieving around 90 % precision and recall and boosting downstream tasks such as summarization, translation, and dubbing.

Multimodal AIdialogue recognitioniQIYI
0 likes · 8 min read
Multimodal Speaker Diarization for Long-Form Video Scripts
Alimama Tech
Alimama Tech
Nov 6, 2024 · Artificial Intelligence

How AI Generates Synchronized Video Narrations for E‑Commerce

This article presents the research behind Synchronized Video Storytelling, introducing the E‑SyncVidStory dataset, the VideoNarrator multimodal architecture, and extensive experiments that demonstrate high‑quality, product‑aware video narration generation for e‑commerce applications.

DatasetLLMMultimodal AI
0 likes · 12 min read
How AI Generates Synchronized Video Narrations for E‑Commerce
DataFunTalk
DataFunTalk
Nov 2, 2024 · Artificial Intelligence

Embodied Intelligence: Core Concepts, Three Elements, and Four Functional Modules

This article introduces embodied intelligence, explains its basic definition, three essential elements (body, intelligence, environment), and details the four functional modules—perception, decision, action, and feedback—while describing the sensors and algorithms that enable physical AI systems to interact with the real world.

AI roboticsEmbodied IntelligenceFeedback Loop
0 likes · 13 min read
Embodied Intelligence: Core Concepts, Three Elements, and Four Functional Modules
DaTaobao Tech
DaTaobao Tech
Nov 1, 2024 · Artificial Intelligence

Multimodal Large Model for Voucher Verification: Prompt Engineering and Fine‑Tuning

By leveraging multimodal large models such as GPT‑4o and fine‑tuned Qwen‑VL, the study builds a prompt‑engineered and SFT‑enhanced voucher verification system that classifies product categories, detects diverse defects, and estimates problem counts, achieving up to 90 % accuracy and meeting real‑time business throughput requirements.

Multimodal AIPrompt engineeringe‑commerce
0 likes · 10 min read
Multimodal Large Model for Voucher Verification: Prompt Engineering and Fine‑Tuning
Tencent Cloud Developer
Tencent Cloud Developer
Oct 30, 2024 · Artificial Intelligence

Comprehensive Survey of AIGC Research: Papers, Resources, and Technical Overview

This survey acts as a comprehensive portal that organizes AIGC research across seven domains—text, image, and audio generation, cross‑modal association, text‑guided image and audio synthesis, and supporting resources—detailing seminal models such as GPT, Diffusion, CLIP, DALL·E, Stable Diffusion, MusicLM, and key papers that shaped each field.

AIGCCLIPComputer Vision
0 likes · 19 min read
Comprehensive Survey of AIGC Research: Papers, Resources, and Technical Overview
Ops Development & AI Practice
Ops Development & AI Practice
Oct 4, 2024 · Artificial Intelligence

How ChatGPT 4.0 with Canvas Redefines Multimodal Human‑AI Interaction

ChatGPT 4.0 with Canvas introduces a visual "canvas" that blends language and graphics, enabling multimodal dialogue, real‑time visual feedback, and collaborative workflows across education, design, and business, while posing technical challenges in vision‑language integration, context consistency, and performance optimization.

AI applicationsCanvasChatGPT
0 likes · 10 min read
How ChatGPT 4.0 with Canvas Redefines Multimodal Human‑AI Interaction
Sohu Tech Products
Sohu Tech Products
Sep 25, 2024 · Artificial Intelligence

Multimodal AI-Powered Video Content Moderation System Using Chinese CLIP and Vector Search

The article describes a multimodal AI video moderation system built on Alibaba’s Chinese‑CLIP model and hybrid RedisSearch/ElasticSearch vector databases, enabling real‑time violation detection and historical recall, with fine‑tuned black‑market ad detection, FP16 quantization, and OpenVINO acceleration to boost speed and cut storage.

Chinese CLIPMultimodal AIOpenVINO optimization
0 likes · 16 min read
Multimodal AI-Powered Video Content Moderation System Using Chinese CLIP and Vector Search
Ops Development & AI Practice
Ops Development & AI Practice
Sep 16, 2024 · Industry Insights

Why Mistral AI Is Shaping the Future of Open‑Source Large Language Models

Mistral AI, a French startup founded in 2023, leverages open‑source large language models, efficient architecture, and multimodal research to offer scalable AI solutions across enterprises, content creation, and healthcare, while pursuing a community‑driven strategy that positions it as a rising force in the competitive AI landscape.

AI industryMistral AIMultimodal AI
0 likes · 9 min read
Why Mistral AI Is Shaping the Future of Open‑Source Large Language Models
DataFunTalk
DataFunTalk
Sep 1, 2024 · Artificial Intelligence

Building Multi‑Scenario AI Assistants with Large Models at Huolala

Huolala, a logistics technology company, shares how it leverages large language models to create personal and office AI assistants across dozens of real‑world scenarios, detailing the underlying platform, prompt engineering, multimodal capabilities, multi‑agent coordination, and the resulting business empowerment.

AI assistantsLarge Language ModelsMultimodal AI
0 likes · 13 min read
Building Multi‑Scenario AI Assistants with Large Models at Huolala
Kuaishou Tech
Kuaishou Tech
Jul 31, 2024 · Artificial Intelligence

Kuaishou’s Kolors Text‑to‑Image Model: Architecture, Evaluation, and Real‑World Applications

The article presents a comprehensive overview of Kuaishou’s Kolors (formerly 可图) multimodal generative model, detailing its data collection strategy, diffusion‑based architecture, evaluation metrics, derived capabilities such as prompt refinement and interactive generation, and a range of practical applications from AI‑powered live‑stream gifts to virtual try‑on, while also offering strategic advice for the domestic visual‑generation community.

AI applicationsDiffusion ModelsKolors
0 likes · 27 min read
Kuaishou’s Kolors Text‑to‑Image Model: Architecture, Evaluation, and Real‑World Applications
Java Tech Enthusiast
Java Tech Enthusiast
Jul 23, 2024 · Industry Insights

Can Baidu’s Orange Paper Outperform Kimi? A Deep Dive into AI Writing Tools

This article compares Baidu’s new AI writing platform Orange Paper with Kimi, evaluating their long‑text understanding, multimodal editing, document upload limits, outline generation, and overall usability for research and academic writing, highlighting Orange Paper’s advantages in knowledge retrieval, large‑scale content creation, and deep editing capabilities.

AI writingKnowledge RetrievalLong Text Generation
0 likes · 11 min read
Can Baidu’s Orange Paper Outperform Kimi? A Deep Dive into AI Writing Tools
DataFunSummit
DataFunSummit
Jul 18, 2024 · Artificial Intelligence

Tencent Music Tianqin Lab’s Practice and Applications of Audio Representation Large Models

This article reviews Tencent Music Tianqin Lab’s research on audio representation large models, covering background, the evolution of audio features, self‑supervised methods such as SimCLR, BYOL, MAE, MLM, benchmark results, multimodal extensions, and real‑world applications like song authenticity detection and search ranking.

Multimodal AITencent Musicaudio representation
0 likes · 20 min read
Tencent Music Tianqin Lab’s Practice and Applications of Audio Representation Large Models
Kuaishou Tech
Kuaishou Tech
Jul 17, 2024 · Artificial Intelligence

Key Technical Innovations in Kuaishou’s “Kuaiyi” Large Model and Its Real-World Applications

The article details Kuaishou’s development of the 175B “Kuaiyi” multimodal large model, presenting eight novel technical innovations—from Temporal Scaling Law and MiLe Loss to MoE‑enhanced reward modeling—and describes how these advances enable high‑performance AI services such as the AI Xiao Kuai chatbot across diverse real‑world scenarios.

AI applicationsModel OptimizationMultimodal AI
0 likes · 12 min read
Key Technical Innovations in Kuaishou’s “Kuaiyi” Large Model and Its Real-World Applications
DataFunSummit
DataFunSummit
Jun 23, 2024 · Artificial Intelligence

Tongyi Xingchen Personalized Large Model: Technical Overview and Applications

This article summarizes the development background of large language models, Alibaba's progression in foundational and personalized AI, the design and capabilities of the Tongyi Xingchen personalized model, its multimodal and agent-based architecture, various industry use cases, and the safety and responsibility measures applied to ensure trustworthy AI deployment.

AI SafetyLarge Language ModelsMultimodal AI
0 likes · 13 min read
Tongyi Xingchen Personalized Large Model: Technical Overview and Applications
21CTO
21CTO
Jun 2, 2024 · Artificial Intelligence

Geoff Hinton on Scaling Laws, Multimodal AI, and the Future of Intelligence

In a candid interview, Geoff Hinton reflects on his AI journey—from early disappointments in physiology and philosophy to breakthroughs in neural networks, scaling laws, multimodal learning, fast‑weight concepts, and the ethical challenges shaping the future of artificial intelligence.

AI ethicsDeep LearningGeoff Hinton
0 likes · 25 min read
Geoff Hinton on Scaling Laws, Multimodal AI, and the Future of Intelligence
NewBeeNLP
NewBeeNLP
May 29, 2024 · Artificial Intelligence

How Ant’s Multimodal Team Boosted Video‑Text Retrieval by 24% and Cut Copyright Search Costs 85%

This article presents Ant Group's multimodal research on video retrieval, detailing a large Chinese video‑text pre‑training dataset, three techniques that raise video‑text semantic search performance by up to 24.5%, and an end‑to‑end video‑video copyright detection system that reduces storage by 85% and speeds up inference 18‑fold.

Multimodal AIcopyright detectionfine-grained modeling
0 likes · 40 min read
How Ant’s Multimodal Team Boosted Video‑Text Retrieval by 24% and Cut Copyright Search Costs 85%
21CTO
21CTO
May 23, 2024 · Artificial Intelligence

How xAI’s Grok 1.5V Adds Multimodal Image Input for Developers

xAI’s Grok 1.5V is set to support multimodal image input, allowing developers to upload pictures and receive text‑based answers via the Python SDK, marking a major upgrade that narrows the gap with leading models like GPT‑4 and signals a new frontier for AI chatbots.

AI chatbotsMultimodal AIPython SDK
0 likes · 4 min read
How xAI’s Grok 1.5V Adds Multimodal Image Input for Developers
21CTO
21CTO
May 21, 2024 · Artificial Intelligence

How Google’s ScreenAI Could Redefine UI Understanding and UX Design

Google’s new ScreenAI visual‑language model, built on the PaLI architecture, can interpret user interfaces and infographics, answer UI‑related questions, generate summaries and navigate screens, and sets new benchmarks that may reshape future user‑experience research and applications.

Google AIMultimodal AIScreenAI
0 likes · 9 min read
How Google’s ScreenAI Could Redefine UI Understanding and UX Design
DataFunTalk
DataFunTalk
May 15, 2024 · Artificial Intelligence

Advances in Video Multimodal Retrieval: Video‑Text Semantic Search and Video‑Video Same‑Source Search

This article presents Ant Group's multimodal research on video retrieval, detailing video‑text semantic search and video‑video same‑source search, introducing a large Chinese pre‑training dataset, novel pre‑training, hard‑sample mining, fine‑grained modeling techniques, and an efficient end‑to‑end copyright detection framework.

Multimodal AIcopyright detectionfine-grained modeling
0 likes · 38 min read
Advances in Video Multimodal Retrieval: Video‑Text Semantic Search and Video‑Video Same‑Source Search
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 15, 2024 · Artificial Intelligence

OpenAI Unveils GPT‑4o: An Omni‑Capable Multimodal Model Offered Free to All Users

OpenAI introduced GPT‑4o, a free, omni‑capable multimodal model that processes text, audio, and images together, delivers near‑human response latency, showcases impressive live demos, and will soon be available via a discounted API, marking a significant step forward in end‑to‑end AI research.

AI researchGPT-4oMultimodal AI
0 likes · 7 min read
OpenAI Unveils GPT‑4o: An Omni‑Capable Multimodal Model Offered Free to All Users
21CTO
21CTO
May 14, 2024 · Artificial Intelligence

What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?

OpenAI’s latest flagship model GPT‑4o combines text, audio, image and video processing in a single, faster, cheaper multimodal system that delivers near‑human response times, expanded API access, and new safety measures, reshaping how developers and users interact with AI.

AI modelAudio ProcessingGPT-4o
0 likes · 10 min read
What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?
JD Tech
JD Tech
Apr 29, 2024 · Artificial Intelligence

Relation-Aware Diffusion Models for Automated Poster Layout and Product Background Generation

This article presents JD Advertising's 2023 AI-driven framework that uses a relation‑aware diffusion model with visual‑text and geometric modules, combined with category‑common and personalized generators and a planning‑and‑rendering network, to automate high‑quality, scalable e‑commerce poster creation and background synthesis.

Diffusion ModelsImage GenerationMultimodal AI
0 likes · 18 min read
Relation-Aware Diffusion Models for Automated Poster Layout and Product Background Generation
Open Source Linux
Open Source Linux
Apr 16, 2024 · Artificial Intelligence

How Sora’s Text-to-Video Model Is Redefining AI‑Generated Video

Sora, a new text‑to‑video AI model, can create one‑minute videos from textual prompts or static images, delivering industry‑leading fidelity, resolution, and coherent motion by using spatial‑temporal patches inspired by ViViT, and shows emergent capabilities that hint at universal physical simulation.

Multimodal AISora modelViViT
0 likes · 4 min read
How Sora’s Text-to-Video Model Is Redefining AI‑Generated Video
NewBeeNLP
NewBeeNLP
Apr 13, 2024 · Artificial Intelligence

How a Multimodal ‘Joke‑King’ Model Beats GPT‑4 at Humor Generation

A research team from Sun Yat‑sen University, Sea AI Lab and Harvard built a multimodal large model that learns to generate creative jokes and memes by training on the Oogiri‑GO dataset, introducing a Leap‑of‑Thought (LoT) paradigm and CLoT fine‑tuning, which outperforms GPT‑4 and other state‑of‑the‑art models in humor tasks.

CLoTLarge Language ModelsLeap-of-Thought
0 likes · 9 min read
How a Multimodal ‘Joke‑King’ Model Beats GPT‑4 at Humor Generation
JD Retail Technology
JD Retail Technology
Mar 12, 2024 · Artificial Intelligence

Multimodal Large Models: Recent Advances, Industry Impact, and Challenges – An Expert Interview

In a detailed interview, Tsinghua researcher Zhao Sicheng and JD Retail senior director Peng Changping discuss the latest progress in multimodal large models, their practical applications in advertising and e‑commerce, persistent challenges such as hallucinations and data alignment, and the skills engineers need to thrive in the emerging AI era.

AI researchMultimodal AIe‑commerce
0 likes · 19 min read
Multimodal Large Models: Recent Advances, Industry Impact, and Challenges – An Expert Interview
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 12, 2024 · Artificial Intelligence

AAAI‑2024 Highlights: Alibaba Cloud’s Deep Tabular Learning & Multi‑Modal Fusion

Alibaba Cloud’s AI platform PAI showcased four cutting‑edge papers at AAAI‑2024—introducing AMFormer for deep tabular learning via arithmetic feature interaction, MuLTI for efficient video‑language understanding, M2SD for few‑shot class‑incremental learning, and M2Doc for multi‑modal document layout analysis—demonstrating the platform’s growing impact on artificial‑intelligence research.

Deep LearningFew‑Shot LearningMultimodal AI
0 likes · 9 min read
AAAI‑2024 Highlights: Alibaba Cloud’s Deep Tabular Learning & Multi‑Modal Fusion
NewBeeNLP
NewBeeNLP
Mar 7, 2024 · Artificial Intelligence

How Sora is Redefining Large Vision Models: A Deep Dive into Technology, Limits, and Opportunities

This comprehensive review examines Sora, the first model capable of generating minute‑long, high‑quality videos from text, covering its historical background, core diffusion‑Transformer architecture, data preprocessing strategies, prompt engineering techniques, diverse applications, and the ethical and technical limitations that shape its future.

Multimodal AIPrompt engineeringSora
0 likes · 28 min read
How Sora is Redefining Large Vision Models: A Deep Dive into Technology, Limits, and Opportunities
DataFunSummit
DataFunSummit
Mar 6, 2024 · Artificial Intelligence

Document Intelligence: Background, Technology, Large Models, and Enterprise Applications

This article presents a comprehensive overview of document intelligence, covering its background, technical evolution, large‑model advancements, and practical enterprise digital transformation use cases, with a focus on multimodal processing, unified document representation, and industry‑specific applications such as legal contract automation.

Document IntelligenceEnterprise AutomationLarge Language Models
0 likes · 14 min read
Document Intelligence: Background, Technology, Large Models, and Enterprise Applications
AntTech
AntTech
Mar 1, 2024 · Artificial Intelligence

Ant Group Unveils SkySense: 2.06‑Billion‑Parameter Multimodal Remote‑Sensing Foundation Model Accepted at CVPR 2024

Ant Group introduced SkySense, a 2.06‑billion‑parameter multimodal remote‑sensing foundation model that outperformed 18 international rivals across 17 benchmark tasks, was accepted to CVPR 2024, and aims to support applications such as agriculture, urban planning, and disaster response.

Ant GroupCVPR 2024Multimodal AI
0 likes · 6 min read
Ant Group Unveils SkySense: 2.06‑Billion‑Parameter Multimodal Remote‑Sensing Foundation Model Accepted at CVPR 2024
Architects' Tech Alliance
Architects' Tech Alliance
Feb 25, 2024 · Artificial Intelligence

How Sora Redefined Video Generation: Breakthroughs and Industry Impact

The article provides an in‑depth technical analysis of OpenAI's Sora, highlighting its 60‑second 1080p video generation capability, the novel patches‑vectorization and transformer training pipeline that leverages GPT‑generated prompts for multimodal alignment, and its potential to become a universal video‑generation base model that could reshape the AI industry.

AGIMultimodal AISora
0 likes · 6 min read
How Sora Redefined Video Generation: Breakthroughs and Industry Impact
NewBeeNLP
NewBeeNLP
Feb 17, 2024 · Artificial Intelligence

How Sora Highlights the Next Leap Toward AGI and Shifts AI Competition

The article analyzes OpenAI's Sora video model, arguing that its integration of large‑language‑model reasoning with diffusion techniques marks a major step toward true world understanding, reshapes creative workflows, widens the AI talent gap, and accelerates the path to artificial general intelligence.

AGIAI trendsLarge Language Models
0 likes · 7 min read
How Sora Highlights the Next Leap Toward AGI and Shifts AI Competition
DataFunTalk
DataFunTalk
Feb 5, 2024 · Artificial Intelligence

Mobile-Agent: An Autonomous Multi‑Modal Mobile Device Agent with Visual Perception

The Mobile-Agent paper presents a vision‑only, autonomous multi‑modal AI system that can interpret user commands, locate UI elements on a smartphone screen, and execute complex tasks such as browsing, commenting, and content creation through a defined operation space, self‑planning, and self‑reflection mechanisms, achieving high success rates across diverse Chinese and English scenarios.

Mobile AutomationMultimodal AIVisual Perception
0 likes · 7 min read
Mobile-Agent: An Autonomous Multi‑Modal Mobile Device Agent with Visual Perception
21CTO
21CTO
Jan 31, 2024 · Artificial Intelligence

Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model

This article introduces LLaVA, an open‑source large language‑visual assistant that replicates GPT‑4‑V capabilities, explains its architecture, training process, and key features, and provides step‑by‑step instructions for using the web demo, running it locally via Ollama or HuggingFace, and building a simple Gradio chatbot with code examples.

GradioLLaVAMultimodal AI
0 likes · 11 min read
Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model
DataFunSummit
DataFunSummit
Jan 20, 2024 · Artificial Intelligence

Cross‑Modal Video Open‑Tag Mining: Techniques, Methods, and Applications

The article presents a comprehensive overview of cross‑modal video open‑tag mining, detailing its technical background, related multimodal research methods, a four‑stage open‑tag solution from 360 AI Research Institute, and future application prospects such as unsupervised tag coverage, semantic retrieval, and content moderation.

Cross-modalMultimodal AIlabel extraction
0 likes · 15 min read
Cross‑Modal Video Open‑Tag Mining: Techniques, Methods, and Applications
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 20, 2024 · Artificial Intelligence

Decoding Xiaohongshu’s Recommendation System: How Ordinary Users Gain Visibility

Xiaohongshu’s recommendation system uses large‑scale multimodal embeddings, dual‑tower and graph models, and diversity techniques like DPP and SSD to quickly surface high‑quality user‑generated content, enabling ordinary users to gain visibility while balancing personalization, exploration, and efficient LLM‑augmented pipelines.

Large Language ModelsMultimodal AIXiaohongshu
0 likes · 15 min read
Decoding Xiaohongshu’s Recommendation System: How Ordinary Users Gain Visibility
DataFunSummit
DataFunSummit
Jan 5, 2024 · Artificial Intelligence

Multimodal Large Model Platform: History, Architecture, Practices, and Future Outlook by Jiuzhang Yunji DataCanvas

This article reviews the evolution of multimodal large models, introduces Jiuzhang Yunji DataCanvas' multimodal model platform—including AI foundation software, model tools, serving, and prompt management—shares practical building methods, memory‑augmented models, ETL pipelines, knowledge‑base applications, and offers a forward‑looking perspective on enterprise data management and intelligent agents.

AI Foundation SoftwareKnowledge BaseMultimodal AI
0 likes · 14 min read
Multimodal Large Model Platform: History, Architecture, Practices, and Future Outlook by Jiuzhang Yunji DataCanvas
DataFunSummit
DataFunSummit
Jan 1, 2024 · Artificial Intelligence

Advances in Image and Video Enhancement, Quality Assessment, and Multimodal AI Techniques

This article reviews the latest research from Alibaba DAMO Academy on real-world image quality problems, covering spatial, temporal, and color enhancement methods, advanced quality assessment metrics, multimodal diffusion models, and future directions toward large‑model integration and lightweight deployment.

Deep LearningMOS regressionMultimodal AI
0 likes · 24 min read
Advances in Image and Video Enhancement, Quality Assessment, and Multimodal AI Techniques
Programmer DD
Programmer DD
Dec 8, 2023 · Artificial Intelligence

Is Google’s Gemini Demo a Staged Illusion? The Truth Behind the AI Showcase

The article examines Google’s Gemini multimodal AI demo, revealing that the striking video was largely fabricated using static image frames and engineered prompts, which misleads viewers about the model’s real‑time capabilities and raises concerns about trust in AI demonstrations.

AI demonstrationAI trustGemini
0 likes · 8 min read
Is Google’s Gemini Demo a Staged Illusion? The Truth Behind the AI Showcase
21CTO
21CTO
Dec 7, 2023 · Artificial Intelligence

Google Gemini vs GPT‑4: Can the New AI Model Outperform ChatGPT?

Google's Gemini AI suite, unveiled in December, brings three model sizes—Nano, Pro, and Ultra—to power Bard and other services, claims superior performance over GPT‑4 across most benchmarks, and introduces multimodal capabilities that signal a major shift in the AI landscape.

AI language modelGPT-4 comparisonGoogle Gemini
0 likes · 6 min read
Google Gemini vs GPT‑4: Can the New AI Model Outperform ChatGPT?
DataFunTalk
DataFunTalk
Oct 19, 2023 · Artificial Intelligence

Multimodal Large Model Platform: History, Architecture, and Practice by Nine Chapters Cloud Extreme DataCanvas

This article presents Nine Chapters Cloud Extreme DataCanvas's insights and practices on multimodal large model platforms, covering their historical development, platform components such as AI Foundation Software and Prompt Manager, practical implementations like memory-augmented models and ETL pipelines, and future prospects for enterprise knowledge bases and agents.

AI PlatformKnowledge BaseMultimodal AI
0 likes · 13 min read
Multimodal Large Model Platform: History, Architecture, and Practice by Nine Chapters Cloud Extreme DataCanvas
Bilibili Tech
Bilibili Tech
Oct 13, 2023 · Artificial Intelligence

Multimodal Video High‑Energy Segment Extraction for Dynamic Video Covers

The authors present a multimodal system that automatically extracts high‑energy video segments for dynamic covers by analyzing subtitles, audio, visual frames, and danmu, employing LLM prompt‑tuning, scene‑cut detection, and aesthetic scoring to reduce manual effort and boost click‑through rates.

ASRMultimodal AIOCR
0 likes · 14 min read
Multimodal Video High‑Energy Segment Extraction for Dynamic Video Covers
21CTO
21CTO
Sep 27, 2023 · Artificial Intelligence

How ChatGPT’s New Voice and Image Features Transform AI Interaction

OpenAI’s latest update adds multimodal voice and image capabilities to ChatGPT, letting users speak or upload pictures for more natural, context‑rich conversations powered by advanced GPT‑3.5 and GPT‑4 models.

AI assistantsChatGPTMultimodal AI
0 likes · 6 min read
How ChatGPT’s New Voice and Image Features Transform AI Interaction
DataFunSummit
DataFunSummit
Sep 5, 2023 · Artificial Intelligence

Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications

This article presents a comprehensive overview of document intelligence, covering its background, the evolution of related technologies, large‑model approaches such as multimodal pre‑training and domain‑specific models, and concrete enterprise use cases across various business functions.

Document IntelligenceEnterprise AIMultimodal AI
0 likes · 14 min read
Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications
DataFunSummit
DataFunSummit
Aug 16, 2023 · Artificial Intelligence

Kuaipedia: Building a Short‑Video Encyclopedia with Multimodal Knowledge Extraction

This article introduces Kuaipedia, Kuaishou's multimodal short‑video encyclopedia, detailing its background, system architecture, knowledge‑video recognition pipeline, multimodal entity linking techniques, and downstream applications, while also providing implementation insights and a brief Q&A.

KuaipediaMultimodal AIentity linking
0 likes · 12 min read
Kuaipedia: Building a Short‑Video Encyclopedia with Multimodal Knowledge Extraction
DataFunTalk
DataFunTalk
Aug 11, 2023 · Artificial Intelligence

Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation

mPLUG-Owl is a modular multimodal dialogue large model from Alibaba DAMO Academy that builds on the mPLUG series, offering advanced image, video, OCR, and multilingual capabilities, with extensive evaluations showing superior performance over MiniGPT‑4, LLaVA, and other multimodal LLMs across various tasks.

Multimodal AIevaluationmPLUG-Owl
0 likes · 17 min read
Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation
Baidu Geek Talk
Baidu Geek Talk
Jul 26, 2023 · Artificial Intelligence

Insights on AIGC Development and Commercial Applications by Baidu's Chief Architect

Baidu’s chief architect Li Shuanglong outlined how AIGC, driven by advanced large‑language and multimodal models, is already powering commercial tools such as automated copywriting, 2D digital‑human video creation and lead‑generation chatbots, while emphasizing future progress in engineering scalability, algorithmic fidelity, data quality, and scenario‑focused applications.

AI commercializationAI researchAIGC
0 likes · 8 min read
Insights on AIGC Development and Commercial Applications by Baidu's Chief Architect
DataFunSummit
DataFunSummit
May 19, 2023 · Artificial Intelligence

Expert Roundtable on the Impact of GPT‑4 and Large Models on Knowledge Graphs

In this expert roundtable, leading AI researchers discuss GPT‑4’s multimodal breakthroughs, the future convergence of large models with knowledge graphs, practical integration strategies, and the evolving relevance of traditional NLP tasks, offering deep insights into the direction of artificial intelligence research.

Artificial IntelligenceGPT-4Knowledge Graphs
0 likes · 44 min read
Expert Roundtable on the Impact of GPT‑4 and Large Models on Knowledge Graphs
AntTech
AntTech
May 10, 2023 · Artificial Intelligence

Brainwave and Behavior Recognition: Multi‑Modal Biometric Authentication with Adversarial Contrastive Transfer Learning

This article presents Ant Security's research on novel biometric methods—brainwave (脑纹) and behavior recognition—detailing their scientific background, data collection, multi‑modal deep‑learning algorithms, adversarial and contrastive training strategies, experimental results, and practical applications for inclusive, secure identity verification.

Multimodal AIaccessibilityadversarial learning
0 likes · 17 min read
Brainwave and Behavior Recognition: Multi‑Modal Biometric Authentication with Adversarial Contrastive Transfer Learning
ITPUB
ITPUB
Apr 14, 2023 · Artificial Intelligence

How Do Generative, Perceptual, and Decision AI Interact? Insights from Jina AI’s Founder

In this interview, Jina AI’s founder Shao Han examines the relationships among generative, perceptual, and decision AI, compares single‑modal and multimodal approaches, discusses large language model development, and evaluates the impact of ChatGPT on search and future AI commercialization.

AI commercializationLarge Language ModelsMultimodal AI
0 likes · 11 min read
How Do Generative, Perceptual, and Decision AI Interact? Insights from Jina AI’s Founder
Python Programming Learning Circle
Python Programming Learning Circle
Apr 3, 2023 · Artificial Intelligence

Key Highlights of GPT‑4: Multimodal Capabilities, Benchmark Performance, and Future Implications

GPT‑4, the new multimodal AI model, can process images and text, generate code and natural language, achieve human‑level scores on standardized exams, handle up to 32 K tokens, and demonstrates advanced reasoning, while OpenAI emphasizes its safety improvements and current limitations as a still‑emerging technology.

AI SafetyGPT-4Multimodal AI
0 likes · 6 min read
Key Highlights of GPT‑4: Multimodal Capabilities, Benchmark Performance, and Future Implications
DataFunTalk
DataFunTalk
Mar 27, 2023 · Artificial Intelligence

GPT-4 Shows Early Signs of Artificial General Intelligence: Insights from the "Sparks of AGI" Paper

A recent 154‑page Microsoft paper titled "Sparks of Artificial General Intelligence: Early Experiments with GPT‑4" argues that GPT‑4, despite being an early prototype, already exhibits many capabilities—multimodal reasoning, programming, mathematics, and human‑like interaction—suggesting it may be an early form of AGI, though experts highlight significant limitations and ongoing debates.

AI EvaluationArtificial General IntelligenceGPT-4
0 likes · 15 min read
GPT-4 Shows Early Signs of Artificial General Intelligence: Insights from the "Sparks of AGI" Paper
Baidu Geek Talk
Baidu Geek Talk
Mar 23, 2023 · Artificial Intelligence

Advanced Image Search in Baidu Netdisk: Semantic Vector Retrieval and Multi-Modal Fusion

Baidu Netdisk’s new image search combines ERNIE‑ViL‑based semantic vectors, cross‑modal matching and metadata such as timestamps, GPS and facial tags, using LSH‑optimized indexing to let users find specific photos among billions with natural‑language queries, delivering faster, more accurate results without manual tagging.

ERNIE-ViLLSH hashingMultimodal AI
0 likes · 11 min read
Advanced Image Search in Baidu Netdisk: Semantic Vector Retrieval and Multi-Modal Fusion
Programmer DD
Programmer DD
Mar 19, 2023 · Artificial Intelligence

How Visual ChatGPT Adds Image Interaction to ChatGPT – A Deep Dive

Microsoft's open‑source Visual ChatGPT extends ChatGPT with image send/receive capabilities, explains its multimodal architecture, demo scenarios, used visual models, and points to the arXiv paper, highlighting its rapid popularity growth on GitHub.

LLMMicrosoftMultimodal AI
0 likes · 4 min read
How Visual ChatGPT Adds Image Interaction to ChatGPT – A Deep Dive
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Mar 18, 2023 · Artificial Intelligence

Unveiling NetEase’s ‘YuZhi’ Multimodal Model: Boosting Personalized Recommendations

NetEase’s Fuxi team developed the multimodal ‘YuZhi’ model, a large‑scale image‑text dual‑tower system optimized with the EET inference framework, which powers personalized recommendations in NetEase News and Cloud Music, while a partnership with Huawei Ascend AI and MindSpore enables further model acceleration, compression, and the new ‘YuZhi‑Wukong’ model that improves video recommendation metrics by about 5%.

Huawei Ascend AILarge ModelMindSpore
0 likes · 5 min read
Unveiling NetEase’s ‘YuZhi’ Multimodal Model: Boosting Personalized Recommendations
21CTO
21CTO
Mar 15, 2023 · Artificial Intelligence

What Makes OpenAI’s New GPT‑4 a Game‑Changer for Multimodal AI?

OpenAI’s GPT‑4, a multimodal large language model that accepts text and image inputs, powers ChatGPT and Bing, offers improved creativity and problem‑solving while still facing hallucination risks, and is now available via ChatGPT Plus and an open API for developers.

AI SafetyGPT-4Multimodal AI
0 likes · 5 min read
What Makes OpenAI’s New GPT‑4 a Game‑Changer for Multimodal AI?
21CTO
21CTO
Mar 11, 2023 · Artificial Intelligence

Microsoft Announces Multimodal GPT-4: A New ‘iPhone Moment’ for AI

Microsoft Germany's CTO announced the imminent release of a multimodal GPT‑4, highlighting its ability to process text, images and video, while executives liken the breakthrough to an “iPhone moment” for AI, emphasizing new capabilities, industry disruption, and responsible data use.

AI DevelopmentGPT-4Large Language Models
0 likes · 6 min read
Microsoft Announces Multimodal GPT-4: A New ‘iPhone Moment’ for AI
Architect
Architect
Feb 18, 2023 · Artificial Intelligence

Paradigm Shifts in Large Language Models: From Pre‑training to AGI and Future Research Directions

The article reviews the evolution of large language models, highlighting two major paradigm shifts after GPT‑3, the role of scaling laws, knowledge acquisition, prompting techniques, reasoning abilities, and outlines future research priorities for building more capable and efficient AI systems.

AI reasoningIn-Context LearningModel Scaling
0 likes · 71 min read
Paradigm Shifts in Large Language Models: From Pre‑training to AGI and Future Research Directions
AntTech
AntTech
Jan 18, 2023 · Artificial Intelligence

Ant Security's Tianjian Content Risk Control System Receives Five‑Star Rating in 2022 Content Review Service Evaluation

On January 17, the China Academy of Information and Communications Technology announced that Ant Security's self‑developed Tianjian multimodal content risk control system achieved the highest five‑star rating in both text and image assessments of the 2022 content review service evaluation, highlighting its advanced AI‑driven moderation capabilities.

Image AnalysisMultimodal AIant security
0 likes · 4 min read
Ant Security's Tianjian Content Risk Control System Receives Five‑Star Rating in 2022 Content Review Service Evaluation
DataFunSummit
DataFunSummit
Dec 19, 2022 · Artificial Intelligence

Multimodal Large‑Model Driven Virtual Digital Humans: Background, Methods, and Applications

This article introduces the rapid development of multimodal digital humans powered by large AI models, covering their background, current challenges, NeRF‑GAN based modeling methods, multimodal dialogue capabilities, and real‑world application cases such as virtual assistants, tourism guides, and sign‑language avatars.

AIGCHuman-Computer InteractionLarge Model
0 likes · 14 min read
Multimodal Large‑Model Driven Virtual Digital Humans: Background, Methods, and Applications
21CTO
21CTO
Dec 15, 2022 · Artificial Intelligence

Sam Altman & Reid Hoffman on AI’s Future: Business, Multimodal Models, Society

In a candid conversation, Sam Altman and Reid Hoffman explore the next stage of AI, discussing commercial opportunities of large language models, the rise of AI‑plus applications in science and the metaverse, future directions such as multimodal and continuously learning models, and the societal challenges of AGI, wealth distribution and universal basic income.

AGIAI commercializationLarge Language Models
0 likes · 16 min read
Sam Altman & Reid Hoffman on AI’s Future: Business, Multimodal Models, Society
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Dec 14, 2022 · Artificial Intelligence

Boosting AI Efficiency in Digital Content Risk Control: Insights from QCon

In this interview, NetEase AI expert Li Yuke shares how lightweight, cost‑effective AI solutions improve digital content risk control, audio‑video processing, and conversational systems, while discussing technical committees, data standards, and future AI trends such as multimodal and unsupervised learning.

AI efficiencyAI productionMultimodal AI
0 likes · 11 min read
Boosting AI Efficiency in Digital Content Risk Control: Insights from QCon
DataFunSummit
DataFunSummit
Dec 9, 2022 · Artificial Intelligence

Volcano Engine Virtual Digital Human Technology Overview

This article provides a comprehensive overview of Volcano Engine's virtual digital human platform, detailing its definition, AI‑driven and human‑driven classifications, 2D and 3D technical architectures, multi‑modal perception, interaction capabilities, application scenarios, and future development directions.

2D avatar3D AvatarComputer Vision
0 likes · 15 min read
Volcano Engine Virtual Digital Human Technology Overview
DataFunSummit
DataFunSummit
Nov 26, 2022 · Artificial Intelligence

Multimodal Digital Human Driving: Motionverse Engine and Metaverse Applications

This article introduces the evolution of digital human technology, explains the five maturity levels (L1‑L5), describes the Motionverse multimodal motion‑generation platform and its large‑scale data and AI models, and outlines SDK integration strategies for diverse metaverse scenarios.

MetaverseMultimodal AImotion generation
0 likes · 11 min read
Multimodal Digital Human Driving: Motionverse Engine and Metaverse Applications
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 25, 2022 · Artificial Intelligence

Youth AI Technology Salon: Multimodal Learning, AIGC, and Career Guidance

At the REDtech Youth AI Technology Salon in Beijing, leading AI experts and top university students discussed the evolution of multimodal learning, Xiaohongshu’s practical applications, autonomous‑driving perception, and offered career guidance, emphasizing solid fundamentals, user value, and opportunities within Xiaohongshu’s talent‑development programs.

AI talent developmentAIGCMultimodal AI
0 likes · 16 min read
Youth AI Technology Salon: Multimodal Learning, AIGC, and Career Guidance
Tencent Cloud Developer
Tencent Cloud Developer
Nov 11, 2022 · Artificial Intelligence

Tencent Advertising Multimedia AI Technology: Research and Application

Liu Wei outlines Tencent’s Advertising Multimedia AI ecosystem on the Taiji platform, describing a five‑platform matrix—Jue for content understanding, Qiankun for automated video creation, Shenzhen for AI‑driven review, Tianyin for hierarchical fingerprinting, and Hunyuan as a multimodal large model—featuring innovations such as massive multimodal pre‑training, logo retrieval, QA‑style attribute extraction, spatiotemporal video analysis, advanced auto‑judgment, and high‑performance hashing that achieve top cross‑modal retrieval results.

Computer VisionMultimodal AIadvertising technology
0 likes · 18 min read
Tencent Advertising Multimedia AI Technology: Research and Application
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 11, 2022 · Artificial Intelligence

Large-Scale Deep Learning Systems and Their Application at Xiaohongshu (RED)

Xiaohongshu’s in‑house LarC platform powers real‑time, multimodal recommendation, life‑search, and generative‑AI commercial content for its 200 million‑user community by processing billions of daily feedback samples, employing conflict‑free parameter servers, diversified sequence modeling, and large‑scale representation learning to deliver personalized, fresh, and diverse user experiences.

AI InfrastructureMachine Learning PlatformMultimodal AI
0 likes · 13 min read
Large-Scale Deep Learning Systems and Their Application at Xiaohongshu (RED)
DataFunSummit
DataFunSummit
Oct 9, 2022 · Artificial Intelligence

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

The article introduces the GIT image‑to‑text (image captioning) model, explains its transformer‑based architecture, showcases multiple example outputs, discusses training details, compares its performance with Flamingo and COCO, and highlights its applicability to tasks such as VQA, video captioning, and image classification.

GIT modelImage CaptioningMultimodal AI
0 likes · 12 min read
Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison
DataFunTalk
DataFunTalk
Sep 24, 2022 · Artificial Intelligence

Cross‑Modal Image‑Text Representation: The Zero Dataset and R2D2 Pre‑training Framework

This article introduces the importance of image‑text cross‑modal representation, presents the Chinese Zero dataset with two pre‑training subsets and five downstream tasks, describes the R2D2 dual‑tower‑plus‑single‑tower pre‑training framework with multiple loss functions, and reports extensive experiments and real‑world deployment insights.

Cross-modalMultimodal AIR2D2 framework
0 likes · 19 min read
Cross‑Modal Image‑Text Representation: The Zero Dataset and R2D2 Pre‑training Framework
Alimama Tech
Alimama Tech
Aug 17, 2022 · Artificial Intelligence

How Multimodal AI Transforms Advertising Copy: From Image Text to Video Scripts

Alibaba’s advertising AI team presents a comprehensive study of four new multimodal copywriting tasks—image overlay text generation, video narration, text style transfer, and detail-page extraction—detailing model architectures, training on billions of images, experimental results, and practical deployment in the “Xiyu” product.

Large-Scale TrainingMultimodal AIStyle Transfer
0 likes · 17 min read
How Multimodal AI Transforms Advertising Copy: From Image Text to Video Scripts
DataFunSummit
DataFunSummit
Apr 14, 2022 · Artificial Intelligence

Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library

This article reviews Alibaba's digital‑human (virtual avatar) research over the past few years, covering the product’s evolution, a six‑stage pipeline for building digital humans, solutions to key challenges in realism, multimodal interaction, and the open‑source MMTK algorithm library.

Digital HumanEmotion ModelingMultimodal AI
0 likes · 12 min read
Advances in Alibaba's Digital Human Technology: Construction, Performance, Interaction, and the MMTK Multimodal Algorithm Library
DataFunTalk
DataFunTalk
Mar 26, 2022 · Artificial Intelligence

Advances in Alibaba's Digital Human (XiaoMi) Technology: Development, Construction, and Interaction

This article reviews Alibaba's XiaoMi digital human technology, covering its evolution since 2019, a six‑stage pipeline for building avatars, methods to enhance emotional, textual, vocal, and motion expressiveness, and approaches for improving long‑term interactive capabilities such as controllable script generation, multimodal QA, sign‑language translation, and intelligent behavior decision, culminating in the release of the MMTK multimodal algorithm library.

Digital HumanEmotion ModelingMultimodal AI
0 likes · 11 min read
Advances in Alibaba's Digital Human (XiaoMi) Technology: Development, Construction, and Interaction
iQIYI Technical Product Team
iQIYI Technical Product Team
Feb 25, 2022 · Artificial Intelligence

Short Video Content Tagging: Multimodal AI Model Framework and Applications

The framework tags short videos by fusing text, image and audio‑video features through specialized extraction, classification, generative and retrieval modules, then ranking candidates with a multimodal BERT model, delivering accurate, business‑specific tags that boost recommendation, search and advertising.

Deep LearningMultimodal AIcontent tagging
0 likes · 10 min read
Short Video Content Tagging: Multimodal AI Model Framework and Applications
DataFunTalk
DataFunTalk
Jan 3, 2022 · Artificial Intelligence

Top AI Stories of 2021: Large‑Scale Pretrained Models, Transformers, Multimodal AI, and Emerging Challenges

The article reviews the 2021 AI landscape, highlighting the race for ever‑larger pretrained models, the dominance of Transformers across modalities, the promise and limits of large models, the rise of multimodal systems, regulatory considerations, and the still‑nascent progress in reinforcement learning.

AI GovernanceAI industryLarge Language Models
0 likes · 12 min read
Top AI Stories of 2021: Large‑Scale Pretrained Models, Transformers, Multimodal AI, and Emerging Challenges
Xianyu Technology
Xianyu Technology
Nov 12, 2021 · Industry Insights

How Taobao’s Cutting‑Edge 3D XR, Multimodal AI, and Low‑Carbon Tech Redefined Double 11

Taobao’s Double 11 showcased a suite of advanced technologies—including 3D immersive XR live rooms, large‑scale multimodal product search, ultra‑low‑latency streaming, AI‑driven content moderation, and low‑carbon model compression—demonstrating how e‑commerce can innovate while promoting sustainability and inclusivity.

3D XRE‑commerce InnovationMultimodal AI
0 likes · 11 min read
How Taobao’s Cutting‑Edge 3D XR, Multimodal AI, and Low‑Carbon Tech Redefined Double 11
Volcano Engine Developer Services
Volcano Engine Developer Services
Oct 12, 2021 · Artificial Intelligence

How ByteDance’s AI‑Powered Audio Signal Processing Elevates Voice, VR, and VoIP

This article reviews ByteDance’s intelligent audio signal processing technologies, covering foundational algorithms, multimodal audio scaling, sound‑field reconstruction, and high‑quality low‑latency VoIP, and explains how these advances improve audio capture, immersive media, and smart voice interaction across devices.

AR/VR audioMultimodal AIVoIP
0 likes · 13 min read
How ByteDance’s AI‑Powered Audio Signal Processing Elevates Voice, VR, and VoIP
DataFunTalk
DataFunTalk
Sep 14, 2021 · Artificial Intelligence

Multimodal and Human‑Computer Interaction Technologies for E‑commerce Live Streaming: From Q&A to Live Broadcast

This talk explores how multimodal AI, knowledge‑graph‑enhanced script generation, and advanced reading‑comprehension techniques enable virtual anchors to transform e‑commerce live streaming from simple Q&A bots into interactive, content‑rich live broadcasts, addressing challenges of material sourcing, personalization, and low‑latency response.

Content GenerationLiveQAMultimodal AI
0 likes · 19 min read
Multimodal and Human‑Computer Interaction Technologies for E‑commerce Live Streaming: From Q&A to Live Broadcast
DataFunTalk
DataFunTalk
Aug 14, 2021 · Artificial Intelligence

Multimodal Advertisement Detection System for WeChat "KanKan" Articles

This article introduces a multimodal advertisement detection framework for WeChat KanKan that decomposes the problem into text, image, and article‑structure dimensions, presents novel models for ad text and image recognition, and describes how sequence classification and visualisation are used to filter severe ad‑spam articles.

Image ClassificationMultimodal AIWeChat
0 likes · 16 min read
Multimodal Advertisement Detection System for WeChat "KanKan" Articles
ITPUB
ITPUB
Jun 25, 2021 · Artificial Intelligence

How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy

Alibaba’s DAMO Academy unveiled the low‑carbon M6 multimodal model, a trillion‑parameter AI trained on just 480 V100 GPUs, achieving over 80% energy reduction and 11‑fold speedup compared to prior trillion‑parameter efforts, and already powering e‑commerce and manufacturing design tools.

GPU efficiencyLarge ModelM6
0 likes · 5 min read
How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy
Baidu Geek Talk
Baidu Geek Talk
Jun 21, 2021 · Artificial Intelligence

Detecting Pornographic Videos with Dual‑Modal AI: Images + Audio

This article presents a technical overview of a multimodal AI framework that combines image and audio analysis to identify pornographic video content, detailing model architectures, feature extraction methods, and experimental results achieving 93.4% accuracy on a 3,000‑sample test set.

Audio AnalysisDeep LearningMultimodal AI
0 likes · 6 min read
Detecting Pornographic Videos with Dual‑Modal AI: Images + Audio
DataFunTalk
DataFunTalk
Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI InfrastructureDistributed TrainingGPU clustering
0 likes · 17 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
DataFunSummit
DataFunSummit
Mar 9, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's multimodal content understanding platform, covering its massive data challenges, heterogeneous model support, standardized pipelines, platformization, workflow architecture, GPU heterogeneous cluster management, resource scheduling, performance optimization, and full‑stack monitoring to achieve stable, low‑latency AI services at scale.

Distributed TrainingGPU clusterModel Serving
0 likes · 18 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
DataFunTalk
DataFunTalk
Feb 9, 2021 · Artificial Intelligence

Multimodal AI Research: Video-Aware Dialog, Dual-Channel Reasoning, and Multimodal Machine Translation

This article surveys recent multimodal AI research, covering video scene‑aware dialog with a GPT‑2 based unified pre‑training framework, dual‑channel multi‑hop reasoning for visual dialog, capsule‑network‑enhanced multimodal machine translation, and graph‑neural‑network‑driven multimodal translation, highlighting experimental results and future directions.

Graph Neural NetworkMultimodal AIMultimodal Learning
0 likes · 12 min read
Multimodal AI Research: Video-Aware Dialog, Dual-Channel Reasoning, and Multimodal Machine Translation
JD Tech
JD Tech
Feb 2, 2021 · Artificial Intelligence

Advances and Trends in Multimodal Digital Content Generation and Automatic Text Summarization

The article reviews recent research on multimodal digital content generation and automatic text summarization, outlining the evolution from extractive to abstractive methods, highlighting four key technology trends such as pretrained language models, transformer dominance, knowledge‑enhanced generation, and multimodal‑knowledge joint modeling, and describing an industrial e‑commerce application built on these advances.

Generative ModelsMultimodal AIe‑commerce
0 likes · 12 min read
Advances and Trends in Multimodal Digital Content Generation and Automatic Text Summarization
DataFunTalk
DataFunTalk
Oct 22, 2020 · Artificial Intelligence

Analyzing Video Excitement: Methods, Frameworks, and Applications

This article presents a comprehensive overview of video excitement analysis, covering quality, aesthetics, and narrative factors, describing a multimodal framework with supervised, weakly supervised, and multi‑task models, and illustrating practical applications such as preview generation, clipping, and automatic cover creation.

Multimodal AIWeak Supervisioncontent recommendation
0 likes · 14 min read
Analyzing Video Excitement: Methods, Frameworks, and Applications
Meituan Technology Team
Meituan Technology Team
Oct 15, 2020 · Artificial Intelligence

Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

The paper introduces the Answer‑Driven Visual State Estimator (ADVSE), which uses answer‑driven focusing attention and conditional visual information fusion to dynamically incorporate answers into visual dialogue, overcoming static encoding limitations and achieving state‑of‑the‑art performance on the GuessWhat?! question‑generation and guessing tasks.

Attention MechanismMultimodal AIState Estimation
0 likes · 10 min read
Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue
DataFunTalk
DataFunTalk
Jul 31, 2020 · Artificial Intelligence

WeChat 'Kan Kan' Content Understanding: Architecture and Techniques for Recommendation

This article details the technical architecture behind WeChat's 'Kan Kan' content understanding platform, covering text and multimedia analysis, tag extraction, entity recognition, knowledge graph construction, and how these components enhance recommendation recall, ranking, and user engagement across the ecosystem.

Multimodal AIRecommendation Systemscontent understanding
0 likes · 46 min read
WeChat 'Kan Kan' Content Understanding: Architecture and Techniques for Recommendation
Xianyu Technology
Xianyu Technology
Jul 9, 2020 · Product Management

Xianyu Product Structuring: Evolution, Current Strategies, and Future Directions

Xianyu’s product‑information structuring has progressed from simple text mining to multimodal AI pipelines that now boost coverage by nearly 50 %, while facing precision and engineering hurdles, and it plans to adopt a standardized VID attribute system, plug‑in multimodal models, and rule‑based input assistance to enable seamless, photo‑driven publishing.

Multimodal AIdata engineeringe‑commerce
0 likes · 10 min read
Xianyu Product Structuring: Evolution, Current Strategies, and Future Directions
Suning Technology
Suning Technology
Apr 9, 2020 · Artificial Intelligence

Affective Computing in Retail: Boosting Customer Experience with Emotion AI

This article explores the development and application of affective computing in the retail sector, covering its psychological foundations, emotion recognition algorithms for facial expressions, speech, and text, multimodal fusion techniques, market players, and future prospects for enhancing shopper experiences, staff service quality, and sales performance.

Affective ComputingEmotion RecognitionMultimodal AI
0 likes · 20 min read
Affective Computing in Retail: Boosting Customer Experience with Emotion AI
JD Tech Talk
JD Tech Talk
Mar 9, 2020 · Artificial Intelligence

Advances in Deep Learning for Content Recommendation and User Behavior Modeling by JD Digits

The article reviews recent deep‑learning breakthroughs in personalized content recommendation, covering news and e‑commerce systems, JD Digits' multi‑dimensional user behavior prediction models, knowledge‑graph meta‑learning, and the impact of multimodal AI on future recommendation technologies.

Deep LearningMultimodal AIRecommendation Systems
0 likes · 6 min read
Advances in Deep Learning for Content Recommendation and User Behavior Modeling by JD Digits
Amap Tech
Amap Tech
Dec 6, 2019 · Artificial Intelligence

Semantic Understanding of Merchant Signboards for Automatic POI Name Generation at Amap

Amap's POI naming automation uses a two-stage cascade model: Stage 1 extracts token and sentence features with POS tags and domain-adapted BERT‑POI; Stage 2 employs a Bi‑LSTM to model line relationships, achieving over 95% semantic accuracy and 3‑6% recall improvements, thereby enhancing automatic signboard‑based POI name generation.

BERTLSTMMultimodal AI
0 likes · 7 min read
Semantic Understanding of Merchant Signboards for Automatic POI Name Generation at Amap
iQIYI Technical Product Team
iQIYI Technical Product Team
Apr 12, 2019 · Artificial Intelligence

iQIYI Multimodal Technology: Datasets, Applications, and Future Directions

iQIYI leverages multimodal AI—combining audio, visual, and textual cues—to advance video understanding, releasing the world’s largest celebrity dataset (iQIYI‑VID), powering applications such as actor‑focused playback, AI Radar, emoji generation, and rapid automated editing, while pursuing future research in emoji captioning, cross‑modal retrieval, visual question answering, and broader health‑care and education uses.

DatasetsMultimodal AIiQIYI
0 likes · 13 min read
iQIYI Multimodal Technology: Datasets, Applications, and Future Directions