Tagged articles

Multimodal Learning

70 articles · Page 1 of 1

Jul 3, 2026 · Artificial Intelligence

ICML 2026: Enabling Multimodal Large Models to Reason Over Time with the Open‑Source TaRO Framework

The paper introduces the Temporal‑Aware Reasoning Optimization (TaRO) framework, which equips multimodal video large models with time‑aware reasoning via template‑based exploration, a temporal‑sensitivity reward, and progressive curriculum learning, achieving state‑of‑the‑art zero‑shot performance on several video temporal grounding benchmarks, including long‑video datasets.

Multimodal LearningTaROTemporal Reasoning

0 likes · 9 min read

ICML 2026: Enabling Multimodal Large Models to Reason Over Time with the Open‑Source TaRO Framework

Machine Learning Algorithms & Natural Language Processing

Jul 1, 2026 · Artificial Intelligence

SAME: Stabilizing MoE to Reduce Dual Forgetting in Multimodal Continual Instruction Tuning

The paper identifies routing drift and expert drift as the two main causes of forgetting in multimodal continual instruction tuning (MCIT) and proposes SAME, which combines spectral‑aware routing, curvature‑aware scaling, and adaptive expert activation to keep MoE models stable, efficient, and less forgetful across long task sequences.

Continual LearningICML 2026Instruction Tuning

0 likes · 19 min read

SAME: Stabilizing MoE to Reduce Dual Forgetting in Multimodal Continual Instruction Tuning

Machine Learning Algorithms & Natural Language Processing

Jun 14, 2026 · Artificial Intelligence

Deep Pre-Alignment (DPA): Tsinghua’s New VLM Architecture Aligns Vision Before Language Understanding

The paper introduces Deep Pre‑Alignment (DPA), a novel Vision‑Language Model architecture that inserts a perceiver VLM to pre‑align visual features with the LLM’s text space, reducing alignment cost, preserving language ability, and delivering consistent multimodal performance gains across multiple benchmarks with minimal inference overhead.

Deep Pre-AlignmentLLMMultimodal Learning

0 likes · 10 min read

Deep Pre-Alignment (DPA): Tsinghua’s New VLM Architecture Aligns Vision Before Language Understanding

JD Retail Technology

May 19, 2026 · Artificial Intelligence

Spectral Disentanglement and Enhancement: Teaching Multimodal Models to Denoise and Purify

The paper introduces the Spectral Disentanglement and Enhancement (SDE) framework, which uses singular value decomposition to separate strong semantic signals, weak auxiliary signals, and noise, applies curriculum‑based spectral enhancement, and jointly optimizes a dual‑domain contrastive loss, achieving markedly improved robustness and generalization on large‑scale multimodal benchmarks.

Multimodal Learningcontrastive learningdual-domain loss

0 likes · 14 min read

Spectral Disentanglement and Enhancement: Teaching Multimodal Models to Denoise and Purify

Machine Heart

May 16, 2026 · Artificial Intelligence

Embodied AI Breakthrough: Beijing Humanoid’s Pelican‑Unify 1.0 Tops WorldArena and Wins Dual Crown

The article details how Beijing Humanoid’s Pelican‑Unify 1.0 model achieved top scores on WorldArena—including a 66.03 overall rating and 98.12% 3D accuracy—by unifying perception, reasoning, imagination and action in a single latent space, marking a milestone for model‑based end‑to‑end embodied intelligence.

Embodied AIMultimodal LearningPelican-Unify

0 likes · 17 min read

Embodied AI Breakthrough: Beijing Humanoid’s Pelican‑Unify 1.0 Tops WorldArena and Wins Dual Crown

Machine Heart

May 8, 2026 · Artificial Intelligence

Omni2Sound Beats Multi-Modal Audio ‘Generalist’ Dilemma via Data Alignment

Omni2Sound tackles the long‑standing “generalist” dilemma of unified audio generation by constructing a high‑quality V‑T‑A dataset (SoundAtlas), employing a three‑stage progressive training pipeline, and using a simple Diffusion Transformer backbone, ultimately achieving state‑of‑the‑art performance on T2A, V2A and VT2A tasks and strong robustness on off‑screen scenarios.

Data AlignmentDiffusion ModelsMultimodal Learning

0 likes · 16 min read

Omni2Sound Beats Multi-Modal Audio ‘Generalist’ Dilemma via Data Alignment

Kuaishou Tech

Apr 24, 2026 · Artificial Intelligence

ICLR 2026: Kuaishou Tech Team’s Cutting‑Edge AI Research Highlights

This article reviews eight Kuaishou‑authored papers accepted at ICLR 2026, summarizing their problem statements, novel methods such as front‑door causal attribution, visual table retrieval, denoising rerankers, difficulty‑adaptive reasoning, diffusion code infilling, generative ordinal regression, multimodal video retrieval, e‑commerce dialogue benchmarks, and a new LLM creativity evaluator, together with reported experimental gains.

Diffusion ModelsICLR 2026Kuaishou

0 likes · 19 min read

ICLR 2026: Kuaishou Tech Team’s Cutting‑Edge AI Research Highlights

Data Party THU

Apr 22, 2026 · Artificial Intelligence

LARYBench: The ImageNet‑Scale Benchmark Bridging Vision and Action for Embodied AI

LARYBench, the first large‑scale benchmark for embodied intelligence, quantifies implicit action representations across 1.2 million video clips, evaluates vision‑only and robot‑specific models, and reveals how general visual encoders can close the vision‑action modality gap.

Embodied AILARYBenchMultimodal Learning

0 likes · 12 min read

LARYBench: The ImageNet‑Scale Benchmark Bridging Vision and Action for Embodied AI

Machine Heart

Apr 21, 2026 · Artificial Intelligence

ControlAudio Enables Scripted Timing and Speech Control in Text-to-Audio Generation

ControlAudio, a progressive diffusion model presented at ACL 2026, jointly models text, timing, and phoneme information to achieve precise event timing and intelligible speech in text-to-audio generation, backed by a large mixed real‑synthetic dataset and competitive experimental results.

ControlAudioMultimodal LearningProgressive Diffusion

0 likes · 10 min read

ControlAudio Enables Scripted Timing and Speech Control in Text-to-Audio Generation

Machine Heart

Mar 31, 2026 · Artificial Intelligence

Point‑VLA: Overcoming Embodied AI’s Language Bottleneck with Visual Grounding

The Point‑VLA method introduced by Qianxun AI’s Gaoyang team tackles the fundamental limits of language‑only instruction in vision‑language‑action models by adding visual grounding via bounding‑box cues, boosting real‑robot success rates from 32.4% to 92.5% across six challenging tasks.

Multimodal LearningPoint-VLAVisual Grounding

0 likes · 13 min read

Point‑VLA: Overcoming Embodied AI’s Language Bottleneck with Visual Grounding

Machine Learning Algorithms & Natural Language Processing

Mar 28, 2026 · Artificial Intelligence

GigaWorld-Policy Boosts Inference Speed 10× and Success Rate 30%

The newly released GigaWorld-Policy world‑action model replaces traditional video‑prediction‑heavy WAM designs with an action‑centered architecture, achieving a ten‑fold inference speedup, ten‑fold training efficiency gain, and a 30% increase in real‑robot task success rate while reducing memory usage compared with Motus and Cosmos‑Policy.

Action-Centered ArchitectureInference OptimizationMultimodal Learning

0 likes · 8 min read

GigaWorld-Policy Boosts Inference Speed 10× and Success Rate 30%

Machine Learning Algorithms & Natural Language Processing

Mar 26, 2026 · Artificial Intelligence

Can Uni‑X Eliminate Multimodal Gradient Conflict with a Pure Autoregressive Design?

The paper reveals that standard shared‑parameter Transformers suffer severe gradient conflict when jointly processing low‑entropy text and high‑entropy visual tokens, and proposes Uni‑X—a two‑end‑separated, middle‑shared autoregressive model that isolates modality‑specific layers, reduces conflict, improves efficiency, and achieves strong results on image generation and editing benchmarks.

Autoregressive ModelGradient ConflictICLR 2026

0 likes · 8 min read

Can Uni‑X Eliminate Multimodal Gradient Conflict with a Pure Autoregressive Design?

AI Explorer

Mar 15, 2026 · Artificial Intelligence

Large Models May Break Language Training Dependence, Redefining Intelligence

A new study suggests that large AI models could reduce their reliance on massive text corpora by early‑fusing multimodal data such as video and sensor streams, potentially slashing training costs, improving generalization, and prompting a shift toward more embodied notions of intelligence.

AI researchEmbodied IntelligenceMultimodal Learning

0 likes · 6 min read

Large Models May Break Language Training Dependence, Redefining Intelligence

Bighead's Algorithm Notes

Mar 3, 2026 · Artificial Intelligence

How HORAI Uses Large‑Scale Multimodal Pretraining to Boost Time‑Series Forecasting and Anomaly Detection

The article reviews the HORAI model, which introduces a frequency‑enhanced multimodal pretraining paradigm and the massive MM‑TS dataset, showing that integrating derived images, endogenous text, and real‑world news dramatically improves zero‑shot forecasting and anomaly detection across six domains.

Anomaly DetectionHORAIMultimodal Learning

0 likes · 23 min read

How HORAI Uses Large‑Scale Multimodal Pretraining to Boost Time‑Series Forecasting and Anomaly Detection

Xiaomi Tech

Feb 3, 2026 · Artificial Intelligence

Xiaomi’s AI Research Secures Spots on ICLR 2026 – Papers and Key Findings

The International Conference on Learning Representations (ICLR) 2026 accepted multiple Xiaomi papers covering multimodal reasoning, reinforcement learning, GUI agents, autonomous driving, audio generation and benchmark design, each presenting novel frameworks, data‑centric training tricks and strong experimental results that advance the state of the art.

ICLR 2026Multimodal LearningXiaomi

0 likes · 17 min read

Xiaomi’s AI Research Secures Spots on ICLR 2026 – Papers and Key Findings

Bighead's Algorithm Notes

Dec 23, 2025 · Artificial Intelligence

How H3M‑SSMoEs Combines Hypergraph Multimodal Learning and LLM Reasoning to Predict Stock Direction

The paper introduces H3M‑SSMoEs, a framework that integrates a multi‑context hypergraph for fine‑grained spatio‑temporal dynamics with a frozen Llama‑3.2‑1B LLM adapter, and a style‑structured expert mixture to jointly model stock relationships, multimodal semantics, and market regimes, achieving superior accuracy and investment returns on DJIA, NASDAQ‑100, and S&P‑100 benchmarks.

HypergraphLLMMultimodal Learning

0 likes · 14 min read

How H3M‑SSMoEs Combines Hypergraph Multimodal Learning and LLM Reasoning to Predict Stock Direction

AI Algorithm Path

Dec 23, 2025 · Artificial Intelligence

Fine‑Tuning Qwen‑Video‑8B with LLaMA‑Factory for Domain‑Specific Video Understanding

This article details how the Qwen‑Video‑8B model, built on Qwen3‑VL‑8B‑Instruct, is fine‑tuned with the LLaMA‑Factory framework using a curated city‑scenery dataset, addresses challenges of domain knowledge, temporal modeling and multimodal fusion, and demonstrates improved video captioning across baseline, English‑fine‑tuned and Chinese‑fine‑tuned versions.

AI fine-tuningLLaMA-FactoryLoRA

0 likes · 10 min read

Fine‑Tuning Qwen‑Video‑8B with LLaMA‑Factory for Domain‑Specific Video Understanding

Bighead's Algorithm Notes

Dec 2, 2025 · Artificial Intelligence

Dual-Relation Fusion Network (DRFN) for Accurate Stock Prediction

The paper introduces DRFN, a dual‑relation fusion network that jointly models static and dynamic stock relationships using multimodal BERT and GRU encodings, achieving significantly lower RMSE and MAE than baseline models on both Chinese and US market datasets.

BERTGRUGraph Neural Network

0 likes · 11 min read

Dual-Relation Fusion Network (DRFN) for Accurate Stock Prediction

Data Party THU

Nov 16, 2025 · Artificial Intelligence

How X‑VLA Enables 120‑Minute Unassisted Robot Clothing Folding with a 0.9B Model

The X‑VLA paper introduces a 0.9‑billion‑parameter, fully open‑source embodied model that uses a learnable soft‑prompt and divide‑and‑conquer encoding to handle heterogeneous robot vision inputs, achieving a record‑breaking 120‑minute autonomous clothing‑folding task while surpassing benchmarks across five simulation environments.

Embodied AIMultimodal LearningX-VLA

0 likes · 7 min read

How X‑VLA Enables 120‑Minute Unassisted Robot Clothing Folding with a 0.9B Model

Bighead's Algorithm Notes

Oct 31, 2025 · Artificial Intelligence

Weekly Quantitative Paper Digest (Oct 25‑31 2025)

This article summarizes six recent arXiv papers that explore how large language models, graph‑theoretic methods, generative frameworks, hypergraph multimodal architectures, GroupSHAP‑enhanced forecasting, and multi‑agent LLM workflows can improve financial signal extraction, portfolio optimization, and stock‑price prediction, providing empirical results on S&P 500 data.

LLMMultimodal Learningfinancial AI

0 likes · 13 min read

Weekly Quantitative Paper Digest (Oct 25‑31 2025)

Bighead's Algorithm Notes

Oct 24, 2025 · Artificial Intelligence

Weekly AI‑Finance Paper Digest (Oct 18‑24 2025)

This digest presents seven recent arXiv papers that explore large‑language‑model‑driven portfolio scoring, hybrid ResNet‑RMT covariance denoising for crypto, LLM‑enhanced financial causal analysis, multilingual news alignment for stock returns, three‑step bubble prediction with news and macro data, multimodal volatility forecasting, and news‑aware reinforcement trading, each with reported performance gains.

LLMMultimodal Learningcausal inference

0 likes · 15 min read

Weekly AI‑Finance Paper Digest (Oct 18‑24 2025)

Bighead's Algorithm Notes

Sep 6, 2025 · Artificial Intelligence

Time Series Paper Digest (Aug 23–Sep 5 2025)

It presents concise summaries of six recent arXiv papers on unsupervised domain adaptation, efficient forecasting, SHAP explanations, text‑reinforced multimodal forecasting, online prediction with feature adjustment, zero‑shot forecasting zoo, and a new anomaly‑detection metric, highlighting methods, datasets, and results.

Anomaly DetectionMultimodal LearningSHAP

0 likes · 16 min read

Time Series Paper Digest (Aug 23–Sep 5 2025)

Kuaishou Tech

Jul 29, 2025 · Artificial Intelligence

How Kuaishou’s 8 Groundbreaking Papers Are Shaping AI at KDD 2025

Eight Kuashou research papers covering recommendation systems, multi‑task learning, multimodal large models, large language models, and combinatorial optimization have been accepted to the premier AI data‑mining conference KDD 2025, highlighting the company’s cutting‑edge innovations and their potential impact on the field.

AIMultimodal LearningRecommendation Systems

0 likes · 18 min read

How Kuaishou’s 8 Groundbreaking Papers Are Shaping AI at KDD 2025

Alibaba Cloud Big Data AI Platform

Jul 28, 2025 · Databases

How Multimodal Ranking Cuts Slow Query Optimization Time by 14%

The VLDB‑2025 paper RCRank introduces a multimodal framework that collects slow‑query data, uses rule‑based and LLM analysis to identify root causes, quantifies their impact, and ranks them, achieving a 14% boost in optimization efficiency for cloud databases.

Cloud DatabasesMultimodal LearningRoot Cause Analysis

0 likes · 6 min read

How Multimodal Ranking Cuts Slow Query Optimization Time by 14%

Data Thinking Notes

Jul 8, 2025 · Artificial Intelligence

How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation

This article details Xiaohongshu's multi‑stage recommendation pipeline—using massive multi‑modal pre‑training, long‑sequence modeling, real‑time context features, reinforcement learning and online deep learning—to precisely surface valuable content, address cold‑start challenges, and break information bubbles for billions of users.

Large Language ModelMultimodal Learningonline deep learning

0 likes · 16 min read

How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation

DataFunSummit

Jun 19, 2025 · Artificial Intelligence

How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights

In a detailed interview, ByteDance AI specialist Cai Conghuai explains how large‑model techniques such as SFT, DPO and RAG address Douyin’s multimodal user‑experience challenges, improve signal detection, root‑cause analysis, and outline future AI‑agent breakthroughs for content platforms.

AI AlgorithmsEvaluationMultimodal Learning

0 likes · 11 min read

How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights

Su San Talks Tech

Feb 23, 2025 · Artificial Intelligence

How DeepSeek’s Distillation Breaks AI Model Limits: Core Principles & Performance

This article explores DeepSeek’s cutting‑edge distillation technology, detailing its definition, underlying principles, innovative data‑model fusion, architecture choices, training strategies, performance gains over large language models, and the remaining challenges in knowledge transfer and multimodal data processing.

DeepSeekMultimodal Learningai-optimization

0 likes · 16 min read

How DeepSeek’s Distillation Breaks AI Model Limits: Core Principles & Performance

DataFunSummit

Feb 4, 2025 · Artificial Intelligence

Training Optimization for Large-Scale Multimodal Models in Content Safety

This article examines the challenges of content safety, outlines the limitations of current task‑specific multimodal models, and proposes large‑model‑inspired training optimizations—including diversified data construction, automated annotation, parameter fine‑tuning, and multi‑task evaluation—to improve efficiency, accuracy, and scalability of multimodal AI systems.

Content SafetyMultimodal Learningai-optimization

0 likes · 26 min read

Training Optimization for Large-Scale Multimodal Models in Content Safety

Alibaba Cloud Big Data AI Platform

Nov 7, 2024 · Artificial Intelligence

How VideoCLIP‑XL Boosts Long‑Description Understanding in Video CLIP Models

VideoCLIP‑XL, a new video CLIP model introduced by Alibaba Cloud AI Platform and Sun Yat‑sen University, enhances long‑text description comprehension through a large‑scale VILD dataset, a text‑similarity guided principal component matching method, and novel DDR and HDR ranking tasks, achieving superior performance on multiple video‑text benchmarks.

Long DescriptionMultimodal LearningVideo CLIP

0 likes · 13 min read

How VideoCLIP‑XL Boosts Long‑Description Understanding in Video CLIP Models

DataFunSummit

Oct 29, 2024 · Artificial Intelligence

Decentralized Distribution in Xiaohongshu: Strengthening Sideinfo, Multimodal Fusion, and Interest Exploration

This article details Xiaohongshu's technical approaches to decentralized content distribution, covering business background, core challenges, high‑frequency recommendation pipelines, link‑level analysis, sideinfo decoupling, graph‑model integration, multimodal signal fusion, explicit interest exploration, interest protection, and future research directions.

Multimodal Learningdecentralized-distributiongraph models

0 likes · 24 min read

Decentralized Distribution in Xiaohongshu: Strengthening Sideinfo, Multimodal Fusion, and Interest Exploration

DataFunSummit

Sep 16, 2024 · Artificial Intelligence

Multimodal Content Understanding and Cold-Start Practices in NetEase Cloud Music Community Recommendation System

This article details how NetEase Cloud Music leverages multimodal content understanding—using audio models like MusicCLIP and Audio MAE and image‑text fusion via FLAVA—to improve recommendation performance for new content and new users, covering system architecture, cold‑start solutions, and future AI‑driven directions.

AI modelsMultimodal Learningaudio representation

0 likes · 15 min read

Multimodal Content Understanding and Cold-Start Practices in NetEase Cloud Music Community Recommendation System

Bilibili Tech

Aug 27, 2024 · Artificial Intelligence

Multimodal Video Scene Classification for Adaptive Video Processing

The paper presents a multimodal video scene classification system that leverages CLIP‑generated pseudo‑labels and a fine‑tuned image encoder to automatically identify nature, animation/game, and document scenes, enabling more effective adaptive transcoding, intelligent restoration, and quality assessment for user‑generated content on platforms such as Bilibili.

Bilibili multimediaCLIPMultimodal Learning

0 likes · 17 min read

Multimodal Video Scene Classification for Adaptive Video Processing

AntTech

Aug 16, 2024 · Artificial Intelligence

PC²: Pseudo‑Classification Based Pseudo‑Captioning for Noisy Correspondence Learning in Cross‑Modal Retrieval

The paper introduces PC², a novel framework that combines pseudo‑classification and pseudo‑captioning to mitigate noisy correspondence in cross‑modal retrieval, presents a large‑scale web‑page/image‑meta‑description dataset called Noise of Web (NoW), and demonstrates significant performance gains on multiple benchmark datasets including Flickr30K, MS‑COCO, and the newly released NoW.

Multimodal LearningPC2cross-modal retrieval

0 likes · 16 min read

PC²: Pseudo‑Classification Based Pseudo‑Captioning for Noisy Correspondence Learning in Cross‑Modal Retrieval

Alimama Tech

Aug 2, 2024 · Artificial Intelligence

Multimodal Representations Boost Taobao Display Advertising CTR

Alibaba’s advertising team introduces semantic‑aware contrastive learning to pre‑train multimodal image‑text embeddings, integrates them via SimTier and MAKE into ID‑based CTR models, achieving up to 6.9% lift in Taobao display ad click‑through rates and improving long‑tail item performance.

CTR PredictionMultimodal LearningRecommendation Systems

0 likes · 21 min read

Multimodal Representations Boost Taobao Display Advertising CTR

Kuaishou Tech

Apr 17, 2024 · Artificial Intelligence

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

The paper presented at AAAI introduces the EERCF method, a coarse‑to‑fine visual representation and two‑stage recall‑then‑rerank strategy that dramatically reduces cross‑modal matching FLOPs while preserving state‑of‑the‑art retrieval performance on multiple video benchmarks.

AIEfficiencyMultimodal Learning

0 likes · 8 min read

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Baobao Algorithm Notes

Dec 24, 2023 · Artificial Intelligence

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

This curated reading list compiles essential papers on AI agents, task planning, hallucination mitigation, multimodal models, image/video generation, foundational LLM research, open‑source large models, fine‑tuning techniques, and performance optimization, providing a comprehensive roadmap for anyone aiming to master modern generative AI.

AI agentsMultimodal LearningPerformance Optimization

0 likes · 23 min read

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

NetEase Cloud Music Tech Team

Dec 21, 2023 · Artificial Intelligence

Video and Image Technologies in NetEase Cloud Music: Architecture, Algorithms, and Applications

The article examines NetEase Cloud Music’s video and image technology stack—covering a four‑module architecture, algorithms for content understanding, intelligent production, moderation, and interactive effects—and explains how these systems enhance user experience, streamline backend processing, and position the platform for future AIGC‑driven innovations.

AI AlgorithmsMultimodal LearningVideo Processing

0 likes · 11 min read

Video and Image Technologies in NetEase Cloud Music: Architecture, Algorithms, and Applications

DataFunTalk

Nov 10, 2023 · Artificial Intelligence

Multimodal Cold-Start Techniques for Music Recommendation at NetEase Cloud Music

This article presents NetEase Cloud Music's multimodal cold-start recommendation approach, detailing the problem's significance, feature extraction using CLIP, I2I2U indirect modeling, U2I DSSM direct modeling with contrastive learning and interest‑boundary mechanisms, deployment pipeline, evaluation results, and future optimization directions.

Multimodal Learningcold-startcontrastive learning

0 likes · 14 min read

Multimodal Cold-Start Techniques for Music Recommendation at NetEase Cloud Music

Xiaohongshu Tech REDtech

Jun 20, 2023 · Artificial Intelligence

Open-Vocabulary Object Attribute Recognition with OvarNet: A Unified Framework for Detection and Attribute Classification

At CVPR 2023 the Xiaohongshu team presented OvarNet, a unified one‑stage Faster‑RCNN model built on CLIP that uses prompt learning and knowledge distillation to jointly detect objects and recognize open‑vocabulary attributes, achieving state‑of‑the‑art results on VAW, MS‑COCO, LSA and OVAD datasets.

Multimodal Learningattribute recognitioncomputer vision

0 likes · 12 min read

Open-Vocabulary Object Attribute Recognition with OvarNet: A Unified Framework for Detection and Attribute Classification

DataFunSummit

May 5, 2023 · Artificial Intelligence

Advances in Virtual Humans, Multimodal Technology, and General AI – Insights from OPPO

The article presents OPPO's latest research on virtual human audio‑lip and RGB driving, multimodal learning breakthroughs such as CETNETs and cross‑modal matching, and a reflective discussion on the challenges and future directions of general artificial intelligence, highlighting the interconnections among these three domains.

AI EngineeringGeneral AIMultimodal Learning

0 likes · 9 min read

Advances in Virtual Humans, Multimodal Technology, and General AI – Insights from OPPO

NetEase LeiHuo Testing Center

May 27, 2022 · Artificial Intelligence

Multimodal Model for Game Frame Rate Prediction

This article explains how a multimodal deep learning model combines static and temporal game data to predict frame rates, helping identify performance bottlenecks and improve client smoothness through feature fusion, data pipelines, and real‑time inference in modern games.

AIMultimodal Learningdeep learning

0 likes · 7 min read

Multimodal Model for Game Frame Rate Prediction

DataFunTalk

May 22, 2022 · Artificial Intelligence

Advances in Information‑Flow Recommendation: Pre‑trained Models and Multimodal User‑Interface Modeling

This article reviews Huawei Noah's Ark Lab's work on modern information‑flow recommendation, covering the evolution from collaborative filtering to deep learning, the application of BERT‑based pre‑training for news ranking, multimodal user‑interface modeling, practical deployment challenges, and future research directions.

AIBERTHuawei

0 likes · 19 min read

Advances in Information‑Flow Recommendation: Pre‑trained Models and Multimodal User‑Interface Modeling

NetEase Media Technology Team

Apr 11, 2022 · Artificial Intelligence

Multimodal Video Tagging: Challenges and a Two‑Stage Recall‑Ranking Solution

To tackle the massive, multimodal tagging challenge of short‑video platforms—characterized by a huge long‑tail tag set, sparse annotations, and uneven modality contributions—the authors propose a two‑stage recall‑ranking system that first retrieves candidates via text, visual, audio and classification cues, then refines them with contrastive learning and extensive hard‑negative sampling, achieving 0.884 tag accuracy in a real‑world news video recommender.

EmbeddingMultimodal LearningRecommendation Systems

0 likes · 12 min read

Multimodal Video Tagging: Challenges and a Two‑Stage Recall‑Ranking Solution

IEG Growth Platform Technology Team

Feb 14, 2022 · Artificial Intelligence

Multimodal Evolution and Application in Tencent Game Advertising System

This article describes the end‑to‑end multimodal modeling pipeline—covering text, image, and video understanding, model evolution from shallow to deep networks, key‑frame extraction, fine‑tuning, and multimodal fusion—used in Tencent's game ad exchange platform, along with practical deployment challenges and solutions.

AdvertisingCNNMultimodal Learning

0 likes · 22 min read

Multimodal Evolution and Application in Tencent Game Advertising System

Baidu Geek Talk

Jan 17, 2022 · Artificial Intelligence

Unlocking Video AI: PaddleVideo’s Open‑Source Solutions for Sports, Media, and Safety

This article surveys PaddleVideo, Baidu's open‑source video AI toolkit, detailing its industry‑focused models for sports action recognition, multimodal tagging, intelligent production, interactive segmentation, drone detection, and medical imaging, while providing performance metrics and GitHub resources for each solution.

Multimodal LearningPaddleVideoVideo AI

0 likes · 14 min read

Unlocking Video AI: PaddleVideo’s Open‑Source Solutions for Sports, Media, and Safety

DataFunTalk

Jan 15, 2022 · Artificial Intelligence

Multimodal + Music: MMatch Series Technologies and Their Applications at Tencent Music

This article presents the multimodal learning demands of QQ Music, introduces the MMatch series of multimodal matching technologies—including image‑text matching, music similarity, AI tagging, and video scoring—and details their practical applications in business scenarios such as merchant public‑play, search, recommendation, and future product ideas.

Multimodal LearningRecommendation SystemsTencent Music

0 likes · 25 min read

Multimodal + Music: MMatch Series Technologies and Their Applications at Tencent Music

Amap Tech

Nov 4, 2021 · Artificial Intelligence

POI Signboard Image Retrieval: Technical Solution, Model Design, and Future Directions

To efficiently filter unchanged POI signboards, the authors propose a multimodal image‑retrieval system that combines enhanced global and local visual features with BERT‑encoded OCR text, using metric learning and alignment techniques to achieve over 95 % accuracy while handling occlusion, viewpoint variation, and subtle text changes.

Multimodal LearningPOIcomputer vision

0 likes · 17 min read

POI Signboard Image Retrieval: Technical Solution, Model Design, and Future Directions

Alibaba Cloud Developer

Nov 4, 2021 · Artificial Intelligence

How AI Powers POI Signboard Image Retrieval for Map Services

This article explains the challenges of POI signboard image retrieval, describes a multimodal deep‑learning solution that combines visual and OCR‑based text features, details data generation, model architecture, loss functions, and presents impressive accuracy improvements and future research directions.

Multimodal LearningPOI mappingdeep learning

0 likes · 17 min read

How AI Powers POI Signboard Image Retrieval for Map Services

Meituan Technology Team

Oct 28, 2021 · Artificial Intelligence

Supply Standardization for Script‑Murder Business Using a Knowledge Graph

Meituan’s To‑Store Integrated data team built an end‑to‑end supply‑standardization pipeline for the rapidly growing script‑murder market by extending the GENE knowledge graph to mine merchant supply, construct a unified script library through rule‑based, semantic, and multimodal clustering, and link products and user‑generated content to standardized scripts, enabling a dedicated category, personalized recommendations, filter tags, and improved ranking.

BERTKnowledge GraphMultimodal Learning

0 likes · 23 min read

Supply Standardization for Script‑Murder Business Using a Knowledge Graph

Kuaishou Tech

Oct 20, 2021 · Artificial Intelligence

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

This paper proposes HiT, a hierarchical transformer model with momentum contrast for video-text retrieval, addressing limitations in existing multimodal learning methods by introducing hierarchical cross-modal contrast matching and momentum cross-modal contrast to improve retrieval performance.

HCMMCCMoCo

0 likes · 9 min read

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

DataFunSummit

Oct 12, 2021 · Artificial Intelligence

Intelligent Grading: Technical Exploration and Practice in AI‑Powered Education

This article presents a comprehensive overview of AI‑driven intelligent grading technologies, covering background, typical educational challenges, multimodal NLP solutions for essay, spelling and grammar correction, adaptive learning, and related research, illustrating how deep learning and multimodal models improve automated assessment across K‑12 scenarios.

AIEducation TechnologyEssay Scoring

0 likes · 24 min read

Intelligent Grading: Technical Exploration and Practice in AI‑Powered Education

Tencent Advertising Technology

Jun 22, 2021 · Artificial Intelligence

Technical Insights and Solution Strategies from the Tencent Advertising Algorithm Competition – Video Ad Track

The article outlines the Tencent Advertising Algorithm Competition’s video ad challenge, details the paper submission guidelines, and shares a participant’s step‑by‑step technical approach—including baseline experiments, model re‑implementation with Paddle, multimodal feature extraction, optimizer choices, and future improvement directions—providing practical AI insights for multimedia video classification.

Multimodal LearningTencent competitiondeep learning

0 likes · 7 min read

Technical Insights and Solution Strategies from the Tencent Advertising Algorithm Competition – Video Ad Track

DataFunTalk

Feb 9, 2021 · Artificial Intelligence

Multimodal AI Research: Video-Aware Dialog, Dual-Channel Reasoning, and Multimodal Machine Translation

This article surveys recent multimodal AI research, covering video scene‑aware dialog with a GPT‑2 based unified pre‑training framework, dual‑channel multi‑hop reasoning for visual dialog, capsule‑network‑enhanced multimodal machine translation, and graph‑neural‑network‑driven multimodal translation, highlighting experimental results and future directions.

Graph Neural NetworkMachine TranslationMultimodal AI

0 likes · 12 min read

Multimodal AI Research: Video-Aware Dialog, Dual-Channel Reasoning, and Multimodal Machine Translation

Amap Tech

Jan 15, 2021 · Artificial Intelligence

Solution Overview of the AMAP-TECH Algorithm Competition: Dynamic Road Condition Analysis from In‑Vehicle Video Images

To tackle the AMAP‑TECH competition’s dynamic road‑condition classification from scarce, imbalanced vehicle‑video frames, the team combined YOLOv5 object detection, ResNeXt101‑based semantic embeddings, and engineered temporal detection statistics, feeding the fused features into a five‑fold LightGBM model that achieved top weighted‑F1 performance.

LightGBMMultimodal LearningResNeXt

0 likes · 10 min read

Solution Overview of the AMAP-TECH Algorithm Competition: Dynamic Road Condition Analysis from In‑Vehicle Video Images

Meituan Technology Team

Sep 24, 2020 · Artificial Intelligence

Meituan Search Ads Team's Solution for KDD Cup 2020 Multimodalities Recall Track

Meituan’s Search Ads team placed third in the KDD Cup 2020 Multimodalities Recall track by tackling training‑test distribution mismatch with diversified negative sampling and distillation learning, and improving text‑image matching via gated fully‑connected layers, bidirectional attention, and diversified fusion, then ensembling neural and tree models for strong NDCG gains later applied to their ad creative‑selection system.

DistillationInformation RetrievalKDD Cup

0 likes · 19 min read

Meituan Search Ads Team's Solution for KDD Cup 2020 Multimodalities Recall Track

Meituan Technology Team

Aug 6, 2020 · Artificial Intelligence

Meituan SIGIR2020 Workshop: MT‑BERT, KDD Cup Solutions, and Knowledge Graph Applications

At the SIGIR 2020 Meituan workshop, researchers unveiled MT‑BERT’s large‑scale pre‑training and compression techniques, a KDD Cup winning solution that tackles bias with graph‑ and multimodal learning for search advertising, and a massive food‑delivery knowledge graph powering personalized recommendations, all demonstrating significant real‑world performance gains.

Multimodal Learningmodel compressionpretrained language models

0 likes · 18 min read

Meituan SIGIR2020 Workshop: MT‑BERT, KDD Cup Solutions, and Knowledge Graph Applications

Qunar Tech Salon

Mar 5, 2020 · Artificial Intelligence

Content Tagging Technology for Short Videos at iQIYI: Challenges and Model Evolution

This article describes iQIYI's short‑video content tagging system, outlining the challenges of extracting type and abstract tags from multimodal data, detailing the evolution from text‑only models to image‑fusion, BERT‑enhanced, and video‑frame models, and discussing their applications and future directions.

BERTMultimodal LearningTransformer

0 likes · 11 min read

Content Tagging Technology for Short Videos at iQIYI: Challenges and Model Evolution

DataFunTalk

Feb 27, 2020 · Artificial Intelligence

Content Tagging Technology for Short Videos: Challenges and Model Evolution at iQIYI

This article examines the challenges of short‑video content tagging and describes iQIYI's multi‑stage evolution from simple text‑only models to sophisticated multimodal architectures that fuse cover images, BERT embeddings, and video frames to improve tag generation accuracy.

BERTMultimodal LearningTransformer

0 likes · 12 min read

Content Tagging Technology for Short Videos: Challenges and Model Evolution at iQIYI

iQIYI Technical Product Team

Feb 14, 2020 · Artificial Intelligence

Content Tagging Technology for Short Videos: Challenges and Multi‑Modal Model Evolution at iQIYI

iQIYI’s short‑video tagging system tackles multimodal fusion, open‑set and abstract tags by evolving from a text‑only model through cover‑image, BERT‑vector, and video‑frame fusion architectures, enabling automated labeling, personalized recommendation, and semantic search while planning to add OCR, audio, and knowledge‑graph enhancements.

BERTMultimodal LearningTransformer

0 likes · 13 min read

Content Tagging Technology for Short Videos: Challenges and Multi‑Modal Model Evolution at iQIYI

Alibaba Cloud Developer

Dec 26, 2019 · Artificial Intelligence

How Decomposed Linguistic Representations Overcome Language Priors in VQA

This article reviews a AAAI 2020 paper that introduces a language‑attention based Visual Question Answering model which decomposes questions into type, object, and concept expressions to mitigate language bias, explains its modular architecture, and demonstrates superior performance on VQA‑CP v2 through extensive experiments and ablations.

Attention MechanismMultimodal LearningVQA-CP

0 likes · 14 min read

How Decomposed Linguistic Representations Overcome Language Priors in VQA

DataFunTalk

Sep 29, 2019 · Artificial Intelligence

UC Information Flow Video Tag Recognition: System Architecture and Multi‑Modal Algorithms

This article presents a comprehensive overview of UC's information‑flow video tag recognition technology, detailing tag usage scenarios, the end‑to‑end system architecture, multi‑modal feature extraction, advanced deep‑learning models such as NextVlad, behavior and person tagging methods, and future research directions.

Multimodal LearningRecommendation Systemscomputer vision

0 likes · 14 min read

UC Information Flow Video Tag Recognition: System Architecture and Multi‑Modal Algorithms

Alibaba Cloud Developer

Aug 27, 2019 · Artificial Intelligence

How Transformers Enable Personalized Outfit Generation for Fashion Recommendation

This article presents a Transformer‑based framework that simultaneously generates visually compatible outfits and personalizes recommendations by leveraging multimodal item embeddings and user behavior, achieving significant gains in compatibility prediction, fill‑in‑the‑blank accuracy, and click‑through rate on Alibaba's iFashion platform.

Multimodal LearningTransformerdeep learning

0 likes · 15 min read

How Transformers Enable Personalized Outfit Generation for Fashion Recommendation

iQIYI Technical Product Team

Jul 5, 2019 · Artificial Intelligence

Residual Dense Network with Feature Fusion for Multimodal Video Person Identification (iQIYI-VID-2019)

The authors introduce a feature‑fusion pipeline and a Residual Dense Net that leverages multi‑frame face embeddings to identify persons in iQIYI‑VID‑2019 videos, achieving 0.9035 mAP (second place) with only ≈0.5 GFLOPs and processing the full test set in minutes.

Multimodal Learningfeature fusioniQIYI-VID-2019

0 likes · 11 min read

Residual Dense Network with Feature Fusion for Multimodal Video Person Identification (iQIYI-VID-2019)

iQIYI Technical Product Team

Jun 28, 2019 · Artificial Intelligence

Watchdog Team's TOP1 Solution for the iQIYI & ACMMM2019 Multimodal Video Person Recognition Challenge

Watchdog team won TOP1 in iQIYI & ACMMM2019 multimodal video person recognition challenge using pre‑extracted multimodal features, a 2048‑dim classifier with BCE loss, re‑ranking, DALI‑accelerated re‑detection, fine‑tuned InsightFace, and multi‑model ensembling achieving ~91% test accuracy.

Multimodal LearningRe‑rankingfeature fusion

0 likes · 12 min read

Watchdog Team's TOP1 Solution for the iQIYI & ACMMM2019 Multimodal Video Person Recognition Challenge

Alibaba Cloud Developer

Apr 10, 2019 · Artificial Intelligence

Bilinear Residual Layers: Boosting Text‑Guided Image Editing

This article explores multimodal representation learning by introducing a Bilinear Residual Layer that automatically fuses image and text features, demonstrates its superiority over traditional concatenation and FiLM methods on text‑guided image editing and fashion synthesis tasks, and reports state‑of‑the‑art results on several benchmark datasets.

GaNMultimodal LearningText-to-Image Generation

0 likes · 17 min read

Bilinear Residual Layers: Boosting Text‑Guided Image Editing

JD Tech

Jan 30, 2019 · Artificial Intelligence

JD AI Presents Eight Papers at AAAI 2019 Showcasing Advances in Machine Learning, NLP, and Computer Vision

At AAAI 2019 in Hawaii, JD AI Research Institute had eight papers accepted covering machine learning, natural language processing, computer vision, and multimodal AI, highlighting innovations such as AutoZOOM black‑box attacks, SACN for knowledge base completion, and temporally aware video captioning models.

Multimodal Learningartificial-intelligencecomputer vision

0 likes · 11 min read

JD AI Presents Eight Papers at AAAI 2019 Showcasing Advances in Machine Learning, NLP, and Computer Vision

iQIYI Technical Product Team

Jan 25, 2019 · Artificial Intelligence

Multimodal Video Quality Assessment Models for Short Video Platforms

The paper presents an integrated multimodal quality assessment system for short‑video platforms that evaluates cover images, video content, and accompanying text using deep‑learning and handcrafted features—combining ResNet‑50, NetVLAD, TSN, VGGish, and XGBoost—to improve user experience, recommendation accuracy, and operational efficiency, with plans for optimization and modular deployment.

Image AnalysisMultimodal LearningText Classification

0 likes · 11 min read

Multimodal Video Quality Assessment Models for Short Video Platforms

Alibaba Cloud Developer

Oct 25, 2017 · Artificial Intelligence

How Hierarchical Multimodal LSTM Boosts Image Captioning Accuracy

This article reviews an ICCV paper introducing a hierarchical multimodal LSTM that jointly embeds images, phrases, and whole sentences, enabling detailed image descriptions and superior performance on Flickr30K, MS‑COCO, and region‑phrase datasets compared to previous methods.

Image CaptioningMultimodal Learningcomputer vision

0 likes · 8 min read

How Hierarchical Multimodal LSTM Boosts Image Captioning Accuracy

Alibaba Cloud Developer

Sep 29, 2017 · Artificial Intelligence

Alibaba iDST’s Winning Strategy in ACM MM2017 Large-Scale Video Classification

The Alibaba iDST team clinched first place in the ACM MM2017 LSVC competition by leveraging Alibaba Cloud’s ODPS to extract eight multimodal features, achieving a 0.8485 mAP on the validation set, and demonstrating the critical role of rich modality fusion in large‑scale video classification.

AlibabaMultimodal LearningODPS

0 likes · 5 min read

Alibaba iDST’s Winning Strategy in ACM MM2017 Large-Scale Video Classification

Alibaba Cloud Developer

Jul 28, 2017 · Artificial Intelligence

Inside Alibaba AI Lab: Dr. Wang Gang on Multimodal AI and Edge Computing

In an exclusive interview, Alibaba AI Lab's distinguished scientist Dr. Wang Gang discusses the lab's research on multimodal AI, edge computing, AI hardware, bio‑inspired cognition, quantum‑deep‑learning integration, and the challenges of moving from recognition to true understanding, while also outlining Alibaba's AI talent recruitment plans.

AI researchAI talent recruitmentMultimodal Learning

0 likes · 25 min read

Inside Alibaba AI Lab: Dr. Wang Gang on Multimodal AI and Edge Computing