Tagged articles
65 articles
Page 1 of 1
Machine Heart
Machine Heart
May 16, 2026 · Artificial Intelligence

Embodied AI Breakthrough: Beijing Humanoid’s Pelican‑Unify 1.0 Tops WorldArena and Wins Dual Crown

The article details how Beijing Humanoid’s Pelican‑Unify 1.0 model achieved top scores on WorldArena—including a 66.03 overall rating and 98.12% 3D accuracy—by unifying perception, reasoning, imagination and action in a single latent space, marking a milestone for model‑based end‑to‑end embodied intelligence.

BenchmarkEmbodied AIMultimodal Learning
0 likes · 17 min read
Embodied AI Breakthrough: Beijing Humanoid’s Pelican‑Unify 1.0 Tops WorldArena and Wins Dual Crown
Machine Heart
Machine Heart
May 8, 2026 · Artificial Intelligence

Omni2Sound Beats Multi-Modal Audio ‘Generalist’ Dilemma via Data Alignment

Omni2Sound tackles the long‑standing “generalist” dilemma of unified audio generation by constructing a high‑quality V‑T‑A dataset (SoundAtlas), employing a three‑stage progressive training pipeline, and using a simple Diffusion Transformer backbone, ultimately achieving state‑of‑the‑art performance on T2A, V2A and VT2A tasks and strong robustness on off‑screen scenarios.

Data AlignmentMultimodal LearningOmni2Sound
0 likes · 16 min read
Omni2Sound Beats Multi-Modal Audio ‘Generalist’ Dilemma via Data Alignment
Kuaishou Tech
Kuaishou Tech
Apr 24, 2026 · Artificial Intelligence

ICLR 2026: Kuaishou Tech Team’s Cutting‑Edge AI Research Highlights

This article reviews eight Kuaishou‑authored papers accepted at ICLR 2026, summarizing their problem statements, novel methods such as front‑door causal attribution, visual table retrieval, denoising rerankers, difficulty‑adaptive reasoning, diffusion code infilling, generative ordinal regression, multimodal video retrieval, e‑commerce dialogue benchmarks, and a new LLM creativity evaluator, together with reported experimental gains.

Causal AttributionICLR 2026Kuaishou
0 likes · 19 min read
ICLR 2026: Kuaishou Tech Team’s Cutting‑Edge AI Research Highlights
Machine Heart
Machine Heart
Apr 21, 2026 · Artificial Intelligence

ControlAudio Enables Scripted Timing and Speech Control in Text-to-Audio Generation

ControlAudio, a progressive diffusion model presented at ACL 2026, jointly models text, timing, and phoneme information to achieve precise event timing and intelligible speech in text-to-audio generation, backed by a large mixed real‑synthetic dataset and competitive experimental results.

ControlAudioMultimodal LearningProgressive Diffusion
0 likes · 10 min read
ControlAudio Enables Scripted Timing and Speech Control in Text-to-Audio Generation
Machine Heart
Machine Heart
Mar 31, 2026 · Artificial Intelligence

Point‑VLA: Overcoming Embodied AI’s Language Bottleneck with Visual Grounding

The Point‑VLA method introduced by Qianxun AI’s Gaoyang team tackles the fundamental limits of language‑only instruction in vision‑language‑action models by adding visual grounding via bounding‑box cues, boosting real‑robot success rates from 32.4% to 92.5% across six challenging tasks.

Multimodal LearningPoint-VLARobotics
0 likes · 13 min read
Point‑VLA: Overcoming Embodied AI’s Language Bottleneck with Visual Grounding
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 28, 2026 · Artificial Intelligence

GigaWorld-Policy Boosts Inference Speed 10× and Success Rate 30%

The newly released GigaWorld-Policy world‑action model replaces traditional video‑prediction‑heavy WAM designs with an action‑centered architecture, achieving a ten‑fold inference speedup, ten‑fold training efficiency gain, and a 30% increase in real‑robot task success rate while reducing memory usage compared with Motus and Cosmos‑Policy.

Action-Centered ArchitectureInference OptimizationMultimodal Learning
0 likes · 8 min read
GigaWorld-Policy Boosts Inference Speed 10× and Success Rate 30%
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 26, 2026 · Artificial Intelligence

Can Uni‑X Eliminate Multimodal Gradient Conflict with a Pure Autoregressive Design?

The paper reveals that standard shared‑parameter Transformers suffer severe gradient conflict when jointly processing low‑entropy text and high‑entropy visual tokens, and proposes Uni‑X—a two‑end‑separated, middle‑shared autoregressive model that isolates modality‑specific layers, reduces conflict, improves efficiency, and achieves strong results on image generation and editing benchmarks.

Autoregressive ModelGradient ConflictICLR 2026
0 likes · 8 min read
Can Uni‑X Eliminate Multimodal Gradient Conflict with a Pure Autoregressive Design?
AI Explorer
AI Explorer
Mar 15, 2026 · Artificial Intelligence

Large Models May Break Language Training Dependence, Redefining Intelligence

A new study suggests that large AI models could reduce their reliance on massive text corpora by early‑fusing multimodal data such as video and sensor streams, potentially slashing training costs, improving generalization, and prompting a shift toward more embodied notions of intelligence.

AI researchEmbodied IntelligenceMultimodal Learning
0 likes · 6 min read
Large Models May Break Language Training Dependence, Redefining Intelligence
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Mar 3, 2026 · Artificial Intelligence

How HORAI Uses Large‑Scale Multimodal Pretraining to Boost Time‑Series Forecasting and Anomaly Detection

The article reviews the HORAI model, which introduces a frequency‑enhanced multimodal pretraining paradigm and the massive MM‑TS dataset, showing that integrating derived images, endogenous text, and real‑world news dramatically improves zero‑shot forecasting and anomaly detection across six domains.

HORAIMultimodal LearningTime Series
0 likes · 23 min read
How HORAI Uses Large‑Scale Multimodal Pretraining to Boost Time‑Series Forecasting and Anomaly Detection
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Dec 23, 2025 · Artificial Intelligence

How H3M‑SSMoEs Combines Hypergraph Multimodal Learning and LLM Reasoning to Predict Stock Direction

The paper introduces H3M‑SSMoEs, a framework that integrates a multi‑context hypergraph for fine‑grained spatio‑temporal dynamics with a frozen Llama‑3.2‑1B LLM adapter, and a style‑structured expert mixture to jointly model stock relationships, multimodal semantics, and market regimes, achieving superior accuracy and investment returns on DJIA, NASDAQ‑100, and S&P‑100 benchmarks.

Financial AIHypergraphLLM
0 likes · 14 min read
How H3M‑SSMoEs Combines Hypergraph Multimodal Learning and LLM Reasoning to Predict Stock Direction
AI Algorithm Path
AI Algorithm Path
Dec 23, 2025 · Artificial Intelligence

Fine‑Tuning Qwen‑Video‑8B with LLaMA‑Factory for Domain‑Specific Video Understanding

This article details how the Qwen‑Video‑8B model, built on Qwen3‑VL‑8B‑Instruct, is fine‑tuned with the LLaMA‑Factory framework using a curated city‑scenery dataset, addresses challenges of domain knowledge, temporal modeling and multimodal fusion, and demonstrates improved video captioning across baseline, English‑fine‑tuned and Chinese‑fine‑tuned versions.

AI fine-tuningLLaMA-FactoryLoRA
0 likes · 10 min read
Fine‑Tuning Qwen‑Video‑8B with LLaMA‑Factory for Domain‑Specific Video Understanding
Data Party THU
Data Party THU
Nov 16, 2025 · Artificial Intelligence

How X‑VLA Enables 120‑Minute Unassisted Robot Clothing Folding with a 0.9B Model

The X‑VLA paper introduces a 0.9‑billion‑parameter, fully open‑source embodied model that uses a learnable soft‑prompt and divide‑and‑conquer encoding to handle heterogeneous robot vision inputs, achieving a record‑breaking 120‑minute autonomous clothing‑folding task while surpassing benchmarks across five simulation environments.

Embodied AIMultimodal LearningRobotics
0 likes · 7 min read
How X‑VLA Enables 120‑Minute Unassisted Robot Clothing Folding with a 0.9B Model
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 31, 2025 · Artificial Intelligence

Weekly Quantitative Paper Digest (Oct 25‑31 2025)

This article summarizes six recent arXiv papers that explore how large language models, graph‑theoretic methods, generative frameworks, hypergraph multimodal architectures, GroupSHAP‑enhanced forecasting, and multi‑agent LLM workflows can improve financial signal extraction, portfolio optimization, and stock‑price prediction, providing empirical results on S&P 500 data.

Financial AILLMMultimodal Learning
0 likes · 13 min read
Weekly Quantitative Paper Digest (Oct 25‑31 2025)
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Oct 24, 2025 · Artificial Intelligence

Weekly AI‑Finance Paper Digest (Oct 18‑24 2025)

This digest presents seven recent arXiv papers that explore large‑language‑model‑driven portfolio scoring, hybrid ResNet‑RMT covariance denoising for crypto, LLM‑enhanced financial causal analysis, multilingual news alignment for stock returns, three‑step bubble prediction with news and macro data, multimodal volatility forecasting, and news‑aware reinforcement trading, each with reported performance gains.

Financial AILLMMultimodal Learning
0 likes · 15 min read
Weekly AI‑Finance Paper Digest (Oct 18‑24 2025)
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Sep 6, 2025 · Artificial Intelligence

Time Series Paper Digest (Aug 23–Sep 5 2025)

It presents concise summaries of six recent arXiv papers on unsupervised domain adaptation, efficient forecasting, SHAP explanations, text‑reinforced multimodal forecasting, online prediction with feature adjustment, zero‑shot forecasting zoo, and a new anomaly‑detection metric, highlighting methods, datasets, and results.

Multimodal LearningOnline LearningSHAP
0 likes · 16 min read
Time Series Paper Digest (Aug 23–Sep 5 2025)
Kuaishou Tech
Kuaishou Tech
Jul 29, 2025 · Artificial Intelligence

How Kuaishou’s 8 Groundbreaking Papers Are Shaping AI at KDD 2025

Eight Kuashou research papers covering recommendation systems, multi‑task learning, multimodal large models, large language models, and combinatorial optimization have been accepted to the premier AI data‑mining conference KDD 2025, highlighting the company’s cutting‑edge innovations and their potential impact on the field.

AIMultimodal LearningRecommendation Systems
0 likes · 18 min read
How Kuaishou’s 8 Groundbreaking Papers Are Shaping AI at KDD 2025
Data Thinking Notes
Data Thinking Notes
Jul 8, 2025 · Artificial Intelligence

How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation

This article details Xiaohongshu's multi‑stage recommendation pipeline—using massive multi‑modal pre‑training, long‑sequence modeling, real‑time context features, reinforcement learning and online deep learning—to precisely surface valuable content, address cold‑start challenges, and break information bubbles for billions of users.

Multimodal Learninglarge language modelonline deep learning
0 likes · 16 min read
How Xiaohongshu Leverages Large Models to Revolutionize Content Recommendation
DataFunSummit
DataFunSummit
Jun 19, 2025 · Artificial Intelligence

How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights

In a detailed interview, ByteDance AI specialist Cai Conghuai explains how large‑model techniques such as SFT, DPO and RAG address Douyin’s multimodal user‑experience challenges, improve signal detection, root‑cause analysis, and outline future AI‑agent breakthroughs for content platforms.

AI AlgorithmsMultimodal LearningRAG
0 likes · 11 min read
How Large Models Are Revolutionizing Douyin’s User Experience – Expert Insights
Su San Talks Tech
Su San Talks Tech
Feb 23, 2025 · Artificial Intelligence

How DeepSeek’s Distillation Breaks AI Model Limits: Core Principles & Performance

This article explores DeepSeek’s cutting‑edge distillation technology, detailing its definition, underlying principles, innovative data‑model fusion, architecture choices, training strategies, performance gains over large language models, and the remaining challenges in knowledge transfer and multimodal data processing.

AI OptimizationDeepSeekMultimodal Learning
0 likes · 16 min read
How DeepSeek’s Distillation Breaks AI Model Limits: Core Principles & Performance
DataFunSummit
DataFunSummit
Feb 4, 2025 · Artificial Intelligence

Training Optimization for Large-Scale Multimodal Models in Content Safety

This article examines the challenges of content safety, outlines the limitations of current task‑specific multimodal models, and proposes large‑model‑inspired training optimizations—including diversified data construction, automated annotation, parameter fine‑tuning, and multi‑task evaluation—to improve efficiency, accuracy, and scalability of multimodal AI systems.

AI OptimizationContent SafetyMultimodal Learning
0 likes · 26 min read
Training Optimization for Large-Scale Multimodal Models in Content Safety
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 7, 2024 · Artificial Intelligence

How VideoCLIP‑XL Boosts Long‑Description Understanding in Video CLIP Models

VideoCLIP‑XL, a new video CLIP model introduced by Alibaba Cloud AI Platform and Sun Yat‑sen University, enhances long‑text description comprehension through a large‑scale VILD dataset, a text‑similarity guided principal component matching method, and novel DDR and HDR ranking tasks, achieving superior performance on multiple video‑text benchmarks.

BenchmarkDatasetLong Description
0 likes · 13 min read
How VideoCLIP‑XL Boosts Long‑Description Understanding in Video CLIP Models
DataFunSummit
DataFunSummit
Oct 29, 2024 · Artificial Intelligence

Decentralized Distribution in Xiaohongshu: Strengthening Sideinfo, Multimodal Fusion, and Interest Exploration

This article details Xiaohongshu's technical approaches to decentralized content distribution, covering business background, core challenges, high‑frequency recommendation pipelines, link‑level analysis, sideinfo decoupling, graph‑model integration, multimodal signal fusion, explicit interest exploration, interest protection, and future research directions.

Multimodal Learningdecentralized-distributiongraph models
0 likes · 24 min read
Decentralized Distribution in Xiaohongshu: Strengthening Sideinfo, Multimodal Fusion, and Interest Exploration
DataFunSummit
DataFunSummit
Sep 16, 2024 · Artificial Intelligence

Multimodal Content Understanding and Cold-Start Practices in NetEase Cloud Music Community Recommendation System

This article details how NetEase Cloud Music leverages multimodal content understanding—using audio models like MusicCLIP and Audio MAE and image‑text fusion via FLAVA—to improve recommendation performance for new content and new users, covering system architecture, cold‑start solutions, and future AI‑driven directions.

AI modelsMultimodal Learningaudio representation
0 likes · 15 min read
Multimodal Content Understanding and Cold-Start Practices in NetEase Cloud Music Community Recommendation System
Bilibili Tech
Bilibili Tech
Aug 27, 2024 · Artificial Intelligence

Multimodal Video Scene Classification for Adaptive Video Processing

The paper presents a multimodal video scene classification system that leverages CLIP‑generated pseudo‑labels and a fine‑tuned image encoder to automatically identify nature, animation/game, and document scenes, enabling more effective adaptive transcoding, intelligent restoration, and quality assessment for user‑generated content on platforms such as Bilibili.

Bilibili multimediaCLIPComputer Vision
0 likes · 17 min read
Multimodal Video Scene Classification for Adaptive Video Processing
AntTech
AntTech
Aug 16, 2024 · Artificial Intelligence

PC²: Pseudo‑Classification Based Pseudo‑Captioning for Noisy Correspondence Learning in Cross‑Modal Retrieval

The paper introduces PC², a novel framework that combines pseudo‑classification and pseudo‑captioning to mitigate noisy correspondence in cross‑modal retrieval, presents a large‑scale web‑page/image‑meta‑description dataset called Noise of Web (NoW), and demonstrates significant performance gains on multiple benchmark datasets including Flickr30K, MS‑COCO, and the newly released NoW.

Multimodal LearningPC2cross-modal retrieval
0 likes · 16 min read
PC²: Pseudo‑Classification Based Pseudo‑Captioning for Noisy Correspondence Learning in Cross‑Modal Retrieval
Alimama Tech
Alimama Tech
Aug 2, 2024 · Artificial Intelligence

Multimodal Representations Boost Taobao Display Advertising CTR

Alibaba’s advertising team introduces semantic‑aware contrastive learning to pre‑train multimodal image‑text embeddings, integrates them via SimTier and MAKE into ID‑based CTR models, achieving up to 6.9% lift in Taobao display ad click‑through rates and improving long‑tail item performance.

CTR predictionMultimodal LearningRecommendation Systems
0 likes · 21 min read
Multimodal Representations Boost Taobao Display Advertising CTR
Kuaishou Tech
Kuaishou Tech
Apr 17, 2024 · Artificial Intelligence

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

The paper presented at AAAI introduces the EERCF method, a coarse‑to‑fine visual representation and two‑stage recall‑then‑rerank strategy that dramatically reduces cross‑modal matching FLOPs while preserving state‑of‑the‑art retrieval performance on multiple video benchmarks.

AIMultimodal Learningcoarse-to-fine representation
0 likes · 8 min read
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 24, 2023 · Artificial Intelligence

Must‑Read AI Agent and LLM Research Papers for Deep Understanding

This curated reading list compiles essential papers on AI agents, task planning, hallucination mitigation, multimodal models, image/video generation, foundational LLM research, open‑source large models, fine‑tuning techniques, and performance optimization, providing a comprehensive roadmap for anyone aiming to master modern generative AI.

AI agentsMultimodal LearningPerformance Optimization
0 likes · 23 min read
Must‑Read AI Agent and LLM Research Papers for Deep Understanding
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Dec 21, 2023 · Artificial Intelligence

Video and Image Technologies in NetEase Cloud Music: Architecture, Algorithms, and Applications

The article examines NetEase Cloud Music’s video and image technology stack—covering a four‑module architecture, algorithms for content understanding, intelligent production, moderation, and interactive effects—and explains how these systems enhance user experience, streamline backend processing, and position the platform for future AIGC‑driven innovations.

AI AlgorithmsMultimodal LearningVideo processing
0 likes · 11 min read
Video and Image Technologies in NetEase Cloud Music: Architecture, Algorithms, and Applications
DataFunTalk
DataFunTalk
Nov 10, 2023 · Artificial Intelligence

Multimodal Cold-Start Techniques for Music Recommendation at NetEase Cloud Music

This article presents NetEase Cloud Music's multimodal cold-start recommendation approach, detailing the problem's significance, feature extraction using CLIP, I2I2U indirect modeling, U2I DSSM direct modeling with contrastive learning and interest‑boundary mechanisms, deployment pipeline, evaluation results, and future optimization directions.

Multimodal Learningcold startcontrastive learning
0 likes · 14 min read
Multimodal Cold-Start Techniques for Music Recommendation at NetEase Cloud Music
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jun 20, 2023 · Artificial Intelligence

Open-Vocabulary Object Attribute Recognition with OvarNet: A Unified Framework for Detection and Attribute Classification

At CVPR 2023 the Xiaohongshu team presented OvarNet, a unified one‑stage Faster‑RCNN model built on CLIP that uses prompt learning and knowledge distillation to jointly detect objects and recognize open‑vocabulary attributes, achieving state‑of‑the‑art results on VAW, MS‑COCO, LSA and OVAD datasets.

Computer VisionMultimodal Learningattribute recognition
0 likes · 12 min read
Open-Vocabulary Object Attribute Recognition with OvarNet: A Unified Framework for Detection and Attribute Classification
DataFunSummit
DataFunSummit
May 5, 2023 · Artificial Intelligence

Advances in Virtual Humans, Multimodal Technology, and General AI – Insights from OPPO

The article presents OPPO's latest research on virtual human audio‑lip and RGB driving, multimodal learning breakthroughs such as CETNETs and cross‑modal matching, and a reflective discussion on the challenges and future directions of general artificial intelligence, highlighting the interconnections among these three domains.

AI EngineeringGeneral AIMultimodal Learning
0 likes · 9 min read
Advances in Virtual Humans, Multimodal Technology, and General AI – Insights from OPPO
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
May 27, 2022 · Artificial Intelligence

Multimodal Model for Game Frame Rate Prediction

This article explains how a multimodal deep learning model combines static and temporal game data to predict frame rates, helping identify performance bottlenecks and improve client smoothness through feature fusion, data pipelines, and real‑time inference in modern games.

AIDeep LearningMultimodal Learning
0 likes · 7 min read
Multimodal Model for Game Frame Rate Prediction
DataFunTalk
DataFunTalk
May 22, 2022 · Artificial Intelligence

Advances in Information‑Flow Recommendation: Pre‑trained Models and Multimodal User‑Interface Modeling

This article reviews Huawei Noah's Ark Lab's work on modern information‑flow recommendation, covering the evolution from collaborative filtering to deep learning, the application of BERT‑based pre‑training for news ranking, multimodal user‑interface modeling, practical deployment challenges, and future research directions.

AIBERTHuawei
0 likes · 19 min read
Advances in Information‑Flow Recommendation: Pre‑trained Models and Multimodal User‑Interface Modeling
NetEase Media Technology Team
NetEase Media Technology Team
Apr 11, 2022 · Artificial Intelligence

Multimodal Video Tagging: Challenges and a Two‑Stage Recall‑Ranking Solution

To tackle the massive, multimodal tagging challenge of short‑video platforms—characterized by a huge long‑tail tag set, sparse annotations, and uneven modality contributions—the authors propose a two‑stage recall‑ranking system that first retrieves candidates via text, visual, audio and classification cues, then refines them with contrastive learning and extensive hard‑negative sampling, achieving 0.884 tag accuracy in a real‑world news video recommender.

EmbeddingMultimodal LearningRecommendation Systems
0 likes · 12 min read
Multimodal Video Tagging: Challenges and a Two‑Stage Recall‑Ranking Solution
IEG Growth Platform Technology Team
IEG Growth Platform Technology Team
Feb 14, 2022 · Artificial Intelligence

Multimodal Evolution and Application in Tencent Game Advertising System

This article describes the end‑to‑end multimodal modeling pipeline—covering text, image, and video understanding, model evolution from shallow to deep networks, key‑frame extraction, fine‑tuning, and multimodal fusion—used in Tencent's game ad exchange platform, along with practical deployment challenges and solutions.

AdvertisingCNNMultimodal Learning
0 likes · 22 min read
Multimodal Evolution and Application in Tencent Game Advertising System
Baidu Geek Talk
Baidu Geek Talk
Jan 17, 2022 · Artificial Intelligence

Unlocking Video AI: PaddleVideo’s Open‑Source Solutions for Sports, Media, and Safety

This article surveys PaddleVideo, Baidu's open‑source video AI toolkit, detailing its industry‑focused models for sports action recognition, multimodal tagging, intelligent production, interactive segmentation, drone detection, and medical imaging, while providing performance metrics and GitHub resources for each solution.

Computer VisionMultimodal LearningPaddleVideo
0 likes · 14 min read
Unlocking Video AI: PaddleVideo’s Open‑Source Solutions for Sports, Media, and Safety
DataFunTalk
DataFunTalk
Jan 15, 2022 · Artificial Intelligence

Multimodal + Music: MMatch Series Technologies and Their Applications at Tencent Music

This article presents the multimodal learning demands of QQ Music, introduces the MMatch series of multimodal matching technologies—including image‑text matching, music similarity, AI tagging, and video scoring—and details their practical applications in business scenarios such as merchant public‑play, search, recommendation, and future product ideas.

Multimodal LearningRecommendation SystemsTencent Music
0 likes · 25 min read
Multimodal + Music: MMatch Series Technologies and Their Applications at Tencent Music
Amap Tech
Amap Tech
Nov 4, 2021 · Artificial Intelligence

POI Signboard Image Retrieval: Technical Solution, Model Design, and Future Directions

To efficiently filter unchanged POI signboards, the authors propose a multimodal image‑retrieval system that combines enhanced global and local visual features with BERT‑encoded OCR text, using metric learning and alignment techniques to achieve over 95 % accuracy while handling occlusion, viewpoint variation, and subtle text changes.

Computer VisionDeep LearningMultimodal Learning
0 likes · 17 min read
POI Signboard Image Retrieval: Technical Solution, Model Design, and Future Directions
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 4, 2021 · Artificial Intelligence

How AI Powers POI Signboard Image Retrieval for Map Services

This article explains the challenges of POI signboard image retrieval, describes a multimodal deep‑learning solution that combines visual and OCR‑based text features, details data generation, model architecture, loss functions, and presents impressive accuracy improvements and future research directions.

Deep LearningMultimodal LearningPOI mapping
0 likes · 17 min read
How AI Powers POI Signboard Image Retrieval for Map Services
Meituan Technology Team
Meituan Technology Team
Oct 28, 2021 · Artificial Intelligence

Supply Standardization for Script‑Murder Business Using a Knowledge Graph

Meituan’s To‑Store Integrated data team built an end‑to‑end supply‑standardization pipeline for the rapidly growing script‑murder market by extending the GENE knowledge graph to mine merchant supply, construct a unified script library through rule‑based, semantic, and multimodal clustering, and link products and user‑generated content to standardized scripts, enabling a dedicated category, personalized recommendations, filter tags, and improved ranking.

BERTKnowledge GraphMultimodal Learning
0 likes · 23 min read
Supply Standardization for Script‑Murder Business Using a Knowledge Graph
Kuaishou Tech
Kuaishou Tech
Oct 20, 2021 · Artificial Intelligence

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

This paper proposes HiT, a hierarchical transformer model with momentum contrast for video-text retrieval, addressing limitations in existing multimodal learning methods by introducing hierarchical cross-modal contrast matching and momentum cross-modal contrast to improve retrieval performance.

HCMMCCMoCo
0 likes · 9 min read
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
DataFunSummit
DataFunSummit
Oct 12, 2021 · Artificial Intelligence

Intelligent Grading: Technical Exploration and Practice in AI‑Powered Education

This article presents a comprehensive overview of AI‑driven intelligent grading technologies, covering background, typical educational challenges, multimodal NLP solutions for essay, spelling and grammar correction, adaptive learning, and related research, illustrating how deep learning and multimodal models improve automated assessment across K‑12 scenarios.

AIEducation TechnologyEssay Scoring
0 likes · 24 min read
Intelligent Grading: Technical Exploration and Practice in AI‑Powered Education
Tencent Advertising Technology
Tencent Advertising Technology
Jun 22, 2021 · Artificial Intelligence

Technical Insights and Solution Strategies from the Tencent Advertising Algorithm Competition – Video Ad Track

The article outlines the Tencent Advertising Algorithm Competition’s video ad challenge, details the paper submission guidelines, and shares a participant’s step‑by‑step technical approach—including baseline experiments, model re‑implementation with Paddle, multimodal feature extraction, optimizer choices, and future improvement directions—providing practical AI insights for multimedia video classification.

Deep LearningMultimodal LearningTencent competition
0 likes · 7 min read
Technical Insights and Solution Strategies from the Tencent Advertising Algorithm Competition – Video Ad Track
DataFunTalk
DataFunTalk
Feb 9, 2021 · Artificial Intelligence

Multimodal AI Research: Video-Aware Dialog, Dual-Channel Reasoning, and Multimodal Machine Translation

This article surveys recent multimodal AI research, covering video scene‑aware dialog with a GPT‑2 based unified pre‑training framework, dual‑channel multi‑hop reasoning for visual dialog, capsule‑network‑enhanced multimodal machine translation, and graph‑neural‑network‑driven multimodal translation, highlighting experimental results and future directions.

Graph Neural NetworkMultimodal AIMultimodal Learning
0 likes · 12 min read
Multimodal AI Research: Video-Aware Dialog, Dual-Channel Reasoning, and Multimodal Machine Translation
Amap Tech
Amap Tech
Jan 15, 2021 · Artificial Intelligence

Solution Overview of the AMAP-TECH Algorithm Competition: Dynamic Road Condition Analysis from In‑Vehicle Video Images

To tackle the AMAP‑TECH competition’s dynamic road‑condition classification from scarce, imbalanced vehicle‑video frames, the team combined YOLOv5 object detection, ResNeXt101‑based semantic embeddings, and engineered temporal detection statistics, feeding the fused features into a five‑fold LightGBM model that achieved top weighted‑F1 performance.

Computer VisionLightGBMMultimodal Learning
0 likes · 10 min read
Solution Overview of the AMAP-TECH Algorithm Competition: Dynamic Road Condition Analysis from In‑Vehicle Video Images
Meituan Technology Team
Meituan Technology Team
Sep 24, 2020 · Artificial Intelligence

Meituan Search Ads Team's Solution for KDD Cup 2020 Multimodalities Recall Track

Meituan’s Search Ads team placed third in the KDD Cup 2020 Multimodalities Recall track by tackling training‑test distribution mismatch with diversified negative sampling and distillation learning, and improving text‑image matching via gated fully‑connected layers, bidirectional attention, and diversified fusion, then ensembling neural and tree models for strong NDCG gains later applied to their ad creative‑selection system.

DistillationKDD CupMultimodal Learning
0 likes · 19 min read
Meituan Search Ads Team's Solution for KDD Cup 2020 Multimodalities Recall Track
Meituan Technology Team
Meituan Technology Team
Aug 6, 2020 · Artificial Intelligence

Meituan SIGIR2020 Workshop: MT‑BERT, KDD Cup Solutions, and Knowledge Graph Applications

At the SIGIR 2020 Meituan workshop, researchers unveiled MT‑BERT’s large‑scale pre‑training and compression techniques, a KDD Cup winning solution that tackles bias with graph‑ and multimodal learning for search advertising, and a massive food‑delivery knowledge graph powering personalized recommendations, all demonstrating significant real‑world performance gains.

Multimodal Learningmodel compressionpretrained language models
0 likes · 18 min read
Meituan SIGIR2020 Workshop: MT‑BERT, KDD Cup Solutions, and Knowledge Graph Applications
Qunar Tech Salon
Qunar Tech Salon
Mar 5, 2020 · Artificial Intelligence

Content Tagging Technology for Short Videos at iQIYI: Challenges and Model Evolution

This article describes iQIYI's short‑video content tagging system, outlining the challenges of extracting type and abstract tags from multimodal data, detailing the evolution from text‑only models to image‑fusion, BERT‑enhanced, and video‑frame models, and discussing their applications and future directions.

BERTMultimodal LearningTransformer
0 likes · 11 min read
Content Tagging Technology for Short Videos at iQIYI: Challenges and Model Evolution
iQIYI Technical Product Team
iQIYI Technical Product Team
Feb 14, 2020 · Artificial Intelligence

Content Tagging Technology for Short Videos: Challenges and Multi‑Modal Model Evolution at iQIYI

iQIYI’s short‑video tagging system tackles multimodal fusion, open‑set and abstract tags by evolving from a text‑only model through cover‑image, BERT‑vector, and video‑frame fusion architectures, enabling automated labeling, personalized recommendation, and semantic search while planning to add OCR, audio, and knowledge‑graph enhancements.

BERTMultimodal LearningTransformer
0 likes · 13 min read
Content Tagging Technology for Short Videos: Challenges and Multi‑Modal Model Evolution at iQIYI
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 26, 2019 · Artificial Intelligence

How Decomposed Linguistic Representations Overcome Language Priors in VQA

This article reviews a AAAI 2020 paper that introduces a language‑attention based Visual Question Answering model which decomposes questions into type, object, and concept expressions to mitigate language bias, explains its modular architecture, and demonstrates superior performance on VQA‑CP v2 through extensive experiments and ablations.

Attention MechanismMultimodal LearningVQA-CP
0 likes · 14 min read
How Decomposed Linguistic Representations Overcome Language Priors in VQA
DataFunTalk
DataFunTalk
Sep 29, 2019 · Artificial Intelligence

UC Information Flow Video Tag Recognition: System Architecture and Multi‑Modal Algorithms

This article presents a comprehensive overview of UC's information‑flow video tag recognition technology, detailing tag usage scenarios, the end‑to‑end system architecture, multi‑modal feature extraction, advanced deep‑learning models such as NextVlad, behavior and person tagging methods, and future research directions.

Computer VisionDeep LearningMultimodal Learning
0 likes · 14 min read
UC Information Flow Video Tag Recognition: System Architecture and Multi‑Modal Algorithms
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 27, 2019 · Artificial Intelligence

How Transformers Enable Personalized Outfit Generation for Fashion Recommendation

This article presents a Transformer‑based framework that simultaneously generates visually compatible outfits and personalizes recommendations by leveraging multimodal item embeddings and user behavior, achieving significant gains in compatibility prediction, fill‑in‑the‑blank accuracy, and click‑through rate on Alibaba's iFashion platform.

Deep LearningMultimodal LearningTransformer
0 likes · 15 min read
How Transformers Enable Personalized Outfit Generation for Fashion Recommendation
iQIYI Technical Product Team
iQIYI Technical Product Team
Jul 5, 2019 · Artificial Intelligence

Residual Dense Network with Feature Fusion for Multimodal Video Person Identification (iQIYI-VID-2019)

The authors introduce a feature‑fusion pipeline and a Residual Dense Net that leverages multi‑frame face embeddings to identify persons in iQIYI‑VID‑2019 videos, achieving 0.9035 mAP (second place) with only ≈0.5 GFLOPs and processing the full test set in minutes.

Multimodal Learningfeature fusioniQIYI-VID-2019
0 likes · 11 min read
Residual Dense Network with Feature Fusion for Multimodal Video Person Identification (iQIYI-VID-2019)
iQIYI Technical Product Team
iQIYI Technical Product Team
Jun 28, 2019 · Artificial Intelligence

Watchdog Team's TOP1 Solution for the iQIYI & ACMMM2019 Multimodal Video Person Recognition Challenge

Watchdog team won TOP1 in iQIYI & ACMMM2019 multimodal video person recognition challenge using pre‑extracted multimodal features, a 2048‑dim classifier with BCE loss, re‑ranking, DALI‑accelerated re‑detection, fine‑tuned InsightFace, and multi‑model ensembling achieving ~91% test accuracy.

Multimodal Learningfeature fusionmodel ensemble
0 likes · 12 min read
Watchdog Team's TOP1 Solution for the iQIYI & ACMMM2019 Multimodal Video Person Recognition Challenge
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 10, 2019 · Artificial Intelligence

Bilinear Residual Layers: Boosting Text‑Guided Image Editing

This article explores multimodal representation learning by introducing a Bilinear Residual Layer that automatically fuses image and text features, demonstrates its superiority over traditional concatenation and FiLM methods on text‑guided image editing and fashion synthesis tasks, and reports state‑of‑the‑art results on several benchmark datasets.

GANMultimodal Learningbilinear residual layer
0 likes · 17 min read
Bilinear Residual Layers: Boosting Text‑Guided Image Editing
JD Tech
JD Tech
Jan 30, 2019 · Artificial Intelligence

JD AI Presents Eight Papers at AAAI 2019 Showcasing Advances in Machine Learning, NLP, and Computer Vision

At AAAI 2019 in Hawaii, JD AI Research Institute had eight papers accepted covering machine learning, natural language processing, computer vision, and multimodal AI, highlighting innovations such as AutoZOOM black‑box attacks, SACN for knowledge base completion, and temporally aware video captioning models.

Computer VisionMultimodal Learningartificial intelligence
0 likes · 11 min read
JD AI Presents Eight Papers at AAAI 2019 Showcasing Advances in Machine Learning, NLP, and Computer Vision
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 25, 2019 · Artificial Intelligence

Multimodal Video Quality Assessment Models for Short Video Platforms

The paper presents an integrated multimodal quality assessment system for short‑video platforms that evaluates cover images, video content, and accompanying text using deep‑learning and handcrafted features—combining ResNet‑50, NetVLAD, TSN, VGGish, and XGBoost—to improve user experience, recommendation accuracy, and operational efficiency, with plans for optimization and modular deployment.

Image AnalysisMultimodal Learningtext classification
0 likes · 11 min read
Multimodal Video Quality Assessment Models for Short Video Platforms
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 25, 2017 · Artificial Intelligence

How Hierarchical Multimodal LSTM Boosts Image Captioning Accuracy

This article reviews an ICCV paper introducing a hierarchical multimodal LSTM that jointly embeds images, phrases, and whole sentences, enabling detailed image descriptions and superior performance on Flickr30K, MS‑COCO, and region‑phrase datasets compared to previous methods.

Computer VisionImage CaptioningMultimodal Learning
0 likes · 8 min read
How Hierarchical Multimodal LSTM Boosts Image Captioning Accuracy
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 28, 2017 · Artificial Intelligence

Inside Alibaba AI Lab: Dr. Wang Gang on Multimodal AI and Edge Computing

In an exclusive interview, Alibaba AI Lab's distinguished scientist Dr. Wang Gang discusses the lab's research on multimodal AI, edge computing, AI hardware, bio‑inspired cognition, quantum‑deep‑learning integration, and the challenges of moving from recognition to true understanding, while also outlining Alibaba's AI talent recruitment plans.

AI researchAI talent recruitmentComputer Vision
0 likes · 25 min read
Inside Alibaba AI Lab: Dr. Wang Gang on Multimodal AI and Edge Computing