Tagged articles

Multimodal

422 articles · Page 3 of 5

Jul 7, 2025 · Artificial Intelligence

8 Kuaishou Papers Spotlighted at ICML 2025: Multimodal AI, Causal Inference and More

Kuaishou has had eight cutting‑edge papers accepted at the International Conference on Machine Learning 2025, covering breakthroughs in multimodal emotion modeling, monotonic probability learning, causal effect generalization, cascade ranking, multimodal LLM alignment, ultra‑low‑rate image compression, and visual autoregressive super‑resolution, with links to each work and accompanying code repositories.

AIMultimodalRanking

0 likes · 13 min read

8 Kuaishou Papers Spotlighted at ICML 2025: Multimodal AI, Causal Inference and More

DataFunSummit

Jul 6, 2025 · Artificial Intelligence

AI-Driven Knowledge Graphs: Key Insights from Multimodal GraphRAG Research

This article presents a comprehensive overview of cutting‑edge research on integrating large language models with knowledge graphs, covering multimodal GraphRAG, financial AI solutions, traditional Chinese medicine decision support, and industry‑specific knowledge services, guiding readers through emerging paradigms and practical implementations.

AIEnterprise AIMultimodal

0 likes · 2 min read

AI-Driven Knowledge Graphs: Key Insights from Multimodal GraphRAG Research

AntTech

Jul 3, 2025 · Artificial Intelligence

How Ant Group’s AI Multimodal Evaluation Transforms Image, Speech, and Video Quality Testing

In a QECon 2025 talk, Ant Group’s AI team detailed a comprehensive multimodal evaluation framework that leverages large‑model metrics, custom pipelines, and benchmark datasets to assess image generation, speech recognition, and video quality, while also contributing to industry standards and academic research.

AI evaluationMultimodalimage assessment

0 likes · 16 min read

How Ant Group’s AI Multimodal Evaluation Transforms Image, Speech, and Video Quality Testing

DataFunTalk

Jul 3, 2025 · Artificial Intelligence

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

In an interview with Vivo AI engineer Liang Tianan, the article explores the challenges of post‑Q&A recommendation, the integration of large language models into recall, ranking and evaluation pipelines, and the engineering trade‑offs required to deliver high‑quality, diverse suggestions on mobile devices.

EvaluationLLMMultimodal

0 likes · 15 min read

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

DataFunTalk

Jun 29, 2025 · Artificial Intelligence

Large Models Boost Douyin User Experience: Expert Insights

In an interview at the DA Digital Intelligence Conference, ByteDance AI specialist Cai Conghuai explains how large language models, combined with techniques like SFT, DPO, and RAG, are reshaping Douyin's user‑experience signal detection, root‑cause analysis, and evaluation, while outlining future AI‑agent breakthroughs.

AIDPOLarge Language Models

0 likes · 12 min read

Large Models Boost Douyin User Experience: Expert Insights

Alibaba Cloud Developer

Jun 26, 2025 · Artificial Intelligence

How to Build a Multi‑Dimensional Evaluation Framework for AI‑Powered Data Analysis Platforms

This article outlines the design of a scientific, quantifiable, multi‑dimensional evaluation system for the DataV‑Note intelligent analysis platform, addressing the lack of unified standards and accuracy challenges in AI‑driven data reporting, and proposes concrete metrics, model architecture, and future automation plans.

AI evaluationModel DesignMultimodal

0 likes · 13 min read

How to Build a Multi‑Dimensional Evaluation Framework for AI‑Powered Data Analysis Platforms

Open Source Linux

Jun 12, 2025 · Artificial Intelligence

From Transformers to DeepSeek‑R1: The Evolution of Large Language Models (2017‑2025)

This article chronicles the rapid development of large language models from the 2017 Transformer breakthrough through the rise of BERT, GPT‑3, multimodal models, alignment techniques like RLHF, and finally the cost‑efficient DeepSeek‑R1 in 2025, highlighting key innovations, scaling trends, and real‑world impacts.

AI alignmentDeep LearningLarge Language Models

0 likes · 26 min read

From Transformers to DeepSeek‑R1: The Evolution of Large Language Models (2017‑2025)

AI Algorithm Path

Jun 11, 2025 · Artificial Intelligence

OpenAI's O3‑Pro Model: Deep Reasoning, Pricing, Benchmarks, and Access Guide

OpenAI introduced the O3‑Pro multimodal deep‑reasoning model with an 80% price cut for O3, detailed its training via large‑scale reinforcement learning, compared its capabilities and costs against GPT‑4o, GPT‑4.1 and O3‑Pro, listed its core specs, limitations, access methods, and presented benchmark tests that highlight both strengths and weaknesses.

AIBenchmarkMultimodal

0 likes · 10 min read

OpenAI's O3‑Pro Model: Deep Reasoning, Pricing, Benchmarks, and Access Guide

Baidu Tech Salon

Jun 11, 2025 · Artificial Intelligence

Why Baidu’s Wenxin Model Dominates IDC’s 2025 Large Model Evaluation

IDC’s 2025 China foundational large‑model evaluation crowns Baidu’s Wenxin as the top performer, scoring perfect marks in seven of eight criteria and highlighting its superior multimodal, dialogue, and ecosystem capabilities among twelve leading models.

AIBaidu WenxinIDC evaluation

0 likes · 5 min read

Why Baidu’s Wenxin Model Dominates IDC’s 2025 Large Model Evaluation

Kuaishou Audio & Video Technology

Jun 11, 2025 · Artificial Intelligence

Kuaishou Showcases 12 Cutting-Edge CVPR 2025 Papers on Video Generation and AI

Kuaishou presented twelve peer‑reviewed papers at CVPR 2025 covering video quality assessment, large‑scale video datasets, dynamic 3D avatar reconstruction, 4D scene simulation, controllable video generation, scaling laws for diffusion transformers, multimodal foundations, and more, highlighting the company's leading research in computer vision and AI.

AI researchCVPR2025Deep Learning

0 likes · 21 min read

Kuaishou Showcases 12 Cutting-Edge CVPR 2025 Papers on Video Generation and AI

DataFunSummit

Jun 10, 2025 · Artificial Intelligence

How Quwan’s Kaitian Model Tackles Emotional AI for Social Apps – Architecture, Training Tricks, and Safety

Quwan Technology presents its Kaitian social large model, designed for personalized, emotionally rich, multimodal AI interactions, detailing its scene‑specific goals, CPT+SFT+RLHF training pipeline, data desensitization, LoRA fine‑tuning, evaluation methods, pruning, latency trade‑offs, safety mechanisms, and future feedback loops.

AI safetyLoRAModel Pruning

0 likes · 13 min read

How Quwan’s Kaitian Model Tackles Emotional AI for Social Apps – Architecture, Training Tricks, and Safety

IT Services Circle

Jun 7, 2025 · Artificial Intelligence

Run Powerful Multimodal AI Offline on a 2 GB Android Phone with Google AI Edge Gallery

Google AI Edge Gallery lets a 2 GB RAM Android phone run sophisticated multimodal AI models completely offline, offering image understanding, text generation, and conversational capabilities without any network connection, and provides a quick three‑step setup to start experimenting with on‑device AI.

AIAndroidGoogle

0 likes · 5 min read

Run Powerful Multimodal AI Offline on a 2 GB Android Phone with Google AI Edge Gallery

Kuaishou Tech

Jun 5, 2025 · Artificial Intelligence

7 Kuaishou AI Papers Accepted at ACL 2025: Video Understanding & Safe LLM Decoding

Kuaishou’s foundational large-model team has secured seven papers at ACL 2025, spanning alignment bias in training, safety defenses during inference, decoding strategies, fine-grained video-temporal understanding, reward fairness in RLHF, multimodal captioning benchmarks, and methods to curb hallucinations in vision-language models.

ACLAI safetyBenchmark

0 likes · 13 min read

7 Kuaishou AI Papers Accepted at ACL 2025: Video Understanding & Safe LLM Decoding

Fighter's World

Jun 2, 2025 · Artificial Intelligence

Why Is Context King for Large Language Models?

This article provides a comprehensive technical analysis of LLM context, covering its definition, types, tokenization, window‑size evolution, diminishing returns, management techniques such as RAG, CoT, memory‑as‑a‑service, and future challenges like multimodal fusion, privacy, and autonomous agent memory.

Agent MemoryContext ManagementLLM

0 likes · 48 min read

Why Is Context King for Large Language Models?

Baidu MEUX

May 28, 2025 · Artificial Intelligence

Top 10 AI Breakthroughs This Week: New Models, Tools, and Industry Moves

This roundup highlights ten recent AI developments, from Apple's Matrix3D model that creates 3D scenes from photos, to Qwen's Deep Research assistant, Tencent's CodeBuddy 3.0, ByteDance's Seed1.5‑VL, Step Star's open‑source Step1X‑3D, Google's iOS icon refresh, Apple's eye‑tracking scrolling test, Chrome's upcoming Gemini AI assistant, Shanghai's AI Identity Ecosystem Alliance, and Kuaishou's Keling AI 2.0 topping the global video‑generation leaderboard.

3D generationAI assistantsAI models

0 likes · 5 min read

Top 10 AI Breakthroughs This Week: New Models, Tools, and Industry Moves

DataFunTalk

May 23, 2025 · Artificial Intelligence

2025 AI Landscape: Inference Models Dominate, Open‑Source Momentum Accelerates

The 2025 Q1 AI report from Artificial Analysis highlights six major trends—including a thousand‑fold drop in inference cost, the rise of MoE models, the growing parity of Chinese open‑source labs, the emergence of autonomous AI agents, native multimodal capabilities, and the trade‑off between performance, cost, and context windows—painting a picture of a rapidly evolving, increasingly competitive AI ecosystem.

AIAgentsMultimodal

0 likes · 11 min read

2025 AI Landscape: Inference Models Dominate, Open‑Source Momentum Accelerates

Baidu Tech Salon

May 21, 2025 · Artificial Intelligence

Baidu AI Day 2024: Wenxin X1 Turbo Sets New Benchmark with Top‑Level Evaluation and Advanced Multimodal Capabilities

At Baidu AI Day in Beijing, the company unveiled the Wenxin 4.5 Turbo and X1 Turbo models, detailing multimodal training breakthroughs, self‑feedback loops, enhanced reasoning and tool‑calling, while the China Academy of Information and Communications Technology awarded X1 Turbo the highest "4+" rating across 24 capability tests, highlighting its leading position in domestic large‑model performance.

BaiduMultimodalWenxin

0 likes · 9 min read

Baidu AI Day 2024: Wenxin X1 Turbo Sets New Benchmark with Top‑Level Evaluation and Advanced Multimodal Capabilities

Tencent Technical Engineering

May 19, 2025 · Artificial Intelligence

RAG, Agents, and Multimodal Large Models: Evolution, Challenges, and Future Trends

This article examines the evolution of large model technologies—including Retrieval‑Augmented Generation, AI agents, and multimodal models—detailing their technical foundations, practical challenges, industry applications, and future development trends, offering a comprehensive perspective for AI practitioners and researchers.

AI AgentMultimodalRAG

0 likes · 14 min read

RAG, Agents, and Multimodal Large Models: Evolution, Challenges, and Future Trends

Bilibili Tech

May 16, 2025 · Artificial Intelligence

How FineVQ Sets New Standards for Fine‑Grained UGC Video Quality Assessment

The article introduces FineVD, the first large‑scale multi‑dimensional UGC video quality dataset, and presents FineVQ, a unified model that predicts quality scores, attributes, and distortion types across six dimensions, achieving state‑of‑the‑art performance on multiple benchmarks and cross‑dataset evaluations.

Deep LearningFineVQMultimodal

0 likes · 9 min read

How FineVQ Sets New Standards for Fine‑Grained UGC Video Quality Assessment

Network Intelligence Research Center (NIRC)

May 14, 2025 · Artificial Intelligence

Hands‑On CLIP: Implementing Multimodal Vision‑Language Understanding

This article introduces OpenAI’s CLIP multimodal model, explains its architecture and contrastive training, details hardware and installation steps, and demonstrates a hands‑on zero‑shot image classification workflow that achieves 97% confidence on a cat image without any task‑specific fine‑tuning.

CLIPMultimodalPython

0 likes · 6 min read

Hands‑On CLIP: Implementing Multimodal Vision‑Language Understanding

DevOps

May 13, 2025 · Artificial Intelligence

The Rise of AI Agents: Current Trends, Core Capabilities, and Future Outlook

This article surveys the rapid emergence of AI agents, outlining their projected 2025 breakthrough, market momentum, key frameworks such as Manus and MCP, the four core abilities of perception, planning, tool use, and memory, and the evolving landscape of multimodal and autonomous AI systems.

AI agentsArtificial IntelligenceMultimodal

0 likes · 11 min read

The Rise of AI Agents: Current Trends, Core Capabilities, and Future Outlook

DataFunSummit

May 13, 2025 · Artificial Intelligence

Integrating Large Language Models and Knowledge Graphs for Financial Applications: Challenges, Solutions, and Future Directions

This talk explores the technical challenges of applying large language models and knowledge graphs in finance, discusses solutions such as RAG enhancements, graph‑guided retrieval, multimodal extensions, and presents future research directions including multimodal graph integration, agentic systems, and decision‑making applications.

AIAgentic SystemsMultimodal

0 likes · 33 min read

Integrating Large Language Models and Knowledge Graphs for Financial Applications: Challenges, Solutions, and Future Directions

Alimama Tech

May 12, 2025 · Artificial Intelligence

Universal Recommendation Model (URM): A General Large‑Model Recall System for Advertising

The article presents the Universal Recommendation Model (URM), a large‑language‑model‑based recall framework that integrates world knowledge and e‑commerce expertise through knowledge injection and prompt‑driven alignment, achieving significant offline recall gains and a 3.1% increase in ad consumption while meeting high‑QPS, low‑latency production constraints.

AdvertisingMultimodalPrompt engineering

0 likes · 17 min read

Universal Recommendation Model (URM): A General Large‑Model Recall System for Advertising

AntTech

May 12, 2025 · Industry Insights

How AI Large Models Are Revolutionizing Multimodal Content Safety

An award‑winning joint project by Shanghai Jiao Tong University and Ant Group unveils a multimodal foundation model and advanced detection techniques that dramatically improve AI‑driven content risk governance across massive online services.

AIAnt GroupContent Safety

0 likes · 3 min read

How AI Large Models Are Revolutionizing Multimodal Content Safety

Alibaba Cloud Developer

May 9, 2025 · Information Security

What’s New in MCP 2025‑03‑26? Deep Dive into OAuth 2.1, Streamable HTTP, and JSON‑RPC Enhancements

The MCP 2025‑03‑26 release introduces mandatory OAuth 2.1 with PKCE, a single‑endpoint Streamable HTTP transport, required JSON‑RPC batch processing, richer tool metadata, structured progress notifications, audio multimodal support, and robust session management, all backed by extensive security hardening and performance gains.

API SecurityJSON-RPCMCP

0 likes · 14 min read

What’s New in MCP 2025‑03‑26? Deep Dive into OAuth 2.1, Streamable HTTP, and JSON‑RPC Enhancements

Tencent Cloud Developer

May 8, 2025 · Artificial Intelligence

Advances and Future of AI Agents: Capabilities, Trends, and Applications

AI agents are rapidly evolving toward a 2025 breakthrough in perception, autonomous planning, tool use and memory, driven by multimodal models, neural‑symbolic reasoning and embodied intelligence, with $27 billion investment forecasts, exemplified by general‑purpose agents like Manus and emerging applications in code generation, research, healthcare, and risk analysis.

AI AgentAgent frameworkAutonomous Planning

0 likes · 12 min read

Advances and Future of AI Agents: Capabilities, Trends, and Applications

Spring Full-Stack Practical Cases

May 7, 2025 · Artificial Intelligence

Unlock Multimodal AI with Spring AI: Hands‑On Image & ID Recognition Cases

This article introduces Spring AI's multimodal capabilities, explains the Message API for handling text, image, audio, and video inputs, and provides step‑by‑step Spring Boot examples for image analysis, ID card extraction, and structured JSON output of car‑color counts.

Artificial IntelligenceJavaMultimodal

0 likes · 8 min read

Unlock Multimodal AI with Spring AI: Hands‑On Image & ID Recognition Cases

AI Algorithm Path

May 2, 2025 · Artificial Intelligence

Qwen3 Launch: Open-Source Models Redefine General AI

The Qwen3 series introduces eight open‑source large language models ranging from 0.6B to 235B parameters, combines dense and Mixture‑of‑Experts architectures, supports multimodal input, offers mixed inference modes, and demonstrates benchmark superiority over leading models such as OpenAI o1 and Gemini 2.5 Pro.

AI agentsBenchmarkMixture of Experts

0 likes · 10 min read

Qwen3 Launch: Open-Source Models Redefine General AI

Data Thinking Notes

Apr 29, 2025 · Artificial Intelligence

From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025

This article chronicles the evolution of large language models from the 2017 Transformer breakthrough through BERT, GPT series, multimodal models, and recent cost‑efficient innovations like DeepSeek‑R1, highlighting key architectures, training methods, alignment techniques, and their transformative impact on AI applications.

AI alignmentLarge Language ModelsMultimodal

0 likes · 29 min read

From Transformers to DeepSeek‑R1: How LLMs Evolved to 2025

DevOps

Apr 27, 2025 · Artificial Intelligence

Large Model Technologies: RAG, AI Agents, Multimodal Applications, and Future Trends

This article examines how Retrieval‑Augmented Generation (RAG), AI agents, and multimodal large‑model techniques are reshaping AI‑industry integration, discusses their technical challenges and practical implementations, and outlines future development directions across algorithms, products, and domain‑specific applications.

AI agentsArtificial IntelligenceMultimodal

0 likes · 14 min read

Large Model Technologies: RAG, AI Agents, Multimodal Applications, and Future Trends

Kuaishou Tech

Apr 23, 2025 · Artificial Intelligence

Kuaishou's Accepted Papers at ICLR 2025 and Their Summaries

The article highlights Kuashou's eleven high‑quality papers accepted at ICLR 2025, covering advances in streaming video understanding, 3D trajectory control, multimodal talking‑face animation, transformer indexing, efficient video generation, industrial recommendation datasets, token gradient conflict in MoE, stable segmentation, multi‑camera video synthesis, large‑scale multimodal instruction tuning, and hallucination detection in retrieval‑augmented generation.

AIResearchDeepLearningICLR2025

0 likes · 20 min read

Kuaishou's Accepted Papers at ICLR 2025 and Their Summaries

Liangxu Linux

Apr 22, 2025 · Artificial Intelligence

Top 10 Open-Source OCR Projects on GitHub Ranked by Stars

This article compiles a ranked list of ten popular open-source OCR projects on GitHub, summarizing each tool’s key capabilities—such as multimodal text extraction, PDF linearization, layout analysis, and multilingual support—along with star counts and direct repository links for developers seeking ready-to-use OCR solutions.

GitHubMultimodalOCR

0 likes · 9 min read

Top 10 Open-Source OCR Projects on GitHub Ranked by Stars

Swan Home Tech Team

Apr 21, 2025 · Artificial Intelligence

How Front-End Teams Leverage AI: FastGPT Platform, Intelligent Search, and Video Synthesis

This article examines how a front‑end team uses AI innovations—FastGPT visual platform, AI‑powered semantic search, and AI video synthesis—to rebuild business workflows, cut costs, and boost efficiency, highlighting architecture, technical highlights, and practical use cases.

AILow‑code platformMultimodal

0 likes · 7 min read

How Front-End Teams Leverage AI: FastGPT Platform, Intelligent Search, and Video Synthesis

DataFunTalk

Apr 18, 2025 · Artificial Intelligence

Applying ByteDance’s Doubao‑1.5 Vision Model for Image Counting and Automated Annotation

The article demonstrates how ByteDance’s new Doubao‑1.5 multimodal model can be used to locate and count objects in images—such as sushi plates, street signs, and cartoon hats—by generating coordinates and overlaying visual annotations through a concise Python script.

AIDoubaoImage Annotation

0 likes · 5 min read

Applying ByteDance’s Doubao‑1.5 Vision Model for Image Counting and Automated Annotation

AIWalker

Apr 17, 2025 · Artificial Intelligence

Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation

This article provides an in‑depth analysis of DeepSeek’s Janus and Janus‑Pro models, explaining how decoupling visual encoding resolves the conflict between multimodal understanding and generation, detailing training stages, data scaling, architectural choices, and presenting extensive benchmark results that demonstrate significant performance gains.

BenchmarkDeepSeekJanus

0 likes · 23 min read

Unveiling DeepSeek’s Janus Series: Decoupled Visual Encoding for Unified Multimodal Understanding and Generation

58UXD

Apr 17, 2025 · Artificial Intelligence

How Zero‑UI and Gemini’s Multimodal AI Are Redefining Human‑Computer Interaction

Zero‑UI, powered by multimodal AI models like Google Gemini, is shifting design from screen‑based interfaces to natural voice, gesture, and environmental interactions, prompting a fundamental redesign of how devices understand user intent across smart homes, cars, and immersive experiences.

AIHuman-Computer InteractionMultimodal

0 likes · 9 min read

How Zero‑UI and Gemini’s Multimodal AI Are Redefining Human‑Computer Interaction

Baidu Tech Salon

Apr 16, 2025 · Artificial Intelligence

Release of the 'Fangsheng' Large Model Benchmark Results (Q1 2025) and Overview of Baidu's Wenxin 4.5 and X1 Models

The China AI Industry Alliance unveiled its Q1 2025 Fangsheng benchmark, showing Baidu’s new multimodal models—Wenxin 4.5 leading basic abilities and Wenxin X1 excelling in reasoning—available for free on the Wenxin Yiyan platform, while Baidu pledges major 2025 investments in AI, data‑center and cloud infrastructure.

AIBenchmarkFactTesting

0 likes · 4 min read

Release of the 'Fangsheng' Large Model Benchmark Results (Q1 2025) and Overview of Baidu's Wenxin 4.5 and X1 Models

JD Tech

Apr 15, 2025 · Artificial Intelligence

Reliable Advertising Creative Generation and Personalized Recommendation via Multimodal Feedback and Offline Representation

The article presents a series of technical breakthroughs by JD's advertising team that improve the quality and coverage of AI‑generated ad images through a trustworthy multimodal feedback network, introduce a large human‑annotated image dataset, and enhance creative ranking with offline multimodal representations and online architecture optimizations, ultimately achieving more precise and scalable ad personalization.

AIAIGCAdvertising

0 likes · 10 min read

Reliable Advertising Creative Generation and Personalized Recommendation via Multimodal Feedback and Offline Representation

58 Tech

Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphMultimodalQuantization

0 likes · 19 min read

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

AntTech

Apr 10, 2025 · Artificial Intelligence

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

At the ICLR 2025 live session in Singapore, Ant Group showcased four cutting‑edge papers—CodePlan, Animate‑X, Group Position Embedding, and OmniKV—demonstrating advances in large‑language‑model reasoning, universal character animation, layout‑aware document understanding, and efficient long‑context inference.

AI researchLarge Language ModelsLong Context

0 likes · 6 min read

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

Baidu Geek Talk

Apr 9, 2025 · Artificial Intelligence

Baidu's Wenxin X1 Large Model Officially Launches on Qianfan Platform

On April 2, Baidu released its Wenxin X1 large model on the Qianfan platform, offering enterprise users and developers a multimodal, deep‑thinking AI with superior math, coding, and reasoning scores, low token‑price API access, batch inference, one‑click distillation, and rapid RAG/Agent application building.

AIAPI ServiceBaidu

0 likes · 4 min read

Baidu's Wenxin X1 Large Model Officially Launches on Qianfan Platform

AIWalker

Apr 7, 2025 · Artificial Intelligence

Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

A recent study by LeCun, Xie, and collaborators shows that large‑scale visual self‑supervised learning (Web‑SSL) can match or surpass CLIP on diverse VQA tasks, even without any language supervision, by scaling model size and data volume.

CLIPModel ScalingMultimodal

0 likes · 13 min read

Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

AI Algorithm Path

Apr 6, 2025 · Artificial Intelligence

Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI

Meta’s newly released Llama 4 models—Maverick with 4 020 billion total parameters and Scout with 1 090 billion—feature a 128‑expert MoE, 10 million‑token context, native multimodal fusion, and FP8 training, delivering benchmark‑leading performance that outpaces GPT‑4o, Gemini 2.0 Flash and DeepSeek v3, while being openly available on Hugging Face and GitHub.

BenchmarkFP8 trainingLlama 4

0 likes · 8 min read

Meta’s Open-Source Llama 4: 2‑Trillion‑Parameter Behemoth Redefines AI

Fighter's World

Apr 5, 2025 · Artificial Intelligence

Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?

The article analyses Google’s Gemini 2.5 Pro as a decisive shift toward a “Reasoning Model”, detailing its architectural focus on inference, benchmark breakthroughs such as Humanity’s Last Exam and GPQA Diamond, long‑context capability, multimodal strengths, Vibe‑coding experience, and the roadmap for future Gemini models.

AI StrategyBenchmarkGemini 2.5 Pro

0 likes · 25 min read

Is Gemini 2.5 Pro the Turning Point for Google’s AI Strategy?

Nightwalker Tech

Apr 1, 2025 · Artificial Intelligence

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

This article reviews AutoGLM, the first "think‑while‑doing" AI agent released by Zhipu AI, detailing its core capabilities, full‑stack architecture, user experience, identified limitations, and the outcomes of three hands‑on tests using both the client application and a Chrome extension.

AI AgentAutoGLMEvaluation

0 likes · 4 min read

Evaluation of AutoGLM: Features, Architecture, and Practical Test Results

AIWalker

Mar 31, 2025 · Artificial Intelligence

VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation

VBench-2.0 expands the original VBench suite by introducing six fine‑grained dimensions—Human Fidelity, Controllability, Creativity, Physics, Commonsense, and more—to evaluate not only the visual quality of generated videos but also their intrinsic faithfulness to physical laws, common sense, and narrative coherence, providing open‑source tools, prompts, and human‑aligned metrics for the research community.

AI evaluationBenchmarkIntrinsic Faithfulness

0 likes · 12 min read

VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation

Nightwalker Tech

Mar 28, 2025 · Artificial Intelligence

Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities

This article presents a thorough assessment of GPT‑4o’s new image generation features, detailing multiple test scenarios—from simple portrait creation and style transfer to UI design, product rendering, and educational illustrations—comparing its output with Claude‑3.7‑Sonnet, highlighting strengths in realism and weaknesses in Chinese text handling.

AI evaluationGPT-4oMultimodal

0 likes · 16 min read

Comprehensive Evaluation of GPT-4o Multimodal Image Generation Capabilities

Meituan Technology Team

Mar 27, 2025 · Artificial Intelligence

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

The Q‑Eval‑100K dataset, comprising 100 k AIGC images and videos with separate visual‑quality and textual‑consistency annotations, powers the open‑source Q‑Eval‑Score framework that fine‑tunes multimodal models to deliver state‑of‑the‑art, scalable, and objective evaluation—including a “vague‑to‑specific” strategy for long prompts—surpassing existing benchmarks.

AIGCEvaluationMultimodal

0 likes · 9 min read

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

37 Interactive Technology Team

Mar 26, 2025 · Artificial Intelligence

LUI vs GUI: Choosing the Right Interface for AI Product Design

When designing AI products, choosing between a Language User Interface—leveraging speech recognition, NLP, and conversational flexibility—and a Graphical User Interface—relying on visual icons, layouts, and intuitive interaction—depends on technology maturity, response speed, and user learning cost, while emerging multimodal designs increasingly blend both for richer, context‑aware experiences.

AIGUIInteraction

0 likes · 11 min read

LUI vs GUI: Choosing the Right Interface for AI Product Design

JD Retail Technology

Mar 25, 2025 · Artificial Intelligence

2024 Advances in Advertising Creative Generation and Selection

In 2024 the advertising team deployed an end‑to‑end AIGC pipeline that automatically creates high‑quality ad images, uses the multimodal Reliable Feedback Network and the million‑size RF1M dataset to filter outputs, builds rich offline and online multimodal representations with contrastive and list‑wise learning, and optimizes ranking architecture to deliver scalable, personalized creative selection.

AIAIGCAdvertising

0 likes · 10 min read

2024 Advances in Advertising Creative Generation and Selection

AI Large Model Application Practice

Mar 24, 2025 · Artificial Intelligence

How to Build a Multimodal RAG Pipeline for PPT Documents with Vision LLMs

This article explains a step‑by‑step implementation of a multimodal Retrieval‑Augmented Generation system that parses PPT/PDF files, extracts rich text and images with vision models, indexes them in a vector store, and generates answers that combine markdown and relevant slide screenshots.

LLMMultimodalPython

0 likes · 9 min read

How to Build a Multimodal RAG Pipeline for PPT Documents with Vision LLMs

Alibaba Cloud Big Data AI Platform

Mar 21, 2025 · Artificial Intelligence

How to Build Multimodal Image Tagging with RAM and BERT in DataWorks Notebook

This tutorial walks through using DataWorks Notebook with GPU support to combine the open‑vocabulary visual model RAM and the language model BERT for zero‑shot multimodal image captioning, covering environment setup, model installation, dataset preparation, tagging code, and result visualization.

BERTDataWorksMultimodal

0 likes · 13 min read

How to Build Multimodal Image Tagging with RAM and BERT in DataWorks Notebook

Amap Tech

Mar 19, 2025 · Artificial Intelligence

Driving by the Rules: Integrating Lane-Level Traffic Regulations into Online HD Maps

Gaode Map and Xi'an Jiaotong University introduce the “Driving by the Rules” task, releasing the MapDR benchmark that integrates lane‑level traffic‑sign regulations into online‑constructed HD maps, and provide modular (VLE‑MEE) and end‑to‑end (RuleVLM) baselines to evaluate rule extraction and lane association.

AIBenchmarkHD maps

0 likes · 8 min read

Driving by the Rules: Integrating Lane-Level Traffic Regulations into Online HD Maps

IT Services Circle

Mar 19, 2025 · Artificial Intelligence

ByteDance’s AI Video Generation Model Goku, Streamer‑Sales Live‑Selling Model, and MimicTalk 3D Talking‑Head Project

ByteDance and partners open‑source three AI projects—Goku for high‑quality text‑to‑video generation, Streamer‑Sales for multimodal live‑selling LLMs, and MimicTalk for rapid 3D talking‑head creation—detailing their core features, underlying transformer‑based architectures, training pipelines, and public repositories.

AI video generationMultimodalTransformer

0 likes · 5 min read

ByteDance’s AI Video Generation Model Goku, Streamer‑Sales Live‑Selling Model, and MimicTalk 3D Talking‑Head Project

JD Tech Talk

Mar 19, 2025 · Artificial Intelligence

Reliable Advertising Image Generation and Creative Selection Using Multimodal Feedback and MLLM Representations

The 2024 advertising team introduced a suite of AI‑driven techniques—including a trustworthy feedback network, a large‑scale human‑annotated dataset, multimodal large language model representations, and online ranking architecture upgrades—to dramatically improve the quality, coverage, and personalization of generated ad creatives.

AIGCAdvertisingMLLM

0 likes · 10 min read

Reliable Advertising Image Generation and Creative Selection Using Multimodal Feedback and MLLM Representations

JD Cloud Developers

Mar 19, 2025 · Artificial Intelligence

How AIGC Boosts Ad Creative Quality: Trustworthy Image Generation & Selection

2024 saw the advertising team achieve major breakthroughs in AI-generated ad creatives by introducing a multimodal reliable feedback network to improve image usability, releasing a large human-annotated dataset, and leveraging multimodal large language models for richer representation and more effective online/offline creative selection.

AIGCMultimodalad optimization

0 likes · 10 min read

How AIGC Boosts Ad Creative Quality: Trustworthy Image Generation & Selection

NewBeeNLP

Mar 18, 2025 · Interview Experience

How to Ace Multimodal Model Interviews at Taobao's Search AI Division

This article recounts a three‑stage interview for a multimodal large‑model position at Taobao's Search AI division, detailing typical questions on CLIP, LoRA, BLIP, Qwen‑VL, Transformer fundamentals, RLHF, and coding challenges, and offers insights on what interviewers focus on.

AICLIPLoRA

0 likes · 5 min read

How to Ace Multimodal Model Interviews at Taobao's Search AI Division

Code Mala Tang

Mar 15, 2025 · Artificial Intelligence

What Makes Google’s New Gemma 3 Model a Game‑Changer for AI Developers?

Google’s Gemma 3, a lightweight open‑source model with up to 27 billion parameters, offers multimodal input, 128K token context, and broad language support, outperforming leading rivals on single‑GPU benchmarks and providing flexible deployment options for developers and researchers alike.

AI modelGemma 3Google AI

0 likes · 9 min read

What Makes Google’s New Gemma 3 Model a Game‑Changer for AI Developers?

AIWalker

Mar 7, 2025 · Artificial Intelligence

How GIFNet’s Low‑Level Interaction Breakthrough Enables Universal Multimodal Fusion Across Tasks

The paper introduces GIFNet, a three‑branch network that leverages low‑level visual tasks and a cross‑fusion gating mechanism to achieve a single, task‑agnostic image‑fusion model with dramatically reduced computation, strong generalization to unseen modalities, and even single‑modal enhancement capabilities.

CVPR2025GIFNetImage Fusion

0 likes · 20 min read

How GIFNet’s Low‑Level Interaction Breakthrough Enables Universal Multimodal Fusion Across Tasks

DaTaobao Tech

Mar 7, 2025 · Artificial Intelligence

Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

Taobao’s AIGC pipeline combines a human‑feedback multimodal reward model, audio‑visual joint pre‑training, and Mixture‑of‑Experts distillation to clean data, align outputs with user preferences, and achieve state‑of‑the‑art multimodal LLM performance that drives content cold‑start and conversion gains in e‑commerce.

AIGCContent GenerationMultimodal

0 likes · 10 min read

Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

Cognitive Technology Team

Mar 7, 2025 · Artificial Intelligence

From Word Embeddings to Large Language Models: A Comprehensive Overview of AI Model Evolution

This article traces the development of AI models—from early word embeddings like Word2Vec and ELMo, through transformer‑based encoders such as BERT and decoder‑only models like GPT‑1/2/3, to recent multimodal systems and scaling laws—explaining their architectures, training methods, and impact on modern AI applications.

AIEmbeddingLarge Language Models

0 likes · 22 min read

From Word Embeddings to Large Language Models: A Comprehensive Overview of AI Model Evolution

DaTaobao Tech

Mar 5, 2025 · Artificial Intelligence

Multimodal Large‑Model Cover Generation AI Agent for Taobao Video and Live Streams

Taobao’s new multimodal AI Agent automatically creates high‑quality static and dynamic video covers by planning tasks, consulting a memory of quality criteria, executing frame selection with ReKV streaming and dual‑stage evaluation, generating marketing copy via fine‑tuned Qwen2.5‑7B, and refining layout, resulting in significantly higher click‑through rates, lower latency, and reduced manual effort.

AIMultimodalVideo Processing

0 likes · 17 min read

Multimodal Large‑Model Cover Generation AI Agent for Taobao Video and Live Streams

DaTaobao Tech

Mar 3, 2025 · Artificial Intelligence

How Taobao’s “Faxiang” AI Model Revolutionizes E‑Commerce Video Generation

Taobao’s AIGC video generation platform, built on a large‑scale “Faxiang” model that evolved from UNet to DiT, leverages over 2 billion curated e‑commerce videos, expert alignment, Lora fine‑tuning, and multi‑control capabilities to deliver diverse, high‑quality product videos that dramatically boost conversion metrics across the marketplace.

AI video generationAIGCMultimodal

0 likes · 11 min read

How Taobao’s “Faxiang” AI Model Revolutionizes E‑Commerce Video Generation

JD Retail Technology

Mar 1, 2025 · Industry Insights

How JD Retail’s AI Assistant Uses Multimodal LLMs to Boost E‑Commerce

JD Retail’s AI assistant combines a Master‑Sub agent framework, ReAct paradigm, multimodal integration and MoE architecture to improve sales forecasting, pricing, and recommendation accuracy, while the team’s collaborative culture and open talent pathways illustrate how cutting‑edge AI is applied in real‑world e‑commerce.

AIJD RetailLLM

0 likes · 8 min read

How JD Retail’s AI Assistant Uses Multimodal LLMs to Boost E‑Commerce

AIWalker

Feb 20, 2025 · Artificial Intelligence

Transfusion: A Single Model for Unified Image Generation and Understanding

Transfusion is a 7B‑parameter transformer that jointly trains language modeling and diffusion losses on mixed text‑image data, enabling seamless text generation, image generation, and image understanding within one model and outperforming prior multimodal approaches such as Chameleon across multiple benchmarks.

AI researchLanguage ModelingMultimodal

0 likes · 20 min read

Transfusion: A Single Model for Unified Image Generation and Understanding

Architect

Feb 16, 2025 · Artificial Intelligence

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

This article provides an in‑depth technical overview of DeepSeek‑V3, DeepSeek‑R1 and Janus‑Pro models, covering their Mixture‑of‑Experts architecture, novel MLA attention, auxiliary‑loss‑free load balancing, multi‑token prediction, FP8 mixed‑precision training, efficient cross‑node communication, reinforcement‑learning pipelines, multimodal modeling strategies, performance comparisons, cost statistics, and current limitations.

AI ArchitectureDeepSeek-V3FP8 training

0 likes · 18 min read

DeepSeek-V3, DeepSeek-R1, and Janus‑Pro: Architecture, Training Techniques, and Performance Insights

AIWalker

Feb 16, 2025 · Artificial Intelligence

VARGPT: A Unified Autoregressive Architecture for Multimodal Understanding and Generation

VARGPT is a novel multimodal large language model that unifies visual understanding and autoregressive image generation within a single architecture, extending LLaVA with next‑token and next‑scale prediction, trained through three staged data‑curated phases and achieving superior performance on numerous vision‑language benchmarks.

AI researchMultimodalVARGPT

0 likes · 20 min read

VARGPT: A Unified Autoregressive Architecture for Multimodal Understanding and Generation

Architects' Tech Alliance

Feb 16, 2025 · Artificial Intelligence

How DeepSeek’s Distillation Breaks Bottlenecks and Boosts Multimodal AI Performance

This article provides an in‑depth technical analysis of DeepSeek’s model distillation technology, covering its core principles, innovative data‑model fusion strategies, architecture design, training optimizations, performance benchmarks, and the remaining challenges of scaling distillation to multimodal tasks.

DeepSeekLarge Language ModelsMultimodal

0 likes · 16 min read

How DeepSeek’s Distillation Breaks Bottlenecks and Boosts Multimodal AI Performance

Ops Development & AI Practice

Feb 10, 2025 · Artificial Intelligence

What’s Inside Google Gemini 2.0 Pro? Free Pricing, Multimodal Power & Real‑Time Streaming

The article reviews Google Gemini 2.0 Pro Experimental, detailing its free‑during‑experiment pricing, multimodal understanding, real‑time streaming, native tool integration, usage limits, latency controls, and practical scenarios such as large‑scale code processing and live media handling.

AIGeminiMultimodal

0 likes · 5 min read

What’s Inside Google Gemini 2.0 Pro? Free Pricing, Multimodal Power & Real‑Time Streaming

AIWalker

Feb 8, 2025 · Artificial Intelligence

Introducing Ola: A Full‑Modal Language Model from Tsinghua & Tencent that Unifies Image, Video, and Audio Understanding

The article presents Ola, an open‑source full‑modal LLM that uses progressive modality alignment to jointly process text, images, video, and audio, and demonstrates competitive performance across image, video, and audio benchmarks, surpassing many specialized models.

BenchmarkMultimodalOla

0 likes · 22 min read

Introducing Ola: A Full‑Modal Language Model from Tsinghua & Tencent that Unifies Image, Video, and Audio Understanding

AIWalker

Feb 4, 2025 · Artificial Intelligence

Meta’s Open‑Source MILS Enables LLMs to See and Hear Without Training – SOTA on Images, Video, and Audio

The paper introduces MILS, a training‑free multimodal iterative LLM solver that lets large language models perceive and generate across image, video, and audio domains, achieving new state‑of‑the‑art results without any task‑specific data or fine‑tuning.

AI researchLLMMILS

0 likes · 18 min read

Meta’s Open‑Source MILS Enables LLMs to See and Hear Without Training – SOTA on Images, Video, and Audio

AI Code to Success

Jan 23, 2025 · Industry Insights

Core Tech vs Application Optimization: Where’s the Real Battleground in the AI Large‑Model Race?

The article analyzes the 2025 AI large‑model landscape, contrasting slowing foundational breakthroughs with fierce application competition, highlighting MiniMax’s low‑cost linear‑attention models, multimodal advances, and the strategic shift from price wars to sustainable, technology‑driven growth.

AIIndustry AnalysisMultimodal

0 likes · 7 min read

Core Tech vs Application Optimization: Where’s the Real Battleground in the AI Large‑Model Race?

Software Engineering 3.0 Era

Jan 22, 2025 · Artificial Intelligence

When Will China Overtake the US in Large‑Model AI? A Technical Comparison

The article analyzes the US‑China large‑model race, detailing algorithmic and architectural strengths of OpenAI, Google and Microsoft versus Chinese innovations like Doubao 1.5, MiniMax‑01 and Vidu, and projects a timeline from 2025 to 2033 for China to close the gap.

AI competitionBenchmarkChina

0 likes · 12 min read

When Will China Overtake the US in Large‑Model AI? A Technical Comparison

DataFunSummit

Jan 22, 2025 · Artificial Intelligence

RAG2.0 Engine Design Challenges and Implementation

This article presents a comprehensive overview of the RAG2.0 engine design, covering RAG1.0 limitations, effective chunking methods, accurate retrieval techniques, advanced multimodal processing, hybrid search strategies, database indexing choices, and future directions such as agentic RAG and memory‑enhanced models.

ChunkingHybrid SearchMultimodal

0 likes · 23 min read

RAG2.0 Engine Design Challenges and Implementation

AI Code to Success

Jan 16, 2025 · Industry Insights

How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape

MiniMax, a Shanghai‑based AI unicorn, has open‑sourced its MiniMax‑01 series featuring large‑scale linear attention, secured $600 million in funding, launched multimodal products like Talkie and Hailuo AI, and is positioning itself as a competitive force amid rising geopolitical tensions in the global artificial‑intelligence market.

AIChina AILinear Attention

0 likes · 4 min read

How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape

ZhongAn Tech Team

Jan 12, 2025 · Artificial Intelligence

AI Weekly Digest Issue 10: Market Insights, Industry Solutions, and Notable Technologies

This issue reviews recent AI industry developments, including Lee Kai‑fu’s clarification on Zero‑One’s strategy, Microsoft’s open‑source Phi‑4 model, the multimodal VITA‑1.5 release, and HaiLuo AI’s advanced Chinese voice‑cloning technology, providing technical details and market implications.

AIMultimodalVoice Cloning

0 likes · 10 min read

AI Weekly Digest Issue 10: Market Insights, Industry Solutions, and Notable Technologies

Infra Learning Club

Jan 2, 2025 · Artificial Intelligence

Three Major LLM Trends in 2025: Ubiquitous Agents, Rising Small Models, and Multimodal Fusion

In 2025, large language models will see three key trends—agents becoming pervasive in daily life and industry, the emergence of efficient small models for edge and specialized tasks, and the integration of multimodal capabilities that combine text, images, and audio to enable more natural human‑machine interaction.

AI trendsAgentsLLM

0 likes · 4 min read

Three Major LLM Trends in 2025: Ubiquitous Agents, Rising Small Models, and Multimodal Fusion

Programmer DD

Dec 31, 2024 · Artificial Intelligence

Build an AI‑Powered Expense Tracker with GLM‑4V‑Flash and MaxKB

This article demonstrates how to create an AI‑driven personal expense‑tracking assistant by leveraging Zhipu's GLM‑4V‑Flash multimodal model for receipt OCR, generating SQL statements, and integrating them with MaxKB workflows and a MySQL database, complete with code snippets and deployment steps.

AIGLM-4V-FlashMaxKB

0 likes · 13 min read

Build an AI‑Powered Expense Tracker with GLM‑4V‑Flash and MaxKB

Baidu Geek Talk

Dec 25, 2024 · Industry Insights

How to Build a Multimodal Web Page Model for the LLM Era

This article examines the unique multimodal and multi‑granular nature of web pages, compares fusion strategies, proposes a cross‑modal attention approach, outlines fine‑ and coarse‑grained pre‑training tasks, and explores low‑cost adaptor methods for adapting large multimodal models to web‑page modeling in the LLM era.

AIHTMLLLM adaptation

0 likes · 10 min read

How to Build a Multimodal Web Page Model for the LLM Era

DevOps

Dec 23, 2024 · Artificial Intelligence

Understanding AIGC Agents: Definition, Core Features, Underlying Logic, and Commercial Applications

This article explains what AIGC agents are, outlines their four main characteristics, describes the underlying transformer‑based architecture, dual‑stage learning, probabilistic generation and feedback optimization, and explores their current and future commercial use cases across content creation, knowledge bases, customer service, internal operations, and product design.

AIGCAgentArtificial Intelligence

0 likes · 14 min read

Understanding AIGC Agents: Definition, Core Features, Underlying Logic, and Commercial Applications

Tencent Cloud Developer

Dec 5, 2024 · Industry Insights

Why Most RAG Projects Fail and How Tencent’s LeXiang AI Assistant Overcomes Them

The article analyses the rapid growth of Retrieval‑Augmented Generation (RAG) in enterprises, explains why self‑built RAG solutions often collapse under cost and maintenance pressures, and demonstrates how Tencent LeXiang AI Assistant addresses these issues through a robust knowledge‑management core, extensive industry experience, scalable resources, and advanced multimodal capabilities.

AI assistantEnterprise AIKnowledge Management

0 likes · 16 min read

Why Most RAG Projects Fail and How Tencent’s LeXiang AI Assistant Overcomes Them

21CTO

Dec 4, 2024 · Artificial Intelligence

Introducing Pi-zero: A General‑Purpose AI Foundation Model for Robotics

Physical Intelligence's new Pi-zero model, built on a vision‑language foundation and fine‑tuned with extensive robot data, outperforms prior baselines across multiple tasks, showcasing the promise of large multimodal foundation models for flexible, robust robot control.

AIFoundation ModelsMultimodal

0 likes · 6 min read

Introducing Pi-zero: A General‑Purpose AI Foundation Model for Robotics

Alibaba Cloud Big Data AI Platform

Dec 4, 2024 · Artificial Intelligence

How EasyAnimate V5 Advances AI Video Generation with Multimodal Control

EasyAnimate V5, an Alibaba Cloud AI video generation framework, expands model size to 7B/12B, introduces multimodal control, token‑length based training, and inpaint‑based image‑to‑video strategies, while providing easy deployment via PAI, DSW, and local ComfyUI integration.

AILoRAMMDiT

0 likes · 11 min read

How EasyAnimate V5 Advances AI Video Generation with Multimodal Control

NewBeeNLP

Dec 2, 2024 · Artificial Intelligence

What Are Today’s Unified Generation-and-Understanding Multimodal Model Architectures?

This article surveys current unified generation-and-understanding multimodal large-model architectures, compares LLM-centric and LLM-plus-diffusion designs, extracts common insights, details large-scale training tricks from models like Emu3, Chameleon and Janus, and outlines open research directions for visual encoders.

Large Language ModelsMultimodaldiffusion

0 likes · 5 min read

What Are Today’s Unified Generation-and-Understanding Multimodal Model Architectures?

JD Retail Technology

Nov 14, 2024 · Artificial Intelligence

Improving Advertisement Image Generation with a Multimodal Reliable Feedback Network (ECCV 2024)

The paper introduces a Multimodal Reliable Feedback Network (RFNet) and a consistency‑condition regularization technique that together boost the usable rate of automatically generated advertisement images while preserving visual quality, supported by a new million‑image annotated dataset and extensive ECCV‑2024 experiments.

AIDiffusion ModelsECCV2024

0 likes · 8 min read

Improving Advertisement Image Generation with a Multimodal Reliable Feedback Network (ECCV 2024)

Bilibili Tech

Nov 8, 2024 · Artificial Intelligence

AI-Powered Game Recognition for League of Legends Live Streaming on Bilibili

Bilibili’s AI‑driven game‑recognition system extracts real‑time LoL events through OCR, hero detection and hot‑spot tagging, generating high‑energy timestamps and interactive overlays that let viewers jump to key moments and view detailed statistics, enhancing spectator engagement and analytical capabilities across major esports tournaments.

AIGame RecognitionMultimodal

0 likes · 14 min read

AI-Powered Game Recognition for League of Legends Live Streaming on Bilibili

Alibaba Cloud Big Data AI Platform

Nov 6, 2024 · Artificial Intelligence

Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

Alibaba Cloud’s AI platform PAI recently saw two papers accepted at EMNLP2024—VideoCLIP‑XL, which enhances video‑text representation for long descriptions using a large video‑long‑description dataset and novel pre‑training tasks, and TAPIR, a curriculum‑planning framework that distills instruction‑following abilities of large language models—while also releasing associated models, datasets, and integration tools for users.

DistillationEMNLP2024Multimodal

0 likes · 8 min read

Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

DataFunSummit

Nov 1, 2024 · Big Data

DataFun Summit Session Overview and E‑book Access Instructions

The article outlines how to obtain the DataFun Summit e‑book by following the public account instructions and provides concise English summaries of twelve technical sessions covering data lineage, integration, AI language models, multimodal content, game AI agents, lake‑warehouse governance, big‑data architecture, and cluster management.

AIBig DataData Integration

0 likes · 5 min read

DataFun Summit Session Overview and E‑book Access Instructions

AntTech

Oct 28, 2024 · Artificial Intelligence

Highlights of AI Large‑Model Sessions at CNCC 2024

The CNCC 2024 conference featured a series of expert talks on AI large‑model research, covering paradigm shifts in scientific discovery, knowledge enhancement and governance, data‑infrastructure analytics, vertical‑domain inference, diffusion‑model advances, multimodal model progress, and medical AI applications, illustrating the breadth and impact of large‑model technologies across multiple domains.

AIKnowledge GovernanceMultimodal

0 likes · 9 min read

Highlights of AI Large‑Model Sessions at CNCC 2024

JD Retail Technology

Oct 15, 2024 · Artificial Intelligence

Large‑Model‑Driven Evolution of E‑commerce Search and Recommendation at JD Retail

The article examines how large language models are reshaping JD Retail's e‑commerce search and recommendation pipelines, detailing industry evolution, technical challenges such as knowledge hallucination, intent understanding, personalization, cost, and safety, and presenting JD's end‑to‑end AIGC architecture, data preprocessing, alignment, evaluation, and next‑generation AI search solutions.

AIMultimodale-commerce

0 likes · 36 min read

Large‑Model‑Driven Evolution of E‑commerce Search and Recommendation at JD Retail

DataFunTalk

Oct 1, 2024 · Artificial Intelligence

From Early AI to Superintelligence: Challenges and Prospects

The article reviews the evolution of artificial intelligence from early statistical models through deep learning and Transformer architectures, examines current breakthroughs like multimodal models, and discusses the technical, computational, and safety challenges that must be overcome before achieving artificial superintelligence (ASI).

AIArtificial IntelligenceMultimodal

0 likes · 8 min read

From Early AI to Superintelligence: Challenges and Prospects

Data Thinking Notes

Sep 26, 2024 · Big Data

How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era

The talk reviews the evolution of data technologies from early database storage to today’s generative AI-driven era, highlighting how massive data, multimodal processing, and advanced analytics are transforming data systems from cost‑centered infrastructures to value‑focused ecosystems that empower intelligent agents, open data ecosystems, and new application paradigms.

Big DataData PlatformsData Value

0 likes · 19 min read

How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era

JD Tech Talk

Sep 23, 2024 · Artificial Intelligence

JD Advertising R&D: AI‑Driven Solutions for Traffic Valuation, Multimodal Understanding, Auction Mechanisms, Generative Recommendation, and Large‑Model Engineering

The JD Advertising R&D team applies cutting‑edge AI techniques—including query intent models, multimodal representation pipelines, reinforcement‑learning‑based auction mechanisms, generative recommendation with quantized product tokens, and large‑model infrastructure—to boost traffic valuation, ad relevance, revenue, and creative generation across the platform.

AIAdvertisingGraph Neural Networks

0 likes · 19 min read

JD Advertising R&D: AI‑Driven Solutions for Traffic Valuation, Multimodal Understanding, Auction Mechanisms, Generative Recommendation, and Large‑Model Engineering

JD Cloud Developers

Sep 23, 2024 · Artificial Intelligence

How JD’s Advertising Lab Leverages Large‑Scale AI to Transform E‑Commerce Ads

JD's advertising research team combines deep learning, multimodal modeling, reinforcement‑learning auctions, and generative recommendation to boost ad relevance, improve long‑tail product exposure, and overcome large‑model inference challenges in a high‑traffic e‑commerce environment.

Graph Neural NetworkMultimodaladvertising AI

0 likes · 22 min read

How JD’s Advertising Lab Leverages Large‑Scale AI to Transform E‑Commerce Ads

AntData

Sep 9, 2024 · Big Data

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

The article reviews the rapid advances in generative AI and big‑data technologies, traces the historical development of data infrastructure, and argues that modern data systems are shifting from a cost‑efficiency focus to a value‑centric paradigm driven by multimodal, non‑structured data, vector search and machine‑oriented services.

@DataArtificial IntelligenceBig Data

0 likes · 18 min read

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

JD Retail Technology

Sep 4, 2024 · Artificial Intelligence

Multimodal Recommendation Algorithms and System Architecture at JD.com

This article presents JD.com’s multimodal recommendation system architecture, covering content understanding, multimodal ranking and recall models, practical deployment pipelines, and future research directions such as large‑model integration and supply‑side generation, all illustrated with detailed diagrams and Q&A.

AIJD.comMultimodal

0 likes · 14 min read

Multimodal Recommendation Algorithms and System Architecture at JD.com

AI Large Model Application Practice

Aug 29, 2024 · Artificial Intelligence

8 Essential Indexing Strategies to Boost Enterprise RAG Performance

This article presents eight practical optimization recommendations for the indexing stage of enterprise‑level Retrieval‑Augmented Generation (RAG) applications, covering chunk creation, abbreviation handling, multimodal document processing, semantic enrichment, metadata usage, alternative index types, and embedding model selection.

ChunkingIndexingMetadata

0 likes · 15 min read

8 Essential Indexing Strategies to Boost Enterprise RAG Performance

DataFunSummit

Aug 29, 2024 · Artificial Intelligence

Intelligent NPC Practices in Tencent Games: Multi‑Modal LLM Solutions and System Optimizations

This article details Tencent Game's end‑to‑end approach to building intelligent NPCs, covering the opportunities brought by AI, the practical implementation of multimodal LLM‑driven dialogue, knowledge‑augmented retrieval, long‑context handling, safety measures, multimodal expression (voice and facial animation), and system‑level performance optimizations for real‑time deployment.

AILLMMultimodal

0 likes · 18 min read

Intelligent NPC Practices in Tencent Games: Multi‑Modal LLM Solutions and System Optimizations

DataFunSummit

Aug 25, 2024 · Artificial Intelligence

Applying Large AI Models to Financial Data Governance and Innovative Use Cases

This article presents a comprehensive technical overview of how large AI models are reshaping financial data production, governance, multimodal document understanding, lakehouse storage, private‑domain model deployment, data‑centric engineering methods, and multi‑agent intelligent advisory within the finance sector.

AIMultimodalRAG

0 likes · 21 min read

Applying Large AI Models to Financial Data Governance and Innovative Use Cases

NewBeeNLP

Aug 15, 2024 · Industry Insights

Decoding Xiaohongshu’s Decentralized Recommendation: Sideinfo and Multimodal Fusion

This article analyzes how Xiaohongshu addresses the decentralization challenge in its recommendation system by strengthening side‑information usage, integrating multimodal signals across the full pipeline, and implementing interest exploration and protection mechanisms, while also outlining future research directions such as generative recommendation and large‑model‑driven user profiling.

Multimodaldecentralized-distributiongraph

0 likes · 25 min read

Decoding Xiaohongshu’s Decentralized Recommendation: Sideinfo and Multimodal Fusion