Tag

Vision-Language

0 views collected around this technical thread.

Baidu Geek Talk
Baidu Geek Talk
Apr 2, 2025 · Artificial Intelligence

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

CodeDeepSeek-VL2Inference
0 likes · 36 min read
DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough
DataFunSummit
DataFunSummit
Nov 1, 2024 · Artificial Intelligence

Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook

This article reviews recent advances in multimodal large language models, covering their background, architectural components, training strategies, application scenarios, evaluation benchmarks, team research on hallucination mitigation and long‑video understanding, and outlines promising future research directions.

Model ArchitectureVision-Languageevaluation benchmarks
0 likes · 15 min read
Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook
DataFunSummit
DataFunSummit
Oct 28, 2024 · Artificial Intelligence

Exploration and Practice of Multimodal Large Models at 360

This article presents 360's comprehensive exploration of image‑text multimodal large models, covering background concepts, research routes, three generations of model development, proprietary architectures like SEEChat, 360VL and Inner‑Adaptor, and real‑world AI applications across various products and services.

AI applicationsModel ArchitectureVision-Language
0 likes · 19 min read
Exploration and Practice of Multimodal Large Models at 360
360 Tech Engineering
360 Tech Engineering
May 17, 2024 · Artificial Intelligence

360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B

The article introduces 360VL, an open‑source multimodal large language model built on Llama‑3‑70B, describes its novel C‑abs bridge architecture for high‑resolution visual understanding, outlines the two‑stage training with bilingual data, and presents benchmark results showing superior performance over prior LMMs.

AI researchLlama3Vision-Language
0 likes · 8 min read
360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B
DataFunSummit
DataFunSummit
Mar 27, 2024 · Artificial Intelligence

Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

This article reviews Tongyi Lab's work on the OFA framework for generative multimodal pretraining and the ONE-PEACE model for unified multimodal representation learning, detailing their architectures, training strategies, experimental results across vision‑language and audio tasks, and future research directions.

Large ModelsOFAONE-PEACE
0 likes · 15 min read
Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings
DataFunTalk
DataFunTalk
Sep 26, 2023 · Artificial Intelligence

MiniGPT-4: Enhancing Vision‑Language Understanding with Large Language Models

This article presents MiniGPT-4, a multimodal system that combines a frozen visual encoder (Q‑Former + ViT) with an open‑source large language model (Vicuna), describes its motivation, training pipeline, demo capabilities, observed limitations, and includes a brief Q&A session.

AI researchMiniGPT-4Vision-Language
0 likes · 15 min read
MiniGPT-4: Enhancing Vision‑Language Understanding with Large Language Models
DataFunTalk
DataFunTalk
Aug 11, 2023 · Artificial Intelligence

Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation

mPLUG-Owl is a modular multimodal dialogue large model from Alibaba DAMO Academy that builds on the mPLUG series, offering advanced image, video, OCR, and multilingual capabilities, with extensive evaluations showing superior performance over MiniGPT‑4, LLaVA, and other multimodal LLMs across various tasks.

Vision-Languageevaluationlarge language model
0 likes · 17 min read
Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation
DataFunTalk
DataFunTalk
Oct 13, 2022 · Artificial Intelligence

Multimodal Attribute-Level Sentiment Analysis for Social Media: Background, Tasks, and Recent Advances

This article reviews the rapid development of multimodal attribute-level sentiment analysis on social media, outlining its background, defining four core sub‑tasks, summarizing representative recent models—including unified multimodal transformers, coarse‑to‑fine image‑target matching, and vision‑language pre‑training—and discussing experimental results and future research directions.

NLPVision-Languageaspect‑based sentiment
0 likes · 21 min read
Multimodal Attribute-Level Sentiment Analysis for Social Media: Background, Tasks, and Recent Advances
DataFunSummit
DataFunSummit
Oct 9, 2022 · Artificial Intelligence

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

The article introduces the GIT image‑to‑text (image captioning) model, explains its transformer‑based architecture, showcases multiple example outputs, discusses training details, compares its performance with Flamingo and COCO, and highlights its applicability to tasks such as VQA, video captioning, and image classification.

GIT modelTransformerVision-Language
0 likes · 12 min read
Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison
DaTaobao Tech
DaTaobao Tech
May 24, 2022 · Artificial Intelligence

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

GEN‑VLKT introduces a Guided‑Embedding Network with position‑ and instance‑guided embeddings to remove costly post‑processing and leverages CLIP‑based visual‑linguistic knowledge transfer for interaction understanding, achieving state‑of‑the‑art HOI detection performance and zero‑shot capability, now deployed in Alibaba’s Taobao services.

ClipComputer VisionHOI detection
0 likes · 7 min read
GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection
DataFunTalk
DataFunTalk
Mar 20, 2020 · Artificial Intelligence

UNITER: Unified Image‑Text Representation Learning for Vision‑Language Tasks

This article introduces UNITER, a unified image‑text representation learning framework pretrained on four large multimodal datasets, describes its three pretraining tasks (MLM, ITM, MRM), details model architecture, training optimizations, and evaluates performance across six vision‑language downstream tasks, achieving state‑of‑the‑art results.

AIITMMLM
0 likes · 11 min read
UNITER: Unified Image‑Text Representation Learning for Vision‑Language Tasks