Tagged articles

CLIP

33 articles · Page 1 of 1

Jul 4, 2026 · Artificial Intelligence

ICML 2026: Certifying VLM Robustness with Text‑Prompted Semantic Intervals

This paper introduces a semantic robustness certification framework for vision‑language models that leverages paired text prompts as semantic proxies to define a continuous transformation in the shared embedding space, derives closed‑form interval bounds where predictions remain unchanged, and validates the method on CLIP ViT‑B/32 with both synthetic and real‑world datasets.

CLIPembedding geometryrobustness certification

0 likes · 13 min read

ICML 2026: Certifying VLM Robustness with Text‑Prompted Semantic Intervals

James' Growth Diary

May 13, 2026 · Artificial Intelligence

Multimodal RAG: A Complete Guide to Ingesting Images, Tables, and PDFs

This article examines the blind spot of pure‑text RAG for visual content, compares three multimodal ingestion strategies—CLIP embeddings, image‑to‑text captioning with a MultiVectorRetriever, and ColPali visual retrieval—covers table‑specific handling, presents end‑to‑end TypeScript implementations, and lists common pitfalls to avoid when deploying production‑grade multimodal RAG pipelines.

CLIPColPaliImage Captioning

0 likes · 22 min read

Multimodal RAG: A Complete Guide to Ingesting Images, Tables, and PDFs

Data Party THU

Mar 25, 2026 · Artificial Intelligence

How Knowledge‑Guided Context Optimization Boosts Zero‑Shot Vision‑Language Models

The article analyzes the Base‑to‑New generalization problem of CLIP‑based visual‑language models, explains why standard prompt tuning (CoOp) forgets base knowledge, and presents the KgCoOp framework that adds a knowledge‑guided loss to keep learned prompts close to hand‑crafted ones, dramatically improving unseen‑class performance while preserving efficiency.

CLIPKnowledge-guided OptimizationPrompt Tuning

0 likes · 12 min read

How Knowledge‑Guided Context Optimization Boosts Zero‑Shot Vision‑Language Models

AI Algorithm Path

Feb 17, 2026 · Artificial Intelligence

Why Contrastive Learning Is the Core Foundation of Visual Language Models

The article explains how contrastive learning replaces fixed‑category visual training with a relationship‑based approach, detailing the dual‑encoder architecture, cosine similarity loss, batch scaling, temperature control, zero‑shot capabilities, scalability from web data, and the method's strengths and limitations in modern multimodal AI.

CLIPMultimodal AIcontrastive learning

0 likes · 25 min read

Why Contrastive Learning Is the Core Foundation of Visual Language Models

AntTech

Feb 5, 2026 · Artificial Intelligence

How Triple Alignment and Rationale Generation Supercharge Knowledge‑Based VQA

This paper presents a lightweight, high‑efficiency framework called Triple Alignment with Rationale Generation (TAG) that transforms knowledge‑based visual question answering into a contrastive learning task, dramatically reducing trainable parameters while achieving state‑of‑the‑art performance on major KVQA benchmarks.

CLIPMultimodalVQA

0 likes · 7 min read

How Triple Alignment and Rationale Generation Supercharge Knowledge‑Based VQA

xkx's Tech General Store

Jan 29, 2026 · Artificial Intelligence

Understanding CLIP: The Image‑Text Translator Behind Text‑to‑Image Models

This article explains CLIP’s dual‑encoder architecture, contrastive training, and zero‑shot inference, then demonstrates its use through image‑text matching and CIFAR‑10 classification experiments with code examples, highlighting strengths and limitations such as resolution mismatch.

CLIPImage-Text MatchingPyTorch

0 likes · 11 min read

Understanding CLIP: The Image‑Text Translator Behind Text‑to‑Image Models

Sohu Tech Products

Jul 23, 2025 · Artificial Intelligence

Boosting Video Moderation with Multimodal CLIP and Efficient Vector Search

This article describes how a video review system combines multimodal CLIP models, image‑text feature alignment, and optimized vector‑search databases such as RedisSearch and Elasticsearch to detect prohibited content in real time and perform large‑scale historical recall, while addressing challenges of generalization, storage cost, and inference speed.

AICLIPmodel fine-tuning

0 likes · 18 min read

Boosting Video Moderation with Multimodal CLIP and Efficient Vector Search

AI Algorithm Path

Jul 15, 2025 · Artificial Intelligence

Day 8: Fine‑Tuning CLIP for Image‑Text Tasks – A Beginner’s Guide

This tutorial walks through fine‑tuning OpenAI's CLIP ViT‑B/32 on a small image‑text dataset in a Kaggle notebook, covering environment setup, model loading, data preprocessing with CLIPProcessor, training a linear head, and observing loss convergence to align visual and textual embeddings.

CLIPHuggingFaceKaggle

0 likes · 5 min read

Day 8: Fine‑Tuning CLIP for Image‑Text Tasks – A Beginner’s Guide

Instant Consumer Technology Team

Jul 10, 2025 · Artificial Intelligence

How LLMs and Vector Search Power Real-Time Icon Recommendations

This article explains a system that combines large language models with multimodal vector retrieval to automatically understand user intent and instantly recommend the most relevant icons, detailing the workflow, semantic vectorization, offline indexing, online inference, and evaluation methods.

CLIPHNSWLLM

0 likes · 13 min read

How LLMs and Vector Search Power Real-Time Icon Recommendations

AI Algorithm Path

Jul 5, 2025 · Artificial Intelligence

Beginner’s Guide to Vision‑Language Models Day 7: How CLIP Achieves Joint Visual‑Language Understanding

This article explains CLIP’s dual‑encoder architecture—using a Vision Transformer for images and a Transformer for text—how both encoders map inputs into a shared embedding space, the role of cosine similarity, and the InfoNCE contrastive loss that drives joint visual‑language learning.

CLIPInfoNCEMulti-modal Embedding

0 likes · 8 min read

Beginner’s Guide to Vision‑Language Models Day 7: How CLIP Achieves Joint Visual‑Language Understanding

AI Algorithm Path

Jul 1, 2025 · Artificial Intelligence

Beginner’s Guide to CLIP Inference: Step‑by‑Step with Hugging Face

This tutorial walks through loading the openai/clip‑vit‑base‑patch32 model with Hugging Face, preprocessing images and text, encoding them into a shared embedding space, computing cosine similarity for zero‑shot image‑text matching, and visualizing the results, all with concrete code examples.

CLIPCosine SimilarityHugging Face

0 likes · 6 min read

Beginner’s Guide to CLIP Inference: Step‑by‑Step with Hugging Face

AI Algorithm Path

Jun 29, 2025 · Artificial Intelligence

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

CLIP (Contrastive Language‑Image Pre‑training) is an OpenAI model that learns visual concepts from 400 million image‑text pairs using a dual‑encoder architecture, enabling zero‑shot classification, flexible text‑driven search, and cross‑modal reasoning, while its strengths, limitations, and emerging applications are examined in detail.

CLIPContrastive Language-Image PretrainingDual Encoder

0 likes · 15 min read

Understanding CLIP: Theory, Architecture, and Zero‑Shot Vision

Network Intelligence Research Center (NIRC)

May 14, 2025 · Artificial Intelligence

Hands‑On CLIP: Implementing Multimodal Vision‑Language Understanding

This article introduces OpenAI’s CLIP multimodal model, explains its architecture and contrastive training, details hardware and installation steps, and demonstrates a hands‑on zero‑shot image classification workflow that achieves 97% confidence on a cat image without any task‑specific fine‑tuning.

CLIPMultimodalPython

0 likes · 6 min read

Hands‑On CLIP: Implementing Multimodal Vision‑Language Understanding

AIWalker

Apr 7, 2025 · Artificial Intelligence

Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

A recent study by LeCun, Xie, and collaborators shows that large‑scale visual self‑supervised learning (Web‑SSL) can match or surpass CLIP on diverse VQA tasks, even without any language supervision, by scaling model size and data volume.

CLIPModel ScalingMultimodal

0 likes · 13 min read

Is CLIP Obsolete? LeCun and Xie's New Multimodal Model Beats Language Supervision

NewBeeNLP

Mar 18, 2025 · Interview Experience

How to Ace Multimodal Model Interviews at Taobao's Search AI Division

This article recounts a three‑stage interview for a multimodal large‑model position at Taobao's Search AI division, detailing typical questions on CLIP, LoRA, BLIP, Qwen‑VL, Transformer fundamentals, RLHF, and coding challenges, and offers insights on what interviewers focus on.

AICLIPLoRA

0 likes · 5 min read

How to Ace Multimodal Model Interviews at Taobao's Search AI Division

AIWalker

Jan 10, 2025 · Artificial Intelligence

How a Simplified Transformer Enables Lightweight CLIP Training on a Single RTX3090

This paper presents SiCLIP, a framework that simplifies the Transformer architecture, combines weight‑sharing, multi‑stage knowledge distillation, and a novel pair‑matching loss with synthetic captions to train a competitive CLIP model using only one RTX3090 GPU and 1 TB of storage, achieving state‑of‑the‑art data‑size‑parameter‑accuracy trade‑offs.

CLIPData AugmentationLightweight Training

0 likes · 19 min read

How a Simplified Transformer Enables Lightweight CLIP Training on a Single RTX3090

Meituan Technology Team

Nov 21, 2024 · Frontend Development

AutoConsis: Automated UI Consistency Detection for Mobile Apps Using Multimodal AI

AutoConsis is a research‑driven, AI‑powered workflow that automatically detects UI content inconsistencies across mobile app pages by combining target region recognition, OCR‑based extraction, and large language model reasoning, achieving low cost, high generalization, and high confidence as demonstrated on Meituan's large‑scale marketing scenarios.

CLIPICSE 2024Large Language Model

0 likes · 15 min read

AutoConsis: Automated UI Consistency Detection for Mobile Apps Using Multimodal AI

Tencent Cloud Developer

Oct 30, 2024 · Artificial Intelligence

Comprehensive Survey of AIGC Research: Papers, Resources, and Technical Overview

This survey acts as a comprehensive portal that organizes AIGC research across seven domains—text, image, and audio generation, cross‑modal association, text‑guided image and audio synthesis, and supporting resources—detailing seminal models such as GPT, Diffusion, CLIP, DALL·E, Stable Diffusion, MusicLM, and key papers that shaped each field.

AIGCCLIPDiffusion Models

0 likes · 19 min read

Comprehensive Survey of AIGC Research: Papers, Resources, and Technical Overview

Bilibili Tech

Aug 27, 2024 · Artificial Intelligence

Multimodal Video Scene Classification for Adaptive Video Processing

The paper presents a multimodal video scene classification system that leverages CLIP‑generated pseudo‑labels and a fine‑tuned image encoder to automatically identify nature, animation/game, and document scenes, enabling more effective adaptive transcoding, intelligent restoration, and quality assessment for user‑generated content on platforms such as Bilibili.

Bilibili multimediaCLIPMultimodal Learning

0 likes · 17 min read

Multimodal Video Scene Classification for Adaptive Video Processing

Sohu Tech Products

May 21, 2024 · Artificial Intelligence

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations

OPPO details how it deploys multimodal pretrained models on resource‑constrained edge devices by compressing CLIP‑based image‑text retrieval, adapting Chinese text‑to‑image generation with LoRA and adapters, and lightweighting diffusion models through layer pruning and progressive distillation, achieving sub‑3‑second generation while preserving cloud‑level quality.

CLIPDistillationEdge deployment

0 likes · 18 min read

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations

Architecture and Beyond

Feb 8, 2024 · Artificial Intelligence

Mastering AIGC: 15 Essential AI Terms and Key Technologies Explained

This article provides a comprehensive overview of core AI concepts, from basic definitions of AI, AGI, and AIGC to detailed explanations of GPUs, major generative models, leading AI products, and influential companies, helping readers quickly grasp the landscape of AI-generated content.

AIAIGCCLIP

0 likes · 24 min read

Mastering AIGC: 15 Essential AI Terms and Key Technologies Explained

Ximalaya Technology Team

Feb 1, 2024 · Artificial Intelligence

Understanding AI Image Generation: Diffusion Models, CLIP, and Control Techniques

This guide explains how AI image generators such as Stable Diffusion and DALL·E 3 turn text prompts into pictures by using diffusion models, CLIP‑aligned embeddings, and optional controls like negative prompts, fine‑tuned LoRA checkpoints and ControlNet conditioning, highlighting their differences, workflow, and practical customization.

AI image generationCLIPControlNet

0 likes · 18 min read

Understanding AI Image Generation: Diffusion Models, CLIP, and Control Techniques

Zhuanzhuan Tech

Nov 29, 2023 · Artificial Intelligence

Applying CLIP and Milvus for Image Similarity Search in E‑commerce Risk Control

The article explains how an e‑commerce risk‑control team leverages OpenAI's CLIP model to generate image and text embeddings and stores them in the Milvus cloud‑native vector database to enable fast, scalable similarity searches for compliance verification and risk detection.

AICLIPMilvus

0 likes · 11 min read

Applying CLIP and Milvus for Image Similarity Search in E‑commerce Risk Control

dbaplus Community

Nov 27, 2023 · Artificial Intelligence

Build an Image‑Search Engine with Elasticsearch 8.x and CLIP

This guide explains how to implement reverse image search by extracting visual features with a multilingual CLIP model, storing the vectors in Elasticsearch 8.x, and using its k‑NN plugin to retrieve similar images, covering architecture, tools, code snippets, and results.

CLIPdeep learningimage search

0 likes · 9 min read

Build an Image‑Search Engine with Elasticsearch 8.x and CLIP

DataFunTalk

Nov 24, 2023 · Artificial Intelligence

Open Vocabulary Detection Contest 2023: Summary of Winning Teams' Technical Solutions

The article reviews the Open Vocabulary Detection Contest organized by the Chinese Society of Image and Graphics and 360 AI Institute, describing the competition setup, dataset characteristics, and detailed winning approaches that combine Detic, CLIP, prompt learning, and multi‑stage pipelines to achieve strong few‑shot and zero‑shot object detection performance.

CLIPOpen-Vocabulary Detectioncompetition

0 likes · 17 min read

Open Vocabulary Detection Contest 2023: Summary of Winning Teams' Technical Solutions

Volcano Engine Developer Services

Aug 11, 2023 · Artificial Intelligence

Build an End-to-End Image-and-Text Search Engine with CLIP and ESCloud

This guide shows how to quickly create a complete image-and-text search solution using Volcano Engine's ESCloud, the CLIP model for feature extraction, and Python, covering data preparation, environment setup, index mapping, bulk indexing, and both text-to-image and image-to-image queries.

CLIPElasticsearchPython

0 likes · 8 min read

Build an End-to-End Image-and-Text Search Engine with CLIP and ESCloud

360 Tech Engineering

May 6, 2023 · Artificial Intelligence

Open‑Vocabulary Object Detection: Overview of OVR‑CNN, RegionCLIP, and CORA

This article reviews the evolution of open‑vocabulary object detection, describing the OVR‑CNN paradigm, the RegionCLIP enhancements, and the CORA model with region prompting and anchor pre‑matching, and discusses their impact on future multimodal AI systems.

CLIPCORAOVR-CNN

0 likes · 14 min read

Open‑Vocabulary Object Detection: Overview of OVR‑CNN, RegionCLIP, and CORA

58UXD

Mar 7, 2023 · Artificial Intelligence

How Diffusion Models Power AI Image Generation: From Prompts to Pictures

This article explains how modern AI image generators like Midjourney and Stable Diffusion use diffusion models, large training datasets, deep learning, latent spaces, and CLIP to transform textual prompts into high‑quality images, while also discussing the impact on designers and future collaboration opportunities.

CLIPMidjourneyStable Diffusion

0 likes · 7 min read

How Diffusion Models Power AI Image Generation: From Prompts to Pictures

IT Services Circle

Jun 6, 2022 · Artificial Intelligence

AI Image Generation Showdown: Google Imagen vs OpenAI DALL·E on the "Tiger Wearing VR" Prompt

The article compares Google’s Imagen and OpenAI’s DALL·E by feeding them the whimsical "Tiger Wearing VR" prompt, showcasing each model’s visual style, underlying architecture—including CLIP, diffusion, and T5‑XXL—and community reactions to the resulting AI‑generated artwork.

AICLIPDiffusion Models

0 likes · 5 min read

AI Image Generation Showdown: Google Imagen vs OpenAI DALL·E on the "Tiger Wearing VR" Prompt

DaTaobao Tech

May 24, 2022 · Artificial Intelligence

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

GEN‑VLKT introduces a Guided‑Embedding Network with position‑ and instance‑guided embeddings to remove costly post‑processing and leverages CLIP‑based visual‑linguistic knowledge transfer for interaction understanding, achieving state‑of‑the‑art HOI detection performance and zero‑shot capability, now deployed in Alibaba’s Taobao services.

CLIPHOI detectionTransformer

0 likes · 7 min read

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

Baobao Algorithm Notes

Mar 7, 2022 · Artificial Intelligence

How CLIP Uses Natural Language Supervision for Powerful Zero‑Shot Vision

This article explains CLIP’s multimodal contrastive pre‑training, its simple yet effective architecture, code implementation, and how its zero‑shot capability can surpass supervised ImageNet models by leveraging a 400‑million image‑text dataset and shared semantic embeddings.

AICLIPMultimodal

0 likes · 7 min read

How CLIP Uses Natural Language Supervision for Powerful Zero‑Shot Vision

MaGe Linux Operations

Apr 14, 2021 · Fundamentals

5 Elegant NumPy Functions for Efficient Data Processing

This article introduces five lesser‑known but powerful NumPy functions—reshape with -1, argpartition, clip, extract, and setdiff1d—explaining their behavior, showcasing code examples, and highlighting how they simplify complex data manipulation tasks.

CLIPExtractargpartition

0 likes · 7 min read

5 Elegant NumPy Functions for Efficient Data Processing

Efficient Ops

Oct 19, 2015 · Operations

Step-by-Step Guide to Installing and Using Clip Server and SDK on Linux

This article provides a comprehensive tutorial on installing the Clip Server (Apache, PHP, MySQL), configuring its virtual host, setting up the Clip SDK with Python, and using various Clip commands to manage IP relationships, all illustrated with command examples and screenshots.

CLIPInstallationLinux

0 likes · 12 min read

Step-by-Step Guide to Installing and Using Clip Server and SDK on Linux