Tagged articles
650 articles
Page 1 of 7
Machine Heart
Machine Heart
May 17, 2026 · Artificial Intelligence

ViT³: Vision Test‑Time Training Architecture Breaking Transformer Complexity (CVPR 2026 Oral)

The paper systematically studies Test‑Time Training (TTT) for vision, derives six design principles, and introduces ViT³—a pure TTT architecture that uses full‑batch internal training, a learning rate of 1.0, and lightweight SwiGLU‑Depthwise convolution modules, achieving state‑of‑the‑art linear‑complexity performance across classification, detection, segmentation and generation tasks.

Computer VisionLinear ComplexitySequence Modeling
0 likes · 14 min read
ViT³: Vision Test‑Time Training Architecture Breaking Transformer Complexity (CVPR 2026 Oral)
Data Party THU
Data Party THU
May 15, 2026 · Artificial Intelligence

94% Precision: YOLO11‑Based Detection of Near‑Earth Object and Satellite Streaks

The StreakMind system built by the Spanish Royal Navy Academy uses a YOLO11‑OBB detector trained on over 2,000 real astronomical images and 280 synthetic streaks to automatically identify satellite and asteroid streaks with 94% precision and 97% recall, delivering standardized database entries and robust frame‑to‑frame tracking.

Computer VisionStreakMindYOLO11
0 likes · 10 min read
94% Precision: YOLO11‑Based Detection of Near‑Earth Object and Satellite Streaks
Machine Heart
Machine Heart
May 14, 2026 · Artificial Intelligence

Breaking the 3D Perception Bottleneck: VGGT Series Enables Dynamic High‑Fidelity Reconstruction

The VGGT series from KOKONI 3D and collaborators tackles three core 3D perception limits—unbounded sequence memory, dynamic‑static entanglement, and compute‑precision trade‑offs—by introducing StreamCacheVGGT, progressive decoupling, and HD‑VGGT, achieving O(1) memory streaming, 15%+ accuracy gains on dynamic benchmarks, and record‑high AUC on RealEstate10K.

3D reconstructionComputer VisionVGGT
0 likes · 10 min read
Breaking the 3D Perception Bottleneck: VGGT Series Enables Dynamic High‑Fidelity Reconstruction
Machine Heart
Machine Heart
May 6, 2026 · Artificial Intelligence

Scal3R Enables Stable Kilometer-Scale 3D Reconstruction of Long Videos

Scal3R introduces test‑time training with a global‑context memory and synchronization mechanism that lets models train on and infer over ultra‑long video sequences, achieving accurate camera poses and dense point clouds for kilometer‑scale scenes while outperforming prior SLAM, SfM and streaming baselines on multiple benchmarks.

3D reconstructionComputer VisionScal3R
0 likes · 11 min read
Scal3R Enables Stable Kilometer-Scale 3D Reconstruction of Long Videos
Machine Heart
Machine Heart
May 3, 2026 · Artificial Intelligence

How LEADER Beats Traditional LiDAR Relocalization in Accuracy and Speed

The LEADER framework achieves ten‑millisecond "eye‑open" LiDAR relocalization while surpassing the decimeter‑level accuracy of classic retrieval‑registration pipelines, using cylindrical projection, sparse convolution, and a Truncated Relative Reliability loss, as demonstrated on the NCLT benchmark.

Computer VisionLEADERLiDAR
0 likes · 9 min read
How LEADER Beats Traditional LiDAR Relocalization in Accuracy and Speed
AI Explorer
AI Explorer
May 2, 2026 · Artificial Intelligence

How DeepSeek’s “Cyber Finger” Gives AI a Physical Sense of the World

DeepSeek introduces a “cyber finger” that lets AI not only recognize objects but also infer their spatial relationships, orientations, and manipulability, turning visual perception into a digital simulation of touch and enabling more realistic interaction in robotics, AR, and assistive technologies.

AIComputer VisionDeepSeek
0 likes · 6 min read
How DeepSeek’s “Cyber Finger” Gives AI a Physical Sense of the World
Geek Labs
Geek Labs
Apr 30, 2026 · Artificial Intelligence

Why the 14-Year-Old ccv Library Remains a Top Choice for Modern Computer Vision

The ccv library, created in 2010 and still actively maintained, offers a highly portable C‑based computer‑vision toolkit with minimal dependencies, a built‑in cache for preprocessing, a full libnnc neural‑network runtime, and easy builds via Bazel, Make, or Swift Package Manager.

C libraryComputer VisionNeural Network
0 likes · 5 min read
Why the 14-Year-Old ccv Library Remains a Top Choice for Modern Computer Vision
Machine Heart
Machine Heart
Apr 27, 2026 · Artificial Intelligence

Google DeepMind Open‑Sources TIPSv2: State‑of‑the‑Art Patch‑Text Alignment at CVPR 2026

The DeepMind team unveils TIPSv2, a vision‑language pre‑training model that dramatically improves patch‑level image‑text alignment through iBOT++, Head‑only EMA, and multi‑granularity captions, achieving record‑breaking results on nine tasks across twenty datasets while remaining fully open‑source.

Computer VisionDeepMindMultimodal Pretraining
0 likes · 12 min read
Google DeepMind Open‑Sources TIPSv2: State‑of‑the‑Art Patch‑Text Alignment at CVPR 2026
Machine Heart
Machine Heart
Apr 19, 2026 · Artificial Intelligence

How Google Turns Your CAPTCHA Clicks into Training Data for the Next Generation of AI

The article explains how YouTube’s AI‑video rating and Google’s reCAPTCHA system covertly collect billions of user interactions each day, converting them into labeled data that fuels Google’s computer‑vision models such as Veo, Maps and Waymo, effectively turning routine security checks into a massive, unpaid AI training workforce.

AI trainingComputer VisionGoogle
0 likes · 7 min read
How Google Turns Your CAPTCHA Clicks into Training Data for the Next Generation of AI
AI Explorer
AI Explorer
Apr 16, 2026 · Artificial Intelligence

AI Tech Daily: Top AI Research and Industry Updates on April 16 2026

This roundup highlights recent AI breakthroughs such as NVIDIA‑MIT’s Sol‑RL framework for faster diffusion model training, Peking University’s CPL++ visual localization improvement, DeepMind’s TIPSv2 for image recognition, Boston Dynamics Spot’s AI upgrade, Anthropic’s safety paper, a major MCP protocol vulnerability, OpenAI’s GPT‑5.4 release, and the shifting AI video landscape.

AIAI SafetyComputer Vision
0 likes · 5 min read
AI Tech Daily: Top AI Research and Industry Updates on April 16 2026
Machine Heart
Machine Heart
Apr 16, 2026 · Artificial Intelligence

CPL++: A Self‑Aware, Self‑Correcting Framework for Weakly Supervised Visual Grounding

The CPL++ framework equips weakly supervised visual grounding models with confidence‑aware pseudo‑label learning, self‑supervised association correction, and dynamic validation, enabling the model to detect and amend erroneous region‑query links during training, which yields absolute performance gains of 1–6 % across five benchmark datasets.

Computer VisionVisual GroundingWeak Supervision
0 likes · 9 min read
CPL++: A Self‑Aware, Self‑Correcting Framework for Weakly Supervised Visual Grounding
AIWalker
AIWalker
Apr 10, 2026 · Artificial Intelligence

How RealRestorer Bridges the Gap in Real‑World Image Restoration

RealRestorer leverages large‑scale image‑editing models, a hybrid synthetic‑and‑real degradation pipeline, and a two‑stage training strategy to deliver state‑of‑the‑art open‑source restoration that generalizes across nine real‑world degradation types while preserving content consistency.

BenchmarkComputer VisionDeep Learning
0 likes · 13 min read
How RealRestorer Bridges the Gap in Real‑World Image Restoration
HyperAI Super Neural
HyperAI Super Neural
Apr 9, 2026 · Artificial Intelligence

Cornell’s EMSeek Generates Insights from EM Images in 2–5 Minutes, 50× Faster Than Experts

EMSeek, a modular multi‑agent platform from Cornell, integrates perception, structural reconstruction, property prediction, and literature reasoning to automate electron microscopy analysis across 20 material systems and five tasks, achieving up to twice the speed of Segment Anything, over 90% structural similarity, and a 50‑fold reduction in processing time compared with expert workflows, while requiring only about 2 % labeled data for calibration.

Computer VisionDeep LearningEMSeek
0 likes · 16 min read
Cornell’s EMSeek Generates Insights from EM Images in 2–5 Minutes, 50× Faster Than Experts
JD Cloud Developers
JD Cloud Developers
Apr 8, 2026 · Artificial Intelligence

How JoyAI-Image-Edit Brings Spatial Intelligence to Open‑Source Image Editing

JoyAI-Image-Edit, an open‑source multimodal foundation model from JD Research Institute, integrates text‑to‑image generation, image understanding, and instruction‑driven spatial editing, achieving world‑leading spatial perception and editing capabilities that unlock new applications across e‑commerce, robotics, 3D reconstruction, and design.

Computer VisionGenerative ModelsMultimodal AI
0 likes · 7 min read
How JoyAI-Image-Edit Brings Spatial Intelligence to Open‑Source Image Editing
AIWalker
AIWalker
Apr 6, 2026 · Artificial Intelligence

BIPNet: Adaptive Progressive Upsampling Drives a Leap in Burst Image Restoration (TPAMI 2025)

The TPAMI 2025 paper introduces BIPNet, a unified burst‑image framework that tackles alignment, fusion, and upsampling challenges with edge‑enhanced alignment, pseudo‑burst feature fusion, and adaptive group upsampling, achieving state‑of‑the‑art results across super‑resolution, low‑light enhancement, and denoising while offering lightweight mobile variants.

BIPNetBurst Image ProcessingComputer Vision
0 likes · 13 min read
BIPNet: Adaptive Progressive Upsampling Drives a Leap in Burst Image Restoration (TPAMI 2025)
AIWalker
AIWalker
Apr 6, 2026 · Artificial Intelligence

How TIR‑Agent Turns Image‑Restoration Tools into a Learnable Decision‑Making Agent

The paper introduces TIR‑Agent, an image‑restoration agent that learns a tool‑calling policy via supervised fine‑tuning and reinforcement learning, addressing exploration stagnation and multi‑objective reward imbalance, and demonstrates over 2.5× faster inference and superior multi‑metric performance on synthetic and real degradation datasets.

Computer VisionImage Restorationagent-based AI
0 likes · 18 min read
How TIR‑Agent Turns Image‑Restoration Tools into a Learnable Decision‑Making Agent
Data Party THU
Data Party THU
Apr 1, 2026 · Artificial Intelligence

How SwiftTailor Accelerates Realistic 3D Garment Generation

SwiftTailor introduces a two‑stage, geometry‑centric framework that unifies pattern inference and mesh synthesis, dramatically cutting inference time to seconds while achieving state‑of‑the‑art accuracy and visual realism on the Multimodal GarmentCodeData benchmark for digital fashion.

3D garment generationAIComputer Vision
0 likes · 4 min read
How SwiftTailor Accelerates Realistic 3D Garment Generation
Machine Heart
Machine Heart
Mar 30, 2026 · Artificial Intelligence

InfoTok: Information-Theoretic Adaptive Video Tokenizer Redefines Efficient Tokenization (ICLR 2026 Oral)

InfoTok, a collaborative effort by Stanford, NVIDIA Cosmos, and NUS, leverages information theory and an ELBO‑based router to allocate tokens adaptively, achieving 2.3× higher compression, 11× faster inference, and superior reconstruction quality on benchmarks such as TokenBench and DAVIS.

Computer VisionELBOICLR 2026
0 likes · 11 min read
InfoTok: Information-Theoretic Adaptive Video Tokenizer Redefines Efficient Tokenization (ICLR 2026 Oral)
Data Party THU
Data Party THU
Mar 29, 2026 · Artificial Intelligence

How LoGeR Enables Minute‑Long 3D Reconstruction with Hybrid Memory

The article presents LoGeR, a long‑context geometric reconstruction framework that combines test‑time‑training memory and sliding‑window attention to achieve minute‑scale, fully‑feedforward 3D reconstruction with superior accuracy on benchmarks such as KITTI and VBR.

3D reconstructionComputer VisionHybrid Memory
0 likes · 11 min read
How LoGeR Enables Minute‑Long 3D Reconstruction with Hybrid Memory
AIWalker
AIWalker
Mar 23, 2026 · Artificial Intelligence

Dynamic Dense Computing and Minimal End‑to‑End Design: YOLO-Master & YOLO26

By introducing a dynamic mixture‑of‑experts routing scheme and an end‑to‑end architecture that eliminates NMS and DFL, YOLO‑Master and YOLO26 dramatically cut compute waste and latency on edge devices, achieving up to 43% faster CPU inference while keeping model accuracy, with all code openly released.

Computer VisionMixture of ExpertsModel Optimization
0 likes · 7 min read
Dynamic Dense Computing and Minimal End‑to‑End Design: YOLO-Master & YOLO26
AI Frontier Lectures
AI Frontier Lectures
Mar 19, 2026 · Artificial Intelligence

Can Circulant Attention Reduce Vision Transformer Cost by 7×?

The article reviews the AAAI 2026 paper "Vision Transformers are Circulant Attention Learners", explaining how modeling self‑attention as a Block‑Circulant matrix enables FFT‑based multiplication that cuts the quadratic complexity of standard attention, achieving up to seven‑fold inference speed‑up while preserving accuracy across ImageNet, COCO and ADE20K benchmarks.

BCCB MatrixCirculant AttentionComputer Vision
0 likes · 15 min read
Can Circulant Attention Reduce Vision Transformer Cost by 7×?
AI Frontier Lectures
AI Frontier Lectures
Mar 19, 2026 · Artificial Intelligence

Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It

The article analyzes the hidden conflict between [CLS] and patch tokens in Vision Transformers, reveals how shared normalization and linear layers cause computational friction, and demonstrates that layer‑specific parameters dramatically improve dense prediction tasks without increasing inference FLOPs.

Computer VisionDense PredictionLayer Specialization
0 likes · 9 min read
Why Sharing Parameters in Vision Transformers Hurts Performance—and How Layer Specialization Fixes It
AIWalker
AIWalker
Mar 18, 2026 · Artificial Intelligence

7× Faster Inference: Tsinghua’s Huang‑Gao Team Redesigns Vision‑Transformer Attention via Fourier Transforms

The AAAI 2026 paper by Tsinghua’s Huang‑Gao team shows that modeling Vision‑Transformer attention as a Block‑Circulant matrix and computing it with FFT reduces the quadratic complexity to O(N log N), delivering up to seven‑fold real‑world speedups without sacrificing accuracy.

AAAI 2026Circulant MatricesComputer Vision
0 likes · 15 min read
7× Faster Inference: Tsinghua’s Huang‑Gao Team Redesigns Vision‑Transformer Attention via Fourier Transforms
SuanNi
SuanNi
Mar 16, 2026 · Artificial Intelligence

How NaLaFormer Revives Linear Attention with Query‑Norm Awareness

NaLaFormer introduces a norm‑aware linear attention mechanism that restores the query‑norm‑driven sharpness of softmax attention, achieving up to 7.5% higher ImageNet accuracy and 92% memory reduction in super‑resolution, while delivering strong results across classification, detection, segmentation, and language modeling tasks.

AIComputer VisionLinear Attention
0 likes · 13 min read
How NaLaFormer Revives Linear Attention with Query‑Norm Awareness
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 15, 2026 · Artificial Intelligence

A 17‑Year‑Old High‑Schooler Becomes First‑Author on a CVPR Paper

A 17‑year‑old high‑school student from Anhui Ansheng School led the first‑author CVPR 2026 paper "CraftMesh," a novel 3D mesh editing framework that combines image editing, mesh generation, and Poisson seamless fusion, achieving superior quantitative metrics and showcasing the growing impact of young researchers in top AI conferences.

3D mesh generationCVPRComputer Vision
0 likes · 7 min read
A 17‑Year‑Old High‑Schooler Becomes First‑Author on a CVPR Paper
AIWalker
AIWalker
Mar 7, 2026 · Artificial Intelligence

YOLO-Master v2026.02 Unveils Four Innovations for SOTA Object Detection

Tencent’s YOLO-Master v2026.02 adds a Mixture‑of‑Experts architecture, zero‑overhead LoRA fine‑tuning, Sparse SAHI inference for large images, and Cluster‑Weighted NMS, delivering 3‑5× faster inference, up to 70% reduced training resources, and markedly higher detection accuracy across diverse benchmarks.

Computer VisionLoRAMixture of Experts
0 likes · 15 min read
YOLO-Master v2026.02 Unveils Four Innovations for SOTA Object Detection
Code Mala Tang
Code Mala Tang
Mar 5, 2026 · Artificial Intelligence

Master YOLOv12: A Step‑by‑Step Guide to Build, Train, and Deploy Custom Models

This tutorial walks readers through the fundamentals of YOLOv12, covering model variants, dataset preparation with Roboflow, optional FlashAttention acceleration, installation, model selection, training commands, post‑training tasks such as tracking, validation, inference, exporting to ONNX, and benchmarking, all with concrete code snippets and practical tips.

Computer VisionFlashAttentionModel Training
0 likes · 8 min read
Master YOLOv12: A Step‑by‑Step Guide to Build, Train, and Deploy Custom Models
Code Mala Tang
Code Mala Tang
Mar 1, 2026 · Artificial Intelligence

Why YOLO Dominates Real-Time Object Detection: A Complete Guide

This article provides a comprehensive overview of the YOLO (You Only Look Once) algorithm, explaining its core principles, architecture, version history, training workflow, real‑world applications, strengths, and current limitations for modern computer‑vision tasks.

Computer VisionDeep LearningReal-Time
0 likes · 9 min read
Why YOLO Dominates Real-Time Object Detection: A Complete Guide
AIWalker
AIWalker
Feb 26, 2026 · Artificial Intelligence

Overcoming Vision Transformer Bottlenecks: The Plug‑and‑Play Upgrade of ViT‑5

ViT‑5 systematically revisits five years of Transformer architecture advances, introducing seven plug‑and‑play components—LayerScale, RMSNorm, GeLU, dual positional encodings, high‑frequency RoPE for register tokens, QK‑Norm, and bias‑free projections—that together raise ImageNet‑1k Top‑1 accuracy to 84.2% (Base) and achieve superior performance across classification, generation, and segmentation tasks.

Computer VisionViT-5Vision Transformer
0 likes · 14 min read
Overcoming Vision Transformer Bottlenecks: The Plug‑and‑Play Upgrade of ViT‑5
Data Party THU
Data Party THU
Feb 19, 2026 · Artificial Intelligence

How Data Priors and Scene Parameterization Boost 3D Indoor Reconstruction

This thesis investigates the two core challenges of data prior utilization and scene parameterization in multi‑view RGB‑based 3D indoor reconstruction, proposing novel representations and learning‑based methods to improve reconstruction quality, generalization, and applicability across AR, robotics, and autonomous navigation.

3D reconstructionComputer Visiondata priors
0 likes · 8 min read
How Data Priors and Scene Parameterization Boost 3D Indoor Reconstruction
AI Algorithm Path
AI Algorithm Path
Feb 18, 2026 · Artificial Intelligence

Using Autoencoders for Industrial Defect Detection

This article explains how to train a simple fully‑connected autoencoder on defect‑free images, use reconstruction error to highlight anomalies in industrial parts, and convert the error into a single metric that cleanly separates good from defective components.

AutoencoderComputer VisionKeras
0 likes · 7 min read
Using Autoencoders for Industrial Defect Detection
AI Cyberspace
AI Cyberspace
Feb 13, 2026 · Artificial Intelligence

How Attention Mechanisms Revolutionized Computer Vision and Machine Translation

This article traces the evolution of attention mechanisms from their inaugural application in computer vision and machine translation to their central role in modern Transformer models, detailing the underlying RNN‑Attention designs, the breakthrough in sequence alignment, and the innovations that enabled high‑performance, parallelizable deep learning architectures.

Attention MechanismComputer VisionDeep Learning
0 likes · 14 min read
How Attention Mechanisms Revolutionized Computer Vision and Machine Translation
php Courses
php Courses
Dec 9, 2025 · Artificial Intelligence

How to Supercharge Your PHP Apps with AI: A Practical Guide

This guide explains why PHP applications need AI, outlines core AI use cases such as intelligent content processing, computer vision, personalization, and chatbots, and provides step‑by‑step implementation paths, tools, best‑practice recommendations, real‑world case studies, and future trends for developers.

AI integrationComputer VisionNLP
0 likes · 10 min read
How to Supercharge Your PHP Apps with AI: A Practical Guide
Kuaishou Tech
Kuaishou Tech
Dec 4, 2025 · Artificial Intelligence

Can a Tree‑Reasoned Model Master Video Emotion Understanding?

The paper introduces VidEmo, a multimodal video foundation model that uses a two‑stage emotion‑clue‑guided reasoning framework and a large emotion‑centric dataset (Emo‑CFG) to achieve state‑of‑the‑art performance on facial attribute, expression, and fine‑grained emotion tasks, surpassing Gemini 2.0.

AIComputer VisionDataset
0 likes · 15 min read
Can a Tree‑Reasoned Model Master Video Emotion Understanding?
Tencent Technical Engineering
Tencent Technical Engineering
Nov 5, 2025 · Artificial Intelligence

iDetex: The Winning AI Model Transforming Image Quality Assessment

iDetex, the champion solution of the ICCV 2025 MIPI Detailed Image Quality Assessment Challenge, introduces a novel multimodal LLM-driven framework that precisely locates, describes, and grades image distortions, outperforming traditional IQA models and enabling practical deployments across video, live streaming, e‑commerce, and image‑processing pipelines.

AIComputer VisionICCV 2025
0 likes · 18 min read
iDetex: The Winning AI Model Transforming Image Quality Assessment
JD Tech Talk
JD Tech Talk
Nov 4, 2025 · Artificial Intelligence

How AI-Powered Virtual Try-On Transforms Fashion E‑Commerce

The article explains how JD.com's AI virtual try‑on system Oxygen Tryon uses advanced computer‑vision and generative models to let shoppers instantly preview clothing on their own photos, dramatically improving purchase decisions, reducing return rates, and outlining technical challenges, innovations, and future development plans.

AIComputer VisionDeep Learning
0 likes · 7 min read
How AI-Powered Virtual Try-On Transforms Fashion E‑Commerce
JD Cloud Developers
JD Cloud Developers
Nov 4, 2025 · Artificial Intelligence

How AI-Powered Virtual Try‑On Is Revolutionizing Fashion E‑Commerce

The article explains how JD.com's AI try‑on system Oxygen Tryon uses advanced computer‑vision models to let shoppers instantly preview garments on their own photos, dramatically improving fit perception, reducing return rates, and outlining future technical and business expansions.

AIComputer VisionFashion E‑commerce
0 likes · 6 min read
How AI-Powered Virtual Try‑On Is Revolutionizing Fashion E‑Commerce
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Nov 4, 2025 · Artificial Intelligence

How Multimodal Large Models Are Revolutionizing Video Analysis

This article examines the evolution from single‑frame video analysis to multimodal large models, detailing their architecture, optimization techniques, experimental validation on edge devices, and practical scenarios, while highlighting current limitations and future directions for AI‑driven video understanding.

AIComputer VisionEdge Computing
0 likes · 20 min read
How Multimodal Large Models Are Revolutionizing Video Analysis
AI Algorithm Path
AI Algorithm Path
Nov 1, 2025 · Artificial Intelligence

Deep Dive into Vision Transformer Patch Embedding Mechanisms

This article explains how Vision Transformers convert images into patch embeddings, compares flattening versus convolutional approaches, discusses position and CLS tokens, analyzes the effect of patch size, explores pixel‑level tokens, and contrasts ViT’s inductive bias with CNNs.

Computer VisionConvolutionInductive Bias
0 likes · 10 min read
Deep Dive into Vision Transformer Patch Embedding Mechanisms
Liangxu Linux
Liangxu Linux
Oct 29, 2025 · Artificial Intelligence

7 Must‑Try Open‑Source Tools for Remote Jobs, AI, and Dev Productivity

This article curates seven open‑source projects—including a remote‑work company list, a versatile file‑conversion platform, a personal finance manager, an AI‑powered resume optimizer, Claude Code resources, a computer‑vision toolbox, and a lightweight AI assistant—each with key features and GitHub links for easy adoption.

AI toolsComputer Visionfile conversion
0 likes · 7 min read
7 Must‑Try Open‑Source Tools for Remote Jobs, AI, and Dev Productivity
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Oct 24, 2025 · Artificial Intelligence

Next‑Gen VR Interaction via Micro‑Gesture Recognition: The “MiaoKong Virtual Realm” Demo

At Beijing University of Posts and Telecommunications' 70th anniversary, the Network Intelligence Research Center showcased a micro‑gesture‑driven VR system that captures millimeter‑scale finger motions with high‑precision, low‑latency hand tracking, delivering efficient, fatigue‑reducing interactions and earning strong audience approval.

Computer VisionVR interactionXR
0 likes · 8 min read
Next‑Gen VR Interaction via Micro‑Gesture Recognition: The “MiaoKong Virtual Realm” Demo
Alimama Tech
Alimama Tech
Oct 22, 2025 · Artificial Intelligence

How Alibaba’s AIGC Model Revolutionizes Virtual Fashion Try‑On

This article details Alibaba’s Taobao Star fashion AIGC model, explaining its data pipeline, captioning strategy, multi‑stage training, and impressive virtual try‑on results for users and merchants, while showcasing model‑based and model‑free generation and pose‑transfer capabilities.

AIAIGCComputer Vision
0 likes · 11 min read
How Alibaba’s AIGC Model Revolutionizes Virtual Fashion Try‑On
Amap Tech
Amap Tech
Oct 2, 2025 · Artificial Intelligence

How FantasyWorld Unifies Video Generation and 3D Geometry for Consistent Virtual Worlds

FantasyWorld introduces a geometry‑enhanced framework that augments a frozen video diffusion model with a trainable geometry branch, enabling simultaneous video representation and implicit 3D field generation, achieving spatially consistent, high‑quality virtual worlds and outperforming recent baselines in multi‑view coherence and geometric fidelity.

3D ModelingComputer VisionMultimodal AI
0 likes · 11 min read
How FantasyWorld Unifies Video Generation and 3D Geometry for Consistent Virtual Worlds
HyperAI Super Neural
HyperAI Super Neural
Sep 29, 2025 · Artificial Intelligence

8 Popular Remote Sensing Object Detection Datasets with One-Click Downloads

This article presents a curated list of eight widely used remote sensing object detection datasets covering indoor scenes, landslides, drone imagery, crop diseases, safety vests, human fractures, urban issues, and plant diseases, each with size estimates and direct download links for researchers.

AIComputer VisionDatasets
0 likes · 10 min read
8 Popular Remote Sensing Object Detection Datasets with One-Click Downloads
Data Party THU
Data Party THU
Sep 27, 2025 · Artificial Intelligence

How Depth-Guided Texture Diffusion Boosts Image Semantic Segmentation

This article reviews the depth‑guided texture diffusion method, detailing its texture extraction, diffusion, structural consistency optimization, and integration into segmentation networks, and shows how it narrows the depth‑RGB gap to achieve state‑of‑the‑art performance on various semantic segmentation tasks.

Computer Visiondepth-guided diffusionsemantic segmentation
0 likes · 13 min read
How Depth-Guided Texture Diffusion Boosts Image Semantic Segmentation
AntTech
AntTech
Sep 25, 2025 · Artificial Intelligence

ICCV Spotlight: Pixel Tracing for Copy Detection and Skip-Vision Model Acceleration

The ICCV 2025 live session will deep‑dive into two cutting‑edge papers—PixTrace with CopyNCE for precise image copy detection and Skip‑Vision for dramatically faster training and inference of vision‑language models—showcasing their methods, results, and real‑world impact.

Computer VisionICCV 2025Vision-Language Models
0 likes · 5 min read
ICCV Spotlight: Pixel Tracing for Copy Detection and Skip-Vision Model Acceleration
Data Party THU
Data Party THU
Sep 16, 2025 · Artificial Intelligence

How Dynamic Snake Convolution Boosts Tubular Segmentation and Infrared Small Target Detection

This article reviews two recent AI papers that introduce dynamic convolution kernels guided by geometric or statistical priors and adaptive loss mechanisms, demonstrating significant improvements in tubular structure segmentation and infrared small‑target detection across multiple 2D and 3D datasets.

Computer VisionMedical Image Segmentationdynamic convolution
0 likes · 6 min read
How Dynamic Snake Convolution Boosts Tubular Segmentation and Infrared Small Target Detection
AIWalker
AIWalker
Sep 2, 2025 · Artificial Intelligence

BEVANet’s Triple Boost for Real-Time Segmentation: Field, Edge, Speed

BEVANet tackles the efficiency‑accuracy trade‑off in real‑time semantic segmentation by integrating large‑kernel attention, an efficient visual attention (EVA) module, a bilateral architecture, and boundary‑guided adaptive fusion, delivering up to 81 % mIoU on Cityscapes at 33 FPS and surpassing prior state‑of‑the‑art models on both accuracy and speed.

Computer VisionReal-Timeefficiency
0 likes · 19 min read
BEVANet’s Triple Boost for Real-Time Segmentation: Field, Edge, Speed
AntTech
AntTech
Aug 21, 2025 · Artificial Intelligence

How the Mixture-of-Queries Transformer Tackles Camouflaged Instance Segmentation

The IJCAI 2025 paper showcase introduces the Mixture‑of‑Queries Transformer, a novel model that combines frequency‑domain feature enhancement with collaborative query decoding to achieve state‑of‑the‑art camouflaged instance segmentation across multiple datasets.

Computer VisionIJCAI 2025Transformer
0 likes · 4 min read
How the Mixture-of-Queries Transformer Tackles Camouflaged Instance Segmentation
AIWalker
AIWalker
Aug 18, 2025 · Artificial Intelligence

UniConvNet: Expanding Effective Receptive Field for a SOTA CNN Vision Backbone (ICCV 2025)

UniConvNet introduces a three‑layer receptive‑field aggregator that combines small kernels to enlarge the effective receptive field while preserving its Gaussian distribution, achieving state‑of‑the‑art results on ImageNet‑1K, COCO2017 and ADE20K with only 30M parameters and 5.1G FLOPs.

CNNComputer VisionEffective Receptive Field
0 likes · 6 min read
UniConvNet: Expanding Effective Receptive Field for a SOTA CNN Vision Backbone (ICCV 2025)
AI Algorithm Path
AI Algorithm Path
Aug 16, 2025 · Artificial Intelligence

Meta Unveils DINOv3: A Universal Self‑Supervised Visual AI for All Image Tasks

Meta's DINOv3 is a 70‑billion‑parameter self‑supervised visual foundation model trained on 17 billion Instagram images without any labels, introducing dense feature extraction, Gram‑Anchoring to prevent feature collapse, high‑resolution adaptation, and multi‑student distillation that together enable out‑of‑the‑box performance on segmentation, depth estimation, 3D matching, and tracking while surpassing prior models such as DINOv2, CLIP, and SAM.

Computer VisionDINOv3Gram Anchoring
0 likes · 8 min read
Meta Unveils DINOv3: A Universal Self‑Supervised Visual AI for All Image Tasks
AIWalker
AIWalker
Aug 13, 2025 · Artificial Intelligence

One‑Model‑For‑All: Inception‑Level AI Try‑On/Off with Arbitrary Poses and No Masks

The paper presents OMFA, a diffusion‑based unified framework for virtual try‑on and try‑off that removes the need for garment templates, segmentation masks, and fixed poses by leveraging a novel partial‑diffusion mechanism and SMPL‑X pose conditioning, achieving state‑of‑the‑art results on VITON‑HD and DeepFashion‑MultiModal datasets.

AI try-onComputer VisionSMPL-X
0 likes · 15 min read
One‑Model‑For‑All: Inception‑Level AI Try‑On/Off with Arbitrary Poses and No Masks
AIWalker
AIWalker
Aug 3, 2025 · Artificial Intelligence

Tree-Guided CNN Boosts Image Super-Resolution in Joint University Study

A collaborative team from five universities proposes a tree-structured convolutional neural network that leverages binary‑tree guidance, cosine cross‑domain extraction, and an adaptive Nesterov momentum optimizer to markedly improve image super‑resolution performance.

Computer VisionDeep Learningadaptive optimizer
0 likes · 5 min read
Tree-Guided CNN Boosts Image Super-Resolution in Joint University Study
Data Party THU
Data Party THU
Jul 31, 2025 · Artificial Intelligence

How LaVin-DiT Revolutionizes Vision Generation with ST‑VAE and Joint Diffusion Transformer

The LaVin-DiT paper introduces a large‑scale vision diffusion transformer that combines a spatiotemporal variational auto‑encoder, a joint diffusion transformer with full‑sequence joint attention, and 3D rotary position encoding to enable unified, efficient generation across diverse visual tasks such as segmentation and video prediction.

3D RoPEComputer VisionVision Transformer
0 likes · 11 min read
How LaVin-DiT Revolutionizes Vision Generation with ST‑VAE and Joint Diffusion Transformer
AI Frontier Lectures
AI Frontier Lectures
Jul 26, 2025 · Artificial Intelligence

Training-Free Universal Virtual Try-On: OmniVTON’s Multi-Person Breakthrough

OmniVTON introduces a training‑free universal virtual try‑on framework that decouples garment texture and human pose, achieving high‑fidelity results across both in‑shop and in‑the‑wild scenarios, and uniquely supporting multi‑person virtual dressing, as demonstrated by extensive quantitative and qualitative experiments.

Computer VisionMulti-Personartificial intelligence
0 likes · 9 min read
Training-Free Universal Virtual Try-On: OmniVTON’s Multi-Person Breakthrough
AI Frontier Lectures
AI Frontier Lectures
Jul 17, 2025 · Artificial Intelligence

Top 8 Tencent Youtu Papers Accepted at ICCV 2025: Innovations in AI and Vision

The 20th ICCV conference announced 8 papers from Tencent Youtu Lab covering stylized face recognition, AI‑generated image detection, heterogeneous knowledge distillation, multi‑conditional diffusion, multimodal LLM distillation, palmprint recognition, low‑light vision, and oracle bone script decipherment, each pushing the frontier of computer vision and AI research.

Computer VisionDatasetICCV 2025
0 likes · 17 min read
Top 8 Tencent Youtu Papers Accepted at ICCV 2025: Innovations in AI and Vision
AIWalker
AIWalker
Jul 15, 2025 · Artificial Intelligence

Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%

This article presents Dynamic Vision Mamba (DyVM), a method that tackles token and block redundancy in Mamba‑based visual models through a novel re‑ordering pruning strategy and dynamic block selection, achieving a 35.2% FLOPs reduction with only a 1.7% accuracy loss while demonstrating strong generalization across tasks and architectures.

Computer VisionDynamic Block SelectionFLOPs Reduction
0 likes · 22 min read
Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%
Amap Tech
Amap Tech
Jul 14, 2025 · Artificial Intelligence

How UPRE Achieves Zero-Shot Domain Adaptation for Object Detection with Unified Prompts

The UPRE paper, presented at ICCV, introduces a multi‑view domain prompt and a unified representation enhancement to enable zero‑shot domain adaptation for object detection, achieving state‑of‑the‑art performance across diverse weather, geographic, and synthetic‑to‑real scenarios.

Computer VisionPrompt Engineeringobject detection
0 likes · 10 min read
How UPRE Achieves Zero-Shot Domain Adaptation for Object Detection with Unified Prompts
Baidu Geek Talk
Baidu Geek Talk
Jul 9, 2025 · Artificial Intelligence

PaddleOCR 3.1 Unveils Multilingual PP‑OCRv5, Document Translation, and MCP Server Integration

PaddleOCR 3.1 introduces three major upgrades—a multilingual PP‑OCRv5 model supporting 37 languages with over 30% accuracy gain, a PP‑DocTranslation pipeline for high‑quality multi‑language document translation, and MCP server support for flexible AI application integration—accompanied by detailed CLI usage, demo scenarios, and open‑source resources.

AIComputer VisionMCP
0 likes · 11 min read
PaddleOCR 3.1 Unveils Multilingual PP‑OCRv5, Document Translation, and MCP Server Integration
AI Frontier Lectures
AI Frontier Lectures
Jul 8, 2025 · Artificial Intelligence

How LaVin-DiT Unifies Vision Tasks with a Large Diffusion Transformer

The LaVin-DiT paper presents a large vision diffusion transformer that integrates a spatio‑temporal variational auto‑encoder, a joint diffusion transformer with full‑sequence joint attention, and 3D rotary position encoding to enable unified, efficient multi‑task generation for images and videos, and details its training via flow‑matching and experimental results.

3D RoPEComputer VisionGenerative Modeling
0 likes · 12 min read
How LaVin-DiT Unifies Vision Tasks with a Large Diffusion Transformer
Huolala Tech
Huolala Tech
Jul 2, 2025 · Artificial Intelligence

Can Diffusion Models Revolutionize Salient Object Detection?

This article introduces a diffusion‑based framework for salient object detection, discusses its background, challenges, and motivations, details the model architecture and training, presents extensive experiments and ablation studies, and outlines limitations and future research directions.

Computer VisionDeep Learningdiffusion model
0 likes · 11 min read
Can Diffusion Models Revolutionize Salient Object Detection?
Qborfy AI
Qborfy AI
Jul 1, 2025 · Artificial Intelligence

Why CNNs Outperform Fully Connected Networks: A Deep Dive into Architecture and Applications

This article explains the fundamentals of convolutional neural networks (CNNs), detailing their definition, advantages over fully connected networks, architectural components such as input, hidden, and output layers, key operations like convolution, pooling, and activation, and showcases practical applications and notable insights.

CNNComputer VisionDeep Learning
0 likes · 5 min read
Why CNNs Outperform Fully Connected Networks: A Deep Dive into Architecture and Applications
Amap Tech
Amap Tech
Jun 30, 2025 · Artificial Intelligence

SeqGrowGraph: Chain-of-Graph Expansion for Precise Lane Topology

SeqGrowGraph introduces a novel chain-of-graph expansion framework that incrementally builds lane topology graphs using a Transformer-based autoregressive model, achieving state‑of‑the‑art performance on large autonomous‑driving datasets such as nuScenes and Argoverse 2 by accurately modeling complex road structures.

Computer VisionSequence ModelingTransformer
0 likes · 10 min read
SeqGrowGraph: Chain-of-Graph Expansion for Precise Lane Topology
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jun 27, 2025 · Artificial Intelligence

Image Encryption, Watermarking, Detection & Green Screen Removal in Python

This tutorial walks through Python-based computer‑vision techniques—including XOR‑based image encryption, mask and ROI methods, digital watermark embedding via bit‑plane and LSB, sensitivity‑driven object detection, and HSV‑based green‑screen removal—providing complete code snippets and practical guidance for rapid AI‑assisted learning.

Computer VisionOpenCVPython
0 likes · 17 min read
Image Encryption, Watermarking, Detection & Green Screen Removal in Python
AntTech
AntTech
Jun 25, 2025 · Artificial Intelligence

CVPR 2025: Semi-Body Digital Humans, Video Upscaling, Mobile Super‑Res

In this CVPR 2025 showcase, Ant Group presents three cutting‑edge papers—EchoMimicV2 introducing an open‑source semi‑body digital human generation framework, RivuletMLP offering an efficient MLP‑based architecture for compressed video quality enhancement, and a quantized super‑resolution model that achieves real‑time 3× upscaling on mobile NPUs.

AICVPRComputer Vision
0 likes · 6 min read
CVPR 2025: Semi-Body Digital Humans, Video Upscaling, Mobile Super‑Res
AIWalker
AIWalker
Jun 24, 2025 · Artificial Intelligence

How Multimodal Fusion Accelerates Paper Publication: Key Insights and Resources

The article surveys 117 recent multimodal‑fusion papers, classifies them into improvement‑based and combination‑based approaches, highlights representative works such as TimeXL, OGP‑Net, MMR‑Mamba and FusionSight, and provides a free collection of papers, classic models and code repositories for researchers.

AI researchComputer VisionDeep Learning
0 likes · 8 min read
How Multimodal Fusion Accelerates Paper Publication: Key Insights and Resources
AI Algorithm Path
AI Algorithm Path
Jun 20, 2025 · Artificial Intelligence

Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter

This article introduces visual‑language models (VLMs), explaining how they combine large language models with visual encoders, why they overcome the rigidity of traditional computer‑vision systems, their key advantages, modular architecture, training methods, and practical applications such as image captioning and visual question answering.

AI applicationsComputer VisionMultimodal AI
0 likes · 8 min read
Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter
AntTech
AntTech
Jun 15, 2025 · Artificial Intelligence

21 Ant Research Papers Shaping CVPR 2025: AI Image & Video Generation Breakthroughs

The Interactive Intelligence Lab of Ant Technology Research Institute presented 21 accepted CVPR 2025 papers covering visual generation, editing, 3D vision, digital humans and multimodal AI, highlighting tools such as MagicQuill, Lumos, Aurora, FLARE, LeviTor, MangaNinja, AniDoc, Mimir, AvatarArtist, DiffListener, MotionStone, TensorialGaussianAvatars, DualTalk, CompreCap and Uni-AD.

CVPR2025Computer VisionVideo Generation
0 likes · 20 min read
21 Ant Research Papers Shaping CVPR 2025: AI Image & Video Generation Breakthroughs
AI Frontier Lectures
AI Frontier Lectures
Jun 14, 2025 · Industry Insights

CVPR 2025 Awards Unveiled: Breakthrough Papers and Rising Stars

The CVPR 2025 awards spotlight groundbreaking research, honoring young scholars and top papers such as VGGT, Neural Inverse Rendering, and several honorable mentions, while summarizing each work's core contributions, methodologies, and potential impact on computer vision and related fields.

2025CVPRComputer Vision
0 likes · 13 min read
CVPR 2025 Awards Unveiled: Breakthrough Papers and Rising Stars
Kuaishou Tech
Kuaishou Tech
Jun 10, 2025 · Artificial Intelligence

Top 12 Cutting-Edge Video Generation Papers from Kuaishou at CVPR 2025

The article highlights CVPR 2025’s acceptance statistics and showcases twelve cutting‑edge video‑generation papers from Kuaishou, spanning datasets, quality assessment, style control, scaling laws, 4D simulation, interleaved image‑text data, vision‑language acceleration, high‑fidelity avatars, patch‑wise super‑resolution, narrative‑driven benchmarks, sketch‑based editing, and spatio‑temporal diffusion, each with links and abstracts.

CVPR2025Computer VisionKuaishou
0 likes · 20 min read
Top 12 Cutting-Edge Video Generation Papers from Kuaishou at CVPR 2025
AI Frontier Lectures
AI Frontier Lectures
Jun 7, 2025 · Artificial Intelligence

Can MaIR’s Locality‑Preserving Mamba Boost Image Restoration?

The article presents MaIR, a locality‑ and continuity‑preserving Mamba‑based model for image restoration, detailing its three‑stage architecture, novel scanning strategy, loss functions, experimental results on super‑resolution and denoising, and ablation studies, with links to the arXiv paper and source code.

Computer VisionDenoisingImage Restoration
0 likes · 5 min read
Can MaIR’s Locality‑Preserving Mamba Boost Image Restoration?
AI Frontier Lectures
AI Frontier Lectures
Jun 3, 2025 · Artificial Intelligence

How MaIR Advances Image Restoration with a Locality‑Preserving Mamba Architecture

The article presents MaIR, a Mamba‑based image restoration model that preserves locality and continuity, detailing its architecture, scanning strategies, loss functions, experimental results on super‑resolution and denoising, and an ablation study, while providing links to the arXiv paper and GitHub source code.

Computer VisionDenoisingImage Restoration
0 likes · 5 min read
How MaIR Advances Image Restoration with a Locality‑Preserving Mamba Architecture
JD Tech
JD Tech
May 26, 2025 · Artificial Intelligence

Solving Technical Challenges at JD Retail: Multi‑Reward Models, LLM‑Based Query Expansion, Model Pruning, and Reinforcement Learning

This article details how JD Retail's young algorithm engineers tackled a series of AI engineering problems—including advertising image quality assessment with multi‑reward models, large‑language‑model‑driven query expansion, FFT‑and‑RDP‑based model pruning, and agent‑centric reinforcement learning—while sharing practical growth insights and code snippets.

AIComputer VisionModel Optimization
0 likes · 15 min read
Solving Technical Challenges at JD Retail: Multi‑Reward Models, LLM‑Based Query Expansion, Model Pruning, and Reinforcement Learning
JD Tech
JD Tech
May 20, 2025 · Artificial Intelligence

How Re‑parameterization and Adaptive Learning Boost Visual Deep Learning Efficiency

The award‑winning project from Tsinghua University and JD Retail introduces re‑parameterization model design, cross‑scene adaptive learning, and platform‑aware compression to overcome accuracy‑efficiency trade‑offs in visual deep learning, achieving over 20% accuracy gains and more than 50% inference speedup in real‑world e‑commerce deployments.

AI researchComputer Visionadaptive models
0 likes · 6 min read
How Re‑parameterization and Adaptive Learning Boost Visual Deep Learning Efficiency
AIWalker
AIWalker
May 18, 2025 · Artificial Intelligence

YOLOE: Open‑Source Real‑Time Anything Detector Beats YOLO‑World v2

YOLOE unifies object detection and segmentation in a single efficient model that supports text, visual, and prompt‑free inference, introduces RepRTA, SAVPE, and LRPC strategies, and achieves higher AP with up to three‑fold lower training cost and 1.4× faster inference on GPUs and mobile devices, as demonstrated by extensive LVIS and COCO experiments.

Computer VisionPrompt EngineeringReal-Time
0 likes · 29 min read
YOLOE: Open‑Source Real‑Time Anything Detector Beats YOLO‑World v2
DaTaobao Tech
DaTaobao Tech
May 16, 2025 · Artificial Intelligence

JianYi: AI‑Powered Image Segmentation and Matting System for Taobao Home‑Decoration

The article introduces JianYi, a self‑developed image segmentation and matting system for Taobao's home‑decoration business that supports product, human, and panoramic segmentation with multi‑modal interaction, achieving high‑precision real‑time performance and powering AI tools such as "Jiazuo" and "Fang Wo Jia".

Computer VisionDeep Learningartificial intelligence
0 likes · 11 min read
JianYi: AI‑Powered Image Segmentation and Matting System for Taobao Home‑Decoration
Bilibili Tech
Bilibili Tech
May 16, 2025 · Artificial Intelligence

How FineVQ Sets New Standards for Fine‑Grained UGC Video Quality Assessment

The article introduces FineVD, the first large‑scale multi‑dimensional UGC video quality dataset, and presents FineVQ, a unified model that predicts quality scores, attributes, and distortion types across six dimensions, achieving state‑of‑the‑art performance on multiple benchmarks and cross‑dataset evaluations.

Computer VisionDatasetDeep Learning
0 likes · 9 min read
How FineVQ Sets New Standards for Fine‑Grained UGC Video Quality Assessment
AI Frontier Lectures
AI Frontier Lectures
May 15, 2025 · Artificial Intelligence

OverLoCK: How a Bio‑Inspired Three‑Stage ConvNet Beats Transformers on Vision Tasks

OverLoCK introduces a bio‑inspired depth‑stage decomposition that splits a network into Base‑Net, Overview‑Net and Focus‑Net, and a novel Context‑Mix dynamic convolution, achieving state‑of‑the‑art accuracy on image classification, detection and segmentation while balancing speed and model size.

Computer VisionConvNet
0 likes · 11 min read
OverLoCK: How a Bio‑Inspired Three‑Stage ConvNet Beats Transformers on Vision Tasks
AI Frontier Lectures
AI Frontier Lectures
May 15, 2025 · Artificial Intelligence

DefMamba: How Deformable Scanning Boosts Vision State‑Space Models

DefMamba introduces a deformable visual state‑space model that dynamically adjusts scanning paths and reference points, preserving spatial structure and improving feature capture, achieving state‑of‑the‑art results on ImageNet classification, COCO detection, and ADE20K segmentation while reducing computational cost.

Computer VisionDefMambaDeformable Scanning
0 likes · 23 min read
DefMamba: How Deformable Scanning Boosts Vision State‑Space Models
AIWalker
AIWalker
May 14, 2025 · Artificial Intelligence

How HGO‑YOLO Achieves 87.4% Accuracy at 56 FPS with Only 4.6 MB Parameters

This paper presents HGO‑YOLO, a lightweight real‑time anomaly‑behavior detector that integrates HGNetv2 and GhostConv into YOLOv8, achieving 87.4% mAP with just 4.6 MB of parameters and 56 FPS on CPU, and validates its performance across multiple datasets and hardware platforms.

Computer VisionLightweight ModelsYOLO
0 likes · 25 min read
How HGO‑YOLO Achieves 87.4% Accuracy at 56 FPS with Only 4.6 MB Parameters
AIWalker
AIWalker
May 13, 2025 · Artificial Intelligence

PixelHacker: Diffusion‑Based Image Inpainting with Latent Class Guidance Beats SOTA

PixelHacker introduces a latent class guidance (LCG) paradigm that injects foreground and background embeddings into a diffusion model, training on 14 million image‑mask pairs and achieving state‑of‑the‑art structural and semantic consistency across Places2, CelebA‑HQ and FFHQ benchmarks.

Computer VisionPixelHackerSOTA
0 likes · 16 min read
PixelHacker: Diffusion‑Based Image Inpainting with Latent Class Guidance Beats SOTA
Meituan Technology Team
Meituan Technology Team
Apr 24, 2025 · Artificial Intelligence

Meituan AI Recruitment: Join Our Advanced Technology Teams

Meituan's AI recruitment page showcases diverse opportunities across AI infrastructure, intelligent interaction, visual intelligence, and intelligent products, featuring roles from algorithm engineers to product managers working on cutting-edge technologies including large models, intelligent agents, and multimodal systems.

AI RecruitmentComputer VisionIntelligent agents
0 likes · 5 min read
Meituan AI Recruitment: Join Our Advanced Technology Teams
php Courses
php Courses
Apr 23, 2025 · Artificial Intelligence

Real-Time Face Recognition with PHP and OpenCV

This article explains how to set up a PHP environment, control a camera, and use the OpenCV library to perform real‑time face detection and recognition with code examples, demonstrating a practical security solution for applications such as access control and surveillance.

Computer VisionOpenCVPHP
0 likes · 6 min read
Real-Time Face Recognition with PHP and OpenCV
Liangxu Linux
Liangxu Linux
Apr 22, 2025 · Artificial Intelligence

Top 10 Open-Source OCR Projects on GitHub Ranked by Stars

This article compiles a ranked list of ten popular open-source OCR projects on GitHub, summarizing each tool’s key capabilities—such as multimodal text extraction, PDF linearization, layout analysis, and multilingual support—along with star counts and direct repository links for developers seeking ready-to-use OCR solutions.

Computer VisionGitHubOCR
0 likes · 9 min read
Top 10 Open-Source OCR Projects on GitHub Ranked by Stars
JD Cloud Developers
JD Cloud Developers
Apr 22, 2025 · Artificial Intelligence

How AI Turns 2D Videos into Immersive 3D Spatial Content at Scale

Leveraging 3D vision and AIGC, JD Retail’s R&D team converts abundant 2D video assets into high‑quality stereoscopic 3D space videos through a pipeline that includes monocular depth estimation, novel view synthesis, multi‑branch inpainting, and MV‑HEVC encoding, validated by ICME 2025 and a new StereoV1K dataset.

3D videoAIGCComputer Vision
0 likes · 26 min read
How AI Turns 2D Videos into Immersive 3D Spatial Content at Scale
JD Tech Talk
JD Tech Talk
Apr 22, 2025 · Artificial Intelligence

End-to-End 3D Spatial Video Generation via Monocular Depth Estimation, Novel View Synthesis, and MV-HEVC Encoding

Leveraging AI-driven monocular depth estimation, novel view synthesis, and MV‑HEVC encoding, the JD Retail Content R&D team presents an end‑to‑end pipeline that converts 2D video assets into high‑quality immersive 3D spatial videos, introduces the large‑scale StereoV1K dataset, and demonstrates superior performance over existing methods.

3D video generationAIGCComputer Vision
0 likes · 22 min read
End-to-End 3D Spatial Video Generation via Monocular Depth Estimation, Novel View Synthesis, and MV-HEVC Encoding
Amap Tech
Amap Tech
Apr 21, 2025 · Artificial Intelligence

Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models

At ICASSP 2025, Gaode’s two accepted papers present Lenna, a language‑enhanced reasoning detection assistant that adds a DET token to multimodal LLMs and achieves state‑of‑the‑art accuracy on RefCOCO benchmarks, and a chain‑of‑thought image‑editing framework that converts complex prompts into segmented masks and repair prompts for diffusion‑based inpainting, surpassing existing methods.

AIComputer VisionICASSP
0 likes · 10 min read
Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models