Tagged articles

OCR

241 articles · Page 1 of 3
Machine Heart
Machine Heart
Jun 23, 2026 · Artificial Intelligence

Unlimited OCR Achieves SOTA Long-Document Parsing in a Single Forward Pass

Unlimited OCR, Baidu's open‑source model built on DeepSeek OCR, uses a novel Reference Sliding Window Attention to compress visual tokens and keep KV cache size constant, enabling end‑to‑end parsing of whole books with 93.23% OmniDocBench v1.5 score and stable latency across dozens of pages.

DeepSeekLarge Language ModelLong Document
0 likes · 14 min read
Unlimited OCR Achieves SOTA Long-Document Parsing in a Single Forward Pass
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 15, 2026 · Artificial Intelligence

Blurry Images Create a ‘Comfort Zone’ for Jailbreaking Multimodal LLMs

A new study from Westlake University shows that when harmful text is rendered as low‑resolution, blurry, or noisy images, multimodal large language models become significantly easier to jailbreak despite still recognizing the text, revealing a U‑shaped risk curve and a simple mitigation that decouples OCR from safety checks.

OCRjailbreakmultimodal LLM
0 likes · 10 min read
Blurry Images Create a ‘Comfort Zone’ for Jailbreaking Multimodal LLMs
Machine Heart
Machine Heart
Jun 14, 2026 · Artificial Intelligence

When Blurry Images Create an Attack Comfort Zone for Multimodal LLMs

Westlake University's AGI Lab shows that when harmful text is rendered as low‑resolution, blurry or noisy images, multimodal large language models can still read the content but their safety filters fail, creating an 'attack comfort zone' that dramatically raises jailbreak success rates across several models.

OCRjailbreakmultimodal LLM
0 likes · 9 min read
When Blurry Images Create an Attack Comfort Zone for Multimodal LLMs
Python Crawling & Data Mining
Python Crawling & Data Mining
Jun 10, 2026 · Artificial Intelligence

Automating Validation of 300,000 Records with Python + AI to Detect Errors and Dirty Data

Even with 99 % accuracy, tens of thousands of errors remain in a 300 k‑row dataset, so the author builds a Python‑AI pipeline that preprocesses images, performs high‑precision OCR, merges data, applies custom validation rules, and automatically generates an error report, dramatically reducing manual effort.

AIAutomationData Validation
0 likes · 6 min read
Automating Validation of 300,000 Records with Python + AI to Detect Errors and Dirty Data
Old Zhang's AI Learning
Old Zhang's AI Learning
Jun 2, 2026 · Fundamentals

Lightning‑Fast Open‑Source Local PDF Parser: LiteParse Processes 400‑Page PDFs in 1 Second

LiteParse, an open‑source Rust‑based local PDF parser from the LlamaIndex team, extracts text from a 400‑page PDF in about one second, offers multi‑language bindings, flexible OCR, bounding‑box output, and Agent Skill integration, while its limitations include basic table handling and complex layout support.

Agent SkillLiteParseLocal processing
0 likes · 9 min read
Lightning‑Fast Open‑Source Local PDF Parser: LiteParse Processes 400‑Page PDFs in 1 Second
Su San Talks Tech
Su San Talks Tech
May 20, 2026 · Artificial Intelligence

Why Convert Docs to Markdown for LLMs? Meet the Open‑Source MarkItDown Tool

The article explains that LLMs process Markdown more effectively than raw PDFs, introduces Microsoft’s open‑source MarkItDown utility that converts a wide range of file types—including PDFs, Word, Excel, HTML, images with OCR, and YouTube videos—into clean Markdown, and provides installation, usage examples, recent feature updates, and a brief critique of its scope.

Azure Document IntelligenceCLILLM preprocessing
0 likes · 6 min read
Why Convert Docs to Markdown for LLMs? Meet the Open‑Source MarkItDown Tool
DataFunTalk
DataFunTalk
May 15, 2026 · Artificial Intelligence

Exploring Multimodal GraphRAG: Combining Document Intelligence, Knowledge Graphs, and Large Models

This article provides a comprehensive technical overview of multimodal GraphRAG, detailing document‑intelligence parsing pipelines, layout analysis, OCR‑pipeline vs OCR‑free approaches, knowledge‑graph integration for chunk relationships, multimodal indexing, retrieval‑generation workflows, and a comparative analysis of RAG, GraphRAG, and KG‑QA solutions.

GraphRAGKnowledge GraphLayout Analysis
0 likes · 23 min read
Exploring Multimodal GraphRAG: Combining Document Intelligence, Knowledge Graphs, and Large Models
DataFunTalk
DataFunTalk
May 10, 2026 · Artificial Intelligence

Exploring Multimodal GraphRAG: Combining Document Intelligence, Knowledge Graphs, and Large Models

This article presents a detailed technical walkthrough of multimodal GraphRAG, covering document‑intelligence parsing pipelines, multimodal graph index construction, knowledge‑graph‑driven chunk linking, recent research progress, performance trade‑offs, and practical recommendations for deploying RAG solutions.

GraphRAGKnowledge GraphOCR
0 likes · 23 min read
Exploring Multimodal GraphRAG: Combining Document Intelligence, Knowledge Graphs, and Large Models
AI Engineer Programming
AI Engineer Programming
May 9, 2026 · Artificial Intelligence

Why PDF Parsing Is Hard for RAG and Which Mainstream Solutions Work

The article examines the intrinsic challenges of extracting structured text from PDFs for Retrieval‑Augmented Generation—such as missing reading order, table reconstruction, font encoding, and scanned images—and compares lightweight libraries, AI‑enhanced frameworks, commercial APIs, and visual language models as practical solutions.

AI frameworksOCRPDF parsing
0 likes · 23 min read
Why PDF Parsing Is Hard for RAG and Which Mainstream Solutions Work
SuanNi
SuanNi
Apr 30, 2026 · Artificial Intelligence

Deploy a 24/7 Document Recognition Toolbox with the PaddleOCR Image on the Cloud

This guide explains how to use Baidu's open‑source PaddleOCR engine—its full OCR and layout analysis pipeline, multi‑language support, and output formats—to set up a continuously running document recognition service on the 算网 GPU cloud platform, including environment preparation, model configuration, and inference execution.

Document processingGPUMagicMind
0 likes · 6 min read
Deploy a 24/7 Document Recognition Toolbox with the PaddleOCR Image on the Cloud
Kuaishou Tech
Kuaishou Tech
Apr 29, 2026 · Operations

Boosting Oncall Interception from 15% to 55%: KOncall’s AI‑Driven Evolution at Kuaishou

Kuaishou’s R&D efficiency team built the KOncall intelligent on‑call platform, integrating LLM‑based retrieval‑augmented generation, Redis Pub/Sub streaming, OCR multimodal parsing, FAQ knowledge ops, and custom reranking, which raised automated query interception from 15% to 55% and processed over 116 000 requests, turning on‑call from a bottleneck into a capability starter.

AI OperationsIncident ManagementKnowledge Management
0 likes · 26 min read
Boosting Oncall Interception from 15% to 55%: KOncall’s AI‑Driven Evolution at Kuaishou
AI Architecture Path
AI Architecture Path
Apr 29, 2026 · Artificial Intelligence

Fed up feeding AI with docs? Microsoft’s Open‑Source MarkItDown converts any format to Markdown in a few lines

MarkItDown, an open‑source Python tool from Microsoft’s AutoGen team, converts over 20 document and media formats—including Word, Excel, PDF, images, audio and YouTube links—into standardized Markdown, offering OCR, LLM integration, Docker deployment, Azure Document Intelligence support, and extensive command‑line examples for enterprise and research pipelines.

AutoGenAzure Document IntelligenceDocker
0 likes · 13 min read
Fed up feeding AI with docs? Microsoft’s Open‑Source MarkItDown converts any format to Markdown in a few lines
Java Architect Essentials
Java Architect Essentials
Apr 17, 2026 · Backend Development

How to Integrate Tess4J OCR into a Spring Boot Application

This article explains OCR fundamentals, introduces Tesseract and its Java wrapper Tess4J, guides you through downloading language data, shows step‑by‑step Spring Boot integration with Maven dependencies and configuration classes, and provides test code for Chinese, English, and mixed‑language image recognition.

JavaLanguage DataOCR
0 likes · 9 min read
How to Integrate Tess4J OCR into a Spring Boot Application
ShiZhen AI
ShiZhen AI
Apr 12, 2026 · Artificial Intelligence

Convert Any File to Clean Markdown in One Click with Microsoft’s MarkItDown

MarkItDown, an open‑source tool from Microsoft’s AutoGen team, lets you feed PDFs, Office documents, web data, media, and even YouTube videos into large language models by converting them to clean Markdown in a single command, preserving structure for better AI understanding.

Azure Document IntelligenceLLM preprocessingMarkItDown
0 likes · 6 min read
Convert Any File to Clean Markdown in One Click with Microsoft’s MarkItDown
Java Architect Handbook
Java Architect Handbook
Apr 1, 2026 · Backend Development

Integrating Tess4j OCR into a Spring Boot 3 Project

This guide explains OCR fundamentals, introduces Tesseract and Tess4j, shows how to download the required language data files, and provides step‑by‑step instructions with Maven configuration, Spring Boot properties, Java code, and test examples for Chinese, English, and mixed‑language image recognition.

JavaOCRSpring Boot
0 likes · 11 min read
Integrating Tess4j OCR into a Spring Boot 3 Project
AI Explorer
AI Explorer
Mar 28, 2026 · Artificial Intelligence

How Chandra OCR 2 Accurately Parses Complex Tables and Handwritten Text

Chandra OCR 2, an open‑source model on GitHub, combines full‑layout understanding with multi‑format output to precisely digitize complex tables, handwritten notes, formulas and multilingual documents, outperforming other OCR solutions in benchmark tests and offering easy installation for developers.

Chandra OCR 2Layout UnderstandingOCR
0 likes · 6 min read
How Chandra OCR 2 Accurately Parses Complex Tables and Handwritten Text
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 27, 2026 · Artificial Intelligence

Alibaba’s Logics-Parsing-v2 Sets New OCR Benchmark Records

Alibaba’s open‑source Logics-Parsing‑v2 achieves top scores on both LogicsDocBench (82.16) and OmniDocBench‑v1.5 (93.23), outperforms leading closed models, and introduces Parsing‑2.0 capabilities that handle flowcharts, music scores, code blocks, and chemical formulas with structured HTML output.

ABC notationLogics-Parsing-v2Mermaid
0 likes · 9 min read
Alibaba’s Logics-Parsing-v2 Sets New OCR Benchmark Records
Architecture Digest
Architecture Digest
Mar 26, 2026 · Artificial Intelligence

How to Integrate Tess4j OCR into a Spring Boot 3 Application

This guide explains the fundamentals of OCR, introduces Tesseract and its Java wrapper Tess4j, shows how to download language data files, configure a Spring Boot 3 project with Maven dependencies and YAML settings, and provides comprehensive test code for Chinese, English, and mixed‑language image recognition.

JavaOCRSpring Boot
0 likes · 9 min read
How to Integrate Tess4j OCR into a Spring Boot 3 Application
Data STUDIO
Data STUDIO
Mar 26, 2026 · Operations

10 Open‑Source Python Tools That Replace Paid SaaS Apps

The article presents ten Python libraries—pikepdf, Playwright, pdf2image + pytesseract, moviepy, pydub + ffmpeg, reportlab, yt‑dlp, watchdog, pyvirtualcam, and rich + textual—each with code samples, runtime requirements, complexity analysis, practical tips, and common pitfalls, showing how they can substitute costly commercial software while offering greater control, privacy, and customization.

Audio ProcessingAutomationFile Monitoring
0 likes · 19 min read
10 Open‑Source Python Tools That Replace Paid SaaS Apps
SpringMeng
SpringMeng
Mar 25, 2026 · Backend Development

How to Perform OCR in SpringBoot Using Tess4j

This tutorial explains OCR fundamentals, introduces Tesseract and its Java wrapper Tess4j, shows how to download language data, integrate Tess4j into a SpringBoot 3 project with Maven configuration, and provides test code for Chinese, English, and mixed‑language image recognition while highlighting performance considerations.

ConfigurationJavaOCR
0 likes · 9 min read
How to Perform OCR in SpringBoot Using Tess4j
java1234
java1234
Mar 24, 2026 · Backend Development

How to Elegantly Perform OCR in Spring Boot 3 Using Tess4J

This tutorial explains OCR fundamentals, introduces the open‑source Tesseract engine and its Java wrapper Tess4J, shows how to download the required traineddata files, and provides step‑by‑step Spring Boot 3 integration, configuration, and test code for Chinese, English, and mixed‑language image recognition, plus important usage notes.

JavaOCRSpring Boot
0 likes · 8 min read
How to Elegantly Perform OCR in Spring Boot 3 Using Tess4J
Java Companion
Java Companion
Mar 22, 2026 · Backend Development

How to Seamlessly Integrate Tess4j OCR into a SpringBoot Application

This tutorial walks through the fundamentals of OCR, explains how to download the required Tesseract traineddata files, shows how to add Tess4j as a Maven dependency, configure SpringBoot with custom properties, and provides complete Java test code for Chinese, English, and mixed‑language image recognition, highlighting performance considerations and file‑naming requirements.

JavaOCRbackend
0 likes · 9 min read
How to Seamlessly Integrate Tess4j OCR into a SpringBoot Application
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Mar 22, 2026 · Artificial Intelligence

How to Overcome MinerU’s Top 9 Limitations for Reliable Document Parsing

This article examines MinerU’s strengths and nine critical shortcomings—such as reading order errors, split tables, merged cells, OCR misrecognition, formula handling, heading hierarchy loss, output inconsistency, hardware limits, and licensing issues—and provides concrete improvement strategies and interview‑ready talking points for engineers.

Document ParsingInterview TipsMinerU
0 likes · 12 min read
How to Overcome MinerU’s Top 9 Limitations for Reliable Document Parsing
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Mar 20, 2026 · Artificial Intelligence

Mastering MinerU: Overcoming Its Top 9 Limitations for Reliable Document Parsing

This article examines MinerU's strengths and nine critical shortcomings—such as layout order errors, cross‑page table splits, merged‑cell failures, OCR misrecognition, and licensing issues—and provides concrete improvement strategies, interview‑ready resume bullets, and practical response frameworks for engineers.

LLMLayout AnalysisMinerU
0 likes · 13 min read
Mastering MinerU: Overcoming Its Top 9 Limitations for Reliable Document Parsing
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 10, 2026 · Artificial Intelligence

FireRed-OCR 2B: An Open‑Source VLM That Tackles Structural Hallucination

FireRed‑OCR‑2B, an open‑source 2‑billion‑parameter visual‑language model, addresses structural hallucination in document OCR through a geometry‑aware data factory and a three‑stage training pipeline, achieving a 92.94 OmniDocBench v1.5 score and leading end‑to‑end performance while remaining lightweight enough for consumer‑grade GPUs.

FireRed-OCROCROmniDocBench
0 likes · 11 min read
FireRed-OCR 2B: An Open‑Source VLM That Tackles Structural Hallucination
Huolala Tech
Huolala Tech
Mar 4, 2026 · Artificial Intelligence

How Lalamove Built an AI‑Powered Edge‑Cloud Review System for Global Driver Verification

Lalamove tackled the scalability and accuracy challenges of worldwide driver onboarding by designing a layered edge‑cloud AI architecture that combines lightweight mobile models, cloud‑based large‑language and computer‑vision models, OCR, and multimodal LLMs to filter low‑quality inputs, automate identity checks, and reduce manual effort while maintaining data compliance.

AIDriver VerificationOCR
0 likes · 12 min read
How Lalamove Built an AI‑Powered Edge‑Cloud Review System for Global Driver Verification
SpringMeng
SpringMeng
Mar 2, 2026 · Backend Development

Deep Dive into an Asynchronous Spring Boot + Tesseract OCR Pipeline for Invoice Recognition

This article presents a complete design and implementation of a high‑throughput, asynchronous OCR pipeline built with Spring Boot and Tesseract, covering distributed architecture, thread‑pool tuning, image‑preprocessing, multi‑engine recognition, data extraction strategies, Kubernetes deployment, security compliance, chaos testing, and future AI‑driven enhancements.

AsynchronousGPUJava
0 likes · 10 min read
Deep Dive into an Asynchronous Spring Boot + Tesseract OCR Pipeline for Invoice Recognition
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 26, 2026 · Artificial Intelligence

Edit Banana Turns AI‑Generated Pixel Diagrams into Fully Editable PPT and Drawio Files

Edit Banana addresses the common pain of uneditable AI‑generated pixel diagrams by instantly converting them into fully editable Drawio (XML) or PPTX files, preserving text, shapes, and connections, and offering LaTeX extraction and a human‑in‑the‑loop mode for complex icons.

AIGCDrawioEdit Banana
0 likes · 6 min read
Edit Banana Turns AI‑Generated Pixel Diagrams into Fully Editable PPT and Drawio Files
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 8, 2026 · Artificial Intelligence

Choosing the Best OCR Large Model: DeepSeek‑OCR‑2, HunyuanOCR, PaddleOCR‑VL‑1.5, and GLM‑OCR Compared

This article provides a detailed technical comparison of four OCR large models—DeepSeek‑OCR‑2, HunyuanOCR, PaddleOCR‑VL‑1.5, and GLM‑OCR—covering their architectures, parameter sizes, release dates, licensing, core features, strengths, weaknesses, benchmark scores, multilingual support, deployment requirements, and recommended use‑cases, helping readers select the most suitable model for their needs.

DeepSeek-OCR 2GLM-OCRHunyuanOCR
0 likes · 17 min read
Choosing the Best OCR Large Model: DeepSeek‑OCR‑2, HunyuanOCR, PaddleOCR‑VL‑1.5, and GLM‑OCR Compared
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 3, 2026 · Artificial Intelligence

Why GLM-OCR Leads OCR Benchmarks: 0.9B Model Tops OmniDocBench

GLM-OCR, a 0.9B‑parameter multimodal OCR model from Zhipu, achieves the highest score (94.62) on OmniDocBench V1.5, offers lightweight deployment via vLLM, Ollama, API and SDK, and outperforms larger rivals like DeepSeek‑OCR and PaddleOCR in speed and accuracy.

GLM-OCROCROllama
0 likes · 10 min read
Why GLM-OCR Leads OCR Benchmarks: 0.9B Model Tops OmniDocBench
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 31, 2026 · Artificial Intelligence

How a 0.1B‑Parameter OCR Model Beats Multi‑Billion‑Parameter Vision‑Language Models

UniRec‑0.1B, a lightweight OCR model with only 0.1 B parameters, achieves accuracy comparable to or better than multi‑billion‑parameter visual‑language models across text, formula, and mixed‑content tasks, thanks to hierarchical supervision training, a semantic‑decoupled tokenizer, and a large 40 M‑sample dataset, while delivering 2‑9× faster inference and full open‑source availability.

Hierarchical SupervisionOCRSemantic Decoupled Tokenizer
0 likes · 12 min read
How a 0.1B‑Parameter OCR Model Beats Multi‑Billion‑Parameter Vision‑Language Models
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 28, 2026 · Artificial Intelligence

How to Deploy DeepSeek‑OCR‑2 Locally: A Hands‑On Walkthrough

The article details a step‑by‑step local deployment of DeepSeek‑OCR‑2, covering GPU memory requirements, accuracy on complex tables, long inference times, dependency hurdles like GCC, GLIBC and flash‑attn, and provides concrete solutions using conda environments and symlinks.

CondaDeepSeek-OCR 2GPU
0 likes · 7 min read
How to Deploy DeepSeek‑OCR‑2 Locally: A Hands‑On Walkthrough
PaperAgent
PaperAgent
Jan 27, 2026 · Artificial Intelligence

How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding

DeepSeek-OCR 2 introduces a novel dual‑stream (bidirectional + causal) attention architecture that replaces fixed raster scanning, leverages a Qwen2‑0.5B encoder, and achieves state‑of‑the‑art accuracy on OmniDocBench while reducing token budget and improving reading‑order consistency.

DeepEncoderDeepSeekDual-Stream Attention
0 likes · 8 min read
How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 27, 2026 · Artificial Intelligence

DeepSeek-OCR 2 Enables AI to Read Images with Human‑Like Logical Flow

DeepSeek-OCR 2 introduces Visual Causal Flow and a LLM‑based visual encoder, achieving 91.09% accuracy on OmniDocBench v1.5, while providing detailed installation, two inference modes (vLLM and Transformers), and an analysis of its strengths and limitations for complex document processing.

DeepEncoder V2DeepSeek-OCR 2LLM
0 likes · 9 min read
DeepSeek-OCR 2 Enables AI to Read Images with Human‑Like Logical Flow
Alibaba Cloud Native
Alibaba Cloud Native
Jan 22, 2026 · Cloud Native

Building a Cloud‑Native AI Glass Traffic Enforcement Prototype with AgentRun and Serverless Functions

This article details a cloud‑native architecture that combines Meta Ray‑Ban AI glasses, a custom iOS app, and Alibaba Cloud Function Compute (FC) with AgentRun to perform OCR‑based traffic rule enforcement, showcasing a three‑layer "client‑brain‑tools" design, prompt‑driven logic, and cost‑effective serverless deployment.

AIAlibaba CloudCloud Native
0 likes · 14 min read
Building a Cloud‑Native AI Glass Traffic Enforcement Prototype with AgentRun and Serverless Functions
Wuming AI
Wuming AI
Jan 3, 2026 · Artificial Intelligence

How to Remove Watermarks and Fix Chinese Text in NotebookLM‑Generated PPTs

This guide walks you through a two‑step process—first using SlideDeckCleaner to strip watermarks from NotebookLM‑generated PDF PPTs, then employing an AI‑powered PPT conversion service to resolve Chinese garbled text and improve image clarity, with detailed screenshots and tips for handling stubborn elements.

AI PPT conversionNotebookLMOCR
0 likes · 4 min read
How to Remove Watermarks and Fix Chinese Text in NotebookLM‑Generated PPTs
Wuming AI
Wuming AI
Dec 30, 2025 · Artificial Intelligence

Build an AI Agent that Turns arXiv Screenshot into Direct PDF Download

The article shows how to create a simple AI agent that receives a screenshot of an arXiv paper, automatically extracts the paper’s URL and PDF link using a custom prompt, and then lets users view the abstract, download the PDF, or save it to a knowledge base.

AI AgentKnowledge BaseOCR
0 likes · 4 min read
Build an AI Agent that Turns arXiv Screenshot into Direct PDF Download
Old Meng AI Explorer
Old Meng AI Explorer
Dec 26, 2025 · Artificial Intelligence

How PaddleOCR Boosts Text Extraction Efficiency 10×: A Hands‑On Review

PaddleOCR, Baidu’s open‑source OCR engine, delivers high‑accuracy multilingual text extraction from images, PDFs, and handwritten notes, offering offline operation, free commercial use, and specialized models for invoices, IDs, and tables, enabling users to automate document processing and increase productivity up to tenfold.

AIDocument AutomationOCR
0 likes · 9 min read
How PaddleOCR Boosts Text Extraction Efficiency 10×: A Hands‑On Review
Su San Talks Tech
Su San Talks Tech
Dec 13, 2025 · Information Security

How to Use Apache Tika in Spring Boot for Sensitive Data Detection and DLP

This article explains Apache Tika's core features, architecture, and common use cases, then provides a step‑by‑step Spring Boot tutorial that integrates Tika to extract file content, detect personal identifiers with regex, and return results via a REST API for data‑loss‑prevention.

Apache TikaDLPFile Parsing
0 likes · 24 min read
How to Use Apache Tika in Spring Boot for Sensitive Data Detection and DLP
Sohu Tech Products
Sohu Tech Products
Dec 3, 2025 · Mobile Development

How to Build a Scalable Android Ad‑Monitoring System with Multi‑Device Automation

This article details the design and implementation of an Android ad‑monitoring platform that controls multiple devices concurrently, automates app interactions, uses OCR for ad detection, and provides real‑time status monitoring via a floating window, while covering architecture, core modules, communication strategies, and performance optimizations.

ADBAd MonitoringAndroid
0 likes · 27 min read
How to Build a Scalable Android Ad‑Monitoring System with Multi‑Device Automation
AI Algorithm Path
AI Algorithm Path
Dec 1, 2025 · Artificial Intelligence

Getting Started with the Cutting‑Edge Vision‑Language Model Qwen3‑VL

This article introduces vision‑language models, explains why they outperform OCR‑plus‑LLM pipelines, and walks through practical OCR and information‑extraction tasks using Qwen3‑VL, complete with code snippets, example prompts, result analysis, and a discussion of the model's limitations and resource considerations.

OCRPythonQwen3-VL
0 likes · 13 min read
Getting Started with the Cutting‑Edge Vision‑Language Model Qwen3‑VL
HyperAI Super Neural
HyperAI Super Neural
Nov 28, 2025 · Artificial Intelligence

Weekly AI paper roundup: protein design, open‑source agent, HunyuanOCR, Olmo 3

This weekly roundup highlights five recent AI papers—including HumanSense for multimodal LLM evaluation, JAM‑2 for de novo antibody design, the open‑source Olmo 3 language models, the Lumine generalist 3D agent, and the lightweight HunyuanOCR vision‑language model—summarizing their core contributions, results, and links.

OCRProtein designgeneralist agents
0 likes · 6 min read
Weekly AI paper roundup: protein design, open‑source agent, HunyuanOCR, Olmo 3
HyperAI Super Neural
HyperAI Super Neural
Nov 11, 2025 · Artificial Intelligence

How Deepseek-OCR Achieves SOTA Using Ultra‑Low Visual Token Counts

Deepseek-OCR leverages a visual‑compression approach, combining DeepEncoder and the DeepSeek3B‑MoE‑A570M decoder, to represent document text with far fewer visual tokens, achieving up to 97% OCR accuracy and surpassing GOT‑OCR2.0 and MinerU2.0 on OmniDocBench, while the article offers a one‑click deployment tutorial.

DeepEncoderDeepSeek-OCRLLM
0 likes · 6 min read
How Deepseek-OCR Achieves SOTA Using Ultra‑Low Visual Token Counts
Architect's Guide
Architect's Guide
Nov 10, 2025 · Artificial Intelligence

Build a Scalable, High‑Performance OCR Invoice Pipeline with Spring Boot & Tesseract

This article details a complete, production‑grade OCR invoice processing pipeline that combines a distributed Spring Boot microservice architecture, deep Tesseract optimizations, ML‑based data validation, GPU acceleration, Kubernetes deployment, and extensive performance and security strategies to achieve million‑scale daily throughput with high accuracy.

OCRPerformance OptimizationSpring Boot
0 likes · 16 min read
Build a Scalable, High‑Performance OCR Invoice Pipeline with Spring Boot & Tesseract
DataFunSummit
DataFunSummit
Oct 30, 2025 · Artificial Intelligence

How Multimodal Large Models Are Revolutionizing Document Processing and OCR

This article explores how the explosion of unstructured data exposes the limits of traditional OCR and shows how emerging multimodal large language models provide end‑to‑end document understanding, reduce pipeline complexity, cut training costs, enable hybrid retrieval‑augmented generation, and drive real‑world industry deployments.

AIDocument processingLarge Language Model
0 likes · 28 min read
How Multimodal Large Models Are Revolutionizing Document Processing and OCR
Old Meng AI Explorer
Old Meng AI Explorer
Oct 30, 2025 · Artificial Intelligence

How PaddleOCR Turns Handwritten Notes and PDFs into Editable Text in Seconds

This article explains how PaddleOCR, an open‑source OCR engine from Baidu, achieves high‑accuracy text extraction from handwritten notes, scanned PDFs, invoices, IDs and multilingual documents, offering offline cross‑platform support, free commercial use, and step‑by‑step guidance for rapid deployment.

AutomationDocument processingOCR
0 likes · 10 min read
How PaddleOCR Turns Handwritten Notes and PDFs into Editable Text in Seconds
HyperAI Super Neural
HyperAI Super Neural
Oct 27, 2025 · Artificial Intelligence

Weekly AI Paper Digest: New OCR Model, Multimodal LLM, Next‑Gen DNA Sequencing

This week’s AI roundup highlights five recent papers: DeepSeek‑OCR’s context‑compression model for large‑scale data generation, Rex‑Omni’s 3‑billion‑parameter multimodal LLM achieving state‑of‑the‑art object perception, Alpha‑Service’s proactive AI‑glass framework, a bias‑variance approach to narrowing cross‑lingual gaps, and GATK’s MapReduce‑based toolkit for next‑generation DNA sequencing.

AI GlassesCross-lingual NLPDNA Sequencing
0 likes · 6 min read
Weekly AI Paper Digest: New OCR Model, Multimodal LLM, Next‑Gen DNA Sequencing
Fun with Large Models
Fun with Large Models
Oct 26, 2025 · Artificial Intelligence

From Deep Learning to Large‑Model OCR: Which Model Leads the Pack?

This article traces OCR's evolution from early CNN‑LSTM systems to modern multimodal VLMs, analyzes leading open‑source models such as DeepSeek‑OCR, PaddleOCR, and MonkeyOCR, and offers practical guidance for long‑document, academic, and edge‑computing scenarios.

DeepSeek-OCRMonkeyOCRMultimodal AI
0 likes · 15 min read
From Deep Learning to Large‑Model OCR: Which Model Leads the Pack?
DataFunTalk
DataFunTalk
Oct 20, 2025 · Artificial Intelligence

How DeepSeek-OCR Achieves 10× Context Compression with Vision Tokens

DeepSeek-OCR, a newly open‑sourced 3B‑parameter OCR model, uses a novel DeepEncoder and a 3B MoE decoder to compress long‑text contexts into visual tokens, achieving up to 10× compression with 97% accuracy and demonstrating strong practical performance on benchmarks and multilingual documents.

DeepSeekMultimodal AIOCR
0 likes · 11 min read
How DeepSeek-OCR Achieves 10× Context Compression with Vision Tokens
HyperAI Super Neural
HyperAI Super Neural
Oct 14, 2025 · Artificial Intelligence

NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass

OCRBench v2, introduced at NeurIPS 2025, evaluates 58 multimodal models on 23 OCR‑related tasks in Chinese and English, revealing that even top models like Gemini‑2.5‑Pro barely exceed the passing threshold and that most models struggle with fine‑grained text localization and multilingual performance.

EvaluationGeminiNeurIPS 2025
0 likes · 8 min read
NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass
HyperAI Super Neural
HyperAI Super Neural
Sep 26, 2025 · Artificial Intelligence

Redefining Next‑Gen OCR: IBM’s Open‑Source Granite‑Docling‑258M for Unified Structure and Content Understanding

IBM’s newly released open‑source model Granite‑Docling‑258M tackles the long‑standing challenge of converting diverse digital documents into machine‑readable, structured data by preserving layout, tables, formulas, and supporting multiple languages, while remaining lightweight at 258 M parameters and outperforming its predecessor SmolDocling‑256M‑Preview.

DoclingIBMOCR
0 likes · 5 min read
Redefining Next‑Gen OCR: IBM’s Open‑Source Granite‑Docling‑258M for Unified Structure and Content Understanding
AndroidPub
AndroidPub
Sep 26, 2025 · Mobile Development

How to Add On‑Device AI Scanning to Your Android App with ML Kit

This article walks through the practical steps of integrating Google ML Kit into an Android app, covering its privacy‑first, zero‑learning‑curve advantages and providing complete code examples for barcode scanning, OCR, error handling, CameraX setup, and performance tuning.

AndroidBarcode ScanningCameraX
0 likes · 14 min read
How to Add On‑Device AI Scanning to Your Android App with ML Kit
Code Ape Tech Column
Code Ape Tech Column
Sep 23, 2025 · Backend Development

Integrate Tess4J OCR into Spring Boot: Step‑by‑Step Guide

This tutorial walks you through setting up a Spring Boot project with Tess4J, adding required dependencies, configuring language data, implementing an OCR service and REST controller, and testing both local file and remote URL image recognition, all with complete code examples.

Image processingJavaOCR
0 likes · 6 min read
Integrate Tess4J OCR into Spring Boot: Step‑by‑Step Guide
Sohu Tech Products
Sohu Tech Products
Sep 17, 2025 · Artificial Intelligence

Choosing the Right Python OCR Library: pytesseract, cnocr, or PaddleOCR?

This article compares three popular Python OCR frameworks—pytesseract, cnocr, and PaddleOCR—examining their installation ease, Chinese recognition ability, model size, accuracy, and unique features, and provides practical code examples to help developers pick the best tool for their needs.

Image processingOCRPaddleOCR
0 likes · 5 min read
Choosing the Right Python OCR Library: pytesseract, cnocr, or PaddleOCR?
DaTaobao Tech
DaTaobao Tech
Sep 17, 2025 · Artificial Intelligence

Boosting ID Card Photo Quality with Multimodal AI: A Practical Deployment Guide

This article details how a multimodal AI model was integrated to detect and improve ID card photo quality, covering common image issues, differences between OCR and multimodal extraction, deployment strategies, performance metrics, cost estimation, and the resulting business and technical benefits.

ID verificationModel DeploymentMultimodal AI
0 likes · 13 min read
Boosting ID Card Photo Quality with Multimodal AI: A Practical Deployment Guide
Tencent Technical Engineering
Tencent Technical Engineering
Sep 12, 2025 · Artificial Intelligence

How POINTS-Reader Achieves State‑of‑the‑Art PDF Extraction Without Teacher Models

The POINTS-Reader paper, accepted at EMNLP 2025, introduces a two‑stage, fully automated data generation pipeline that enables a lightweight visual‑language model to extract text, tables, and LaTeX formulas from diverse PDF layouts with superior performance and high throughput, all without relying on costly teacher‑model distillation.

AIDocument ParsingOCR
0 likes · 12 min read
How POINTS-Reader Achieves State‑of‑the‑Art PDF Extraction Without Teacher Models
Chen Tian Universe
Chen Tian Universe
Sep 8, 2025 · Operations

Unlocking the Power of Financial Shared Service Centers: A Complete Guide

This article explains the background, concept, suitable enterprises, involved departments, policies, processes, technical architecture, and common challenges of Financial Shared Service Centers (FSSC), offering a step‑by‑step roadmap for organizations seeking cost reduction, efficiency, and stronger financial control.

Enterprise ArchitectureFinancial Shared ServicesOCR
0 likes · 17 min read
Unlocking the Power of Financial Shared Service Centers: A Complete Guide
Architect
Architect
Aug 21, 2025 · Artificial Intelligence

Implement OCR in Java with Tess4j and SpringBoot in Just a Few Lines

This tutorial walks you through adding optical character recognition to a Java SpringBoot project using the Tess4j library, covering prerequisites, dependency setup, engine initialization, RESTful API creation, and tips for improving accuracy with custom training data or third‑party services.

Image processingJavaOCR
0 likes · 8 min read
Implement OCR in Java with Tess4j and SpringBoot in Just a Few Lines
Architect
Architect
Aug 16, 2025 · Artificial Intelligence

Build a Scalable High‑Performance OCR Invoice Pipeline with Spring Boot & Tesseract

This article presents a comprehensive, high‑throughput OCR invoice processing solution that combines distributed system design, Spring Boot asynchronous execution, Tesseract deep optimization, multi‑engine fusion, structured data extraction, performance tuning, Kubernetes deployment, and security compliance.

AIOCRSpring Boot
0 likes · 16 min read
Build a Scalable High‑Performance OCR Invoice Pipeline with Spring Boot & Tesseract
Programmer XiaoFu
Programmer XiaoFu
Aug 12, 2025 · Backend Development

Deep Dive into an Asynchronous Spring Boot + Tesseract OCR Pipeline for Invoice Recognition

This article presents a comprehensive, step‑by‑step analysis of a high‑throughput, asynchronous OCR pipeline built with Spring Boot and Tesseract, covering system architecture, thread‑pool tuning, custom invoice‑specific model training, multi‑engine fusion, structured data extraction, performance optimizations, GPU acceleration, Kubernetes deployment, monitoring, security compliance, chaos testing, and future evolution plans.

AsynchronousGPUOCR
0 likes · 12 min read
Deep Dive into an Asynchronous Spring Boot + Tesseract OCR Pipeline for Invoice Recognition
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jul 31, 2025 · Artificial Intelligence

How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM

dots.ocr is a 1.7 billion-parameter multilingual document-parsing model that unifies layout detection and content recognition within a single visual-language model, delivering state-of-the-art performance across text, tables, formulas and reading order while remaining efficient and extensible for future multimodal AI research.

AIDocument ParsingOCR
0 likes · 10 min read
How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM
Java Tech Enthusiast
Java Tech Enthusiast
Jul 13, 2025 · Artificial Intelligence

Build a Java SpringBoot 3.x License Plate Recognition System with OCR

This article walks through creating a server‑side license‑plate recognition solution using Java SpringBoot 3.x, Tesseract OCR, and OpenCV, covering project goals, Maven dependencies, image‑processing services, special‑plate handling, and a REST API for real‑time plate detection.

JavaOCRlicense-plate-recognition
0 likes · 8 min read
Build a Java SpringBoot 3.x License Plate Recognition System with OCR
Baidu Geek Talk
Baidu Geek Talk
Jul 9, 2025 · Artificial Intelligence

PaddleOCR 3.1 Unveils Multilingual PP‑OCRv5, Document Translation, and MCP Server Integration

PaddleOCR 3.1 introduces three major upgrades—a multilingual PP‑OCRv5 model supporting 37 languages with over 30% accuracy gain, a PP‑DocTranslation pipeline for high‑quality multi‑language document translation, and MCP server support for flexible AI application integration—accompanied by detailed CLI usage, demo scenarios, and open‑source resources.

AIMCPOCR
0 likes · 11 min read
PaddleOCR 3.1 Unveils Multilingual PP‑OCRv5, Document Translation, and MCP Server Integration
Programmer XiaoFu
Programmer XiaoFu
Jun 10, 2025 · Backend Development

Integrating Tess4j with SpringBoot: Low‑Cost OCR Image Recognition

This tutorial shows how to add OCR capabilities to a SpringBoot application using the Tess4j library, covering dependency setup, Tesseract engine initialization, RESTful endpoint implementation, training data choices, and practical tips for handling resources and deployment.

JavaOCRrestapi
0 likes · 7 min read
Integrating Tess4j with SpringBoot: Low‑Cost OCR Image Recognition
Selected Java Interview Questions
Selected Java Interview Questions
Jun 3, 2025 · Artificial Intelligence

Implementing OCR in Java with SpringBoot and Tess4j

This article demonstrates how to build a lightweight OCR service in Java using SpringBoot and the Tess4j library, covering dependency setup, Tesseract engine initialization, RESTful API creation, training data options, and deployment considerations.

Image processingOCRRESTful API
0 likes · 7 min read
Implementing OCR in Java with SpringBoot and Tess4j
Python Programming Learning Circle
Python Programming Learning Circle
May 6, 2025 · Artificial Intelligence

Automatic Math Equation Grading with Python: Data Generation, CNN Training, Image Segmentation, and Result Feedback

This tutorial explains how to build a Python-based automatic grading system for handwritten math equations by generating synthetic character images, training a convolutional neural network, segmenting input images using projection techniques, evaluating expressions with eval, and overlaying correctness indicators on the original image.

CNNImage processingMath Grading
0 likes · 28 min read
Automatic Math Equation Grading with Python: Data Generation, CNN Training, Image Segmentation, and Result Feedback
Liangxu Linux
Liangxu Linux
Apr 22, 2025 · Artificial Intelligence

Top 10 Open-Source OCR Projects on GitHub Ranked by Stars

This article compiles a ranked list of ten popular open-source OCR projects on GitHub, summarizing each tool’s key capabilities—such as multimodal text extraction, PDF linearization, layout analysis, and multilingual support—along with star counts and direct repository links for developers seeking ready-to-use OCR solutions.

GitHubMultimodalOCR
0 likes · 9 min read
Top 10 Open-Source OCR Projects on GitHub Ranked by Stars
Python Programming Learning Circle
Python Programming Learning Circle
Apr 15, 2025 · Artificial Intelligence

Automatic Math Expression Grading with Python, CNN and Image Processing

This tutorial explains how to generate synthetic digit fonts, build a convolutional neural network to recognize handwritten arithmetic expressions, segment images using projection methods, evaluate the results with Python's eval function, and overlay feedback symbols on the original image, providing a complete end‑to‑end solution.

AutomationCNNImageProcessing
0 likes · 27 min read
Automatic Math Expression Grading with Python, CNN and Image Processing
58UXD
58UXD
Mar 14, 2025 · Product Management

How 58租房 Accelerated Landlord Publishing with LBS, OCR, and AI Guidance

This case study details how 58租房 tackled cumbersome landlord publishing by redesigning the workflow with smart location (LBS), AI‑driven shooting assistance, OCR‑based document recognition, and digital‑human guidance, achieving up to 90% faster operations, higher accuracy, and stronger privacy protection.

AI guidanceLBSOCR
0 likes · 7 min read
How 58租房 Accelerated Landlord Publishing with LBS, OCR, and AI Guidance
AI Frontier Lectures
AI Frontier Lectures
Mar 7, 2025 · Artificial Intelligence

Can Mistral’s New OCR Model Really Beat the Competition? A Deep Dive

Mistral AI’s newly launched OCR API claims to deliver world‑class document understanding with multilingual support, high speed, and self‑hosting options, and benchmark tests show it outperforms Azure OCR and Google Doc AI, yet independent evaluations reveal limitations on complex tables and legal forms, prompting a balanced assessment of its readiness for enterprise use.

AI modelMistral AIOCR
0 likes · 7 min read
Can Mistral’s New OCR Model Really Beat the Competition? A Deep Dive
Sohu Tech Products
Sohu Tech Products
Jan 8, 2025 · Artificial Intelligence

Multimodal RAG: Implementation Paths and Development Prospects

The talk outlines Multimodal RAG implementation routes, comparing OCR‑based object recognition, transformer encoder‑decoder encoding, and Visual Language Model processing, explains the ColPali late‑interaction method for multi‑dimensional vector matching, addresses scaling tensors with binarization and reranking, and recommends a hybrid long‑term strategy where VLM excels on abstract imagery while traditional OCR remains valuable.

ColPaliDocument processingMultimodal RAG
0 likes · 10 min read
Multimodal RAG: Implementation Paths and Development Prospects
Programmer DD
Programmer DD
Dec 31, 2024 · Artificial Intelligence

Build an AI‑Powered Expense Tracker with GLM‑4V‑Flash and MaxKB

This article demonstrates how to create an AI‑driven personal expense‑tracking assistant by leveraging Zhipu's GLM‑4V‑Flash multimodal model for receipt OCR, generating SQL statements, and integrating them with MaxKB workflows and a MySQL database, complete with code snippets and deployment steps.

AIGLM-4V-FlashMaxKB
0 likes · 13 min read
Build an AI‑Powered Expense Tracker with GLM‑4V‑Flash and MaxKB
Architecture Breakthrough
Architecture Breakthrough
Dec 26, 2024 · Industry Insights

Understanding Chinese Invoices: Types, Lifecycle, and FinTech Applications

This article provides a comprehensive overview of Chinese invoices, covering legal definitions, paper and electronic forms, basic copies, content fields, lifecycle stages, classification of VAT and ordinary invoices, the distinction between full‑electronic and digital invoices, and their practical use in fintech solutions such as OCR and third‑party verification platforms.

ChinaOCRVAT
0 likes · 18 min read
Understanding Chinese Invoices: Types, Lifecycle, and FinTech Applications
Test Development Learning Exchange
Test Development Learning Exchange
Dec 6, 2024 · Artificial Intelligence

Using pytesseract and Pillow for OCR: Installation, Configuration, and Accuracy Improvement Techniques

This guide explains how to install Tesseract OCR and the Python libraries pytesseract and Pillow, configure the engine path, perform image-to-text extraction with example code, and apply various preprocessing, detection, and post‑processing methods to significantly improve OCR accuracy.

OCRPythoncomputer vision
0 likes · 8 min read
Using pytesseract and Pillow for OCR: Installation, Configuration, and Accuracy Improvement Techniques
Huolala Tech
Huolala Tech
Nov 28, 2024 · Artificial Intelligence

How AI-Powered OCR Transforms Freight Document and Vehicle Verification

This article explains how AI-driven OCR combined with deep‑learning image classification streamlines ticket, document, and license‑plate verification in freight logistics, detailing system architecture, algorithmic components, and future prospects for unified large‑model OCR solutions.

OCRartificial-intelligenceimage classification
0 likes · 12 min read
How AI-Powered OCR Transforms Freight Document and Vehicle Verification
Full-Stack Cultivation Path
Full-Stack Cultivation Path
Nov 25, 2024 · Artificial Intelligence

Get High-Quality OCR with Ollama-OCR in Just a Few Lines of Code

This guide shows how to set up the open‑source Ollama‑OCR tool, which leverages the Llama 3.2‑Vision multimodal model to perform high‑quality OCR, covering installation of Ollama, the vision model, the OCR package, and example code for plain‑text and Markdown outputs.

Llama 3.2-VisionNode.jsOCR
0 likes · 6 min read
Get High-Quality OCR with Ollama-OCR in Just a Few Lines of Code
Bilibili Tech
Bilibili Tech
Nov 8, 2024 · Artificial Intelligence

AI-Powered Game Recognition for League of Legends Live Streaming on Bilibili

Bilibili’s AI‑driven game‑recognition system extracts real‑time LoL events through OCR, hero detection and hot‑spot tagging, generating high‑energy timestamps and interactive overlays that let viewers jump to key moments and view detailed statistics, enhancing spectator engagement and analytical capabilities across major esports tournaments.

AIGame RecognitionMultimodal
0 likes · 14 min read
AI-Powered Game Recognition for League of Legends Live Streaming on Bilibili
Architect
Architect
Nov 2, 2024 · Frontend Development

How to Build Robust Dark Watermarks and Boost OCR Accuracy in Web Apps

This article walks through the evolution of watermark techniques, demonstrates how to harden a front‑end watermark against deletion, invisibility, and covering using MutationObserver and canvas, introduces a low‑visibility dark watermark with decode logic, and details OCR integration and optimization to improve recognition accuracy in screenshot‑search scenarios.

CanvasImage processingMutationObserver
0 likes · 21 min read
How to Build Robust Dark Watermarks and Boost OCR Accuracy in Web Apps