Get High-Quality OCR with Ollama-OCR in Just a Few Lines of Code
This guide shows how to set up the open‑source Ollama‑OCR tool, which leverages the Llama 3.2‑Vision multimodal model to perform high‑quality OCR, covering installation of Ollama, the vision model, the OCR package, and example code for plain‑text and Markdown outputs.
Llama 3.2-Vision is a multimodal large language model available in 11B and 90B sizes, capable of processing text and image inputs and generating text output. It excels at visual recognition, image reasoning, captioning, and answering image‑related questions, outperforming many open‑source and closed models on industry benchmarks.
This article introduces the open‑source ollama-ocr tool, which by default uses a locally running Llama 3.2-Vision model to accurately recognize text in images while preserving the original format.
https://github.com/bytefer/ollama-ocr
Features of Ollama-OCR
High‑precision text recognition with Llama 3.2-Vision, retaining original layout.
Supports multiple image formats: JPG, JPEG, PNG.
Customizable recognition prompts and model selection.
Optional Markdown output format.
Llama 3.2-Vision Application Scenarios
Handwritten Text Recognition
OCR Recognition
Image Question Answering
Environment Setup
Install Ollama
Before using Llama 3.2-Vision, install Ollama, a platform for running multimodal models locally.
Download Ollama from the official website for your operating system.
Run the installer and follow the prompts.
Install Llama 3.2-Vision 11B
After Ollama is installed, run the following command:
ollama run llama3.2-visionInstall Ollama-OCR
npm install ollama-ocr
# or using pnpm
pnpm add ollama-ocrUsing Ollama-OCR
OCR
import { ollamaOCR, DEFAULT_OCR_SYSTEM_PROMPT } from "ollama-ocr";
async function runOCR() {
const text = await ollamaOCR({
filePath: "./handwriting.jpg",
systemPrompt: DEFAULT_OCR_SYSTEM_PROMPT,
});
console.log(text);
}The test image is shown below:
The resulting output is:
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of instruction‑tuned image reasoning generative models in 118 and 908 sizes (text + images in / text out). The Llama 3.2-Vision instruction‑tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
Output Markdown
import { ollamaOCR, DEFAULT_MARKDOWN_SYSTEM_PROMPT } from "ollama-ocr";
async function runOCR() {
const text = await ollamaOCR({
filePath: "./trader-joes-receipt.jpg",
systemPrompt: DEFAULT_MARKDOWN_SYSTEM_PROMPT,
});
console.log(text);
}Test image:
The resulting Markdown output is displayed in the following screenshot:
If you prefer to use an online Llama 3.2-Vision model, you can try the llama-ocr library.
References
ollama-ocr: https://github.com/bytefer/ollama-ocr
Ollama: https://ollama.com/
Llama 3.2-Vision 11B: https://ollama.com/blog/llama3.2-vision
llama-ocr: https://github.com/Nutlope/llama-ocr
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Full-Stack Cultivation Path
Focused on sharing practical tech content about TypeScript, Vue 3, front-end architecture, and source code analysis.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
