Artificial Intelligence 6 min read

Get High-Quality OCR with Ollama-OCR in Just a Few Lines of Code

This guide shows how to set up the open‑source Ollama‑OCR tool, which leverages the Llama 3.2‑Vision multimodal model to perform high‑quality OCR, covering installation of Ollama, the vision model, the OCR package, and example code for plain‑text and Markdown outputs.

Full-Stack Cultivation Path

Nov 25, 2024

Get High-Quality OCR with Ollama-OCR in Just a Few Lines of Code

Llama 3.2-Vision is a multimodal large language model available in 11B and 90B sizes, capable of processing text and image inputs and generating text output. It excels at visual recognition, image reasoning, captioning, and answering image‑related questions, outperforming many open‑source and closed models on industry benchmarks.

This article introduces the open‑source ollama-ocr tool, which by default uses a locally running Llama 3.2-Vision model to accurately recognize text in images while preserving the original format.

https://github.com/bytefer/ollama-ocr

Features of Ollama-OCR

High‑precision text recognition with Llama 3.2-Vision, retaining original layout.

Supports multiple image formats: JPG, JPEG, PNG.

Customizable recognition prompts and model selection.

Optional Markdown output format.

Llama 3.2-Vision Application Scenarios

Handwritten Text Recognition

OCR Recognition

Image Question Answering

Environment Setup

Install Ollama

Before using Llama 3.2-Vision, install Ollama, a platform for running multimodal models locally.

Download Ollama from the official website for your operating system.

Run the installer and follow the prompts.

Install Llama 3.2-Vision 11B

After Ollama is installed, run the following command:

ollama run llama3.2-vision

Install Ollama-OCR

npm install ollama-ocr
# or using pnpm
pnpm add ollama-ocr

Using Ollama-OCR

OCR

import { ollamaOCR, DEFAULT_OCR_SYSTEM_PROMPT } from "ollama-ocr";

async function runOCR() {
  const text = await ollamaOCR({
    filePath: "./handwriting.jpg",
    systemPrompt: DEFAULT_OCR_SYSTEM_PROMPT,
  });
  console.log(text);
}

The test image is shown below:

The resulting output is:

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of instruction‑tuned image reasoning generative models in 118 and 908 sizes (text + images in / text out). The Llama 3.2-Vision instruction‑tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Output Markdown

import { ollamaOCR, DEFAULT_MARKDOWN_SYSTEM_PROMPT } from "ollama-ocr";

async function runOCR() {
  const text = await ollamaOCR({
    filePath: "./trader-joes-receipt.jpg",
    systemPrompt: DEFAULT_MARKDOWN_SYSTEM_PROMPT,
  });
  console.log(text);
}

Test image:

The resulting Markdown output is displayed in the following screenshot:

If you prefer to use an online Llama 3.2-Vision model, you can try the llama-ocr library.

References

ollama-ocr: https://github.com/bytefer/ollama-ocr

Ollama: https://ollama.com/

Llama 3.2-Vision 11B: https://ollama.com/blog/llama3.2-vision

llama-ocr: https://github.com/Nutlope/llama-ocr

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OCR Node.js image recognition multimodal LLM Ollama Llama 3.2-Vision

Written by

Full-Stack Cultivation Path

Focused on sharing practical tech content about TypeScript, Vue 3, front-end architecture, and source code analysis.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.