A Step-by-Step Guide to Using Mistral OCR

Extracting text from PDFs and images is easier than ever with Mistral OCR. This guide walks you through setting it up, processing documents, and handling real-world use cases like invoices, academic papers, and bulk uploads. With working code snippets in Python and TypeScript, you’ll have a functional OCR pipeline in no time. Let's dive in.

Mistral AI has recently launched Mistral OCR. It's is an advanced framework designed to extract text and structure from documents. Whether you’re processing PDFs or images, its robust capabilities not only extract text but also preserve the original layout—including headers, paragraphs, lists, and tables.

Presentation of the Framework

At its core, the Mistral OCR processor leverages the latest OCR model (mistral-ocr-latest) and is built to:

Preserve Document Structure: Extracts both raw text and metadata (headers, paragraphs, tables, etc.).

Process Complex Layouts: Handles multi-column text and mixed content.

Return Markdown Outputs: Facilitates easy parsing and rendering.

Scale with High Accuracy: Suitable for large-scale document processing tasks.

Benefits

Using Mistral OCR provides several advantages:

Accurate Extraction: Maintains the original document’s hierarchy and formatting.

Ease of Integration: Comes with client libraries for Python, TypeScript, and supports direct API calls via curl.

Versatile Document Support: Works with PDFs, images, and various uploaded document formats.

Quick Setup: Integrates seamlessly into your workflows and pipelines.

Getting Started: Installation and Setup

Prerequisites

Before you begin:

API Key: Obtain an API key from Mistral AI and set it as an environment variable (MISTRAL_API_KEY).

Development Environment: Set up your Python (or Node.js) environment.

Installation (Python Example)

Install the Mistral client library:

pip install mistralai

Code Snippet: First Run in Python

import os
from mistralai import Mistral

# Set your API key from environment variables
api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

# Process a document via URL
ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    include_image_base64=True
)

print(ocr_response)

This example initializes the client, sends a document URL for OCR processing, and prints the resulting markdown output along with document metadata.

Code Snippet: First Run in TypeScript

import { Mistral } from '@mistralai/mistralai';

const apiKey = process.env.MISTRAL_API_KEY;
const client = new Mistral({ apiKey: apiKey });

async function processDocument() {
    const ocrResponse = await client.ocr.process({
        model: "mistral-ocr-latest",
        document: {
            type: "document_url",
            documentUrl: "https://arxiv.org/pdf/2201.04234"
        },
        includeImageBase64: true
    });
    console.log(ocrResponse);
}

processDocument();

Example: Building a Simple OCR Agent

Below is a step-by-step example of creating a simple OCR agent in Python. This agent takes a document URL, processes it through Mistral OCR, and returns structured markdown content.

import os
from mistralai import Mistral

class SimpleOCRAgent:
    def __init__(self, api_key):
        self.client = Mistral(api_key=api_key)
    
    def process_document(self, document_url):
        response = self.client.ocr.process(
            model="mistral-ocr-latest",
            document={
                "type": "document_url",
                "document_url": document_url
            },
            include_image_base64=True
        )
        return response

if __name__ == "__main__":
    api_key = os.environ.get("MISTRAL_API_KEY")
    if not api_key:
        raise ValueError("Please set the MISTRAL_API_KEY environment variable.")
    
    agent = SimpleOCRAgent(api_key=api_key)
    document_url = "https://arxiv.org/pdf/2201.04234"  # Change as needed
    result = agent.process_document(document_url)
    print("OCR Result:")
    print(result)

Explanation:

Initialization: The agent initializes with the API key.

Processing: The process_document method sends the document URL to the Mistral OCR processor.

Output: It prints the structured OCR result (in markdown format) including text and metadata.

Error Handling and Improvements

In a production setting, you might want to:

• Add exception handling for network issues.

• Validate the document URL.

• Parse the returned markdown to render in a UI.

Additional Use Cases with Implementation Details

1. Invoice Processing and Data Extraction

Mistral OCR can extract structured data from invoices, preserving tables and key fields like invoice numbers, dates, and totals. Once the OCR response is obtained, you can apply further parsing to extract the required information.

Python Code Snippet:

import re

def extract_invoice_details(markdown_text):
    # Use regular expressions to find key invoice details
    invoice_number = re.search(r"Invoice Number:\s*(\w+)", markdown_text)
    invoice_date = re.search(r"Invoice Date:\s*([\d/-]+)", markdown_text)
    total_amount = re.search(r"Total Amount:\s*\$?([\d,]+\.\d{2})", markdown_text)
    
    return {
        "invoice_number": invoice_number.group(1) if invoice_number else "Not Found",
        "invoice_date": invoice_date.group(1) if invoice_date else "Not Found",
        "total_amount": total_amount.group(1) if total_amount else "Not Found"
    }

# Assuming `ocr_response` contains a key 'pages' with markdown output
ocr_markdown = ocr_response.get("pages", [])[0].get("markdown", "")
invoice_details = extract_invoice_details(ocr_markdown)
print("Extracted Invoice Details:", invoice_details)

This snippet processes the OCR markdown to extract and print invoice details using regex matching.

2. Academic Paper Analysis and Summarization

Researchers can use Mistral OCR to convert academic papers into markdown format, then apply natural language processing (NLP) for further analysis or summarization. For instance, you might extract sections like the abstract, introduction, and conclusion.

Python Code Snippet:

def extract_section(markdown_text, section_title):
    # Simple extraction of a section based on title keywords
    pattern = rf"(#{1,6}\s*{section_title}.*?)(?=\n#|\Z)"
    match = re.search(pattern, markdown_text, re.DOTALL | re.IGNORECASE)
    return match.group(1).strip() if match else "Section not found"

# Extracting the Abstract and Conclusion
abstract = extract_section(ocr_markdown, "Abstract")
conclusion = extract_section(ocr_markdown, "Conclusion")
print("Abstract:\n", abstract)
print("\nConclusion:\n", conclusion)

This snippet demonstrates how to extract specific sections from the OCR markdown for further processing or summarization.

3. Bulk Document Processing

For large-scale document processing, you may want to process multiple documents in a batch. The following Python example loops over a list of document URLs, processes each with Mistral OCR, and stores the results.

Python Code Snippet:

document_urls = [
    "https://arxiv.org/pdf/2201.04234",
    "https://example.com/invoice1.pdf",
    "https://example.com/invoice2.pdf"
]

def process_documents(urls, agent):
    results = {}
    for url in urls:
        try:
            result = agent.process_document(url)
            results[url] = result
            print(f"Processed document: {url}")
        except Exception as e:
            results[url] = f"Error: {e}"
            print(f"Failed processing {url}: {e}")
    return results

bulk_results = process_documents(document_urls, agent)
print("Bulk Processing Results:", bulk_results)

This snippet shows how to handle multiple document URLs in a batch process with error handling.

4. Image-Based Document Processing

Besides PDFs, Mistral OCR can process images directly. You can either use local image files or image URLs. Here’s an example processing an image file.

Python Code Snippet:

import base64

def process_local_image(image_path, agent):
    # Open and read the image file in binary mode
    with open(image_path, "rb") as image_file:
        image_data = image_file.read()
    
    # Convert binary data to a base64 encoded string
    encoded_image = base64.b64encode(image_data).decode('utf-8')
    
    response = agent.client.ocr.process(
        model="mistral-ocr-latest",
        document={
            "type": "image_base64",
            "document": encoded_image
        },
        include_image_base64=True
    )
    return response

# Replace 'path/to/image.jpg' with the actual image file path
image_response = process_local_image("path/to/image.jpg", agent)
print("Image OCR Result:", image_response)

This snippet illustrates handling a local image file and processing it with Mistral OCR. Adjust the image encoding method as per your environment’s requirements.

Final Thoughts

Mistral OCR significantly simplifies the extraction of text and structural data from diverse document types. Its ability to return markdown-formatted output makes it an excellent tool for automated document analysis—if you’re processing invoices, summarizing academic papers, handling bulk document uploads, or working with images.

Until the next one,

Cohorte Team

March 13, 2025