A Step-by-Step Guide to Using Mistral OCR

Mistral AI has recently launched Mistral OCR. It's is an advanced framework designed to extract text and structure from documents. Whether you’re processing PDFs or images, its robust capabilities not only extract text but also preserve the original layout—including headers, paragraphs, lists, and tables.
Presentation of the Framework
At its core, the Mistral OCR processor leverages the latest OCR model (mistral-ocr-latest) and is built to:
• Preserve Document Structure: Extracts both raw text and metadata (headers, paragraphs, tables, etc.).
• Process Complex Layouts: Handles multi-column text and mixed content.
• Return Markdown Outputs: Facilitates easy parsing and rendering.
• Scale with High Accuracy: Suitable for large-scale document processing tasks.
Benefits
Using Mistral OCR provides several advantages:
• Accurate Extraction: Maintains the original document’s hierarchy and formatting.
• Ease of Integration: Comes with client libraries for Python, TypeScript, and supports direct API calls via curl.
• Versatile Document Support: Works with PDFs, images, and various uploaded document formats.
• Quick Setup: Integrates seamlessly into your workflows and pipelines.
Getting Started: Installation and Setup
Prerequisites
Before you begin:
• API Key: Obtain an API key from Mistral AI and set it as an environment variable (MISTRAL_API_KEY).
• Development Environment: Set up your Python (or Node.js) environment.
Installation (Python Example)
Install the Mistral client library:
pip install mistralai
Code Snippet: First Run in Python
import os
from mistralai import Mistral
# Set your API key from environment variables
api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)
# Process a document via URL
ocr_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": "https://arxiv.org/pdf/2201.04234"
},
include_image_base64=True
)
print(ocr_response)
This example initializes the client, sends a document URL for OCR processing, and prints the resulting markdown output along with document metadata.
Code Snippet: First Run in TypeScript
import { Mistral } from '@mistralai/mistralai';
const apiKey = process.env.MISTRAL_API_KEY;
const client = new Mistral({ apiKey: apiKey });
async function processDocument() {
const ocrResponse = await client.ocr.process({
model: "mistral-ocr-latest",
document: {
type: "document_url",
documentUrl: "https://arxiv.org/pdf/2201.04234"
},
includeImageBase64: true
});
console.log(ocrResponse);
}
processDocument();
Example: Building a Simple OCR Agent
Below is a step-by-step example of creating a simple OCR agent in Python. This agent takes a document URL, processes it through Mistral OCR, and returns structured markdown content.
import os
from mistralai import Mistral
class SimpleOCRAgent:
def __init__(self, api_key):
self.client = Mistral(api_key=api_key)
def process_document(self, document_url):
response = self.client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": document_url
},
include_image_base64=True
)
return response
if __name__ == "__main__":
api_key = os.environ.get("MISTRAL_API_KEY")
if not api_key:
raise ValueError("Please set the MISTRAL_API_KEY environment variable.")
agent = SimpleOCRAgent(api_key=api_key)
document_url = "https://arxiv.org/pdf/2201.04234" # Change as needed
result = agent.process_document(document_url)
print("OCR Result:")
print(result)
Explanation:
• Initialization: The agent initializes with the API key.
• Processing: The process_document method sends the document URL to the Mistral OCR processor.
• Output: It prints the structured OCR result (in markdown format) including text and metadata.
Error Handling and Improvements
In a production setting, you might want to:
• Add exception handling for network issues.
• Validate the document URL.
• Parse the returned markdown to render in a UI.
Additional Use Cases with Implementation Details
1. Invoice Processing and Data Extraction
Mistral OCR can extract structured data from invoices, preserving tables and key fields like invoice numbers, dates, and totals. Once the OCR response is obtained, you can apply further parsing to extract the required information.
Python Code Snippet:
import re
def extract_invoice_details(markdown_text):
# Use regular expressions to find key invoice details
invoice_number = re.search(r"Invoice Number:\s*(\w+)", markdown_text)
invoice_date = re.search(r"Invoice Date:\s*([\d/-]+)", markdown_text)
total_amount = re.search(r"Total Amount:\s*\$?([\d,]+\.\d{2})", markdown_text)
return {
"invoice_number": invoice_number.group(1) if invoice_number else "Not Found",
"invoice_date": invoice_date.group(1) if invoice_date else "Not Found",
"total_amount": total_amount.group(1) if total_amount else "Not Found"
}
# Assuming `ocr_response` contains a key 'pages' with markdown output
ocr_markdown = ocr_response.get("pages", [])[0].get("markdown", "")
invoice_details = extract_invoice_details(ocr_markdown)
print("Extracted Invoice Details:", invoice_details)
This snippet processes the OCR markdown to extract and print invoice details using regex matching.
2. Academic Paper Analysis and Summarization
Researchers can use Mistral OCR to convert academic papers into markdown format, then apply natural language processing (NLP) for further analysis or summarization. For instance, you might extract sections like the abstract, introduction, and conclusion.
Python Code Snippet:
def extract_section(markdown_text, section_title):
# Simple extraction of a section based on title keywords
pattern = rf"(#{1,6}\s*{section_title}.*?)(?=\n#|\Z)"
match = re.search(pattern, markdown_text, re.DOTALL | re.IGNORECASE)
return match.group(1).strip() if match else "Section not found"
# Extracting the Abstract and Conclusion
abstract = extract_section(ocr_markdown, "Abstract")
conclusion = extract_section(ocr_markdown, "Conclusion")
print("Abstract:\n", abstract)
print("\nConclusion:\n", conclusion)
This snippet demonstrates how to extract specific sections from the OCR markdown for further processing or summarization.
3. Bulk Document Processing
For large-scale document processing, you may want to process multiple documents in a batch. The following Python example loops over a list of document URLs, processes each with Mistral OCR, and stores the results.
Python Code Snippet:
document_urls = [
"https://arxiv.org/pdf/2201.04234",
"https://example.com/invoice1.pdf",
"https://example.com/invoice2.pdf"
]
def process_documents(urls, agent):
results = {}
for url in urls:
try:
result = agent.process_document(url)
results[url] = result
print(f"Processed document: {url}")
except Exception as e:
results[url] = f"Error: {e}"
print(f"Failed processing {url}: {e}")
return results
bulk_results = process_documents(document_urls, agent)
print("Bulk Processing Results:", bulk_results)
This snippet shows how to handle multiple document URLs in a batch process with error handling.
4. Image-Based Document Processing
Besides PDFs, Mistral OCR can process images directly. You can either use local image files or image URLs. Here’s an example processing an image file.
Python Code Snippet:
import base64
def process_local_image(image_path, agent):
# Open and read the image file in binary mode
with open(image_path, "rb") as image_file:
image_data = image_file.read()
# Convert binary data to a base64 encoded string
encoded_image = base64.b64encode(image_data).decode('utf-8')
response = agent.client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "image_base64",
"document": encoded_image
},
include_image_base64=True
)
return response
# Replace 'path/to/image.jpg' with the actual image file path
image_response = process_local_image("path/to/image.jpg", agent)
print("Image OCR Result:", image_response)
This snippet illustrates handling a local image file and processing it with Mistral OCR. Adjust the image encoding method as per your environment’s requirements.
Final Thoughts
Mistral OCR significantly simplifies the extraction of text and structural data from diverse document types. Its ability to return markdown-formatted output makes it an excellent tool for automated document analysis—if you’re processing invoices, summarizing academic papers, handling bulk document uploads, or working with images.
Until the next one,
Cohorte Team
March 13, 2025