Multimodal RAG for Comprehensive PDF Document Processing

Generative AI has enabled powerful PDF interaction, but most applications only process text content. This article explores how to make AI systems analyze both text and graphics in PDFs, enhancing document comprehension capabilities.

The ability to efficiently extract and interpret information from PDF documents is crucial in the ever-expanding digital landscape. The Multimodal RAG (Retrieval-Augmented Generation) system integrates advanced Optical Character Recognition (OCR), Natural Language Processing (NLP), and machine learning technologies to offer a sophisticated platform for processing PDF documents. This blog post delves into the core functionalities and setup process of Multimodal RAG, with a special focus on its image extraction and conversion capabilities.

Introduction to Multimodal RAG

Multimodal RAG is designed to make both text and image data in PDF documents fully accessible and analyzable. It employs OCR to digitize text within images, NLP to understand and manipulate this text, and machine learning models to enhance data retrieval and analysis, ensuring a thorough understanding of document contents.

Core Features of Multimodal RAG

The platform's key features include:

PDF and Image Processing: Automatic extraction of text and images from PDF files.
Adaptive Image-to-Text Conversion: Options to use either OpenAI’s API or the LLaVA model for converting images to descriptive text.
Advanced Text and Image Indexing: Utilizes the Chroma vector database for efficient data indexing and retrieval.
Interactive Query Interface: A Streamlit-based interface facilitates easy exploration and querying of content.

Setup and Configuration

Requirements for the project

To get started with Multimodal RAG, the following needs to be installed if we are choosing to use OpenAI for Image to Text:

langchain==0.1.14
python-dotenv==1.0.1
unstructured[local-inference]==0.5.6
chromadb==0.4.24 # Vector storage
openai==1.16.2 # For embeddings
tiktoken==0.6.0  # For embeddings
streamlit==1.33.0
pypdf==4.1.0

If you wish to use LLaVA, please see the additional requirements below (these include requirements for GPU utilization as well):

transformers==4.37.2
bitsandbytes==0.41.3
accelerate==0.25.0

API Key

In .env file, add OPENAI_API_KEY=<YOURA_OAI_APIKEY>

Process Documents and Populate the Database

For text in PDFs, we are chunking them using:

def split_text(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        add_start_index=True,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks.")

    document = chunks[10]
    print(document.page_content)
    print(document.metadata)

    return chunks

Extracting Images from PDFs and Converting Them to Text

One of Multimodal RAG's standout features is its ability to handle images embedded within PDF documents. Here's a breakdown of how images are extracted from PDFs and converted into text.

a) Image Extraction from PDFs

The process begins with the extraction of images from each page of the PDF. The code snippet below demonstrates how this is achieved using the pypdf library:

from pypdf import PdfReader

def extract_images(pdf_path):
    pdf_reader = PdfReader(pdf_path)
    images = []
    for page in pdf_reader.pages:
        for img_name, img_details in page.images.items():
            images.append(img_details.extract_image())
    return images

b) Converting Extracted Images to Text

After extracting the images, they are converted into text using either OpenAI's API or the LLaVA model. Below is a code snippet that showcases how the OpenAI API can be utilized for this purpose:

import base64
import requests

def encode_image_to_base64(image_data):
    return base64.b64encode(image_data).decode('utf-8')

def convert_image_to_text(image_data, api_key):
    base64_image = encode_image_to_base64(image_data)
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }
    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [{
            "role": "user",
            "content": [{
                "type": "text",
                "text": "What’s in this image?"
            }, {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}"
                }
            }]
        }],
        "max_tokens": 300
    }
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    return response.json()["choices"][0]["message"]["content"]

These generated texts (chunks) will be processed exactly like the way we handle the detected texts in pdf (classical RAGs).

Start the Streamlit application

If you are using "Dockerized" solution, don’t forget to add -p 8501:8501 as streamlit opens on 8501 by default.

streamlit run app.py

After sending your query, the app will search through the embeddings created from texts and images in the PDF documents.
You will then receive a response along with the top 3 sources from which the information was extracted

Demo Snippet and Video

As demonstrated in the video below, we addressed a question where the required information was contained within a graphic rather than text. The RAG system successfully provided the correct answer, complete with relevant sources!

‍

Check out our Github page for more code snippets and a complete app implementation. Share comments and thoughts with us at hello@cohorte.co.

Conclusion

Multimodal RAG is a powerful tool for enhancing document processing workflows. Its advanced capabilities include extracting and converting images from PDF documents to text. By using this guide, users can leverage OCR, NLP, and machine learning to extract and analyze data more efficiently and effectively.