RAG testing and diagnosis using Giskard

Building smarter AI means tackling the complexities of evaluating Retrieval-Augmented Generation (RAG) systems. Giskard’s RAG Evaluation Toolkit (RAGET) automates the process, identifying weaknesses in key components like retrievers and generators. With tailored diagnostics, it simplifies fine-tuning while enhancing performance and reliability. This post shows you how to streamline RAG evaluation and unlock better AI.

Retrieval-Augmented Generation (RAG) enhances large language models by integrating external knowledge, making their responses more accurate and contextually relevant. But evaluating and fine-tuning these systems is a challenge. Each component—generator, retriever, rewriter, router, and knowledge base—needs to function optimally, often requiring significant manual effort to assess and improve performance. Giskard’s RAG Evaluation Toolkit (RAGET) solves this problem. It automates the creation of evaluation datasets, generates diverse question types tailored to test each component, and systematically identifies weaknesses. Developers can pinpoint and address issues efficiently, streamlining the evaluation process and reducing manual workload. The result? Smarter, more reliable RAG systems capable of delivering accurate and context-aware answers. In this blog post, we’ll dive into how RAGET works, how it simplifies diagnostics, and how it helps build better AI systems. By the end, you’ll have a clear roadmap for improving your RAG-based applications.

Introduction to Giskard

Giskard is a powerful toolkit for building conversational AI agents that leverage recent advancements in natural language processing (NLP).

One of Giskard's key features is the RAG Evaluation Toolkit (RAGET), which automates the generation of evaluation datasets and assessment of RAG agents' performance. RAGET systematically identifies weaknesses within specific components—such as the generator, retriever, rewriter, router, and knowledge base—by generating diverse question types tailored to test each part. This targeted evaluation enables developers to pinpoint and address deficiencies efficiently.

Here's a step-by-step guide to implement this (full code link provided at the end of the post).

Importing Necessary Libraries

Before we dive into the code, let's ensure we have all the necessary libraries installed. We'll be using Streamlit for building the user interface, PyPDF2 for PDF processing, LangChain for text manipulation, and Giskard for RAG diagnostics.

import streamlit as st
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from dotenv import load_dotenv
import PyPDF2
import tempfile
import pandas as pd
from giskard.rag import KnowledgeBase, generate_testset
from giskard.rag import evaluate, RAGReport
from giskard.rag.metrics.ragas_metrics import ragas_context_recall, ragas_context_precision
from giskard.llm.client.openai import OpenAIClient
import giskard

load_dotenv()

These libraries provide the foundation for our chatbot application, enabling us to handle PDF files, process text, and integrate Giskard for RAG evaluation.

Setting Up Language Model Client

Giskard relies on a language model to generate responses. In this case, we'll use OpenAI's GPT-3.5 model. We initialize the client and set it as the default for Giskard.

giskard.llm.set_llm_api("openai")
oc = OpenAIClient(model="gpt-3.5-turbo")
giskard.llm.set_default_client(oc)

This configuration ensures that Giskard interacts with the appropriate language model to produce responses.

Extracting Text from PDF

Our chatbot will operate on PDF documents. To enable this functionality, we define a function to extract text from PDF files.

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
    return text

This function enables us to convert PDF documents into text, making them accessible for our chatbot.

Creating a Question Answering Chain

To build our chatbot, we need to create a question-answering chain. This involves processing the uploaded PDF document, generating embeddings, and setting up a retrieval-based QA system.

def create_qa_chain(uploaded_file_path):
    if uploaded_file is not None:
        documents = [extract_text_from_pdf(uploaded_file_path)]
        text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
        texts = text_splitter.create_documents(documents)
        embeddings = OpenAIEmbeddings()
        db = Chroma.from_documents(texts, embeddings)
        retriever = db.as_retriever()
        qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0), chain_type='stuff', retriever=retriever)
        return qa, texts

This function sets up the components required for our question-answering chain, including text preprocessing, embedding creation, and retrieval-based QA model initialization.

Generating Response

Finally, we define a function to generate responses to user queries using our QA chain.

def generate_response(query_text):
    return qa.run(query_text)

This function leverages our QA chain to generate responses based on user queries.

Streamlit App: Integrating Giskard for RAG Diagnostics

Let's delve into the Streamlit application code and understand how it seamlessly integrates Giskard for RAG diagnostics, enabling interactive document analysis.

1. File Upload and Processing

# File upload and processing
uploaded_file = st.file_uploader('Upload an article')
if uploaded_file is not None:
    # Load and process the PDF
    with tempfile.NamedTemporaryFile(delete=False) as temp_file:
        temp_file.write(uploaded_file.getvalue())
        temp_file_path = temp_file.name
    # Define input and output paths
    filename = temp_file_path
    qa, texts = create_qa_chain(filename)

This section handles file upload, processing, and sets up the question-answering chain for the uploaded PDF document.

2. Query Input and Submission

# Query input and submission
query_text = st.text_input('Enter your question:', placeholder='Please provide a short summary.', disabled=not uploaded_file)

Here, users can input their queries, which will be used to generate responses from the document.

3. Knowledge Base Creation

# Knowledge base creation
if st.button("Start with KB"):
    knowledge_base_df = pd.DataFrame([t for t in texts[0].page_content.split("\\n")], columns=["text"])
    knowledge_base = KnowledgeBase(knowledge_base_df)
    st.dataframe(knowledge_base_df, use_container_width=True)

This button allows users to create a knowledge base from the text extracted from the PDF document.

4. Test Set Generation and Evaluation with Giskard

# Test set generation and evaluation with Giskard
if st.button("Generate Samples"):
    num_question = st.text_input("Please enter the number of samples you wish to create for the testset (it would take a long time to generate):", 10)

    with st.spinner('Creating Testset and then generating report...'):
        testset = generate_testset(knowledge_base,
                                   num_questions=int(num_question),
                                   agent_description="A chatbot answering questions about the document")
        st.write("Testset Samples:")
        st.dataframe(testset.to_pandas(), use_container_width=True)

        report = evaluate(generate_response,
                          testset=testset,
                          knowledge_base=knowledge_base,
                          metrics=[ragas_context_recall, ragas_context_precision])

        txt_path = report.to_html("file.html")
        report_html = open("file.html").read()
        st.components.v1.html(report_html, scrolling=True, height=500)

This section generates a test set based on the knowledge base and evaluates the chatbot's performance using Giskard, displaying the results in an HTML report.

5. Displaying Evaluation Report

# Displaying evaluation report
if st.button("Show Evaluation Report"):
    st.write("Report Analysis metrics:")
    st.write("Report correctness by topic metrics:")
    st.write(report.correctness_by_topic())

    st.write("Report correctness by question type metrics:")
    st.write(report.correctness_by_question_type())

    st.write("Report failures:")
    st.write(report.failures)

    st.write("Report failures by topic:")
    st.write(report.get_failures(topic="Topic from your knowledge base", question_type="simple"))

    st.write("Evaluation Report of RAG:")
    st.dataframe(report.to_pandas(), use_container_width=True)

This button allows users to view detailed evaluation metrics, including correctness by topic, correctness by question type, and any reported failures.

Conclusion

Integrating Giskard with Streamlit transforms document analysis and evaluation into an interactive, real-time experience. Users can explore documents, ask questions, and assess model performance effortlessly. Whether for research, education, or business, this approach streamlines document understanding with precision and efficiency.

Explore the full code on our GitHub.

Cohorte Team

December 19, 2024