Engineering24 min read

Evaluating RAG Systems in 2025: RAGAS Deep Dive, Giskard Showdown, and the Future of Context

RAG is everywhere, but evaluating it is still messy. This post dives into RAGAS and Giskard—two open-source frameworks helping teams measure trust, faithfulness, and performance in RAG pipelines. We compare their strengths, show how they work with real code, and explore what happens when context windows make RAG optional. For developers and AI leaders who need more than just vibes to trust their LLMs.

Tega Adeyemi
Tega Adeyemi
Evaluating RAG Systems in 2025: RAGAS Deep Dive, Giskard Showdown, and the Future of Context

Large Language Models are amazing at generating text – sometimes too amazing, as they’ll gladly fill in knowledge gaps with creative fiction. Retrieval-Augmented Generation (RAG) architectures emerged as a clever fix for this tendency. In a RAG system, the model isn’t left to wing it on its own; instead, it’s given a stack of relevant documents (retrieved from an external knowledge base) as reference before answering a query. This way, an LLM acts as a natural language layer on top of a database, reducing the risk of hallucination by grounding its answers in actual data (arxiv.org). RAG has quickly become the go-to method for enterprise QA systems and chatbots, from customer support assistants to internal knowledge base search, by combining vector databases with LLMs.

In short, building a RAG proof-of-concept is easy – just hook up an LLM with a vector store – but making it production-ready and evaluating its performance is another story (medium.com). A RAG pipeline has moving parts (a retriever and a generator), so you need to assess: Did the retriever fetch good info? Did the LLM use that info correctly? And is the final answer actually helpful? This is where specialized evaluation frameworks come into play. Enter RAGAS, the Retrieval-Augmented Generation Assessment framework, and friends. In this post, we’ll deep dive into RAGAS: what it is, how it evaluates RAG systems (with code examples), and how it stacks up against other open-source evaluation frameworks like Giskard. We’ll also ponder the future – if LLMs get way bigger context windows, will we still need retrieval (and RAGAS)? Let’s find out.

RAGAS: What Is It and How Does It Evaluate RAG?

RAGAS (short for Retrieval-Augmented Generation Assessment) is an open-source framework (Apache-2.0 licensed, with a thriving GitHub repo) for evaluating RAG pipelines. It was introduced by researchers Shahul ES et al. in late 2023 as a “reference-free” evaluation toolkit for RAG. Reference-free means you don’t necessarily need a human-written ground-truth answer for each question. Instead, RAGAS leverages language models under the hood to judge the quality of responses. In practice, you provide RAGAS with the pieces of your RAG interaction – the user’s question, the retrieved context documents, the LLM’s answer, and (optionally) a ground-truth answer if you have one. With that, RAGAS can score the performance along several dimensions, using LLMs to evaluate things like factuality and relevance automatically.

What makes RAGAS especially handy is that it evaluates a RAG system on a component level, aligning with the two main parts of the pipeline: the retrieval part and the generation part. When your RAG-powered assistant answers, you want to know: (1) Did the retriever get all the relevant info (no important document missed)? and (2) Did the generator (LLM) stick to that info and produce a correct, relevant answer? RAGAS tackles these with a suite of metrics for different aspects of the answer and context. Let’s look at a few key metrics RAGAS provides (with self-explanatory names, for the most part):

Metrics That Matter: Faithfulness, Context Recall, Answer Relevance, & More

Those are the big three metrics explicitly mentioned, but RAGAS actually offers a broader menu – e.g. Context Precision (did the retriever also pull in a bunch of irrelevant text?), Factual Correctness (overall correctness of the answer, similar to faithfulness but possibly comparing to a known ground truth), and even specialized metrics for multi-hop QA or SQL query answers. The framework is extensible: you can plug in other evaluation criteria or custom metrics if your use case demands it. But for most RAG QA systems, faithfulness, recall, and relevance (plus maybe a basic accuracy check) form a solid trio to evaluate how well the system is doing. In fact, RAGAS’s own documentation refers to these as “core metrics” that can be combined into an overall RAG score

Using RAGAS: Evaluating a RAG Pipeline (with Code)

Let’s get our hands dirty with a simple example. Suppose you have built a RAG system (perhaps using LangChain or your own retrieval+LLM setup) that answers user questions. Now you want to systematically evaluate it. RAGAS makes this straightforward by treating each Q&A interaction as an evaluation sample. Here’s how you might use it:

First, install RAGAS via pip (it’s a Python library):

pip install ragas

Now, assume we have collected some evaluation data for our system – say a list of dictionaries, each with a user_input (the question), the system’s response (answer), the retrieved_contexts (list of docs or passages the system retrieved for that question), and optionally a reference answer (ground-truth answer, if we have one for evaluation). We can load this into a RAGAS EvaluationDataset and run a suite of metrics:

from ragas import EvaluationDataset, evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness

# Suppose 'dataset' is a list of dicts with our evaluation Q&A data
evaluation_dataset = EvaluationDataset.from_list(dataset)

# Wrap an LLM to use as the evaluator (could be GPT-4 or a smaller model)
evaluator_llm = LangchainLLMWrapper(llm)  # 'llm' is a LangChain LLM like OpenAI(gpt-4)

# Define which metrics we want to compute
metrics = [LLMContextRecall(), Faithfulness(), FactualCorrectness()]

# Run evaluation
result = evaluate(dataset=evaluation_dataset, metrics=metrics, llm=evaluator_llm)
print(result)

When you run this, RAGAS will use the evaluator_llm (which could be a GPT-4 or an open source LLM) to perform the heavy lifting for metrics that require natural language judgment (like checking faithfulness). It returns a dictionary of metric scores. For example, you might get output like:

{'context_recall': 1.0, 'faithfulness': 0.8571, 'factual_correctness': 0.7280}

This indicates our retriever got everything (100% recall), but the LLM’s answers were only ~85.7% consistent with the documents and ~72.8% factually correct on average Such granular evaluation helps pinpoint weaknesses: maybe the model is mildly hallucinating or misinterpreting some context (faithfulness < 1), and there are some factual errors to address.

Under the hood, each metric is implemented either by prompting an LLM to judge something or by using a deterministic comparison. For instance, Faithfulness can be computed by a special classifier model (like Vectara’s open-source T5-based model for hallucination detection) or by an LLM prompting method. RAGAS allows you to choose – e.g., Faithfulness(llm=evaluator_llm) uses a prompting approach, while FaithfulnesswithHHEM() uses that T5 model for a potentially faster check. Similarly, ContextRecall can be LLM-based (checking each claim of a reference answer against retrieved docs) or non-LLM (string overlap between retrieved vs expected docs). As a developer, you have the flexibility to swap these out or even write your own metric if needed, but RAGAS’s built-ins cover most needs.

In summary, RAGAS provides a neat toolkit to quantitatively evaluate a RAG system’s quality along multiple axes without requiring thousands of hand-labeled examples. It leverages LLMs to judge correctness, so you can catch hallucinations and irrelevance automatically. Now that we’ve seen what RAGAS is about, it’s time to compare it with another rising star in LLM evaluation: Giskard. (No, not the robot from Asimov’s novels – but like that Giskard, it’s very much focused on making sure AI follows the rules and doesn’t harm humans!).

RAGAS vs Giskard: Evaluating Trust and Explainability in RAG Systems

Both RAGAS and Giskard are open-source tools aiming to help us trust and verify our AI systems, but they come at the problem from different angles. RAGAS, as we saw, zeroes in on performance metrics for retrieval-augmented Q&A – things like accuracy, relevance, and factuality. Giskard, on the other hand, is more of an umbrella testing framework for AI models in general, with strong support for LLM applications (including RAG pipelines). Let’s break down the comparison in a few key areas:

To sum up the comparison in a fun analogy: RAGAS is like a meticulous grader for your open-book QA test (checking if the student used the book correctly and answered the question), while Giskard is like the strict proctor and school counselor combined – catching cheaters (prompt injection), flagging inappropriate behavior, and telling you which subject the student is weak in. 📝🔍

Other Noteworthy Open-Source Evaluation Tools

The ecosystem for LLM evaluation is expanding, and RAGAS and Giskard aren’t the only games in town. Depending on your specific requirements, you might also consider:

The list goes on – new tools and frameworks are popping up (LangSmith by LangChain for example, or NeuralTrust’s evaluators). The good news is the community recognizes that evaluation is critical for LLM applications, and we’re seeing rapid innovation to make it easier. Whether you prioritize deep metrics (RAGAS), broad safety checks (Giskard), or integration into CI (DeepEval) or MLOps platforms, there’s likely an open-source project out there to help. Don’t be afraid to mix and match these tools; many are complementary. For instance, you might use LlamaIndex’s built-in eval during development, RAGAS for a detailed analysis before a release, and Giskard in a deployment pipeline for continuous monitoring.

Now, let’s address the elephant (or rather, the giant context window) in the room: as LLMs get bigger memories, will RAG as a technique become obsolete? And if so, what happens to all these RAG evaluation frameworks?

Future Outlook: Do Giant Context Windows Threaten RAG (and Its Evaluation)?

It’s 2025, and we already have models with 100k-token context windows. The trend suggests we might eventually have language models that can ingest entire manuals or even databases in one go. One might ask: if an LLM can literally read all relevant content in its prompt, do we even need retrieval and RAG techniques? Could we just stuff the knowledge base into the prompt (or into the model’s parameters) and call it a day? 🤔

The honest answer: RAG is likely here to stay for the foreseeable future, though its role may evolve. Here’s why large context alone isn’t a slam dunk replacement and how that affects RAG evaluation:

On the flip side, larger context windows will likely reduce some pain points. For instance, today a bad retrieval can completely fail an answer (if the doc you need wasn’t retrieved, you’re out of luck). In a future scenario where most relevant info is somewhere in the giant context, the problem shifts to search within context (which the model does implicitly). We might rely a bit less on explicit retrieval algorithms and more on the model’s ability to find needles in the haystack it’s given. But we’d then evaluate the model’s answer for whether it indeed found the right needle. It’s a bit like moving the goalposts: the game remains to verify the model’s output is grounded in truth.

In practical terms, as context windows grow, some metrics might get easier to max out – e.g., context recall could often be 1.0 because we just dump in all possibly relevant docs. That would make recall a less interesting number (everyone gets an A because the test is open-everything). The focus would shift to metrics like faithfulness and relevance even more. Hallucination detection (faithfulness) becomes paramount: if the model has all the info and still hallucinated, shame on it – and we need to catch that. So RAGAS or its successors might double down on improved faithfulness checks, maybe multi-step reasoning verification. Similarly, if the model has tons of context, we’d want to ensure it didn’t copy irrelevant bits verbatim or go off on tangents – a blend of relevance and maybe a new metric for “focus” could emerge.

From a strategic viewpoint, AI leaders should keep an eye on this: you may not always need a vector database for retrieval if your model can swallow a library, but you will always need evaluation. Whether it’s RAG-specific or general LLM QA evaluation, the goal is to maintain trust and quality in the answers your systems produce. The tools we discussed are already adapting to cover a broad range of evaluation scenarios (notice Giskard’s branding is not just RAG, but all AI model testing; RAGAS too could be seen as part of a bigger eval toolkit). So, larger contexts will change how we use these tools, but not eliminate the need for them. You’ll possibly configure them differently, but you’ll still run your model through an eval gauntlet to ensure it’s behaving.

Conclusion

In the fast-paced world of AI, Retrieval-Augmented Generation has proven to be a practical way to boost LLM performance by grounding it in real data. But with great power (an LLM that cites sources!) comes great responsibility – we must rigorously evaluate these systems to ensure they’re actually doing what we expect: finding the right info and using it correctly to answer questions. We explored RAGAS, a purpose-built framework that gives developers a leg up in this evaluation process by providing metrics for everything from retrieval quality (did we get the info?) to generation faithfulness (did we use it correctly?). With RAGAS, one can quantitatively track improvements and regressions in a RAG pipeline, getting numeric scores for things that used to require laborious manual QA docs.ragas.io.

We also looked at Giskard, which broadens the scope to testing not just for accuracy but for trustworthiness – catching those nasty edge cases and AI behaviors that keep CEOs up at night. Giskard and similar frameworks add a layer of assurance by scanning for biases, hallucinations, security vulnerabilities, and more. For a developer, these tools can feel like having a safety net; for an AI leader, they provide insight and confidence that the system won’t turn into a loose cannon in production.

Crucially, we’ve realized that as models evolve (hello, 100k context windows!), evaluation frameworks will also evolve. If tomorrow’s model doesn’t need to “retrieve” because it can pay attention to your entire database at once, we’ll still want to measure if its answers are correct and grounded. The names of the metrics might change, but the mission remains the same: keep the AI honest and helpful. In fact, one could argue that evaluation will only grow in importance – when failures are rare but potentially very costly, catching the one in a million mistake is like finding a needle in a haystack. Automated eval tools will be our magnet to find that needle.

On a lighter note, think of your RAG system as a star employee (albeit an electronic one) – you gave it access to the company wiki so it can do its job better. RAGAS is like HR checking its references and work outputs for honesty and relevance, and Giskard is IT/security making sure it’s not breaking any rules or spilling secrets. Together, they help you trust this new “employee” with bigger responsibilities. And as this employee gets smarter (larger context, more training), you won’t stop evaluating them – you’ll just have better tools to do so, ensuring they continue to perform and behave.

TL;DR: RAG became popular to keep LLMs factual by letting them fetch documents; RAGAS arose to score how well that works (no more relying purely on “vibe checking” your model’s answers). RAGAS gives you metrics like how much of the needed info was retrieved, how faithful the answer was to that info, and how relevant the answer was to the question. It uses LLMs behind the scenes to automate a lot of this, fitting into fast development cyclesarxiv.org. Giskard, in comparison, is like the multi-tool for AI evaluation – it’s not limited to RAG and adds trust metrics (catching things like toxicity or data leaks), with a focus on explainability and customization for enterprise needs. Other tools like DeepEval, LlamaIndex’s evaluators, and more are also in the mix, each with their strengths. As LLMs get bigger brains (contexts), we may lean on retrieval a bit less, but we’ll always need to verify that our giant-brained models are using their brains correctly. So whether you stick to RAG or venture into hypertrophic context land, keep those evaluation checklists handy! In the end, a model that isn’t evaluated is a model you’re flying blind – and nobody wants their AI project to be a random leap of faith. Happy evaluating, and may your RAG systems always faithfully answer with just the facts (and maybe a dash of humor when appropriate).

References to dive deeper:

Tega AdeyemiMay 9, 2025