Evaluating RAG Systems in 2025: RAGAS Deep Dive, Giskard Showdown, and the Future of Context

Large Language Models are amazing at generating text – sometimes too amazing, as they’ll gladly fill in knowledge gaps with creative fiction. Retrieval-Augmented Generation (RAG) architectures emerged as a clever fix for this tendency. In a RAG system, the model isn’t left to wing it on its own; instead, it’s given a stack of relevant documents (retrieved from an external knowledge base) as reference before answering a query. This way, an LLM acts as a natural language layer on top of a database, reducing the risk of hallucination by grounding its answers in actual data (arxiv.org). RAG has quickly become the go-to method for enterprise QA systems and chatbots, from customer support assistants to internal knowledge base search, by combining vector databases with LLMs.

In short, building a RAG proof-of-concept is easy – just hook up an LLM with a vector store – but making it production-ready and evaluating its performance is another story (medium.com). A RAG pipeline has moving parts (a retriever and a generator), so you need to assess: Did the retriever fetch good info? Did the LLM use that info correctly? And is the final answer actually helpful? This is where specialized evaluation frameworks come into play. Enter RAGAS, the Retrieval-Augmented Generation Assessment framework, and friends. In this post, we’ll deep dive into RAGAS: what it is, how it evaluates RAG systems (with code examples), and how it stacks up against other open-source evaluation frameworks like Giskard. We’ll also ponder the future – if LLMs get way bigger context windows, will we still need retrieval (and RAGAS)? Let’s find out.

RAGAS: What Is It and How Does It Evaluate RAG?

RAGAS (short for Retrieval-Augmented Generation Assessment) is an open-source framework (Apache-2.0 licensed, with a thriving GitHub repo) for evaluating RAG pipelines‍. It was introduced by researchers Shahul ES et al. in late 2023 as a “reference-free” evaluation toolkit for RAG. Reference-free means you don’t necessarily need a human-written ground-truth answer for each question. Instead, RAGAS leverages language models under the hood to judge the quality of responses. In practice, you provide RAGAS with the pieces of your RAG interaction – the user’s question, the retrieved context documents, the LLM’s answer, and (optionally) a ground-truth answer if you have one. With that, RAGAS can score the performance along several dimensions, using LLMs to evaluate things like factuality and relevance automatically.

What makes RAGAS especially handy is that it evaluates a RAG system on a component level, aligning with the two main parts of the pipeline: the retrieval part and the generation part. When your RAG-powered assistant answers, you want to know: (1) Did the retriever get all the relevant info (no important document missed)? and (2) Did the generator (LLM) stick to that info and produce a correct, relevant answer? RAGAS tackles these with a suite of metrics for different aspects of the answer and context. Let’s look at a few key metrics RAGAS provides (with self-explanatory names, for the most part):

Metrics That Matter: Faithfulness, Context Recall, Answer Relevance, & More

Faithfulness – Is the answer faithful to the retrieved documents? This metric checks if the model’s claims are supported by the provided context. It’s essentially a measure of factual consistency: the fraction of statements in the answer that can be confirmed by the retrieved docs. A perfectly faithful answer (score = 1.0) means every claim the LLM made is backed up by the supplied text; a low score means the LLM is hallucinating or straying from the source. For example, if the context says Einstein was born on 14 March 1879 and the answer says 20 March 1879, that answer would get dinged for faithfulness (only some claims match)‍.
Context Recall – Did our retriever find enough of the relevant info? Context recall measures how many of the ground-truth relevant documents were retrieved. Higher recall means the retriever didn’t miss much. In short, it’s about not leaving out important material. This requires knowing what the truly relevant docs were – so this is the one core RAGAS metric that does need a reference answer or reference documents to compare against‍. RAGAS can compute recall in an LLM-based way (by comparing against a reference answer’s content) or via non-LLM string matching, but either way it gives a score 0 to 1 indicating coverage‍. If your context_recall is low, your retrieval step might be missing key info (time to tweak that vector store or add data!).
Answer Relevance – Is the answer actually answering the user’s question? Also called Response Relevancy, this metric checks alignment with the user query An answer that directly addresses the question (and doesn’t wander or leave things out) scores high; if the answer is off-topic, incomplete, or contains extra fluff, it scores lower‍. Under the hood, RAGAS computes this by a neat trick: it uses an LLM to generate a few hypothetical questions based on the answer, then measures how similar those are to the original question (via embeddings). The intuition is that if the answer is truly on-point, questions derived from it will resemble the original question. (Yes, RAGAS basically asks, “if this answer is the solution, what might the question have been?” — pretty cool!). The relevance score is typically between 0 and 1 (higher is better), though since it’s based on cosine similarity, it’s not strictly bounded 0–1. In practice, you can interpret it as a percentage match to the query intent.

Those are the big three metrics explicitly mentioned, but RAGAS actually offers a broader menu – e.g. Context Precision (did the retriever also pull in a bunch of irrelevant text?), Factual Correctness (overall correctness of the answer, similar to faithfulness but possibly comparing to a known ground truth), and even specialized metrics for multi-hop QA or SQL query answers. The framework is extensible: you can plug in other evaluation criteria or custom metrics if your use case demands it. But for most RAG QA systems, faithfulness, recall, and relevance (plus maybe a basic accuracy check) form a solid trio to evaluate how well the system is doing. In fact, RAGAS’s own documentation refers to these as “core metrics” that can be combined into an overall RAG score

Using RAGAS: Evaluating a RAG Pipeline (with Code)

Let’s get our hands dirty with a simple example. Suppose you have built a RAG system (perhaps using LangChain or your own retrieval+LLM setup) that answers user questions. Now you want to systematically evaluate it. RAGAS makes this straightforward by treating each Q&A interaction as an evaluation sample. Here’s how you might use it:

First, install RAGAS via pip (it’s a Python library):

pip install ragas

Now, assume we have collected some evaluation data for our system – say a list of dictionaries, each with a user_input (the question), the system’s response (answer), the retrieved_contexts (list of docs or passages the system retrieved for that question), and optionally a reference answer (ground-truth answer, if we have one for evaluation). We can load this into a RAGAS EvaluationDataset and run a suite of metrics:

from ragas import EvaluationDataset, evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness

# Suppose 'dataset' is a list of dicts with our evaluation Q&A data
evaluation_dataset = EvaluationDataset.from_list(dataset)

# Wrap an LLM to use as the evaluator (could be GPT-4 or a smaller model)
evaluator_llm = LangchainLLMWrapper(llm)  # 'llm' is a LangChain LLM like OpenAI(gpt-4)

# Define which metrics we want to compute
metrics = [LLMContextRecall(), Faithfulness(), FactualCorrectness()]

# Run evaluation
result = evaluate(dataset=evaluation_dataset, metrics=metrics, llm=evaluator_llm)
print(result)

When you run this, RAGAS will use the evaluator_llm (which could be a GPT-4 or an open source LLM) to perform the heavy lifting for metrics that require natural language judgment (like checking faithfulness). It returns a dictionary of metric scores. For example, you might get output like:

{'context_recall': 1.0, 'faithfulness': 0.8571, 'factual_correctness': 0.7280}

This indicates our retriever got everything (100% recall), but the LLM’s answers were only ~85.7% consistent with the documents and ~72.8% factually correct on average Such granular evaluation helps pinpoint weaknesses: maybe the model is mildly hallucinating or misinterpreting some context (faithfulness < 1), and there are some factual errors to address.

Under the hood, each metric is implemented either by prompting an LLM to judge something or by using a deterministic comparison. For instance, Faithfulness can be computed by a special classifier model (like Vectara’s open-source T5-based model for hallucination detection) or by an LLM prompting method‍. RAGAS allows you to choose – e.g., Faithfulness(llm=evaluator_llm) uses a prompting approach, while FaithfulnesswithHHEM() uses that T5 model for a potentially faster check‍. Similarly, ContextRecall can be LLM-based (checking each claim of a reference answer against retrieved docs) or non-LLM (string overlap between retrieved vs expected docs). As a developer, you have the flexibility to swap these out or even write your own metric if needed, but RAGAS’s built-ins cover most needs.

In summary, RAGAS provides a neat toolkit to quantitatively evaluate a RAG system’s quality along multiple axes without requiring thousands of hand-labeled examples. It leverages LLMs to judge correctness, so you can catch hallucinations and irrelevance automatically. Now that we’ve seen what RAGAS is about, it’s time to compare it with another rising star in LLM evaluation: Giskard. (No, not the robot from Asimov’s novels – but like that Giskard, it’s very much focused on making sure AI follows the rules and doesn’t harm humans!).

RAGAS vs Giskard: Evaluating Trust and Explainability in RAG Systems

Both RAGAS and Giskard are open-source tools aiming to help us trust and verify our AI systems, but they come at the problem from different angles. RAGAS, as we saw, zeroes in on performance metrics for retrieval-augmented Q&A – things like accuracy, relevance, and factuality. Giskard, on the other hand, is more of an umbrella testing framework for AI models in general, with strong support for LLM applications (including RAG pipelines)‍. Let’s break down the comparison in a few key areas:

Trust & Safety Focus: RAGAS’s metrics mostly deal with correctness – it will tell you if your answer is wrong or unfaithful, but it doesn’t explicitly check for things like offensive content or data leakage. Giskard, by contrast, has an entire LLM vulnerability scanner. It automatically probes your model (or agent) for issues like hallucinations, harmful content generation, prompt injection, sensitive info disclosure, or bias (stereotypes and discrimination). In fact, Giskard maintains a list of “critical LLM vulnerabilities” (inspired by OWASP) and provides tests to detect them. This makes Giskard attractive for business use-cases where trust, safety, and ethical compliance are as important as accuracy. If your boss (or your regulator) asks “can we be sure the chatbot won’t say something crazy or leak personal data?”, a Giskard scan is meant to give that confidence.
Explainability & Component-Level Insights: With RAGAS, you get metric scores that quantify issues, but it’s up to you to interpret them. For example, a low faithfulness score tells you something’s off, but not which part of the answer was unsupported. Giskard emphasizes explainability in the sense of pinpointing failure modes. For RAG systems, Giskard offers the RAG Evaluation Toolkit (RAGET), which evaluates each component of your pipeline separately. It can autogenerate test questions from your knowledge base and then score the Retriever, Generator (LLM), Rewriter (if you have a query rewriter), Router, etc., by aggregating how each performs on those tests‍. The result is that you can identify “the retriever is the weakest link” or “the LLM is the one messing up even though retrieval is fine” easily. This kind of breakdown is extremely useful in debugging a complex system – it’s like a report card showing which subsystem needs improvement. RAGAS currently doesn’t explicitly separate scores by component (though you might infer it, e.g. context_recall is about the retriever, faithfulness about the generator). Giskard’s approach feels more actionable: it doesn’t just say what the quality is, but where to focus fixes.
Customization & Flexibility: Both frameworks are extensible, but in different ways. RAGAS allows you to plug in custom metrics and even custom evaluator models fairly easily in code It’s very much library style – you integrate it into your Python code and you can tweak prompts or write new metric classes if needed. Giskard also offers customization but leans towards a higher-level user experience: for instance, its RAGET can generate evaluation datasets automatically from your knowledge base, sparing you the effort of writing evaluation queries yourself. It also integrates with existing ML platforms – notably, Giskard has a plugin for MLflow’s evaluation API‍, meaning you can slot it into your MLOps pipeline to run tests whenever you train or update a model. In terms of developing tests, Giskard supports both programmatic use (as a Python library) and a UI for creating and reviewing tests. This makes it quite friendly to a team setting: a QA engineer or an ML researcher could define custom test cases (like specific adversarial prompts) and add them to the suite. With RAGAS, creating specialized tests might mean writing a new metric or script – totally doable for a developer, but less point-and-click.
Usability for AI Leaders (and Busy Developers): If you’re an AI tech lead or VP overseeing a project, you probably care about quick feedback loops and clear dashboards. RAGAS is lightweight and developer-oriented – you run evaluations and you could plot the results over time, integrate in CI to catch regressions, etc. It’s very much your code, your responsibility to present the data in a useful way (though RAGAS docs do show examples of summarizing results). Giskard is positioning itself as a more comprehensive solution – they even call part of it an “LLM Evaluation Hub”, suggesting a web interface where results are aggregated. In practice, Giskard open-source gives you a nice API and possibly a local app to see reports. It’s built to integrate into business workflows: e.g., test results can be used as quality gates before deploying a model, and the variety of tests (bias, security, performance) aligns with risk checklists enterprises use. In short, RAGAS is a specialized tool in your toolbox, whereas Giskard is more of a full testing platform. Depending on your needs, you might even use them together – e.g. RAGAS for detailed QA metrics and Giskard for broader safety scans. Both are evolving rapidly in this nascent LLM evaluation space.

To sum up the comparison in a fun analogy: RAGAS is like a meticulous grader for your open-book QA test (checking if the student used the book correctly and answered the question), while Giskard is like the strict proctor and school counselor combined – catching cheaters (prompt injection), flagging inappropriate behavior, and telling you which subject the student is weak in. 📝🔍

Other Noteworthy Open-Source Evaluation Tools

The ecosystem for LLM evaluation is expanding, and RAGAS and Giskard aren’t the only games in town. Depending on your specific requirements, you might also consider:

DeepEval – An open-source framework that treats LLM evaluations like unit tests. It even integrates with PyTest, so you can write tests for your model outputs just as you would for your code. DeepEval provides a bunch of metrics (including hallucination detection, etc.) and can generate synthetic test cases from your knowledge base or load them from datasets (dev.to). For example, you can assert in code that the model’s answer should have a hallucination score below a threshold, and DeepEval will fail the test if not‍. This is great for developers who want to enforce quality via CI pipelines. DeepEval’s philosophy is similar to RAGAS (LLM-as-a-judge metrics), and indeed it overlaps on metrics; one difference noted by reviewers is that RAGAS’s numeric metrics, while thorough, sometimes aren’t self-explanatory in isolation, whereas DeepEval encourages explicit pass/fail criteria which can be easier to interpret in a test context.
MLflow LLM Evaluator – The folks at Databricks have extended MLflow with an mlflow.evaluate() API that supports LLMs. It comes with some built-in metrics and also allows custom plugins. In fact, Giskard’s integration with MLflow uses this plugin mechanism. If your organization already uses MLflow to track models, this is a natural way to evaluate LLMs (including RAG models) and log their performance over time. Think of it as adding evaluation results to your model registry, so you always know the “score” of a model before pushing it to production.
LlamaIndex Evaluation Module – If you’re using LlamaIndex (formerly GPT Index) for your RAG system, it has its own evaluation toolkit. For instance, LlamaIndex offers a Correctness Evaluator that compares a generated answer to a reference answer for a given query‍. It similarly can leverage GPT-4 to judge if the answer aligns with the reference, operating in a reference-free manner for the most part. Essentially, they built something quite akin to RAGAS but integrated into the LlamaIndex workflow. This could be convenient if your RAG pipeline is built on LlamaIndex, as you won’t need an external dependency.
DeepChecks (LLM) – The DeepChecks library (known for ML testing) has introduced LLM evaluation features. It’s geared more towards evaluating the LLM itself (like its general capabilities or biases) rather than the whole RAG pipeline, but it’s worth mentioning. If you want to test your model on generic language tasks or adversarial inputs, DeepChecks provides a framework to do so, and you could incorporate those results into your RAG evaluation as additional signals (e.g., “our model tends to fail on arithmetic questions” – which might or might not be relevant to your RAG use-case).
Arize Observe / Phoenix – Arize (an ML observability company) released an open-source tool called Phoenix for troubleshooting LLMs in production. It’s not strictly an evaluation metric framework; it’s more of an observability and analysis toolkit. You feed in logs of your LLM interactions (prompts, responses, feedback) and it helps cluster and surface problem areas. For RAG, Phoenix could help identify patterns like “questions about topic X often have bad answers” by analyzing embeddings of responses and such It complements metric-based evals by adding an exploratory, data-driven view (and even visualization) of where your system underperforms. An AI leader might use Phoenix to monitor a deployed RAG system, then use RAGAS or Giskard to dig into specific issues found.

The list goes on – new tools and frameworks are popping up (LangSmith by LangChain for example, or NeuralTrust’s evaluators‍). The good news is the community recognizes that evaluation is critical for LLM applications, and we’re seeing rapid innovation to make it easier. Whether you prioritize deep metrics (RAGAS), broad safety checks (Giskard), or integration into CI (DeepEval) or MLOps platforms, there’s likely an open-source project out there to help. Don’t be afraid to mix and match these tools; many are complementary. For instance, you might use LlamaIndex’s built-in eval during development, RAGAS for a detailed analysis before a release, and Giskard in a deployment pipeline for continuous monitoring.

Now, let’s address the elephant (or rather, the giant context window) in the room: as LLMs get bigger memories, will RAG as a technique become obsolete? And if so, what happens to all these RAG evaluation frameworks?

Future Outlook: Do Giant Context Windows Threaten RAG (and Its Evaluation)?

It’s 2025, and we already have models with 100k-token context windows. The trend suggests we might eventually have language models that can ingest entire manuals or even databases in one go. One might ask: if an LLM can literally read all relevant content in its prompt, do we even need retrieval and RAG techniques? Could we just stuff the knowledge base into the prompt (or into the model’s parameters) and call it a day? 🤔

The honest answer: RAG is likely here to stay for the foreseeable future, though its role may evolve. Here’s why large context alone isn’t a slam dunk replacement and how that affects RAG evaluation:

Efficiency vs. Capacity: Just because a model can take in 100k tokens doesn’t mean you always want to give it 100k tokens. Feeding huge contexts is expensive (in terms of compute and cost) and potentially slow. It’s like having an infinitely long cheat sheet – it’s great in theory, but flipping through all of it for every question might be overkill. Retrieval is an efficient strategy: find the most relevant 1% of the text and give that to the model. This is likely to remain valuable, because no matter how large the window, a targeted prompt can save time and reduce noise. From an evaluation perspective, as long as retrieval is used, we need to evaluate it (so metrics like context recall/precision stay relevant). If someday we truly stop doing retrieval, some RAG-specific metrics (like context recall) might become irrelevant – but we’ll still need to evaluate answer correctness and usage of provided info (which are essentially the same as faithfulness and factuality).
Human Attention is Finite (and so is Model Attention): An LLM with a massive context can include everything relevant and irrelevant. Models are prone to distraction; important facts can get “lost” in a sea of text. So even with a big context, it’s vital to check if the model focused on the right info. If we give GPT-6 a 500-page appendix and ask a question, we must verify it actually used the relevant bits from those 500 pages to form its answer. In other words, faithfulness metrics still matter: did the model’s answer align with the facts in the context? We might not call it “retrieval” anymore if we provided the documents directly, but frameworks like RAGAS could pivot to evaluate prompt-grounded faithfulness. In fact, RAGAS’s definition of faithfulness (claims supported by retrieved context) (docs.ragas.io) applies just as well if you replace “retrieved context” with “provided context”. The model could have the whole company handbook in context, but if it makes a claim not supported by any paragraph in that handbook, it’s unfaithful. So the need to catch hallucinations and ensure factual consistency won’t go away.
Dynamic Knowledge & Updates: Context window size doesn’t solve the problem of keeping models up-to-date. Companies will always have evolving data – new documents, changing facts. You wouldn’t want to finetune a giant model or regenerate a 100k prompt every time one article in your knowledge base changes. RAG as a pattern (retrieve latest relevant info, then generate) addresses this by design. So long as that pattern is common, tools like RAGAS are useful to validate that for each query, the retrieval+generation did the right thing. If we imagine a future where models have some retrieval ability internally (some people call this “internal knowledge utilization” where the model itself learns to navigate provided docs), we’ll still evaluate the same concepts, just maybe via different interfaces.
Evaluation Frameworks Will Adapt: Even if one day we say “our LLM reads the whole intranet on each question, no vector DB needed,” we would still build evaluation metrics around relevance of answer, correctness, completeness, and perhaps information utilization. RAGAS could adapt by dropping context recall (since if you feed everything, recall is moot – it’s 100% by design) and focusing more on things like answer quality and maybe conciseness or focus (did the model ignore irrelevant context?). In essence, the RAG evaluation frameworks might broaden into LLM evaluation frameworks. We already see RAGAS and others overlapping with general LLM metrics (e.g., RAGAS has “FactualCorrectness” which could apply to any QA system, RAG or not docs.ragas.io). So even if retrieval becomes less critical, the legacy of these tools – robust LLM evaluation – will carry on.

On the flip side, larger context windows will likely reduce some pain points. For instance, today a bad retrieval can completely fail an answer (if the doc you need wasn’t retrieved, you’re out of luck). In a future scenario where most relevant info is somewhere in the giant context, the problem shifts to search within context (which the model does implicitly). We might rely a bit less on explicit retrieval algorithms and more on the model’s ability to find needles in the haystack it’s given. But we’d then evaluate the model’s answer for whether it indeed found the right needle. It’s a bit like moving the goalposts: the game remains to verify the model’s output is grounded in truth.

In practical terms, as context windows grow, some metrics might get easier to max out – e.g., context recall could often be 1.0 because we just dump in all possibly relevant docs. That would make recall a less interesting number (everyone gets an A because the test is open-everything). The focus would shift to metrics like faithfulness and relevance even more. Hallucination detection (faithfulness) becomes paramount: if the model has all the info and still hallucinated, shame on it – and we need to catch that. So RAGAS or its successors might double down on improved faithfulness checks, maybe multi-step reasoning verification. Similarly, if the model has tons of context, we’d want to ensure it didn’t copy irrelevant bits verbatim or go off on tangents – a blend of relevance and maybe a new metric for “focus” could emerge.

From a strategic viewpoint, AI leaders should keep an eye on this: you may not always need a vector database for retrieval if your model can swallow a library, but you will always need evaluation. Whether it’s RAG-specific or general LLM QA evaluation, the goal is to maintain trust and quality in the answers your systems produce. The tools we discussed are already adapting to cover a broad range of evaluation scenarios (notice Giskard’s branding is not just RAG, but all AI model testing; RAGAS too could be seen as part of a bigger eval toolkit). So, larger contexts will change how we use these tools, but not eliminate the need for them. You’ll possibly configure them differently, but you’ll still run your model through an eval gauntlet to ensure it’s behaving.

Conclusion

In the fast-paced world of AI, Retrieval-Augmented Generation has proven to be a practical way to boost LLM performance by grounding it in real data. But with great power (an LLM that cites sources!) comes great responsibility – we must rigorously evaluate these systems to ensure they’re actually doing what we expect: finding the right info and using it correctly to answer questions. We explored RAGAS, a purpose-built framework that gives developers a leg up in this evaluation process by providing metrics for everything from retrieval quality (did we get the info?) to generation faithfulness (did we use it correctly?). With RAGAS, one can quantitatively track improvements and regressions in a RAG pipeline, getting numeric scores for things that used to require laborious manual QA docs.ragas.io.

We also looked at Giskard, which broadens the scope to testing not just for accuracy but for trustworthiness – catching those nasty edge cases and AI behaviors that keep CEOs up at night. Giskard and similar frameworks add a layer of assurance by scanning for biases, hallucinations, security vulnerabilities, and more. For a developer, these tools can feel like having a safety net; for an AI leader, they provide insight and confidence that the system won’t turn into a loose cannon in production.

Crucially, we’ve realized that as models evolve (hello, 100k context windows!), evaluation frameworks will also evolve. If tomorrow’s model doesn’t need to “retrieve” because it can pay attention to your entire database at once, we’ll still want to measure if its answers are correct and grounded. The names of the metrics might change, but the mission remains the same: keep the AI honest and helpful. In fact, one could argue that evaluation will only grow in importance – when failures are rare but potentially very costly, catching the one in a million mistake is like finding a needle in a haystack. Automated eval tools will be our magnet to find that needle.

On a lighter note, think of your RAG system as a star employee (albeit an electronic one) – you gave it access to the company wiki so it can do its job better. RAGAS is like HR checking its references and work outputs for honesty and relevance, and Giskard is IT/security making sure it’s not breaking any rules or spilling secrets. Together, they help you trust this new “employee” with bigger responsibilities. And as this employee gets smarter (larger context, more training), you won’t stop evaluating them – you’ll just have better tools to do so, ensuring they continue to perform and behave.

TL;DR: RAG became popular to keep LLMs factual by letting them fetch documents; RAGAS arose to score how well that works (no more relying purely on “vibe checking” your model’s answers). RAGAS gives you metrics like how much of the needed info was retrieved, how faithful the answer was to that info, and how relevant the answer was to the question. It uses LLMs behind the scenes to automate a lot of this, fitting into fast development cyclesarxiv.org. Giskard, in comparison, is like the multi-tool for AI evaluation – it’s not limited to RAG and adds trust metrics (catching things like toxicity or data leaks), with a focus on explainability and customization for enterprise needs. Other tools like DeepEval, LlamaIndex’s evaluators, and more are also in the mix, each with their strengths. As LLMs get bigger brains (contexts), we may lean on retrieval a bit less, but we’ll always need to verify that our giant-brained models are using their brains correctly. So whether you stick to RAG or venture into hypertrophic context land, keep those evaluation checklists handy! In the end, a model that isn’t evaluated is a model you’re flying blind – and nobody wants their AI project to be a random leap of faith. Happy evaluating, and may your RAG systems always faithfully answer with just the facts (and maybe a dash of humor when appropriate).

References to dive deeper:

Shahul Es et al., "Ragas: Automated Evaluation of Retrieval Augmented Generation" (2023) – arXiv preprint introducing the RAGAS framework (arxiv.org arxiv.org).
RAGAS Documentation – Official docs for RAGAS metrics and usage (docs.ragas.io, docs.ragas.io).
Leonie Monigatti, "Evaluating RAG Applications with RAGAS" – Medium article with a great overview of RAGAS’s approach and how it uses LLMs for evaluation (medium.com, medium.com).
Giskard GitHub README – Describes Giskard’s features for scanning LLMs (hallucinations, security, bias, etc.) (github.com, github.com.
Databricks Blog on Giskard & MLflow – Details how Giskard integrates with MLflow and the types of LLM vulnerabilities it detects (databricks.com, databricks.com).
Top 5 Open-Source LLM Evaluation Frameworks in 2025 – Dev.to article comparing frameworks like DeepEval, RAGAS, etc., and noting strengths/weaknesses (dev.to, dev.to).
NeuralTrust Blog, "Benchmarking LLM Evaluation Models" – Discusses various eval frameworks (incl. RAGAS, Giskard, LlamaIndex) and their correctness evaluators (neuraltrust.ai).

Tega AdeyemiMay 9, 2025