Open-Source Evaluation Frameworks for Deep Research Agents with PyTest Integration

my research on standalone Python evaluation tools that can be integrated with PyTest and GitHub Actions for evaluating deep research agents built with Semantic Kernel.

Open-Source Evaluation Frameworks for Deep Research Agents with PyTest Integration

Before diving into the available frameworks, let me summarize the key findings from my research on standalone Python evaluation tools that can be integrated with PyTest and GitHub Actions for evaluating deep research agents built with Semantic Kernel.

Top Open-Source Evaluation Frameworks

DeepEval - Comprehensive LLM Evaluation Framework

DeepEval emerges as a standout option for evaluating deep research agents with the following key features:

  • Open-source Python framework with simple installation: pip install -U deepeval
  • Native PyTest integration with deepeval test run command
  • 14+ LLM-evaluated metrics with research backing
  • Supports both individual test cases and evaluation datasets
  • Specialized metrics for agents (Tool Correctness, Task Completion)
  • Completely local evaluation without cloud dependencies
  • Synthetic dataset generation capabilities for edge case testing

DeepEval follows a test case approach similar to PyTest, making it intuitive for developers:

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import TaskCompletionMetric

def test_agent_completion():
   test_case = LLMTestCase(
       input="Plan a 3-day itinerary for Paris with cultural landmarks",
       actual_output="Day 1: Eiffel Tower...",
       tools_called=[...]
# List of tools your agent called
   )
   
   task_completion_metric = TaskCompletionMetric(threshold=0.7)
   assert_test(test_case, [task_completion_metric])

DeepEval can be easily integrated into GitHub Actions workflows:

name: LLM Agent Evaluation
on: [push, pull_request]
jobs:
 evaluate:
   runs-on: ubuntu-latest
   steps:
     - uses: actions/checkout@v3
     - uses: actions/setup-python@v4
       with:
         python-version: '3.9'
     - run: pip install deepeval
     - run: deepeval test run test_agent.py

RAGAS - Specialized RAG Evaluation

If your research agent heavily utilizes retrieval augmentation, RAGAS provides specialized metrics:

  • Purpose-built for evaluating RAG systems with comprehensive metrics
  • Explicit PyTest integration for CI/CD pipelines
  • Works entirely locally without cloud dependencies
  • Includes metrics like Faithfulness, Context Precision, and Response Relevancy

RAGAS integration with PyTest for GitHub Actions is straightforward:

import pytest
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

@pytest.mark.ragas_ci
def test_agent_responses():
   
# Your test dataset with agent responses
   result = evaluate(
       dataset,
       metrics=[answer_relevancy, faithfulness],
       in_ci=True,
   )
   
assert result["answer_relevancy"] >= 0.9

MLflow LLM Evaluate - Versatile Evaluation Platform

MLflow offers a robust evaluation framework with these features:

  • Supports various model types including Python callables
  • Combines mathematical metrics with LLM-as-a-judge evaluations
  • Allows comparative analysis between models and prompts
  • Traces for deeper insights into evaluation results

Basic MLflow integration example:

import mlflow
from mlflow.metrics.genai import answer_relevance

def test_agent_evaluation():
   result = mlflow.evaluate(
       data=test_data,
       model=your_agent_function,
       extra_metrics=[mlflow.metrics.genai.answer_relevance()]
   )
   
assert result.metrics["answer_relevance"] > 0.8

Framework Comparison

FrameworkBest ForPyTest IntegrationGitHub ActionsCloud RequirementsDeepEvalGeneral LLM & agent evaluationNative supportExcellentNone (optional cloud)RAGASRAG-specific evaluationBuilt-in supportGoodNoneMLflowComprehensive model evaluationAdaptableGoodNone for basic metrics

Implementation Guide for GitHub Actions

To integrate your chosen evaluation framework into GitHub Actions alongside other unit tests:

name: Agent Testing
on: [push, pull_request]

jobs:
 test:
   runs-on: ubuntu-latest
   steps:
     - uses: actions/checkout@v3
     - name: Set up Python
       uses: actions/setup-python@v4
       with:
         python-version: '3.9'
     
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
         pip install pytest deepeval ragas
         pip install -r requirements.txt
     
     - name: Run unit tests
       run: pytest tests/ -v
     
     - name: Run agent evaluation
       run: deepeval test run agent_tests/

Conclusion

For evaluating a deep research agent built with Semantic Kernel, DeepEval stands out as the most comprehensive solution with native PyTest integration. It provides specialized agent evaluation metrics while remaining completely independent from any cloud services.

If your agent heavily uses RAG capabilities, RAGAS offers specialized metrics that may better target those aspects of performance. For more general model evaluation with extensive customization, MLflow provides a mature framework with broad capabilities.

All three frameworks can be seamlessly integrated into PyTest and GitHub Actions workflows, allowing you to evaluate your agent automatically with each commit alongside your other unit tests.