Open-Source Evaluation Frameworks for Deep Research Agents with PyTest Integration
Before diving into the available frameworks, let me summarize the key findings from my research on standalone Python evaluation tools that can be integrated with PyTest and GitHub Actions for evaluating deep research agents built with Semantic Kernel.
Top Open-Source Evaluation Frameworks
DeepEval - Comprehensive LLM Evaluation Framework
DeepEval emerges as a standout option for evaluating deep research agents with the following key features:
- Open-source Python framework with simple installation:
pip install -U deepeval
- Native PyTest integration with
deepeval test run
command - 14+ LLM-evaluated metrics with research backing
- Supports both individual test cases and evaluation datasets
- Specialized metrics for agents (Tool Correctness, Task Completion)
- Completely local evaluation without cloud dependencies
- Synthetic dataset generation capabilities for edge case testing
DeepEval follows a test case approach similar to PyTest, making it intuitive for developers:
from
deepeval
import
assert_test
from
deepeval.test_case
import
LLMTestCase
from
deepeval.metrics
import
TaskCompletionMetric
def
test_agent_completion():
test_case = LLMTestCase(
input="Plan a 3-day itinerary for Paris with cultural landmarks",
actual_output="Day 1: Eiffel Tower...",
tools_called=[...] # List of tools your agent called
)
task_completion_metric = TaskCompletionMetric(threshold=0.7)
assert_test(test_case, [task_completion_metric])
DeepEval can be easily integrated into GitHub Actions workflows:
name: LLM Agent Evaluation
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.9'
- run: pip install deepeval
- run: deepeval test run test_agent.py
RAGAS - Specialized RAG Evaluation
If your research agent heavily utilizes retrieval augmentation, RAGAS provides specialized metrics:
- Purpose-built for evaluating RAG systems with comprehensive metrics
- Explicit PyTest integration for CI/CD pipelines
- Works entirely locally without cloud dependencies
- Includes metrics like Faithfulness, Context Precision, and Response Relevancy
RAGAS integration with PyTest for GitHub Actions is straightforward:
import
pytest
from
ragas
import
evaluate
from
ragas.metrics
import
answer_relevancy, faithfulness
@pytest.mark.ragas_cidef
test_agent_responses():
# Your test dataset with agent responses
result = evaluate(
dataset,
metrics=[answer_relevancy, faithfulness],
in_ci=True,
)
assert
result["answer_relevancy"] >= 0.9
MLflow LLM Evaluate - Versatile Evaluation Platform
MLflow offers a robust evaluation framework with these features:
- Supports various model types including Python callables
- Combines mathematical metrics with LLM-as-a-judge evaluations
- Allows comparative analysis between models and prompts
- Traces for deeper insights into evaluation results
Basic MLflow integration example:
import
mlflow
from
mlflow.metrics.genai
import
answer_relevance
def
test_agent_evaluation():
result = mlflow.evaluate(
data=test_data,
model=your_agent_function,
extra_metrics=[mlflow.metrics.genai.answer_relevance()]
)
assert
result.metrics["answer_relevance"] > 0.8
Framework Comparison
FrameworkBest ForPyTest IntegrationGitHub ActionsCloud RequirementsDeepEvalGeneral LLM & agent evaluationNative supportExcellentNone (optional cloud)RAGASRAG-specific evaluationBuilt-in supportGoodNoneMLflowComprehensive model evaluationAdaptableGoodNone for basic metrics
Implementation Guide for GitHub Actions
To integrate your chosen evaluation framework into GitHub Actions alongside other unit tests:
name: Agent Testing
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pytest deepeval ragas
pip install -r requirements.txt
- name: Run unit tests
run: pytest tests/ -v
- name: Run agent evaluation
run: deepeval test run agent_tests/
Conclusion
For evaluating a deep research agent built with Semantic Kernel, DeepEval stands out as the most comprehensive solution with native PyTest integration. It provides specialized agent evaluation metrics while remaining completely independent from any cloud services.
If your agent heavily uses RAG capabilities, RAGAS offers specialized metrics that may better target those aspects of performance. For more general model evaluation with extensive customization, MLflow provides a mature framework with broad capabilities.
All three frameworks can be seamlessly integrated into PyTest and GitHub Actions workflows, allowing you to evaluate your agent automatically with each commit alongside your other unit tests.