← All posts

Building Reliable AI Agents - Implementing Advanced Evaluation with Azure AI SDK and Custom APIM Integration

Learn how to implement robust evaluation for AI agents using Azure AI Evaluation SDK when working with Azure API Management (APIM), overcoming authentication and integration challenges.

  • azure
  • openai
  • evaluation
  • apim
  • llm
  • testing

The evaluate() function returned 401 Unauthorized for the fourth time that afternoon, and I was starting to take it personally.

I had the right endpoint. The right API version. The environment variables were set. Every smoke test I ran against the Azure API Management gateway directly came back fine. But the moment I handed that same gateway URL to the Azure AI Evaluation SDK, it acted like I’d never authenticated at all.

What I wanted was simple: a battery of automated quality checks on our research agent — groundedness, relevance, faithfulness, fluency — the kind of eval harness any serious AI deployment needs. What I got was three hours of staring at HTTP 401s while the SDK silently stripped the headers APIM required.

Here’s what was actually happening, and how I eventually got it working.


The wall between the SDK and the gateway

The Azure AI Evaluation SDK ships with built-in evaluators that work beautifully when your model sits behind a plain Azure OpenAI endpoint. But most enterprise setups don’t look like that. They route through Azure API Management — a gateway layer that sits in front of your AI services and handles authentication, rate limiting (capping how many requests a caller can make in a window), IP filtering, and request transformation before anything reaches the model.

The SDK’s evaluators don’t know about gateways. They construct their own openai.AzureOpenAI client internally, pulling AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY from environment variables. There’s no hook to inject extra headers or swap in a bearer token — a short-lived credential that proves your identity to Azure. When APIM demands custom headers — a username claim, a subscription key, a tenant-routing field — those headers aren’t there. APIM rejects the request before the model ever sees it.

Our research agent needed evaluation across five dimensions: groundedness (are the claims supported by sources?), answer relevancy, contextual precision and recall, faithfulness to provided context, and fluency/coherence. Reasonable requirements. But none of them mattered if I couldn’t get the evaluators to talk to the model at all.

graph TD
    Client[Evaluation Client] -->|Standard Path| Direct[Direct Azure OpenAI]
    Client -->|Enterprise Path| APIM[Azure API Management]
    APIM -->|Custom Auth Required| Direct
    
    style Enterprise Path stroke:#f66,stroke-width:2px
    style APIM fill:#f66,stroke:#333,stroke-width:2px

The standard approach looked like this, and it never worked:

# Standard approach - fails with APIM
evaluators = {
    "groundedness": GroundednessEvaluator(),
    "relevance": RelevanceEvaluator()
}

results = evaluate(
    evaluators=evaluators,
    target_function=run_research_agent,
    test_cases=test_scenarios,
    # Even when providing correct APIM URL, it fails with auth errors
    azure_endpoint=os.getenv("AZURE_APIM_URI")
)

401 Unauthorized. 403 Forbidden. Every time.


Making the evaluators APIM-aware

The fix meant owning the HTTP client myself instead of letting the SDK build it under the hood.

I wrote a base class that each evaluator would inherit from. It grabs an Azure AD token — a credential obtained through Azure’s identity chain (environment variables → managed identity → CLI credentials, tried in order) — and passes it as the api_key to AzureOpenAI. Then it staples the required APIM headers onto every request through with_additional_headers(), a method on the OpenAI Python SDK’s transport layer.

class APIMEvaluatorBase:
    """Base class for custom APIM-compatible evaluators."""
    
    def __init__(self, metric_name, apim_config):
        self._metric_name = metric_name
        self.config = apim_config
        
        # Get OpenAI client with proper auth for APIM
        self.client = self._create_apim_client()
    
    def _create_apim_client(self):
        # Get Azure AD token
        credential = DefaultAzureCredential()
        token = credential.get_token("https://cognitiveservices.azure.com/.default")
        
        # Create client with custom headers for APIM
        client = AzureOpenAI(
            azure_endpoint=self.config.endpoint,
            api_version=self.config.api_version,
            api_key=token.token,  # Using token as API key
        )
        
        # Add required APIM headers
        headers = {
            "mkl-User-name": self.config.username,
            "username": self.config.username
        }
        
        # Apply headers to all requests
        client = client.with_additional_headers(headers)
        return client
        
    def __call__(self, response, context=None, query=None):
        """Evaluate with our APIM-aware client."""
        raise NotImplementedError()

I created subclasses for each metric — APIMGroundednessEvaluator, APIMRelevanceEvaluator, APIMContextualPrecisionEvaluator, and so on — all inheriting that same client factory.

the mechanism — why this works give me the detail

Why the standard SDK fails at an APIM boundary

azure-ai-evaluation’s built-in evaluators construct their own openai.AzureOpenAI client internally. They read AZURE_OPENAI_ENDPOINT / AZURE_OPENAI_API_KEY from the environment and make direct calls — there is no hook to inject extra headers or swap in a bearer token mid-flight. When APIM sits in front and requires custom headers (e.g. a username claim, a subscription key, or a tenant-routing header), those requests arrive incomplete and APIM rejects them with 401/403 before the model is ever reached.

The fix: own the client, own the headers

azure-identity’s DefaultAzureCredential follows the standard Azure credential chain (env vars → managed identity → CLI → VS Code → …) and returns a short-lived bearer token scoped to https://cognitiveservices.azure.com/.default. That token is passed as api_key to AzureOpenAI — the SDK doesn’t care that it looks like a key; it just puts it in the Authorization header. Then with_additional_headers() (part of the openai Python SDK’s httpx transport layer) attaches any APIM-required headers to every subsequent request without you touching the HTTP client directly.

Minimal reproducible setup — confirm your credential chain and APIM headers are working before wiring up any evaluator:

from azure.identity import DefaultAzureCredential
from openai import AzureOpenAI

credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")

client = AzureOpenAI(
    azure_endpoint="https://<your-apim-gateway>.azure-api.net/openai",
    api_version="2024-02-01",
    api_key=token.token,
).with_additional_headers({
    "Ocp-Apim-Subscription-Key": "<your-subscription-key>",  # if required
    "x-custom-username": "<service-account-name>",
})

# Smoke-test: if this returns a completion, your auth+headers are correct.
resp = client.chat.completions.create(
    model="<your-deployment-name>",
    messages=[{"role": "user", "content": "ping"}],
    temperature=0.0,
)
print(resp.choices[0].message.content)

Once this smoke test passes, every evaluator subclass that inherits _create_apim_client() is authenticated correctly — the groundedness/relevance/faithfulness prompts are just structured text sent through the same verified channel. The response_format={"type": "json_object"} parameter (supported in gpt-4o and later deployments) is what makes parsing deterministic; without it you’ll chase intermittent JSON-decode errors under load.


When the SDK still wouldn’t cooperate

The custom evaluators worked individually. I could call each one and get back a score with reasoning. But when I tried to feed them into the SDK’s evaluate() function — the one that aggregates results across all metrics and test cases — it refused to cooperate.

The framework’s inner aggregation logic assumes certain behaviors from evaluators that our APIM-aware classes couldn’t satisfy. I spent a full morning stepping through stack traces before accepting that the evaluate() path was a dead end for our setup.

So I built the aggregation loop myself.

def run_manual_evaluation(test_scenarios, apim_config):
    """Run evaluation manually with custom APIM-aware evaluators."""
    
    evaluators = {
        "groundedness": APIMGroundednessEvaluator(apim_config),
        "relevance": APIMRelevanceEvaluator(apim_config),
        "contextual_precision": APIMContextualPrecisionEvaluator(apim_config),
        "faithfulness": APIMFaithfulnessEvaluator(apim_config),
        "fluency": APIMFluencyEvaluator(apim_config)
    }
    
    thresholds = {
        "groundedness": 3.5,
        "relevance": 3.5,
        "contextual_precision": 3.0,
        "faithfulness": 3.5,
        "fluency": 3.0
    }
    
    all_results = []
    
    for scenario in test_scenarios:
        agent_response = run_research_agent(
            query=scenario["query"],
            additional_context=scenario.get("additional_context", "")
        )
        
        scenario_results = {
            "query": scenario["query"],
            "response": agent_response,
            "metrics": {}
        }
        
        for metric_name, evaluator in evaluators.items():
            if metric_name not in scenario.get("evaluation_metrics", list(evaluators.keys())):
                continue
                
            result = evaluator(
                response=agent_response,
                context=scenario.get("context", ""),
                query=scenario["query"]
            )
            
            scenario_results["metrics"][metric_name] = {
                "score": result["score"],
                "reasoning": result["reasoning"],
                "threshold": thresholds[metric_name],
                "pass": result["score"] >= thresholds[metric_name]
            }
        
        all_results.append(scenario_results)
    
    summary = calculate_evaluation_summary(all_results, thresholds)
    save_evaluation_results(all_results, summary)
    
    return all_results, summary

Giving up on the SDK’s orchestration layer was the right call. The manual loop gave me full visibility into every evaluation call and made debugging authentication failures trivial — I could print() the response from any single evaluator without the framework swallowing the error.


What the metrics actually looked like

Here’s a groundedness evaluation in practice. The agent’s response scores a 5.0 — every claim maps back to the provided context:

scenario = {
    "query": "What are the main environmental risks for offshore drilling?",
    "context": "Offshore drilling poses several environmental risks including: 
               oil spills that can damage marine ecosystems, 
               disruption of marine habitats during drilling, 
               underwater noise pollution affecting marine mammals, 
               and greenhouse gas emissions from flaring operations.",
    "evaluation_metrics": ["groundedness", "relevance"]
}

response = "Offshore drilling presents significant environmental concerns. 
            The primary risks include catastrophic oil spills that damage marine ecosystems,
            physical disruption of seafloor habitats during drilling operations,
            noise pollution that can disorient and harm marine mammals like whales and dolphins,
            and substantial greenhouse gas emissions from gas flaring."

result = {
    "score": 5.0,
    "reasoning": "The response accurately reflects all environmental risks mentioned 
                 in the context without adding unsupported claims."
}

And here’s one that scored a 2.0 on relevance — the model answered about environmental permits when the question was about safety regulations:

scenario = {
    "query": "What regulations govern offshore drilling safety?",
    "context": "Offshore drilling is regulated by several frameworks including 
               the BSEE regulations in the US, which mandate safety equipment like blowout preventers. 
               The SEMS rule requires safety management systems. 
               International operations often follow IMO guidelines and regional regulations."
}

response = "Offshore drilling has strict regulations. Companies must follow 
            environmental protection standards and obtain permits before drilling. 
            Environmental impact assessments are required in many jurisdictions."

result = {
    "score": 2.0,
    "reasoning": "The response discusses regulations but focuses on environmental permits
                 rather than the safety regulations mentioned in the context. 
                 It doesn't address BSEE, SEMS, or IMO guidelines specifically asked about in the query."
}

Three things that saved me

TypedDict for return types. The SDK expects specific shapes from evaluator __call__ methods. Using Python’s TypedDict kept the return signatures compatible and caught mismatches before runtime:

from typing import TypedDict, List, Dict

class GroundednessResult(TypedDict):
    score: float
    reasoning: str

class APIMGroundednessEvaluator(APIMEvaluatorBase):
    def __call__(self, response, context=None, query=None) -> GroundednessResult:
        # Implementation

Retry logic. APIM calls hit timeouts and transient errors — not often, but often enough to break a long eval run. An exponential backoff wrapper (wait 2^n seconds between retries, up to 3 attempts) turned intermittent failures into passing runs:

def safe_api_call(client, *args, max_retries=3, **kwargs):
    """Make API call with retry logic."""
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(*args, **kwargs)
        except (APITimeoutError, ServiceUnavailableError) as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

Result caching. Running the same eval prompts against the same responses burns tokens for no reason. An MD5-keyed cache (hashing query + context + response) saved both cost and latency on re-runs:

class CachingEvaluator:
    """Wrapper for evaluators that caches results."""
    
    def __init__(self, evaluator, cache_file=None):
        self.evaluator = evaluator
        self.cache_file = cache_file or f"{type(evaluator).__name__}_cache.json"
        self.cache = self._load_cache()
    
    def _load_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        return {}
    
    def _save_cache(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)
    
    def __call__(self, response, context=None, query=None):
        key = hashlib.md5(f"{query}|{context}|{response}".encode()).hexdigest()
        
        if key in self.cache:
            return self.cache[key]
        
        result = self.evaluator(response, context, query)
        self.cache[key] = result
        self._save_cache()
        
        return result

What we got

After running the full eval battery across our research agent’s test scenarios, the numbers looked like this:

{
  "summary": {
    "groundedness": {
      "average_score": 4.7,
      "pass_rate": 0.95,
      "threshold": 3.5
    },
    "relevance": {
      "average_score": 4.5,
      "pass_rate": 0.92,
      "threshold": 3.5
    },
    "contextual_precision": {
      "average_score": 4.2,
      "pass_rate": 0.89,
      "threshold": 3.0
    },
    "faithfulness": {
      "average_score": 4.6,
      "pass_rate": 0.94,
      "threshold": 3.5
    },
    "fluency": {
      "average_score": 4.8,
      "pass_rate": 0.98,
      "threshold": 3.0
    },
    "overall_pass_rate": 0.91
  }
}

91% overall pass rate. Contextual precision was the weakest link — the agent sometimes pulled in tangentially relevant information when the question demanded laser focus. That became the next thing to fix.


What I’d tell someone starting this tomorrow

Test one evaluator against your gateway before you try to orchestrate five. A single 401 is a five-minute fix. Five evaluators all throwing 401 inside a framework you don’t control is an afternoon.

Log authentication failures verbosely. The SDK’s default error messages won’t tell you which header is missing — you’ll need to inspect the raw HTTP response to see what APIM is actually rejecting.

If the SDK’s orchestration doesn’t fit, walk away from it. The manual evaluation loop here is under a hundred lines. It does exactly what you need, it’s trivial to debug, and it doesn’t fight you on headers.

For the Azure AI Evaluation team, three things would help: documentation that explicitly covers APIM-protected endpoints, SDK support for Azure AD token auth with custom headers, and cleaner extension points for custom evaluators so developers don’t have to bypass the orchestration layer entirely.

The quality of an AI system isn’t just about what the model can do. It’s about whether you can reliably measure what it actually does. That measurement has to survive the real architecture your service runs behind — not just the demo path.