analyze_responses
ActiveTool of IA-QA — 130+ QA & Dev Tools for AI Agents
Semantically analyze N already-produced model outputs for the SAME task (the MCP counterpart to the LLM Sandbox). Without a reference: computes consensus — pairwise cosine agreement, the most-representative output, and the outlier. With a `reference` (ground truth): also ranks every output by closeness (token cosine + ROUGE-L composite) and names the closest. Deterministic, no LLM, no key — gate-able in CI. You bring the outputs (2+). For a 2-way head-to-head with structural JSON diff use compare_responses instead.
Parameters schema
{
"type": "object",
"required": [
"responses"
],
"properties": {
"reference": {
"type": "string",
"description": "Optional ground-truth answer. If set, each output is also ranked by closeness to it and the closest one is named."
},
"responses": {
"type": "array",
"items": {
"type": "object",
"required": [
"text"
],
"properties": {
"text": {
"type": "string",
"description": "The produced output"
},
"label": {
"type": "string",
"description": "Human name for this candidate (e.g. model id)"
}
}
},
"minItems": 2,
"description": "The outputs to analyze (same task, N models/prompts/versions). Each item is a plain string or { \"label\": \"GPT-4o\", \"text\": \"...\" }. At least 2 required."
}
}
}No endpoints wrapped at confidence ≥ 0.50.
Parent server
IA-QA — 130+ QA & Dev Tools for AI Agents
https://github.com/jcjamet/ia-qa
1/7 registries