compare_responses
ActiveTool of IA-QA — 130+ QA & Dev Tools for AI Agents
Compare two ALREADY-PRODUCED outputs (e.g. model A vs model B on the same task) side by side. Returns deterministic metrics (token cosine, ROUGE-L, Jaccard, length/structure deltas, JSON diff) and a verdict. If a `reference` (ground truth) is given, scores each output against it and picks the closer one. If `model` + `api_key` are given, an LLM judge also picks a qualitative winner for the task. No re-execution — you bring the outputs.
Parameters schema
{
"type": "object",
"required": [
"response_a",
"response_b"
],
"properties": {
"task": {
"type": "string",
"description": "The task/prompt both outputs were answering — used by the LLM judge for context"
},
"model": {
"type": "string",
"description": "Optional judge model id (BYOK). When set with api_key, an LLM judge picks a qualitative winner."
},
"api_key": {
"type": "string",
"description": "Optional API key for the judge model (BYOK). Used only for the judge call; never stored."
},
"label_a": {
"type": "string",
"description": "Label for output A (e.g. \"GPT-4o\", \"v1.0\")"
},
"label_b": {
"type": "string",
"description": "Label for output B (e.g. \"GPT-5-nano\", \"v1.1\")"
},
"reference": {
"type": "string",
"description": "Optional ground-truth / expected answer. If set, each output is scored against it and the closer one wins (deterministic)."
},
"check_json": {
"type": "boolean",
"description": "Try to parse as JSON and compare structurally (keys, types, values)"
},
"response_a": {
"type": "string",
"description": "First output (e.g. model A's answer)"
},
"response_b": {
"type": "string",
"description": "Second output (e.g. model B's answer)"
}
}
}No endpoints wrapped at confidence ≥ 0.50.
Parent server
IA-QA — 130+ QA & Dev Tools for AI Agents
https://github.com/jcjamet/ia-qa
1/7 registries