analyze_responses

Active

Tool of IA-QA — 130+ QA & Dev Tools for AI Agents

declared in 1.0.0

Semantically analyze N already-produced model outputs for the SAME task (the MCP counterpart to the LLM Sandbox). Without a reference: computes consensus — pairwise cosine agreement, the most-representative output, and the outlier. With a `reference` (ground truth): also ranks every output by closeness (token cosine + ROUGE-L composite) and names the closest. Deterministic, no LLM, no key — gate-able in CI. You bring the outputs (2+). For a 2-way head-to-head with structural JSON diff use compare_responses instead.

Parameters schema

{
  "type": "object",
  "required": [
    "responses"
  ],
  "properties": {
    "reference": {
      "type": "string",
      "description": "Optional ground-truth answer. If set, each output is also ranked by closeness to it and the closest one is named."
    },
    "responses": {
      "type": "array",
      "items": {
        "type": "object",
        "required": [
          "text"
        ],
        "properties": {
          "text": {
            "type": "string",
            "description": "The produced output"
          },
          "label": {
            "type": "string",
            "description": "Human name for this candidate (e.g. model id)"
          }
        }
      },
      "minItems": 2,
      "description": "The outputs to analyze (same task, N models/prompts/versions). Each item is a plain string or { \"label\": \"GPT-4o\", \"text\": \"...\" }. At least 2 required."
    }
  }
}

What this tool wraps· 0 endpoints

min confidence0.70 0.50

No endpoints wrapped at confidence ≥ 0.50.

Parent server

IA-QA — 130+ QA & Dev Tools for AI Agents

https://github.com/jcjamet/ia-qa

1/7 registries

View full server →