run_vlm_test_suite_batch

Active

Tool of IA-QA — 130+ QA & Dev Tools for AI Agents

declared in 1.0.0

Compare multiple VLMs on the same test suite in parallel â€” send an image (URL or base64) + N test cases to all models simultaneously. Returns per-model PASS/FAIL verdicts, pass rates, latency stats, and a comparison table. Assertion types: contains, not_contains, json_format, min_length, max_length, semantic_contains. BYOK: requires API keys for each provider.

Parameters schema

{
  "type": "object",
  "required": [
    "test_cases",
    "models",
    "api_keys"
  ],
  "properties": {
    "models": {
      "type": "array",
      "items": {
        "enum": [
          "gpt-4o",
          "gpt-4o-mini",
          "claude-3-5-sonnet-20241022",
          "claude-3-5-haiku-20241022",
          "gemini-1.5-flash",
          "gemini-2.0-flash"
        ],
        "type": "string"
      },
      "maxItems": 6,
      "minItems": 1,
      "description": "Array of model IDs to compare (runs in parallel)."
    },
    "api_keys": {
      "type": "object",
      "description": "Map of model ID â†’ API key. Example: { \"gpt-4o\": \"sk-...\", \"claude-3-5-sonnet-20241022\": \"sk-ant-...\" }",
      "additionalProperties": {
        "type": "string"
      }
    },
    "image_url": {
      "type": "string",
      "description": "Public URL of the image to evaluate (required unless image_base64 is provided)."
    },
    "threshold": {
      "type": "number",
      "description": "Pass rate threshold for overall verdict (default: 80, 0â€“100)."
    },
    "test_cases": {
      "type": "array",
      "items": {
        "type": "object",
        "required": [
          "question"
        ],
        "properties": {
          "id": {
            "type": "string",
            "description": "Optional identifier for this case."
          },
          "question": {
            "type": "string",
            "description": "Question to ask the VLM about the image."
          },
          "assertion_type": {
            "enum": [
              "contains",
              "not_contains",
              "json_format",
              "min_length",
              "max_length",
              "semantic_contains"
            ],
            "type": "string",
            "description": "Assertion to run on the VLM response."
          },
          "assertion_value": {
            "type": "string",
            "description": "Expected value for the assertion (not needed for json_format)."
          }
        }
      },
      "maxItems": 10,
      "description": "Array of test cases to run against every model."
    },
    "image_base64": {
      "type": "string",
      "description": "Base64-encoded image data (required unless image_url is provided)."
    },
    "system_prompt": {
      "type": "string",
      "description": "Optional system prompt sent to every VLM."
    },
    "image_mime_type": {
      "type": "string",
      "description": "MIME type of the image if using image_base64 (default: image/jpeg)."
    }
  }
}

What this tool wraps· 0 endpoints

min confidence0.70 0.50

No endpoints wrapped at confidence ≥ 0.50.

Parent server

IA-QA — 130+ QA & Dev Tools for AI Agents

https://github.com/jcjamet/ia-qa

1/7 registries

View full server →