jailbreak_attempt_detector

Active

declared in 0.2.0

Detects potential LLM jailbreak attempts by analyzing user input against NIST AI Risk Management Framework adversarial patterns. Designed for persona risk assessment, this tool evaluates text for common jailbreak techniques such as prompt injection, role-playing, or obfuscation. Inputs include the user message and optional context, returning a risk assessment with confidence scores and pattern matches. Ideal for real-time moderation in chat applications or API gateways.

Parameters schema

{
  "type": "object",
  "required": [
    "message"
  ],
  "properties": {
    "async": {
      "type": "boolean",
      "description": "If true, returns a job_id immediately (<200ms) instead of waiting for the result. Poll the result with job_result(job_id). Use for slow tools to avoid client timeouts."
    },
    "context": {
      "type": "string",
      "description": "Optional conversation context for better pattern matching"
    },
    "message": {
      "type": "string",
      "description": "User input text to analyze for jailbreak attempts"
    },
    "threshold": {
      "type": "number",
      "default": 0.7,
      "maximum": 1,
      "minimum": 0,
      "description": "Confidence threshold for flagging attempts"
    }
  }
}

What this tool wraps· 0 endpoints

min confidence0.70 0.50

No endpoints wrapped at confidence ≥ 0.50.

Parent server

gapup-mcp

https://github.com/getgapup/gapup-mcp-public

2/7 registries

View full server →