Files
OpenWebUI-Discordbot/litellm-mcp-research.md
Josh Knapp 240330cf3b Refactor to use LiteLLM Responses API for automatic MCP tool execution
Major refactoring to properly integrate with LiteLLM's Responses API, which handles
MCP tool execution automatically instead of requiring manual tool call loops.

Key changes:
- Switched from chat.completions.create() to client.responses.create()
- Use "server_url": "litellm_proxy" to leverage LiteLLM as MCP gateway
- Set "require_approval": "never" for fully automatic tool execution
- Simplified get_available_mcp_tools() to get_available_mcp_servers()
- Removed manual OpenAI tool format conversion (LiteLLM handles this)
- Updated response extraction to use output[0].content[0].text format
- Convert system prompts to user role for Responses API compatibility

Technical improvements:
- LiteLLM now handles the complete tool calling loop automatically
- No more placeholder responses - actual MCP tools will execute
- Cleaner code with ~100 fewer lines
- Better separation between tools-enabled and tools-disabled paths
- Proper error handling for Responses API format

Responses API benefits:
- Single API call returns final response with tool results integrated
- Automatic tool discovery, execution, and result formatting
- No manual tracking of tool_call_ids or conversation state
- Native MCP support via server_label configuration

Documentation:
- Added comprehensive litellm-mcp-research.md with API examples
- Documented Responses API vs chat.completions differences
- Included Discord bot migration patterns
- Covered authentication, streaming, and tool restrictions

Next steps:
- Test with actual Discord interactions
- Verify GitHub MCP tools execute correctly
- Monitor response extraction for edge cases

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-12 10:32:04 -08:00

14 KiB

LiteLLM Responses API with MCP tool integration

LiteLLM's /v1/responses endpoint enables automatic MCP tool execution through a single API call, eliminating the manual tool-calling loop required with chat.completions. When configured with "require_approval": "never", LiteLLM handles tool discovery, execution, and response integration automatically—making Discord bot migration straightforward. The key differences from chat.completions are the input parameter (replacing messages) and native MCP tool support via a "type": "mcp" tool specification.

Request and response format for /v1/responses

The Responses API (available in LiteLLM 1.63.8+) uses input instead of messages. The input parameter accepts either a simple string or an array of message objects:

# Simple string input
response = client.responses.create(
    model="anthropic/claude-3-5-sonnet-latest",
    input="What is the weather today?"
)

# Array format (for multi-turn conversations)
response = client.responses.create(
    model="anthropic/claude-3-5-sonnet-latest",
    input=[
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"},
        {"role": "user", "content": "Tell me about Python"}
    ]
)

Response structure differs significantly from chat.completions. Instead of choices[0].message.content, responses use an output array:

{
    "id": "resp_abc123",
    "object": "response",
    "created_at": 1734366691,
    "status": "completed",
    "model": "claude-3-5-sonnet-latest",
    "output": [
        {
            "type": "message",
            "id": "msg_abc123",
            "status": "completed",
            "role": "assistant",
            "content": [
                {
                    "type": "output_text",
                    "text": "Here is the response text...",
                    "annotations": []
                }
            ]
        }
    ],
    "usage": {"input_tokens": 18, "output_tokens": 98, "total_tokens": 116}
}

To extract text: response.output[0].content[0].text

MCP tool specification format

MCP tools use "type": "mcp" with three critical parameters: server_label, server_url, and require_approval. The special value "server_url": "litellm_proxy" tells LiteLLM to act as an MCP gateway, handling all tool execution internally:

tools=[
    {
        "type": "mcp",
        "server_label": "my_mcp_server",      # Identifier for the MCP server
        "server_url": "litellm_proxy",         # LiteLLM handles MCP bridging
        "require_approval": "never",           # Automatic execution
        "allowed_tools": ["tool1", "tool2"]    # Optional: restrict available tools
    }
]
Parameter Purpose
server_label Identifies which configured MCP server to use (must match config.yaml)
server_url "litellm_proxy" for LiteLLM gateway, or direct URL like "https://mcp.example.com/mcp"
require_approval "never" for automatic execution; omit for approval-based flow
allowed_tools Whitelist of tool names to make available

When server_url="litellm_proxy", LiteLLM performs a four-step automatic flow: (1) fetches MCP tools and converts to OpenAI format, (2) sends tools to the LLM with your input, (3) executes any tool calls against MCP servers, and (4) returns the final response with tool results integrated.

Streaming versus non-streaming responses

For non-streaming, pass stream=False (default) and receive the complete response object:

response = client.responses.create(
    model="gpt-4o",
    input="Hello",
    stream=False
)
text = response.output[0].content[0].text

For streaming, set stream=True and iterate over events:

stream = client.responses.create(
    model="gpt-4o",
    input="Write a poem",
    stream=True
)

full_text = ""
for event in stream:
    if hasattr(event, 'type'):
        if event.type == "response.output_text.delta":
            print(event.delta, end="", flush=True)
            full_text += event.delta
        elif event.type == "response.completed":
            print("\n--- Done ---")

Key streaming event types include response.created, response.output_text.delta (incremental text), response.output_text.done, and response.completed.

Python SDK differences between responses.create() and chat.completions.create()

Aspect responses.create() chat.completions.create()
Input parameter input (string or array) messages (array required)
Response access response.output[0].content[0].text response.choices[0].message.content
Conversation history Built-in via previous_response_id Manual message array management
MCP tools Native "type": "mcp" support Standard function calling only
Endpoint /v1/responses /v1/chat/completions

Client setup is identical for both APIs:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000",  # Your LiteLLM proxy
    api_key="sk-your-litellm-key"
)

# Responses API
response = client.responses.create(model="gpt-4o", input="Hello")

# Chat Completions API (old way)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

Conversation history with the input parameter

Unlike chat.completions where you manually pass the full message history each time, the Responses API offers two approaches:

Option 1: Use previous_response_id for automatic context (recommended):

# First message
response1 = client.responses.create(model="gpt-4o", input="My name is Alice")

# Follow-up with context preserved automatically
response2 = client.responses.create(
    model="gpt-4o",
    input="What's my name?",
    previous_response_id=response1.id  # LiteLLM maintains context
)

Option 2: Pass full history in input array (manual approach):

response = client.responses.create(
    model="gpt-4o",
    input=[
        {"role": "user", "content": "My name is Alice"},
        {"role": "assistant", "content": "Nice to meet you, Alice!"},
        {"role": "user", "content": "What's my name?"}
    ]
)

The input array supports roles: user, assistant, developer (replaces system in newer models), and tool.

The require_approval parameter and MCP options

require_approval: "never" enables fully automatic tool execution—LiteLLM returns the final response in a single API call:

response = client.responses.create(
    model="gpt-4o",
    input="Search for Python documentation",
    tools=[{
        "type": "mcp",
        "server_label": "search_server",
        "server_url": "litellm_proxy",
        "require_approval": "never"  # No approval needed
    }]
)
# Response includes tool results integrated into final answer

Without require_approval: "never", you get an approval flow requiring two API calls:

# Step 1: Get approval request
response = client.responses.create(
    model="gpt-4o",
    input="Search for docs",
    tools=[{"type": "mcp", "server_label": "search", "server_url": "litellm_proxy"}]
)

# Extract approval request ID from response.output
approval_id = None
for output in response.output:
    if output.type == "mcp_approval_request":
        approval_id = output.id
        break

# Step 2: Approve and get final response
final_response = client.responses.create(
    model="gpt-4o",
    input=[{"type": "mcp_approval_response", "approve": True, "approval_request_id": approval_id}],
    previous_response_id=response.id,
    tools=[{"type": "mcp", "server_label": "search", "server_url": "litellm_proxy"}]
)

Restricting tools with allowed_tools

Control which MCP tools are available at request time or server configuration level:

Request-level restriction (per-call):

tools=[{
    "type": "mcp",
    "server_label": "github_mcp",
    "server_url": "litellm_proxy",
    "require_approval": "never",
    "allowed_tools": ["list_repos", "get_file_contents"]  # Only these tools available
}]

Server-level restriction (in config.yaml):

mcp_servers:
  github_mcp:
    url: "https://api.github.com/mcp"
    allowed_tools: ["list_repos", "get_file_contents"]   # Whitelist
    disallowed_tools: ["delete_repo", "force_push"]      # Blacklist

If both allowed_tools and disallowed_tools are specified, allowed_tools takes priority.

Authentication headers

LiteLLM supports multiple authentication header formats:

Header Use Case
Authorization: Bearer sk-... Standard - Used by OpenAI SDK automatically
x-litellm-api-key: Bearer sk-... MCP connections and custom scenarios
api-key: ... Azure OpenAI compatibility

For standard API calls (Discord bot), use the OpenAI SDK default:

client = OpenAI(
    base_url="http://localhost:4000",
    api_key="sk-your-key"  # Sent as "Authorization: Bearer sk-your-key"
)

For MCP tool headers (when calling external MCP servers), use the headers parameter:

tools=[{
    "type": "mcp",
    "server_label": "github",
    "server_url": "litellm_proxy",
    "require_approval": "never",
    "headers": {
        "x-litellm-api-key": "Bearer sk-your-litellm-key",
        "x-mcp-github-authorization": "Bearer ghp_your_github_token"
    }
}]

Complete Discord bot migration example

Here's a full implementation pattern for migrating from chat.completions to responses with MCP:

from openai import OpenAI
import os

class LiteLLMResponsesClient:
    """Client wrapper for Discord bot using LiteLLM Responses API with MCP."""
    
    def __init__(self, proxy_url: str, api_key: str):
        self.client = OpenAI(base_url=proxy_url, api_key=api_key)
        self.conversations = {}  # user_id -> response_id mapping
    
    def get_mcp_tools(self, server_label: str = "default") -> list:
        """Define MCP tools configuration."""
        return [{
            "type": "mcp",
            "server_label": server_label,
            "server_url": "litellm_proxy",
            "require_approval": "never",
            "allowed_tools": ["search", "fetch_data", "analyze"]  # Customize as needed
        }]
    
    def chat(
        self,
        user_id: str,
        message: str,
        model: str = "anthropic/claude-3-5-sonnet-latest",
        use_mcp_tools: bool = True,
        stream: bool = False
    ):
        """Send a message and get response, with optional MCP tools and streaming."""
        
        previous_id = self.conversations.get(user_id)
        
        kwargs = {
            "model": model,
            "input": message,
            "stream": stream
        }
        
        if previous_id:
            kwargs["previous_response_id"] = previous_id
        
        if use_mcp_tools:
            kwargs["tools"] = self.get_mcp_tools()
            kwargs["tool_choice"] = "auto"
        
        if stream:
            return self._handle_stream(user_id, **kwargs)
        else:
            response = self.client.responses.create(**kwargs)
            self.conversations[user_id] = response.id
            return self._extract_text(response)
    
    def _handle_stream(self, user_id: str, **kwargs):
        """Generator for streaming responses."""
        stream = self.client.responses.create(**kwargs)
        response_id = None
        
        for event in stream:
            if hasattr(event, 'type'):
                if event.type == "response.created":
                    response_id = event.response.id
                elif event.type == "response.output_text.delta":
                    yield event.delta
        
        if response_id:
            self.conversations[user_id] = response_id
    
    def _extract_text(self, response) -> str:
        """Extract text from Responses API response."""
        for output in response.output:
            if output.type == "message":
                for content in output.content:
                    if content.type == "output_text":
                        return content.text
        return ""
    
    def clear_history(self, user_id: str):
        """Clear conversation history for a user."""
        self.conversations.pop(user_id, None)


# Discord bot integration example
import discord

bot = discord.Bot()
llm_client = LiteLLMResponsesClient(
    proxy_url=os.environ["LITELLM_PROXY_URL"],
    api_key=os.environ["LITELLM_API_KEY"]
)

@bot.event
async def on_message(message):
    if message.author.bot:
        return
    
    if bot.user.mentioned_in(message):
        user_id = str(message.author.id)
        user_message = message.content.replace(f'<@{bot.user.id}>', '').strip()
        
        # Non-streaming response with MCP tools
        response_text = llm_client.chat(
            user_id=user_id,
            message=user_message,
            use_mcp_tools=True
        )
        await message.reply(response_text)

# Run: bot.run(os.environ["DISCORD_TOKEN"])

Key migration considerations

The Responses API is marked as BETA in LiteLLM. Ensure you're running LiteLLM 1.63.8+ and using OpenAI SDK 1.66.1+ for full compatibility. Model names must include the provider prefix (e.g., openai/gpt-4o, anthropic/claude-3-5-sonnet-latest). Response IDs are encrypted per-user by default for security—users cannot access other users' conversation history unless you disable this with disable_responses_id_security: true in config.yaml.

The primary advantage for Discord bots is the automatic MCP tool execution loop. With chat.completions, you must manually detect tool calls, execute them, and send results back. With Responses API and require_approval: "never", LiteLLM handles this entire flow internally, returning the final integrated response in a single call.