408 lines
14 KiB
Markdown
408 lines
14 KiB
Markdown
|
|
# LiteLLM Responses API with MCP tool integration
|
||
|
|
|
||
|
|
LiteLLM's `/v1/responses` endpoint enables automatic MCP tool execution through a single API call, eliminating the manual tool-calling loop required with chat.completions. When configured with `"require_approval": "never"`, LiteLLM handles tool discovery, execution, and response integration automatically—making Discord bot migration straightforward. The key differences from chat.completions are the `input` parameter (replacing `messages`) and native MCP tool support via a `"type": "mcp"` tool specification.
|
||
|
|
|
||
|
|
## Request and response format for /v1/responses
|
||
|
|
|
||
|
|
The Responses API (available in LiteLLM **1.63.8+**) uses `input` instead of `messages`. The `input` parameter accepts either a simple string or an array of message objects:
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Simple string input
|
||
|
|
response = client.responses.create(
|
||
|
|
model="anthropic/claude-3-5-sonnet-latest",
|
||
|
|
input="What is the weather today?"
|
||
|
|
)
|
||
|
|
|
||
|
|
# Array format (for multi-turn conversations)
|
||
|
|
response = client.responses.create(
|
||
|
|
model="anthropic/claude-3-5-sonnet-latest",
|
||
|
|
input=[
|
||
|
|
{"role": "user", "content": "Hello"},
|
||
|
|
{"role": "assistant", "content": "Hi there!"},
|
||
|
|
{"role": "user", "content": "Tell me about Python"}
|
||
|
|
]
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Response structure** differs significantly from chat.completions. Instead of `choices[0].message.content`, responses use an `output` array:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"id": "resp_abc123",
|
||
|
|
"object": "response",
|
||
|
|
"created_at": 1734366691,
|
||
|
|
"status": "completed",
|
||
|
|
"model": "claude-3-5-sonnet-latest",
|
||
|
|
"output": [
|
||
|
|
{
|
||
|
|
"type": "message",
|
||
|
|
"id": "msg_abc123",
|
||
|
|
"status": "completed",
|
||
|
|
"role": "assistant",
|
||
|
|
"content": [
|
||
|
|
{
|
||
|
|
"type": "output_text",
|
||
|
|
"text": "Here is the response text...",
|
||
|
|
"annotations": []
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"usage": {"input_tokens": 18, "output_tokens": 98, "total_tokens": 116}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
To extract text: `response.output[0].content[0].text`
|
||
|
|
|
||
|
|
## MCP tool specification format
|
||
|
|
|
||
|
|
MCP tools use `"type": "mcp"` with three critical parameters: `server_label`, `server_url`, and `require_approval`. The special value `"server_url": "litellm_proxy"` tells LiteLLM to act as an MCP gateway, handling all tool execution internally:
|
||
|
|
|
||
|
|
```python
|
||
|
|
tools=[
|
||
|
|
{
|
||
|
|
"type": "mcp",
|
||
|
|
"server_label": "my_mcp_server", # Identifier for the MCP server
|
||
|
|
"server_url": "litellm_proxy", # LiteLLM handles MCP bridging
|
||
|
|
"require_approval": "never", # Automatic execution
|
||
|
|
"allowed_tools": ["tool1", "tool2"] # Optional: restrict available tools
|
||
|
|
}
|
||
|
|
]
|
||
|
|
```
|
||
|
|
|
||
|
|
| Parameter | Purpose |
|
||
|
|
|-----------|---------|
|
||
|
|
| `server_label` | Identifies which configured MCP server to use (must match config.yaml) |
|
||
|
|
| `server_url` | `"litellm_proxy"` for LiteLLM gateway, or direct URL like `"https://mcp.example.com/mcp"` |
|
||
|
|
| `require_approval` | `"never"` for automatic execution; omit for approval-based flow |
|
||
|
|
| `allowed_tools` | Whitelist of tool names to make available |
|
||
|
|
|
||
|
|
When `server_url="litellm_proxy"`, LiteLLM performs a **four-step automatic flow**: (1) fetches MCP tools and converts to OpenAI format, (2) sends tools to the LLM with your input, (3) executes any tool calls against MCP servers, and (4) returns the final response with tool results integrated.
|
||
|
|
|
||
|
|
## Streaming versus non-streaming responses
|
||
|
|
|
||
|
|
For **non-streaming**, pass `stream=False` (default) and receive the complete response object:
|
||
|
|
|
||
|
|
```python
|
||
|
|
response = client.responses.create(
|
||
|
|
model="gpt-4o",
|
||
|
|
input="Hello",
|
||
|
|
stream=False
|
||
|
|
)
|
||
|
|
text = response.output[0].content[0].text
|
||
|
|
```
|
||
|
|
|
||
|
|
For **streaming**, set `stream=True` and iterate over events:
|
||
|
|
|
||
|
|
```python
|
||
|
|
stream = client.responses.create(
|
||
|
|
model="gpt-4o",
|
||
|
|
input="Write a poem",
|
||
|
|
stream=True
|
||
|
|
)
|
||
|
|
|
||
|
|
full_text = ""
|
||
|
|
for event in stream:
|
||
|
|
if hasattr(event, 'type'):
|
||
|
|
if event.type == "response.output_text.delta":
|
||
|
|
print(event.delta, end="", flush=True)
|
||
|
|
full_text += event.delta
|
||
|
|
elif event.type == "response.completed":
|
||
|
|
print("\n--- Done ---")
|
||
|
|
```
|
||
|
|
|
||
|
|
Key streaming event types include `response.created`, `response.output_text.delta` (incremental text), `response.output_text.done`, and `response.completed`.
|
||
|
|
|
||
|
|
## Python SDK differences between responses.create() and chat.completions.create()
|
||
|
|
|
||
|
|
| Aspect | `responses.create()` | `chat.completions.create()` |
|
||
|
|
|--------|---------------------|---------------------------|
|
||
|
|
| Input parameter | `input` (string or array) | `messages` (array required) |
|
||
|
|
| Response access | `response.output[0].content[0].text` | `response.choices[0].message.content` |
|
||
|
|
| Conversation history | Built-in via `previous_response_id` | Manual message array management |
|
||
|
|
| MCP tools | Native `"type": "mcp"` support | Standard function calling only |
|
||
|
|
| Endpoint | `/v1/responses` | `/v1/chat/completions` |
|
||
|
|
|
||
|
|
**Client setup** is identical for both APIs:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from openai import OpenAI
|
||
|
|
|
||
|
|
client = OpenAI(
|
||
|
|
base_url="http://localhost:4000", # Your LiteLLM proxy
|
||
|
|
api_key="sk-your-litellm-key"
|
||
|
|
)
|
||
|
|
|
||
|
|
# Responses API
|
||
|
|
response = client.responses.create(model="gpt-4o", input="Hello")
|
||
|
|
|
||
|
|
# Chat Completions API (old way)
|
||
|
|
response = client.chat.completions.create(
|
||
|
|
model="gpt-4o",
|
||
|
|
messages=[{"role": "user", "content": "Hello"}]
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Conversation history with the input parameter
|
||
|
|
|
||
|
|
Unlike chat.completions where you manually pass the full message history each time, the Responses API offers two approaches:
|
||
|
|
|
||
|
|
**Option 1: Use `previous_response_id`** for automatic context (recommended):
|
||
|
|
```python
|
||
|
|
# First message
|
||
|
|
response1 = client.responses.create(model="gpt-4o", input="My name is Alice")
|
||
|
|
|
||
|
|
# Follow-up with context preserved automatically
|
||
|
|
response2 = client.responses.create(
|
||
|
|
model="gpt-4o",
|
||
|
|
input="What's my name?",
|
||
|
|
previous_response_id=response1.id # LiteLLM maintains context
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Option 2: Pass full history in input array** (manual approach):
|
||
|
|
```python
|
||
|
|
response = client.responses.create(
|
||
|
|
model="gpt-4o",
|
||
|
|
input=[
|
||
|
|
{"role": "user", "content": "My name is Alice"},
|
||
|
|
{"role": "assistant", "content": "Nice to meet you, Alice!"},
|
||
|
|
{"role": "user", "content": "What's my name?"}
|
||
|
|
]
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
The `input` array supports roles: `user`, `assistant`, `developer` (replaces `system` in newer models), and `tool`.
|
||
|
|
|
||
|
|
## The require_approval parameter and MCP options
|
||
|
|
|
||
|
|
**`require_approval: "never"`** enables fully automatic tool execution—LiteLLM returns the final response in a single API call:
|
||
|
|
|
||
|
|
```python
|
||
|
|
response = client.responses.create(
|
||
|
|
model="gpt-4o",
|
||
|
|
input="Search for Python documentation",
|
||
|
|
tools=[{
|
||
|
|
"type": "mcp",
|
||
|
|
"server_label": "search_server",
|
||
|
|
"server_url": "litellm_proxy",
|
||
|
|
"require_approval": "never" # No approval needed
|
||
|
|
}]
|
||
|
|
)
|
||
|
|
# Response includes tool results integrated into final answer
|
||
|
|
```
|
||
|
|
|
||
|
|
**Without `require_approval: "never"`**, you get an approval flow requiring two API calls:
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Step 1: Get approval request
|
||
|
|
response = client.responses.create(
|
||
|
|
model="gpt-4o",
|
||
|
|
input="Search for docs",
|
||
|
|
tools=[{"type": "mcp", "server_label": "search", "server_url": "litellm_proxy"}]
|
||
|
|
)
|
||
|
|
|
||
|
|
# Extract approval request ID from response.output
|
||
|
|
approval_id = None
|
||
|
|
for output in response.output:
|
||
|
|
if output.type == "mcp_approval_request":
|
||
|
|
approval_id = output.id
|
||
|
|
break
|
||
|
|
|
||
|
|
# Step 2: Approve and get final response
|
||
|
|
final_response = client.responses.create(
|
||
|
|
model="gpt-4o",
|
||
|
|
input=[{"type": "mcp_approval_response", "approve": True, "approval_request_id": approval_id}],
|
||
|
|
previous_response_id=response.id,
|
||
|
|
tools=[{"type": "mcp", "server_label": "search", "server_url": "litellm_proxy"}]
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Restricting tools with allowed_tools
|
||
|
|
|
||
|
|
Control which MCP tools are available at **request time** or **server configuration level**:
|
||
|
|
|
||
|
|
**Request-level restriction** (per-call):
|
||
|
|
```python
|
||
|
|
tools=[{
|
||
|
|
"type": "mcp",
|
||
|
|
"server_label": "github_mcp",
|
||
|
|
"server_url": "litellm_proxy",
|
||
|
|
"require_approval": "never",
|
||
|
|
"allowed_tools": ["list_repos", "get_file_contents"] # Only these tools available
|
||
|
|
}]
|
||
|
|
```
|
||
|
|
|
||
|
|
**Server-level restriction** (in config.yaml):
|
||
|
|
```yaml
|
||
|
|
mcp_servers:
|
||
|
|
github_mcp:
|
||
|
|
url: "https://api.github.com/mcp"
|
||
|
|
allowed_tools: ["list_repos", "get_file_contents"] # Whitelist
|
||
|
|
disallowed_tools: ["delete_repo", "force_push"] # Blacklist
|
||
|
|
```
|
||
|
|
|
||
|
|
If both `allowed_tools` and `disallowed_tools` are specified, `allowed_tools` takes priority.
|
||
|
|
|
||
|
|
## Authentication headers
|
||
|
|
|
||
|
|
LiteLLM supports multiple authentication header formats:
|
||
|
|
|
||
|
|
| Header | Use Case |
|
||
|
|
|--------|----------|
|
||
|
|
| `Authorization: Bearer sk-...` | **Standard** - Used by OpenAI SDK automatically |
|
||
|
|
| `x-litellm-api-key: Bearer sk-...` | **MCP connections** and custom scenarios |
|
||
|
|
| `api-key: ...` | Azure OpenAI compatibility |
|
||
|
|
|
||
|
|
**For standard API calls** (Discord bot), use the OpenAI SDK default:
|
||
|
|
```python
|
||
|
|
client = OpenAI(
|
||
|
|
base_url="http://localhost:4000",
|
||
|
|
api_key="sk-your-key" # Sent as "Authorization: Bearer sk-your-key"
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
**For MCP tool headers** (when calling external MCP servers), use the `headers` parameter:
|
||
|
|
```python
|
||
|
|
tools=[{
|
||
|
|
"type": "mcp",
|
||
|
|
"server_label": "github",
|
||
|
|
"server_url": "litellm_proxy",
|
||
|
|
"require_approval": "never",
|
||
|
|
"headers": {
|
||
|
|
"x-litellm-api-key": "Bearer sk-your-litellm-key",
|
||
|
|
"x-mcp-github-authorization": "Bearer ghp_your_github_token"
|
||
|
|
}
|
||
|
|
}]
|
||
|
|
```
|
||
|
|
|
||
|
|
## Complete Discord bot migration example
|
||
|
|
|
||
|
|
Here's a full implementation pattern for migrating from chat.completions to responses with MCP:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from openai import OpenAI
|
||
|
|
import os
|
||
|
|
|
||
|
|
class LiteLLMResponsesClient:
|
||
|
|
"""Client wrapper for Discord bot using LiteLLM Responses API with MCP."""
|
||
|
|
|
||
|
|
def __init__(self, proxy_url: str, api_key: str):
|
||
|
|
self.client = OpenAI(base_url=proxy_url, api_key=api_key)
|
||
|
|
self.conversations = {} # user_id -> response_id mapping
|
||
|
|
|
||
|
|
def get_mcp_tools(self, server_label: str = "default") -> list:
|
||
|
|
"""Define MCP tools configuration."""
|
||
|
|
return [{
|
||
|
|
"type": "mcp",
|
||
|
|
"server_label": server_label,
|
||
|
|
"server_url": "litellm_proxy",
|
||
|
|
"require_approval": "never",
|
||
|
|
"allowed_tools": ["search", "fetch_data", "analyze"] # Customize as needed
|
||
|
|
}]
|
||
|
|
|
||
|
|
def chat(
|
||
|
|
self,
|
||
|
|
user_id: str,
|
||
|
|
message: str,
|
||
|
|
model: str = "anthropic/claude-3-5-sonnet-latest",
|
||
|
|
use_mcp_tools: bool = True,
|
||
|
|
stream: bool = False
|
||
|
|
):
|
||
|
|
"""Send a message and get response, with optional MCP tools and streaming."""
|
||
|
|
|
||
|
|
previous_id = self.conversations.get(user_id)
|
||
|
|
|
||
|
|
kwargs = {
|
||
|
|
"model": model,
|
||
|
|
"input": message,
|
||
|
|
"stream": stream
|
||
|
|
}
|
||
|
|
|
||
|
|
if previous_id:
|
||
|
|
kwargs["previous_response_id"] = previous_id
|
||
|
|
|
||
|
|
if use_mcp_tools:
|
||
|
|
kwargs["tools"] = self.get_mcp_tools()
|
||
|
|
kwargs["tool_choice"] = "auto"
|
||
|
|
|
||
|
|
if stream:
|
||
|
|
return self._handle_stream(user_id, **kwargs)
|
||
|
|
else:
|
||
|
|
response = self.client.responses.create(**kwargs)
|
||
|
|
self.conversations[user_id] = response.id
|
||
|
|
return self._extract_text(response)
|
||
|
|
|
||
|
|
def _handle_stream(self, user_id: str, **kwargs):
|
||
|
|
"""Generator for streaming responses."""
|
||
|
|
stream = self.client.responses.create(**kwargs)
|
||
|
|
response_id = None
|
||
|
|
|
||
|
|
for event in stream:
|
||
|
|
if hasattr(event, 'type'):
|
||
|
|
if event.type == "response.created":
|
||
|
|
response_id = event.response.id
|
||
|
|
elif event.type == "response.output_text.delta":
|
||
|
|
yield event.delta
|
||
|
|
|
||
|
|
if response_id:
|
||
|
|
self.conversations[user_id] = response_id
|
||
|
|
|
||
|
|
def _extract_text(self, response) -> str:
|
||
|
|
"""Extract text from Responses API response."""
|
||
|
|
for output in response.output:
|
||
|
|
if output.type == "message":
|
||
|
|
for content in output.content:
|
||
|
|
if content.type == "output_text":
|
||
|
|
return content.text
|
||
|
|
return ""
|
||
|
|
|
||
|
|
def clear_history(self, user_id: str):
|
||
|
|
"""Clear conversation history for a user."""
|
||
|
|
self.conversations.pop(user_id, None)
|
||
|
|
|
||
|
|
|
||
|
|
# Discord bot integration example
|
||
|
|
import discord
|
||
|
|
|
||
|
|
bot = discord.Bot()
|
||
|
|
llm_client = LiteLLMResponsesClient(
|
||
|
|
proxy_url=os.environ["LITELLM_PROXY_URL"],
|
||
|
|
api_key=os.environ["LITELLM_API_KEY"]
|
||
|
|
)
|
||
|
|
|
||
|
|
@bot.event
|
||
|
|
async def on_message(message):
|
||
|
|
if message.author.bot:
|
||
|
|
return
|
||
|
|
|
||
|
|
if bot.user.mentioned_in(message):
|
||
|
|
user_id = str(message.author.id)
|
||
|
|
user_message = message.content.replace(f'<@{bot.user.id}>', '').strip()
|
||
|
|
|
||
|
|
# Non-streaming response with MCP tools
|
||
|
|
response_text = llm_client.chat(
|
||
|
|
user_id=user_id,
|
||
|
|
message=user_message,
|
||
|
|
use_mcp_tools=True
|
||
|
|
)
|
||
|
|
await message.reply(response_text)
|
||
|
|
|
||
|
|
# Run: bot.run(os.environ["DISCORD_TOKEN"])
|
||
|
|
```
|
||
|
|
|
||
|
|
## Official documentation links
|
||
|
|
|
||
|
|
- **Responses API documentation**: https://docs.litellm.ai/docs/response_api
|
||
|
|
- **MCP overview**: https://docs.litellm.ai/docs/mcp
|
||
|
|
- **MCP usage guide**: https://docs.litellm.ai/docs/mcp_usage
|
||
|
|
- **MCP permission management**: https://docs.litellm.ai/docs/mcp_control
|
||
|
|
- **OpenAI provider Responses API**: https://docs.litellm.ai/docs/providers/openai/responses_api
|
||
|
|
- **Streaming documentation**: https://docs.litellm.ai/docs/completion/stream
|
||
|
|
- **Virtual keys and auth**: https://docs.litellm.ai/docs/proxy/virtual_keys
|
||
|
|
|
||
|
|
## Key migration considerations
|
||
|
|
|
||
|
|
The Responses API is marked as **BETA** in LiteLLM. Ensure you're running LiteLLM **1.63.8+** and using OpenAI SDK **1.66.1+** for full compatibility. Model names must include the provider prefix (e.g., `openai/gpt-4o`, `anthropic/claude-3-5-sonnet-latest`). Response IDs are encrypted per-user by default for security—users cannot access other users' conversation history unless you disable this with `disable_responses_id_security: true` in config.yaml.
|
||
|
|
|
||
|
|
The primary advantage for Discord bots is the automatic MCP tool execution loop. With chat.completions, you must manually detect tool calls, execute them, and send results back. With Responses API and `require_approval: "never"`, LiteLLM handles this entire flow internally, returning the final integrated response in a single call.
|