Files
ericxliu-me/content/posts/open-webui-openai-websearch.md
Automated Publisher f1178d37f5
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 18s
📚 Auto-publish: Add/update 1 blog posts
Generated on: Mon Dec 29 07:15:58 UTC 2025
Source: md-personal repository
2025-12-29 07:15:58 +00:00

8.9 KiB
Raw Permalink Blame History

title, date, draft
title date draft
How I Got Open WebUI Talking to OpenAI Web Search 2025-12-29 false

OpenAI promised native web search in GPT5, but LiteLLM proxy deployments (and by extension Open WebUI) still choke on it—issue #13042 tracks the fallout. I needed grounded answers inside Open WebUI anyway, so I built a workaround: route GPT5 traffic through the Responses API and mask every web_search_call before the UI ever sees it.

This post documents the final setup, the hotfix script that keeps LiteLLM honest, and the tests that prove Open WebUI now streams cited answers without trying to execute the tool itself.

Why Open WebUI Broke

  1. Wrong API surface. /v1/chat/completions still rejects type: "web_search" with Invalid value: 'web_search'. Supported values are: 'function' and 'custom'.
  2. LiteLLM tooling gap. The OpenAI TypedDicts in litellm/types/llms/openai.py only allow Literal["function"]. Even if the backend call succeeded, streaming would crash when it saw a new tool type.
  3. Open WebUI assumptions. The UI eagerly parses every tool delta, so when LiteLLM streamed the raw web_search_call chunk, the UI tried to execute it, failed to parse the arguments, and aborted the chat.

Fixing all three required touching both the proxy configuration and the LiteLLM transformation path.

Step 1 Route GPT5 Through the Responses API

LiteLLMs Responses bridge activates whenever the backend model name starts with openai/responses/. I added a dedicated alias, gpt-5.2-search, that hardcodes the Responses API plus web search metadata. Existing models (reasoning, embeddings, TTS) stay untouched.

# proxy-config.yaml (sanitized)
model_list:
  - model_name: gpt-5.2-search
    litellm_params:
      model: openai/responses/openai/gpt-5.2
      api_key: <OPENAI_API_KEY>
      reasoning_effort: high
      merge_reasoning_content_in_choices: true
      tools:
        - type: web_search
          user_location:
            type: approximate
            country: US

Any client (Open WebUI included) can now request model: "gpt-5.2-search" over the standard /v1/chat/completions endpoint, and LiteLLM handles the Responses API hop transparently.

Step 2 Mask web_search_call Chunks Inside LiteLLM

Even with the right API, LiteLLM still needs to stream deltas Open WebUI can digest. My hotfix.py script copies the LiteLLM source into /tmp/patch/litellm, then rewrites two files. This script runs as part of the Helm releases init hook so I can inject fixes directly into the container filesystem at pod start. That saves me from rebuilding and pushing new images every time LiteLLM upstream changes (or refuses a patch), which is critical while waiting for issue #13042 to land. Ill try to upstream the fix, but this is admittedly hacky, so timelines are uncertain.

  1. openai.py TypedDicts: extend the tool chunk definitions to accept Literal["web_search"].
  2. litellm_responses_transformation/transformation.py: intercept every streaming item and short-circuit anything with type == "web_search_call", returning an empty assistant delta instead of a tool call.
# Excerpt from hotfix.py
tool_call_chunk_original = (
    'class ChatCompletionToolCallChunk(TypedDict):  # result of /chat/completions call\n'
    '    id: Optional[str]\n'
    '    type: Literal["function"]'
)
tool_call_chunk_patch = tool_call_chunk_original.replace(
    'Literal["function"]', 'Literal["function", "web_search"]'
)
...
if tool_call_chunk_original in content:
    content = content.replace(tool_call_chunk_original, tool_call_chunk_patch, 1)
added_block = """            elif output_item.get("type") == "web_search_call":
                # Mask the call: Open WebUI should never see tool metadata
                action_payload = output_item.get("action")
                verbose_logger.debug(
                    "Chat provider: masking web_search_call (added) call_id=%s action=%s",
                    output_item.get("call_id"),
                    action_payload,
                )
                return ModelResponseStream(
                    choices=[
                        StreamingChoices(
                            index=0,
                            delta=Delta(content=""),
                            finish_reason=None,
                        )
                    ]
                )
"""

These patches ensure LiteLLM never emits a tool_calls delta for web_search. Open WebUI only receives assistant text chunks, so it happily renders the model response and the inline citations the Responses API already provides.

Step 3 Prove It with cURL (and Open WebUI)

I keep a simple smoke test (litellm_smoke_test.sh) that hits the public ingress with and without streaming. The only secrets are placeholders here, but the structure is the same.

#!/usr/bin/env bash
set -euo pipefail

echo "Testing non-streaming..."
curl "https://api.ericxliu.me/v1/chat/completions" \
  -H "Authorization: Bearer <LITELLM_MASTER_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.2-search",
    "messages": [{"role": "user", "content": "Find the sunset time in Tokyo today."}]
  }'

echo -e "\n\nTesting streaming..."
curl "https://api.ericxliu.me/v1/chat/completions" \
  -H "Authorization: Bearer <LITELLM_MASTER_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.2-search",
    "stream": true,
    "messages": [{"role": "user", "content": "What is the weather in NYC right now?"}]
  }'

Each request now returns grounded answers with citations (url_citation annotations) via Open WebUI, and the SSE feed never stalls because the UI isnt asked to interpret tool calls.

Lessons & Pitfalls

  • The Responses API is non-negotiable (and syntax-sensitive). /v1/chat/completions still rejects web_search. Always test against /v1/responses directly before wiring LiteLLM into the loop. Furthermore, the syntax for reasoning is different: while Chat Completions uses the top-level reasoning_effort parameter, the Responses API requires a nested object: "reasoning": {"effort": "medium"}.
  • The Native Model Trap. Models like gpt-5-search-api exist and support web search via standard Chat Completions, but they are often less flexible—for instance, rejecting reasoning_effort entirely. Routing a standard model through LiteLLM's Responses bridge offers more control over formatting and fallbacks.
  • Magic strings control routing. LiteLLM has hardcoded logic (deep in main.py) that only triggers the Responses-to-Chat bridge if the backend model name starts with openai/responses/. Without that specific prefix, LiteLLM bypasses its internal transformation layer entirely, leading to cryptic 404s or "model not found" errors.
  • Synthesized Sovereignty: The Call ID Crisis. Open WebUI is a "well-behaved" OpenAI client, yet it often omits the id field in tool_calls when sending assistant messages back to the server. LiteLLM's Responses bridge initially exploded with a KeyError: 'id' because it assumed an ID would always be present. The fix: synthesizing predictable IDs like auto_tool_call_N on the fly to satisfy the server-side schema.
  • The Argument Delta Void. In streaming mode, the Responses API sometimes skips sending response.function_call_arguments.delta entirely if the query is simple. If the proxy only waits for deltas, the client receives an empty {} for tool arguments. The solution is to fallback and synthesize the arguments string from the action payload (e.g., output_item['action']['query']) when deltas are missing.
  • Streaming State Machines are Fragile. Open WebUI is highly sensitive to the exact state of a tool call. If it sees a web_search_call with status: "in_progress", its internal parser chokes, assuming it's an uncompleted "function" call. These intermediate state chunks must be intercepted and handled before they reach the UI.
  • Defensive Masking is the Final Boss. To stop Open WebUI from entering an infinite client-side loop (thinking it needs to execute a tool it doesn't have), LiteLLM must "mask" the web_search_call chunks. By emitting empty content deltas instead of tool chunks, we hide the server-side search mechanics from the UI, allowing it to stay focused on the final answer.

With those guardrails in place, GPT5s native web search works end-to-end inside Open WebUI, complete with citations, without waiting for LiteLLM upstream fixes.

References