OpenAI Assistants API Deprecation: Migration Guide to Responses API
OpenAI officially deprecated the Assistants API in mid-2025, giving developers until mid-2026 to migrate all production workloads to the new Responses API. If you are still running Assistants-based integrations, the clock is ticking. This guide walks through the deprecation timeline, the architectural differences between the two APIs, a step-by-step migration path, and the pitfalls that have caught teams off guard during early migrations.
Deprecation Timeline
OpenAI announced the deprecation alongside the launch of the Responses API in March 2025. The key dates every team needs to know:
- March 2025: Responses API launched. Assistants API marked as "legacy" in documentation.
- June 2025: New Assistants API feature development frozen. No new tools or model support added.
- September 2025: Assistants API endpoints began returning deprecation headers in every response.
- March 2026: Rate limits on Assistants API reduced by 50% for all tiers.
- Mid-2026 (estimated): Full shutdown. All Assistants API endpoints return
410 Gone.
Why OpenAI Killed the Assistants API
The Assistants API was OpenAI's first attempt at a stateful, multi-turn agent framework. It introduced Threads, Runs, and server-side message storage. While powerful in concept, it created significant operational overhead for OpenAI and frustration for developers:
- Server-side state complexity: Threads stored messages on OpenAI's infrastructure, creating data residency concerns for enterprise customers and GDPR complications for European teams.
- Polling-based architecture: Developers had to poll the Runs endpoint to check completion status, adding latency and unnecessary API calls. Some production apps were making 5-10x more API calls than needed.
- Opaque tool execution: Code Interpreter and Retrieval ran server-side with limited visibility into what was happening, making debugging nearly impossible for complex workflows.
- Cost unpredictability: Because Threads persisted and Retrieval indexed files server-side, storage costs accumulated in ways that surprised many teams.
The Responses API addresses all of these by shifting to a stateless, single-request model with native streaming and client-side orchestration of tools.
Key Architectural Differences
| Feature | Assistants API | Responses API |
|---|---|---|
| State Management | Server-side Threads | Stateless (client manages context) |
| Execution Model | Async Runs with polling | Synchronous or streaming |
| Tool Calling | Server-side execution | Client-side orchestration |
| File Search | Built-in Retrieval | File Search tool (improved) |
| Code Execution | Code Interpreter (opaque) | Code Interpreter (with output visibility) |
| Streaming | Server-Sent Events on Runs | Native response streaming |
| Multi-turn | Automatic via Threads | Pass previous response.id |
Step-by-Step Migration
Step 1: Audit Your Assistants Usage
Before writing any migration code, catalog every Assistant you have deployed. Use the List Assistants endpoint to pull your full inventory:
import openai
client = openai.OpenAI()
# List all assistants before migration
assistants = client.beta.assistants.list(limit=100)
for a in assistants.data:
print(f"{a.id} | {a.name} | Tools: {[t.type for t in a.tools]}")
Document which tools each assistant uses (Code Interpreter, Retrieval/File Search, Function Calling) because the migration path differs for each.
Step 2: Replace Thread-Based Conversations
The biggest conceptual shift is moving from server-managed Threads to client-managed conversation state. In the Responses API, you maintain multi-turn context by passing previous_response_id:
# Old: Assistants API (Thread-based)
thread = client.beta.threads.create()
client.beta.threads.messages.create(thread.id, role="user", content="Analyze Q4 revenue")
run = client.beta.threads.runs.create(thread.id, assistant_id="asst_xxx")
# ... poll for completion ...
# New: Responses API (stateless with chaining)
response = client.responses.create(
model="gpt-4.1",
input="Analyze Q4 revenue",
tools=[{"type": "file_search"}],
)
# For follow-up, chain responses:
follow_up = client.responses.create(
model="gpt-4.1",
input="Break that down by region",
previous_response_id=response.id,
)
Step 3: Migrate Function Calling
Function calling translates almost directly. The schema format is identical, but execution flow changes from polling a Run to handling tool calls inline:
response = client.responses.create(
model="gpt-4.1",
input="What's the weather in Tokyo?",
tools=[{
"type": "function",
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}]
)
# Handle tool calls from the response
for item in response.output:
if item.type == "function_call":
result = call_your_function(item.name, item.arguments)
# Submit result back
response = client.responses.create(
model="gpt-4.1",
input=[{"type": "function_return", "call_id": item.call_id, "output": result}],
previous_response_id=response.id,
)
Step 4: Migrate File Search and Code Interpreter
Both tools carry over to the Responses API but with improved interfaces. File Search now supports vector stores that you create and manage explicitly, giving you better control over what documents are indexed. Code Interpreter now returns full execution output including generated files, eliminating the black-box problem.
Step 5: Update Error Handling and Retries
The Responses API uses standard HTTP error codes instead of Run status checks. Replace your Run polling logic with standard retry logic on 429 (rate limit) and 500 (server error) responses.
Common Pitfalls
- Thread data loss: When Assistants shuts down, all Thread history is deleted. Export any conversation data you need before the deadline using the List Messages endpoint.
- Billing model change: Assistants charged per Run. Responses charges per input/output token. For long conversations with heavy context, costs may increase if you are not pruning conversation history.
- Streaming differences: Assistants used Server-Sent Events on the Run object. Responses API uses native streaming on the response itself. Your SSE parsing code will need updates.
- Model compatibility: Some older models available in Assistants (like early GPT-4 snapshots) are not available in Responses. Verify your model string is supported.
- Rate limit structure: Responses API rate limits are per-model, not per-assistant. If you were running multiple assistants to work around rate limits, that strategy no longer applies.
Testing Your Migration
Run both APIs in parallel during migration. Send identical prompts to both and compare outputs for quality parity. Key metrics to track:
- Latency: Responses API should be faster (no polling overhead). Expect 30-60% latency reduction for tool-calling workflows.
- Cost per conversation: Compare token usage. Responses API makes token counts explicit in every response object.
- Tool call accuracy: Ensure function calling produces the same parameter extraction quality.
- Error rates: Monitor for any increase in malformed responses during cutover.
The Responses API is objectively better designed than what it replaces. The migration is not trivial for complex multi-tool assistants, but the improved debugging, predictable billing, and reduced latency make it worth prioritizing now rather than scrambling at the deadline.