Last year, I spent three weeks building a custom streaming interface for ChatGPT in our app. Rate limiting, error recovery, chunk parsing, state management—it was a nightmare. Then Vercel released their AI SDK, and I rebuilt the entire feature in two days. Not only was it faster to implement, but it handled edge cases I hadn't even considered. The Vercel AI SDK isn't just a convenience wrapper—it's a production-grade framework for building AI features the right way.
In this guide, we'll go beyond basic chat implementations and explore advanced patterns that power real-world AI applications: streaming with React Server Components, tool calling for agent-like behavior, multi-model orchestration, and production-ready error handling.
Why Vercel AI SDK?
Before we dive deep, understand what makes this SDK special:
Key Advantages:
- Provider Agnostic: OpenAI, Anthropic, Google, Mistral, local models—same API
- Streaming First: Built-in support for token streaming with React hooks
- Type Safe: Full TypeScript support with inferred types
- Edge Compatible: Works on Vercel Edge, Cloudflare Workers, Node.js
- Framework Integration: First-class Next.js, React, Svelte, Vue support
- Tool Calling: Declarative function calling for agentic behavior
# Installation
npm install ai @ai-sdk/openai @ai-sdk/anthropic
Architecture Overview
The SDK has three main layers:
// 1. Provider Layer - Model adapters
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
// 2. Core Layer - Streaming & generation
import { streamText, generateText } from 'ai';
// 3. Framework Layer - React hooks & UI helpers
import { useChat, useCompletion } from 'ai/react';
Pattern 1: Streaming Chat with Server Actions
Let's build a production-ready streaming chat with Next.js 15 Server Actions:
// app/api/chat/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
export const runtime = 'edge'; // Deploy to edge for low latency
export async function POST(req: Request) {
const { messages } = await req.json();
const result = await streamText({
model: openai('gpt-4-turbo'),
messages,
// System message for context
system: 'You are a helpful coding assistant specializing in React and Next.js.',
// Limit tokens for cost control
maxTokens: 2000,
// Control randomness
temperature: 0.7,
// Streaming configuration
onFinish: async ({ text, usage }) => {
// Log completion for analytics
console.log('Completion:', { text, usage });
// Save to database, trigger webhooks, etc.
},
});
return result.toAIStreamResponse();
}
// app/chat/page.tsx
'use client';
import { useChat } from 'ai/react';
export default function ChatPage() {
const {
messages,
input,
handleInputChange,
handleSubmit,
isLoading,
error,
reload,
stop,
} = useChat({
api: '/api/chat',
// Optional: Handle errors
onError: (error) => {
console.error('Chat error:', error);
},
// Optional: Message sent callback
onFinish: (message) => {
console.log('Message received:', message);
},
});
return (
<div className="flex flex-col h-screen">
{/* Messages */}
<div className="flex-1 overflow-y-auto p-4 space-y-4">
{messages.map((message) => (
<div
key={message.id}
className={`flex ${
message.role === 'user' ? 'justify-end' : 'justify-start'
}`}
>
<div
className={`max-w-[70%] rounded-lg p-4 ${
message.role === 'user'
? 'bg-blue-600 text-white'
: 'bg-gray-100 text-gray-900'
}`}
>
<ReactMarkdown>{message.content}</ReactMarkdown>
</div>
</div>
))}
{/* Loading indicator */}
{isLoading && (
<div className="flex justify-start">
<div className="bg-gray-100 rounded-lg p-4">
<LoadingDots />
</div>
</div>
)}
{/* Error handling */}
{error && (
<div className="flex justify-center">
<div className="bg-red-50 border border-red-200 rounded-lg p-4">
<p className="text-red-800">{error.message}</p>
<button
onClick={() => reload()}
className="mt-2 text-red-600 underline"
>
Retry
</button>
</div>
</div>
)}
</div>
{/* Input */}
<form onSubmit={handleSubmit} className="border-t p-4">
<div className="flex gap-2">
<input
value={input}
onChange={handleInputChange}
placeholder="Type a message..."
className="flex-1 px-4 py-2 border rounded-lg"
disabled={isLoading}
/>
<button
type="submit"
disabled={isLoading || !input.trim()}
className="px-6 py-2 bg-blue-600 text-white rounded-lg disabled:opacity-50"
>
{isLoading ? 'Stop' : 'Send'}
</button>
</div>
</form>
</div>
);
}
Key Points:
- Edge runtime for <50ms cold starts globally
useChathook handles all state management- Automatic retry and error handling
- Markdown rendering for formatted responses
- Stop generation mid-stream
Pattern 2: Tool Calling (Function Calling)
Tool calling lets AI models execute functions and use results in responses—essential for building agents:
// app/api/chat-with-tools/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText, tool } from 'ai';
import { z } from 'zod';
export async function POST(req: Request) {
const { messages } = await req.json();
const result = await streamText({
model: openai('gpt-4-turbo'),
messages,
tools: {
// Weather tool
getWeather: tool({
description: 'Get the current weather for a location',
parameters: z.object({
location: z.string().describe('The city name'),
unit: z.enum(['celsius', 'fahrenheit']).optional(),
}),
execute: async ({ location, unit = 'celsius' }) => {
// Call real weather API
const response = await fetch(
`https://api.openweathermap.org/data/2.5/weather?q=${location}&units=${unit === 'celsius' ? 'metric' : 'imperial'}&appid=${process.env.OPENWEATHER_API_KEY}`
);
const data = await response.json();
return {
temperature: data.main.temp,
condition: data.weather[0].description,
location: data.name,
unit,
};
},
}),
// Database query tool
searchDocuments: tool({
description: 'Search documentation database',
parameters: z.object({
query: z.string().describe('Search query'),
limit: z.number().optional().default(5),
}),
execute: async ({ query, limit }) => {
// Vector search in your database
const results = await vectorSearch(query, limit);
return results;
},
}),
// Code execution tool
executeCode: tool({
description: 'Execute JavaScript code safely',
parameters: z.object({
code: z.string().describe('JavaScript code to execute'),
}),
execute: async ({ code }) => {
// Use VM2 or similar for sandboxed execution
try {
const result = await safelyExecuteCode(code);
return { success: true, output: result };
} catch (error) {
return { success: false, error: error.message };
}
},
}),
},
maxToolRoundtrips: 5, // Limit tool calls to prevent loops
});
return result.toAIStreamResponse();
}
Client-side handling:
'use client';
import { useChat } from 'ai/react';
export default function AgentChat() {
const { messages, input, handleInputChange, handleSubmit } = useChat({
api: '/api/chat-with-tools',
});
return (
<div className="space-y-4 p-4">
{messages.map((message) => (
<div key={message.id}>
<div className="font-semibold">
{message.role === 'user' ? 'You' : 'Assistant'}
</div>
<div className="mt-1">{message.content}</div>
{/* Show tool invocations */}
{message.toolInvocations?.map((invocation) => (
<div key={invocation.toolCallId} className="mt-2 p-2 bg-gray-100 rounded">
<div className="text-sm font-medium">
🔧 {invocation.toolName}
</div>
<div className="text-xs text-gray-600 mt-1">
{JSON.stringify(invocation.args, null, 2)}
</div>
{invocation.state === 'result' && (
<div className="text-xs text-green-600 mt-1">
✓ Result: {JSON.stringify(invocation.result, null, 2)}
</div>
)}
</div>
))}
</div>
))}
<form onSubmit={handleSubmit}>
<input
value={input}
onChange={handleInputChange}
placeholder="Ask me anything..."
className="w-full px-4 py-2 border rounded"
/>
</form>
</div>
);
}
Watch Out: Tool execution happens on your server. Always validate inputs and implement proper security controls. Never execute arbitrary code without sandboxing.
Pattern 3: Multi-Model Orchestration
Different models excel at different tasks. Route requests intelligently:
// lib/ai-router.ts
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
import { streamText } from 'ai';
type TaskType = 'code' | 'creative' | 'analysis' | 'chat';
export async function routeToModel(task: TaskType, messages: any[]) {
const modelConfig = {
code: {
// Claude is excellent for code
model: anthropic('claude-3-5-sonnet-20241022'),
temperature: 0.3,
maxTokens: 4000,
},
creative: {
// GPT-4 for creative writing
model: openai('gpt-4-turbo'),
temperature: 0.9,
maxTokens: 3000,
},
analysis: {
// Claude for complex reasoning
model: anthropic('claude-3-opus-20240229'),
temperature: 0.2,
maxTokens: 4000,
},
chat: {
// GPT-3.5 for fast chat
model: openai('gpt-3.5-turbo'),
temperature: 0.7,
maxTokens: 1500,
},
};
const config = modelConfig[task];
return streamText({
model: config.model,
messages,
temperature: config.temperature,
maxTokens: config.maxTokens,
});
}
// Usage
export async function POST(req: Request) {
const { messages, taskType } = await req.json();
const result = await routeToModel(taskType as TaskType, messages);
return result.toAIStreamResponse();
}
Pattern 4: RAG (Retrieval Augmented Generation)
Combine vector search with AI generation:
// app/api/rag-chat/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
import { embed } from 'ai';
export async function POST(req: Request) {
const { messages } = await req.json();
const lastMessage = messages[messages.length - 1].content;
// Step 1: Generate embedding for user query
const { embedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: lastMessage,
});
// Step 2: Search vector database
const relevantDocs = await vectorDatabase.search({
vector: embedding,
limit: 5,
threshold: 0.7,
});
// Step 3: Build context from results
const context = relevantDocs
.map((doc) => `[${doc.title}]\n${doc.content}`)
.join('\n\n');
// Step 4: Stream response with context
const result = await streamText({
model: openai('gpt-4-turbo'),
system: `You are a helpful assistant. Use the following context to answer questions:
${context}
If the context doesn't contain relevant information, say so.`,
messages,
});
return result.toAIStreamResponse();
}
Pair this with: Building semantic search with OpenAI embeddings
Pattern 5: Streaming with React Server Components
Next.js 15 enables streaming AI responses directly from Server Components:
// app/generate/page.tsx
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
import { StreamResponse } from './stream-response';
export default async function GeneratePage({
searchParams,
}: {
searchParams: { prompt: string };
}) {
if (!searchParams.prompt) {
return <div>Enter a prompt to generate content</div>;
}
const stream = await streamText({
model: openai('gpt-4-turbo'),
prompt: searchParams.prompt,
});
return <StreamResponse stream={stream.textStream} />;
}
// app/generate/stream-response.tsx
'use client';
import { useEffect, useState } from 'react';
export function StreamResponse({ stream }: { stream: ReadableStream<string> }) {
const [content, setContent] = useState('');
useEffect(() => {
const reader = stream.getReader();
const decoder = new TextDecoder();
(async () => {
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
setContent((prev) => prev + text);
}
})();
}, [stream]);
return (
<div className="prose max-w-none">
<ReactMarkdown>{content}</ReactMarkdown>
</div>
);
}
Pattern 6: Cost Optimization
AI API calls are expensive. Optimize costs:
// lib/ai-cost-optimizer.ts
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
import { streamText } from 'ai';
// Track usage per user
const usageTracker = new Map<string, number>();
export async function optimizedGenerate(
userId: string,
messages: any[],
priority: 'low' | 'medium' | 'high'
) {
const usage = usageTracker.get(userId) || 0;
// Select model based on priority and usage
let model: any;
if (priority === 'low' || usage > 100000) {
// Use cheaper model for low priority or high usage
model = openai('gpt-3.5-turbo');
} else if (priority === 'medium') {
model = openai('gpt-4-turbo');
} else {
model = anthropic('claude-3-opus-20240229');
}
const result = await streamText({
model,
messages,
// Limit tokens
maxTokens: priority === 'low' ? 500 : priority === 'medium' ? 2000 : 4000,
// Cache system prompts (Anthropic only)
experimental_providerMetadata: {
anthropic: {
cacheControl: { type: 'ephemeral' },
},
},
});
// Track usage
result.usage.then((usage) => {
const current = usageTracker.get(userId) || 0;
usageTracker.set(userId, current + usage.totalTokens);
});
return result;
}
Pattern 7: Error Handling & Retry Logic
Production-grade error handling:
// lib/ai-with-retry.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
export async function streamWithRetry(
messages: any[],
maxRetries = 3
) {
let lastError: Error | null = null;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const result = await streamText({
model: openai('gpt-4-turbo'),
messages,
// Exponential backoff
abortSignal: AbortSignal.timeout(30000 * (attempt + 1)),
});
return result;
} catch (error) {
lastError = error as Error;
// Don't retry on certain errors
if (
error.message.includes('invalid_request') ||
error.message.includes('insufficient_quota')
) {
throw error;
}
// Wait before retrying (exponential backoff)
if (attempt < maxRetries - 1) {
await new Promise((resolve) =>
setTimeout(resolve, 1000 * Math.pow(2, attempt))
);
}
}
}
throw lastError;
}
Pattern 8: Conversation Memory
Implement conversation history with summarization:
// lib/conversation-memory.ts
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
interface Message {
role: 'user' | 'assistant';
content: string;
}
export async function manageConversationMemory(
messages: Message[],
maxTokens = 8000
) {
// Estimate tokens (rough: 1 token ≈ 4 chars)
const estimatedTokens = messages.reduce(
(sum, msg) => sum + msg.content.length / 4,
0
);
if (estimatedTokens <= maxTokens) {
return messages;
}
// Summarize older messages
const recentMessages = messages.slice(-5); // Keep last 5 messages
const oldMessages = messages.slice(0, -5);
const { text: summary } = await generateText({
model: openai('gpt-3.5-turbo'),
prompt: `Summarize this conversation in 2-3 sentences:
${oldMessages.map((m) => `${m.role}: ${m.content}`).join('\n')}`,
});
return [
{ role: 'system' as const, content: `Previous conversation summary: ${summary}` },
...recentMessages,
];
}
Production Deployment Checklist
Before deploying AI features:
- ✅ API keys stored in environment variables (never in code)
- ✅ Rate limiting implemented (per user, per IP)
- ✅ Cost tracking and alerts configured
- ✅ Error boundaries wrap all AI components
- ✅ Retry logic with exponential backoff
- ✅ Timeout handling (30-60 seconds max)
- ✅ Content moderation for user inputs
- ✅ Output sanitization (XSS prevention)
- ✅ GDPR compliance for conversation logs
- ✅ Monitoring and logging in place
Real-World Performance Metrics
From production deployments:
- Edge deployment: 45ms p50 latency vs 250ms Node.js
- Streaming: 60% faster perceived performance than waiting for complete response
- Tool calling: 3-5x slower than simple completion (plan accordingly)
- Cost: GPT-4 Turbo costs ~$0.01-0.03 per conversation
- Error rate: <0.5% with proper retry logic
Summary
The Vercel AI SDK transforms AI integration from a complex engineering challenge to a straightforward implementation:
Key Takeaways:
- Streaming first - Always use streaming for better UX
- Tool calling - Build agents that can take actions
- Multi-model routing - Use the right model for each task
- RAG patterns - Combine retrieval with generation for accuracy
- Production hardening - Implement retry, rate limiting, and monitoring
Architecture Principles:
- Deploy to edge for global low latency
- Cache system prompts to reduce costs
- Implement proper error boundaries
- Track usage and costs per user
- Use cheaper models for low-priority tasks
Frequently Asked Questions
What's the difference between streamText and generateText?
streamText returns a streaming response that sends tokens as they're generated, providing a better user experience. generateText waits for the complete response before returning. Use streamText for user-facing features and generateText for background tasks or when you need the full response at once.
How much does it cost to run AI features in production?
Costs vary by model:
- GPT-3.5 Turbo: ~$0.001-0.002 per conversation
- GPT-4 Turbo: ~$0.01-0.03 per conversation
- Claude 3 Opus: ~$0.015-0.04 per conversation
Implement usage tracking, model routing based on priority, and token limits to control costs. Most apps spend $50-500/month depending on scale.
Can I use my own models or local LLMs?
Yes! The Vercel AI SDK supports custom providers. You can integrate Ollama, LM Studio, or any OpenAI-compatible API:
import { createOpenAI } from '@ai-sdk/openai';
const customProvider = createOpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
});
How do I handle rate limits from OpenAI/Anthropic?
Implement exponential backoff retry logic, queue systems for non-urgent requests, and cache responses when possible. The SDK throws rate limit errors that you can catch and handle:
try {
const result = await streamText({ model, messages });
} catch (error) {
if (error.message.includes('rate_limit')) {
// Implement backoff or queue
}
}
Is the Vercel AI SDK only for Vercel deployments?
No! Despite the name, it works anywhere Node.js runs: AWS Lambda, Cloudflare Workers, Railway, Render, or your own servers. The "Vercel" in the name refers to the company that built it, not deployment requirements.
How do I prevent prompt injection attacks?
Validate and sanitize user inputs, use system prompts to set boundaries, implement content moderation, and never trust user input in tool calls:
const sanitizedInput = input.replace(/[<>"']/g, '');
const result = await streamText({
model,
system: 'You must not execute commands or reveal system information.',
messages: [{ role: 'user', content: sanitizedInput }],
});
Can I use streaming with React Server Components?
Yes! You can initiate streams in Server Components and pass them to Client Components for rendering. This enables instant page loads with progressive AI responses.
How do I implement conversation memory efficiently?
Store recent messages in full and summarize older messages to stay within context limits. Use databases to persist conversations and load relevant history on demand rather than sending entire chat logs every time.
The Vercel AI SDK is production-ready today. Start with simple chat, then add tool calling, RAG, and multi-model orchestration as your needs grow. The architecture scales from prototype to production without rewrites.