Vercel AI SDK: Streaming, Tool Calling & Multi-Model Orchestration

By LearnWebCraft Team14 min readadvanced
Vercel AI SDKAI IntegrationNext.jsStreamingTool CallingLLMOpenAI

Last year, I spent three weeks building a custom streaming interface for ChatGPT in our app. Rate limiting, error recovery, chunk parsing, state management—it was a nightmare. Then Vercel released their AI SDK, and I rebuilt the entire feature in two days. Not only was it faster to implement, but it handled edge cases I hadn't even considered. The Vercel AI SDK isn't just a convenience wrapper—it's a production-grade framework for building AI features the right way.

In this guide, we'll go beyond basic chat implementations and explore advanced patterns that power real-world AI applications: streaming with React Server Components, tool calling for agent-like behavior, multi-model orchestration, and production-ready error handling.

Why Vercel AI SDK?

Before we dive deep, understand what makes this SDK special:

Key Advantages:

  • Provider Agnostic: OpenAI, Anthropic, Google, Mistral, local models—same API
  • Streaming First: Built-in support for token streaming with React hooks
  • Type Safe: Full TypeScript support with inferred types
  • Edge Compatible: Works on Vercel Edge, Cloudflare Workers, Node.js
  • Framework Integration: First-class Next.js, React, Svelte, Vue support
  • Tool Calling: Declarative function calling for agentic behavior
# Installation
npm install ai @ai-sdk/openai @ai-sdk/anthropic

Architecture Overview

The SDK has three main layers:

// 1. Provider Layer - Model adapters
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';

// 2. Core Layer - Streaming & generation
import { streamText, generateText } from 'ai';

// 3. Framework Layer - React hooks & UI helpers
import { useChat, useCompletion } from 'ai/react';

Pattern 1: Streaming Chat with Server Actions

Let's build a production-ready streaming chat with Next.js 15 Server Actions:

// app/api/chat/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';

export const runtime = 'edge'; // Deploy to edge for low latency

export async function POST(req: Request) {
  const { messages } = await req.json();
  
  const result = await streamText({
    model: openai('gpt-4-turbo'),
    messages,
    // System message for context
    system: 'You are a helpful coding assistant specializing in React and Next.js.',
    // Limit tokens for cost control
    maxTokens: 2000,
    // Control randomness
    temperature: 0.7,
    // Streaming configuration
    onFinish: async ({ text, usage }) => {
      // Log completion for analytics
      console.log('Completion:', { text, usage });
      // Save to database, trigger webhooks, etc.
    },
  });
  
  return result.toAIStreamResponse();
}
// app/chat/page.tsx
'use client';
import { useChat } from 'ai/react';

export default function ChatPage() {
  const {
    messages,
    input,
    handleInputChange,
    handleSubmit,
    isLoading,
    error,
    reload,
    stop,
  } = useChat({
    api: '/api/chat',
    // Optional: Handle errors
    onError: (error) => {
      console.error('Chat error:', error);
    },
    // Optional: Message sent callback
    onFinish: (message) => {
      console.log('Message received:', message);
    },
  });
  
  return (
    <div className="flex flex-col h-screen">
      {/* Messages */}
      <div className="flex-1 overflow-y-auto p-4 space-y-4">
        {messages.map((message) => (
          <div
            key={message.id}
            className={`flex ${
              message.role === 'user' ? 'justify-end' : 'justify-start'
            }`}
          >
            <div
              className={`max-w-[70%] rounded-lg p-4 ${
                message.role === 'user'
                  ? 'bg-blue-600 text-white'
                  : 'bg-gray-100 text-gray-900'
              }`}
            >
              <ReactMarkdown>{message.content}</ReactMarkdown>
            </div>
          </div>
        ))}
        
        {/* Loading indicator */}
        {isLoading && (
          <div className="flex justify-start">
            <div className="bg-gray-100 rounded-lg p-4">
              <LoadingDots />
            </div>
          </div>
        )}
        
        {/* Error handling */}
        {error && (
          <div className="flex justify-center">
            <div className="bg-red-50 border border-red-200 rounded-lg p-4">
              <p className="text-red-800">{error.message}</p>
              <button
                onClick={() => reload()}
                className="mt-2 text-red-600 underline"
              >
                Retry
              </button>
            </div>
          </div>
        )}
      </div>
      
      {/* Input */}
      <form onSubmit={handleSubmit} className="border-t p-4">
        <div className="flex gap-2">
          <input
            value={input}
            onChange={handleInputChange}
            placeholder="Type a message..."
            className="flex-1 px-4 py-2 border rounded-lg"
            disabled={isLoading}
          />
          <button
            type="submit"
            disabled={isLoading || !input.trim()}
            className="px-6 py-2 bg-blue-600 text-white rounded-lg disabled:opacity-50"
          >
            {isLoading ? 'Stop' : 'Send'}
          </button>
        </div>
      </form>
    </div>
  );
}

Key Points:

  • Edge runtime for <50ms cold starts globally
  • useChat hook handles all state management
  • Automatic retry and error handling
  • Markdown rendering for formatted responses
  • Stop generation mid-stream

Pattern 2: Tool Calling (Function Calling)

Tool calling lets AI models execute functions and use results in responses—essential for building agents:

// app/api/chat-with-tools/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText, tool } from 'ai';
import { z } from 'zod';

export async function POST(req: Request) {
  const { messages } = await req.json();
  
  const result = await streamText({
    model: openai('gpt-4-turbo'),
    messages,
    tools: {
      // Weather tool
      getWeather: tool({
        description: 'Get the current weather for a location',
        parameters: z.object({
          location: z.string().describe('The city name'),
          unit: z.enum(['celsius', 'fahrenheit']).optional(),
        }),
        execute: async ({ location, unit = 'celsius' }) => {
          // Call real weather API
          const response = await fetch(
            `https://api.openweathermap.org/data/2.5/weather?q=${location}&units=${unit === 'celsius' ? 'metric' : 'imperial'}&appid=${process.env.OPENWEATHER_API_KEY}`
          );
          const data = await response.json();
          
          return {
            temperature: data.main.temp,
            condition: data.weather[0].description,
            location: data.name,
            unit,
          };
        },
      }),
      
      // Database query tool
      searchDocuments: tool({
        description: 'Search documentation database',
        parameters: z.object({
          query: z.string().describe('Search query'),
          limit: z.number().optional().default(5),
        }),
        execute: async ({ query, limit }) => {
          // Vector search in your database
          const results = await vectorSearch(query, limit);
          return results;
        },
      }),
      
      // Code execution tool
      executeCode: tool({
        description: 'Execute JavaScript code safely',
        parameters: z.object({
          code: z.string().describe('JavaScript code to execute'),
        }),
        execute: async ({ code }) => {
          // Use VM2 or similar for sandboxed execution
          try {
            const result = await safelyExecuteCode(code);
            return { success: true, output: result };
          } catch (error) {
            return { success: false, error: error.message };
          }
        },
      }),
    },
    maxToolRoundtrips: 5, // Limit tool calls to prevent loops
  });
  
  return result.toAIStreamResponse();
}

Client-side handling:

'use client';
import { useChat } from 'ai/react';

export default function AgentChat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat({
    api: '/api/chat-with-tools',
  });
  
  return (
    <div className="space-y-4 p-4">
      {messages.map((message) => (
        <div key={message.id}>
          <div className="font-semibold">
            {message.role === 'user' ? 'You' : 'Assistant'}
          </div>
          <div className="mt-1">{message.content}</div>
          
          {/* Show tool invocations */}
          {message.toolInvocations?.map((invocation) => (
            <div key={invocation.toolCallId} className="mt-2 p-2 bg-gray-100 rounded">
              <div className="text-sm font-medium">
                🔧 {invocation.toolName}
              </div>
              <div className="text-xs text-gray-600 mt-1">
                {JSON.stringify(invocation.args, null, 2)}
              </div>
              {invocation.state === 'result' && (
                <div className="text-xs text-green-600 mt-1">
                  ✓ Result: {JSON.stringify(invocation.result, null, 2)}
                </div>
              )}
            </div>
          ))}
        </div>
      ))}
      
      <form onSubmit={handleSubmit}>
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Ask me anything..."
          className="w-full px-4 py-2 border rounded"
        />
      </form>
    </div>
  );
}

Watch Out: Tool execution happens on your server. Always validate inputs and implement proper security controls. Never execute arbitrary code without sandboxing.

Pattern 3: Multi-Model Orchestration

Different models excel at different tasks. Route requests intelligently:

// lib/ai-router.ts
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
import { streamText } from 'ai';

type TaskType = 'code' | 'creative' | 'analysis' | 'chat';

export async function routeToModel(task: TaskType, messages: any[]) {
  const modelConfig = {
    code: {
      // Claude is excellent for code
      model: anthropic('claude-3-5-sonnet-20241022'),
      temperature: 0.3,
      maxTokens: 4000,
    },
    creative: {
      // GPT-4 for creative writing
      model: openai('gpt-4-turbo'),
      temperature: 0.9,
      maxTokens: 3000,
    },
    analysis: {
      // Claude for complex reasoning
      model: anthropic('claude-3-opus-20240229'),
      temperature: 0.2,
      maxTokens: 4000,
    },
    chat: {
      // GPT-3.5 for fast chat
      model: openai('gpt-3.5-turbo'),
      temperature: 0.7,
      maxTokens: 1500,
    },
  };
  
  const config = modelConfig[task];
  
  return streamText({
    model: config.model,
    messages,
    temperature: config.temperature,
    maxTokens: config.maxTokens,
  });
}

// Usage
export async function POST(req: Request) {
  const { messages, taskType } = await req.json();
  const result = await routeToModel(taskType as TaskType, messages);
  return result.toAIStreamResponse();
}

Pattern 4: RAG (Retrieval Augmented Generation)

Combine vector search with AI generation:

// app/api/rag-chat/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
import { embed } from 'ai';

export async function POST(req: Request) {
  const { messages } = await req.json();
  const lastMessage = messages[messages.length - 1].content;
  
  // Step 1: Generate embedding for user query
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: lastMessage,
  });
  
  // Step 2: Search vector database
  const relevantDocs = await vectorDatabase.search({
    vector: embedding,
    limit: 5,
    threshold: 0.7,
  });
  
  // Step 3: Build context from results
  const context = relevantDocs
    .map((doc) => `[${doc.title}]\n${doc.content}`)
    .join('\n\n');
  
  // Step 4: Stream response with context
  const result = await streamText({
    model: openai('gpt-4-turbo'),
    system: `You are a helpful assistant. Use the following context to answer questions:

${context}

If the context doesn't contain relevant information, say so.`,
    messages,
  });
  
  return result.toAIStreamResponse();
}

Pair this with: Building semantic search with OpenAI embeddings

Pattern 5: Streaming with React Server Components

Next.js 15 enables streaming AI responses directly from Server Components:

// app/generate/page.tsx
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
import { StreamResponse } from './stream-response';

export default async function GeneratePage({
  searchParams,
}: {
  searchParams: { prompt: string };
}) {
  if (!searchParams.prompt) {
    return <div>Enter a prompt to generate content</div>;
  }
  
  const stream = await streamText({
    model: openai('gpt-4-turbo'),
    prompt: searchParams.prompt,
  });
  
  return <StreamResponse stream={stream.textStream} />;
}
// app/generate/stream-response.tsx
'use client';
import { useEffect, useState } from 'react';

export function StreamResponse({ stream }: { stream: ReadableStream<string> }) {
  const [content, setContent] = useState('');
  
  useEffect(() => {
    const reader = stream.getReader();
    const decoder = new TextDecoder();
    
    (async () => {
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        
        const text = decoder.decode(value);
        setContent((prev) => prev + text);
      }
    })();
  }, [stream]);
  
  return (
    <div className="prose max-w-none">
      <ReactMarkdown>{content}</ReactMarkdown>
    </div>
  );
}

Pattern 6: Cost Optimization

AI API calls are expensive. Optimize costs:

// lib/ai-cost-optimizer.ts
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
import { streamText } from 'ai';

// Track usage per user
const usageTracker = new Map<string, number>();

export async function optimizedGenerate(
  userId: string,
  messages: any[],
  priority: 'low' | 'medium' | 'high'
) {
  const usage = usageTracker.get(userId) || 0;
  
  // Select model based on priority and usage
  let model: any;
  if (priority === 'low' || usage > 100000) {
    // Use cheaper model for low priority or high usage
    model = openai('gpt-3.5-turbo');
  } else if (priority === 'medium') {
    model = openai('gpt-4-turbo');
  } else {
    model = anthropic('claude-3-opus-20240229');
  }
  
  const result = await streamText({
    model,
    messages,
    // Limit tokens
    maxTokens: priority === 'low' ? 500 : priority === 'medium' ? 2000 : 4000,
    // Cache system prompts (Anthropic only)
    experimental_providerMetadata: {
      anthropic: {
        cacheControl: { type: 'ephemeral' },
      },
    },
  });
  
  // Track usage
  result.usage.then((usage) => {
    const current = usageTracker.get(userId) || 0;
    usageTracker.set(userId, current + usage.totalTokens);
  });
  
  return result;
}

Pattern 7: Error Handling & Retry Logic

Production-grade error handling:

// lib/ai-with-retry.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

export async function streamWithRetry(
  messages: any[],
  maxRetries = 3
) {
  let lastError: Error | null = null;
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const result = await streamText({
        model: openai('gpt-4-turbo'),
        messages,
        // Exponential backoff
        abortSignal: AbortSignal.timeout(30000 * (attempt + 1)),
      });
      
      return result;
    } catch (error) {
      lastError = error as Error;
      
      // Don't retry on certain errors
      if (
        error.message.includes('invalid_request') ||
        error.message.includes('insufficient_quota')
      ) {
        throw error;
      }
      
      // Wait before retrying (exponential backoff)
      if (attempt < maxRetries - 1) {
        await new Promise((resolve) =>
          setTimeout(resolve, 1000 * Math.pow(2, attempt))
        );
      }
    }
  }
  
  throw lastError;
}

Pattern 8: Conversation Memory

Implement conversation history with summarization:

// lib/conversation-memory.ts
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';

interface Message {
  role: 'user' | 'assistant';
  content: string;
}

export async function manageConversationMemory(
  messages: Message[],
  maxTokens = 8000
) {
  // Estimate tokens (rough: 1 token ≈ 4 chars)
  const estimatedTokens = messages.reduce(
    (sum, msg) => sum + msg.content.length / 4,
    0
  );
  
  if (estimatedTokens <= maxTokens) {
    return messages;
  }
  
  // Summarize older messages
  const recentMessages = messages.slice(-5); // Keep last 5 messages
  const oldMessages = messages.slice(0, -5);
  
  const { text: summary } = await generateText({
    model: openai('gpt-3.5-turbo'),
    prompt: `Summarize this conversation in 2-3 sentences:
    
${oldMessages.map((m) => `${m.role}: ${m.content}`).join('\n')}`,
  });
  
  return [
    { role: 'system' as const, content: `Previous conversation summary: ${summary}` },
    ...recentMessages,
  ];
}

Production Deployment Checklist

Before deploying AI features:

  • ✅ API keys stored in environment variables (never in code)
  • ✅ Rate limiting implemented (per user, per IP)
  • ✅ Cost tracking and alerts configured
  • ✅ Error boundaries wrap all AI components
  • ✅ Retry logic with exponential backoff
  • ✅ Timeout handling (30-60 seconds max)
  • ✅ Content moderation for user inputs
  • ✅ Output sanitization (XSS prevention)
  • ✅ GDPR compliance for conversation logs
  • ✅ Monitoring and logging in place

Real-World Performance Metrics

From production deployments:

  • Edge deployment: 45ms p50 latency vs 250ms Node.js
  • Streaming: 60% faster perceived performance than waiting for complete response
  • Tool calling: 3-5x slower than simple completion (plan accordingly)
  • Cost: GPT-4 Turbo costs ~$0.01-0.03 per conversation
  • Error rate: <0.5% with proper retry logic

Summary

The Vercel AI SDK transforms AI integration from a complex engineering challenge to a straightforward implementation:

Key Takeaways:

  1. Streaming first - Always use streaming for better UX
  2. Tool calling - Build agents that can take actions
  3. Multi-model routing - Use the right model for each task
  4. RAG patterns - Combine retrieval with generation for accuracy
  5. Production hardening - Implement retry, rate limiting, and monitoring

Architecture Principles:

  • Deploy to edge for global low latency
  • Cache system prompts to reduce costs
  • Implement proper error boundaries
  • Track usage and costs per user
  • Use cheaper models for low-priority tasks

Frequently Asked Questions

What's the difference between streamText and generateText?

streamText returns a streaming response that sends tokens as they're generated, providing a better user experience. generateText waits for the complete response before returning. Use streamText for user-facing features and generateText for background tasks or when you need the full response at once.

How much does it cost to run AI features in production?

Costs vary by model:

  • GPT-3.5 Turbo: ~$0.001-0.002 per conversation
  • GPT-4 Turbo: ~$0.01-0.03 per conversation
  • Claude 3 Opus: ~$0.015-0.04 per conversation

Implement usage tracking, model routing based on priority, and token limits to control costs. Most apps spend $50-500/month depending on scale.

Can I use my own models or local LLMs?

Yes! The Vercel AI SDK supports custom providers. You can integrate Ollama, LM Studio, or any OpenAI-compatible API:

import { createOpenAI } from '@ai-sdk/openai';

const customProvider = createOpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

How do I handle rate limits from OpenAI/Anthropic?

Implement exponential backoff retry logic, queue systems for non-urgent requests, and cache responses when possible. The SDK throws rate limit errors that you can catch and handle:

try {
  const result = await streamText({ model, messages });
} catch (error) {
  if (error.message.includes('rate_limit')) {
    // Implement backoff or queue
  }
}

Is the Vercel AI SDK only for Vercel deployments?

No! Despite the name, it works anywhere Node.js runs: AWS Lambda, Cloudflare Workers, Railway, Render, or your own servers. The "Vercel" in the name refers to the company that built it, not deployment requirements.

How do I prevent prompt injection attacks?

Validate and sanitize user inputs, use system prompts to set boundaries, implement content moderation, and never trust user input in tool calls:

const sanitizedInput = input.replace(/[<>"']/g, '');
const result = await streamText({
  model,
  system: 'You must not execute commands or reveal system information.',
  messages: [{ role: 'user', content: sanitizedInput }],
});

Can I use streaming with React Server Components?

Yes! You can initiate streams in Server Components and pass them to Client Components for rendering. This enables instant page loads with progressive AI responses.

How do I implement conversation memory efficiently?

Store recent messages in full and summarize older messages to stay within context limits. Use databases to persist conversations and load relevant history on demand rather than sending entire chat logs every time.

The Vercel AI SDK is production-ready today. Start with simple chat, then add tool calling, RAG, and multi-model orchestration as your needs grow. The architecture scales from prototype to production without rewrites.

Related Articles