Running a high-performance AI model like Llama 3 or Phi-3 directly in a web browser used to sound like a pipe dream. We’re talking about no servers, no API keys, and—crucially—zero data leaving the user's device. Just a year ago, this was barely experimental. Today, it’s something you can actually ship.
In this guide, we’re going to get our hands dirty with WebLLM and React 19 to build a privacy-first AI chat application. We aren't just making API calls to OpenAI here; we are turning the user's browser into the inference engine itself.
Why bother? Because the future of AI isn't strictly in the cloud—it's moving to the edge. By handling intelligence on the client side, you eliminate latency, cut infrastructure costs to literally zero, and offer a level of privacy that server-based solutions simply can't compete with.
We'll be using React 19 because its latest features are uniquely suited for the kind of heavy, asynchronous state updates needed for streaming AI responses. Whether you've been working with React for years or you're just starting to explore local AI, this tutorial will walk you through building something that feels like the future.
Let's build something cool.
Introduction: The Power of Local AI in the Browser
For the last decade or so, web development has leaned heavily on a "thin client" architecture. The browser was mostly just a display for logic that ran on powerful, remote servers. AI followed that same path—when you type into ChatGPT, your text travels to a massive data center, gets processed, and comes back.
But the pendulum is swinging back.
Thanks to WebGPU, modern browsers can now tap directly into the raw power of your computer's Graphics Processing Unit (GPU). This opens the door to running heavy machine learning workloads right inside Chrome, Edge, or Firefox.
Why go local?
- Privacy: This is the big one. If you are building for healthcare, finance, or even personal journaling, users are often terrified of sending data to the cloud. With local AI, the data stays on their machine. Period.
- Cost: LLM APIs get expensive quickly. By offloading the "thinking" to the user's device, your backend costs drop to near zero.
- Offline Capability: Once the model is cached, it works anywhere—on a plane, in a subway, or in a cabin without Wi-Fi.
- Latency: There are no network round-trips. Once the model is loaded, the interaction feels incredibly snappy.
When you pair this with React 19, which is optimized for smooth UI transitions, you get a toolkit that lets you build experiences that feel native, responsive, and surprisingly smart.
Understanding WebLLM: Local AI Models for the Web
Before we start coding, it helps to know what's happening under the hood. WebLLM is a high-performance inference engine for the browser. It’s built on top of WebGPU and uses Apache TVM (Tensor Virtual Machine) to make sure these models run efficiently in JavaScript.
How it works
WebLLM doesn't just "run" a Python model. It uses a workflow called MLC (Machine Learning Compilation).
- Compilation: Models like Llama-3 or Gemma are pre-compiled into a format that WebGPU can actually understand.
- Weights Fetching: The browser downloads the model weights (the AI's "brain"). These are big binary files, usually split into smaller chunks or "shards."
- Inference: The engine maps these weights to the GPU and handles the complex math needed to generate text.
The best part? WebLLM abstracts the messy stuff away. It gives us a clean, JavaScript-friendly API that looks a lot like the OpenAI SDK. If you’ve ever used chat.completions.create, you already know how to use WebLLM.
Note on Hardware: WebLLM is optimized, but it still relies on the user's hardware. Someone with a dedicated NVIDIA GPU will have a blazing fast experience. Someone on an older budget laptop might see slower speeds. We'll cover how to handle this gracefully in the UI section.
Why React 19? New Features for AI Development
You might be wondering, "Can't I just stick with React 18?" You absolutely can. But React 19 brings a few specific improvements that make building AI interfaces a lot less painful.
1. The React Compiler
React 19 introduces an automatic optimizing compiler. In the past, we had to manually wrestle with useMemo and useCallback to stop unnecessary re-renders—which is a huge deal when you are streaming text token-by-token at 50 times a second. The React Compiler handles this optimization for us, keeping the UI buttery smooth without the manual overhead.
2. The use API
Loading a 4GB AI model is a massive asynchronous operation. Historically, handling this in React involved a lot of boilerplate state management. The new use API lets us read the value of a Promise directly inside the render function. It pairs perfectly with Suspense, letting us declaratively show "Loading Model..." skeletons while the heavy lifting happens in the background.
3. Actions and Optimistic Updates
React 19's Actions are great for managing form submissions—like sending a chat message. They give us built-in pending states, which simplifies the logic for disabling the "Send" button or showing a spinner while the AI is "thinking."
Setting Up Your React 19 Project for WebLLM
Enough theory. Let's get our environment set up. We'll use Vite to scaffold a fresh project.
Prerequisites
- Node.js (v18 or higher is best)
- A browser that supports WebGPU (Chrome 113+, Edge, or FireFox Nightly)
Step 1: Initialize the Project
Fire up your terminal and create a new Vite project. We'll stick with the React TypeScript template.
npm create vite@latest local-ai-chat -- --template react-ts
cd local-ai-chat
npm install
Step 2: Install Dependencies
We need the WebLLM library itself. I also highly recommend a few utility libraries to handle Markdown (since LLMs love outputting code blocks) and icons.
npm install @mlc-ai/web-llm markdown-to-jsx lucide-react
@mlc-ai/web-llm: The engine that runs the models.markdown-to-jsx: Makes the AI's output look good (bold text, code blocks, etc.).lucide-react: For clean UI icons.
Step 3: Configure TypeScript (Optional but Recommended)
Since WebLLM pushes the boundaries of browser tech, make sure your tsconfig.json targets a modern ECMAScript version.
{
"compilerOptions": {
"target": "ESNext",
"lib": ["DOM", "DOM.Iterable", "ESNext"],
// ... other options
}
}
Integrating WebLLM: Loading Models and Running Inference
This is the core of our app. We need a way to load the model engine and keep track of its status.
Because loading a model is a heavy, one-time event, it makes sense to wrap this logic in a custom hook. Let's create useLLM.ts.
Creating the useLLM Hook
Create a file at src/hooks/useLLM.ts.
import { useState, useEffect, useRef, useCallback } from 'react';
import { CreateMLCEngine, MLCEngine, InitProgressReport } from "@mlc-ai/web-llm";
// We'll use a smaller model for broader compatibility
// You can swap this for "Llama-3-8B-Instruct-q4f32_1-MLC" if you have a beefy GPU
const SELECTED_MODEL = "Phi-3-mini-4k-instruct-q4f16_1-MLC";
export type Message = {
role: 'user' | 'assistant' | 'system';
content: string;
};
export const useLLM = () => {
const [engine, setEngine] = useState<MLCEngine | null>(null);
const [progress, setProgress] = useState<string>("");
const [isLoading, setIsLoading] = useState(false);
const [isModelLoaded, setIsModelLoaded] = useState(false);
// Initialize the engine
useEffect(() => {
const initEngine = async () => {
try {
setIsLoading(true);
// This callback updates us on the download progress
const initProgressCallback = (report: InitProgressReport) => {
setProgress(report.text);
};
const engineInstance = await CreateMLCEngine(
SELECTED_MODEL,
{ initProgressCallback }
);
setEngine(engineInstance);
setIsModelLoaded(true);
setIsLoading(false);
} catch (error) {
console.error("Failed to load model:", error);
setIsLoading(false);
setProgress("Error loading model. Check console for details.");
}
};
if (!engine && !isLoading) {
initEngine();
}
}, []); // Run once on mount
return {
engine,
progress,
isLoading,
isModelLoaded
};
};
Breaking it down
CreateMLCEngine: This is the factory function from WebLLM. It fetches the config, downloads the binary weights, and compiles the WebGPU shaders.initProgressCallback: This is vital for good UX. Since we are downloading around 2GB of data, we need to tell the user exactly what's happening (e.g., "Fetching param shard 12/24").- State Management: We track
isModelLoadedso we can keep the chat interface disabled until the "brain" is fully ready.
Building the User Interface: A Local AI Chat Example
Now that the engine logic is handled, let's build the React components. We want a clean, modern chat interface.
The Chat Component
In src/App.tsx, we'll implement the chat logic. The tricky part here is handling the message history and the streaming response updates.
import React, { useState, useRef, useEffect } from 'react';
import Markdown from 'markdown-to-jsx';
import { Send, Loader2, Bot, User } from 'lucide-react';
import { useLLM, Message } from './hooks/useLLM';
function App() {
const { engine, progress, isModelLoaded } = useLLM();
const [messages, setMessages] = useState<Message[]>([
{ role: 'system', content: 'You are a helpful, privacy-focused AI assistant running locally in the browser.' }
]);
const [input, setInput] = useState('');
const [isGenerating, setIsGenerating] = useState(false);
// Ref for auto-scrolling
const messagesEndRef = useRef<HTMLDivElement>(null);
const scrollToBottom = () => {
messagesEndRef.current?.scrollIntoView({ behavior: "smooth" });
};
useEffect(() => {
scrollToBottom();
}, [messages]);
const handleSend = async () => {
if (!input.trim() || !engine || isGenerating) return;
const userMessage: Message = { role: 'user', content: input };
setMessages(prev => [...prev, userMessage]);
setInput('');
setIsGenerating(true);
try {
// Prepare the messages for the AI
// We create a temporary "assistant" message to hold the streaming response
const assistantMessage: Message = { role: 'assistant', content: '' };
setMessages(prev => [...prev, assistantMessage]);
const chunks = await engine.chat.completions.create({
messages: [...messages, userMessage],
stream: true, // Enable streaming!
temperature: 0.7,
});
let fullResponse = "";
for await (const chunk of chunks) {
const content = chunk.choices[0]?.delta?.content || "";
fullResponse += content;
// Update the last message (assistant) with the new token
setMessages(prev => {
const newMessages = [...prev];
const lastMsg = newMessages[newMessages.length - 1];
lastMsg.content = fullResponse;
return newMessages;
});
}
} catch (err) {
console.error("Generation failed", err);
} finally {
setIsGenerating(false);
}
};
// Render Logic
if (!isModelLoaded) {
return (
<div className="min-h-screen flex flex-col items-center justify-center bg-gray-900 text-white p-4">
<Loader2 className="w-12 h-12 animate-spin text-blue-500 mb-4" />
<h2 className="text-xl font-bold mb-2">Initializing Local AI...</h2>
<p className="text-gray-400 text-center max-w-md">{progress}</p>
<p className="text-xs text-gray-500 mt-4">This downloads ~2GB of data. It happens only once.</p>
</div>
);
}
return (
<div className="flex flex-col h-screen bg-gray-950 text-gray-100">
{/* Header */}
<header className="p-4 border-b border-gray-800 bg-gray-900 flex items-center gap-3">
<div className="w-3 h-3 bg-green-500 rounded-full animate-pulse"></div>
<h1 className="font-semibold text-lg">Local AI (Phi-3)</h1>
</header>
{/* Messages Area */}
<div className="flex-1 overflow-y-auto p-4 space-y-6">
{messages.filter(m => m.role !== 'system').map((msg, idx) => (
<div key={idx} className={`flex gap-4 ${msg.role === 'user' ? 'justify-end' : 'justify-start'}`}>
{msg.role === 'assistant' && (
<div className="w-8 h-8 rounded-full bg-blue-600 flex items-center justify-center shrink-0">
<Bot size={16} />
</div>
)}
<div className={`max-w-[80%] p-4 rounded-lg ${
msg.role === 'user'
? 'bg-blue-600 text-white'
: 'bg-gray-800 text-gray-100 border border-gray-700'
}`}>
<Markdown options={{ forceBlock: true }}>{msg.content}</Markdown>
</div>
{msg.role === 'user' && (
<div className="w-8 h-8 rounded-full bg-gray-600 flex items-center justify-center shrink-0">
<User size={16} />
</div>
)}
</div>
))}
{isGenerating && messages[messages.length - 1].role === 'user' && (
<div className="flex gap-4">
<div className="w-8 h-8 rounded-full bg-blue-600 flex items-center justify-center">
<Loader2 size={16} className="animate-spin" />
</div>
<div className="bg-gray-800 p-4 rounded-lg animate-pulse w-24 h-10"></div>
</div>
)}
<div ref={messagesEndRef} />
</div>
{/* Input Area */}
<div className="p-4 bg-gray-900 border-t border-gray-800">
<div className="max-w-3xl mx-auto relative">
<input
type="text"
value={input}
onChange={(e) => setInput(e.target.value)}
onKeyDown={(e) => e.key === 'Enter' && handleSend()}
placeholder="Type a message..."
disabled={isGenerating}
className="w-full bg-gray-800 border-gray-700 border rounded-xl pl-4 pr-12 py-4 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:border-transparent disabled:opacity-50"
/>
<button
onClick={handleSend}
disabled={isGenerating || !input.trim()}
className="absolute right-2 top-2 p-2 bg-blue-600 rounded-lg hover:bg-blue-700 disabled:bg-gray-700 disabled:cursor-not-allowed transition-colors"
>
<Send size={20} />
</button>
</div>
</div>
</div>
);
}
export default App;
Key Implementation Details
- Streaming Updates: Check out that
for awaitloop. WebLLM returns an async iterable. As each "chunk" (token) arrives, we append it to our message string and trigger a state update. This creates that satisfying "typing" effect users expect. - React 19 Rendering: Thanks to React 19's concurrent features, rapidly updating the state (dozens of times per second) as tokens fly in remains smooth. It doesn't block the browser's main thread nearly as much as it used to.
- Markdown Parsing: We wrap the output in
<Markdown>. If the AI writes a Python script or a bulleted list, it renders beautifully instead of as raw text.
Optimizing Performance and User Experience
Getting the app running is step one. Making it feel "production-ready" is step two. Local AI is resource-intensive, so we need to be smart about how we handle it.
1. Web Workers: Offloading the Brain
Right now, our code runs on the main thread. If the GPU is working overtime, the UI might stutter. The pro move is to shift WebLLM into a Web Worker.
React 19 makes handling the async communication easier, but the concept remains the same: keep the UI thread free for the user.
Brief implementation sketch:
- Create
llm.worker.ts. - Move the
CreateMLCEnginelogic there. - Use
postMessageto send prompts from the UI to the Worker. - The Worker sends back tokens via
postMessage.
WebLLM actually provides a helper class called WebWorkerMLCEngine specifically for this scenario.
// In your worker file
import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
const handler = new WebWorkerMLCEngineHandler();
self.onmessage = (msg) => {
handler.onmessage(msg);
};
2. Caching Strategies
The first load is the biggest hurdle. The model files are automatically cached in the browser's Cache Storage API by WebLLM. However, manage user expectations:
- "First load takes 2 minutes."
- "Subsequent loads are instant."
3. Memory Management
If a user navigates away or unmounts the component, you should call engine.unload(). This frees up the GPU VRAM. In React, you'd handle this in the cleanup function of your useEffect.
useEffect(() => {
// init logic...
return () => {
// Cleanup when component unmounts
if (engine) engine.unload();
};
}, []);
Deployment Strategies for Browser-Based AI Apps
Deploying a local AI app is actually simpler than a server-side one because, at the end of the day, it's just a static site! However, there is one major "gotcha" involving headers.
Cross-Origin Isolation (COOP and COEP)
WebGPU and SharedArrayBuffer (which high-performance WASM often uses) require a secure context and specific HTTP headers.
If you deploy to Vercel or Netlify, you must add these headers to your configuration (vercel.json or netlify.toml):
// vercel.json
{
"headers": [
{
"source": "/(.*)",
"headers": [
{ "key": "Cross-Origin-Opener-Policy", "value": "same-origin" },
{ "key": "Cross-Origin-Embedder-Policy", "value": "require-corp" }
]
}
]
}
If you skip this, you’ll likely see a console error saying SharedArrayBuffer is not defined, and the app simply won't work.
CDN and Model Hosting
By default, WebLLM pulls models from the MLC AI HuggingFace repository. This works fine for demos. For production, you might want to host the model weights on your own CDN (like AWS CloudFront or Cloudflare R2) to avoid rate limits and ensure faster downloads for your specific region. You can configure appConfig in WebLLM to point to your custom URLs.
Future Prospects: The Evolution of Client-Side AI
We are still in the early innings of local web AI. Here is what I see coming down the pipeline:
- Smaller, Smarter Models: Models like Phi-3 and Gemma-2B are proving that you don't need 70 billion parameters to be useful. As distillation techniques improve, we will get GPT-4 level reasoning in 2GB packages.
- NPU Integration: Apple's Neural Engine and Intel's NPU are specialized chips for AI. Web standards are evolving to tap into these chips specifically, which will reduce battery drain significantly compared to running everything on the GPU.
- Hybrid AI: Apps will likely use local AI for privacy-sensitive or low-latency tasks (like autocomplete or grammar checking) and seamlessly fall back to the cloud for heavy reasoning tasks.
Conclusion: Building Intelligent, Private Web Applications
Integrating WebLLM with React 19 opens up a massive new frontier for web development. You aren't just building interfaces anymore; you are deploying intelligence.
By following this guide, you've built a chat application that respects user privacy, costs nothing to run per-token, and works completely offline. This architecture is perfect for:
- Legal/Medical Apps: Where data sensitivity is paramount.
- Educational Tools: That need to work in remote areas with spotty internet.
- Creative Writing Aids: That offer instant, zero-latency feedback.
The tools are here. The browser is ready. Now it's time to build.
Frequently Asked Questions
Can I run this on mobile browsers? Yes, provided the mobile browser supports WebGPU. Android (Chrome) has decent support on modern high-end devices. iOS support is currently experimental and often requires enabling specific flags in Safari settings, though Apple is moving fast on this.
How much RAM does the user need? It depends heavily on the model. A "quantized" 4-bit model like Llama-3-8B typically eats up about 4GB to 6GB of VRAM (GPU memory) or shared system memory. A smaller model like Phi-3-mini can run comfortably on a machine with 4GB of RAM total.
Is WebLLM faster than server-side APIs? In terms of latency (time to first token), it can be faster because there is no network request. However, the throughput (tokens per second) is usually lower than what you'd get from a massive server cluster. For a single user, it's usually "fast enough" (20-50 tokens/sec).
Does this work with Next.js? Absolutely. However, since WebGPU is a browser-only API, you have to make sure your components are Client Components (
"use client") and that you aren't trying to initialize the engine during Server-Side Rendering (SSR). Wrapping the initialization in auseEffectusually solves this.
What happens if I close the tab? The model is unloaded from memory immediately. The downloaded weights stay in the browser cache, so reopening the tab won't require re-downloading the files, but the model will need to be re-initialized into RAM.