Local ML Experiment

Running AI models locally on your machine in the browser

This experiment explores the fascinating world of Machine Learning models, more specifically running AI models that run entirely in the browser, utilizing WebGPU to process the model.

What I'm Exploring

  • Model Performance: Testing various open-source models (Llama, Mistral, Phi) on consumer hardware
  • Quantization: Exploring how model size affects performance and quality
  • Use Cases: Building practical applications with local AI
  • Privacy Benefits: Keeping all data processing local and secure

What This Demo Is (and Isn't)

This is an educational experiment exploring the possibilities of running AI models locally in the browser. Here's what you should expect:

  • ✅ It is: A demonstration of browser-based AI using small, efficient models
  • ✅ It is: Completely private - all processing happens on your device
  • ✅ It is: A proof of concept for local AI capabilities

This isn't trying to be:

  • ❌ It's not: A way for you to replace ChatGPT or other large-scale AI services
  • ❌ It's not: Optimized for complex reasoning or professional tasks
  • ❌ It's not: As capable as models with billions of parameters, don't be suprised at the somewhat goofy responses

The models used here are intentionally small (0.5-1.5B parameters) to run efficiently in browser constraints. Think of this as a glimpse into a future where personal devices can run AI locally, not as a complete solution for all AI needs.

Some of the use cases that I'm thinking about is:

  • Autocomplete for search queries without complex client side or server side code
  • Personalized content generation with minimal code
  • Personalized recommendations with minimal code

Live Demo: Model Switcher

Try this interactive demo that lets you switch between different AI models right in your browser! Experience sentiment analysis and text generation running entirely locally.

⚠️ Device Requirements Not Met

This experiment requires a modern device with sufficient resources to run AI models locally.

Issues detected:

  • Low device memory (less than 4GB)
  • Low CPU core count (less than 4 cores)

Recommended requirements:

  • WebGPU support
  • 4GB+ RAM
  • 4+ CPU cores
  • Desktop/laptop (not mobile)

← Back to Experiments

Current Models

Qwen2.5-0.5B-Instruct

Size: 380MB | Type: Instruction-tuned

A powerful model with strong reasoning and instruction-following capabilities. Best for complex tasks, detailed responses, and when you need higher quality outputs.

Best for: Chat, reasoning, instruction following

📖 View on Hugging Face

DistilGPT2

Size: 82MB | Type: Text generation

A lightweight, fast model perfect for quick demos and experimentation. While less capable than Qwen, it loads much faster and is great for testing the interface.

Best for: Quick demos, creative writing, testing

📖 View on Hugging Face

How It Works

This demo uses 🤗 Transformers.js to run LLMs directly in your browser using WebGPU. The models are quantized to 4-bit precision to keep them lightweight while maintaining good performance.

The Process:

  1. Model Selection: When you select a model, the browser checks if it's already cached locally
  2. Download (if needed): If not cached, the model files (80-400MB) are downloaded from Hugging Face and stored in your browser's cache
  3. Initialization: The model is loaded into your GPU memory using WebGPU for accelerated inference
  4. Inference: When you type a message, it's tokenized and fed through the neural network entirely on your device
  5. Response Generation: The model generates text token by token, which is displayed in real-time as it's created
  6. Caching: The model stays in memory for quick subsequent responses, and remains cached for future visits

Key Technologies:

  • WebGPU Acceleration: Uses your GPU for faster inference (📖 learn more)
  • On-the-Fly Loading: Models are downloaded and cached as needed
  • Privacy First: Everything runs locally in your browser

Example

Here's a variation of this experiment in CodeSandbox (NOTE: that it won't work in the sandbox, due to restrictions with service workers and WebGPU, but you can copy the code and run it locally)

Last updated: 2026-03-20T13:45:54.645Z

← Back to Experiments