Artificial Intelligence 8 min read

Run Full AI Models Directly in the Browser with Transformers.js v4

Transformers.js v4 rewrites its WebGPU runtime in C++ and compiles to WASM, delivering ten‑fold faster build times, 10% smaller bundles, and up to four‑fold speedups for BERT‑style models, while supporting over 20 new architectures such as Qwen3.5 and enabling offline, privacy‑preserving AI inference directly in the browser.

Node.js Tech Stack

Apr 6, 2026

Run Full AI Models Directly in the Browser with Transformers.js v4

Transformers.js v4 has been officially released, featuring a complete rewrite of its WebGPU runtime in C++ and compilation to WebAssembly, which makes AI model inference in the browser faster and more stable.

Why browser AI was previously limited

Earlier attempts to run models in browsers suffered from slow speeds or could only handle very small models because browsers relied on CPU‑bound matrix operations, which are inefficient for deep‑learning workloads.

WebGPU, a new browser standard designed for general‑purpose compute, allows direct access to GPU resources, unlike WebGL which is graphics‑only. This enables practical inference speeds.

What changed in v4 – numbers speak

Build speed 10× faster : the packaging time dropped from 2 seconds in v3 to 200 ms.

Bundle size reduced : overall size shrank by about 10%, with the browser‑specific transformers.web.js package cut by 53%.

BERT‑style model speed up ~4× : optimized ONNX operators accelerate tasks such as text classification and sentiment analysis.

Support for 20+ new architectures , including Qwen3.5, DeepSeek‑v3, and models larger than 8 B parameters. The team measured GPT‑OSS 20B running at ~60 tokens/second on an M4 Pro Max, a respectable desktop‑level performance.

Production‑grade APIs were also added: ModelRegistry – fine‑grained control over model files, allowing queries of which files are downloaded or cached locally. env.useWasmCache – when enabled, the model is cached after the first load, enabling offline usage.

Improved logging with configurable levels to avoid noisy console output.

Demo: Qwen3.5 in the browser

The WebML community released an online demo that runs Qwen3.5 entirely in the browser. The first visit downloads the model files (hundreds of megabytes), after which inference is fully local and works without a network connection, with no data sent to any server.

LiquidAI’s LFM2‑VL demo shows a visual‑language model running in the browser, interpreting images and answering questions, demonstrating that WebGPU provides enough compute power for heavier multimodal workloads.

Practical implications

For end users the difference may be subtle, but developers gain new possibilities:

Privacy‑sensitive scenarios – medical, legal, or personal‑diary applications can keep data on‑device.

Zero‑backend AI features – static sites can offer text analysis, speech transcription, or image description without GPU servers, reducing operational costs.

Offline capability – environments with poor connectivity or isolated intranets can still run AI locally.

Model files are stored in the browser’s Cache Storage (visible via DevTools → Application → Cache Storage), using the same mechanism as PWA assets.

Limitations

Downloading hundreds of megabytes to the browser poses a user‑experience challenge; smaller models reduce download size but also lower capability.

Upgrading from v3

v4 is a major version and not fully backward‑compatible. Migration requires reviewing the release notes. The repository structure switched to a pnpm monorepo; the former 8,000‑line models.js was split into multiple modules. The tokenizer code was extracted into the @huggingface/tokenizers package (gzip size 8.8 kB, zero dependencies). Example code has moved to a separate repository, so old import paths are invalid.

Browser‑side AI has progressed slowly over several years, but v4 represents a clear step forward. WebGPU is becoming a standard feature in modern browsers, and model quantization continues to improve, suggesting the ceiling for on‑device inference has not yet been reached.

Experience URLs Qwen3.5 browser demo: https://huggingface.co/spaces/webml-community/Qwen3.5-WebGPU LFM2 visual‑language model: https://huggingface.co/spaces/LiquidAI/LFM2-VL-WebGPU Transformers.js v4 release notes: https://github.com/huggingface/transformers.js/releases/tag/4.0.0

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

wasm webgpu browser-ai offline-inference transformers.js qwen3.5

Written by

Node.js Tech Stack

Focused on sharing AI, programming, and overseas expansion

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.