Device‑Side Real‑Time Multimodal AI: Deep Dive into Two Open‑Source Projects

This article examines two open‑source projects—Parlor for on‑device multimodal inference and Gemma Tuner Multimodal for Apple Silicon fine‑tuning—detailing their architectures, privacy and cost benefits, performance on Apple M3 Pro, hands‑free VAD, streaming TTS, multilingual support, setup steps, and current limitations.

Geek Labs
Geek Labs
Geek Labs
Device‑Side Real‑Time Multimodal AI: Deep Dive into Two Open‑Source Projects

Device‑Side Real‑Time Multimodal AI: Two Open‑Source Projects

Multimodal large models are moving from the cloud to edge devices. The following two projects illustrate the latest practices for local real‑time dialogue and multimodal fine‑tuning .

Parlor: Running Multimodal AI Locally

Stars: 1,323 | Forks: 131

One‑line description: On macOS or Linux, run real‑time speech + vision multimodal conversation without any network connection.

Problem addressed

Cloud‑based multimodal AI (speech + vision + text) works well but suffers from two drawbacks:

Privacy concerns: audio and video must be uploaded to servers.

Cost issues: continuous API calls incur noticeable expenses.

Parlor solves these by keeping the entire inference pipeline on the local machine, ensuring that microphone and camera data never leave the device.

Technical architecture

Browser (mic + camera)
│
│  WebSocket (PCM audio + JPEG frames)
▼
FastAPI server
├── Gemma 4 E2B via LiteRT‑LM (GPU inference) → speech & vision understanding
└── Kokoro TTS (MLX on Mac, ONNX on Linux) → speech synthesis
│
│  WebSocket (streaming audio chunks)
▼
Browser (playback + text transcript)

Core dependencies:

Gemma 4 E2B: Google’s latest compact multimodal model supporting speech and vision.

Kokoro‑82M: Lightweight TTS model (82 M parameters) with fast local inference.

Silero VAD: Browser‑side voice activity detection for hands‑free interaction.

Performance on Apple M3 Pro (measured)

Speech + vision understanding: ~1.8‑2.2 s

Reply generation (≈ 25 tokens): ~0.3 s

Speech synthesis (1‑3 sentences): ~0.3‑0.7 s

End‑to‑end latency: ~2.5‑3.0 s

The GPU decodes at roughly 83 tokens/s; combined with streaming TTS, users hear audio before the full textual reply finishes.

Quick start

git clone https://github.com/fikrikarim/parlor.git
cd parlor/src
uv sync
uv run server.py

Open http://localhost:8000, grant camera and microphone permissions, and start speaking. The first run automatically downloads the ~2.6 GB model.

Key highlights

Hands‑free dialogue: Silero VAD detects speech start and automatically interrupts the AI (barge‑in).

Streaming TTS: Sentence‑level audio streams while the model is still generating the reply.

Multilingual support: Gemma 4 E2B natively handles many languages, making it suitable for language‑learning scenarios.

GitHub page includes a full demo video showing the complete real‑time speech + vision conversation flow.

GitHub: https://github.com/fikrikarim/parlor

Gemma Tuner Multimodal: Fine‑Tuning on Apple Silicon

Stars: 1,159 | Forks: 71

One‑line description: Fine‑tune Gemma 4 and Gemma 3n multimodal models (audio, image, text) on Mac M‑series chips using PyTorch + Metal Performance Shaders to fully exploit GPU performance.

Problem addressed

Traditional large‑model fine‑tuning relies on CUDA + NVIDIA GPUs, but many developers now use Apple Silicon Macs. Gemma Tuner Multimodal is optimized for Apple chips, enabling on‑laptop multimodal model fine‑tuning.

Core features

Supported modality combinations:

Audio + text

Image + text

Pure text

Any custom combination

Training visualization: Real‑time loss curves, attention heatmaps, gradient signals, memory usage, and token predictions.

Command‑line wizard interface: system check → LoRA config → model selection → dataset config
Command‑line wizard interface: system check → LoRA config → model selection → dataset config

Technical highlights

Apple Silicon native optimization: Uses Metal Performance Shaders as the backend; PyTorch calls the Apple GPU directly without extra cross‑compilation.

LoRA efficient fine‑tuning: Default Low‑Rank Adaptation drastically reduces VRAM usage and training time.

Out‑of‑the‑box CLI wizard: After launching, an interactive guide walks users through system detection, LoRA settings, model choice, and dataset configuration, making it beginner‑friendly.

Limitations

Currently supports only Gemma series models (4 and 3n).

Requires sufficient Apple Silicon memory (official recommendation: 16 GB+).

Audio‑multimodal training needs an additional audio dataset.

GitHub: https://github.com/mattmireles/gemma-tuner-multimodal

Conclusion

The two projects illustrate complementary directions for device‑side multimodal AI:

Parlor focuses on the inference side , bringing powerful multimodal models to local machines so ordinary users can enjoy free, privacy‑preserving real‑time AI conversations.

Gemma Tuner Multimodal targets the training side , enabling developers to fine‑tune their own multimodal models on consumer‑grade Macs.

If you are interested in deploying AI locally or experimenting with multimodal fine‑tuning on a Mac, both projects are worth bookmarking.

Multimodal AIApple Siliconlocal inferenceGemma TunerParlorreal-time dialogue
Geek Labs
Written by

Geek Labs

Daily shares of interesting GitHub open-source projects. AI tools, automation gems, technical tutorials, open-source inspiration.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.