1.5B‑Parameter Model Enables Offline Real‑Time Speech Transcription
Liquid AI’s new 1.5 B‑parameter LFM2‑Audio model delivers high‑quality offline, real‑time speech‑to‑text, text‑to‑speech, and multimodal dialogue on local devices, using a 1.2 B language backbone, a FastConformer encoder, and supports two generation strategies, with benchmark scores surpassing larger rivals.
Cloud‑based speech transcription is common, but fully offline real‑time transcription has only recently become viable. Liquid AI released its first end‑to‑end audio foundation model, LFM2‑Audio‑1.5B, demonstrating that a 1.5 B‑parameter model can handle high‑quality audio tasks locally.
Language model backbone : 1.2 B‑parameter LFM2 model
Audio encoder : FastConformer‑based 115 M‑parameter encoder
Audio tokenizer : Mimi from Kyutai, supporting eight codebooks
Context length : 32,768 tokens
Supported precision : bfloat16
Beyond its small size, the model is a unified multimodal system that does not require separate ASR and TTS components; it can perform speech‑to‑text, text‑to‑speech, and handle mixed multi‑turn dialogues.
The model supports two generation strategies:
Interleaved generation : Text and audio tokens alternate in a fixed pattern, minimizing the first audio output latency and suited for real‑time voice dialogue.
Sequential generation : A special token tells the model when to switch modalities, fitting ASR, TTS, or other non‑dialogue tasks.
This flexibility lets a single model adapt to different usage scenarios.
Typical usage examples (run with llama‑lfm2‑audio):
./llama-lfm2-audio \
-m $CKPT/LFM2-Audio-1.5B-Q8_0.gguf \
--mmproj $CKPT/mmproj-audioencoder-LFM2-Audio-1.5B-Q8_0.gguf \
-mv $CKPT/audiodecoder-LFM2-Audio-1.5B-Q8_0.gguf \
-sys "Perform ASR." \
--audio $INPUT_WAVFor text‑to‑speech:
./llama-lfm2-audio \
-m $CKPT/LFM2-Audio-1.5B-Q8_0.gguf \
--mmproj $CKPT/mmproj-audioencoder-LFM2-Audio-1.5B-Q8_0.gguf \
-mv $CKPT/audiodecoder-LFM2-Audio-1.5B-Q8_0.gguf \
-sys "Perform TTS." \
-p "My name is Pau Labarta Bajo and I love AI" \
--output $OUTPUT_WAVAnd for TTS with voice commands:
./llama-lfm2-audio \
-m $CKPT/LFM2-Audio-1.5B-Q8_0.gguf \
--mmproj $CKPT/mmproj-audioencoder-LFM2-Audio-1.5B-Q8_0.gguf \
-mv $CKPT/audiodecoder-LFM2-Audio-1.5B-Q8_0.gguf \
-sys "Perform TTS.
Use the following voice: A male speaker delivers a very expressive and animated speech, with a low‑pitch voice and a slightly close‑sounding tone. The recording carries a slight background noise." \
-p "What is your name man?" \
--output $OUTPUT_WAVDespite its modest parameter count, performance rivals larger competitors. In VoiceBench audio tests, LFM2‑Audio‑1.5B achieved a composite score of 56.78, far above the 7 B‑parameter Moshi model (29.51). On ASR, its average word error rate (WER) is 7.24 %, comparable to Whisper‑large‑V3’s 7.93 %.
A notable comparison is with Qwen2.5‑Omni‑3B, which has more than three times the parameters but shows similar metrics on most indicators, highlighting Liquid AI’s efficiency optimizations.
The current limitation is English‑only support, restricting some use cases.
Conclusion : Prioritizing local processing aligns with many applications that value data privacy and independence from network connectivity, creating numerous scenarios for offline‑first solutions.
Repository: https://github.com/Liquid4All/liquid-audio
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
