Yann LeCun: Today's AI Still Below Dog Level – Inside Meta’s Voicebox, MusicGen & I‑JEPA
Meta’s chief AI scientist Yann LeCun warned that current large language models still fall short of human and even dog intelligence, citing their lack of real‑world understanding, while Meta unveiled three new generative AI models—Voicebox for speech, MusicGen for music, and I‑JEPA for image reasoning—showcasing both progress and remaining limitations.
Meta’s chief AI scientist Yann LeCun said at the Viva Tech conference in Paris that today’s AI systems, such as ChatGPT, have not reached human‑level intelligence and are even less capable than a dog.
He argued that large language models (LLMs) are not truly intelligent because they cannot understand or interact with reality; they merely generate text based on massive language training and miss the bulk of human experience that is not language‑based.
LeCun illustrated the gap with examples: an AI can pass the U.S. bar exam but cannot install a dishwasher—a skill a ten‑year‑old can learn in ten minutes. He also compared infant perception, noting that a five‑month‑old sees a floating object without questioning it, while a nine‑month‑old is surprised because it knows objects should not float, a capability current AI cannot replicate.
Meta is currently training AI on video and envisions future machines that will act as helpful assistants, more intelligent than their users, without being a threat.
He also dismissed the notion that robots will dominate the world, emphasizing that intelligence does not imply a desire to take over.
Meta Releases Voicebox: A Generative Speech Model
Voicebox is a non‑autoregressive flow‑matching model trained on more than 50,000 hours of raw, unfiltered speech. It can perform speech generation tasks such as editing, sampling, and style transfer directly from context, without task‑specific training.
The model supports zero‑shot text‑to‑speech synthesis in single or multiple languages, noise removal, content editing, style conversion, and diverse sample generation. It achieves lower word error rates (1.9 % vs. 5.9 %) and higher audio similarity (0.681 vs. 0.580) than the state‑of‑the‑art English model VALL‑E, while running 20× faster.
Meta Open‑Sources MusicGen: Text‑to‑Music Generation
MusicGen converts text prompts and optional melody inputs into full musical pieces. Built on the 2017 Transformer architecture, it was trained on 20,000 hours of licensed music using Meta’s EnCodec encoder, which splits audio into small units for parallel processing, improving efficiency and speed.
The model can combine textual descriptions with existing melodies, for example generating “a light‑hearted track” blended with Beethoven’s “Ode to Joy.”
Meta Introduces I‑JEPA: A “Human‑Like” Vision Model
I‑JEPA learns abstract world representations through self‑supervised image learning, delivering more accurate analysis and completion of unfinished images than existing models.
Inspired by LeCun’s human‑like reasoning approach, I‑JEPA reduces common generative‑image errors such as extra fingers and offers higher computational efficiency compared with widely used vision models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
