Multimodal AI Innovations from the ERNIE Hackathon: Accessibility, Elderly Assistance, Autism Intervention and More
The ERNIE Open Innovation Hackathon’s multimodal track showcased a diverse set of award‑winning projects that leveraged the ERNIE‑4.5‑VL model to dramatically shorten video‑production cycles, create audio‑only smartphone assistants for seniors, enable personalized autism‑intervention platforms, generate AI‑driven music for videos, and more, demonstrating the practical impact of multimodal AI across real‑world scenarios.
First Prize – AI 无碍 – Video‑Accessibility Editing Platform
The system implements a full AI‑assisted pipeline for creating accessible video scripts. Core visual understanding is performed by ERNIE‑4.5‑VL with LoRA fine‑tuning, achieving 95% scene‑level accuracy and 81% accuracy on artistic‑shot classification (a 102.5% improvement over the baseline). Audio is transcribed and emotion‑tagged to enrich narration. Structured templates and real‑time multi‑user collaboration reduce manual editing to 12% of the total workload. Multimodal preview simulates screen‑reader output, and the end‑to‑end preprocessing of a 10‑minute video takes only 3.5 minutes.
Second Prize – Assistify – Audio‑Only Smart Companion for Elderly Smartphone Use
Assistify provides a pure‑audio interface that reads the screen and guides users step‑by‑step. It captures screen frames at 1 FPS and processes them with ERNIE‑4.5‑VL, reaching 94.2% UI‑element classification accuracy. A RAG‑driven personalization loop stores interaction vectors; similar future requests retrieve and adapt prior dialogues, shortening guidance. The request‑analysis‑instruction‑verification loop keeps latency under 1.8 seconds, and beta testing with users aged 68–82 showed a 100% independent task‑completion rate for actions such as photo sharing and secure payments.
Third Prize – HeartBridge – Autism Family Intervention Platform
HeartBridge moves autism assessment from clinic to home by analysing short parent‑recorded videos. The multimodal model ERNIE‑4.5‑VL detects facial emotions (5‑class F1 = 0.88), eye‑gaze, and body motions (6‑class F1 = 0.85). Incremental learning and few‑shot fine‑tuning create a per‑child model that continuously adapts as new data arrive. Privacy is protected by edge‑side de‑identification, encrypted transmission, and a hybrid edge‑cloud architecture. The platform currently serves 128 active families with a weekly retention of 42%.
Open‑Source Contribution – Rhythm AI – Intelligent Video‑Music Generation
Rhythm AI converts video content and style intent into synchronized background music via a three‑stage modular architecture:
Visual‑semantic embedding: extracts scene, motion, rhythm, and style features using ERNIE‑4.5‑VL.
Cross‑modal alignment: long‑short‑term temporal modeling aligns visual semantics with musical structure; a classifier‑free guidance (CFG) mechanism mitigates style drift.
Music generation: a decoder transforms the aligned features into audio that matches the video’s emotion and tempo.
All components are prepared for open‑source release.
Other Notable Projects
Yao – Zero‑API General‑Purpose Intelligent Agent: operates across desktop applications (e.g., WeChat, QQ, Word) by looping screenshot‑to‑action without external APIs.
HearSight – Multimodal Video Knowledge Summarizer: uses ERNIE‑4.5‑VL to generate structured text‑image summaries, supports multi‑video linking, multilingual translation, and full data export.
VideoTalk – Browser‑Based Video Dialogue Tool: combines local OCR (PaddleOCR) with cloud‑based ERNIE‑4.5‑VL analysis to provide real‑time Q&A on video content, including virtual avatar interaction.
Ping Pong AI Coach (WeChat Mini‑Program): analyses match videos for scoring, produces technique radar charts, generates AI‑driven coaching videos, and creates personalized training plans using ERNIE‑4.5‑VL and PaddleSpeech.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
