Om AI Unveils Three Edge AI Models for Continuous Perception to Action
Om AI announced a three‑model VLX suite—VLX‑Flow, VLX‑Seek and VLX‑Go—designed to keep video streams continuously feeding a device‑side brain, using incremental visual memory and linear attention to meet the low‑latency, resource‑constrained demands of real‑world cameras, drones and robots.
Overview
Om AI announced on June 27‑29 the release of three new multimodal models—VLX‑Flow, VLX‑Seek and VLX‑Go—collectively called the VLX "edge brain". The suite targets real‑time perception, localization and action for devices such as cameras, drones and robots, where continuous video input and immediate response are required.
Challenges of Physical‑World Video Understanding
Traditional video‑language models assume an offline workflow: a complete video is recorded, frames are sampled, encoded, and then a single inference is performed. In real‑world scenarios the camera never stops capturing, the environment changes continuously, and queries can appear at any moment. Two common approaches—full‑frame input and fixed‑interval sampling—either explode computational cost or miss critical motion details, making it impossible to maintain a stable visual state.
VLX Model Architecture
VLX is organized into three layers:
VLX‑Flow handles streaming visual understanding and textual interaction. It continuously ingests video, builds incremental visual representations, and updates a cached semantic memory without recomputing the entire history.
VLX‑Seek anchors semantic targets to specific visual regions, providing fine‑grained perception.
VLX‑Go converts monocular video and commands into short‑term executable waypoints for navigation and obstacle avoidance.
The three layers implement the "understand → locate → act" closed‑loop for edge devices.
Streaming Input and Incremental Memory (VLX‑Flow)
VLX‑Flow splits the incoming video into small, ordered clips. Each clip is encoded into a visual representation that is stored in a visual cache. The language model maintains a reusable memory buffer that holds the compressed state of past clips. When a new clip arrives, the model performs an incremental update, avoiding full recomputation of the entire history.
This two‑level memory (visual cache + textual continuation layer) preserves recent frame details and long‑range semantic context, enabling the model to answer questions like "how many people are in the scene" without re‑processing the whole video.
Linear Attention for Low Latency
VLX‑Flow replaces standard self‑attention with a linear attention mechanism that supports a recursive state. The design reduces first‑token generation latency (TTFT) and keeps latency stable as the video length grows. Benchmarks show that VLX‑Flow’s TTFT remains low and flat, while full‑attention and sliding‑window baselines exhibit rising latency as history accumulates.
Stream‑Memory Training Paradigm
Training uses a specially constructed streaming dataset. A 16‑second video is divided into consecutive 2‑second clips, each paired with a concise caption that exactly describes the visual content. The model first learns to ingest these captions and build a recursive memory state (stream‑memory). In a second stage, questions are posed after a delay (e.g., 10 seconds or 1 minute) to test whether the accumulated memory can support accurate inference without revisiting the raw video.
This separation of observation and question‑answer phases forces the model to compress visual history into a reusable memory.
System Implications
By moving video processing to the device and maintaining state locally, VLX‑Flow eliminates repeated uploading and re‑encoding of long video histories. This makes the approach suitable for edge deployment where bandwidth and compute are limited. The continuous perception model provides a long‑running state that can trigger alerts, reminders, or navigation commands as soon as relevant events occur.
Future Work
The next article will examine VLX‑Seek, focusing on how the system precisely identifies which visual target is being referred to, completing the "understand → locate → act" pipeline.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
