A 1.3 MB SAM Model Runs Inside a Sensor Chip in 11 ms—No Raw Images Leave the Device

IBM Research open‑sources PicoSAM3, a 1.3 MB promptable segmentation model that fits inside Sony's IMX500 sensor, runs inference in 11.8 ms, and keeps raw images on‑chip, demonstrating ultra‑low‑latency, privacy‑preserving edge AI for smart glasses and IoT devices.

AIWalker
AIWalker
AIWalker
A 1.3 MB SAM Model Runs Inside a Sensor Chip in 11 ms—No Raw Images Leave the Device

Problem Statement

Segmentation models such as the Segment Anything Model (SAM) are valuable for data filtering and pre‑annotation, but deploying them on latency‑sensitive, privacy‑critical edge devices (smart glasses, IoT cameras, drones) is difficult. Existing lightweight variants (TinySAM, EdgeSAM, MobileSAM, LiteSAM) still require the image to leave the sensor and be processed on a host CPU or edge‑box, incurring transmission latency and exposing raw pixels.

Why Sensor‑In‑Computation?

When the image sensor and the AI accelerator are vertically integrated, the image can be processed the instant it is captured, eliminating the data‑transfer hop. This reduces end‑to‑end latency to a few milliseconds and guarantees that the raw image never leaves the silicon, addressing privacy concerns. The Sony IMX500 exemplifies this architecture: it embeds a dedicated edge‑AI processor beneath a CMOS sensor with SRAM < 8 MiB , but its limited memory and operator set prevent direct deployment of existing promptable segmentation models.

PicoSAM3 Design for Sensor‑In‑Computation

Architecture : A symmetric U‑Net encoder‑decoder built exclusively from convolutional layers; no Transformer blocks. Total parameters ≈ 1.37 M.

Quantization : Post‑training INT8 quantization compresses the model to 1.31 MiB . The all‑convolutional feature distribution is near‑Gaussian, allowing uniform quantization with negligible accuracy loss.

Training strategy : Knowledge distillation from a full‑size SAM3 teacher improves mean Intersection‑over‑Union (mIoU) by up to 14.5 % compared with pure supervised training on the same dataset.

Prompt encoding : The IMX500 accepts only an RGB frame; additional point, box, or mask channels are unavailable. During training, a target instance’s bounding box is expanded by 10 % and cropped to a square centered on the object. This cropped patch serves as the “prompt”. The model learns to map a centered crop to the object mask, eliminating the need for extra input channels.

Implementation on the Sony IMX500

Model weights (INT8) are flashed onto the sensor’s on‑chip flash; the runtime resides in the embedded AI accelerator.

Inference on a single 640 × 480 RGB frame completes in 11.82 ms , well within real‑time constraints for AR/VR pipelines.

Memory footprint stays below the 8 MiB SRAM limit, whereas TinySAM, EdgeSAM, MobileSAM‑v2 and LiteSAM exceed the SRAM budget or rely on operators (e.g., multi‑head attention) unsupported by the IMX500.

Experimental Results

Model size: 1.31 MiB (INT8) vs. > 5 MiB for TinySAM and > 10 MiB for EdgeSAM.

Latency: 11.82 ms on IMX500 vs. 30‑70 ms on a typical mobile‑CPU when running TinySAM.

Accuracy: Distillation from SAM3 raises mIoU from 62 % (supervised) to 71 % (distilled), a 14.5 % relative gain.

Power: The on‑sensor AI block consumes ≈ 200 mW , an order of magnitude lower than a separate edge‑box CPU running comparable models.

Implications

Embedding a promptable segmentation model inside the sensor demonstrates that high‑quality, privacy‑preserving perception is feasible on devices with sub‑megabyte memory budgets. It also shows that, under extreme resource constraints, pure CNN designs remain viable while Transformer‑based architectures become impractical due to their random‑access memory patterns and heavyweight operators.

Paper URL: https://arxiv.org/pdf/2603.11917
model compressionpromptable segmentationCNN vs TransformerIMX500PicoSAM3sensor computing
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.