How EUPE’s Three‑Stage Distillation Lets an 86M Model Run Classification, Segmentation and VLM on iPhone in 62 ms (SOTA)

EUPE introduces a three‑stage “scale‑then‑shrink” distillation pipeline that first trains a large proxy model to absorb heterogeneous expert knowledge and then compresses it into an 86M encoder, achieving state‑of‑the‑art performance on image classification, dense prediction and vision‑language tasks on an iPhone with only 62 ms latency.

AIWalker
AIWalker
AIWalker
How EUPE’s Three‑Stage Distillation Lets an 86M Model Run Classification, Segmentation and VLM on iPhone in 62 ms (SOTA)

Core Pain Point: Why Direct Multi‑Teacher Distillation Fails

Running AI on edge devices often requires image classification, precise segmentation, and visual‑language understanding simultaneously. Using separate specialist models either exceeds memory limits or sacrifices performance, while a single efficient encoder lacks the capacity to absorb heterogeneous expert knowledge.

EUPE Overview – “Scale First, Then Shrink”

Meta Reality Labs and FAIR propose EUPE, a three‑stage distillation pipeline that first builds a heavyweight proxy model to unify multiple expert teachers, then progressively distills this knowledge into a lightweight student model.

Stage 1: Build a Super‑Generalist Proxy Model

The authors select three complementary teachers:

PEcore‑G (1.9 B) – image‑understanding expert (zero‑shot classification, retrieval).

PElang‑G (1.7 B) – vision‑language expert (OCR, knowledge‑based VQA).

DINOv3‑H+ (840 M) – dense‑prediction expert (segmentation, depth, key‑point matching).

All teachers process the same unlabeled image and output a CLS token and patch tokens. A 1.9 B proxy model receives the same image, and for each teacher a trainable Adapter maps the student’s tokens to the teacher’s token space; teacher parameters remain frozen. Feature normalization is applied before loss computation to prevent any single teacher from dominating the gradient.

The distillation loss combines cosine similarity for CLS tokens with a cosine‑plus‑smooth‑L1 term for patch tokens, summed over all teachers.

Stage 2: Fixed‑Resolution Distillation

After the proxy is trained, Stage 2 performs a long‑duration (390 K iterations) distillation at a fixed resolution. This reduces the computational burden of multi‑resolution training while still allowing the student to inherit high‑resolution knowledge.

Stage 3: Multi‑Resolution Fine‑Tuning

Stage 3 introduces a three‑scale image pyramid (256, 384, 512). Each iteration randomly selects a scale for both teacher and student, and a short fine‑tuning schedule (100 K iterations, LR = 1e‑5) adapts the student to varying resolutions. Adaptive bilinear interpolation aligns spatial tokens across scales.

Experimental Validation

SOTA Comparison

Table 1 shows EUPE‑ViT‑B (86 M parameters) matching or surpassing domain‑specific experts on three task families:

Image classification: IN1k‑KNN 84.1 % (on par with the strongest expert, far above DINOv3‑ViT‑B’s 44.4 %).

Dense prediction: ADE20K segmentation 51.3 % (exceeds DINOv3‑ViT‑B’s 39.4 %).

Vision‑language: RealWorldQA 85.9 % and GQA 85.9 % (well above PEcore‑B and comparable to the best DUNE‑B).

Thus a single 86 M model delivers the combined capabilities of multiple specialists while using only one inference pass and reducing memory by two‑thirds.

Ablation of the Three Stages

Table 2 isolates each stage’s contribution. Using only Stage 2 yields poor VLM performance (TextVQA 60.7, RealWorld 76.9). Adding Stage 1 raises TextVQA to 65.2 and RealWorld to 83.8, confirming the proxy’s role in unifying knowledge. Skipping Stage 2 but keeping Stage 1 & 3 improves dense‑prediction metrics but harms VLM. The full pipeline achieves the best overall balance.

Teacher Combination Ablation

Table 3 explores different teacher sets. The complementary trio PEcore + DINOv3 + PElang delivers the highest scores (TextVQA 69.6, RealWorld 82.9). Adding a second CLIP‑style teacher (SigLIP2) degrades VLM OCR performance, indicating feature incompatibility.

Proxy Model Performance

Table 4 demonstrates that multi‑teacher proxies consistently outperform single‑teacher proxies across VLM and dense tasks, validating the effectiveness of knowledge aggregation.

Feature Visualization

PCA visualizations (Fig 4) compare raw teacher features with EUPE’s fused features. CLIP‑style models produce noisy, spatially inconsistent patches, while DINOv3 yields sharp but less detailed semantics. EUPE combines DINOv3’s semantic clarity with CLIP‑style spatial consistency.

Fig 5 shows stage‑wise feature quality: Stage 2‑only features are noisy; adding Stage 1 produces orderly representations; the complete Stage 1 & 2 & 3 pipeline yields the cleanest, most coherent features.

Practical Considerations

Training Cost : The three‑stage pipeline requires training a 1.9 B proxy before two student‑model stages, increasing total compute compared with direct multi‑teacher distillation. However, the cost is incurred once; inference uses only the lightweight student.

Teacher Size Ceiling : Scaling the proxy to 7 B (ViT‑7B) yields diminishing or negative returns on TextVQA and RealWorld, indicating that student capacity limits the benefit of larger teachers.

Inference Efficiency : Table 11 reports latency on iPhone 15 Pro. ViT‑B/16 at 512 px runs in 62.1 ms, satisfying real‑time constraints, whereas larger ViT‑L exceeds 400 ms. EUPE targets the <100 M‑parameter, <100 ms sweet spot for edge AI.

Key Takeaways

“Scale first, then shrink” is essential for building an efficient universal encoder; a small model cannot directly absorb heterogeneous multi‑teacher knowledge.

The three‑stage pipeline balances knowledge unification (Stage 1), efficient distillation (Stage 2), and scale‑generalization (Stage 3); removing any stage skews performance.

Teacher selection must prioritize complementarity; stacking similar CLIP‑style teachers causes interference, while pairing vision‑language and dense‑prediction experts yields synergistic gains.

The authors invite readers to contemplate which AI applications—edge multimodal assistants, AR scene understanding, or robotic vision—might be most transformed by EUPE.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

model compressionedge AIKnowledge DistillationViTEUPEmulti‑task vision
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.