How Apple’s OpenELM Redefines Efficient LLM Scaling with Layer‑Wise Design

Apple’s OpenELM introduces a layer‑wise scaling Transformer family ranging from 270 M to 3 B parameters, provides a full open‑source training framework, and demonstrates superior zero‑shot and few‑shot performance over existing open LLMs despite using less public data, while also analyzing inference bottlenecks and PEFT results.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How Apple’s OpenELM Redefines Efficient LLM Scaling with Layer‑Wise Design

Overview

Apple released OpenELM, a family of four decoder‑only Transformer language models (270M, 450M, 1.1B, 3B parameters) trained on public datasets.

Key Architectural Innovations

OpenELM uses layer‑wise scaling: each Transformer layer has its own head count and FFN dimension, breaking the uniform‑layer assumption.

No learnable bias in linear layers.

RMSNorm pre‑normalization and RoPE positional encoding.

Grouped‑query attention (GQA) instead of multi‑head attention.

SwiGLU feed‑forward network.

Flash attention for scalable dot‑product.

Same tokenizer as LLaMA.

Layer‑wise Scaling Details

Two hyper‑parameters α and β scale the number of attention heads (n_h) and the FFN multiplier (m) per layer, allowing non‑uniform parameter allocation.

Layer scaling formula
Layer scaling formula

Training Data and Procedure

Pre‑training data combines RefinedWeb, deduplicated PILE, subsets of RedPajama and Dolma v1.6, totaling ~1.8 trillion tokens.

Training data composition
Training data composition

Training used Apple’s open‑source CoreNet library (formerly CVNets) for 350 k iterations, producing the four model sizes.

Evaluation

OpenELM was benchmarked on zero‑shot and few‑shot tasks against open LLMs such as PyThia, Cerebras‑GPT, TinyLlama, OpenLM, MobiLlama, and OLMo.

Zero‑shot performance
Zero‑shot performance

Results show OpenELM 1.1B surpasses OLMo (1.2B) by 1.28‑2.36 % accuracy on several benchmarks, despite using less pre‑training data.

Instruction Fine‑Tuning and PEFT

Instruction tuning improves average accuracy by 1‑2 %. Parameter‑efficient fine‑tuning (PEFT) experiments with LoRA and DoRA on a commonsense reasoning dataset show comparable performance.

PEFT results
PEFT results

Performance Analysis

Throughput analysis attributes slower inference to a naïve RMSNorm implementation that launches many small kernels. Replacing it with Apex’s optimized RMSNorm improves throughput, yet OpenELM remains slower than OLMo due to more RMSNorm layers.

Performance bottleneck
Performance bottleneck

Resources

Paper: https://arxiv.org/pdf/2404.14619.pdf

Code repository: https://github.com/apple/corenet

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMOpen-sourcelayer-wise scalingOpenELM
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.