How Apple’s OpenELM Redefines Efficient LLM Scaling with Layer‑Wise Design
Apple’s OpenELM introduces a layer‑wise scaling Transformer family ranging from 270 M to 3 B parameters, provides a full open‑source training framework, and demonstrates superior zero‑shot and few‑shot performance over existing open LLMs despite using less public data, while also analyzing inference bottlenecks and PEFT results.
Overview
Apple released OpenELM, a family of four decoder‑only Transformer language models (270M, 450M, 1.1B, 3B parameters) trained on public datasets.
Key Architectural Innovations
OpenELM uses layer‑wise scaling: each Transformer layer has its own head count and FFN dimension, breaking the uniform‑layer assumption.
No learnable bias in linear layers.
RMSNorm pre‑normalization and RoPE positional encoding.
Grouped‑query attention (GQA) instead of multi‑head attention.
SwiGLU feed‑forward network.
Flash attention for scalable dot‑product.
Same tokenizer as LLaMA.
Layer‑wise Scaling Details
Two hyper‑parameters α and β scale the number of attention heads (n_h) and the FFN multiplier (m) per layer, allowing non‑uniform parameter allocation.
Training Data and Procedure
Pre‑training data combines RefinedWeb, deduplicated PILE, subsets of RedPajama and Dolma v1.6, totaling ~1.8 trillion tokens.
Training used Apple’s open‑source CoreNet library (formerly CVNets) for 350 k iterations, producing the four model sizes.
Evaluation
OpenELM was benchmarked on zero‑shot and few‑shot tasks against open LLMs such as PyThia, Cerebras‑GPT, TinyLlama, OpenLM, MobiLlama, and OLMo.
Results show OpenELM 1.1B surpasses OLMo (1.2B) by 1.28‑2.36 % accuracy on several benchmarks, despite using less pre‑training data.
Instruction Fine‑Tuning and PEFT
Instruction tuning improves average accuracy by 1‑2 %. Parameter‑efficient fine‑tuning (PEFT) experiments with LoRA and DoRA on a commonsense reasoning dataset show comparable performance.
Performance Analysis
Throughput analysis attributes slower inference to a naïve RMSNorm implementation that launches many small kernels. Replacing it with Apex’s optimized RMSNorm improves throughput, yet OpenELM remains slower than OLMo due to more RMSNorm layers.
Resources
Paper: https://arxiv.org/pdf/2404.14619.pdf
Code repository: https://github.com/apple/corenet
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
