Artificial Intelligence 8 min read

How Apple’s OpenELM Redefines Efficient LLM Scaling with Layer‑Wise Design

Apple’s OpenELM introduces a layer‑wise scaling Transformer family ranging from 270 M to 3 B parameters, provides a full open‑source training framework, and demonstrates superior zero‑shot and few‑shot performance over existing open LLMs despite using less public data, while also analyzing inference bottlenecks and PEFT results.

NewBeeNLP

Apr 25, 2024

How Apple’s OpenELM Redefines Efficient LLM Scaling with Layer‑Wise Design

Overview

Apple released OpenELM, a family of four decoder‑only Transformer language models (270M, 450M, 1.1B, 3B parameters) trained on public datasets.

Key Architectural Innovations

OpenELM uses layer‑wise scaling: each Transformer layer has its own head count and FFN dimension, breaking the uniform‑layer assumption.

No learnable bias in linear layers.

RMSNorm pre‑normalization and RoPE positional encoding.

Grouped‑query attention (GQA) instead of multi‑head attention.

SwiGLU feed‑forward network.

Flash attention for scalable dot‑product.

Same tokenizer as LLaMA.

Layer‑wise Scaling Details

Two hyper‑parameters α and β scale the number of attention heads (n_h) and the FFN multiplier (m) per layer, allowing non‑uniform parameter allocation.

Training Data and Procedure

Pre‑training data combines RefinedWeb, deduplicated PILE, subsets of RedPajama and Dolma v1.6, totaling ~1.8 trillion tokens.

Training used Apple’s open‑source CoreNet library (formerly CVNets) for 350 k iterations, producing the four model sizes.

Evaluation

OpenELM was benchmarked on zero‑shot and few‑shot tasks against open LLMs such as PyThia, Cerebras‑GPT, TinyLlama, OpenLM, MobiLlama, and OLMo.

Results show OpenELM 1.1B surpasses OLMo (1.2B) by 1.28‑2.36 % accuracy on several benchmarks, despite using less pre‑training data.

Instruction Fine‑Tuning and PEFT

Instruction tuning improves average accuracy by 1‑2 %. Parameter‑efficient fine‑tuning (PEFT) experiments with LoRA and DoRA on a commonsense reasoning dataset show comparable performance.

Performance Analysis

Throughput analysis attributes slower inference to a naïve RMSNorm implementation that launches many small kernels. Replacing it with Apex’s optimized RMSNorm improves throughput, yet OpenELM remains slower than OLMo due to more RMSNorm layers.

Resources

Paper: https://arxiv.org/pdf/2404.14619.pdf

Code repository: https://github.com/apple/corenet

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Open-source layer-wise scaling OpenELM

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.