Artificial Intelligence 15 min read

Technical Report on the Index-1.9B Series: Model Variants, Pre‑training Optimizations, and Alignment Experiments

The report presents the open‑source Index‑1.9B family—base, pure, chat, and character variants—detailing benchmark results, pre‑training optimizations such as a normalized LM‑Head and deeper‑slim architectures, the importance of modest instruction data, alignment via SFT/DPO, role‑play enhancements with RAG, and acknowledges remaining safety and factual limitations.

Bilibili Tech

Jun 14, 2024

Technical Report on the Index-1.9B Series: Model Variants, Pre‑training Optimizations, and Alignment Experiments

We introduce the lightweight versions of the Index series, the Index‑1.9B family, which includes four model variants: Index‑1.9B base (19 B non‑embedding parameters, pretrained on 2.8 T Chinese‑English tokens and leading on several benchmarks), Index‑1.9B pure (same architecture as base but with all instruction‑related data filtered out to study instruction impact), Index‑1.9B chat (base further aligned with SFT and DPO, showing markedly higher conversational fun), and Index‑1.9B character (base + SFT/DPO + RAG for few‑shot role‑play customization). The models are open‑sourced on GitHub and HuggingFace.

Model basic performance is illustrated with benchmark tables (Ceval, CMMLU, MMLU, ARC‑C/E, Hellaswag) and visualized in several figures. Annotations indicate which scores are taken from reports.

Pre‑training optimization experiments focus on gradient distribution and LM‑Head stability. We observe that the LM‑Head layer dominates gradient magnitude, making it a stability bottleneck. Introducing a normalized LM‑Head (Norm‑Head) improves training stability and yields higher gradient‑norm values while achieving better downstream scores.

We also explore model shape effects. Keeping total FLOPs constant, deeper (high‑slim) models outperform shallower (short‑fat) ones, though deeper models consume more activation memory. Experiments compare a 36‑layer base (≈1.01 B non‑embedding parameters) with a 9‑layer counterpart.

Learning‑rate matters : simple changes to the maximum learning rate (e.g., 2e‑4 vs 5e‑4) produce consistent performance gains. Different learning‑rate schedules (Cosine, Linear, WSD) converge to similar validation loss and evaluation scores, indicating that as long as the final learning‑rate scale is comparable, the schedule choice is less critical.

We further investigate the interaction between learning‑rate schedules and data quality. Adding high‑quality data during the decay phase of WSD yields the best results, while cosine with added data performs slightly worse, likely due to overly low learning rates at the end of training.

Instruction tuning impact : Two ablation groups are compared—pure (no instruction data) and boost (7 % instruction data added during decay). The boost group improves MMLU scores by ~7.0 points, confirming that a modest amount of instruction data can significantly raise benchmark performance.

Emergence during training : While still in the stable phase (before decay), the 1.9 B model shows a sudden performance jump around 1.0–1.2 T tokens, with Ceval and MMLU rising from ~27/26 to ~36/33, surpassing many 7 B models.

Alignment discussion : To better align with human preferences, we apply SFT and DPO on the base model. SFT uses >10 M high‑quality bilingual instruction pairs (≈100 k after filtering) and a system‑query‑response format with a 1e‑5 learning rate. DPO leverages >100 k high‑quality chosen/rejected pairs, focusing on writing, instruction compliance, and safety. The DPO learning rate is 1e‑6 with cosine scheduling and β=0.1.

Role‑play capability : We collect ~80 k high‑quality dialogue lines from public scripts, filter them with a role‑reward model, and augment prompts with retrieved past utterances via RAG. Evaluation on the CharacterEval benchmark places the 1.9 B model at rank 9 among peers, demonstrating strong role‑play consistency and attractiveness.

Limitations : Despite compliance checks, potential data‑related legal or safety issues may remain. The model can still generate factual errors or misunderstand instructions, especially given its parameter size. Future work will focus on further alignment and RAG techniques.

References include recent works on small LLMs, scaling laws, instruction tuning, and direct preference optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Optimization LLM Large Language Model Evaluation alignment pretraining Instruction Tuning

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.