Scaling Laws, Synthetic Data, and New Model Architectures: What’s Next?
In a recent round‑table, experts debated the validity of scaling laws, the role of synthetic and semi‑synthetic data in overcoming data scarcity, explored alternatives to Transformers such as RNN‑based models and MOE, and examined techniques for handling long‑context inference efficiently.
1. Do You Trust Scaling Laws?
Participants discussed the controversy around scaling laws. Some believe that continuously increasing data and compute will inevitably lead to AGI, while others argue performance will plateau despite unlimited resources. The original scaling law was proposed by OpenAI in 2020, later refined by DeepMind's 2022 Chinchilla scaling law, which links model parameters to dataset size proportionally.
One view holds that scaling laws remain valid: more high‑quality data and larger models still improve performance. However, recent observations—such as the release of GPT‑4o and other strong models—suggest the growth curve may be flattening. The slowdown could stem from data scarcity; Llama 3 already consumes about 14 TB of data, nearing the limits of available text data.
2. Can Synthetic Data Solve the Data Bottleneck?
Synthetic data is increasingly used to alleviate data shortages. Companies like Llama 3 incorporate synthetic samples, and models such as Sora generate data in a semi‑synthetic manner. Two main pathways were identified:
Improve a model’s data efficiency so it learns more from the same amount of data—an attractive but currently unsolved direction.
When real data runs out, rely on synthetic data generation.
Successful examples include instruct‑tuning pipelines where GPT‑4 generates answers to human‑crafted questions, creating high‑quality “question‑answer” pairs. Another approach, termed “half‑synthetic data,” augments existing human‑annotated pairs (e.g., text‑image for DALL‑E 3 or text‑video for Sora) by expanding the textual description with AI, then using the enriched pairs for training.
Fully synthetic data—generating entirely machine‑produced samples that match human distribution—is still an open research challenge.
3. Are There Model Architectures Better Than Transformers?
The discussion highlighted two broad strategies for improving over the Transformer’s long‑context inefficiencies:
RNN‑based designs offer low memory and compute during inference but suffer from poor training parallelism.
Transformer‑centric modifications aim to reduce the quadratic cost of self‑attention, preserving training efficiency while cutting inference overhead.
Recent advances, such as Google Gemini supporting two‑million‑token contexts, indicate that extending context length is becoming feasible.
4. The Role of Mixture‑of‑Experts (MoE) Models
MoE models become attractive as scaling laws push model sizes upward, because they can keep training and inference costs manageable by activating only a subset of experts per token. However, practical implementations often fall short of the ideal “each expert handles its own domain,” behaving more like flexible ensembles.
5. Tackling Long‑Context Challenges
Two primary technical directions were outlined:
Compress the KV‑cache directly, e.g., through quantization (3‑bit KV‑cache enabling 1 M token context on a single GPU) or low‑rank decomposition similar to LoRA.
Compress the input context itself. Retrieval‑Augmented Generation (RAG) stores large knowledge bases externally and retrieves relevant chunks, reducing the prompt length. An in‑memory variant maintains a “memory” that chunks long inputs, akin to an internal RAG. Recent Google work on unlimited‑context models follows this idea.
Combining both KV‑cache reduction and input compression can further alleviate memory and compute pressures for long‑context inference.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
