Why Building LLMs Is Like Buying a Hardware Lottery – Lessons from a Startup
The article recounts Yi Tay’s experience founding Reka and building large language models from scratch, highlighting the unpredictable quality of GPU clusters, the challenges of multi‑cluster orchestration, code‑base choices, and how startups must rely on fast, intuition‑driven experimentation to succeed.
Yi Tay left Google after three years to co‑found Reka, a startup aiming to train large language models comparable to Gemini Pro or GPT‑3.5 within a year. In a candid blog post he details the practical engineering obstacles he faced, from securing compute to navigating unreliable hardware providers.
Hardware lottery
Access to compute is the primary bottleneck, but the real surprise is the instability of cloud‑based GPU providers. Even when renting identical H100 GPUs, the overall cluster quality varies dramatically, leading to frequent node failures, wiring issues, and I/O bottlenecks that can waste thousands of GPU hours. These inconsistencies affect model‑flop‑utilization (MFU) and checkpoint reliability, making it feel like buying a lottery ticket.
GPU vs TPU
While Google’s internal TPU pods rarely fail, GPU clusters in the wild exhibit high failure rates. Yi attributes this not to the silicon itself but to the competence of the hardware‑support teams managing the accelerators. Robust hardware support is essential; otherwise, training can stall or crash within days.
Multi‑cluster pain points
Startups often have to juggle several accelerator pools spread across different providers. Data movement at the terabyte scale, fragmented infrastructure, and the lack of a unified orchestration layer make scaling arduous. Building a custom orchestration layer is realistic for large AI labs but usually out of reach for early‑stage companies.
Wild code choices
Yi’s team moved from Google‑centric libraries such as T5X and Mesh TensorFlow to the more widely supported PyTorch, citing better usability for non‑Google engineers. However, external codebases lack the stability and feature completeness of internal Google stacks, especially for large‑scale encoder‑decoder or prefix‑LM training, and often require manual model‑parallelism adapters.
Less principle, more YOLO
Instead of exhaustive systematic sweeps, the team adopted a rapid‑iteration approach: small‑scale, short‑duration runs (the “YOLO” mindset) to quickly identify promising configurations. This intuition‑driven method allowed them to produce a 21B “Reka Flash” model and a 7B edge model with far fewer experiments than traditional large‑scale labs.
Overall, the post underscores that building LLMs outside of a well‑resourced organization demands coping with hardware variability, crafting ad‑hoc tooling for monitoring and checkpointing, and embracing fast, experimental cycles.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
