Foundation-scale models are no longer the domain of the top AI labs
Picture this: a vast and powerful AI model but not built in tech giants' cold, secretive labs.
Instead, it’s brought to life by millions of everyday people. Gamers, whose GPUs are usually busy rendering epic tournaments in Call of Duty, are now lending their computing power to something even grander—an open-source AI model owned by everyone around the globe.
In this future, foundation-scale models are no longer the domain of the top AI labs only.
For now, though, the lion’s share of heavy AI model training remains anchored in centralized data centres.
Companies like OpenAI are scaling up their massive clusters, and they’re not alone. Elon Musk recently announced that xAI is nearing completion of a data centre boasting the equivalent of 200,000 H100 GPUs. It’s aptly named “Colossus”.
Why is AI training still so centralized? It’s all about efficiency. Training large AI models relies on several key techniques:
• Data parallelism: Splitting the dataset across multiple GPUs to perform the same operations in parallel, speeding up the training process.
• Model parallelism: Dividing the model across multiple GPUs to get around memory constraints.
These methods require constant data exchange between GPUs, making interconnect speed critical.
Interconnect speed = rate at which data is transferred between computers in the network
When training runs cost $50M+, every ounce of efficiency matters.
Model FLOPS utilization (MFU)—which measures how effectively a GPU’s maximum capacity is used—typically hovers around 35-40%. MFU was first introduced in Google’s PaLM paper back in 2022.
Why so low? While GPU performance has skyrocketed, other essential components like networking, memory, and storage have only improved by 5-10x, creating bottlenecks. This leaves GPUs often sitting idle, waiting for data.
With their high-speed interconnects, centralised data centres can transfer data between GPUs far more efficiently, leading to significant cost savings.
So, one of the main challenges with decentralized AI training is low interconnect speed.
Since GPU clusters aren’t physically in the same location, transferring data between them takes longer, leading to a bottleneck. Training requires GPUs to exchange data and sync at every step.
The farther apart the clusters are geographically, the higher the latency, slowing down the process and raising costs.
There are distributed training methods that allow the training of smaller models in low interconnect speed environments already. Frontier research is extending this to larger models now.
For example, Prime Intellect’s open DiCoLo paper demonstrates a practical approach where islands of GPUs perform 500 “local steps” before sharing and synchronising the results, resulting in up to 500x lower bandwidth requirements. They’ve extended this to work in a 1.1B parameter model.
Introducing OpenDiLoCo, an open-source implementation and scaling of DeepMind’s Distributed Low-Communication (DiLoCo) method, enabling globally distributed AI model training. — Prime Intellect (@PrimeIntellect)
4:55 PM • Jul 11, 2024
Nous Research’s DisTrO offers another promising solution with 1000x reduction in bandwidth requirements in training a 1.2B parameter model.
What if you could use all the computing power in the world to train a shared, open source AI model? Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of… x.com/i/web/status/1… — Nous Research (@NousResearch) 5:25 PM • Aug 26, 2024
There is still quite a way to go, but so far, things look promising.
Another challenge is managing a diverse range of GPU hardware, including consumer-grade GPUs with limited memory that are typical in decentralized networks. Techniques like model parallelism (splitting model layers across devices) can help make this feasible.
There are even bigger ideas on the horizon.
Pluralis suggests that decentralized networks could aggregate more GPU compute power than centralized data centres ever could—even the largest data centres are limited by physical space constraints, power, and cooling.
This means the world's most powerful AI models will be trained in a decentralised fashion.
It’s an exciting thought, but I’m not fully sold on this vision just yet. I need to see better evidence that decentralised training of the largest AI models is technically and economically feasible.
There are already specific use cases for decentralised AI training, such as training mid-sized models or certain types of models whose architecture makes decentralised training most suitable.
Large global networks of nodes (computers) are essential to realize the vision of decentralized training and inference.
And the best way to build decentralised networks? Crypto, by far.
Once decentralized training becomes practically feasible, tokens will be the key to scaling these networks and rewarding contributors.
Tokens align the interests of participants, uniting them around a shared goal: growing the network and increasing the token’s value. A well-designed token economy creates a powerful incentive structure, helping overcome the early challenges that typically stall growing networks.
Just look at Bitcoin and Ethereum—they’ve already aggregated the largest supply of computing power globally, proving the power of token-driven networks.
I’d love to hear your thoughts—whether you agree or see things differently. Feel free to drop me an email.
Cheers,
Teng Yan