The Compute Gold Rush

We have all seen or experienced the increase in real-life AI performance as new models have been released over the last 12/18 months [1]. What we don’t often see are the compute resources that went into creating and operating these models. Those numbers have been growing exponentially.

It is always hard to wrap your head around exponential growth, so here are a few figures and anecdotes to understand this better.

By 2021, just a few models broke the barrier of 10^23 FLOP of estimated training compute (GPT-3 being one of them). Fast forward three years, and we have over 77 models above that threshold [2]. These are very big numbers; only a handful of data centers in the world are capable of producing them. Both US and EU regulators have established reporting requirements on similar compute thresholds to regulate consequential model developments [3][4].

How did we get here so quickly? As in many of these paradigm shifts, a combination of factors coincided into this swift consolidation of efforts and resources.

The seminal 2017 paper “Attention Is All You Need” [5] introduced the Transformer architecture, which showcased very compelling properties and scaling laws [6] that were perfectly suited for hardware accelerators like GPUs. With a given compute budget, larger models yield better results, and as the compute budget increases, the performance improves with it. Great!

With falling compute costs [7], creating very large models and training jobs has also become economically viable for the first time. This factor is often understated; a recent survey among AI practitioners [8] put the falling cost of compute as one - if not the leading - factor behind this shift towards large language models. Pair that with the most successful consumer product launch in history, with ChatGPT reaching 100 million users in two months, and we are off to the races.

Let’s look at a specific example of this growth:

Microsoft Azure AI Datacenter Growth in the Last Four Years

Azure, the CSP behind OpenAI, deployed an AI datacenter in 2020 with a capacity of around 160 petaflops (10^15 FLOPS) to train GPT-3.5. Just three years later, Microsoft was already in the third spot of the largest AI datacenters in the world [9] with 560 petaflops. The 2024 figures have not been disclosed yet, but Azure CTO Mark Russinovich referenced in a recent interview that their growth trajectory is just accelerating: “Today [..] we are deploying the equivalent of five supercomputers of that (560 petaflops) scale, every single month” [10]. Kevin Scott confirmed at Build that this isn’t a one-off: “We are nowhere near the point of diminishing returns on how powerful we can make AI models as we increase the scale of compute” [11].

As in any gold rush, there has to be gold once we get there, and the estimates are pointing at a very large market shift in technology spend over the coming five years. Most developers and enterprises are considering how to integrate neural networks in the way they build their products, services, and experiences, as well as how they run their own companies more effectively. A recent a16z survey [12] among enterprises in their portfolio shows that companies expect at least a 2.5x increase in AI spend just in 2024, with use cases ranging from internal operations, and customer service to value-added features to their services. The Generative AI market - as measured by technology spend - is forecasted to surpass the trillion dollars in the next five years, reaching over 10% of the total technology spend of enterprises [13].

This means that a significant amount of all of our software experiences will be running on a neural network in just five years.

The Trials of the Compute Gold Rush

While the direction seems clear (more compute > better models, better and cheaper models > more demand), the path to get there isn’t as straightforward.

Design and production of chips and high-bandwidth memory are concentrated in a handful of companies. NVIDIA is dominating design with their consumer and server offering, but TSCM - the chip manufacturer - serves most chip designers in the world. Equally, SK Hynix and Micron are producing most high-bandwidth memories in consumer and server units. Their production capacity is booked for months or even years.

As computers, tablets, phones, cars, and servers all race towards adding AI compute capabilities, supply is physically constrained by a handful of companies capable of satisfying it, making it very inelastic.

Access to the produced compute is also not distributed equally. As foundational model companies amass the majority of the production of state-of-the-art hardware for training (famously, Meta acquired 70% of the H2 2023 production of H100 [14]), only a minority is being put back in circulation via CSP for developers’ ML workloads. According to the latest NVIDIA earnings, only 45% of their production (end of series of H100) ended up in CSPs for upstream consumption by developers in Q1, and this would be less if it wasn’t for some large companies holding off for H200 later in the year [15].

The growth and deployment of large-scale datacenters like in the previous example is also bottlenecked by an obvious next hurdle: electricity. Annoyingly, physics is getting in the way of our software ambitions! The electric grid system in many of the sites in which data centers are currently deployed cannot support 5x monthly increases in consumption. The AI gold rush, similar to the chip manufacturing, isn’t happening in isolation. Many other industries and sectors are being electrified and are competing for energy production and stressing the distribution infrastructure.

We need new ways of making compute available more broadly to accelerate this gold rush.

The Opportunities of the Compute Gold Rush

If we look at total FLOP production of the last four years among high-end consumer and server-grade (below SOTA) AI chipsets, the story looks far less bleak than we painted above [16].

In fact, when looking at compute alone, consumer gaming PCs and Laptops sold in the last 4 years could amount to a staggering 13 * 10^18 FLOPS (exaflop) [17], if networked together in a “grid”, that is a 13x increase on the largest supercomputer in the world today [9].

Just this month, Qualcomm announced the Snapdragon X family [18], a SoC that compares with the Apple M series, and we are now looking at an NPU energy efficiency that is incredibly competitive and can make this compute grid feasible and cost effective.

The long tail of the supply is an incredible opportunity to scale the availability of compute and lower its cost to unleash further innovation. These chips are/ will be ubiquitous, come with thermal solutions and distributed energy access, are used sporadically [18] and are all internet-connected, with bandwidth speed improving year over year.

The missing piece is a way to network supply over the internet effectively to perform ML jobs.

This is the challenge we took on with Kluster.ai: how do we make heterogeneous, unreliable, distributed compute nodes work as an aggregate (in a grid) to perform large distributed ML jobs at scale. In other words, how can we “feed” orders of magnitude more compute to our exponential “hunger” curve using a distributed protocol versus or alongside co-located data centers.

The category of decentralized physical infrastructure isn’t something new, but with our take, we try to extract competitive cost to performance, and offer reliable service objectives that aren’t possible with today’s generic DePIN solutions that simply focus on resource aggregation and redistribution.

We call the core technology we are developing to do that “Adaptive pipelines”. Through a series of post-training enhancements to open models, we can affect changes to the properties of these models to fit more naturally into a distributed compute paradigm.

A compute scheduling algorithm that is model-aware and environment-aware is then capable of rendering a large enough group of unreliable nodes, dependable. The scheduling of compute task is model and environment aware, ensuring that compute is undertaken only if it accrues to the results’ performance and in response to the grid health and status.

A lot more to share in the months to come, but in the meantime, learn more about it at kluster.ai

Sources

Artificial Analysis. (n.d.). Quality, speed, price. Retrieved from https://artificialanalysis.ai/models ↩
Epoch AI. (2024, April 5). Tracking compute-intensive models. Retrieved from https://epochai.org/blog/tracking-compute-intensive-ai-models ↩
The White House. (2023, October 30). Executive order on the safe, secure, and trustworthy development and use of artificial intelligence. Retrieved from https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/ ↩
European Commission. (2021). EU AI Act. Retrieved from https://ec.europa.eu/commission/presscorner/detail/en/qanda_21_1683 ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Retrieved from: https://arxiv.org/abs/1706.03762 ↩
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. Retrieved from https://arxiv.org/pdf/2001.08361 ↩
Epoch AI. (n.d.). Trends in machine learning hardware. Retrieved from https://epochai.org/blog/trends-in-machine-learning-hardware ↩
Grace, K., Stewart, H., Sandkühler, J. F., Thomas, S., Weinstein-Raun, B., & Brauner, J. (2024, January). Thousands of AI authors on the future of AI. Retrieved from https://arxiv.org/abs/2401.02843 ↩
TOP500. (2024, June). June 2024 edition of TOP500. Retrieved from https://top500.org/lists/top500/2024/06/highs/ ↩
Russinovich, M. (2024). What runs GPT-4o and Microsoft Copilot? Inside the 2024 AI supercomputer. Retrieved from https://youtu.be/DlX3QVFUtQI?si=z9UJf28jUKBEpvUs&t=63 ↩
Microsoft Build. (2024). Retrieved from https://www.youtube.com/live/2bnayWpTpW8?si=LWhN9U3XdyZH1RMw&t=8065 ↩
A16Z. (2023). A16Z survey. Retrieved from https://x.com/chiefaioffice/status/1771884750589841823 ↩
Bloomberg. (2023). Generative AI spending: Generative AI races toward $1.3 trillion in revenue by 2032. Retrieved from https://www.bloomberg.com/professional/insights/data/generative-ai-races-toward-1-3-trillion-in-revenue-by-2032/ ↩
PCMag. (2023). Meta buying 350k H100. Retrieved from https://www.pcmag.com/news/zuckerbergs-meta-is-spending-billions-to-buy-350000-nvidia-h100-gpus ↩
CNBC. (2024, May 22). NVIDIA Q1 earnings report. Retrieved from https://www.cnbc.com/2024/05/22/nvidia-nvda-earnings-report-q1-2025-.html ↩
Business Wire. (2021, March 29). Global gaming PC and monitor market hit new record high in 2020. Retrieved from https://www.businesswire.com/news/home/20210329005150/en/Global-Gaming-PC-and-Monitor-Market-Hit-New-Record-High-in-2020-According-to-IDC ↩
This extrapolate sales of gaming computers and laptops in the last 4 years 16 at around 160m units and assumes 4090 performance for simplicity ↩
Qualcomm. (2023). Snapdragon X series. Retrieved from https://www.qualcomm.com/products/mobile/snapdragon/pcs-and-tablets/snapdragon-x-elite ↩
Gaming computers hours used.”Taming the Energy Use of Gaming Computers”. Retrieved from https://sites.google.com/site/greeningthebeast/energy/taming-the-energy-use-of-gaming-computers ↩