AI’s Seemingly Elusive Infrastructure – II

Undoubtedly, the infrastructure requirements for producing the 2020 175-billion-parameter GPT-3 model are hefty by most standards. But, of course, that just means that competitors now striving to generate their own creatures in the same genre must have access to similar infrastructure—not to mention talent. And we can count their number with our fingers; sophisticated data analytics is unnecessary here. Regardless, such infrastructure is not limited to the computing power I described in the first part of this post.

Flops per second or per watt are essential, but are variables of a much longer equation. After all, such computing power must reside somewhere on planet Earth. It is usually found in hyperscale data centers, living quite a hectic life. To give you an idea of their size, one of Google’s hyperscale data centers spans over 185,000 square meters and has a power capacity of over 100 megawatts. For comparison, a U.S. household’s average daily power use is around 30 kilowatts. Those beasts are also thirsty all the time and consume substantial water resources to keep their inhabitants comfortable and avoid death by exhaustion while running the endless digital marathon.

Indeed, location, land, water resources and electrical power are part of AI’s infrastructure requirements. That is why looking only at data center GHG emissions is not sufficient. Examining their ecological footprint in detail is also critical. The former has a global impact. The latter has a discerning local one that can directly affect local inhabitants. No wonder recent corporate data center literature warns investors about rising local populations, challenging new center deployment, or opposing planned expansions. Mitigation and adaptation must work hand in hand here.

However, ChatGPT’s production did not stop once the 2020 model was completed. Another turn of the screw was required to make it happen. Indeed, the 2022 paper mentioned in the first part reveals the path to the final product release in late November. Here, a completely different data set was used in the process. It comprised almost 3 million Reddit entries from 29 subreddits, though I’m unsure whether it was random. After cleaning and filtering, the production data set had fewer than 130k posts. Note also that the focus here was capturing post summaries (or TL;DRs, as the paper often repeats). Notably, summary posts with fewer than 24 or more than 48 tokens were also purged. Four models were pre-trained and trained with parameter counts ranging from 1.3 to 13 billion. The 6.7-billion-parameter model was chosen as the preferred flavor at the end of the process. Notwithstanding, compared to the humongous us 2020 model, this one is certainly much more modest yet still significant by usual standards.

But the real innovation here does not stem from the above. Instead, it emerges from using human feedback to improve model learning and performance. To implement this new feature, external labelers were recruited to analyze existing data summaries and, in some cases, create their own. Reinforced learning (RL) was used to handle the overall process. Labelers helped refine the model and developed an optimal RL policy to guide the computational agent’s decisions, which showed substantial improvement. Indeed, this is yet another example of humans in the loop of AI. The paper tells us this set of workers, recruited from reasonably well-known task websites, were paid hourly and had various nationalities, although the information shared is far from complete. At least we learned firsthand that ghostwork was not part of the process. It also shows us the human side of infrastructure—usually considered a set of inanimate objects. After all, humans create infrastructure to serve and connect them in various ways.

Regarding power usage, the fine-tuning of the 6.7-billion-parameter model using RL took 320 GPU-days, or 19,200 hours—we assume one GPU was used, which was probably not the case. The actual number of days then depends on how many GPUs were used in the computation, a number not reported. The paper also indicates the data collection and management phase was “also expensive compared to prior work — the training set took thousands of labeler hours and required significant researcher time to ensure quality” (pg. 9). Do not ask how many hours, though. It is undoubtedly quite curious that engineers, who must be very precise in their work and computations, are so imprecise regarding transparency.

Note that this paper, while toying with it, did not introduce the concept of RL with Human Feedback (RLHF). That was done in a subsequent paper that, following the leads of previous research, presented InstructGPT. ChatGPT is thus the result of combining these approaches, RLHF being a critical differentiator from previous GPT-3 models pre-trained and trained by OpenAI.  This reminds me of the old days when knowledge management was all the rage, and codifying tacit knowledge was the clarion call. Not sure if that really worked. Nowadays, we are trying to codify common knowledge so an allegedly intelligent computational agent can respond accordingly.

I decided to double-check the above summary with ChatGPT. The results were baffling. It was confused about InstructGPT, did not understand RFHL, and said the latter was not used for its training. It also did not disclose information readily available in the papers discussed above. I mean, is this guy drunk or what? It does not even know how it was produced. A transcript (copy-and-paste, I should say) of our little chat is here, typos included.

 One strategy to gauge ChatGPT on an individual basis entails three components. First, always start by asking a question you know the answer to. Pick one of your fields of expertise, knowledge, or hobbies and do a drill-down. Second, always ask for sources for the text being generated by the computational agent and assess them. You probably know a few of them, but others might be seemingly randomly selected. Of course, linking the references to the text requires more analysis. And third, be aware that the agent is constantly being updated and improving. Save some of your questions, repeat them when a new version comes out and compare the results. You will then be able to assess whether it is improving, regardless of what the company and its wealthy sponsors say and the ongoing hype that might get much bigger soon.

At a more macro level, the release of ChatGPT, “responsible” or not, has unleashed a new and intense competitive process. Big Tech members will not only attempt to develop their own and improved versions but also to examine more deeply other areas where AI appears more like AGI. That, in turn, could have a critical impact on the current infrastructure supporting such efforts, especially data centers, where we should see an increase in hyperscale deployments to meet the growing demand. Of course, AI is already being used in such deployments. However, that is not the same as repurposing data centers to cater to new AI production needs. As a result, more emissions are expected globally, and an augmented ecological footprint impacts locally.

Cheers, Raúl