Undoubtedly, the infrastructural requirements used to produce the 2020 175 billion GPT-3 model are hefty by most standards. But, of course, that just means that competitors now striving to generate their own creatures on the same genre must have access to similar infrastructure – not to mention talent. And we can count their number with our fingers; sophisticated data analytics is unnecessary here. Regardless, such infrastructure is not limited to the computing power I described in the first part of this post.
Flops per second or per watt are essential but are variables of a much longer equation. After all, such computing power must reside somewhere on planet Earth. It is usually found in hyperscale data centers, living quite a hectic life. To give you an idea of their size, one of Google’s hyperscale centers occupies over 185 thousand square meters and has a power capacity of over 100 megawatts. For comparisons, a U.S. household’s average daily power use is around 30 kilowatts. Those beasts are also thirsty all the time and consume substantial water resources to keep their inhabitants comfortable and avoid death by exhaustion while running the endless digital marathon.
Indeed, location, land, water resources and electrical power are part of AI’s infrastructure requirements. That is why looking only at data center GHG emissions is not sufficient. Examining their ecological footprint in detail is also critical. The former has a global impact. The latter has a discerning local one that can directly affect local inhabitants. No wonder recent corporate data center literature warns investors about rising local populations, challenging new center deployment, or opposing planned expansions. Mitigation and adaptation must join hands here.
However, ChatGPT’s production did not stop once the 2020 model was completed. Another turn of the screw was required to make it happen. Indeed, the 2022 paper mentioned in the first part reveals the path to the final product release in late November. Here, a totally different data set was used in the process. It comprised almost 3 million Reddit entries selected from 29 subreddits, although I’m unsure if it was random. After cleaning and filtering, the production data set had less than 130k posts. Note also that the focus here was capturing post summaries (or TL;DRs, as the paper often repeats). Notably, summary posts smaller than 24 or larger than 48 tokens were also purged. Four models were pre-trained and trained with parameters ranging between 1.3 and 13 billion. The 6.7 billion parameter model was chosen as the preferred flavor at the end of the process. Notwithstanding, compared to the humongous us 2020 model, this one is certainly much more modest yet still significant by usual standards.
But the real innovation here does not stem from the above. Instead, it emerges from using human feedback to improve model learning and performance. To implement this new feature, external labelers were recruited to analyze existing data summaries and, in some cases, create their own. Reinforced learning (RL) was used to handle the overall process. Labelers helped refine the model and developed an optimal RL policy to guide the computational agent’s decisions, which showed substantial improvement. Indeed, this is yet another example of humans in the loop of AI. The paper tells us this set of workers, recruited from reasonably known task websites, were paid hourly and had various nationalities, although the information shared is far from complete. At least we learned firsthand that ghostwork was not part of the process. It also shows us the human side of infrastructure -usually considered a set of inanimate objects. After all, humans create infrastructure to serve and connect them in various ways.
Regarding power usage, the fine-tuning of the 6.7 billion parameter model using RL took 320 GPU-days or 19,200 hours – we assume one GPU was used here, which was probably not the case. The actual number of days then depends on how many GPUs were used in the computation, a number not reported. The paper also indicates the data collection and management phase was “also expensive compared to prior work — the training set took thousands of labeler hours and required significant researcher time to ensure quality” (pg. 9). Do not ask how many hours, though. It is undoubtedly quite curious that engineers, who must be very precise in their work and computations, are so imprecise regarding transparency.
Note that this paper, while toying with it, did not introduce the concept of RL with Human Feedback (RLHF). That was done in a subsequent paper that, following the leads of previous research, presented InstructGPT. ChatGPT is thus the result of combining these approaches, RLHF being a critical differentiator from previous GPT-3 models pre-trained and trained by OpenAI. This reminds me of the old days when knowledge management was all the rage, and codifying tacit knowledge was the clarion call. Not sure that really worked. Nowadays, we are trying to codify common knowledge so an allegedly intelligent computational agent can respond accordingly.
I decided to double-check the above summary with ChatGPT. The results were baffling. It was confused about InstructGPT, did not understand RFHL, and said the latter was not used for its training. It also did not disclose information readily available in the papers discussed above. I mean, is this guy drunk or what? It does not even know how it was produced? A transcript )copy and paste, I should say) of our little chat is here, typos included.
One strategy to gauge ChatGPT on an individual basis entails three components. First, always start by asking a question you know the answer. Pick one of your fields of expertise, knowledge, or hobby and do a drill-down. Second, always ask for sources on the text being spat out by the computational agent and assess such sources. You probably know a few of them, but others might seem to be randomly selected. Of course, linking the references to the text produces requires more analysis. And third, be aware that the agent is constantly being updated and improving. Save some of your questions, repeat them when a new version comes out and compare the results. You will then be able to assess if it is improving, regardless of what the company and its wealthy sponsors are saying and the ongoing hype that might get much bigger soon.
At a more macro level, the release of ChatGPT, “responsible” or not, has unleashed a new and intense competitive process where BigTech members will attempt not only to come up with their own and improved versions but also look more deeply at other areas where AI looks more like AGI. That, in turn, could have a critical impact on the current infrastructure supporting such efforts, especially data centers, where we should see an increase in the number of hyperscale implementations to cater to the increasing demand. Of course, AI is already being used in such deployments. However, that is not the same as repurposing data centers to cater to new AI production needs. As a result, more emissions are expected globally, and an augmented ecological footprint impacts locally.
Cheers, Raúl