Garbage in, garbage out (GIGO) is one of the oldest computer programming commandments. It was first coined in the same decade as AI, the 1950s, thus suggesting the connection between the two goes back to their birth dates. GIGO is particularly relevant to programs that take data – text and graphics included – as primary input, run it through one or more algorithms and generate the expected (and often unexpected) outputs. An example will help elucidate the process.
Sorting is one of the fundamental algorithms, usually taught first in computing programming classes. The idea is simple. Suppose I have a list of ten thousand names and need to sort by last name and then by first name. Piece of cake. I can choose one of the various sorting algorithms to get the desired output. Now suppose 10% of those names are misspelled. Indeed, I am very familiar with that as it has happened to me over a zillion times. The most typical mistake is to change my first name to Paul. Even when people read my name from a printed document or a computer screen, they still see Paul. Go figure. I usually have to politely correct them. Never mind my last name. A while back, I received a snail mail letter addressed to a Mr. Tul Zuntano, wrong name, correct address. At least they got the Z right.
More elaborate and complex full names are out there; that much I know. Mine seems trivial, sort of. So a 10 percent data error sounds reasonable. However, I have no simple way of identifying incorrect names in my dataset. Some might be the result of typos, but others could be actual names, like the infamous Mr. Zuntano. Nevertheless, the sorting algorithm will organize the data as expected, but 10% will still be incorrect. Thanks to my dear friend Tul, someone using the data to search for my name will be surprised to see that I am nowhere to be found. I have vanished into thin air!
Now, is there an “ethical” issue here? Depends. One thing is sure: the algorithms are not the culprit, nor are they acting in bad faith. On the contrary, they are delivering with 100 percent accuracy. I got my list of names, sans mine, in alphabetical order. Do not put the blame on them. How about the programmer(s)? Two options pop up here. First, the programmer is unaware of the 10 percent error, nor is error-checking part of the job. He could then share general albeit inaccurate conclusions about the output, totally unaware of the data errors. Here, ignorance is bliss. Second, the programmer is aware of the error rate. Two alternatives emerge. In one case, she tries to find and fix obvious errors, corrects and documents them, and then discloses the process to the public when sharing results. Alternatively, the programmer deliberately ignores or rectifies the data but does not publicly disclose his actions while reaching conclusions. The latter immediately raises eyebrows for “ethical” reasons.
A couple of decades ago, a programmer working at a large commercial bank started automatically shaving one or two pennies out of every bank account via a simple program he devised. His correct assumption was that clients did not check the digits after the decimal point when reading monthly bank statements. The bank had over 10 million accounts, so he was probably getting at least 100k per month. He only got caught because he started showing off his new income via luxury expenditures. He got too greedy and cocky, I suppose. But he was also banking on the fact that his supervisors did not know such a thing was possible. His “unethical” behavior landed him in prison. The little program he wrote was declared innocent unanimously.
Enter modern AI, and the GIGO picture described above still holds. While not all demand massive amounts of data, AI can face a much more significant data error challenge as the number of data points could be billions. In the case of supervised learning, labeling the data can help find erroneous data. But labeling is still done by humans, so mistakes might go unnoticed. Indeed, ghost work available globally in seemingly unlimited supply can be a countervailing force here, helping to purge data errors. 100 percent accuracy can only be reached asymptotically, if ever.
Nevertheless, a new type of GIGO emerges once we have big data in our hands. I will call this qualitative GIGO as it is not about data errors but instead data representation. Statistically speaking, that boils down to whether my data represents the entire population I am studying, humans, cats and many others included. Using my fictional dataset of 10 thousand names to draw any conclusions about the overall population will undoubtedly be a mistake, “unethical” if I do it purposely.
In the old days of tiny data, capturing and managing data was costly and complex. Obtaining a good random sample of the population with some margin of error was the best we could do. The core idea was to capture the most representative subset of a given total population (humans, cats, etc.). That enabled researchers to draw overall relevant conclusions “ethically.” The emphasis was then placed on developing adequate data sampling methods such as Monte Carlo. Sample size also played a role, so balancing the latter with randomness was vital. Of course, all statisticians and econometricians know all this by heart, as this is part of their core business.
Big data has removed many of the barriers to massive data collection. So now, one can end up with sample sizes having millions of data points. Fair enough. But that does not solve the question of data representation, as I can increase data size without necessarily increasing its randomness. I have already addressed these issues in previous posts (here and here, for example), so I will not repeat myself.
Feeding such large unrepresentative datasets to AI and Machine Learning algorithms will generate bias by default as portions of the population under study are missing in action. Therefore, at least any learning or predictions they yielded should be taken with a grain or two of salt. Sweeping generalization should be thus avoided. Instead, researchers should study and acknowledge the limitations of the data fed to state-of-the-art algorithms. Why is this not happening more frequently?
In any event, AI has no ethics as a technology set. There is no ethical AI. But there is the ethical use of AI by researchers and big corporations that have seemingly and conveniently forgotten, intentionally or not, what they learned in Statistics 101. Ethics is as human as Free Time. The only way one could talk about ethical AI is if one assumes AI has some sort of humanity – just as our ancestors thought the sun, the moon and many other objects were somehow more than human, capable of overseeing human life. However, this line of thought, fetishistic, does exist out there as we start the third decade of the new Millennium.
Startling, to say the least.