Garbage in, garbage out (GIGO) is one of the oldest computer programming commandments. It was first coined in the same decade as AI, the 1950s, thus suggesting the connection between the two goes back to their birth dates. GIGO is particularly relevant to programs that take data – text and graphics included – as main input, run it through one or more algorithms and generate the expected (and many times unexpected) outputs. An example will help elucidate the process.
Sorting is one of the most basic algorithms, usually taught first in computing programming classes. The idea is simple. Suppose I have a list of 10 thousand names and need to sort by last name and then by first name. Piece of cake. I can choose one of the various sorting algorithms to get the output desired. Now suppose that 10% of those names are misspelled. Indeed, I am very familiar with that as it has happened to me over a zillion times. The most typical mistake is to change my first name to Paul. Even when people read my name from a printed document or a computer screen, they still see Paul. Go figure. I usually have to politely correct them. Never mind my last name. A while back, I received a snail mail letter addressed to a Mr. Tul Zuntano, wrong name, right address. At least they got the Z right.
More elaborate and complex full names are out there that much I know. Mine seems trivial, sort of. So a 10 percent data error sounds reasonable. However, I have no simple way of identifying incorrect names in my dataset. Some might be the result of typos, but others could be real names, like the infamous Mr. Zuntano. Nevertheless, the sorting algorithm will organize the data as expected, but 10% will still be incorrect. Someone using the data to search for my name will be surprised to see that I am nowhere to be found, thanks to my dear friend Tul. I have vanished into thin air!
Now, is there an “ethical” issue here? Depends. One thing is certain: the algorithms are not the culprit, nor are they acting in bad faith. On the contrary, they are delivering with 100 percent accuracy. I got my list of names, sans mine, in alphabetical order. Do not put the blame on them. How about the programmer(s)? Two options pop up here. First, the programmer is not aware of the 10 percent error, nor is error-checking part of the job. He could then share general albeit inaccurate conclusions about the output, totally unaware of the data errors. Here, ignorance is bliss. Second, the programmer is aware of the error rate. Two alternatives emerge. In one case, she tries to find and fix obvious errors, corrects and documents them, and then discloses the process to the public when sharing results. Alternatively, the programmer deliberately chooses to either ignore or correct the data but does not publicly disclose his actions while reaching conclusions. The latter immediately raises eyebrows for “ethical” reasons.
A couple of decades ago, a programmer working at a large commercial bank started to automatically shave off one or two pennies out of every bank account via a simple program he devised. His correct assumption was that clients did not really check the digits after the decimal point when reading monthly bank statements. The bank had over 10 million accounts, so he was probably getting at least 100k per month. He only got caught only because he started to show off his new income via luxury expenditures. He got too greedy and cocky, I suppose. But he was also banking on the fact that his supervisors did not know such a thing was possible. His “unethical” behavior landed him in prison. The little program he wrote was declared innocent, unanimously.
Enter modern AI and the GIGO picture described above still holds. While not all demand massive amounts of data, the AI areas can face a much larger data error challenge as the number of data points could be billions. In the case of supervised learning, labeling the data can help find faulty data. But labeling is still done by humans, so mistakes might creep in unnoticed. Certainly, ghost work available globally in seemingly unlimited supply can be a countervailing force here, helping to purge data errors. 100 percent accuracy, however, can only be reached asymptotically, if ever.
Nevertheless, once we have Bigdata in our hands, a new type of GIGO emerges. I will call this one qualitative GIGO as it not about data errors but rather about data representation. Statistically speaking, that boils down to whether my data represents the real population I am studying, humans, cats and many others included. Using my fictional dataset of 10 thousand names to draw any conclusions about the overall population will certainly be a mistake, “unethical” if I do it purposely.
In the old days of Tinydata, capturing and managing data was not only costly but also complex. Obtaining a good random sample of the population with some margin of error was the best we could do. Capturing the widest diversity possible within the real population (humans, cats, etc.) was the core idea. That enabled researchers to draw overall relevant conclusions “ethically.” The emphasis was then placed on developing adequate data sampling methods such as Monte Carlo. Sample size also played a role, so balancing the latter with sample randomness was key. All statisticians and econometricians know all this by heart as this is part of their core business.
Bigdata has removed many of the barriers to massive data collection. So now, one can end up with sample sizes having millions of data points. Fair enough. But that does not solve the question of data representation as I can increase data size without necessarily increasing its randomness. I have already addressed these issues in previous posts (here and here, for example), so I will not repeat myself.
Feeding such large unrepresentative datasets to AI and Machine Learning algorithms will generate bias by default as portions of the population under study are missing in action. Obviously, any learning or predictions yielded by them should be taken with a grain or two of salt, at least. Sweeping generalization should be thus avoided. Instead, researchers should study and acknowledge the limitations of the data being fed to state-of-the-art algorithms. Why is this not happening more frequently?
In any event, AI, as a set of technologies, has no ethics. There is no ethical AI. But there is the ethical use of AI by researchers and big corporations that have seemingly and conveniently forgotten, intentionally or not, what they learned in Statistics 101. Ethics is as human as Free Time. The only way one could talk about ethical AI is if one assumes AI has some sort of humanity – just as our ancestors thought the sun, the moon and many other objects were somehow more than human, capable of overseeing human life. This line of thought, fetishistic, does exist out there, however, as we start the third decade of the new Millennium.
Startling, to say the least.