A recent paper published under the auspices of Google Health makes a case for using deep learning algorithms to improve breast cancer detection. The research has been positively received by most and widely publicized as yet another victory of smart machines over weak, dumber humans. Only a few have been critical for good reasons. In this post, I will explore the methodology used in the research to highlight other critical issues. But before I take the dive, let me first set the scene.
Like education or justice, health is an information-rich sector, thus prone to rapid (not just digital) technology innovation and overall digitization. Unlike its peers, most if not all health-related services use a gamut of technologies, from simple thermometers and stethoscopes to noisy, giant MRI machines and jaw-dropping robotic surgeries. Recently, and thanks in part to mobile technologies, consumer-centered health technologies have popped-up, allowing us to monitor our health, from heart rates and blood pressure to mobile electrocardiograms and in situ diabetes testing. In fact, health technology has carved its own territory and provides fertile ground for deploying new technologies.
Diagnostic and treatment are the two overall areas that shape health services. The former is closely associated with prevention and early detection. In this light, annual check-ups are a best practice, and one accompanied by a battery of tests that might end up including MRIs, computerized tomography scans (a.k.a cat scans) and mammographies. These and other similar technologies have been around since the 1970s and thus predate the digital era.
At that time, Computer-Aided Diagnostic (CAD), born in the 1950s, was transitioning to its second phase by using expert systems, an early form of AI, to enhance diagnostic processes. By the 1990s, and given the limitations of expert systems and their eventual failure, CAD switched to data mining techniques supported by AI algorithms. In any case, recent research shows that CAD has not helped improve the diagnostic process in general, not breast cancer detection in particular.
Can Deep Learning (DL) make a difference? That is what the paper explores. Note that this is the third time AI is enlisted to enhance health diagnostic outcomes.
I have read plenty of news media coverage of the paper’s core findings. Maybe these reports are combining the results from other projects as I found discrepancies in terms of the population covered by the study in both the UK and the US, among others. An odd occurrence indeed as research data tables are available on the web – the paper is behind a paywall but can be read online as a PDF.
Table 1 summarizes the key findings. I have added the sample sizes for each of the results (n) and the size of the total population studied in each country (N). For the record, the UK’s overall sample size is almost nine times larger than that for the US.
False positives refer to women wrongly diagnosed with breast cancer. False negatives include women with breast cancer that were given a clean health bill by mistake. The former causes tremendous stress and anxiety, while the latter could be lethal. Neither is really a good outcome. I am familiar with this as a few years ago, my wife was diagnosed with the disease. Fortunately, it was detected in its very early stages with no recurrence since. Notwithstanding, she is always under close scrutiny. During her last checkup late last year, she was asked to return for a second test just to make sure everything was in order. It was. Can AI/DL help my wife and thus make a real difference here?
Going back to the table, each percentage represents the difference in performance between doctors and the deep learning algorithms. A positive number registers the rate of improvement made by DL. Cells in bold highlight findings that were statistically significant at the 95% level – a fact most media coverage seemed to have missed or simply ignored. For example, the UK false negatives improvement of 2.70% is not statistically significant and should not be used to make any conclusions about DL superiority. The paper suggests DL is not inferior but never says DL is beating humans in this rubric.
The difference in performance between UK and US outputs is outstanding. US DL diagnostic improvements are four times larger than those in the UK in the case of false positives – and slightly lower when applying a UK trained model to the US data sample. So what is going on in the US then? Are US mammography readers less qualified than those in the UK?
I was not aware that in the UK, as in most of Europe, two independent readers are required to get the final results of any mammography. Moreover, if the two parties have a dissenting diagnostic, consensus among the parties must be reached, sometimes involving a third opinion. That is certainly not the case in the US, where only one reader is required. Interestingly, the DL platform does not outperform the second reader or the consensus diagnostic in the UK. DL algorithms are actually underperforming in comparison to the consensus reading by -3.35% in terms of false negatives. But this difference is not statistically significant and DL is thus not inferior, the paper tells us.
In any event, we can safely conclude that in the UK, DL can be most useful as a labor-saving technology, just like many older technologies, starting with the steam engine. Apparently, the UK is facing a shortage of qualified readers. DL could undoubtedly help fill this gap. But in other countries where this is not the case, DL could end up displacing readers if embraced uncritically.
The evidence also suggests DL could indeed be an excellent complementary tool for UK readers and thus beneficial to expedite consensus and hopefully improve overall diagnostics. Note, however, that UK readers in the sample studied have a 98.5% false negatives accuracy rate, as reported by the paper. Maybe this is why the DL platform could not generate any significant improvements here. The question then is how do we then get to 99% or higher.
Several media outlets and pundits have already flagged a few issues with this research. Not surprisingly, the code used cannot be shared as it proprietary. Data sets for the US are not available either while UK data is apparently available upon request. The paper briefly mentions ethical issues related to the use of personal data and claims all is cool. No details are provided, however. I am sure my wife will love to know if her data was going to be used in some random AI study. Most probably, she would agree, but I, nor anyone else, should decide for her. At any rate, all of the above criticisms are right on the money.
Nevertheless, I want to raise a few issues regarding the statistical procedures and significance of the paper findings. As mentioned above, the sharp difference in DL’s performance across the two countries is massive, showing a prediction improvement of over 300% for both sensibility and specificity in the US. The implication here is that UK first readers are beating their US counterparts by a relatively large margin. That does not sound quite right to me. The US health care system is indeed expensive but is also of high quality. We need to poke a bit more deeply at the way the data samples were designed and selected to make sure we are on the right track. Table 2 below presents a summary, compiled from one of the tables in the paper, of the sample design.
The reference column shows the data from the baseline populations selected by the study. Statistically, we could say these represent the real population. That includes all sites in the UK, whereas in the US, it is much smaller and limited to six centers. We can thus assume that the UK reference population is very close to the real population, as shown by the slight differences in mammogram diagnostics shown in Table 2. This is not the case in the US. Here, the differences between sample and reference populations are abysmal. We end up with two critical issues in the US. First, the sample is not representative of the reference population. And second, the reference population might not resemble the real population.
The second issue is the period covered by the overall data. Once again, the UK data is more accurate as both sample and reference cover almost the same period. In the US, the sample covers 17 years and is thus much broader than the reference population, which only includes seven. This adds more bias to the US sample. In principle, the sample should cover the same period as the reference population to allow for appropriate statistical inference.
Third, the US sample comes from only one site, which puts into question the randomness of the selection. The paper suggests that data availability and convenience were important drivers for selection. That is certainly understandable but the research should also have added the potential for bias and lack of randomness of the sample being used to avoid any sweeping generalizations – which is what most media ended up doing.
Lastly, the differences between the sample and reference populations for cancer diagnostics are incredibly significant in the US case. In the sample, almost 12 percent of women have cancer, which is nearly 20 times higher than those with a similar predicament in the reference data. The same goes for normal diagnostic results – albeit the difference is not as vast. The paper estimates 95% confidence intervals for these numbers and, in all cases, the range of the mean of the sample set does not include the actual mean of the reference population. The data by age groups show a similar statistical bias. It thus seems we might have the wrong sample here.
All of the above strongly suggests the US sample is not random nor representative of the overall population of women facing uncomfortable mammograms, analog and digital, every year. Moreover, comparing the results of the US sample with those of the UK also introduces bias as the data covers different periods. In the end, we can conclude that while the UK results are statistically significant, the same cannot be said about the US. In principle, the researchers should have tried to obtain a random sample in the US in the same fashion as they did in the UK.
The UK results, however, are not that impressive as DL was unable to outperform the system of two or more readers prevalent in the UK and Europe. Maybe the US should consider having more than one reader for complicated mammogram results and not put all its eggs into the DL basket.
I am sure my wife will love that to avoid being recalled for a second shot every time she gets her annual check. However, she is not betting on DL algorithms to change her predicament – at least not for the time being.
* Updated on 02/03/2020