What makes good healthcare AI?

Lessons from diabetic retinopathy screening...


By 2045, the global incidence of diabetes is projected to reach 629 million adults, with one third expected to have diabetic retinopathy (DR). DR remains the leading cause of vision loss in the working age population worldwide, with the exception of the UK, which has a long-established DR screening programme. It is now widely accepted that screening for sight threatening DR is effective and, whilsDR classification systems and referral pathways may differ, at its core a successful DR screening programme must be accessible, cost effective, accurate, and provide pathways for referral and treatment.  


One of the challenges faced by screening programmes is providing accurate, consistent grading of retinal photographs and Artificial Intelligence (AI). Specificallydeep learning has been touted as way of automating the process of DR screeningThe process of developing a deep learning algorithm (DLA) to perform DR grading has been described elsewhere, but, in brief, it involves training a convolutional neural network on a large dataset of images labelled with the correct DR grade: the ground truth’. The DLA then starts assigning a DR grade to each image and the result generated is then compared with this ground truth. After every comparison, the DLA modifies the neural network’s parameters in an attempt to improve and maximise its accuracy. This process is repeated until the DLA has learnt to assign the correct DR grade to the images in the training dataset. Once training is complete the DLA’s performance is then tested and validated against unseen fundus images.  


DR screening, with its rich repositry of labelled images, is at the forefront of clinical AI development. As challenging as it is to train an AI, arguably the critical step is its translation into clinical practice. To date, very few DLAs have succesfully navigated this final, critical hurdle. However, there are a number of products emerging and, as clinicians may soon be required to review the utility of DLA system, it is timely to ask the question, ‘what makes a good AI?’. Or, perhaps more importantly, how do we spot a bad one?  


One of the key questions to ask when reviewing the utility of any AI is how generalisable is it? And, is it suited to the patients that I intend to use it on? In this respectthe way one evaluates an AI is no different to the way one should critically appraise any clinical trial. As such, the first consideration is to review who the DLA was trained on so you can understand the biases inherent in it. A good AI should therefore have a large dataset of relevant images, one which includes enough examples of each class, diseased/non-diseased etc. This can be challenging to achieve in medicine where cases of rare diseases or outcomes are, by definition, rare. Whilst some biases may be obvious, others are more subtle and human bias may therefore be inadvertently built into DLA’s decision making. For example, the majority of studies using AI thus far have either relied on private datasets, and/or used datasets, which are dominated by a single ethnicity for the training and validation of the DLA. The AI thus derived may deliver excellent health outcomes for those in the socioeconomic class, or ethnic group that the AI was trained on, but will perform less well on all others. Adopting the wrong AI may therefore worsen, not improve existing health inequalities. Uncovering areas of bias therefore requires the developers to fully disclose the demographics of those it is trained and validated on. To date, very few groups have published this information and this, in turn, makes it difficult for clinicians and patients alike to know whether the AI will work for them. 


Until recently, it was considered sufficient to simply publish the results of your AI by way of a receiver operator curve, with no explanation as to how the DLA derived this result. This is critically important issue because all the AI is doing during training is making associations. It is therefore important to be able to assess whether the associations it is making are correct, or even relevant 


For obvious reasons, the American military were early funders of AI research. One of the first experiments conducted was to train an AI to spot tanks in photographsThe results at first looked impressive but the AI failed in field testing. Later, it was discovered that the AI had not learnt to look for tanks at all, it was just that all the images containing tanks were darker and all the AI was doing was detecting the dark images.  


The lack of transparency as to how an AI comes to its decisions is called the black box phenomenon. And, if a DLA cannot be understood, how can we assess its reliability and justify its results to patients? This issue is now being addressed by DLA developers who will often publish attention maps, which highlight the areas within an image the DLA is focusing on when making its decision (Fig 1).   






Figure 1. The process by which a DLA makes its decisions can be evaluated by generating heat maps (attention maps) of what the AI is looking at within the image when making its decision. In this case, an example of proliferative diabetic retinopathy, the original image (a) is rendered into a probability map (b), which can then be overlaid on top of the original image (c). As can be seen, the AI is concentrating on those areas of the image that one would expect it to. (Illustration taken from authors’ own work.)  



Having decided that a particular AI has been trained on a dataset that is representative of the individuals that one wants to use it on and having satisfied oneself that the engine of the AI is robust, there are a number of legal and ethical considersations that then need to be addressed. Concerns around intellectual property, such as algorithm and patient information ownership, are also relevantThus, before an AI is deployed in clinical practice and referring to the current legal and governance frameworksattention should be paid to the following principles: transparencytrustjusticefairnessequitynon-maleficencebeneficenceresponsibility, accountability, respect for autonomy, sustainability, dignity and solidarity 


While many of these issues are generic to all technological innovations, in countries with indigenous populations, for exampleclinicians implementing AI will need to take account of specific issues and concerns of these communitiesThis may necessitate a more cautious approach that gives greater weight to issues of equity, dignity, and social justice, as well as taking account of Indigenous Data Sovereignty. Laws, such as the General Data Protection Regulation can also be viewed as promoting data sovereignty more generally, as part of their remit is to give people more control over their data.  


The final issue for those looking at implementing AI is addressing where liability lies when a DLA makes an error resulting in misdiagnosis or poor patient outcome. Arguably, this will first lie with the clinician who used the AI. This fact alone is a powerful argument for clinicians to be able to recognise bad AI.  Developers do bear some responsibility, however, as the recent Boeing 737 Max experience shows; putting good people into an unsafe environment risks poor outcomes.  


In the context of DR screening, arguably any AI that is deployed to screen for disease should have a leave no one behind philosophy, meaning it should have very high sensitivity to detect disease at the expense of specificity. In effect, one would accept a relatively high false positive rate to ensure the negative predictive value (the likelihood of a negative test result being correct) is very high (>99.5%). Clinically this makes good sense. Because >75% of individuals who are being screened routinely for sight threatening DR have none or minimal disease, if all the AI does is confidently remove this group of patients from the workload of the grading team, this represents a big win for the programme as a whole. 


In conclusion, a good AI is one that has been trained and validated on large datasets that represent the population in which it is intended to be deployed in. It is one that reflects the cultural values of the jurisdiction it is used in and it is one that will not further exacerbate exisiting health inequalities. At least for now, such systems should be viewed as clinical decision support tools that will aid clinicians and health providers to achieve the best health outcomes for their patients. 




1. Schmidt-Erfurth U, Sadeghipour A, Gerendas BS, Waldstein SM, Bogunović H. Artificial intelligence in retina. Prog Retin Eye Res [Internet]. 2018 Nov; 67:1-29. Available from: https://doi.org/10.1016/j.preteyeres.2018.07.004 

2. Time to discuss consent in digital-data studies’ (August 2019) 527(5) Nature Editorial Available from https://doi: 10.1038/d41586-019-02322-z 

3. Chen PHC, Liu Y, Peng L. How to develop machine learning models for healthcare. Nat Mater [Internet]. 2019 May;18(5):410. Available from: https://doi.org/10.1038/s41563-019-0345-0 

4. Char DS, Shah NH, Magnus D. Implementing machine learning in health care - addressing ethical challenges. N Engl J Med [Internet]. 2018 Mar 15; 378(11):981-983. Available from https://doi: 10.1056/NEJMp1714229



Dr David Squirrell is lead ophthalmologist for the Auckland Diabetic Screening programme and an ophthalmology representative at Pharmac. He has a double fellowship as a retina specialist with over 10 years post-fellowship experience assessing and treating patients with retinal and macular disease.  


Dr Ehsan Vaghefi is a senior lecturer at Auckland University in ocular imaging and bioinstrumentation and has been leading research in bioengineering and medical device development for over 10 years. He co-founded Toku Eyes with Dr Squirrell (www.tokueyes.com). 


Aan Chu is Part V BOptom student, who undertook research under the co-supervision of Drs Squirrel and Vaghefi. Her project lead to a systematic review of AI usage in ophthalmology (in press), with emphasis on biases of AI. 


Bottom Banner Advert