Artificial intelligence (AI) and in particular computer vision promise to be valuable aids for diagnosing diseases based on medical imaging techniques. For humans, it takes years of academic and on-the-job training to e.g. perform medical diagnosis from X-ray images. As we will see, it is also quite a challenge for intelligent algorithms.
At this year's KIS-RIS-PACS and DICOM convention organized by the Department of Medicine at the University of Mainz, Germany, researchers from radiology and adjacent fields gathered to discuss the state-of-the-art of AI in their field. Philipp Jackmuth from dida was the speaker of choice for this topic and here we will discuss key points of his talk.
Classification - diagnosing X-ray images
One of, if not the oldest medical imaging technique is X-ray. The patient is exposed to high-energy electromagnetic radiation, which in turn gets absorbed to different degrees by biological tissue. The particles passing the patient's body are caught by a screen which darkens when exposed to radiation. This allows the creation of black-and-white images indicating different tissue densities in the patient's body and thus grant doctors a look "inside".
Interpreting these images and basing a diagnosis on them is quite a tricky task. In Baltruschat et al. 2019, the authors perform classification of over 100.000 X-ray images from ca. 30.000 patients into the respective diagnosis. There are a few caveats and discoveries worth pointing out.
The authors trained classifiers based on ResNet-38, ResNet-50 and ResNet-101 architectures and compared performances. ResNet is a very successful version of a Deep Neural Network by Microsoft (He at al. 2015) and the appended numbers denote the number of layers in the network. Furthermore, the authors compared transfer learning with and without fine-tuning with a model trained from scratch, as well as different input data formats. Transfer learning is the practice of using the lower layers of a different neural network trained on a different dataset under the assumption that the simple features learned by the lower layers are useful in both problem settings. The weights of these layers can be kept fixed or fine-tuned, i.e. additionally trained for the specific new task. Readers interested about these aspects are referred to the original paper, here we want to highlight a few key points of the study.
Two key contributions of Baltruschat et al. 2019 were 1) to include non-image data into the classification process, i.e. giving age and gender of the patient, as well as view position (posterior-anterior vs. anterior-posterior, that is patient facing the radiation source vs. patient facing away from the radiation source) as additional input to the network. And 2) the authors performed a Grad-CAM analysis (Selvaraju et al. 2016) to generate class activation maps of some of the images. Grad-CAM is a method to determine the relevance of parts of the image for a specific classification. After the forward pass of the image through the network the gradient of the classification neurons is computed via backpropagation with respect to the highest feature maps, i.e. the neurons in the last convolutional layer. A relevance score is computed from these gradients for each feature map in the layer and the feature maps are linearly combined weighted by these importance scores. Setting all negative values to zero gives the desired class activation maps.
Let us look at some results of the paper. Below we see examples of the training dataset used in Baltruschat et al. 2019.
In image (d) we see an acute pneumothorax. It can be identified by the thin white line in the lower third of the right chest (right chest is left in the image), which is the lower border of the lung, and the lowered right diaphragm. On the other hand, in image (c) we see a pneumothorax which has already been treated with a chest drain, which can be identified by the two parallel lines on the right chest. The fact that images of treated patients are included in the training data will be discussed below.
The best performing model reported by the authors uses the additional non-image data, however the increase in performance stemming from the non-image features seems to be quite small compared to other factors. It seems like the additional information was already largely contained in the images. It appears intuitively correct, that age, gender and viewing position can be derived from X-ray images, as the former two clearly influence the physiology, while the latter can be derived e.g. from the position of the heart in the images.
One take-away from this is that when feeding extra data into a model, one might first want to examine if this information actually constructively contributes to the decision process. Here, the authors made sure that there is at least some valuable information in the additional data by first training a simple Multi-Layer Perceptron (MLP) on these three features to predict disease. While the performance of the MLP was quite low, it performed still better than random, but even if it did not perform better than random, the additional data in combination with the original images might help solving the classification task.
Let us have a look at what the model considers relevant for the classification task. In the following figure we can see the class activation maps from the Grad-CAM analysis of two different images labeled as "Pneumothorax".
The issue with these images is that some of the "Pneumothorax"-images are X-rays of patients who have been treated already. As a treatment of the pneumothorax, a chest drain is inserted into the patients chest. The chest drain is quite visible in the X-ray images and it stands to reason that the model will use the drain as an indicator for a pneumothorax.
The top row in the figure shows X-ray and class activation map for an untreated patient. The network seems to accurately identify the pneumothorax based on the acute finding. The bottom row however, shows the images for a treated patient with a chest drain. The class activation map shows that the network pays exclusive attention to the chest drain, which implies that the network identifies the drain as a symptom of pneumothorax. This is a good example how explanation methods help to critically engage with AI-made predictions.
TL;DR: Additional non-image information can aid image classification. Great care should be given especially to the training data when AI is used for diagnosis, as model can perform well for the wrong reasons.
Semantic Segmentation - localizing cardiac catheters in 3D-ultra sonic images
In Yang et al. 2019 the authors used a combination of U-Net-like fully-convolutional networks (FCNs) to segment a cardiac catheter in 3D-ultra sonic images. A U-Net is a neural network consisting of an encoder-decoder structure (Ronneberger et al. 2015). The encoder consists of a series of convolutional and pooling layers, the decoder mirrors the structure of the encoder with deconvolution and up-sampling layers. Skip connections forward the information of each feature map before pooling in the encoder to the respective up-sampling layer in the decoder. This way the network can learn the features on multiple scales if necessary and use them to compute the segmentation masks.
Heart catheterization describes the insertion of a tube into the patients arteries or veins in order to measure pressure inside the heart, to locally inject contrast agents or perform other examinations. During this procedure the physician needs to obtain multiple X-ray images checking the location of the catheter.
In general it is quite tricky to get real-time 3D images from the inner body as sophisticated methods like MRI involve heavy machinery and strong magnetic fields. This limits the use of MRI as an aid to surgical procedures as heart catheterization. Other methods like CT/X-ray expose the patient to heavy loads of radiation and should be used with care. Ultra sonic on the other hand has basically no side-effects and is incredibly gentle on the body. Thus it would be a good way to aid physicians performing heart catheterization.
The issue, however, is the low quality of ultrasonic images in comparison to e.g. X-ray, which makes it incredibly hard to use for this procedure. The authors in Yang et al. 2019 contribute towards making ultrasonic images more useful by developing an algorithm to identify location and size of the catheter from 3D-ultrasonic images.
As 3D-convolutional neural networks require significant amounts of training data the authors attempt to utilize 3D-information by slicing the volume of interest into 2D-slices. The neural network they deploy is a FCN consisting of the convolutional layers of the VGG-16 network (Simonyan and Zisserman 2014), another pre-trained image classification model. As input data they take a volume from the ultrasonic and slice it along each spatial axis. In order to have 3D-information represented in the input data, they take three slices along each axis and assign them to the three color-channels. For example, consider the volume sliced in 48x48x48 slices. With padding this gives 48 3-channel images along each axis. Each of these 3-channel images is passed through the FCN and the resulting feature maps are recombined according to their location in the original volume. This pre-processed volume is now segmented using a 3D-convolutional layer. Based on the fact that the 2D-segmentation get recombined in their original spatial manner, the authors dubbed their method Direction Fused-FCN (DF-FCN).
For comparison, the authors also perform segmentation without the Direction Fusion, i.e. the 2D-3-channel images are segmented all by themselves and the segmentation masks are set back into their original place to obtain 3D-segmentation mask. So instead of generating feature maps, recombining them and performing segmentation on the fused maps, the 2D-images are used for segmentation directly and the 2D-predictions are recombined to get the final voxel-classification. This is the FCN-approach shown in the next figure. In the FCN-approach one obtains three predictions for each voxel, one for each spatial axis. The recombination of the three predictions along the axes is done by randomly choosing one of the predictions, i.e. in the final segmentation mask a voxel is classified with a certain probability based on the agreement of the separate predictions along each spatial axis.
Furthermore, the authors compared their approach to a classifier with handcrafted features, i.e. the programmers hard-coded intelligently made-up feature representations into the classifier. Unlike the ML approaches the model does not learn the features automatically, thus requiring a detailed understanding of the problem at hand which often times is not feasible. The last method for comparison is a LateFusion Convolutional Neural Network (CNN), where the voxels are sliced along the spatial dimensions, as done so in the paper, and each slice is passed through a CNN respectively. The images are concatenated to one vector before the classification layer and this gives a classification for the voxel where all three slices intersect.
In the figure below we see one example of an ultrasonic image of a cardiac catheter. The authors used pig hearts for their tests. The image shows the original, the training label and the results of different approaches to segmentation.
As it can be seen in the last figure, the DF-FCN-approach works much better than the pure FCN-approach. The handcrafted feature method performs quite well, too. The authors also deliver objective metrics. They compare precision, recall, Dice-loss and two localization errors (skeleton error and endpoints error). The DF-FCN outperforms all other approaches in all these metrics.
TL;DR: 3D-image segmentation is tricky and requires some trickery to get good results. However, many steps towards more efficient analysis of 3D-ultrasonic images have been taken already.
AI vs. M.D.
When engaging with the topic of AI in medicine one question pops up almost instantly: how does AI perform relative to the average M.D.? This is a sensitive and complicated issue, where it would be unseemly to make predictions on how the field will develop. Reputable statements should always consider the context in which doctors make their predictions, while at the same time consider the many variables when looking at the training data for the AI. That is to say, doctors tend to not merely consider a given image when diagnosing, but also the patient's history and complaints. On the other hand, one has to make sure the AI does not use illegitimate "extra" data hidden in the measurement devices, e.g. particular machine idiosyncrasies may correlate with certain diagnoses if the physician had more of these cases in his or her practice. The example of the "Pneumothorax" discussed above is another excellent illustration of how the algorithm learned to do its job very well, however, in some cases it did well for the wrong reasons. This makes a direct comparison of physician with computer vision algorithms quite tricky.
TL;DR: AI is already performing quite well in purely data-based medical tasks, however, comparisons of algorithms and M.D. should be taken with a grain of salt, as doctors rarely make their decisions merely based on a particular type of data.