The Big Future of Small Data

AI triumphs and revolutions fill the daily news; we are told that AI can now detect cancer, find lifesaving treatments, drive cars, and much more. So why do only a small proportion of organisations successfully implement AI assisted solutions [1]?

Behind the flashy headlines lies an uncomfortable truth: to build AI models that can generate useful insights, one needs data that has been painstakingly labelled by humans. For example, individuals need to annotate images before they can teach an AI model what is visible in those images (e.g., how many cars are in the image and where are the cars located?) or have access to a big database of clinically validated drugs before teaching an AI how to predict new assets. This strategy worked well for applications such as autonomous driving where labelled data is cheap and easy to come by (e.g., dashboard camera images) – after all, non-experts can correctly identify common objects in images. However, when dealing with expert domains such as the Life Sciences sector, labelling data quickly becomes difficult. For example, labelling a single pathology image to teach an AI to diagnose cancer takes an expert pathologist an average of 2 hours – a cost of $280/image [2][3]. The first AI powered FDA approved prostate cancer detection tool required 12,160 of such labelled images [4]. Hence, labelling enough data to bring an AI tool into the clinic costs approximately $6M. Unfortunately, labelling a single dataset is not enough due to Model/Data drift [5], i.e., the decrease of AI model performance due to changes in data over time. For example, different imaging devices for pathology images could yield data that the original model may not be able to analyse correctly anymore. Hence models must be retrained from time to time, increasing costs dramatically.

Cases in which collecting and labelling enough data is not only expensive but almost impossible are even more challenging. Consider building an AI model that detects defects in assembly lines. Industry experts estimate that there might be as few as 40 images containing defects in a dataset of millions of images, so that no company may ever collect enough data to deploy AI [6]. Amongst the most difficult cases are those in which data collection and labelling requires complex and lengthy specialised human tasks. In gene-editing experiments, for example, it may take a month to generate a single labelled datapoint [7], e.g., measure the effect of gene deletion on a disease model such as human derived stem cells. When trying to predict the clinical outcomes of drugs, we may never have access to enough data for AI: only a few drugs are eventually tested in humans after several years of development time [8], and unfortunately most of these treatments fail [9].

As of today, we have only scratched the surface of what is possible with AI. Despite our collective mission to optimise AI on Big Data, our models often fail in the real world when using small datasets. Additionally, Big Data struggles with the “Big Data Paradox”, i.e., inherent dataset biases are amplified with bigger datasets [11]. Leading industry experts estimate that for every Big Data application there are thousands of Small Data applications that currently remain unsolved (Figure 1) [12]. To truly solve the big challenges ahead, AI leaders estimate that we need to leverage small datasets and enable tens of thousands of lightweight Small Data AI models for niche applications [12].

Figure 1To date, AI has been successful in Big Data applications. However, Big Data is just the tip of the iceberg and there is a massive unmet need of AI in Small Data applications

How may AI work for Small Data? Carefully examining nature may help: toddlers do not need to see tens of thousands of animals and be told by their parents which ones are cats. Often a few examples suffice. Note that that true Small Data Problems are mostly impossible to solve: no toddler can learn to identify cats from only one labelled example of a cat. However, by the time their parents points at a cat and says “cat” the toddler has seen a lot of the world. They already have a general understanding of objects and shapes in the world that they can leverage to learn what a cat is. In other words: prior to encountering the Small Data problem of naming cats the toddler has already processed and potentially categorised massive amounts of raw unlabelled data about the world around them. 

Small Data AI techniques emulate nature. Organisations typically have abundant unlabelled data but struggle to generate labelled data. For example, taking an x-ray of a patient is faster than having a radiologist label it in detail, and it is often cheap and easy to access raw, anonymised, unstructured imaging datasets from hospitals but expensive to hire medical experts to label them. Similarly, generating a library of potential drug compounds to test against a given disease is cheap, running trials to test them is not. Owing to this, Small Data AI algorithms typically enable learning from big unlabelled or partially labelled datasets (Unsupervised/Semi-supervised Learning), or use AI models pretrained on big, labelled datasets and then refined on a related Small Data problem (Transfer Learning, Synthetic Data). See Figure 2 & Table 1 for an overview of the different techniques with examples.

Figure 2 Venn Diagram comparing different Small Data AI techniques in the context of supervised/unsupervised learning.

TechniqueDescriptionStrengths / Weaknesses
Synthetic DataSynthetic labelled data is generated with generative models that were trained on a related dataset and subsequently used for supervised learning.
Example: Using a generative model to learn how glass defects look like and apply this to small datasets of glass without a defect to generate data that contains defects
Strengths: The only technique that may work if one has neither labelled nor unlabelled data on a given Small Data problem
Weaknesses: Notoriously difficult to train so that each new application requires a lot of work
Auto-EncoderAuto-Encoders are trained on unlabelled data to learn a low dimensional representation of the data. 
Example: Reducing complex genetic sequences to a few numbers to separate different kinds of sequences.
Strengths: No labelled data required, Easy to use
Weaknesses: Unsupervised models rarely learn human relevant features.
Self-Supervised LearningSimilar to Auto-Encoders but the training is augmented by exploiting the underlying structure of the unlabelled data.
Example: Masking parts of microscopy images and make an AI model learn to reconstruct the full images. This way the network learns general features of the data and can be specialised on smaller datasets afterwards (Semi-Supervised)
Strengths: Very powerful & popular technique especially if combined with labelled data (semi-supervised).
Weaknesses: Finding applicable perturbations might require some trial and error for different datatypes
Transfer & Meta LearningA 2-step process in which an AI model is first trained on a big, labelled dataset that is like the small dataset for which an AI is required. The trained AI model is then fine-tuned on the small, labelled dataset. Adding Meta Learning to the first training ensures that the resulting model can learn new tasks very efficiently.  Note that the first training step could also be done using self-supervised learning, making this a good semi-supervised technique as well.
Example: Train an image analysis AI on a big dataset containing many labelled cell microscopy images. Then fine tune the model on a small dataset from one specific cell type.
Strengths: Very powerful technique that can be used in many cases.
Weaknesses: Training the model twice poses some practical challenges. 
Active LearningIn active learning one starts with an AI model that can predict which datapoints need to be labelled first. After the first labelling step, the process is repeated so that humans only need to label the minimal number of datapoints for the AI to achieve the best performance.
Example: After testing a few drugs in the laboratory an AI model is built to predict outcomes for other drugs together with a prediction confidence score. Subsequently, the drugs that exhibit low confidence are tested in the laboratory to boost performance.
Strengths: Good at finding edge cases that require labelling & does not require unlabelled data
Weaknesses: May not reduce the amount of data required by much
 

Table 1 An overview of the most popular supervised and unsupervised Deep Learning based Small Data AI techniques.

Life Sciences industries are amongst the most affected by the Small Data problem. Unlabelled data, i.e., microscopy images or potential target binding compounds, are relatively accessible, but labelled data requires expert labelling work that may involve slow computations (e.g. computer programs that take days to compute an analysis), lengthy manual labelling by eye (e.g., of pathology images), and expensive laboratory experiments (e.g. compound binding assays). Life Sciences data is heterogeneous, messy, and small making the industry ripe for disruption with lightweight problem focussed Small Data AI models [13]. DeepMirror is at the forefront of this disruption by providing Small Data AI as a service to organisations in the Life Sciences. To achieve this, we use a mixture of Transfer Learning and Self/Semi-Supervised learning paired with specialist AI Models for different Life Science data types (e.g., medical images, small molecules, and DNA/RNA sequences). Our technology enables tumour analysis in pathology images of patients using 100x less labelled data than conventionally required [14], and can predict new & better drugs against malaria from just 300 previously tested drugs [15]. Have a look at our blog for more information [16], and reach out to us if you are interested in what we do. Let’s work together to make breakthrough discoveries with Small Data AI!

Contact: Max Jakobs (max@deepmirror.ai)

Edited by: Andrea Dimitracopoulos, Eva Pillai, Craig Russell, Ala Alenazi, Ryan Greenhalgh, Amir Shirian, Sasha Eremina, Brieuc Lehmann

References

[1] [Online]. Available: https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/global-survey-the-state-of-ai-in-2020.
[2] [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6669998/.
[3] [Online]. Available: https://www.salary.com/research/salary/benchmark/physician-pathology-hourly-wages.
[4] [Online]. Available: https://arxiv.org/pdf/1805.06983.pdf.
[5] [Online]. Available: https://www.forbes.com/sites/forbestechcouncil/2021/09/23/model-drift-in-data-analytics-what-is-it-so-what-now-what/?sh=6ffef4cb4862.
[6] [Online]. Available: https://www.protocol.com/enterprise/landing-mariner-ai-manufacturing-defect.
[7] [Online]. Available: https://www.synthego.com/crispr-benchmark.
[8] [Online]. Available: https://www.nature.com/articles/nrd3078.
[9] [Online]. Available: https://theconversation.com/90-of-drugs-fail-clinical-trials-heres-one-way-researchers-can-select-better-drug-candidates-174152.
[10] [Online]. Available: https://www.forbes.com/sites/brentdykes/2015/10/14/analyzing-big-data-8-tips-for-finding-the-signals-within-the-noise/?sh=5969fd7e16f5.
[11] [Online]. Available: https://www.nature.com/articles/s41586-021-04198-4
[12] [Online]. Available: https://www.protocol.com/newsletters/protocol-enterprise/manufacturing-ai-salesforce-shay-banon?rebelltitem=1#rebelltitem1.
[13] [Online]. Available: https://www.mercuryds.com/blog/5-data-challenges-healthcare-lifesciences.
[14] [Online]. Available: https://deepmirror.ai/2022-02-14-digitalpathology/.
[15] [Online]. Available: https://deepmirror.ai/2022-03-18-open-source-drug-discovery/.
[16] [Online]. Available: https://deepmirror.ai/blog/.