A primer on machine learning techniques for genomic applications

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Graphical abstract

An external file that holds a picture, illustration, etc. Object name is ga1.jpg

Keywords: Machine learning, Deep learning, Genomics

Abstract

High throughput sequencing technologies have enabled the study of complex biological aspects at single nucleotide resolution, opening the big data era. The analysis of large volumes of heterogeneous “omic” data, however, requires novel and efficient computational algorithms based on the paradigm of Artificial Intelligence. In the present review, we introduce and describe the most common machine learning methodologies, and lately deep learning, applied to a variety of genomics tasks, trying to emphasize capabilities, strengths and limitations through a simple and intuitive language. We highlight the power of the machine learning approach in handling big data by means of a real life example, and underline how described methods could be relevant in all cases in which large amounts of multimodal genomic data are available.

1. Introduction

The aim of Artificial Intelligence (AI) is to simulate human intelligence in non-living agents, mimicking actions that human brains perform daily such as problem-solving and reasoning or pattern recognition and knowledge acquisition [1]. The development of AI has been largely driven by Machine Learning (ML), by which computers acquire the ability to learn and improve from experience with limited human intervention.

The most common type of ML algorithms is supervised learning, a class of AI methods that learns input-to-output mappings. It is called “supervised” learning because the output is known and the algorithm iteratively makes output predictions until an acceptable level of performance is reached. A family of ML algorithms called Neural Networks, loosely inspired by how neurons pass messages to each other in the human brain, has recently evolved into the so called Deep Learning (DL) subfield. In contrast with ML, DL methods are more flexible and can handle large amounts of data. However, since their predictions strongly depend on input training data, great care and caution should be taken in the interpretation of results, especially in the case of biological data. All ML and DL methods need to learn from input data and the majority of them require a training set, generally consisting in a random subset of the available data. After training, another data set (usually a part of the original database not used for training) is used to validate and select the best-fit ML or DL model. Sometimes a further independent test set is used for performance evaluation.

In the last fifteen years, the genomics world has been revolutionised by the advent of high throughput sequencing technologies (HTS), opening definitively the era of big data or “omic” sciences [2], [3], [4]. HTS have indeed enabled the study of complex biological aspects at single nucleotide resolution and now they are commonly applied to a variety of functional genomics problems including, for instance, the identification of genomic rearrangements and variants [5], [6], the investigation of epigenetic changes [7], or the study of transcriptional and post-transcriptional molecular dynamics [8], [9]. HTS are also recently emerging as key technologies for the discovery of biomarkers [10] and offer great promise to deliver personalized medicine [11], [12].

Nowadays, multiple “omic” applications are routinely applied to the same biological samples, raising the complex problem of integrated data analysis and interpretation [13]. In this context, ML and DL methods are indispensable to systematically analyze large volumes of heterogeneous data to better understand underlying biological processes neglected or undetectable by single “omic” approaches, and a growing number of ML and DL based computational strategies are becoming available through dedicated platforms such as TensorFlow [14] or PyTorch, [15] and/or fully equipped and documented R packages, such as Caret [16].

In the present review, we cover main ML methodologies and some principles of DL methods currently applied to genomic problems and omic data, defined here as all data generated by technologies (such as HTS) working at the genomic scale. We start from the field literature of the last decade and focus on the most used methods, providing technical descriptions as well as relevant examples and try to emphasize their capabilities, strengths and limitations. In introducing basic ML and DL principles, we have tried to use an intuitive language in order to be accessible as much as possible to researchers approaching for the first time the fascinating world of AI (drawing and simplifying concepts in [17]). Additionally, to prove the power of ML methods in handling genomic data, we provide a real life example in which we show how to predict with high accuracy age and biological sex from human gene expression experiments (generated by the RNAseq technology) taking into account a large number of deep transcriptome data available through the international Genotype-Tissue Expression (GTEx) project [19].

2. The learning problem

The primary goal of ML is to acquire skills or knowledge from experience in order to automate human tasks. As a consequence, at the heart of ML there is the learning problem in which computers learn from real data and perform useful predictions. Depending on the type of available data and on the task to perform, a variety of ML methods have been developed [20]. Nowadays, many of such methods have been applied to genomic data for solving several complex biological problems such as the prediction of specific sequence motifs for DNA or RNA binding proteins [21], of the genome methylation status [22], of the 3D organization of the chromatin [23], as well as of the pattern of post-transcriptional modifications [24] or of the most likely cell types from single cell RNAseq experiments [25]. Drawing on the field literature of the last ten years, we have collected the main ML algorithms and ranked them according to their popularity and versatility in genomic applications. In particular, we retrieved publications from the PubMed database using the query string ((“Next Generation Sequencing” OR “single cell sequencing” OR “gene expression” OR “transcriptomics”) AND (“machine learning” OR “deep learning”) AND (’human’)) AND ((“2009”[Date – Publication]: “2022”[Date – Publication])) – and organized them in a local sqlite3 database available at our Github page https://github.com/claudiologiudice/ML-DL-REVIEW) for downstream analyses. We then grouped the set of algorithms used in these publications, which include linear and non linear models for classification and regression as well as some regularization procedures, in two learning classes referred to as supervised and unsupervised, whose main characteristics are represented in Fig. 1 and summarized in Table 1 , and will be discussed in the next sections.

An external file that holds a picture, illustration, etc. Object name is gr1.jpg

Supervised versus unsupervised learning – a pictorial representation. Supervised learning involves a training phase in which a labeled dataset is used to train the model that will subsequently be able to recognize unseen data. Unsupervised learning identifies latent factors in unmarked data and groups them based on similarity.

Table 1

Main differences between supervised and unsupervised learning.

Supervised learning	Unsupervised learning
Input data is labelled	Input data is unlabelled
There is a training phase	There is no training phase
Data is modelled based on training dataset	Uses properties of given data for classification
Divided into two types: Classification and Regression	Most popular types: Clustering and Dimensionality reduction
Known number of classes (for classification)	Unknown number of classes

Since the ML field is at the intersection of Statistics, Data Science and Engineering, some terms with the same meaning could be used interchangeably. To facilitate the non-specialist reader and avoid confusion, in Table 2 we provide a list of such terms with their description and highlight in italics the terminology that will be preferentially used in this review. On our Github page, in the Supplementary Material section, we provide further mathematical details of the described ML methods for the interested reader.

Table 2

List of alternative terminologies used in Machine Learning to represent the same concept (our preferred choice is highlighted in italics). X denotes the input, Y or G the quantitative or qualitative output, respectively.

Synonims	Description
Quantitative or continuous variable	A variable assuming continuous values with explicit ordering
Qualitative, discrete, factor or categorical variable	A variable assuming discrete values with no explicit ordering
Observation or measurement	A realization of a statistical variable
Feature, predictor, attribute or independent variable j	The j-th column of the observed X : ( x ij ) i = 1 N
Output, outcome, response or dependent variable	The observed Y or G
Output class or output label i	The i-th element of the observed G : g i
Data imputation	Replacing missing or inconsistent data with plausible data

3. Supervised learning

Supervised learning is the most common and used type of ML that learns a mapping from input X to output Y for quantitative values, or output G for qualitative values. The observed values of variable X can be represented by an NxM-dimensional matrix of elements ( x ij ) i = 1 N j = 1 M where N is the number of observations (e.g., individuals or samples) and M is the number of features (e.g., genetic factors or genomic variables); ( y i ) i = 1 N or ( g i ) i = 1 N is an N-dimensional vector of output variables assuming continuous or discrete values, respectively. When the output is quantitative (Y), i.e., corresponds to continuous measurements, the supervised learning problem is known as a regression problem (e.g., prediction of a quantitative phenotype such as age). When the outcome is qualitative (G) with no explicit ordering, the prediction problem is called classification, and the output class is specified bys a label, i.e., a digit (e.g. 0 , 1 as in a case control study or 1 , … , K as in cancer type or disease trait classification) or a dummy variable (“case”, “control”, or “cancer A”, “cancer B”, “cancer C”). Although here regression and classification problems have been separated into two categories, both are tasks in function approximation as both learn an input to output mapping. A third but less common variable type, defined as ordered categorical (e.g., “mild”, “medium”, “severe” as in symptom severity classification), will not be considered in this review (for further details, please refer to [17]). Also, additional material on weakly supervised learning, or ML with noisy, limited, or imprecise labels can be found in Zhou [18].

Two typical examples of input to output mappings will be presented in Section 7, where the input X will be the NxM gene expression matrix, M will be the number of genes and N the number of individuals. In the first example – biological sex classification (a classification problem) – the output G is discrete, g i ∈ can have one of two values in each subject. In the second example – age regression (a regression problem) – the output Y is continuous, y i represents the age of individual i, and i ∈ < 1 , … , N >. The implemented algorithms will predict the output (biological sex or age) from the input (gene expression).

3.1. Training, validation and test set

To build an accurate and robust predictive model, the initial dataset is typically split into training, test and validation sets. The training set is the data sample employed to fit the model (i.e., to find the parameters than can best describe the full dataset) and the performance of an ML algorithm significantly depends on it. If, for instance, the size of the training set is too small, the algorithm may not have enough experience and knowledge which will lead to many prediction errors (underfitting condition). On the other hand, if the training set contains too much data, the algorithm may lose its ability to generalize on unseen data (a problem called overfitting). The validation set is the dataset used to evaluate the model fit on the training dataset and is employed to fine-tune the model hyperparameters. Finally, the test set is tipically used to provide an unbiased estimation of a final model fit on the training dataset. The test set returns the actual performance of the model and is only used once the model has been fully trained.

To better understand the general problems of over- and underfitting, it is helpful to introduce the notions of model bias and variance. The bias is the difference between the average prediction of the model and the expected value we are trying to predict. A model with high bias is making wrong assumptions about the data. For instance, see Fig. 2 on the left, where the model (Logistic Regression) is trying to find a linear boundary to separate a dataset with circular boundary. In this case, a model assuming a linear boundary will clearly lead to both high training and high testing errors. Variance, on the other hand, represents the variability of the model’s predictions and indicates how sensitive the model is to the randomness of the data in the training set. Consider for instance a model that is making hypotheses that are so general that they can possibly fit to any data. If we train this model on a given training set we will find a set of optimal parameters (optimal for that training set); if we train the same model on a second training set we will find a completely different set of optimal parameters; this model will be very sensitive to the input data, but will likely fail to perform well on unseen data. See, for instance, Fig. 2 on the right, where the algorithm (K-NN) is predicting a boundary that strongly depends on the input data (the training set represented with circles) and fails to generalize to the test set (represented with triangles) which wasn’t used for training.

An external file that holds a picture, illustration, etc. Object name is gr2.jpg

Examples of under-, appropriate and over-fitting. The input dataset consists of two classes (blue and red points) that are distributed on an inner and outer circle, respectively (with Gaussian noise) in a two dimensional feature space. Three different models are used to fit the input training set, or in other words, to find the boundary (in black) that can best separate the two classes under the model hypotheses. The training set is represented by circles, the test set by triangles. The chosen methods, Logistic Regression, Support Vector Machine with Gaussian kernel, and K-NN lead to typical over-, appropriate, and under-fitting scenarios, respectively. The code used to generate this figure is available on Github.

In supervised learning, the underfitting condition occurs when a model is unable to capture the underlying pattern of the data. Such models usually have high bias and low variance. Instead the overfitting issue occurs when a model has low bias and high variance. Fig. 2 shows in the middle an example of appropriate fitting where the used model (a Support Vector Machine with Gaussian kernel) has a good bias-variance tradeoff, as it is making an adequate assumption on the data distribution and is not too sensitive to the input data.

The performance of a predictive algorithm can be optimized via a cross-validation (CV) procedure. In the k-fold CV procedure, the training sample is divided into k mutually exclusive subsets of equal size. The algorithm is trained on k - 1 subsamples and validated on the remaining subsample. This procedure is repeated for each of the k subsamples with the advantage that all observations are used for both training and validation, and each observation is used for validation exactly once. Cross-validation provides reasonable estimates of the expected error [17] and average performance is reported along with standard deviation and statistical significance.

3.2. Naive Bayes

The Naive Bayes (NB) algorithm is a classification algorithm, belonging to the class of generative models, i.e., it builds a full statistical model for both input and output. Given this model, the output can be generated from the input (using Bayes’ rule). It is called naive because it makes a simple but strong assumption that all pairs of features (columns of X) are conditionally independent given the output labels, an assumption that is generally not true. Building the model is easy and requires no complicated iterative parameter estimation. For a discrete variable it requires the construction of a frequency table for each feature against the output, then the posterior probability (a product of probabilities given the assumption of independence) is computed and the class with the highest posterior probability is returned as the predicted outcome. NB can be used to model binary, categorical unordered, continuous features, as well as to model features with unknown distribution, and to model input data it uses a Bernoulli, Multinomial, Gaussian, or a kernel density, respectively. NB classifiers have been adopted for their simplicity or as a baseline in comparison with more complex classifiers. Two of the many examples where NB is used as a baseline for comparison are [26] for classification of treatment success/failure given a set of M = 2161 input features (variants of hepatitis C obtained through RNAseq) cross-validated on a total of N = 173 different subjects, and [27] where input features are gene expression profiles from RNAseq, output is binary drug response, and the analysis is cross-validated on N = 455 individuals with cancer. For the taxonomic classification of microbiomes using metabarcoding, the RDP (Ribosomal Database Project) curators [28] have developed the RDP-classifier, a Naive Bayes classifier relying on k-mer frequencies measured on prokaryotic genera [29]. The RDP-classifier, given an input sequence of bacterial 16S rRNA, predicts an output label, the genus. The algorithm was trained on Bergey corpus, consisting of N = 5014 labeled sequences (the label is the genus and can have 988 different values, g ∈ < 1 , … , 988 >) or on RDP sequences ( N = 23095 , g ∈ < 1 , … , 1187 >where 1187 is the number of genera in the NCBI taxonomy database). The classifier uses a set of M features for each input sequence, the k-mers (with k = 8 ) that make up the sequence, and assigns to the sequence a genus based on the frequency of those k-mers in the labeled training set. More recently QIIME2 [30] teams have introduced a Multinomial Naive Bayes classifier to achieve taxonomic classification of metabarcoding data [31].

Fig. 3 illustrates the behavior of the Gaussian NB algorithm on two synthetic datasets, a linearly separable dataset (i.e., with a boundary between classes that is linear, top row in the figure) and a circularly separable dataset (i.e., with a boundary between classes that is a circle, bottom row), together with a few other classification algorithms. Despite its over-simplistic assumptions, the NB algorithm outperforms more sophisticated alternatives. Although the estimator may be biased, as it makes a priori assumptions, it has low variance, i.e. it is not sensitive to small fluctuations in the training set. Its use is typically recommended when the feature space is large (high M) and density estimations become unfeasible. In Oncofuse [32], a computational pipeline for the classification of fusion sequences with oncogenic potential, the NB algorithm has been chosen for its robustness and because it can natively handle missing data (as any generative model) which is essential when high throughput datasets from different sources need to be combined. The NB algorithm has been applied also to pharmacogenetic predictions. Boloc et al. [33], for instance, developed a NB-based predictive model of antipsychotic induced extrapyramidal symptoms using functional SNPs belonging to four genes of the mTOR pathway.