Genomic Prediction Using Machine Learning: A Comparison Of The Performance Of Regularized Regression

Published: 27 Dec, 2024
Category Assignment Subject Computer Science
University Module Title Genomic prediction using machine learning

BMC Genomics

Genomic prediction using machine learning: a comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data

Abstract

Background

The accurate prediction of genomic breeding values is central to genomic selection in both plant and animal breeding studies. Genomic prediction involves the use of thousands of molecular markers spanning the entire genome and therefore requires methods able to efficiently handle high dimensional data. Not surprisingly, machine learning methods are becoming widely advocated for and used in genomic prediction studies. These methods encompass different groups of supervised and unsupervised learning methods. Although several studies have compared the predictive performances of individual methods, studies comparing the predictive performance of different groups of methods are rare. However, such studies are crucial for identifying (i) groups of methods with superior genomic predictive performance and assessing (ii) the merits and demerits of such groups of methods relative to each other and to the established classical methods. Here, we comparatively evaluate the genomic predictive performance and informally assess the computational cost of several groups of supervised machine learning methods, specifically, regularized regression methods, deep, ensemble and instance-based learning algorithms, using one simulated animal breeding dataset and three empirical maize breeding datasets obtained from a commercial breeding program.

Results

Our results show that the relative predictive performance and computational expense of the groups of machine learning methods depend upon both the data and target traits and that for classical regularized methods, increasing model complexity can incur huge computational costs but does not necessarily always improve predictive accuracy. Thus, despite their greater complexity and computational burden, neither the adaptive nor the group regularized methods clearly improved upon the results of their simple regularized counterparts. This rules out selection of one procedure among machine learning methods for routine use in genomic prediction. The results also show that, because of their competitive predictive performance, computational efficiency, simplicity and therefore relatively few tuning parameters, the classical linear mixed model and regularized regression methods are likely to remain strong contenders for genomic prediction.

Conclusions

The dependence of predictive performance and computational burden on target datasets and traits call for increasing investments in enhancing the computational efficiency of machine learning algorithms and computing resources.

Keywords

Genomic prediction, Genomic selection, Breeding value, Predictive accuracy, Predictive ability, High
dimensional data, Supervised machine learning methods

Background

Rapid advances in genotyping and phenotyping technologies have enabled widespread and growing use of genomic prediction (GP). The very high dimensional nature of both genotypic and phenotypic data, however, is increasingly limiting the utility of the classical statistical methods. As a result, machine learning (ML) methods able to efficiently handle high dimensional data are becoming widely used in GP. This is especially so because, compared to many other methods used in GP, ML methods possess the significant advantage of being able to model nonlinear relationships between the response and the predictors and complex interactions among predictor variables. However, this often comes at the price of a very high computational burden. Often, however, computational cost is less likely to present serious challenges if the number of SNPs in a dataset is relatively modest but it can become increasingly debilitating as the number of markers grows to millions or even tens of millions. Future advances in computational efficiencies of machine learning algorithms or using high-performance or more efficient programming languages may progressively ameliorate this limitation. Given their growing utility and popularity, it is important to establish the relative predictive performance of different groups of ML methods in GP. Even so, the formal comparative evaluation of the predictive performance of groups of ML methods has attracted relatively little attention. The rising importance of ML methods in plant and animal breeding research and practice, increases both the urgency and importance of evaluating the relative predictive performance of groups of ML methods relative to each other and to classical methods. This can facilitate identification of groups of ML methods that balance high predictive accuracy with low computational cost for routine use with high dimensional phenotypic and genomic data, such as for GP, say.

ML is perhaps one of the most widely used branches of contemporary artificial intelligence. Using ML methods facilitates automation of model building, learning and efficient and accurate predictions. ML algorithms can be subdivided into two major classes: supervised and unsupervised learning algorithms. Supervised regression ML methods encompass regularized regression methods, deep, ensemble and instance-based learning algorithms. Supervised ML methods have been successfully used to predict genomic breeding values for unphenotyped genotypes, a crucial step in genome-enabled selection [1,2,3,4,5,6,7,8,9]. Furthermore, several studies have assessed the relative predictive performance of supervised ML methods in GP, including two ensemble methods and one instance-based method [5]; four regularized and two adaptive regularized methods [6]; three regularized and five regularized group methods [9] and several deep learning methods [1,2,3,4, 8].

However, no study has comprehensively evaluated the comparative predictive performance of all these groups of methods relative to each other or to the classical regularized regression methods. We therefore rigorously evaluate the comparative predictive performance as well as the computational complexity or cost of three groups of popular and state-of-the-art ML methods for GP using one simulated animal dataset and three empirical datasets obtained from a commercial maize breeding program. We additionally offer brief overviews of the mathematical properties of the methods with emphasis on their salient properties, strengths and weaknesses and relationships with each other and with the classical regularization methods. While we offer a somewhat comprehensive review of genomic prediction methods with a specific emphasis on ML, our contribution extends to showcasing novel findings derived from comparative assessments of ML techniques across both real and simulated datasets.

Besides ML methods, Bayesian methods are also becoming widely used for genomic prediction [3, 8, 10]. So, even though our goal is not to provide an exhaustive review of all genomic prediction methods, we offer two Bayesian methods for benchmarking the performance of the ML methods.

The rest of the paper is organized as follows. First we present the synthetic and real datasets. Second, we detail the methods compared in this study. Next, the results from the comparative analyses of the data are presented. Finally, a discussion of the results and closing remarks follow.

Data

Simulated (animal) data
We consider one simulated dataset [9], an animal breeding outbred population simulated for the 16-th QTLMAS Workshop 2012 (Additional file 1). The simulation models used to generate the data are described in detail in [11] and are therefore not reproduced here. The dataset consists of 4020 individuals genotyped for 9969 SNP markers. Out of these, 3000 individuals were phenotyped for three quantitative milk traits and the remaining 1020 were not phenotyped (see [9] for details). The goal of the analysis of the simulated dataset is to predict the genomic breeding values (PGBVs) for the 1020 unphenotyped individuals using the available genomic information. The simulated dataset also provides true genomic breeding values (TGBVs) for the 1020 genotypes for all the traits.

Looking for expert support with your Genomic Prediction Using Machine Learning coursework? Our online assignment writing help and dissertation writing services are designed to assist UK students in tackling complex topics. This assignment dives into comparing methods like regularized regression, ensemble techniques, instance-based approaches, and deep learning on both synthetic and real-world data. Our experts ensure detailed, high-quality solutions tailored to your requirements. Need answers fast? Get assignment answers from professionals today and make your academic journey smoother!

Online Assignment Help in UK