Zuzanna Karwowska, Oliver Aasmets, Estonian Biobank research team, Tomasz Kosciolek, Elin Org
The effective classification of host phenotypes through microbiome data is essential for the progression of microbiome-centered treatments, where machine learning serves as a pivotal tool. The inherent complexity of the gut microbiome, coupled with issues like data sparsity, compositionality, and variability across populations, poses substantial challenges. Although transforming microbiome data can mitigate some of these difficulties, its application in machine learning endeavors remains largely underinvestigated.
In our study, we examined more than 8500 samples across 24 shotgun metagenomic datasets, discovering that it is feasible to differentiate between healthy and diseased states using microbiome data, with minimal reliance on specific algorithms or data transformations. We found that presence-absence data transformations were as effective as those based on abundance, and that accurate classification could be achieved using only a limited set of predictive features. Despite similar levels of classification accuracy across different transformations, the key features identified varied significantly, underscoring the importance of reevaluating the detection of biomarkers through machine learning.
Our results demonstrate that while microbiome data transformations have a substantial impact on feature selection, they do not significantly alter classification accuracy. This indicates that although the classification process is stable across various transformations, careful consideration is necessary in the selection of features for biomarker discovery using machine learning. This study not only contributes valuable insights into the application of machine learning to microbiome data but also points to crucial areas for future research.
DOI: 10.1186/s40168-024-01996-6
READ HERE