Alfredo Ibias, Varun Ravi Varma, Karol Capała, Luca Gherardini, Jose Sousa

In personalized health, small datasets with missing data are quite common. Current Machine Learning methods are unable to process such datasets in a meaningful way due to the huge data volume requirement. To address this problem, we propose a new Small and iNcomplete Dataset Analyser (SaNDA) to process such datasets in a meaningful way. Due to the characteristics of these datasets and the criticality of the domain, an explainable method is mandatory to facilitate decision-making interpretation. Thus, SaNDA prioritises explainability over efficiency by design. We evaluated our proposal against Random Forest as a baseline for explainable methods, and against gcForest as state-of-the-art for small datasets. We observed that our proposal outperforms Random Forest when there is more missing data and/or a lower number of entries in the dataset, obtaining less favourable results over larger, well-curated datasets. It is also preferable than gcForest due to its explainability and privacy protection capabilities. Given the difficulties in obtaining complete, reliable data in the healthcare field, we consider that our proposal could be useful for practitioners.