Avati, Valentina; Blaszkiewicz, Milosz; Bocchi, Enrico; Canali, Luca; Castro, Diogo; Cervantes, Javier; Grzanka, Leszek; Guiraud, Enrico; Kaspar, Jan; Kothuri, Prasanth; Lamanna, Massimo; Malawski, Maciej; Mnich, Aleksandra; Moscicki, Jakub; Murali, Shravan; Piparo, Danilo; Tejedor, Enric
The High-Energy Physics community faces new data processing challenges caused by the expected growth of data resulting from the upgrade of LHC accelerator. These challenges drive the demand for exploring new approaches for data analysis. In this paper, we present a new declarative programming model extending the popular ROOT data analysis framework, and its distributed processing capability based on Apache Spark. The developed framework enables high-level operations on the data, known from other big data toolkits, while preserving compatibility with existing HEP data files and software. In our experiments with a real analysis of TOTEM experiment data, we evaluate the scalability of this approach and its prospects for interactive processing of such large data sets. Moreover, we show that the analysis code developed with the new model is portable between a production cluster at CERN and an external cluster hosted in the Helix Nebula Science Cloud thanks to the bundle of services of Science Box.