175. Simplifying bioinformatics data science with Pyrun

Pedro García López, Enginyeria Informàtica i Matemàtiques Universitat Rovira i Virgili, Taragona, Spain

Bioinformatics with Guardrails: What Python Can (and Can’t) Do

Abstract:

Bioinformatics pipelines are increasingly complex: they must process massive datasets, scale across heterogeneous infrastructures, and remain reproducible and maintainable. While Python has become the de-facto language for bioinformatics and data science, turning Python code into scalable, production-ready workflows still requires deep expertise in resource provisioning, parallel execution, and data movement—areas that often distract scientists from the scientific questions they want to answer.

In this talk, we explore how guardrails can be added to bioinformatics workflows using Python-based tools such as Lithops, DataPlug, Data Cockpit, and Pyrun.cloud. These tools abstract away much of the complexity of parallel execution, data ingestion, and deployment, allowing data scientists to scale their workloads with minimal changes to their existing code. We show how Lithops enables the same execution model across environments—from public cloud to on-premise HPC systems via LithopsHPC—providing portability and consistency without forcing users to rewrite pipelines for each platform.

However, guardrails are not a silver bullet. Not all performance, cost, or architectural challenges can be fully automated or hidden behind abstractions. We discuss the limits of these tools, highlighting scenarios where close collaboration between data scientists and engineers remains essential. The goal is not to eliminate complexity entirely, but to manage it—allowing bioinformatics practitioners to move faster, more safely, and with clearer boundaries between scientific intent and infrastructure concerns.

About the author:

Small Bio – Pedro García López pedrogarcialopez.es

Pedro Garcia is professor of the Computer Engineering and Mathematics Department at the University Rovira i Virgili (Spain). He leads he “Cloud and Distributed Systems Lab” research group and coordinates large research european projects. In particular, he leads CloudStars (2023-2027), NearData (2023-2025), CloudSkin (2023-2025), and he participates as partner in EXTRACT (2023-2025). He also coordinated FP7 CloudSpaces (2013-1015), H2020 IOStack (2015-2017) and H2020 CloudButton (2019-2022).

During 2019-2020 he worked as visiting scientist in IBM Watson Research in the Hybrid Clouds group focused on serverless technologies. His research topics are distributed systems, cloud computing, data analytics, software architectures and middleware. He has published more than 100 papers on journals and prestigious conferences (ACM Middleware, IEEE ICDCS, USENIX FAST, ICDE, IMC). He has participated in scientific committees of different conferences like Middleware, CCGRID, CloudCom, CIC, P2P, CLOSER, or WETICE among others. He is currenlty co-organizing the International Workshop on Serverless Computing (WoSC).