103. A Federated Search Workflow Engine for Text Analytics Using Large-Scale BioMedical Literature

Filip Katulski – Clinical Data Science Team, Sano Centre for Computational Medicine, Krakow, PL

Abstract

Modern domain-specific scientific literature is very often packed in extremely large and complex data sets. For scientists parsing these large repositories for a specific term and its corresponding articles, using standard methods could be impossible due to complexity and computational time. The scope of this project is to create a solution suitable for modern researchers and scientists that would help to better impact their work by parsing data gathered from the scientific literature in an easy and quick way. This work presents the design, development, and solution to the problem. It is provided as a federated Full-Text Search architecture which supports scientists in their research on biomedical literature such as PubMed, which size is close to 40 million individual records. The core solution is based on containerized OpenSearch engine instances, created and maintained within a federated system for its flexibility and ability to quickly adapt to various datasets and infrastructure architectures. With this principle in mind, potential users can define their own computing infrastructure, according to their needs and capabilities, which could greatly reduce the time and resources spent on research. The project is continuously evolving to improve its features and use cases. The future direction of this work will test the proposed solution using different computing infrastructures and software settings, to identify a well-optimized option for a drug repurposing knowledge graph use case.

In this seminar, I will introduce basic concepts and techniques of Federated Search, explain how we might benefit from such systems and how to design them, and present the results of experiments using Federated Search Engine on sample datasets.

About the autor

Filip is a student of Computer Science at the Faculty of Computer Science, Electronics and Telecommunications at AGH University of Science and Technology. At the same university he obtained two engineering diplomas: Power Engineering and Electronics. His engineering thesis in Electronics focused on the development of a MOX-type gas sensor control system for the Biomarker Analysis Laboratory. His Master’s thesis is being carried out in Prof Hamed’s team in collaboration with Prof Malawski. The topic is the design of a federated text search engine for Biomedical text analysis. Filip interned as a Technical Student at CERN in the Data Aquisition team, responsible for the Computational Infrastructure of the ATLAS project, one of the four components of the Large Hadron Collider. As part of his career, he is interested in topics such as Cloud Computing, Large Scale Computing and Natural Language Processing. In his free time, he enjoys travelling, snowboarding, going to the opera, theatre, and cinema.