107. Exploring Information Retrieval: From Sparse to Dense Vector Representations

Aleksander Smywiński-Pohl, PhD – Computer Science Institute, AGH University of Krakow, Poland

Abstract

This presentation delves into the diverse approaches to information retrieval, comparing classical sparse text representations with cutting-edge dense text representations.

In the initial segment, we embark on a journey into classical information retrieval methods, employing powerful software tools like ElasticSearch and SOLR. We delve into the intricacies of the BM25 model, shedding light on challenges pertaining to inflectional languages.

The second part of this presentation delves into contemporary advancements in information retrieval. We explore text representations grounded in the transformative architecture, navigating through a comprehensive search pipeline that encompasses a dense retriever, re-ranker, and a question-answering reader. Additionally, we showcase models proficient in generating dense representations, such as DPR and E5.

In closing, we weigh the performance of both sparse and dense retrievers, offering insightful considerations to conclude our exploration of these information retrieval methodologies.

About the author

Dr. Aleksander Smywiński-Pohl is a researcher in natural language processing. He received his Ph.D. in 2015 from AGH University of Science and Technology in Krakow for the work entitled: Automatic extraction of semantic relations from Polish texts. His primary research interests concentrate on the application of modern NLP techniques in a broad range of practical problems. In 2017, he started a research project funded by the Polish National Center for Research and Development (NCBR) devoted to the construction of an intelligent legal information system called 0„Lemkin”. He also participated in other projects aimed at building the Polish language model for application in Automatic Speech Recognition, sentiment analysis of user-generated content, monitoring contents of the public media as well as cyberbullying and self-harm detection. Currently he works together with Tomer Libal in a grant sponsored by FNR and NCBR on automatic question answering for court cases.