Information Retrieval Tools Information Retrieval MSc course NWI-I00041
Introduction
Many different types of retrieval technologies are freely available online. I list a few here, but keep in mind that this is not meant to be an exhaustive list, and you are free to choose what solutions to mix and match to achieve your practical assignment project goals.
Open source search engines
Elastic and SOLR are widely used open source systems that provide information retrieval functionalities (and a lot more - in practice, a common application of these two is more as an easy-to-deply and scalable JSON document store than as a retrieval system). Both are layers on top of Lucene, which provides the indexing and retrieval functionalities.
If you want to use Lucene in your work, I recommend to take a look at the LIARR workshop material (repository from the Lucene for Information Access and Retrieval Research workshop held at SIGIR 2017).
University of Waterloo’s Anserini (for Lucene) and Queensland University of Technology’s Elastic4IR seem good starting points.
Other IR/NLP tools
Many research systems have been developed throughout the years, where a mature project like Indri has been known to run in production at many different organisations. A good starting point is to investigate the systems that took part in the IR Reproducibility workshop, as these are likely to work correctly and be reasonably easy to install.
REL is an entity linking toolkit that can be used for annotating documents, queries, and conversations. MMEAD is the entity annotations of MS MARCO obtained using REL. Nordlys is especially targeting entity retrieval tasks, but has not been tested as thoroughly as the alternatives mentioned before. The team behind Nordlys recommends to checkout the following resources:
- elastic.py: this part of the code talks to Elastic, performs indexing, search, and gets all raw statistics from index;
- scorer.py: computes statistics for various retrieval models (LM, MLM, and PRMS);
- Nordlys documentation related to the implementation of the baselines.
OSIRRC 2019
The Open-Source IR Replicability Challenge (OSIRRC 2019) organized as a workshop at SIGIR 2019 has created an easy-to-(re-)use experimental IR research framework.
The OSIRRC 2019 Image Library repository catalogs the images that have been submitted to the OSIRRC 2019. The time investment to learn to use this catalog can pay off very well, as it gives you immediate access to a wide variety of systems that all have been demonstrated capable of running standard IR test collections.
Roll your own
An alternative is to develop the tools you need to process the data yourself. The advantage is complete control, a disadvantage that it may turn out a little harder than expected… the learning experience will be hard to beat!
Return to the Practical Assignment page.