Information Retrieval Tools Information Retrieval MSc course NWI-I00041-2017-KW1-V
Many different types of retrieval technologies are freely available online. I list a few here, but keep in mind that this is not meant to be an exhaustive list, and you are free to choose what solutions to mix and match to achieve your practical assignment project goals.
Open source IR tools
Elastic and SOLR are widely used open source systems that provide information retrieval functionalities (and a lot more - in practice, a common application of these two is more as an easy-to-deply and scalable JSON document store than as a retrieval system). Both are layers on top of Lucene, which provides the indexing and retrieval functionalities.
If you want to use Lucene in your work, I recommend to take a look at the LIARR workshop material (repository from the Lucene for Information Access and Retrieval Research workshop held at SIGIR 2017).
Many research systems have been developed throughout the years, where a mature project like Indri has been known to run in production at many different organisations. A good starting point is to investigate the systems that took part in the IR Reproducibility workshop, as these are likely to work correctly and be reasonably easy to install.
A new kid on the block is Nordlys, which is especially targeting entity retrieval tasks, but has not been tested as thoroughly as the alternatives mentioned before. The team behind Nordlys recommends to checkout the following resources:
- elastic.py: this part of the code talks to Elastic, performs indexing, search, and gets all raw statistics from index;
- scorer.py: computes statistics for various retrieval models (LM, MLM, and PRMS);
- Nordlys documentation related to the implementation of the baselines.
They also provide a REST API that you might consider to use instead of building an index yourself.
http://api.nordlys.cc/er?q=keith+van+rijsbergen&1st_num_docs=100&model=lm to find
entities related to “Keith van Rijsbergen” using language modelling.
The Open-Source IR Replicability Challenge (OSIRRC 2019) organized as a workshop at SIGIR 2019 has created an easy-to-(re-)use experimental IR research framework.
The OSIRRC 2019 Image Library repository catalogs the images that have been submitted to the OSIRRC 2019. The time investment to learn to use this catalog can pay off very well, as it gives you immediate access to a wide variety of systems that all have been demonstrated capable of running standard IR test collections.
Roll your own
An alternative is to develop the tools you need to process the data yourself. The advantage is complete control, a disadvantage that it may turn out a little harder than expected… the learning experience will be hard to beat!
Go to the DBPedia-Entity V2 test collection web resources.
Return to the Practical Assignment page.