Internships at WizeNoze for NLP, ML and IR students

WizeNoze is looking for talented research interns to join their team of developers and scientists. They offer a number of interesting research internships in Natural Language Processing, Machine Learning and Information Retrieval. Below is a list of suggested projects, and expectations regarding potential interns. Contact Thijs Westerveld to apply.

Most of the suggested projects are especially suitable for a research internship, but MSc thesis projects can be formulated with the suggested use case as starting point. After acceptance at WizeNoze, contact Arjen P. de Vries to arrange the internal supervision at Radboud University.

About WizeNoze

WizeNoze aims to make online information accessible and meaningful for all. We build state-of-the-art text classification, text analysis and search technology to create and access content that is readable and reliable. With the WizeNoze content collection, children and young people have access to a wide variety of reliable content for their own reading level. The WizeNoze editing technology on the other hand, helps authors, companies, and institutions simplify their texts and write for these lower literacy levels.

WizeNoze’s team consists of a wide variety of specialists in NLP, ML, IR, child computer interaction, education and business. As an intern you will be part of this team, collaborating with us to improve our technology. WizeNoze is based in downtown Amsterdam.

Requirements

  • MSc student in computer science, artificial intelligence or related field
  • Knowledge of natural language processing, information retrieval and/or machine learning (depending on internship topic)
  • Programming experience; preferably familiar with java and groovy
  • Ability to work independently
  • Outspoken; participate in discussions

ML: Single sentence classification

Currently our readability classifier is trained to predict the reading level of a complete document. In many scenarios we would also like to know the reading level of individual sentences (within the document). This project is about building a classifier for individual sentences. Two of the big challenges are how to use our training corpus which consists of labeled documents and can we directly use the features that we use currently or do we need to change them?

ML/NLP: Entity familiarity classification

Our editing tools identify difficult terms and concepts. Named entities are identified, but have been skipped the difficulty analysis so far. This project aims to compute a fame or familiarity score for named entities based on the statistics from our content collection.

ML/IR/NLP: Knowledge graph

Many modern search engines show semi-structured, semantic information from a knowledge graph for specific queries. We would like to add similar functionality to the WizeNoze search engines. In this project, you will investigate how we can combine existing knowledge bases (e.g. wikipedia, wikidata) with the children specific information from our collections.

IR: Integrated search / Wikify

Automatically identify concepts in any text that could be linked to WizeNoze search results. The task is similar to the automatic addition of cross-links in wikipedia pages, but we need to take into account reading level and relevance for children. This could result in a nice browser plugin and/or additional functionality for API users that want to offer integrated search.

WizeNoze focuses on informational queries, so the classic local search (find a restaurant, or petrol station in this area) is not of interest to us. Still, some informational queries have clear local or regional results, e.g. searching for prime minister or elections should return different results for Dutch, Flemish or British users of the search technology. Can we automatically identify these types of queries and how can we balance regional preferences in our ranking with the other ranking components like relevance, readability and recency.

IR: Mixed result pages

Provide WizeNoze users with a mix of image, video, news and general web results in a single search result page. How do we determine when we need to mix in these other result types in the main “pages” results tab? Federated search techniques will likely play a role in this project.

NLP: Automatically rewrite passive voice

Sentences written in the passive voice can be hard to understand. Our tools recognize them, but the next step is to provide suggestions for rewriting. This project involves syntactic parsing and hand-crafting, and/or machine learning rewriting rules. An extension of the project could be to decide when to rewrite: some idiomatic passive phrases don’t make much sense in active voice.

NLP: Compound and multi-word difficulty prediction

Our text analysis tools identify difficult terms in a text, but what if a term is a compound or an idiomatic multi-word expression. For compounds, we first need to identify the parts of the term. For both compounds and multi-word expressions, we need to predict the difficulty. If all parts are common terms for a given reading level, does that mean the compound is easy as well? Probably not in all cases.

NLP: Abstract language identification

Especially on the lower reading levels, a text needs to be as concrete and specific as possible. But what makes a text concrete or abstract. This project is about the identification of abstract language. Step one is to identify abstract sentences or phrases, but perhaps we could use the WizeNoze content collection to suggest concrete examples for explaingin/rewriting the abstract language.

ML/NLP/UI: Office 365 add-in for WizeNote

Integrate the WizeNote classification and text analytics tech in Office 365, allowing users to directly analyse and simplify their texts from MS Word or Powerpoint.

ML/NLP/UI: Batch WizeScan / Quickscan app

At WizeScan, users can analyse the readability of a pasted text or a URL. We would like to extend the functionality to batch analyse multiple documents. The taks is to integrate existing crawling, text extraction and text analysis technology to create a wizescan pipeline to analyse sets of documents supporting different input formats, ranging from zipped archives with pdf, docx or html files to lists of urls and starting points for a web crawl. The output should be a nicely formatted report with (average) readability information for the batch.