Semantic Mapping in Video Retrieval

December 18, 2017

This Siks symposium aims to bring together researchers from disciplines including data science, linguistics and computer vision, to discuss approaches to multimedia information retrieval.


Radboud University, GR 1.109, Grotius building, Comeniuslaan 4, 6525 HP Nijmegen (more information).

Participation is free but please register here to help us plan room capacity and catering.

Preliminary program

9.30 - 10.00 Coffee and welcome
10.00 - 10:30 Zero-example multimedia event search (Chong-wah Ngo, City University of Hong Kong)
10.35 - 11.05 Activity localization without spatiotemporal supervision (Cees Snoek, University of Amsterdam)
11.15 - 11.45 Automatic Captioning of Video: Evaluation within the TRECVid Framework (Alan Smeaton, Dublin City University)
12.30-13.30 PhD (thesis in .pdf) defense Maaike de Boer (Aula, Comeniuslaan 2)

Afternoon guest lecture

Later that day, 15.45 - 16.45, Sien Moens gives a guest lecture to the combined group of students of the IR and Text and Multimedia Mining MSc courses, on Text and Multimedia Mining in the Context of e-Commerce.

The lecture takes place in the Huygens building (HG 00.307), and room capacity fits a few extra, so let us know if you are interested to attend this lecture as well.

More information

Chong-wah Ngo (City University of Hong Kong).


Different from supervised learning, the queries of zero-example search come with no visual training examples. The queries are described in text with few keywords or a paragraph. Zero-example search depends heavily on the scale and accuracy of concept classifiers in interpreting the semantic content of videos. The general idea is to annotate and index videos with concepts during offline processing, and then to retrieve videos with relevant concepts match to query description. Zero-example search starts since the very beginning of TRECVid in year 2003. Since then, the search systems have grown from indexing around twenty concepts (high-level features) to today’s more than ten of thousands of classifiers. The queries also evolve from finding a specific thing (e.g., find shots of an airplane taking off) to detecting a complex and generic event (e.g., wedding shower), while dataset size expands yearly from less than 200 hours to more than 5,000 hours of videos.

This talk presents a brief overview of zero-example search for multimedia events. Interesting problems include (i) how to determine the number of concepts for searching multimedia events, (ii) how to identify query-relevant video fragments for feature pooling, (iii) whether the result of zero-example search can complement supervised learning, (iv) how to leverage the result for video recounting.


Chong-Wah Ngo is a professor in the Dept. of Computer Science at the City University of Hong Kong. He received PhD in Computer Science from Hong Kong University of Science & Technology, and MSc and BSc, both in Computer Engineering, from Nanyang Technological University of Singapore. Before joining City University of Hong Kong, he was a postdoctoral scholar in Beckman Institute at the University of Illinois in Urbana‐Champaign. His main research interests include large‐scale multimedia information retrieval, video computing, multimedia mining and visualization. He is the founding leader of video retrieval group (VIREO) at City University, a research team that releases open source softwares, tools and datasets widely used in the multimedia community. He was the associate editor of IEEE Trans. on Multimedia, and has served as guest editor of IEEE MultiMedia, Multimedia Systems, and Multimedia Tools and Applications. Chong-Wah has organized and served as program committee member of numerous international conferences in the area of multimedia. He is on the steering committee of TRECVid and ICMR (Int. Conf. on Multimedia Retrieval). He was conference co-chair of ICMR 2015, program co-chairs of ACM MM 2019, MMM 2018, ICMR 2012, MMM 2012 and PCM 2013. He also served as the chairman of ACM (Hong Kong Chapter) during 2008-2009. He was awarded ACM Distinguished Scientist 2016.

Activity localization without spatiotemporal supervision

Cees Snoek (University of Amsterdam).


Understanding what activity is happening where and when in video content is crucial for video computing, communication and intelligence. In the literature, the common tactic for activity localization is to learn a deep classifier on hard to obtain spatiotemporal annotations and to apply it at test time on an exhaustive set of spatiotemporal candidate locations. Annotating the spatiotemporal extent of an activity in training video is not only cumbersome, tedious, and error prone, it also does not scale beyond a hand full of activity categories. In this presentation, I will highlight recent work from my team at the University of Amsterdam in addressing the challenging problem of activity localization in video without the need for spatiotemporal supervision. We consider three possible solution paths: 1) the first relies on intuitive user-interaction with points, 2) the second infers the relevant spatiotemporal location from an activity class label, and finally, 3) the third derives a spatiotemporal activity location from off-the-shelf object detectors and text corpora only. I will discuss the benefit and drawbacks of these three solutions on common activity localization datasets, compare with alternatives depending on spatiotemporal supervision, and highlight the potential for future work.


Cees Snoek received the M.Sc. degree in business information systems in 2000 and the Ph.D. degree in computer science in 2005, both from the University of Amsterdam, The Netherlands. He is currently a director of the QUVA Lab, the joint research lab of Qualcomm and the University of Amsterdam, on deep learning and computer vision. He is also a principal engineer/manager at Qualcomm Research Netherlands and an associate professor at the University of Amsterdam. His research interests focus on video and image recognition. He is recipient of a Veni Talent Award, a Fulbright Junior Scholarship, a Vidi Talent Award, and The Netherlands Prize for Computer Science Research.

Automatic Captioning of Video: Evaluation within the TRECVid Framework

Alan Smeaton


Automatic caption of video, without using any metadata or contextual information, has many applications from improving search and retrieval, collaborative exploration, and developing new forms of assistive technologies. Captioning builds upon the work achieved in image tagging and video tagging which has seen huge progress over the last 5 years. In 2017 TRECVid continued a pilot programme on video captioning using a collection of 2,000 short video clips and 14 research groups worldwide, participated in the activity. Apart from the year-on-year improvement in video tagging itself, and the emergence of different approaches, the emphasis on spatial and temporal salience and the variety of training data used, one of the notable developments was the emergence of a new kind of evaluation metric based on direct assessment, and the use or semantic similarity among captions to identify videos which are either difficult or easy, for captioning. This presentation will present a précis of this work, which includes our own contributions, as well as pointers to future directions.


Alan Smeaton is Professor of Computing at Dublin City University (appointed 1997) where he has previously been Head of School and Executive Dean of Faculty. He was a Founding Director of the Insight Centre for Data Analytics at DCU and is a member of the Board of the Irish Research Council with a Ministerial government appointment from 2012 to 2018. He is also a member of the COST Scientific Committee which oversees the disbursement of COST’s budget of €300m during the lifetime of Horizon 2020.

His research work focuses on the development of theories and technologies to support all aspects of information discovery and human memory, allowing people to find the right information at the right time and in the right form. He is internationally recognised for his work on information retrieval — particularly multimedia retrieval — and on automatic video analysis. He is founding coordinator of TRECVid, the international benchmarking evaluation campaign on information retrieval from digital video, run by the National Institute of Standards and Technology in the US annually since 2001. Over the last 17 years TRECVid has involved contributions from over 2,000 research scientists across 5 continents and almost 200 Universities, Research Institutes and companies.

Alan Smeaton has more than 600 peer-reviewed publications on Google SCHOLAR with over 14,000 citations and an h-index of 60. He is an elected member of the Royal Irish Academy, the highest academic distinction that can be awarded in Ireland. Within the Academy he is chair of the Engineering and Computer Sciences Committee. In 2016 he was awarded the Academy’s Gold Medal for Engineering Sciences, an award given once every 3 years for his “world-leading research reputation in the field of multimedia information retrieval”. In 2017 he was elevated to Fellow of the Institutite of Electrical and Electronic Engineers (IEEE) for his “outstanding contributions to multimedia indexing and retrieval”.

Text and Multimedia Mining in the Context of E-Commerce

Sien Moens

Note: guest lecture in the afternoon, 15.45 - 16.45, in the Huygens building (HG 00.307).


The lecture focuses on representation learning of e-commerce content and is illustrated with several use cases: opinion mining, bridging the language of consumers and product vendors, and bridging language and visual data for cross-modal and multimodal fashion search.

We start with a short introduction of probabilistic graphical models and deep learning models. A first use case of opinion mining illustrates deep learning models. In the second use case, we focus on linking content (textual descriptions of pins in Pinterest to webshops). We explain the problem of linking information between different usages of the same language, e.g., colloquial and formal “idioms” or the language of consumers versus the language of sellers. For bridging these languages, we have trained a multi-idiomatic latent Dirichlet allocation model (MiLDA) on product descriptions aligned with their reviews. In the third use case, we propose two architectures to link visual with textual content in the fashion domain. The first architecture uses a bimodal latent Dirichlet allocation topic model to bridge between these two modalities. As a second architecture, we have developed a neural network which learns intermodal representations for fashion attributes. Both resulting models learn from organic e-commerce data, which is characterized by clean image material, but noisy and incomplete product descriptions. We show the results of cross-modal retrieval: retrieval of images given a textual description and retrieving descriptive text given an image. In a fourth use case, we present our most recent work on multimodal retrieval. Given an image and a textual description of what needs to be changed in the image, we retrieve relevant e-commerce fashion items that are relevant for this multimodal query.


Marie-Francine Moens is full professor at the Department of Computer Science at KU Leuven, Belgium, where she is the director of the Language Intelligence and Information Retrieval (LIIR) research lab and head of the Informatics section. She holds a M.Sc. and a Ph.D. degree in Computer Science from this university. Her main research topic regards automated content recognition in text and multimedia using statistical machine learning. She has a special interest in learning with limited supervision, in probabilistic graphical models, in deep learning and in the application of these methods in language understanding and in the translation of language to other languages or to the visual medium. Currently, she is the scientific manager of the EU COST action iV&L Net (The European Network on Integrating Vision and Language).


The symposium is organised by Maaike de Boer, Arjen P. de Vries, and Wessel Kraaij.