How hard is it to find a specific picture in your smartphone gallery when you have billions of them, they don't have captions, and you don't even remember the date to help you with the search? A long time I would presume. But what if you could describe with a few words what you are looking for and use it as your search query?
Search query: “a dog yawning in a mountain”
Using text to retrieve images will definitely reduce the search and help you find the picture you are looking for in a shorter amount of time!
Did you know that with cross-modal retrieval solutions it is possible to conduct this type of search?
Social media websites are emerging as well as the increased consumption of multimedia content (e.g. text, video, images). There is a need to change the way people interact and search for information to improve the user experience and minimise the time it takes to find relevant content. Cross-Modal Retrieval is the task of searching data using different data modalities (e.g. image-text, video-text, audio-text). It will revolutionise the way modern search engines work by providing flexibility to the user to search for images, videos and audio using natural language and vice-versa.
Most cross-modal retrieval solutions use representation learning techniques.
Representation learning is a set of techniques that are used to learn representations of the data that make it easier to extract useful information when building machine learning models .
During the past few years, multiple algorithms have been developed to solve cross-modal retrieval problems. The most popular solutions use representation learning techniques. Based on the survey conducted in , these solutions can be divided into two groups - Real-valued representation learning and Binary-valued representation learning.
Different data modalities that are related to the same subject or event, are expected to share a common representation space in which the most correlated data will be close to each other in that space. This means that the text description “a dog yawning in a mountain” will be closer to the picture in Fig. 1 than the text description “a cat climbing a tree”.
Real-valued representation learning methods aim to learn a real-valued common representation space, in which different modalities of data can be directly measured. It includes the following four methods:
Transforms different modalities of data into a common Hamming space in which cross-modal similarity search is fast. Data is encoded to binary codes which might lead to information loss, therefore the retrieval accuracy might decrease slightly. It can be subdivided into:
Originally, the most popular algorithm for handling cross-modal retrieval problems was the subspace learning technique Canonical Correlation Analysis (CCA). Recent studies use the power of deep learning to handle this task [ 3, 4, 5, 6, 7, 8 ].
There are several open-source tools for cross-modal retrieval involving different data modalities. Below there a list of some interesting tools which are available on GitHub:
The research on cross-modal retrieval has become more popular over the last couple of years. Different methodologies have been explored, but the most promising ones nowadays involve the power of deep learning. Most of the existing solutions use image and text modalities, therefore there are lots of opportunities to explore solutions for other modality types.
Author: Leticia Fernandes, Deeper Insights