AI Learnings: A simple guide to Cross-Modal Retrieval

Published on

November 9, 2023

Authors

Leticia Fernandes

Senior Data Scientist, Deeper Insights

Advancements in AI Newsletter

Subscribe to our Weekly Advances in AI newsletter now and get exclusive insights, updates and analysis delivered straight to your inbox.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

How hard is it to find a specific picture in your smartphone gallery when you have billions of them, they don't have captions, and you don't even remember the date to help you with the search? A long time I would presume. But what if you could describe with a few words what you are looking for and use it as your search query?

Search query: “a dog yawning in a mountain”

Result:

dog — Fig.1: Image extracted from this source.

Using text to retrieve images will definitely reduce the search and help you find the picture you are looking for in a shorter amount of time!

Did you know that with cross-modal retrieval solutions it is possible to conduct this type of search?

What is cross-modal retrieval?

Social media websites are emerging as well as the increased consumption of multimedia content (e.g. text, video, images). There is a need to change the way people interact and search for information to improve the user experience and minimise the time it takes to find relevant content. Cross-Modal Retrieval is the task of searching data using different data modalities (e.g. image-text, video-text, audio-text). It will revolutionise the way modern search engines work by providing flexibility to the user to search for images, videos and audio using natural language and vice-versa.

Most cross-modal retrieval solutions use representation learning techniques.

What is representation learning?

Representation learning is a set of techniques that are used to learn representations of the data that make it easier to extract useful information when building machine learning models [1].

Representation learning in cross-modal retrieval

During the past few years, multiple algorithms have been developed to solve cross-modal retrieval problems. The most popular solutions use representation learning techniques. Based on the survey conducted in [2], these solutions can be divided into two groups - Real-valued representation learning and Binary-valued representation learning.

Real-valued representation learning

Different data modalities that are related to the same subject or event, are expected to share a common representation space in which the most correlated data will be close to each other in that space. This means that the text description “a dog yawning in a mountain” will be closer to the picture in Fig. 1 than the text description “a cat climbing a tree”.

Real-valued representation learning methods aim to learn a real-valued common representation space, in which different modalities of data can be directly measured. It includes the following four methods:

Subspace Learning: learns a common subspace shared by different modalities of data, in which the similarity between different modalities of data can be measured.
Topic Model Learning: learns topics shared in latent variables that represent correlations between different modalities of data.
Deep Learning: learns similarities between different modalities of data. Some examples include the application of deep networks to learn features and joint representations over multiple modalities.
Shallow learning: includes pairwise-based methods where a meaningful metric distance between different modalities of data is learned and rank-based methods where common representations of different modalities of data are learned by utilising rank lists.

Binary-valued representation learning (also called cross-modal hashing)

Transforms different modalities of data into a common Hamming space in which cross-modal similarity search is fast. Data is encoded to binary codes which might lead to information loss, therefore the retrieval accuracy might decrease slightly. It can be subdivided into:

Linear modelling: learn linear functions to obtain hash codes
Nonlinear modelling: learn nonlinear functions to obtain hash codes

Originally, the most popular algorithm for handling cross-modal retrieval problems was the subspace learning technique Canonical Correlation Analysis (CCA). Recent studies use the power of deep learning to handle this task [ 3, 4, 5, 6, 7, 8 ].

Open-Source Tools for cross-modal retrieval

There are several open-source tools for cross-modal retrieval involving different data modalities. Below there a list of some interesting tools which are available on GitHub:

X-VLM learns multi-grained alignments to locate visual concepts in the image given the associated texts, and in the meantime aligns the texts with the visual concepts, where the alignments are in multi-granularity [9].
X-modaler is a versatile and high-performance codebase for cross-modal analytics (e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual common sense reasoning, and cross-modal retrieval) [10].
Image Search uses CLIP (Contrastive Language-Image Pre Training) which is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image.
VisualBert consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention [11].

The research on cross-modal retrieval has become more popular over the last couple of years. Different methodologies have been explored, but the most promising ones nowadays involve the power of deep learning. Most of the existing solutions use image and text modalities, therefore there are lots of opportunities to explore solutions for other modality types.