MindLab Blog: May 2013

Introduction

In the CLEF2013's annotation subtask the challenge is to annotate a series of images with a predetermined set of labels. In this post we will present the multiple approaches that we took to try to tackle this challenge with some of the multimodal learning techniques proposed in our group. The final results are not what we expected but give us some insights on the advantages and disadvantages of our proposed method.

Problem

Given an image and a set of words, assign to each word a rank that represents the relevance of that word to the image and furthermore decide whether or not to annotate the image with that word (either based on the ranking or not).

You must do the same process for a series of images using the exact same set of words.

Training data

We are given a set of 250 000 samples (annotated images), each with:

An HTML pages with on image each (it may have more than one image but we are interested in just one) with the text that "annotates" the image.
A filtered textual representation of each of the 250 000 images. The ImageClef organizers seem to have processed the HTML's of each image to extract the most relevant text based on some closeness measure on the DOM tree or something like that. For each image they provide us with a set of words related to the image and furthermore they gave us a weight for each of this words. A fast look at the HTML pages and the images' locations in those pages reveal that the parsing and weighting made by them is more or less consistent with the importance of the terms for the images, as some of the most relevant words (relatively speaking) have the highest weights (or scores, as they call it).
Various visual representations of the image: C-SIFT, RGB-SIFT.

Validation data

We are given a set of 1 000 images with their corresponding visual representations (the same ones that were given for the training data) and their corresponding ground-thruth annotations, i.e. a set of words that are actually associated with each image. Finally we are provided with the set of keywords (or concepts) with which we need to annotate those 1 000 images.

Additional material

We are also provided with scripts for:

Measuring the performance (in terms of f-measure and map) of an annotation algorithm.
Creating two baseline results:

Assigning random annotations to each image
Assign annotations based on the 32 nearest training images neighbors (on a 1000 features C-SIFT space using the L1 norm) and a words co-ocurrence matrix derived from the training data.

In addition, for each word used to label images, we have:

The synset offset, the sense id and the word type in wordnet. This information is enough to search for synonyms and antonyms in wordnet (e.g. using the word net command line interface [2]) e.g. word : bottle, synset offset: 02876657, type : noun, sense: 1
The wikipedia page's link.

Proposed solution

Our proposed solution would feet into what are called the label embedding strategies [1], in which both the labeled object and the labels are embedded into the same vector space and then related using a ranking function (which can be a distance measure, e.g. the cosine distance), i.e. the "importance" of a label for an image is found using this ranking function on the embedded label and on the embedded image. This provides, for each image, an "importance" rank of the labeling words, and so using this we must select which labels to use to actually annotate the each image, i.e. we must place a threshold or select the N most "important" labels for each image.

The overall strategy can be sumarized in a diagram:

We first use the training data (text and images) to train the visual and textual models. Later we use the trained visual models to produce a latent representation of each image to be annotated. Next we use the trained textual model to produce a latent representation of each word that will be used to annotate the images. Having this will feed every possible pair of image and word representation to find out which of the words should be used to annotate which of the images.

Conspicuous note! As we'll see later, "a word representation" in this case was just a vector in a BoW vector space with just one element in one corresponding to the word being evaluated. However we're still evaluating richer representations of a word which can (1) extract information from wikipedia and online dictionaries, (2) extract synonyms from wordnet.

Therefore we need to come up with three different models:

A label embedding model or textual model. $f_T : R^t \rightarrow R^l $, with $t$ the number of terms in the text collection and $l$ the dimensionality of the embedding space.
An image embedding model or visual model. $f_V : R^v \rightarrow R^l $, with $v$ the dimensionality of the space in which the image is represented and $l$ the dimensionality of the embedding space.
A ranking model. $f_R : R^l \times R^l \rightarrow R$, here the first argument is the representation of a word or a document in the embedding space and the second argument of the function is the representation of the image in the embedding space. The output of this function is the "importance" of the label or document for the image.
A decision model. Receives a set of ranked words (a set of tuples composed of a word and a real) and decides which ones are relevant.

It's important to keep in mind that the first two models must be able to cope with large scale data sets (in the order of hundreds of thousands as our training data set). This is one of the main challenges of imageclef and the one we tried to address directly. Therefore our proposal begins from a specific visual model (the computationally speaking most expensive model) and from there on imposes some restrictions that help define the rest of the models.

Conspicuous note! Do $f_V$ or $f_T$ need to be a function (or just relations)? In other words, may a given image have various embedded representations or it must have just one.

The visual model

We tried various scalable models here. This models should

ONSE

ONSE stands for Online Non-negative Semantic Embedding and is an online (uses SGD) algorithm devised by Jorge Vanegas. This algorithm basically solves a unconstrainted optimization problem using the training data:

$$ V = ST $$

Where $V \in R^{v\times n}$, $T \in R^{l\times n}$ and $S^{v \times t}$, with $n$ being the number of training samples (each column of $V$ represents a training image, and each column of $T$ contains the corresponding embedded text representation of an image).

Conspicuous note! Should we assume that the pass from visual to textual representation is just an affine projection or should we give more degrees of freedom to the model by assuming a projective transformation (adding a 1 at the end of each column of $V$ and considering the resulting "equal up to scale" factor)

Conspicuous note! What if during training we consider the fact that an image is not only related to one vector $x_t \in R^{l\times n}$ but to many vectors in $R^l$, each one representing each one of the non-zero elements of the initial $x_t$ vector.

The $S$ matrix later allows us to find the textual representation of a given image by solving using the Moore-Penrose inverse:

$$ x_t = f_V(x_v) = S^{-1}x_v $$

Where $x_t$ is unknown and $x_v$ is a known representation of the image in the same vector space as the training images, i.e. $R^v$.

M3F

In this model we try to construct a latent representation (or embedding) such that there is are two affine transformations that are able to go (with minimum loss) from the latent representation to the textual representation on the one side and to the visual representation on the other [3] :

$$ min_{P,Q,H} \frac{1}{2}(\| V-PH \|^2_F + \| T-QH \|^2_F) + \frac{\lambda}{2}(\| P \|^2_F+\| Q \|^2_F+\| H \|^2_F)$$

This unconstrained optimization problem is then solved using SGD to later be able to use the matrices $P,Q,H$ to find the latent representation of a new image:

$$ x_t = f_V(x_v) = (\lambda I + P^TP + Q^TQ)^{-1} (P^Tx_v) $$

K-NN

This is the least scalable model that we used and consists of taking the K nearest neighbors of an in the training data set, using histogram intersection (however we could have evaluated L1, L2, tanimoto distance or other distance measures), and then averaging the textual representation of those nearest neighbors to produce the textual representation for the new image :

$$ x_t = f_V(x_v) = \frac{1}{\mid KNN(x_v) \mid} \sum_{T(x) \in KNN(x_v)}x$$

Where T(x) looks for the corresponding textual representation of the training image x in the training data set.

Conspicuous note! What should T(x) be in the last equation, the LDA representation of the training image x_v or the original textual representation (vector in a vector space model)?

The textual model

Here we used just one model: LDA. LDA can be used as an embedding algorithm since it is able to represent a text document in terms of its probability to belong to any of $l$ latent topics. However, to use LDA we need to find a Bag Of Words representation of the annotations of the training sample images, i.e. a BoF representation of HTML pages. This poses a scale challenge as it would be very costly to pre-process the HTML pages (delete HTML tags, CSS, JS and other scripts) to later extract the actual text that can be said to be annotating each image.

Luckily, this BoF representation of the training HTML pages is partly provided by the ImageCLEF organizers in the form of a set of weighted words for each training image. We are given a set of weighted words associated with each training image:

apple 500
fruta 300
de 50
la 20
by 30
the 10
search 20
contact 20
searching 40
bäker 10

The weights of the words of an image are normalized to 1000 and so we can build some sort of term-frequency for each weighted word measured as:

$$ \frac{w}{1000}*nt $$

With $w$ the weight of the word and $nt$ the number of terms in the corresponding document.

This however is just one of the many problems we need to solve before feeding this information to the LDA algorithm, other problems we faced were:

The words "annotating" each training image where in more than one language. What we did this time was just to use only images that were annotated with english words. This is obviously not the best approach as we could have translated all the words to english (using for example wikipedia) or use a cross-language textual embedding model. After filtering the training set to have only english annotated images we ended up with 212850 samples.
The words are not stemmed. We used an english stemmer that left us with roughly 390150 unique words.
There are a lot of stop-words. In fact of the 390K unique words left from stemming only 30K are neither stopwords nor words with a frequency of less than 5 or more than 200K.

So finally we have a text training data set composed of 212850 samples, with a dictionary of roughly 30 thousand (stemmed) different words. This is feeded into an LDA model trainer that finds $l$ latent topics on the documents collection. This model later outputs an $R^l$ vector representing the text of each of the images in the training data set.

Ranking model

Here we used three different functions : Dot product (or cosine) distance, histogram intersection and tanimoto distance.

Decision model

We just used a threshold to decide which labels to assign to a given image using the ranking obtained from the previous model.

Implementation

All the code for this challenge was written in Java and is available on request through a private Bitbucket repository. We provide however the jar file which can be used to reproduce any of our results with the same or other datasets.

The implementation of the visual models was made in-house.

The LDA implementation is taken from MALLET. And if you ever have the need to generate and use an LDA representation of a text data set you can use it by getting a textual TF matrix and feeding it to the appropriate commands of the provided jar file. This implementation does stemming (using snowball) and stop-word removal as well and allows you to use an already trained LDA model as many times as necessary to represent new textual data.

To get help on how to use any of the commands just call

java -jar clef.jar

Evaluation

We tried to evaluate the textual and visual models as isolated from one another as possible. This allowed us to detect various implementation details as well as pitfalls of our approach.
This exhaustive evaluation was done because the final results were really poor in terms of both f-measure and map, and so we needed to find out the problem.

Text model

To evaluate the textual model we built a LDA model using the training data and then generated the textual representation (vectors in $R^l$) of both the training data and the validation data. We can do this because we have the ground-truth text for the validation data set and so we can construct the actual expected textual (the one that the visual model should output) representation of the validation data. Then we use our ranking and decision models to see how well this textual model worked.

In other words what we're trying to do here is suppose that the visual model is perfect and so will be able to reconstruct from an image its exact latent representation. With this assumption we would like to know if measuring the distance of an image latent representation to a single word latent representation is a good way to find the words that correctly annotate to an image.

The results from this evaluation are encouraging as we get a MAP of 0.5 for the best parameter selection. This MAP is way beyond the baselines proposed by the imageclef organizers. However this is not the end of the problem as we still need a visual model that is good enough to correctly reconstruct the latent representation of new images.

Visual model

It was evaluating the visual models that we got the most problems. The results from this evaluation showed various problems both in our implementations and in our assumptions about the problem. The results from this evaluation are still in progress and will be published here as soon as we have them.

The results of the whole model are pretty bad and we're still struggling to find the reason for this, as it could be an issue in our theoretical proposal or in our code implementation.

Notes and references

[1] Park, S., & Choi, S. (2012). Max-margin embedding for multi-label learning. Pattern Recognition Letters.

[2] If you try to compile Wordnet 3.0 you might get the error error: 'Tcl_Interp' has no member named 'result' in the src/stubs.c file. To resolve this issue edit the file src/Makefile and add "-DUSE_INTERP_ERRORLINE -DUSE_INTERP_RESULT
" to the CFLAGS (line 86).

[3] Caicedo, J., & González, F. (2012). Online Matrix Factorization for Multimodal Image Retrieval. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 340–347.

MindLab Blog

Saturday, May 25, 2013

CLEF2013 - Our approach