For instance, consider the problem of image matching using SIFT features. The procedure goes like this:
- Extract SIFT features from image A and B.
- Build a list of descriptors for each image.
- For each descriptor in image A, compute the distance to all other descriptors in image B.
- Identify the minimum distance. If it is less than a threshold, count one match.
- Repeat.
At the end of the procedure, we have the total number of matches between both images. As you can imagine, this is sort of expensive if we are supposed to compute the number of matches for a large set of images (not just between two images). There is when the bag-of-features (BoF) appears and introduces an intermediate layer to save some computations: the dictionary of visual patterns. Actually, instead of saving those computations, the BoF moves the effort to an off-line stage, when we can wait (the dictionary construction).
Later, when we are indexing images with the BoF, we pre-compute the number of matches between the descriptors of one image and a reference dictionary. That's the histogram of occurrences between two images. Afterwards, for a query image, we can estimate the total number of matches with respect to a previously indexed image, just by computing the histogram intersection between both histograms.
In other words, if image A shows the pattern "x" 5 times and image B shows the same pattern 3 times, guess what would be the number of common matches if we look directly from image to image without using a dictionary: 3 (because at least 3 of the patterns in A would match with those patterns in B, approximately, i.e., the minimum between 5 and 3). Sounds familiar? Of course, this is an approximation to the matching process described above using the histogram intersection metric. However, it sort of mimics what was previously done.
Notice that one important parameter in the direct matching process is the threshold to accept or reject a match. This is sort of relaxed in the BoF approach by using k-means to group similar patterns. I think it's relaxed because there is no rejection when using BoF (unless it would be sparse enough). So, here is the trick: the larger the number of patterns in the dictionary, the smaller the number of matches, and possibly the better the approximation.
I would argue that when the number of patterns goes to infinite, it is equivalent to have a matching process where the threshold is set to zero (a match requires exact descriptors). This sort of justifies the use of large dictionaries when using BoF. And it becomes even more important when we would like to use it in large scale setups, just because when we introduce more images, it is likely that more unseen visual patterns appear in the collection.
Very interesting post! Juanca is right the BOF histogram intersection is an approximation to a complete match. In fact the codebook is a quantization of the space, and the threshold in the complete match is related to the size of the codebook, and Juanca is right when argues that an infinite codebook corresponds to a 0 threshold.
ReplyDeleteIn previous discussions we have discussed about e possibility of using a soft clustering for building the codebook, instead of the hard clustering produced by k-means. In this case, a visual pattern will contribute to more of one histogram bin depending on the degree of belonging to the curresponding cluster/codeword. I wonder whether this modification could have a positive impact in a particular task (retrieval or annotation). My conjecture is that we can obtain equivalent results, but with smaller codebooks. It would be intresting to give it a try.