I think bag of features is a good enough representation for image matching. Of course, it is not perfect, it still requires a lot of improvements. But it still captures many ideas that the computer vision community was working on a few years before it started to be popular.
For instance, consider the problem of image matching using SIFT features. The procedure goes like this:
- Extract SIFT features from image A and B.
- Build a list of descriptors for each image.
- For each descriptor in image A, compute the distance to all other descriptors in image B.
- Identify the minimum distance. If it is less than a threshold, count one match.
- Repeat.
At the end of the procedure, we have the total number of matches between both images. As you can imagine, this is sort of expensive if we are supposed to compute the number of matches for a large set of images (not just between two images). There is when the bag-of-features (BoF) appears and introduces an intermediate layer to save some computations: the dictionary of visual patterns. Actually, instead of saving those computations, the BoF moves the effort to an off-line stage, when we can wait (the dictionary construction).
Later, when we are indexing images with the BoF, we pre-compute the number of matches between the descriptors of one image and a reference dictionary. That's the histogram of occurrences between two images. Afterwards, for a query image, we can estimate the total number of matches with respect to a previously indexed image, just by computing the histogram intersection between both histograms.
In other words, if image A shows the pattern "x" 5 times and image B shows the same pattern 3 times, guess what would be the number of common matches if we look directly from image to image without using a dictionary: 3 (because at least 3 of the patterns in A would match with those patterns in B, approximately, i.e., the minimum between 5 and 3). Sounds familiar? Of course, this is an approximation to the matching process described above using the histogram intersection metric. However, it sort of mimics what was previously done.
Notice that one important parameter in the direct matching process is the threshold to accept or reject a match. This is sort of relaxed in the BoF approach by using k-means to group similar patterns. I think it's relaxed because there is no rejection when using BoF (unless it would be sparse enough). So, here is the trick: the larger the number of patterns in the dictionary, the smaller the number of matches, and possibly the better the approximation.
I would argue that when the number of patterns goes to infinite, it is equivalent to have a matching process where the threshold is set to zero (a match requires exact descriptors). This sort of justifies the use of large dictionaries when using BoF. And it becomes even more important when we would like to use it in large scale setups, just because when we introduce more images, it is likely that more unseen visual patterns appear in the collection.