Optimizing Image Retrieval

Andy Zhang bio photo By Andy Zhang

This is the third and final part of my series on my summer research experience. You can find part one here and part two here.

Now that we had a working robot and localizaiton algorithm, the problem was runtime. We were running these scripts on relatively powerful personal computers, but creating the map still took an inordinately long time. For three locations, each with 24 possible orientations, we had a total of 72 particles to index, which meant finding hundreds of features per image and saving them in an index. As we increased the resolution of our images and extended the map, the performance dipped tremendously. Runtime increased more than exponentially, and we were running scripts overnight just to gneerate a map. And if our surroundings changed, we would need to redo the entire process…

In theory, we could have kept our implementation the way it was, but I like to believe that computer scientists are lazy in the best way, in that they are efficient and want to find the fastest way to do things (most of the time).

Further research led me down a rabbit hole of optimization techniques, but most sources pointed toward using a method called Bag-of-Words(BoW). Taking cues from natural language processing and machine learning, individual image features could be clustered together into visual “words.” Then, image matching would be a question of matching the words in the two images, rather than the features themselves.

Most of the time, BoW is used for image classification rather than retrieval. For instance, a program is trained on a set of images of dogs, zebras, and cars. Features are used to classify each image into one of the categories.

In our case, we cluster features into landmarks and perform image matching using a dictionary of landmarks. This approach led to a speed up of over 95% in most cases! Natural language processing is powerful, indeed.

If you’re curious, the code looks something like this. It isn’t too complicated, and if you want to see the rest of it head over to the GitHub repo or the project webpage for more.

 def createIndex(self, trainingPath):
    desc = cv2.xfeatures2d.SIFT_create()
    train_des_list = []
    numWords = self.numWords
    image_paths = glob.glob(trainingPath + '/*' + '.png')
    
    # Extract the descriptors from the maps and store them 
    for imagePath in glob.glob(trainingPath + '/*' + '.png'):
        print(imagePath)
        image = cv2.imread(imagePath)
        kp, des = desc.detectAndCompute(image, None)
        train_des_list.append((imagePath, des))

    # Stack them all in a numpy array
    train_descriptors = train_des_list[0][1]
    for image_path, descriptor in train_des_list[1:]:
        train_descriptors = np.vstack((train_descriptors, descriptor))

    # Perform K-means clustering
    voc, variance = kmeans(train_descriptors, numWords, 1)

    # Calculate histogram of features for the Training SET
    im_features = np.zeros((len(image_paths), numWords), "float32")
    for i in range(len(glob.glob(trainingPath + '/*' + '.png'))):
        words, distance = vq(train_des_list[i][1], voc)
        for w in words:
            im_features[i][w] += 1

    # Perform Tf-idf vectorization for Training set
    nbr_occurences = np.sum((im_features > 0) * 1, axis = 0)
    idf = np.array(np.log((1.0*len(image_paths)+1) / (1.0*nbr_occurences + idf = np.array(np.log((1.0*len(image_paths)+1) / (1.0*nbr_occurences + 1)), 'float32')

    # Perform L2 normalization
    im_features = im_features*idf
    im_features = preprocessing.normalize(im_features, norm='l2')

    joblib.dump((im_features, image_paths, idf, numWords, voc), trainingPath + ".pkl", compress=3)