Computer Vision Project 4 Report

The goal of this project is image recognition. We will classify scenes into one of 15 categories by training and testing on the 15 scene database. We use 2 types of features to represent image information: tiny images and SIFT. We use 2 types of classifiers to train the data: nearest neighbor and SVM. We will discuss the results on these combinations:

tiny images + Nearest Neighbor
bag of SIFT + Nearest Neighbor
bag of SIFT + linear SVM

Part 1: tiny images + Nearest Neighbor

To build a tiny image feature, we simply resize the original image to a very small square resolution (16x16). We resize the images to square and also crop the center square portion out of each image thus making the tiny images zero mean and unit length (normalizing them).


		for i = 1:size(image_paths,1)
			im = im2double(imread(image_paths{i}));
			dis = imresize(im,[16,16]);
			mean = mean2(dis);
			for j = 1:size(dis,1)
				for k = 1:size(dis,2)
					dis(j,k) = dis(j,k) - mean;
				end
			end
			dis = dis/std2(dis);
			image_feats(i,:) = reshape(dis,[1,256]);
		end

For the nearest neighbor classifier, we find the nearest training example and assign the test case the label of that nearest training example.


			D = vl_alldist2(train_image_feats',test_image_feats');
			[~,I] = min(D);
			predicted_categories = cell(size(test_image_feats,1),1);
			for i = 1:size(test_image_feats,1)
			  predicted_categories{i} = train_labels{I(i)};
			end

Results

Confusion Matrix for tiny images + NN

We get an accuracy of 22.4%. This is not a particularly good representation, because it discards all of the high frequency image content and is not especially invariant to spatial or brightness shifts.

Part 2: bag of SIFT + Nearest Neighbor

First we need to build a vocabulary of visual words. To do this, we load images from the training set. For each loaded image, we get some SIFT features. Once we have tens of thousands of SIFT features from many training images, cluster them with kmeans. The resulting centroids are now our visual word vocabulary.


			features1 = cell(1,500);
			step = 3;
			for i = 1:size(image_paths,1)
			  [~,features] = vl_dsift(single(imread(image_paths{i})),'norm','step',step);
			  features1{1,i} = features;
			end
			f = cell2mat(features1);
			single_f = single(f);
			vocab = vl_kmeans(single_f,vocab_size);
			vocab = vocab';

Next, we need to write get_bags_of_sifts.m. We construct SIFT features here in the same way we did in build_vocabulary.m (except for changing the sampling rate) and then assign each local feature to its nearest cluster center and build a histogram indicating how many times each cluster was used. We also normalize the histogram, or else a larger image with more SIFT features will look very different from a smaller version of the same image.


			step = 4;
			for i = 1:size(image_paths,1)
			  [~,features] = vl_dsift(single(imread(image_paths{i})),'norm','step',step);
			  dis = vl_alldist2(single(features),vocab');

			  row = size(vocab,1);
			  feat = zeros(1,row);
			  for j = 1:size(features,2)
			    [~,index] = min(dis(j,:));
			    feat(index) = feat(index) + 1;
			  end
			  image_feats(i,:) = feat/sum(feat);
			end

Results

Confusion Matrix for bags of SIFT + NN

We get an accuracy of 50.9%. The values of the free parameters are: step size = 4 for both build_vocabulary.m and get_bags_of_sifts.m, vocab size = 200. The runtime is ~30 mins which is realy large since the step size is small.

Confusion Matrix for bags of SIFT + NN

We get an accuracy of 47.9%. The values of the free parameters are: step size = 4 and 10 for build_vocabulary.m and get_bags_of_sifts.m respectively, vocab size = 200. The runtime is ~17 mins.

Part 3: bag of SIFT + linear SVM

The task is to train 1-vs-all linear SVMS to operate in the bag of SIFT feature space. The feature space is partitioned by a learned hyperplane and test cases are categorized based on which side of that hyperplane they fall on. To decide which of 15 categories a test case belongs to, we will train 15 binary, 1-vs-all SVMs. 1-vs-all means that each classifier will be trained to recognize 'forest' vs 'non-forest', 'kitchen' vs 'non-kitchen', etc. All 15 classifiers will be evaluated on each test case and the classifier which is most confidently positive "wins". The code for svm_classify.m is as follows:


			for i = 1:num_categories
			  matching_indices = strcmp(categories{i},train_labels);
			  labels(find(~matching_indices)) = -1;
			  labels(find(matching_indices)) = 1;
			  labels = double(labels);
			  [W(:,i), B(1,i)] = vl_svmtrain(train_image_feats',labels,LAMBDA);
			end
			dis = zeros(1,num_categories);
			for i = 1:size(test_image_feats,1)
			  for j = 1:num_categories
			    dis(1,j) = dot(test_image_feats(i,:),W(:,j)+B(1,j));
			  end
			  [~,index] = max(dis);
			  predicted_categories{i} = categories{index};
			end

Results

Confusion Matrix for bags of SIFT + linear SVM

We get an accuracy of 63.1%. The values of the free parameters are: step size = 4 and 10 for build_vocabulary.m and get_bags_of_sifts.m respectively, vocab size = 200, LAMBDA = 0.000001. The runtime is ~14 mins.

Extra Credit: Experiment with different vocabulary sizes

Used bags of SIFT and linear SVM classifier. The step size is 4 and 10 for build_vocabulary.m and get_bags_of_sifts.m respectively, LAMBDA = 0.000001.

Vocabulary Size	Accuracy	Confusion Matrix
10	35.4%
20	46.5%
50	54.5%
100	59.4%
200	63.1%
400	61.8%
1000	65.5%

Extra Credit: RBF kernel for non-linear SVM

Implemented a non-linear SVM with RBF kernel, Used bags of SIFT and non-linear SVM classifier. The code for rbf_svm_classify.m is as follows:


			for i = 1:num_categories
			  matching_indices = strcmp(categories{i},train_labels);
			  matching_indices = double(matching_indices);
			  matching_indices(find(~matching_indices)) = -1;
			  svm_train_labels = double(matching_indices);
			  svm_model{i} = fitcsvm(double(train_image_feats),svm_train_labels,'KernelFunction','rbf');
			end
			for i = 1:num_categories
			  [~,scores] = predict(svm_model{i},test_image_feats);
			  confidence(i,:) = (scores(:,2))';
			end
			[~,index] = max(confidence);
			dis = zeros(1,num_categories);
			for i = 1:size(test_image_feats,1)
			  predicted_categories{i,1} = categories{index(i)};
			end

Results

Confusion Matrix for bags of SIFT + non-linear SVM

We get an accuracy of 57.4%. The values of the free parameters are: step size = 4 and 10 for build_vocabulary.m and get_bags_of_sifts.m respectively, vocab size = 200. The runtime is ~16 mins.

Extra Credit: GIST Descriptor

Implemented GIST features with LMgist.m and imresizecrop.m functions taken from Olivia et al. Used GIST features and linear SVM classifier. The code for get_gist_feats.m is as follows:


			for i = 1:size(image_paths,1)
			  im = single(imread(image_paths{i}));
			  param.imageSize = [256 256];
			  param.orientationsPerScale = [8 8 8 8];
			  param.numberBlocks = 4;
			  param.fc_prefilt = 4;
			  image_feats(i,:) = LMgist(im,'',param);
			  image_feats(i,:) = image_feats(i,:)/sum(image_feats(i,:));
			end

Results

Confusion Matrix for GIST + SIFT + linear SVM

We get an accuracy of 60.3%. LAMBDA = 0.000001. The runtime is ~35 mins.

Extra Credit: Fisher Encoding

Implemented a more sophisticated feature encoding scheme: fisher encoding. The code for get_fisher_feats.m is as follows:


			features = [];
			for i = 1:size(image_paths,1)
			  [~,features1] = vl_dsift(single(imread(image_paths{i})),'norm','step',8);
			  features = [features,features1];
			end
			[means,covariances,priors] = vl_gmm(double(features),clusters);

			load('vocab_fisher_20.mat','means','covariances','priors')
			for i = 1:size(image_paths,1)
			  [~,features] = vl_dsift(single(imread(image_paths{i})),'norm','step',8);
			  image_feats(i,:) = vl_fisher(double(features),means,covariances,priors,'normalized');
			end

Results

Number of Clusters	Accuracy	Runtime
10	49.3%	~12 minutes
20	53.8%	~13 minutes
50	50.3%	~13 minutes
100	52.7%	~14 minutes

Results visualization for best performing recognition pipeline.

Accuracy (mean of diagonal of confusion matrix) is 65.5% using SIFT + SVM with vocab size = 1000.

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.470			Store	Bedroom	Office	InsideCity
Store	0.490			LivingRoom	Kitchen	TallBuilding	Industrial
Bedroom	0.420			Coast	Store	LivingRoom	Kitchen
LivingRoom	0.340			Bedroom	InsideCity	Office	Kitchen
Office	0.880			LivingRoom	Store	Kitchen	Bedroom
Industrial	0.580			Kitchen	Store	InsideCity	Bedroom
Suburb	0.980			Industrial	OpenCountry	LivingRoom	Store
InsideCity	0.510			Store	LivingRoom	Coast	TallBuilding
TallBuilding	0.720			Street	Bedroom	Mountain	Industrial
Street	0.630			Bedroom	Store	TallBuilding	Industrial
Highway	0.810			Street	LivingRoom	Bedroom	LivingRoom
OpenCountry	0.480			Coast	Mountain	Mountain	Coast
Coast	0.770			OpenCountry	OpenCountry	Highway	Mountain
Mountain	0.830			Forest	Store	OpenCountry	Suburb
Forest	0.920			OpenCountry	Store	Store	Mountain
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label

Dibyendu Mondal

Project 4: Scene Recognition with Bag of Words

Part 1: tiny images + Nearest Neighbor

Results

Part 2: bag of SIFT + Nearest Neighbor

Results

Part 3: bag of SIFT + linear SVM

Results

Extra Credit: Experiment with different vocabulary sizes

Extra Credit: RBF kernel for non-linear SVM

Results

Extra Credit: GIST Descriptor

Results

Extra Credit: Fisher Encoding

Results