Project 5: Face Detection with a Sliding Window

The goal of this project is to implement a siding window face detector. The sliding window model is conceptually simple: independently classify all image patches as being object or non-object. Sliding window classification is the dominant paradigm in object detection and for one object category in particular -- faces -- it is one of the most noticeable successes of computer vision. It involves the following steps:

  1. Load cropped positive trained examples
  2. Sample random negative examples
  3. Train a classifier from these examples
  4. Run the classifier on the test set

Part 1: Load cropped positive trained examples

This function returns all positive training examples (faces) from 36x36 images in 'train_path_pos'. Each face is converted into a HoG template according to 'feature_params'.


		for i = 1:num_images
			img = im2single(imread(fullfile(train_path_pos,image_files(i).name)));
			features_pos(i,:) = reshape(vl_hog(img,feature_params.hog_cell_size),[1,D]);
		end
		

Part 2: Sample random negative examples

This function returns negative training examples (non-faces) from any images in 'non_face_scn_path'. Images are converted to grayscale because the positive training data is only available in grayscale.


		for i = 1:num_images
			img = im2single(rgb2gray(imread(fullfile(non_face_scn_path,image_files(i).name))));
			x = size(img,2) - feature_params.template_size;
			y = size(img,1) - feature_params.template_size;
			sample_num = min([n,x,y]);
			sample_x = randsample(x,sample_num);
			sample_y = randsample(y,sample_num);
			for j = 1:sample_num
				patch = img(sample_y(j):sample_y(j)+feature_params.template_size-1,sample_x(j):sample_x(j)+feature_params.template_size-1);
				features_neg((i-1)*n+j,:) = reshape(vl_hog(patch,feature_params.hog_cell_size),[1,D]);
			end
		end
		

Part 3: Train a classifier from these examples

This function calls vl_svmtrain on the returned features. I set lambda as 0.0001.


		X = cat(1,features_pos,features_neg);
		Y = cat(1,ones(size(features_pos,1),1),-1*ones(size(features_neg,1),1));
		lambda = 0.0001;
		[w,b] = vl_svmtrain(X',Y',lambda);
		

Part 4: Run the classifier on the test set

This function returns detections on all of the images in a given path. I convert each test image to HoG feature space with a _single_ call to vl_hog for each scale. Then step over the HoG cells, taking groups of cells that are the same size as the learned template, and classifying them. If the classification is above some confidence, keep the detection and then pass all the detections for an image to non-maximum suppression.


		for s = 1:length(scales)
	        scale_img = imresize(img,scales(s));
	        hog_feat = vl_hog(scale_img,feature_params.hog_cell_size);
	        for j = 1:size(hog_feat,1)-n
	            for k = 1:size(hog_feat,2)-n
	                temp_hog_feat = reshape(hog_feat(j:j+n-1,k:k+n-1,:),[1,D]);
	                score = temp_hog_feat*w + b;
	                if score > threshold
	                    y_min = (j-1)*feature_params.hog_cell_size;
	                    x_min = (k-1)*feature_params.hog_cell_size;
	                    y_max = y_min + feature_params.template_size - 1;
	                    x_max = x_min + feature_params.template_size - 1;
	                    y_min = floor(y_min/scales(s)) + 1;
	                    x_min = floor(x_min/scales(s)) + 1;
	                    y_max = floor(y_max/scales(s)) + 1;
	                    x_max = floor(x_max/scales(s)) + 1;
	                    cur_x_min = [cur_x_min;x_min];
	                    cur_y_min = [cur_y_min;y_min];
	                    cur_x_max = [cur_x_max;x_max];
	                    cur_y_max = [cur_y_max;y_max];
	                    cur_confidences = [cur_confidences;score];
	                end
	            end
	        end
	    end
	    cur_bboxes = [cur_x_min,cur_y_min,cur_x_max,cur_y_max];
	    cur_image_ids(1:size(cur_bboxes,1),1) = {test_scenes(i).name};
		

Results


Curves for HOG cell size = 6

We get an average precision of 82.5%. The values of the free parameters are: hog_cell_size = 6, threshold = 0.2 and lambda = 0.0001. The runtime is ~10 mins.


Curves for HOG cell size = 3

We get an average precision of 88.9%. The values of the free parameters are: hog_cell_size = 3, threshold = 0.2 and lambda = 0.0001.

Results on class images

Results on other images

Extra Credit: Hard Mining

This function returns negative features which have confidence greater than a threshold based on the weight and bias found using the SVM classifier. These negative features are then added to the original negative features list and the weight and bias are re-calculated.


		for i = 1:num_images
			img = im2single(rgb2gray(imread(fullfile(non_face_scn_path,image_files(i).name))));
		    if index > 5000
				break;
		    end
		    for s = 1:length(scales)
				scale_img = imresize(img,scales(s));
		        hog_feat = vl_hog(scale_img,feature_params.hog_cell_size);
		        for j = 1:size(hog_feat,1)-n
		            for k = 1:size(hog_feat,2)-n
		                temp_hog_feat = reshape(hog_feat(j:j+n-1,k:k+n-1,:),[1,D]);
		                score = temp_hog_feat*w + b;
		                if score > threshold && index <= 5000
		                	features_neg(index,:) = temp_hog_feat;
		                	index = index + 1;
		                end
		            end
		        end
		    end
		end
		

Results

Curves for HOG cell size = 3

We get an average precision of 65.9%. The values of the free parameters are: hog_cell_size = 3, threshold = 0.2 and lambda = 0.0001. Learning the hard negatives help us exclude false positives, at the same time it also rejects some true positives. Since the accuracy computation provided in the template code doesn't penalize false positives, the accuracy tends to be lower. Hard negative mining would probably be more important if we had a strict budget of negative training examples or a more expressive, non-linear classifier that can benefit from more trianing data.

Results on some images

Extra Credit: Alternative positive training data

I modified get_positive_features() so that it includes feautres of an image and the flipped image too.


		for i = 1:num_images
			img = im2single(imread(fullfile(train_path_pos,image_files(i).name)));
			features_pos(2*i-1,:) = reshape(vl_hog(img,feature_params.hog_cell_size),[1,D]);
			flip_img = fliplr(img);
			features_pos(2*i,:) = reshape(vl_hog(flip_img,feature_params.hog_cell_size),[1,D]);
		end
		

Results

Curves for HOG cell size = 3

We get an average precision of 89.4%. The values of the free parameters are: hog_cell_size = 3, threshold = 0.2 and lambda = 0.0001. We observe that on increasing the positive features by including the flip of the images along with the original images increases the accuracy.

Results on some images