Computer Vision Project 5 Report

The goal of this project is to implement a siding window face detector. The sliding window model is conceptually simple: independently classify all image patches as being object or non-object. Sliding window classification is the dominant paradigm in object detection and for one object category in particular -- faces -- it is one of the most noticeable successes of computer vision. It involves the following steps:

Load cropped positive trained examples
Sample random negative examples
Train a classifier from these examples
Run the classifier on the test set

Part 1: Load cropped positive trained examples

This function returns all positive training examples (faces) from 36x36 images in 'train_path_pos'. Each face is converted into a HoG template according to 'feature_params'.


		for i = 1:num_images
			img = im2single(imread(fullfile(train_path_pos,image_files(i).name)));
			features_pos(i,:) = reshape(vl_hog(img,feature_params.hog_cell_size),[1,D]);
		end

Part 2: Sample random negative examples

This function returns negative training examples (non-faces) from any images in 'non_face_scn_path'. Images are converted to grayscale because the positive training data is only available in grayscale.


		for i = 1:num_images
			img = im2single(rgb2gray(imread(fullfile(non_face_scn_path,image_files(i).name))));
			x = size(img,2) - feature_params.template_size;
			y = size(img,1) - feature_params.template_size;
			sample_num = min([n,x,y]);
			sample_x = randsample(x,sample_num);
			sample_y = randsample(y,sample_num);
			for j = 1:sample_num
				patch = img(sample_y(j):sample_y(j)+feature_params.template_size-1,sample_x(j):sample_x(j)+feature_params.template_size-1);
				features_neg((i-1)*n+j,:) = reshape(vl_hog(patch,feature_params.hog_cell_size),[1,D]);
			end
		end

Part 3: Train a classifier from these examples

This function calls vl_svmtrain on the returned features. I set lambda as 0.0001.


		X = cat(1,features_pos,features_neg);
		Y = cat(1,ones(size(features_pos,1),1),-1*ones(size(features_neg,1),1));
		lambda = 0.0001;
		[w,b] = vl_svmtrain(X',Y',lambda);

Part 4: Run the classifier on the test set

This function returns detections on all of the images in a given path. I convert each test image to HoG feature space with a _single_ call to vl_hog for each scale. Then step over the HoG cells, taking groups of cells that are the same size as the learned template, and classifying them. If the classification is above some confidence, keep the detection and then pass all the detections for an image to non-maximum suppression.


		for s = 1:length(scales)
	        scale_img = imresize(img,scales(s));
	        hog_feat = vl_hog(scale_img,feature_params.hog_cell_size);
	        for j = 1:size(hog_feat,1)-n
	            for k = 1:size(hog_feat,2)-n
	                temp_hog_feat = reshape(hog_feat(j:j+n-1,k:k+n-1,:),[1,D]);
	                score = temp_hog_feat*w + b;
	                if score > threshold
	                    y_min = (j-1)*feature_params.hog_cell_size;
	                    x_min = (k-1)*feature_params.hog_cell_size;
	                    y_max = y_min + feature_params.template_size - 1;
	                    x_max = x_min + feature_params.template_size - 1;
	                    y_min = floor(y_min/scales(s)) + 1;
	                    x_min = floor(x_min/scales(s)) + 1;
	                    y_max = floor(y_max/scales(s)) + 1;
	                    x_max = floor(x_max/scales(s)) + 1;
	                    cur_x_min = [cur_x_min;x_min];
	                    cur_y_min = [cur_y_min;y_min];
	                    cur_x_max = [cur_x_max;x_max];
	                    cur_y_max = [cur_y_max;y_max];
	                    cur_confidences = [cur_confidences;score];
	                end
	            end
	        end
	    end
	    cur_bboxes = [cur_x_min,cur_y_min,cur_x_max,cur_y_max];
	    cur_image_ids(1:size(cur_bboxes,1),1) = {test_scenes(i).name};

Results

Curves for HOG cell size = 6

We get an average precision of 82.5%. The values of the free parameters are: hog_cell_size = 6, threshold = 0.2 and lambda = 0.0001. The runtime is ~10 mins.