The goal of this project is to implement a siding window face detector. The sliding window model is conceptually simple: independently classify all image patches as being object or non-object. Sliding window classification is the dominant paradigm in object detection and for one object category in particular -- faces -- it is one of the most noticeable successes of computer vision. It involves the following steps:
This function returns all positive training examples (faces) from 36x36 images in 'train_path_pos'. Each face is converted into a HoG template according to 'feature_params'.
for i = 1:num_images
img = im2single(imread(fullfile(train_path_pos,image_files(i).name)));
features_pos(i,:) = reshape(vl_hog(img,feature_params.hog_cell_size),[1,D]);
end
This function returns negative training examples (non-faces) from any images in 'non_face_scn_path'. Images are converted to grayscale because the positive training data is only available in grayscale.
for i = 1:num_images
img = im2single(rgb2gray(imread(fullfile(non_face_scn_path,image_files(i).name))));
x = size(img,2) - feature_params.template_size;
y = size(img,1) - feature_params.template_size;
sample_num = min([n,x,y]);
sample_x = randsample(x,sample_num);
sample_y = randsample(y,sample_num);
for j = 1:sample_num
patch = img(sample_y(j):sample_y(j)+feature_params.template_size-1,sample_x(j):sample_x(j)+feature_params.template_size-1);
features_neg((i-1)*n+j,:) = reshape(vl_hog(patch,feature_params.hog_cell_size),[1,D]);
end
end
This function calls vl_svmtrain on the returned features. I set lambda as 0.0001.
X = cat(1,features_pos,features_neg);
Y = cat(1,ones(size(features_pos,1),1),-1*ones(size(features_neg,1),1));
lambda = 0.0001;
[w,b] = vl_svmtrain(X',Y',lambda);
This function returns detections on all of the images in a given path. I convert each test image to HoG feature space with a _single_ call to vl_hog for each scale. Then step over the HoG cells, taking groups of cells that are the same size as the learned template, and classifying them. If the classification is above some confidence, keep the detection and then pass all the detections for an image to non-maximum suppression.
for s = 1:length(scales)
scale_img = imresize(img,scales(s));
hog_feat = vl_hog(scale_img,feature_params.hog_cell_size);
for j = 1:size(hog_feat,1)-n
for k = 1:size(hog_feat,2)-n
temp_hog_feat = reshape(hog_feat(j:j+n-1,k:k+n-1,:),[1,D]);
score = temp_hog_feat*w + b;
if score > threshold
y_min = (j-1)*feature_params.hog_cell_size;
x_min = (k-1)*feature_params.hog_cell_size;
y_max = y_min + feature_params.template_size - 1;
x_max = x_min + feature_params.template_size - 1;
y_min = floor(y_min/scales(s)) + 1;
x_min = floor(x_min/scales(s)) + 1;
y_max = floor(y_max/scales(s)) + 1;
x_max = floor(x_max/scales(s)) + 1;
cur_x_min = [cur_x_min;x_min];
cur_y_min = [cur_y_min;y_min];
cur_x_max = [cur_x_max;x_max];
cur_y_max = [cur_y_max;y_max];
cur_confidences = [cur_confidences;score];
end
end
end
end
cur_bboxes = [cur_x_min,cur_y_min,cur_x_max,cur_y_max];
cur_image_ids(1:size(cur_bboxes,1),1) = {test_scenes(i).name};
Curves for HOG cell size = 6
We get an average precision of 82.5%. The values of the free parameters are: hog_cell_size = 6, threshold = 0.2 and lambda = 0.0001. The runtime is ~10 mins.
Curves for HOG cell size = 3
We get an average precision of 88.9%. The values of the free parameters are: hog_cell_size = 3, threshold = 0.2 and lambda = 0.0001.
This function returns negative features which have confidence greater than a threshold based on the weight and bias found using the SVM classifier. These negative features are then added to the original negative features list and the weight and bias are re-calculated.
for i = 1:num_images
img = im2single(rgb2gray(imread(fullfile(non_face_scn_path,image_files(i).name))));
if index > 5000
break;
end
for s = 1:length(scales)
scale_img = imresize(img,scales(s));
hog_feat = vl_hog(scale_img,feature_params.hog_cell_size);
for j = 1:size(hog_feat,1)-n
for k = 1:size(hog_feat,2)-n
temp_hog_feat = reshape(hog_feat(j:j+n-1,k:k+n-1,:),[1,D]);
score = temp_hog_feat*w + b;
if score > threshold && index <= 5000
features_neg(index,:) = temp_hog_feat;
index = index + 1;
end
end
end
end
end
Curves for HOG cell size = 3
We get an average precision of 65.9%. The values of the free parameters are: hog_cell_size = 3, threshold = 0.2 and lambda = 0.0001. Learning the hard negatives help us exclude false positives, at the same time it also rejects some true positives. Since the accuracy computation provided in the template code doesn't penalize false positives, the accuracy tends to be lower. Hard negative mining would probably be more important if we had a strict budget of negative training examples or a more expressive, non-linear classifier that can benefit from more trianing data.
I modified get_positive_features() so that it includes feautres of an image and the flipped image too.
for i = 1:num_images
img = im2single(imread(fullfile(train_path_pos,image_files(i).name)));
features_pos(2*i-1,:) = reshape(vl_hog(img,feature_params.hog_cell_size),[1,D]);
flip_img = fliplr(img);
features_pos(2*i,:) = reshape(vl_hog(flip_img,feature_params.hog_cell_size),[1,D]);
end
Curves for HOG cell size = 3
We get an average precision of 89.4%. The values of the free parameters are: hog_cell_size = 3, threshold = 0.2 and lambda = 0.0001. We observe that on increasing the positive features by including the flip of the images along with the original images increases the accuracy.