This project was needed by Bluusun LLC company who served Cupid Media network. The customer wanted a rating system which could rate each user face photo by its internal alorithms (without involving other users' input). At that moment they were starting a pilot project aiming to introduce the new matching system and to use it with existing databases of the network of their *cupid webistes. The new project is supposed to appear at mightydating.com So I decided to developed a human facial beauty estimator, basing one of the existing facial feature extraction libraries.
The first atttempt was to use cmusatyalab OpenFace library for face feature extraction. The library is aimed for face recognition tasks, extracting features related to face geometry. Apparently this means loosing some useful info needed for beauty estimator like skin texture and hairstyling. However I gave it a try, given face geometry makes the greatest impact in overall facial beauty perception. The library uses dlib for face extraction/alignment and Torch as a deep learning framework.
Initially I used database of thaicupid.com website which was provided in JSON format. I wrote a Python script to scrape about 4000 images from this dating website. I've decided to start with binary classifier which would be able to choose top 10% images from all others. The images were manually labled by me with 1/0 rating.
OpenFace extracts 128 floats from an image containing human face. I ran it on my database and got resulting
csvfile with 129 columns (including image id).
OpenFace is distributed as Docker container, so I had to push my code into container each time to run it. The following script runs a python script provided as input argument in the openface directory inside container. It also mounts current directory in so that script can read needed inputs.
INARGS="$@" docker run -t -i --rm -v "$(pwd)":/root/openface/openface-rate -w /root/openface/openface-rate openface sh -c "python $INARGS"
I then used
sklearnframework to train SVM classifier. I actually tested all provided classifiers but other performed worse than SVM. First of all the data was scaled with
One have to use
"balanced"parameter of Support Vector Regression for skewed data. This will give more weight to less-frequent ("1" or "beatiful" in this case) labels, otherwise they will have no effect compared to "0" samples, given summation in SVM loss formula.
lf = GridSearchCV(svm.SVC(class_weight="balanced"), tuned_params, cv=cv, scoring="f1", verbose=True)
Then I used
GirdSearchCVto perform hyperparameters C and gamma search for different SVM kernels. Since the data is skewed, I used F1 scoring. The
matplotliblibrary was used to visualize heatmap (much more values were searhed actually):
Obviously F1 score value was exceptionally low. In order to diagnose bias/variance problem and determine future steps I've plot the learing curves:
Later I've discovered the model was suffering from high bias problem. But at that moment due to lack of experience and some ambiguity in the curves I've diagnosed high variance problem. So that I've decided to perform some feature selection based on finding correlations in density histograms, but with no avail (some transformations were applied to the graphs actually but I omit them).
The above model didn't prove being effective which I suppose is because of feature extraction engine, which is well-suited for face recognition. Looks like geometric traits only are not sufficient for the beauty estimation.
Then I've decided to try more feature extraction algorithms/models. After some unsuccessfull attempts I've found the following working solution. In short, I use VGG-Very-Deep-16 convolutional neural network to extract face features. Above those features a Support Vector Regression with linear kernel is trained.
First the face is found and extracted from the image, reusing OpenFace code. I have modified the scaling part which uses OpenCV. Now it not only extracts face from an arbitrary image but also scales and rotates it so that eyes get to a predefined position. The described estimator is trained on SCUT-FBP dataset . So the code places a face from arbitrary image at the same position in resulting 224x224 snippet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
"""Find landmarks on the face image in source image coordinates""" landmarks = self.findLandmarks(rgbImg, bb) landmarks_np = np.float32(landmarks) """Choose eyes and mouth landmarks from template landmarks""" landmark_indices_np = np.array(chosen_landmarks_) """For chosen landmarks: get template landmark positions in 0..1 template coordinates; Multiply to image height/width (targret image is quadratic). Results in the position of template landmarks in target image coordinates """ tgt_landmarks = image_side * MINMAX_TEMPLATE[landmark_indices_np] """Shrink eyes and mouth position by 0.6 and move downwards by 18%. This transform template landmark positions into the approximate position inferred from the train database images""" tgt_landmarks = scale_transform(np.array(tgt_landmarks), np.float32([image_side/2, image_side/2]), 0.6) tgt_landmarks = move_transform(tgt_landmarks, np.float32([0, image_side*0.18])) """Create transformation so that eyes and nose positions on the source image (taken from landmarks_np) are transformed to corresponding scaled points on target image""" H = cv2.getAffineTransform(landmarks_np[landmark_indices_np], tgt_landmarks) """Apply transformation. Now we get extracted face with the following properties: -square form with width 224 -image is rotated and scaled so that the face is placed exactly at the position the target estimator expect it to be White background is added where image borders are exceeded. """ result_img = cv2.warpAffine(rgbImg, H, (image_side, image_side), borderValue=(255,255,255)) return result_img
Image affine transormation to align with train dataset images
Face features are then extracted by VGG-Very-Deep-16 convolutional neural network. FC layers 6 and 7 output 2622 floats. I use FC layer 7 output, so in my case the network extracts 2622 features. In the original work  this output is further passed to the Rectification Unit, dropout and one more FC layer, finally getting to the softmax classifier. Since we don't need classification, but only feature extraction, the last 4 layers are not used. Below is the overall structure of the CNN being used.
It has the following peculiarities:
-13 convolution layers, thus considered "very deep" network; The similar results can also be achieved with 5 convolution layers CNN.
-3 fully connected layers; They are actually the same as the convolution layers but each filter size matches the size of the input data and the number of filters is the desired output size. The last FC layer performs classification according to the number of persons being recognized (not used in this project).
-to add regulariztion, droupout takes place after relu6 and relu7 layers (not shown in the table);
The features were extracted from SCUT-FBP dataset resulting in 500x2622 resulting csv file.
Examples of SCUT-FBP samples
These are seemingly too much features so I apply dimensionality reduction with
sklearn.decomposition.PCA.Choosing 99% threshold I have to leave the first 114 components:
Components Variance Exp.
Then I use
pickleto save Ureduce matrix as part of
pandasto export resulting dataset to
.csvfile with the reduced dimensionality.
Now then, we have ready dataset, produced by our CNN feature extraction with dimensionality reduced to somewhat moderate values (still high-dimensional though). The dataset is labeled and labels are stored in
.xlsxspreadsheet. Since we need continuous rating output values 1.0-5.0 I use Support Vector Regression as an estimator. I perform extensive hyperparameters search and get heatmaps similar to the picture above related to the previous model. Finally I come up with the following values:
self._clf = svm.SVR(kernel="linear", C=0.00025)
Having image extraction procedure, trained SVR estimator and PCA Ureduce matrix saved, the actual usage is the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
"""Extract image (repeated for a set of images actually)""" im = extract_image(img_path) ... im = np.asarray(im, dtype='float64') / 256 """Prepare input image""" MEAN_RGB = np.array([129.1863,104.7624,93.5940]) """Change axes so that dimensions correspond to those of the CNN input and subtract MEAN_RGB""" """Results in 1 x channels x height x width""" im = prepare_image(im) """Get fully-connected layer 7 output""" """ Apply Rectified Linear Unit to layer 6 output and combine outputs""" out = net_caffe.forward(data = floatX(image_list), end='fc7') """ Multiple resulting vector to pre-calculated Ureduce matrix""" img_features = out['fc7'].reshape((out['fc7'].shape,np.prod(out['fc7'].shape[1:]))) ... """Final design matrix for a set of images""" feat_mx = np.concatenate((feat_mx, img_features.copy), axis=0) ... """Make predictions for a set of images""" pred = clf.predict(feat_mx)
Pearsons correlation coefficient have been chosen as the metric to measure beauty ranking accuracy. The SVR estimator have been trained with images from SCUT-FBP dataset. When assessed on the same dataset, cross-validation with random shuffling is used. Predictably when assessed on the special Test 200 dataset (described in the next section), the scores are lower. But the last case only can be considered meaningful, because it is obtained on the real data. The pool5+fc6 layers output was tested as in . The resulting accuracies are shown in the table below. Only the extract is shown.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
function [J, grad] = computeCost(s, id_list, pairs, lambda) J = 0; %"compute cost J" for p = pairs larger_id = p(1); s_i_plus = s * (id_list == larger_id); smaller_id = p(2); s_i_minus = s * (id_list == smaller_id); J += exp(-(s_i_plus - s_i_minus)); end J += lambda * s * s'; p_first_row = pairs(1, :); p_second_row = pairs(2, :); N = length(s); grad = zeros(N, 1); %"compute gradient w.r.t to each s" for i=1:length(s) id = id_list(i); %"id pairs, where id(i) (corresponding to s_i) compares +" s_more_pairs = pairs(:, find(p_first_row == id)); %"ids, where id(i) compares + to this id " s_more_ids = unique(s_more_pairs(2, :)); %"s_x corresponding to those ids" s_more = s(:, find(ismember(id_list, s_more_ids))); grad(i) += exp(-s(i)) * sum(exp(s_more)); %"same but comparing -" s_less_pairs = pairs(:, find(p_second_row == id)); s_less_ids = unique(s_less_pairs(1, :)); s_less = s(:, find(ismember(id_list, s_less_ids))); grad(i) += exp(s(i)) * sum(exp(-s_less)); grad(i) += 2 * s(i); end end
With the above cost/gradient function the minimization problem is solved in one call:
[s, cost] = fminunc(@(s_var)(computeCost(s_var, id_list, pairs, lambda)), s);
The resulting \$s\$ contains absolute ratings according to the input pairwise comparisons. It may require some scaling to fit into the scale used by the estimator.
The developed estimator enables for facial beauty estimation of an arbitrary image (URL). The underlying convolutional NN is very deep network developed by VGG. scikit-learn library was used above its outputs. The Test 200 labeled dataset was composed from pairwise ratings obtained with a developed Python utility and an Octave script. Differrent CNN architectures/hyperparameters/preprocessing options were researched for maximum correlation between the estimator and human-produced rating. The project was successfully shipped and deployed on the customer site.
Xie Duorui, Lingyu Liang, Lianwen Jin, Jie Xu, and Mengru Li. "SCUT-FBP: A Benchmark Dataset for Facial Beauty Perception." In Systems, Man, and Cybernetics (SMC), 2015
Omkar M. Parkhi Andrea Vedaldi, Andrew Zisserman. Deep Face Recognition. 2015
 Douglas Gray, Kai Yu, Wei Xu and Yihong Gong. Predicting Facial Beauty without Landmarks. 2010