Telling Something from Your Face: Age, Beauty, and Anti-Spoofing

Figure 1. Demonstration of the facial age, beauty, pose estimation.

In the last few decades, facial analysis has been thoroughly explored due to its huge commercial potential. For example, learning the semantic meaning of a human face can help us to target potential customers more easily and more efficient. While face recognition is a well-established technology for surveillance systems or security control, it is also crucial for us to know the captured image is from a real person or a spoofing device. In this blog post, we are going to discuss what we do in Taiwan Ailabs and our demo system.

Predicting a human age or beauty from the appearance is a very subjective task. A man/woman can look younger than his/her real age while beauty is not even a quantifiable value. Nevertheless, knowing the approximated appearance age and beauty is still helpful for merchandise recommendation. For instance, recommending an old man/woman some soft drinks may not be a good idea, nor placing a massage chair in a children’s play zone.

There are several ways to estimate these two values with proper supervision. Classification tells us that we can use independent bins to represent a range of the age or beauty score. The drawback of such method is the range is manually set and might lead to quantization error. On the other hand, it is also natural to use regression to predict continuous values, but it might also lead to overfitting without constraints. In addition, the face pose and resolution also have a huge impact on the prediction performance. The above challenges make the prediction of age and beauty even more difficult.

 

 

Method:

Age and Beauty estimation

To resolve the above challenges while achieving very low cost, the demo system is built upon the recently published FSA-Net [1] in CVPR2019. It adopts Soft-Stagewise Regression (SSR) scheme [2] for eliminating the quantization error while maintaining low memory overhead. 

For the training, we choose 128x128x3 as the cropped face resolution, and put an auxiliary loss as the quantized supervision prediction.

A drawback of such scheme is that unstable prediction might appears with different input images. We use a sequential frame selecting pipeline to stabilized the final prediction.

Figure 2. Sequential frame selecting pipeline.

Finally, only when the detected face has enough resolution with very small pose angle, it will be considered as a valid face and proceed with the estimation pipeline. The replacement policy can be altered as long as the selected face image is high quality with enough resolution.

 

Face anti-spoofing

In order to achieve anti-spoofing with a pure RGB image, we divide the process into two different tasks, cell phone detection and denoising based anti-spoofing estimation.

Figure 3. Face anti-spoofing pipeline.

We adopt the famous YOLOv3 [3] as the detector for the cell phone, laptop, monitor detection. Task 1 is defined as detection inside the phone area will be considered as a fake spoof. While taks 2 takes the quality and the noise of the image for determining whether it’s Real or Fake.

 

Demo images:

The following demo shows that our system can predict age and beauty with decent

 Figure 4. Demo images for age and beauty estimation.

 

Figure 5. Demo image for face anti-spoofing. (Red: Fake. Blue: Real.)

Summary:

Face attributes estimation such as age and beauty is subjectively determined by the label data, but it is still useful for commercial analysis and recommendation. For face recognition, the anti-spoofing is also very important for the security and the robustness of the whole identity verification pipeline. We achieved these prototypes for showing that there is much more potential on these topics and the AI can truly help people to make a better decision with these estimations.

 

Reference:

[1] Yang, Tsun-Yi, et al. “FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation from a Single Image.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

[2] Yang, Tsun-Yi, et al. “SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation.” IJCAI. Vol. 5. No. 6. 2018.

[3] Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).

,

Expanding Computer Vision Multi-View Stereo Capabilities: Automatic Generation of 3-dimensional Models via 360 Camera Footage

As the world we live in is three-dimension, 3D model is the most iconic representation of our world. 3D modeling allows people to see what they would not see when viewing in 2D. It gives people the ability to physically see how much real estate an object takes from all perspectives. But compared to generated from scratch, we are also able to build 3D models from video automatically.

Taiwan AI Labs has built a virtual aerial website: droneye.tw. This website has great 360 video resources. It triggers an idea. Can we use 360 videos to generate high-quality 3D model within only few flights in order to minimize the cost to obtain the 3D model? Indeed, it is feasible no matter using stereo-views computer vision algorithm or deep learning approach. The concept of 3d model reconstruction can be illustrated as figure 1 that we are able to solve 3D object coordinate from the same point on the at least 2 images.

Fig 1. Concept of 3D Reconstruction (Modified from Tokkari etc., 2017)

 

360-degree cameras capture the whole scene around a photographer in a single shot. 360 cameras are becoming a new paradigm for photogrammetry. In fact, the camera can be pointed to any direction, and the large field of view reduces the number of photographs.[1] The camera we used at the former flight is Virb 360, which record the video in panoramic ways. As a result, it cannot be used to reconstruct the 3D model directly. We have to project its equirectangular images to any FOV perspective images (Fig 2.) or cubemap so we are able to take advantage of these images to build our 3D models. It can also support reconstructing the 3D model inside the building, we can choose the image angle we want or just choose cubemap format, it will automatically use the reprojection images to generate 3D model.

Fig 2. Different angle of views from drone

 

First of all, tanksandtemples website which present a benchmark for image-based 3D reconstruction have indicated results of 3D model generated by deep learning still do not surpass the stereo-views computer vision algorithm. We have implemented both algorithms and compare the results. The test input is 16 high resolution NTU campus photos from drone. The deep learning algorithm we use is R-MVSNet, it’s an end-to-end deep learning architecture for depth map inference from multi-view images.[8] The computer vision algorithm is Structure from Motion and semi-global matching. We can tell from the result that state-of-the-art deep learning algorithm of 3D model still has a way to go. Nonetheless, deep learning algorithm for 3D reconstruction undoubtedly has a great prospect in the future.

Fig. The Architecture of R-MVSNet (Yao etc, 2019)

 

Table. Point Cloud results of deep learning and conventional computer vision

Semi-Global Matching 

R-MVSNet

   

 

Base on the above statement, we decide to apply Structure from Motion & patch matching algorithm to reconstruct our 3d model. The main steps include sparse reconstruction and dense reconstruction. In the sparse reconstruction, we use SIFT to match features. Now we have the corresponding points on each image. Then we use bundle adjustment to adjust the camera extrinsic and intrinsic by least square method.[4] After these sub-steps we obtain the more accurate camera extrinsic and intrinsic for the dense reconstruction. In the dense reconstruction, first we take advantage of camera position to calculate the depth map of each image by semi-global matching.[2] Then we fuse the neighbor depth maps to generate 3d point cloud. However, data volume of point cloud is relatively huge, we simplify the point to Delaunay triangulations or so-called mesh, which turn the points into planes.[5] Final, we texture the mesh with the corresponding images.[7] 

Fig. 3D reconstruction pipeline (Bianco etc, 2018)

 

 

Table. Result of NTU Green House

Real Picture   3D Model       
 

 

Table. Result of Tainan Qingping Park

Real 360 Picture

3D Model

   

 

Although 360 video has the advantage of multi-views in single frame, it has significant drawbacks such as unstable camera intrinsic and when projecting to the perspective images, the resolution will decrease drastically. Thus, we adopt super resolution algorithm – ESRGAN in order to overcome low-quality of images. Indeed, this strategy not only increases the details of the 3d models especially when texturing the mesh but also densify the point cloud. In order to obtain better results, we can train our own model on the “Taiwan landscape” the prevent bias of the unsuitable pre-trained model and to meet the special needs of drone data in Taiwan. 

 

Bilinear x4 SR x4      

 

Nonetheless, ESRGAN doesn’t restore the original information, it learns high frequency information in the images by inference.[9] Due to the fact, it would hurt the quality of structure from motion. If we would like to take advantage of the better results of super-resolution and also maintain the quality of Structure from Motion, we could use the SR images as input at the step of dense matching. To sum up, by using state-of-the-art deep learning algorithm such as super-resolution (ESRGAN), we may be able to reduce some drawbacks of 360 video and generate the desired 3D models.

 

Fig. Chimei Museum

 

Fig. Lin Mo-Niang Park

 

Fig. Tzu Chi Senior High School

 

Reference

  1. Barazzetti, L., Previtali, M., & Roncoroni, F. (2018). Can we use low-cost 360 degree cameras to create accurate 3D models?. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, 42(2).
  2. Barnes, C., Shechtman, E., Finkelstein, A., & Goldman, D. B. (2009). PatchMatch: A randomized correspondence algorithm for structural image editing. In ACM Transactions on Graphics (ToG) (Vol. 28, No. 3, p. 24).
  3. Bianco, S., Ciocca, G., & Marelli, D. (2018). Evaluating the performance of structure from motion pipelines. Journal of Imaging, 4(8), 98.
  4. Fraundorfer, F., Scaramuzza, D., & Pollefeys, M. (2010). A constricted bundle adjustment parameterization for relative scale estimation in visual odometry. In 2010 IEEE International Conference on Robotics and Automation (pp. 1899-1904). IEEE.
  5. Jancosek, M., & Pajdla, T. (2014). Exploiting visibility information in surface reconstruction to preserve weakly supported surfaces. International scholarly research notices.
  6. Shen, S. (2013). Accurate multiple view 3d reconstruction using patch-based stereo for large-scale scenes. IEEE transactions on image processing, 22(5), 1901-1914.
  7. Waechter, M., Moehrle, N., & Goesele, M. (2014). Let there be color! Large-scale texturing of 3D reconstructions. In European Conference on Computer Vision (pp. 836-850).
  8. Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C.,& Change Loy, C. (2018). Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 0-0).
  9. Yao, Y., Luo, Z., Li, S., Fang, T., & Quan, L. (2018). Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 767-783).