Responsive image
博碩士論文 etd-0708122-165811 詳細資訊
Title page for etd-0708122-165811
論文名稱
Title
基於特徵點檢測與注意力模組之表情保留人臉風格轉換
Expression-Preserving Facial Style Transfer Based on Landmark Detection and Attention Module
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
59
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2022-07-27
繳交日期
Date of Submission
2022-08-08
關鍵字
Keywords
圖像轉換、風格轉換、人臉表情保留、注意力模組、人臉特徵點偵測
Image to Image Translation, Style Transfer, Facial Expression Preservation, Attention Module, Facial Landmark Detection
統計
Statistics
本論文已被瀏覽 282 次,被下載 91
The thesis/dissertation has been browsed 282 times, has been downloaded 91 times.
中文摘要
隨著生成對抗網路 (GANs) 的技術日益發展,許多研究開始在圖像轉換(或稱圖像翻譯)的領域中使用此架構。目前既有的成果在風格遷移、臉部屬性編輯以及人臉表情操控等圖像領域都有著極佳的成就。然而,縱使有這些成功的先例,當處理人臉的風格轉換時,大部分的架構皆無法有效保留原始人像的表情。除此之外,目前並未有能夠同時處理人臉風格轉移並且能夠控制表情的轉換模型。因此在此篇論文中,我們提出了一個基於 GAN 架構的圖像轉換模型,此模型能夠在轉換人臉風格的同時保留原始臉部表情。我們提取原始人臉的人臉特徵點來當作額外的表情標籤,並計算一個特徵點損失來學習有關臉部表情的信息。此外,我們額外採用一個的注意力模組來幫助學習兩個風格域之間的輪廓差異。量化和質化評估以及使用者測驗的結果都顯示了我們的模型在轉換人臉風格的同時,能夠有效地保留原始圖片更完整的表情。此外,我們分析不同權重的特徵點損失對轉換後圖像的影響,以提供超參數選擇的依據;另外,我們也探討利用深度可分離卷積取代傳統卷積,來加速模型訓練的可行性。
Abstract
Recent advances in Generative Adversarial Networks (GANs) have made tremendous stride in human face image-to-image translation (I2I). Previous studies mainly focus on the style transformation between faces or the control over attributes on the faces. However, existing human face I2I methods often fail to preserve the original expressions while proceeding style transformation. In our thesis, we propose a GAN-based image-to-image translation model, which is capable of preserving the facial expressions of the original
faces while transferring face images between two style domains. We extract the facial landmarks of the original faces as extra expression labels and calculate a landmark loss to learn the information about facial expressions. Additionally, we adopt an attention mechanism to help learn the contour differences between the two style domains. Quantitative and qualitative evaluations as well as results of a user study are provided to prove that our proposed method preserves better facial expressions while transferring the style of face images. Besides, we provide ablation analyses on the relative contribution of the landmark loss to the overall model performance and on the use of depthwise separable convolutions as an alternative to standard convolutions for speeding up the training.
目次 Table of Contents
論文審定書 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
誌謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ii
摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Generative Adversarial Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 PatchGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Image-to-Image Translation (I2I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Cycle-Consistent Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . 8
2.3 Facial Image-to-Image Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Facial Expression Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Landmark Driven Image-to-Image Translation . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Facial Landmark Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Class Activation Map Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Multi-head Discriminator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Loss Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Training Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Overall Image Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2 Expression Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Qualitative Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.1 Style Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.2 Expression Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.3 Visual Attractiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Parameter Sensitivity Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Standard Convolution vs. Depthwise Separable Convolutions in Image Translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
Chapter 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter A Images for User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
參考文獻 References
[1] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation
using Cycle-Consistent Adversarial Networks,” in ICCV, 2017.
[2] J. Kim, M. Kim, H. Kang, and K. H. Lee, “U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image
Translation,” in ICLR, 2020.
[3] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional
neural networks,” in CVPR, 2016, pp. 2414–2423.
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in NeurIPS, vol. 27.
Curran Associates, Inc., 2014.
[5] M.-Y. Liu, T. M. Breuel, and J. Kautz, “Unsupervised Image-to-Image Translation
Networks,” in NeurIPS, 2017.
[6] H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena, “Self-Attention Generative Adversarial Networks,” in ICML, 2019.
[7] R. Li, C.-H. Wu, S. Liu, J. Wang, G. Wang, G. Liu, and B. Zeng, “SDP-GAN:
Saliency Detail Preservation Generative Adversarial Networks for High Perceptual
Quality Style Transfer,” IEEE TIP, vol. 30, pp. 374–385, 2021.
[8] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified
Generative Adversarial Networks for Multi-Domain Image-to-Image Translation,”
in CVPR, 2018.
[9] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “StarGAN v2: Diverse Image Synthesis for
Multiple Domains,” in CVPR, 2020.
[10] H. Ding, K. Sricharan, and R. Chellappa, “ExprGAN: Facial Expression Editing with
Controllable Expression Intensity,” in AAAI, 2018.
[11] A. Pumarola, A. Agudo, A. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “GANimation: One-Shot Anatomically Consistent Facial Animation,” IJCV, 2019.
[12] R. Wu, G. Zhang, S. Lu, and T. Chen, “Cascade EF-GAN: Progressive Facial Expression Editing With Local Focuses,” in CVPR, 2020, pp. 5020–5029.
[13] H. Su, J. Niu, X. Liu, Q. Li, J. Cui, and J. Wan, “Mangagan: Unpaired photo-tomanga translation based on the methodology of manga drawing,” AAAI, vol. 35,
no. 3, pp. 2611–2619, May 2021.
[14] M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,” ArXiv, vol.
abs/1411.1784, 2014.
[15] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein Generative Adversarial Networks,” in ICML, ser. Proceedings of Machine Learning Research, vol. 70. PMLR,
06–11 Aug 2017, pp. 214–223.
[16] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive Growing of GANs for
Improved Quality, Stability, and Variation,” in ICLR, vol. abs/1710.10196, 2018.
[17] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” in CVPR, 2017, pp. 5967–5976.
[18] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros, “Context Encoders: Feature Learning by Inpainting,” in CVPR, 2016, pp. 2536–2544.
[19] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image
completion,” ACM TOG, vol. 36, pp. 1 – 14, 2017.
[20] J. Yu, Z. L. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative Image Inpainting with Contextual Attention,” in CVPR, 2018, pp. 5505–5514.
[21] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image Style Transfer Using Convolutional
Neural Networks,” in CVPR, 2016, pp. 2414–2423.
[22] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal Unsupervised Imageto-image Translation,” in ECCV, 2018.
[23] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang, “Diverse Imageto-Image Translation via Disentangled Representations,” in ECCV, 2018.
[24] H. Tang, D. Xu, N. Sebe, and Y. Yan, “Attention-Guided Generative Adversarial
Networks for Unsupervised Image-to-Image Translation,” in IJCNN, 2019.
[25] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez, “Invertible Conditional GANs for image editing,” ArXiv, vol. abs/1611.06355, 2016.
[26] W. Shen and R. Liu, “Learning Residual Images for Face Attribute Manipulation,”
in CVPR, 2017, pp. 1225–1233.
[27] G. Zhang, M. Kan, S. Shan, and X. Chen, “Generative Adversarial Network with
Spatial Attention for Face Attribute Editing,” in ECCV, 2018.
[28] K. Ali, I. Isler, and C. Hughes, “Facial Expression Recognition Using Human to
Animated-Character Expression Translation,” ArXiv, vol. abs/1910.05595, 2019.
[29] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky, “Few-shot adversarial
learning of realistic neural talking head models,” in ICCV, 2019, pp. 9458–9467.
[30] H. Tang, D. Xu, G. Liu, W. Wang, N. Sebe, and Y. Yan, “Cycle in cycle generative
adversarial networks for keypoint-guided image generation,” in ACM MM, 2019.
[31] W. Wang, X. Alameda-Pineda, D. Xu, P. Fua, E. Ricci, and N. Sebe, “Every Smile
is Unique: Landmark-Guided Diverse Smile Generation,” in CVPR, Jun. 2018, pp.
7083–7092.
[32] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan, “Geometry guided adversarial facial
expression synthesis,” in ACM MM, ser. MM ’18, 2018, p. 627–635.
[33] R. Wu, X. Gu, X. Tao, X. Shen, Y.-W. Tai, and J. iaya Jia, “Landmark assisted cyclegan for cartoon face generation,” 2019.
[34] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in ECCV. Springer Verlag, 2016, pp. 483–499.
[35] Z. Tang, X. Peng, K. Li, and D. N. Metaxas, “Towards efficient u-nets: A coupled
and quantized approach,” TPAMI, vol. 42, no. 8, pp. 2038–2050, 2020.
[36] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan,
X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for
visual recognition,” TPAMI, vol. 43, no. 10, pp. 3349–3364, 2021.
[37] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in CVPR, 2016, pp. 2921–2929.
[38] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “GradCAM: Visual Explanations from Deep Networks via Gradient-Based Localization,”
IJCV, vol. 128, pp. 336–359, 2019.
[39] Hysts, “Anime face detector,” https://github.com/hysts/anime-face-detector, 2021.
[40] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in ICCV, 2017, pp. 2813–2821.
[41] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in ICCR,
2015.
[42] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs
Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,”
in NeurIPS, vol. 30. Curran Associates, Inc., 2017.
[43] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying MMD
GANs,” in ICLR, 2018.
[44] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” in CVPR, 2016, pp. 2818–2826.
[45] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in
CVPR, 2017, pp. 1800–1807.
[46] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for
Mobile Vision Applications,” ArXiv, vol. abs/1704.04861, 2017.
[47] P. Ekman and W. V. Friesen, “Facial action coding system,” Environmental Psychology & Nonverbal Behavior, 1978.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code