國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於特徵點檢測與注意力模組之表情保留人臉風格轉換,Expression-Preserving Facial Style Transfer Based on Landmark Detection and Attention Module

論文名稱 Title	基於特徵點檢測與注意力模組之表情保留人臉風格轉換 Expression-Preserving Facial Style Transfer Based on Landmark Detection and Attention Module
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	110 學年度第 2 學期 The spring semester of Academic Year 110	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	59
研究生 Author	翁聖淇 Sheng-Chi Weng
指導教授 Advisor	楊惠芳 Yang,Huei-Fang
召集委員 Convenor	魏家博 Wei,Chia-Po
口試委員 Advisory Committee	劉宗榮 Liu,Tsung-Jung
口試日期 Date of Exam	2022-07-27	繳交日期 Date of Submission	2022-08-08
關鍵字 Keywords	圖像轉換、風格轉換、人臉表情保留、注意力模組、人臉特徵點偵測 Image to Image Translation, Style Transfer, Facial Expression Preservation, Attention Module, Facial Landmark Detection
統計 Statistics	本論文已被瀏覽 464 次，被下載 97 次 The thesis/dissertation has been browsed 464 times, has been downloaded 97 times.

中文摘要
隨著生成對抗網路 (GANs) 的技術日益發展，許多研究開始在圖像轉換（或稱圖像翻譯）的領域中使用此架構。目前既有的成果在風格遷移、臉部屬性編輯以及人臉表情操控等圖像領域都有著極佳的成就。然而，縱使有這些成功的先例，當處理人臉的風格轉換時，大部分的架構皆無法有效保留原始人像的表情。除此之外，目前並未有能夠同時處理人臉風格轉移並且能夠控制表情的轉換模型。因此在此篇論文中，我們提出了一個基於 GAN 架構的圖像轉換模型，此模型能夠在轉換人臉風格的同時保留原始臉部表情。我們提取原始人臉的人臉特徵點來當作額外的表情標籤，並計算一個特徵點損失來學習有關臉部表情的信息。此外，我們額外採用一個的注意力模組來幫助學習兩個風格域之間的輪廓差異。量化和質化評估以及使用者測驗的結果都顯示了我們的模型在轉換人臉風格的同時，能夠有效地保留原始圖片更完整的表情。此外，我們分析不同權重的特徵點損失對轉換後圖像的影響，以提供超參數選擇的依據；另外，我們也探討利用深度可分離卷積取代傳統卷積，來加速模型訓練的可行性。
Abstract
Recent advances in Generative Adversarial Networks (GANs) have made tremendous stride in human face image-to-image translation (I2I). Previous studies mainly focus on the style transformation between faces or the control over attributes on the faces. However, existing human face I2I methods often fail to preserve the original expressions while proceeding style transformation. In our thesis, we propose a GAN-based image-to-image translation model, which is capable of preserving the facial expressions of the original faces while transferring face images between two style domains. We extract the facial landmarks of the original faces as extra expression labels and calculate a landmark loss to learn the information about facial expressions. Additionally, we adopt an attention mechanism to help learn the contour differences between the two style domains. Quantitative and qualitative evaluations as well as results of a user study are provided to prove that our proposed method preserves better facial expressions while transferring the style of face images. Besides, we provide ablation analyses on the relative contribution of the landmark loss to the overall model performance and on the use of depthwise separable convolutions as an alternative to standard convolutions for speeding up the training.

目次 Table of Contents
論文審定書 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 誌謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ii 摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Generative Adversarial Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 PatchGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Image-to-Image Translation (I2I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Cycle-Consistent Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . 8 2.3 Facial Image-to-Image Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Facial Expression Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Landmark Driven Image-to-Image Translation . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Facial Landmark Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Class Activation Map Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Multi-head Discriminator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Loss Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Training Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.1 Overall Image Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.2 Expression Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Qualitative Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5.1 Style Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5.2 Expression Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5.3 Visual Attractiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6 Parameter Sensitivity Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.7 Standard Convolution vs. Depthwise Separable Convolutions in Image Translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 Chapter 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter A Images for User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

參考文獻 References
[1] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” in ICCV, 2017. [2] J. Kim, M. Kim, H. Kang, and K. H. Lee, “U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation,” in ICLR, 2020. [3] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in CVPR, 2016, pp. 2414–2423. [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in NeurIPS, vol. 27. Curran Associates, Inc., 2014. [5] M.-Y. Liu, T. M. Breuel, and J. Kautz, “Unsupervised Image-to-Image Translation Networks,” in NeurIPS, 2017. [6] H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena, “Self-Attention Generative Adversarial Networks,” in ICML, 2019. [7] R. Li, C.-H. Wu, S. Liu, J. Wang, G. Wang, G. Liu, and B. Zeng, “SDP-GAN: Saliency Detail Preservation Generative Adversarial Networks for High Perceptual Quality Style Transfer,” IEEE TIP, vol. 30, pp. 374–385, 2021. [8] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation,” in CVPR, 2018. [9] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “StarGAN v2: Diverse Image Synthesis for Multiple Domains,” in CVPR, 2020. [10] H. Ding, K. Sricharan, and R. Chellappa, “ExprGAN: Facial Expression Editing with Controllable Expression Intensity,” in AAAI, 2018. [11] A. Pumarola, A. Agudo, A. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “GANimation: One-Shot Anatomically Consistent Facial Animation,” IJCV, 2019. [12] R. Wu, G. Zhang, S. Lu, and T. Chen, “Cascade EF-GAN: Progressive Facial Expression Editing With Local Focuses,” in CVPR, 2020, pp. 5020–5029. [13] H. Su, J. Niu, X. Liu, Q. Li, J. Cui, and J. Wan, “Mangagan: Unpaired photo-tomanga translation based on the methodology of manga drawing,” AAAI, vol. 35, no. 3, pp. 2611–2619, May 2021. [14] M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,” ArXiv, vol. abs/1411.1784, 2014. [15] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein Generative Adversarial Networks,” in ICML, ser. Proceedings of Machine Learning Research, vol. 70. PMLR, 06–11 Aug 2017, pp. 214–223. [16] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” in ICLR, vol. abs/1710.10196, 2018. [17] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” in CVPR, 2017, pp. 5967–5976. [18] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros, “Context Encoders: Feature Learning by Inpainting,” in CVPR, 2016, pp. 2536–2544. [19] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM TOG, vol. 36, pp. 1 – 14, 2017. [20] J. Yu, Z. L. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative Image Inpainting with Contextual Attention,” in CVPR, 2018, pp. 5505–5514. [21] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image Style Transfer Using Convolutional Neural Networks,” in CVPR, 2016, pp. 2414–2423. [22] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal Unsupervised Imageto-image Translation,” in ECCV, 2018. [23] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang, “Diverse Imageto-Image Translation via Disentangled Representations,” in ECCV, 2018. [24] H. Tang, D. Xu, N. Sebe, and Y. Yan, “Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-Image Translation,” in IJCNN, 2019. [25] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez, “Invertible Conditional GANs for image editing,” ArXiv, vol. abs/1611.06355, 2016. [26] W. Shen and R. Liu, “Learning Residual Images for Face Attribute Manipulation,” in CVPR, 2017, pp. 1225–1233. [27] G. Zhang, M. Kan, S. Shan, and X. Chen, “Generative Adversarial Network with Spatial Attention for Face Attribute Editing,” in ECCV, 2018. [28] K. Ali, I. Isler, and C. Hughes, “Facial Expression Recognition Using Human to Animated-Character Expression Translation,” ArXiv, vol. abs/1910.05595, 2019. [29] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky, “Few-shot adversarial learning of realistic neural talking head models,” in ICCV, 2019, pp. 9458–9467. [30] H. Tang, D. Xu, G. Liu, W. Wang, N. Sebe, and Y. Yan, “Cycle in cycle generative adversarial networks for keypoint-guided image generation,” in ACM MM, 2019. [31] W. Wang, X. Alameda-Pineda, D. Xu, P. Fua, E. Ricci, and N. Sebe, “Every Smile is Unique: Landmark-Guided Diverse Smile Generation,” in CVPR, Jun. 2018, pp. 7083–7092. [32] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan, “Geometry guided adversarial facial expression synthesis,” in ACM MM, ser. MM ’18, 2018, p. 627–635. [33] R. Wu, X. Gu, X. Tao, X. Shen, Y.-W. Tai, and J. iaya Jia, “Landmark assisted cyclegan for cartoon face generation,” 2019. [34] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in ECCV. Springer Verlag, 2016, pp. 483–499. [35] Z. Tang, X. Peng, K. Li, and D. N. Metaxas, “Towards efficient u-nets: A coupled and quantized approach,” TPAMI, vol. 42, no. 8, pp. 2038–2050, 2020. [36] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” TPAMI, vol. 43, no. 10, pp. 3349–3364, 2021. [37] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in CVPR, 2016, pp. 2921–2929. [38] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “GradCAM: Visual Explanations from Deep Networks via Gradient-Based Localization,” IJCV, vol. 128, pp. 336–359, 2019. [39] Hysts, “Anime face detector,” https://github.com/hysts/anime-face-detector, 2021. [40] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in ICCV, 2017, pp. 2813–2821. [41] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in ICCR, 2015. [42] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” in NeurIPS, vol. 30. Curran Associates, Inc., 2017. [43] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying MMD GANs,” in ICLR, 2018. [44] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” in CVPR, 2016, pp. 2818–2826. [45] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in CVPR, 2017, pp. 1800–1807. [46] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” ArXiv, vol. abs/1704.04861, 2017. [47] P. Ekman and W. V. Friesen, “Facial action coding system,” Environmental Psychology & Nonverbal Behavior, 1978.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0708122-165811.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS