國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以多輪方式進行多模態影片檢索,Multimodal Video Retrieval with Multi-turn Query

論文名稱 Title	以多輪方式進行多模態影片檢索 Multimodal Video Retrieval with Multi-turn Query
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	109 學年度第 2 學期 The spring semester of Academic Year 109	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	83
研究生 Author	蔡守恒 Shou-Heng Tsai
指導教授 Advisor	李偉柏 Lee,Wei-Po
召集委員 Convenor	楊新章 Hsin-Chang Yang
口試委員 Advisory Committee	許育峯 Yu-Feng Hsu
口試日期 Date of Exam	2021-08-18	繳交日期 Date of Submission	2021-08-26
關鍵字 Keywords	影片檢索、多輪搜尋、關聯式回饋、多模態、分群 Video Retrieval, Multi-turn Query, Relevance Feedback, Multi-Modal, Clustering
統計 Statistics	本論文已被瀏覽 521 次，被下載 6 次 The thesis/dissertation has been browsed 521 times, has been downloaded 6 times.

中文摘要
近年來，隨著通訊技術日益進步與網際網路的發展，每天都有大量的多媒體資料如圖片、音樂、文字與影片等資料類型上傳到網路上，如何從各類型的多媒體資料庫或網站中快速地檢索或儲存資訊為重要的課題。多媒體檢索(Multimedia Retrieval)的出現，目的就是針對不同檢索內容去幫助使用者可以有效率地找到感興趣的資源。在影片檢索這領域中，除了會有使用者想找的片段內容不大相同，其搜尋的用詞也會不同，導致與模型的標籤在語意上會有所出入，使影片檢索結果有誤，這種情況使用者無法透過單輪的方式直接找到答案。綜上所述，本研究提出一個多輪的影片檢索方法，延伸影片檢索模型的預測片段，透過回饋機制讓預測片段與上一輪選擇的片段做結合，將結合的結果利用分群演算法(Clustering algorithm)去選擇與答案最相似的片段，透過此方法可以確保每一輪是在靠近答案的情況下去搜尋最相似的片段，因此每一輪的搜尋結果可以更靠近正確片段且提升影片檢索的搜尋效果。本研究主要利用三種相似度的計算與Recall當成評估指標，在三個不同的資料集進行實驗。在實驗部分，本研究也考慮到四種回饋方式與四種分群數量分配結果，基於上述的主要流程，進行不同組合的實驗。根據實驗結果，由三種相似度指標來看，每一輪的相似度分數平均與每一輪搜尋相似度分數之間的比較上，整體上都有變好的趨勢，而在找到的數量與Recall中，可以發現加入多輪的回饋機制也有助於影片檢索提升Recall。另外也進行了人工評估，從問卷與操作結果可以知道，本研究對於影片檢索是有幫助的。
Abstract
Recently, with progress of communication technology and development of the Internet. Many multimedia data are uploaded to Internet every day like images, music, videos and so on. How to quickly retrieve or store information from various types of multimedia databases or websites is an important topic. The emergence of Multimedia Retrieval aims to help users effectively find resources they are interesting for different retrieval contents. In the field of video retrieval, users often search for different contents and use different terms, resulting in semantic discrepancies between terms and labels of model, and the result of video retrieval will be incorrect. In this case, users cannot directly find the answer through a single turn. In summary, we propose a video retrieval method with multi-turn query which we extend the video retrieval model’s prediction. Predicted clip from current turn and selected clip from last turn will be combined through the feedback mechanism, and then select the clip that is closed to the answer with Clustering algorithm. We can make sure that each turn searches the nearest clip in situation that is approach to answer gradually. Thus, the query from each turn can be more closed to answer clip and boost the effect of video retrieval. We use three similarity measures and Recall as our metrics, doing the experiment with three datasets. We also consider four ways of feedback and four allocations of related clips selection in clustering, devising different experiments based on above process. According to the result, three similarity measures get better in average and comparison of each turn over all. Video retrieval with multi-turn and feedback mechanism can also improve Recall. Additionally, we conduct a human evaluation. It can be and we can see that our research is helpful for video retrieval from the questionnaire and operation results.

目次 Table of Contents
論文審定書 i 論文公開授權書 ii 致謝 iii 摘要 iv Abstract v 目錄 vi 圖次 viii 表次 ix 第一章緒論 1 1.1. 研究背景 1 1.2. 研究動機 2 1.3. 研究目的 2 第二章文獻探討 3 2.1. 影片檢索 (Video Retrieval) 3 2.1.1. 實例搜尋 (Query by Example) 3 2.1.2. 物件搜尋 (Query by Objects) 4 2.1.3. 關鍵字搜尋 (Query by Keywords) 5 2.1.4. 自然語言搜尋 (Query by Natural Language) 6 2.2. 互動式影片檢索(Interactive Video Retrieval) 6 2.2.1. Video Browser Showdown (VBS) 7 2.2.2. 關聯式回饋(Relevance Feedback) 7 2.2.3. 對話式搜尋(Dialog-Based Search) 8 2.3. 多模態融合 (Multimodal Fusion) 9 2.3.1. 早期融合(Early fusion) 9 2.3.2. 晚期融合(Late fusion) 10 2.4. TVRetrieval 10 2.4.1. Cross-modal Moment Localization (XML) 11 2.4.2. Convolutional Start-End Detector (ConvSE) 11 2.4.3. Video Retrieval 12 第三章研究方法與步驟 13 3.1 回饋機制 13 3.1.1 回饋閥值與結果 14 3.1.2 產生分群新特徵點 15 3.2 分群篩選 21 3.2.1 鄰近傳播分群法(Affinity Propagation, AP) 22 3.2.2 分群流程 24 3.2.3 相關片段選擇分配 27 3.2.4 實際流程 28 第四章實驗 30 4.1 資料集介紹 30 4.1.1 TVRetrieval 30 4.1.2 Concat data 31 4.1.3 Simplified data 32 4.1.4 TVCaption 32 4.2 評估方式 33 4.2.1 Recall 33 4.2.2 相似度指標 34 4.2.3 人工評估 34 第五章實驗結果 36 5.1 實驗環境與設置 36 5.2 TVCaption實驗結果 36 5.2.1 MF實驗結果 36 5.2.2 NFS實驗結果 40 5.2.3 NFM實驗結果 44 5.2.4 NFL實驗結果 48 5.2.5 討論 52 5.3 Concat data實驗結果 53 5.3.1 NFM 與 NFL實驗結果 53 5.3.2 結論 56 5.4 Simplified data實驗結果 56 5.4.1 NFM 與 NFL實驗結果 56 5.4.2 討論 59 5.5 人工評估結果 60 5.5.1 實體測驗結果 60 5.5.2 問卷資料結果 61 5.5.3 討論 62 第六章結論與建議 63 6.1 結論 63 6.2 未來展望 63 參考文獻 65 附錄 69

參考文獻 References
[1] P. Das, C. Xu, R. F. Doell, and J. J. Corso, "A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2634-2641. [2] M. Shim, Y. Hwi Kim, K. Kim, and S. Joo Kim, "Teaching machines to understand baseball games: Large-scale baseball video database for multiple video understanding tasks," in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 404-420. [3] H.-C. Shih, "A survey of content-aware video analysis for sports," IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 5, pp. 1212-1231, 2017. [4] J. Lei, L. Yu, T. L. Berg, and M. Bansal, "Tvr: A large-scale dataset for video-subtitle moment retrieval," in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, 2020: Springer, pp. 447-463. [5] S. Hou and S. Zhou, "Audio-visual-based query by example video retrieval," Mathematical Problems in Engineering, vol. 2013, 2013. [6] J. S. Yuk, K.-Y. K. Wong, R. H. Chung, K. Chow, F. Y. Chin, and K. S. Tsang, "Object-based surveillance video retrieval system with real-time indexing methodology," in International Conference Image Analysis and Recognition, 2007: Springer, pp. 626-637. [7] E. Şaykol, U. Güdükbay, and Ö. Ulusoy, "Scenario-based query processing for video-surveillance archives," Engineering Applications of Artificial Intelligence, vol. 23, no. 3, pp. 331-345, 2010. [8] V. Gabeur, C. Sun, K. Alahari, and C. Schmid, "Multi-modal transformer for video retrieval," in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, 2020: Springer, pp. 214-229. [9] Z. Cheng, X. Li, J. Shen, and A. G. Hauptmann, "Which information sources are more effective and reliable in video search," in Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2016, pp. 1069-1072. [10] W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank, "A survey on visual content-based video indexing and retrieval," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 41, no. 6, pp. 797-819, 2011. [11] A. Vaswani et al., "Attention is all you need," in Advances in Neural Information Processing Systems, 2017, pp. 5998-6008. [12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. [13] R. Bansal and S. Chakraborty, "Visual content based video retrieval on natural language queries," in Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, 2019, pp. 212-219. [14] D. Francis, P. Anh Nguyen, B. Huet, and C.-W. Ngo, "Fusion of multimodal embeddings for ad-hoc video search," in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0-0. [15] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, "Dense-captioning events in videos," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 706-715. [16] D. Shao, Y. Xiong, Y. Zhao, Q. Huang, Y. Qiao, and D. Lin, "Find and focus: retrieve and localize video events with natural language queries," in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 200-216. [17] C. Cobârzan et al., "Interactive video search tools: a detailed analysis of the video browser showdown 2015," Multimedia Tools and Applications, vol. 76, no. 4, pp. 5539-5571, 2017. [18] L. Rossetto et al., "Interactive video retrieval in the age of deep learning-detailed evaluation of vbs 2019," IEEE Transactions on Multimedia, 2020. [19] F. Berns, L. Rossetto, K. Schoeffmann, C. Beecks, and G. Awad, "V3C1 dataset: an evaluation of content characteristics," in Proceedings of the 2019 on International Conference on Multimedia Retrieval, 2019, pp. 334-338. [20] R. Gasser, L. Rossetto, S. Heller, and H. Schuldt, "Cottontail DB: an open source database system for multimedia retrieval and analysis," in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4465-4468. [21] L. Rossetto, I. Giangreco, and H. Schuldt, "Cineast: a multi-feature sketch-based video retrieval engine," in 2014 IEEE International Symposium on Multimedia, 2014: IEEE, pp. 18-23. [22] S. Padmakala and G. AnandhaMala, "Interactive video retrieval using semantic level features and relevant feedback," Int. Arab J. Inf. Technol., vol. 14, no. 5, pp. 764-773, 2017. [23] L. Shao, S. Jones, and X. Li, "Efficient search and localization of human actions in video databases," IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 3, pp. 504-512, 2013. [24] H. Wang, Z. Wu, and J. Chen, "Multi-turn response selection in retrieval-based chatbots with iterated attentive convolution matching network," in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1081-1090. [25] S. Maeoki, K. Uehara, and T. Harada, "Interactive video retrieval with dialog," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 952-953. [26] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, "Multimodal fusion for multimedia analysis: a survey," Multimedia Systems, vol. 16, no. 6, pp. 345-379, 2010. [27] P. K. Atrey, M. S. Kankanhalli, and J. B. Oommen, "Goal-oriented optimal subset selection of correlated multimedia streams," ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 3, no. 1, pp. 2-es, 2007. [28] N. Spolaôr, H. D. Lee, W. S. R. Takaki, L. A. Ensina, C. S. R. Coy, and F. C. Wu, "A systematic review on content-based video retrieval," Engineering Applications of Artificial Intelligence, vol. 90, p. 103557, 2020. [29] B. J. Frey and D. Dueck, "Clustering by passing messages between data points," Science, vol. 315, no. 5814, pp. 972-976, 2007. [30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009: Ieee, pp. 248-255. [31] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778. [32] W. Kay et al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017. [33] J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299-6308. [34] Y. Liu et al., "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019. [35] M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele, "Script data for attribute-based recognition of composite activities," in European Conference on Computer Vision, 2012: Springer, pp. 144-157. [36] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, "Localizing moments in video with temporal language," arXiv preprint arXiv:1809.01337, 2018. [37] A. Garain, A. Basu, R. Dawn, and S. K. Naskar, "Sentence simplification using syntactic parse trees," in 2019 4th International Conference on Information Systems and Computer Networks (ISCON), 2019: IEEE, pp. 672-676. [38] "LanguageTool in Python." https://pypi.org/project/language_tool/ (accessed. [39] "TVC Dataset. A large-scale multimodal video captioning dataset." https://tvr.cs.unc.edu/tvc_explore.html (accessed. [40] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, "Natural language object retrieval," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4555-4564.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0726121-155816.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS