博碩士論文 etd-0810110-175653 詳細資訊

[回到前頁查詢結果 | 重新搜尋]

姓名 胡書嫻(Su-Hsien Hu) 電子郵件信箱 E-mail 資料不公開
畢業系所 電機工程學系研究所(Electrical Engineering)
畢業學位 碩士(Master) 畢業時期 98學年第2學期
論文名稱(中) 應用雲端運算於排序支援向量機之研究   
論文名稱(英) Application of MapReduce to Ranking SVM for Large-Scale Datasets
  • etd-0810110-175653.pdf
  • 本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。


    論文語文/頁數 中文/87
    統計 本論文已被瀏覽 5671 次,被下載 44 次
    摘要(中) 近年來,搜索引擎藉由機器學習技術與訓練查詢(training queries)來建立評等模型(ranking model),然後透過評等模型來進行網頁排序,其中訓練查詢來自過去使用者的查詢與點擊次數。目前,基於機器學習的排序學習方法非常多,其中以Raniking SVM最受矚目,然而此方法會因為大量的訓練資料而造成運算量過大。為了解決此問題,我們將評等支援向量機演算法修改成分散式的架構,並以雲端運算環境中的MapReduce來實現分散式演算法。MapReduce是由Google所提出,其隱藏平行處理的繁複細節、高容錯、資料分布、負載平衡,讓使用者可以簡單地使用大量的電腦來處理巨量資料集(large-scale datasets)。Ranking SVM主要分成訓練與測試兩個階段,在訓練階段,我們需要求解Ranking SVM的對偶子問題,而測試階段,則是藉由訓練後的評等模型對測試查詢作評分排名。因此,我們設計不同的MapReduce架構來滿足不同階段的需求。實驗結果證明,此分散式評等支援向量機可以有效地減少運算所耗費的時間。
    摘要(英) Nowadays, search engines are more relying on machine learning techniques to construct a model, using past user queries and clicks as training data, for ranking web pages. There are several learning to rank methods for information retrieval, and among them ranking support vector machine (SVM) attracts a lot of attention in the information retrieval community. One difficulty with Ranking SVM is that the computation cost is very high for constructing a ranking model due to the huge number of training data pairs when the size of training dataset is large. We adopt the MapReduce programming model to solve this difficulty. MapReduce is a distributed computing framework introduced by Google and is commonly adopted in cloud computing centers. It can deal easily with large-scale datasets using a large number of computers. Moreover, it hides the messy details of parallelization, fault-tolerance, data distribution, and load balancing from the programmer and allows him/her to focus on only the underlying problem to be solved. In this paper, we apply MapReduce to Ranking SVM for processing large-scale datasets. We specify the Map function to solve the dual sub problems involved in Ranking SVM and the Reduce function to aggregate all the outputs having the same intermediate key from Map functions of distributed machines. Experimental results show efficiency improvement on ranking SVM by our proposed approach.
  • 搜索引擎
  • 雲端運算
  • MapReduce
  • Ranking SVM
  • 關鍵字(英)
  • search engines
  • cloud computing
  • MapReduce
  • Ranking SVM
  • 論文目次 摘要 iv
    Abstract v
    圖目錄 viii
    表目錄 ix
    第一章簡介 10
    1.1 研究背景 10
    1.2 問題定義 12
    1.3 研究目的 14
    1.4 論文架構 15
    第二章 文獻探討 16
    2.1 雲端運算 16
    2.2 Hadoop 17
    2.2.1 Hadoop Distributed File System (HDFS) 18
    2.2.2 MapReduce 20
    2.3 虛擬主機 22
    2.4 排序學習 23
    2.4.1 逐點式方法(Pointwise Approach) 25
    2.4.2 成對式方法(Pairwise Approach) 26
    2.4.3 序列式方法(Listwise Approach) 28
    2.5 Ranking SVM 28
    2.6 評估工具 30
    2.6.1 正規化遞減累積效益(NDCG) 30
    第三章 研究方法 33
    3.1 研究動機 33
    3.2 方法概述 34
    3.3 方法架構 36
    3.3.1 架構一 36
    3.3.2 架構二 39
    3.3.3 架構三 42
    3.3.4 架構四 45
    第四章 實驗結果與分析 50
    4.1 實驗資料 50
    4.2 特徵選取 54
    4.3 實驗環境架設 59
    4.4 實驗結果與分析 59
    4.4.1 實驗一 60
    4.4.2 實驗二 63
    4.4.3 實驗三 65
    4.4.4 實驗四 66
    4.4.5 實驗五 68
    4.4.6 實驗六 72
    4.4.7 實驗七 76
    第五章 結果與未來展望 78
    5.1 結論 78
    5.2 未來研究方向 78
    參考文獻 80
    附錄1 82
    Hadoop 0.20.2 叢集安裝 82
    參考文獻 [1]. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addision Wesley, 1999.
    [2]. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to Rank Using Gradient Descant,” in Proceedings of the 22nd International Conference on Machine Learning, pp. 89-96, 2005.
    [3]. Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, H. Li, “Learning to Rank: From Pairwise Approach to Listwise Approach,” in Proceedings of the 24th International Conference on MachineLearning , pp. 129-136, 2007.
    [4]. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, pp.10-10, 2004.
    [5]. Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An Efficient Boosting Algorithm for Combining Preferences,” Journal of Machine Learning Research, Vol. 4, pp. 933-969, 2003.
    [6]. X.-B. Geng, T.-Y. Liu, T. Qin, H. Li, and H.-Y. Shum, “Query-Dependent Ranking Using K-Nearest Neighbor,” in Proceedings of the 31s Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115-122, 2008.
    [7]. S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” in Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003.
    [8]. T. Joachims, “Optimizing Search Engines Using Clickthrough Data,” ACM Conference on Knowledge Discovery and Data Mining, pp. 133-142, 2002.
    [9]. J. Kleinbreg, “Authoritative Sources in a Hyperlinked Environment,” in Proceeding of the 9th ACM–SIAM SODA, 1998.
    [10]. P. Li, C.J.C. Burges, and Q.Wu, “McRank: Learning to Rank Using Multiple Classification and Gradient Boosting,” Neural Information Processing Systems(NIPS), pp. 897-904, 2007.
    [11]. T.-Y. Liu, T. Qin, J. Xu, W. Xiong and H. Li, “LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval,” in Proceedings of the Learning to Rank workshop in the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007.
    [12]. T.Y. Liu, “Learning to Rank for Information Retrieval,” Tutorial at 17th International World-Wide Web Conference (WWW), 2008.
    [13]. R. Nallapati, “Discriminant Models for Information Retrieval,” in Proceedings of the 27th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 64-71, 2004.
    [14]. L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” Technical Report, Stanford University, 1998.
    [15]. V. C. Raykar, R. Duraiswami, and B. Krishnapuram, “A Fast Algorithm for Learning a Ranking Function from Large-Scale Data Sets,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 7, pp. 1158-1170, July 2008.
    [16]. S. E. Robertson, “Overview of the Okapi Projects,” Journal of Documentation, Vol. 53, No. 1, pp. 3-7, 1997.
    [17]. J.E. Smith and R. Nair, "The Architecture of Virtual Machines," IEEE Computer Society, pp. 32-38, 2005.
    [18]. M.-F. Tsai, T.-Y. Liu, T. Qin, H.-H. Chen, W.-Y. Ma, “FRank: A Ranking Method with Fidelity Loss,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 383-390, 2007.
    [19]. T. White, “Hadoop: The Definitive Guide, “America: O’Reilly Media, 2009.
    [20]. J. Xu and H.Li, “AdaRank: A Boosting Algorithm for Information Retrieval,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 391-398, 2007.
    [21]. Y. Yue, T. Finley, F. Radlinski, and T. Joachims, “A Support Vector Method for Optimizing Average Precision,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 271-278, 2007.
    [22]. C. Zhai and J. Lafferty, “A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval,” in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334-342, 2001.
    [23]. http://hadoop.apache.org/hdfs/
    [24]. http://research.microsoft.com/en-us/um/beijing/projects/letor/yahoodata.aspx
    [25]. 陳瀅﹝2010﹞:雲端策略:雲端運算與虛擬化技術。台北市:天下雜誌。
    [26]. 王鵬﹝2010﹞:雲端運算的關鍵技術與應用實例。台北市:加魁資訊。
  • 吳志宏 - 召集委員
  • 歐陽振森 - 委員
  • 蔡賢亮 - 委員
  • 賴智錦 - 委員
  • 李錫智 - 指導教授
  • 口試日期 2010-07-21 繳交日期 2010-08-10

    [回到前頁查詢結果 | 重新搜尋]