博碩士論文 etd-0805118-135637 詳細資訊


[回到前頁查詢結果 | 重新搜尋]

姓名 張家揚(Chia-Yang Chang) 電子郵件信箱 E-mail 資料不公開
畢業系所 電機工程學系研究所(Electrical Engineering)
畢業學位 碩士(Master) 畢業時期 106學年第2學期
論文名稱(中) 基於文字語意分群之文章抄襲偵測
論文名稱(英) Plagiarism detection based on word semantic clustering
檔案
  • etd-0805118-135637.pdf
  • 本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
    請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
    論文使用權限

    紙本論文:5 年後公開 (2023-09-06 公開)

    電子論文:使用者自訂權限:校內 5 年後、校外 5 年後公開

    論文語文/頁數 中文/44
    統計 本論文已被瀏覽 5638 次,被下載 0 次
    摘要(中) 近年來文章抄襲已經是越來越常見的問題了,隨著網路與科技越加發達,他人的著作在網路上已經是唾手可得。而當你的著作使用了他人的著作內容,卻又未明確的指出引用,那便很有可能涉及抄襲。抄襲行為已經侵犯到了他人的智慧財產權,而且發生的頻率越來越高,因此,抄襲偵測在現今已經是非常重要的議題。目前的抄襲研究多與偵測近似複本(near-duplicate)類似,例如向量空間模型、詞袋模型,大多只能偵測抄襲相似度非常高的部分,若將抄襲的部分稍加修飾,例如替換掉某些單字、將句子改寫等等,這些方法的效果便會受到極大的影響。因此,我們針對單字的語意進行分析。利用單字的語意來辨別文章究竟有沒有抄襲的嫌疑。Word2vec是由Google團隊所提出的詞嵌入(word embedding)模型,藉由機器學習訓練大量的文章,最後使用向量來代表單字的意思。我們便透過Word2vec獲得單字的向量,由於單字語意的資訊量十分龐大,我們使用主成分分析(principal component analysis, PCA)進行降維,藉由忽略向量中資訊量較少的維度,來達到縮減維度的效果。之後再使用分群將單字分為許多不同的語意概念(concept),透過比對文章間語意概念的重複程度,我們便可以辨識出複雜度較高的抄襲行為。最後,我們也用實驗將我們的方法與其他方法比較,並測試了許多不同的實驗參數,證明了利用單字的語意確實可以更精準地分辨出複雜度較高的抄襲行為。
    摘要(英) Plagiarism is a common problem in current years. With the advance of Internet, it is more and more easy to obtain other people's writings. When someone uses the content without citation, he may cause the problem of plagiarism. Plagiarisms will infringe the intellectual property rights. So plagiarism detection is a serious problem in nowadays.Current plagiarism detection methods are similar to near-duplicate detection methods, like VSM(vector space model) or bag-of-words. These methods can't handle the complex plagiarized technique very well, e.g. word substitution and sentence rewriting. Therefore, we focus on the semantic of words. In this paper, we propose a new method for plagiarism detection by analyzing the semantic of words.Word2vec is a word embedding model proposed by Google group. It can use a vector to represent a word. We use Word2vec to obtain the vector of words and use PCA for dimension reduction. After that, we use spherical K-means to cluster the words into concepts. By using Word2vec, we can consider the semantic of words and cluster the words into concepts in order to deal with the complex plagiarized technique.Finally, we will show our experimental results and compare with other methods. The experimental results show that our method is well performance.
    關鍵字(中)
  • 近似複本
  • 向量空間模型
  • 詞袋模型
  • Word2vec
  • 詞嵌入
  • 主成分分析
  • 抄襲偵測
  • 關鍵字(英)
  • Word2vec
  • word embedding
  • PCA
  • VSM
  • near-duplicate
  • bag-of-words
  • plagiarism detection
  • 論文目次 論文審定書 ................................................................................................................... i
    誌謝 .............................................................................................................................. ii
    摘要 ............................................................................................................................. iii
    Abstract ....................................................................................................................... iv
    圖目錄 ........................................................................................................................ vii
    表目錄 ....................................................................................................................... viii
    第一章導論 .................................................................................................................. 1
    1.1. 研究背景與目的 ........................................................................................... 1
    1.2. 研究動機....................................................................................................... 3
    1.3. 論文架構....................................................................................................... 5
    第二章文獻探討 .......................................................................................................... 6
    2.1. 文件模型....................................................................................................... 6
    2.2. 抄襲偵測....................................................................................................... 7
    2.3. Word2vec ...................................................................................................... 8
    2.4. Spherical K-means....................................................................................... 11
    第三章研究方法 ........................................................................................................ 14
    3.1. 方法簡介..................................................................................................... 14
    3.2. 第一階段流程 ............................................................................................. 15
    3.3. 第二階段流程 ............................................................................................. 21
    第四章實驗結果分析 ................................................................................................. 24
    4.1. Data set ....................................................................................................... 24
    4.2. 評估標準..................................................................................................... 24
    4.3. K-means 和spherical K-means 比較結果 ................................................... 25
    4.4. 分群群數與PCA 累積能量比較結果......................................................... 26
    4.5. 與MLM 方法比較結果 .............................................................................. 27
    4.6. 替換單字之抄襲偵測比較結果 .................................................................. 28
    第五章結論與未來展望 ............................................................................................. 30
    5.1. 結論 ............................................................................................................ 30
    5.2. 未來研究方向 ............................................................................................. 30
    參考文獻 .................................................................................................................... 31
    參考文獻 [1] Swanson, D.R., 1960. Searching natural language text by computer. Science, 132, 1099–1101.
    [2] Salton, G., 1970. Automatic text analysis. Science 168, 335–343.
    [3] Blair, D.C., Maron, M.E., 1985. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM 28, 289–299.
    [4] Baeza-Yates, R., Ribeiro-Neto, B., et al., 1999. Modern information retrieval. volume 463. ACM press New York.
    [5] Henzinger, M., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms, in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM. pp.284–291.
    [6] Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G, 2011. Efficient similarity joins for nearduplicate detection. ACM Transactions on Database Systems(TODS) 36,15.
    [7] Brin, S., Davis, J., Garcia-Molina, H., 1995. Copy detection mechanisms for digital documents, in: ACM SIGMOD Record, ACM. pp. 398–409.
    [8] Shivakumar, N., Garcia-Molina, H., 1995. Scam: A copy detection mechanism for digital documents .
    [9] Clough, P., 2000. Plagiarism in natural and programming languages: an overview of current tools and technologies .
    [10] Mozgovoy, M., Fredriksson, K., White, D., Joy, M., Sutinen, E., 2005. Fast plagiarism detection system, in: International Symposium on String Processing and Information Retrieval, Springer. pp. 267–270.
    [11] Maurer, H.A., Kappe, F., Zaka, B., 2006. Plagiarism-a survey. J. UCS 12, 1050–1084.
    [12] Lukashenko, R., Graudina, V., Grundspenkis, J., 2007. Computer-based plagiarism detection methods and tools: an overview, in: Proceedings of the 2007 international conference on Computer systems and technologies, ACM. p. 40.
    [13] Ceska, Z., 2008. Plagiarism detection based on singular value decomposition, in: Advances in natural language processing. Springer, pp. 108–119.
    [14] Barrón-Cedeño, A., Rosso, P., Benedí, J.M., 2009. Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance, in: International conference on intelligent text processing and computational linguistics, Springer. pp. 523–534.
    [15] Alzahrani, S., Salim, N., 2010. Fuzzy semantic-based string similarity for extrinsic plagiarism detection. Braschler and Harman 1176, 1–8.
    [16] Alzahrani, S.M., Salim, N., Abraham, A., 2012. Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 133–149.
    [17] Chow, T.W., Rahman, M., 2009. Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Transactions on Neural Networks 20, 1385–1402.
    [18] Zhang, H., Chow, T.W., 2011. A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognition 44, 471–487.
    [19] Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
    [20] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013b. Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, pp. 3111–3119.
    [21] Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J., 2007. Large language models in machine translation, in: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).
    [22] Collobert, R., Weston, J., 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning, in: Proceedings of the 25th international conference on Machine learning, ACM. pp. 160–167.
    [23] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P., 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537.
    [24] Duchi, J., Hazan, E., Singer, Y., 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121–2159.
    [25] Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y., 2012. Improving word representations via global context and multiple word prototypes, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:Long Papers-Volume 1, Association for Computational Linguistics. pp.873–882.
    [26] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C., 2011. Learning word vectors for sentiment analysis, in: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, Association for Computational Linguistics. pp.142–150.
    [27] Mikolov, T., Kopecky, J., Burget, L., Glembek, O., et al., 2009. Neural network based language models for highly inflective languages, in: Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, IEEE. pp. 4725–4728.
    [28] Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S., 2010. Recurrent neural network based language model, in: Eleventh Annual Conference of the International Speech Communication Association.
    [29] Mikolov, T., Kombrink, S., Burget, L., Černockỳ, J., Khudanpur, S., 2011. Extensions of recurrent neural network language model, in: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, IEEE. pp. 5528–5531.
    [30] Mnih, A., Hinton, G., 2007. Three new graphical models for statistical language modelling, in: Proceedings of the 24th international conference on Machine learning, ACM. pp. 641–648.
    [31] Mnih, A., Hinton, G.E., 2009. A scalable hierarchical distributed language model, in: Advances in neural information processing systems, pp. 1081–1088.
    [32] Buchta, C., Kober, M., Feinerer, I., Hornik, K., 2012. Spherical k-means clustering.Journal of Statistical Software 50, 1–22.
    [33] Mammasis, K., Pfann, E., Stewart, R. W., Freeland, G. 2008. Three-dimensional channel modelling using spherical statistics for smart antennas. Electronics Letters, 44(2), 136-138.
    口試委員
  • 吳志宏 - 召集委員
  • 侯俊良 - 委員
  • 劉志峰 - 委員
  • 歐陽振森 - 委員
  • 李錫智 - 指導教授
  • 口試日期 2018-07-26 繳交日期 2018-09-06

    [回到前頁查詢結果 | 重新搜尋]


    如有任何問題請與論文審查小組聯繫