Research Article
BibTex RIS Cite

Kavramlar Arası WordNet Tabanlı Anlamsal Benzerlik Değerlerinin Farklı Metriklerle Değerlendirilmesi

Year 2020, Ejosat Special Issue 2020 (ICCEES), 473 - 479, 05.10.2020
https://doi.org/10.31590/ejosat.819599

Abstract

Kelimelerin anlam belirsizliği giderilmesi metin madenciliği, bilgi erişimi, doğal dil işleme gibi alanlarda yüksek doğruluklu başarı elde edilmesi için önemli bir adımdır. Kelimelerin bağlam içerisinde yer alan doğru anlamı belirlemek için sözlük tabanlı yaklaşımlar, eğiticili- eğiticisiz öğrenmede kullanılan etiketli-etiketsiz külliyatlar, kelime gömme gibi yeni yaklaşımlar sıklıkla kullanılmaktadır. Çalışmamız kapsamında ekonomi, teknoloji ve spor kategorilerine ait RSS haberleri haber sağlayıcılarından elde edilmiştir. Çalışma kapsamında RSS haber beslemelerindeki kelimeler kategorilere göre terim frekansı- ters doküman frekansı (tf-idf) ağırlandırması gerçekleştirilmiştir. Kelimeler arasındaki anlamsal benzerliklerin belirlenmesi için elle etiketlenmiş hiyerarşik çizge tabanlı sözlük olan WordNet tabanlı yaklaşımlar kullanılmıştır. İlk adımda tf-idf ağırlıklarına göre belirlenen kelimeler WordNet tabanlı Wu-Palmer, Lin ve Jiang – Conrath anlamsal benzerlik yaklaşımlarına göre tekrar sıralanmıştır. Aynı kategoride yer alan tf-idf değeri en yüksek elli kelimenin Kategorik Anlamsal İlişki Değeri (KAİD) hesaplanarak kelimelerin kategorilere ait anlamsal ilişki değerleri belirlenmiş. En yüksek KAİD değerine sahip 3, 5, 10 ve 20 kelime tüm kategoriler için çıkartılmıştır. Elde edilen kelimeler elle etiketlenmiş ve tf-idf ağırlıkları kullanılarak sıralanmış kelimelerle karşılaştırılmıştır. Karşılaştırma sonuçlarına göre iki katmanlı eleme ile anlamsal ilişkileri çıkarılan kelimeler ile insan tarafından belirlenen kelimelerin benzerlik oranının yüksek olduğu sonucu elde edilmiştir. WordNet tabanlı yöntemlerle elde edilen ve sıralanan kelimeler aynı zamanda tf-idf ağırlıklandırmasıyla elde edilen ve sıralanan kelimelerle de karşılaştırılmıştır. Sonuçlara göre ağırlıklandırma ile sıralanan kelimelerde örtüşme oranı insan algısıyla elde edilen kelimelerden daha düşük çıkmıştır. İki katmanlı değerlendirme ile oluşturulan kelimelerin anlamsal ilişki değerleri kategori uzayında görselleştirilerek anlamsal ilişki değerlerinin başarısı değerlendirilmiştir. İleriki çalışmalarda iki katmanlı değerlendirmeyle elde edilen kelimeler bilgi edinimi, metin özetleme, metin sınıflandırma alanında kullanılması hedeflenmektedir.

References

  • Chen, J., & Palmer, M. (2005). Towards robust high performance word sense disambiguation of english verbs using rich linguistic features. In International Conference on Natural Language Processing (pp. 933-944). Springer, Berlin, Heidelberg.
  • Dang, H. T., Chia, C. Y., Palmer, M., & Chiou, F. D. (2002). Simple features for Chinese word sense disambiguation. In COLING 2002: The 19th International Conference on Computational Linguistics.
  • Mihalcea, R. (2007, April). Using wikipedia for automatic word sense disambiguation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference (pp. 196-203).
  • Seo, H. C., Chung, H., Rim, H. C., Myaeng, S. H., & Kim, S. H. (2004). Unsupervised word sense disambiguation using WordNet relatives. Computer Speech & Language, 18(3), 253-273.
  • Pham, T. P., Ng, H. T., & Lee, W. S. (2005). Word sense disambiguation with semi-supervised learning. In Proceedings of the National Conference on Artificial Intelligence (Vol. 20, No. 3, p. 1093). Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
  • Simov, K., Koprinkova-Hristova, P., Popov, A., & Osenova, P. (2020). A Reservoir Computing Approach to Word Sense Disambiguation. Cognitive Computation, 1-10.
  • Miller, G. A. (1998). WordNet: An electronic lexical database. MIT press.
  • Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of lexical semantic relatedness. Computational linguistics, 32(1), 13-47.
  • Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. arXiv preprint cmp-lg/9406033.
  • Lin, D. (1998). An information-theoretic definition of similarity. In Icml (Vol. 98, No. 1998, pp. 296-304).
  • Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. WordNet: An electronic lexical database, 49(2), 265-283.
  • Oliver, A. (2020). Aligning Wikipedia with WordNet: a Review and Evaluation of Different Techniques. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 4851-4858).
  • Kolajo, T., Daramola, O., Adebiyi, A., & Seth, A. (2020). A framework for pre-processing of social media feeds based on integrated local knowledge base. Information Processing & Management, 57(6), 102348.
  • Iqbal, F., Fung, B. C., Debbabi, M., Batool, R., & Marrington, A. (2019). Wordnet-based criminal networks mining for cybercrime investigation. IEEE Access, 7, 22740-22755.
  • Hasan, A. M., Noor, N. M., Rassem, T. H., Noah, S. A. M., & Hasan, A. M. (2020). A proposed method using the semantic similarity of WordNet 3.1 to handle the ambiguity to apply in social media text. In Information Science and Applications (pp. 471-483). Springer, Singapore.
  • Zhu, X., Xu, Q., Chen, Y., & Wu, T. (2019). An Improved Class-Center Method for Text Classification Using Dependencies and WordNet. In CCF International Conference on Natural Language Processing and Chinese Computing (pp. 3-15). Springer, Cham.
  • Jain, A., Vij, S., & Tayal, D. K. (2019). Text Summarization Using WordNet Graph Based Sentence Ranking. In Proceedings of 2nd International Conference on Communication, Computing and Networking (pp. 711-715). Springer, Singapore.
  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55-60).
  • Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028.

Evaluation of WordNet Based Semantic Similarity Values Between Concepts with Different Metrics

Year 2020, Ejosat Special Issue 2020 (ICCEES), 473 - 479, 05.10.2020
https://doi.org/10.31590/ejosat.819599

Abstract

Word sense disambiguation is an important step in text mining, information retrieval, natural language processing to obtain more accurate results. Dictionary- and knowledge-based, supervised, unsupervised and word embedding methods are used to discover the correct sense of words in the context. We retrieve RSS feeds ,whose categories are economy, technology and sport, to utilize in our study. After data retrieval, we used data preprocessing steps of text mining and we applied term frequency- inverse document frequency(tf-idf) for term weighting. WordNet is a large lexical database in which sense of words are kept in hierarchical network. In the first step, the words determined according to tf-idf weights were ranked according to the WordNet based semantic similarity measures Wu-Palmer, Lin and Jiang - Conrath. We used the top fifty ranked words ,which are obtained from tf-idf scores, to calculate Categorical Semantic Relationship Value (CSRV) of each word for each category. We determined the top 3, 5, 10 and 20 words due to CSRV for each category. Semantic ordered words are compared with tf-idf weighting based words and hand-labeled words which are determined according to semantic relationship by humans. The similarity rate is high between words are determined by two tier semantic structure based words and human labeled words. This similarity rate is lower between words are determined by two tier semantic structure based words and words which are ordered by tf-idf values. We also visualize the semantic similarity values in class dimension space to evaluate the success of the system. We intend to use two tier semantic structure in information retrieval, text summarization and text classification projects as future works.

References

  • Chen, J., & Palmer, M. (2005). Towards robust high performance word sense disambiguation of english verbs using rich linguistic features. In International Conference on Natural Language Processing (pp. 933-944). Springer, Berlin, Heidelberg.
  • Dang, H. T., Chia, C. Y., Palmer, M., & Chiou, F. D. (2002). Simple features for Chinese word sense disambiguation. In COLING 2002: The 19th International Conference on Computational Linguistics.
  • Mihalcea, R. (2007, April). Using wikipedia for automatic word sense disambiguation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference (pp. 196-203).
  • Seo, H. C., Chung, H., Rim, H. C., Myaeng, S. H., & Kim, S. H. (2004). Unsupervised word sense disambiguation using WordNet relatives. Computer Speech & Language, 18(3), 253-273.
  • Pham, T. P., Ng, H. T., & Lee, W. S. (2005). Word sense disambiguation with semi-supervised learning. In Proceedings of the National Conference on Artificial Intelligence (Vol. 20, No. 3, p. 1093). Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
  • Simov, K., Koprinkova-Hristova, P., Popov, A., & Osenova, P. (2020). A Reservoir Computing Approach to Word Sense Disambiguation. Cognitive Computation, 1-10.
  • Miller, G. A. (1998). WordNet: An electronic lexical database. MIT press.
  • Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of lexical semantic relatedness. Computational linguistics, 32(1), 13-47.
  • Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. arXiv preprint cmp-lg/9406033.
  • Lin, D. (1998). An information-theoretic definition of similarity. In Icml (Vol. 98, No. 1998, pp. 296-304).
  • Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. WordNet: An electronic lexical database, 49(2), 265-283.
  • Oliver, A. (2020). Aligning Wikipedia with WordNet: a Review and Evaluation of Different Techniques. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 4851-4858).
  • Kolajo, T., Daramola, O., Adebiyi, A., & Seth, A. (2020). A framework for pre-processing of social media feeds based on integrated local knowledge base. Information Processing & Management, 57(6), 102348.
  • Iqbal, F., Fung, B. C., Debbabi, M., Batool, R., & Marrington, A. (2019). Wordnet-based criminal networks mining for cybercrime investigation. IEEE Access, 7, 22740-22755.
  • Hasan, A. M., Noor, N. M., Rassem, T. H., Noah, S. A. M., & Hasan, A. M. (2020). A proposed method using the semantic similarity of WordNet 3.1 to handle the ambiguity to apply in social media text. In Information Science and Applications (pp. 471-483). Springer, Singapore.
  • Zhu, X., Xu, Q., Chen, Y., & Wu, T. (2019). An Improved Class-Center Method for Text Classification Using Dependencies and WordNet. In CCF International Conference on Natural Language Processing and Chinese Computing (pp. 3-15). Springer, Cham.
  • Jain, A., Vij, S., & Tayal, D. K. (2019). Text Summarization Using WordNet Graph Based Sentence Ranking. In Proceedings of 2nd International Conference on Communication, Computing and Networking (pp. 711-715). Springer, Singapore.
  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55-60).
  • Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028.
There are 19 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Articles
Authors

Mustafa Özgür Cingiz 0000-0003-4469-1440

Publication Date October 5, 2020
Published in Issue Year 2020 Ejosat Special Issue 2020 (ICCEES)

Cite

APA Cingiz, M. Ö. (2020). Kavramlar Arası WordNet Tabanlı Anlamsal Benzerlik Değerlerinin Farklı Metriklerle Değerlendirilmesi. Avrupa Bilim Ve Teknoloji Dergisi473-479. https://doi.org/10.31590/ejosat.819599