Towards Corpus-Based Stemming for Arabic Texts

Yasser Muhammad Naguib Sabtan

Download full text
(English, 11 pages)

Abstract

Stemming is an essential processing step in a number of natural language processing (NLP) applications such as information extraction, text analysis and machine translation. It is the process of reducing words to their stems. This paper presents a light stemmer for Arabic, using a corpus-based approach. The stemmer groups morphological variants of words in an Arabic corpus based on shared characters, before stripping off their affixes (prefixes and suffixes) to produce their common stem. Experimental results show that 86% of words in the test set were correctly grouped under a similar reduced form (i.e. the possible stem). In some cases the reduced form is not the legitimate stem. The evaluation shows that 72.2% of the words in the test set were reduced to their legitimate stem. The current stemmer is developed with the future aim of investigating the effectiveness of using word stems for extracting bilingual equivalents from an Arabic-English parallel corpus.

 

Metrics

  • 166 views
  • 145 downloads

Journal

International Journal of Linguistics, Literature and Translation

Founded in 2018, the International Journal of Linguistics, Literature and Translation (IJLLT) is ... see more

Funder

Al-Kindi Center for Research and Development

Al-Kindi Center for Research and Development (KCRD) is an academic publisher that aims to bring t... see more