Web proceedings papers

Authors

Bojan Ilijoski and Zaneta Popeska

Abstract

The first step of text mining is finding similar words or words with the same meaning. This is important for the extraction of the primary meaning of the text. There are a few different approaches in which this can be done. Some of them use dictionaries, others use stemming algorithms, some are statistically based etc. The different methods have their advantages and disadvantages and generally the choice of a method depends on the problem that we want to solve. The statistical methods, are usually used for languages where the production of dictionary or stemming rules are difficult. These methods are also independent from typing errors and the words which do not exist in the languages such as names or some dialects. One of the best known statistical method is n-gram similarity. In this paper we will give a new measure for similarity of words based on their n-gram similarities. We will explain in which case it is good to use this method and this approach for finding word similarity, what are the advantages of this method and how the new measures improve it. We will give comparison of the new measure with the existing ones and describe the improvement of the new measure.

Keywords

Syntactical similarity · Similarity measures · Word similarity · n-gram · Text mining.