Web proceedings papers

Authors

Jelena Graovac and Gordana Pavlovic-Lazetic

Abstract

We present a novel language-independent technique for de- termining polarity, positive or negative, of opinions expressed by differ- ent individuals. The technique is based on byte-level n-gram frequency statistics method for document representation, and a variant of k nearest neighbors (kNN) (for k = 1) machine learning algorithm for categoriza- tion process. The main advantages of the technique are its simplicity and full language and topic independence. For driving experiments we used corpora of movie reviews: Cornell polarity dataset in English and Mu- choCine in Spanish. Experimental results (85.6% accuracy for English and 82.49% for Spanish corpora) confirm that the presented technique is comparable with the best ranked previously published techniques, when applied to movie reviews datasets. Still, it use no additional linguistic information nor external resources.

Keywords

Sentiment Analysis, Byte n-Grams, kNN, Movie Review.