Web proceedings papers

Authors

Filip Andonovski and Ivan Chorbev

Abstract

Social networks are extremely popular nowadays and the amount of data generated there on daily basis is staggering. Users are overwhelmed with data streams and are unable to filter out and extract information of interest. Twitter being one of the most popular social networks has over 100 million daily active users which generate over 500 million posts on daily basis. This paper describes the algorithm implemented in a system aimed at defining the most popular Twitter topics for athletes. Twitter messages are clustered in their corresponding topics. The algorithm uses Lucene to determine the most commonly used words in the data set and the posts containing those words. The system includes a Daily/Weekly/Monthly digest feature, clustering tweets for the appropriate period. Experiments have been performed in order to validate the system’s performance using some of the metrics most commonly used in the field of information retrieval. The system’s performance has been analyzed and a discussion has been made on the subject of generalization of the system and the possibilities of future improvements.

Keywords

/