Towards Expanding the South Slavic ConceptNet: Leveraging MULTEXT-East’s Corpora for Concept Extraction

Web proceedings papers

Authors

Abstract

Conceptual graphs offer some much-needed contextual knowledge when it comes to natural language processing. There are several common knowledge bases available to us today, the largest and most comprehensible one being ConceptNet. ConceptNet's scope includes words and common phrases, interconnected by their lexical and semantic relations. Although offering infor- mation for 70+ languages, ConceptNet is lacking when it comes to the South Slavic languages, not only having an unbalanced ratio of concept numbers when compared to the West European languages, but when compared between themselves as well. This paper describes an approach to expand ConceptNet’s corpus for these languages using a balanced corpora with the same word base, namely MULTEXT-East's multilingual parallel corpora of George Orwell's "1984". For the first phase of the project, the Macedonian, Serbian and Slovenian translations of the text are used to extract noun pairs and get new Concept- Net connections. The results show a balanced ratio of new concepts and relations in all of the languages used, without any of the redundancies found in the current ConceptNet entries.

Keywords

ConceptNet, MULTEXT-East, natural language processing, machine translation, conceptual graphs, common knowledge bases

Innovations

Towards Expanding the South Slavic ConceptNet: Leveraging MULTEXT-East’s Corpora for Concept Extraction

Authors

Abstract

Keywords

Download

Export citation

Conferences