-Motivation:
On various Social Media platforms, people tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. While being so diverse and rich, Arabic language and particularly Arabic dialects, are still under represented and not yet fully exploited.
Arabizi is a term describing a system of writing Arabic using English characters. This term comes from two words “arabi” (Arabic) and “Engliszi” (English). Arabizi is the representation of Arabic sounds using Latin letters and numbers to replace the non existing equivalent ones. Particularly in Tunisia, this way of writing was introduced as ”Tunizi”.
Tunizi example comments and their Modern Standard Arabic (MSA) and English translations.
-About this Dataset :
This is a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, Negative, and Neutral.
-Value of this Data :
The authors introduced this large Tunizi dataset built for the sentiment analysis task, in order to help Tunisian and other researchers interested in the Natural Language Processing (NLP) field. This dataset can be also used for other NLP subtasks such as dialect identification, named entities recognition, etc...
-How the data were acquired:
According to the article authors, because of the lack of available Tunisian dialect data (books, wikipedia, etc.), they used a Common Crawl-based dataset extracted from social media. It is collected from comments on various social networks. The chosen posts included sports, politics, comedy, TV shows, TV series, arts and Tunisian music videos such that the dataset is representative and contains different types of ages, background, writing, etc. This data does not include any confidential information since it is collected from comments on public Social Media posts. However, negative comments may include offensive or insulting content. This dataset relates directly to people from different regions, different ages and different genders. A filter was applied to ensure that only Latin based comments are included. The extracted data was preprocessed by removing links, emoji symbols, and ponctuations|
Header & Thumbnail Image : Credits @VectorStock