Italian Negation Constructions - Tweets
Exploring Language Variation Across 10 Cities
By [source]
About this dataset
This dataset, the Twitter Italian Negation (TIN) Corpus, provides an interesting glimpse into language change in Romance languages with the emergence of non-standard uses of negations. This collection contains 10,000 tweets from ten different cities -Milan, Rome, Naples, Palermo, Bologna, Turin, Florence Cagliari Genoa and New York City -each collected in August 2019. The data includes tokenized text and frequency measures for each tweet as well as a city column so users can explore regional differences. With this resource users can uncover how the language of these cities is changing over time or even how language usage between neighboring countries or states may differ. Get ready to dive deep into the fascinating shifts that occur between spoken and written languages!
More Datasets
For more datasets, click here.
Featured Notebooks
- 🚨 Your notebook can be here! 🚨!
How to use the dataset
This dataset contains 10,000 tweets in Italian gathered from ten different cities between August and December 2019. This collection of tweets provides an interesting insight into the language change phenomena in Romance languages, specifically with regard to non-standard uses of negations.
The dataset is composed of nine columns: token, absolute frequency, relative frequency, variation, and city from which the tweet originated. Each row represents a single token in a particular tweet: each tweet can contain more than one token.
By using this dataset you can analyze and compare patterns of usage across different cities or even within a specific city. You can also compare variations within tokens between different cities to understand how certain constructions are used differently across regions or dialects.
Additionally you could use this data to examine trends in literary works such as poetry by looking at the most commonly used words and phrases over time.
To use the data effectively, it is important first to understand what each column represents:
-
Tok (Tokenized text): This is text that has been broken down into individual words or tokens representing all of the words found in a particular tweet including punctuation marks like commas or exclamation points;
-
Abs (Absolute Frequency): This is the total number of times that a particular token appears within all tweets;
-
Rel (Relative Frequency): This is calculated by calculating how many times a particular token appears compared to other tokens;
-
Var (Variation): This indicates whether there have been any alterations made compared to standard usage such as “has” being replaced with “haz”;
-
City: The originator's city corresponds with each tweet guiding analysis on usage differences among locales for example “Milan” or “Genua” but also generalized larger geographic areas such as “Italy” versus other countries like “United States.
Using these numeric values alongside thematic exploration allows for understanding not only usages but trends across different geographic populations relative representations both locally and globally provided by Twitter users regarding issues related language use especially non-standard dialectical contructs throughout Italy
Research Ideas
- Studying the regional variation of Italian negation constructions by comparing the frequency and variation between cities.
- Investigating language change over time by tracking changes in relative and absolute frequencies of negation constructions across tweets.
- Exploring how different socio-economic contexts or trends such as news, fashion, sports impacted the evolution of language use in tweets in each city
Acknowledgements
If you use this dataset in your research, please credit the original authors.
Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
Columns
File: interessa+word1.csv
Column name |
Description |
tok |
Tokenized text of the tweet. (String) |
abs |
Absolute frequency of a token in the tweet. (Integer) |
rel |
Relative frequency of a token in the tweet. (Float) |
var |
Variation of a token in the tweet. (String) |
city |
City from which the tweet originated. (String) |
File: frega+word1.csv
Column name |
Description |
tok |
Tokenized text of the tweet. (String) |
abs |
Absolute frequency of a token in the tweet. (Integer) |
rel |
Relative frequency of a token in the tweet. (Float) |
var |
Variation of a token in the tweet. (String) |
city |
City from which the tweet originated. (String) |
Acknowledgements
If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit .