Name: Italian Negation Constructions - Tweets
Creator: Kaggle
Published: 2025-02-13T08:24:54.017Z
License: https://creativecommons.org/publicdomain/zero/1.0/

Exploring Language Variation Across 10 Cities

Italian Negation Constructions - Tweets

Exploring Language Variation Across 10 Cities

By [source]

About this dataset

This dataset, the Twitter Italian Negation (TIN) Corpus, provides an interesting glimpse into language change in Romance languages with the emergence of non-standard uses of negations. This collection contains 10,000 tweets from ten different cities -Milan, Rome, Naples, Palermo, Bologna, Turin, Florence Cagliari Genoa and New York City -each collected in August 2019. The data includes tokenized text and frequency measures for each tweet as well as a city column so users can explore regional differences. With this resource users can uncover how the language of these cities is changing over time or even how language usage between neighboring countries or states may differ. Get ready to dive deep into the fascinating shifts that occur between spoken and written languages!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains 10,000 tweets in Italian gathered from ten different cities between August and December 2019. This collection of tweets provides an interesting insight into the language change phenomena in Romance languages, specifically with regard to non-standard uses of negations.

The dataset is composed of nine columns: token, absolute frequency, relative frequency, variation, and city from which the tweet originated. Each row represents a single token in a particular tweet: each tweet can contain more than one token.

By using this dataset you can analyze and compare patterns of usage across different cities or even within a specific city. You can also compare variations within tokens between different cities to understand how certain constructions are used differently across regions or dialects.
Additionally you could use this data to examine trends in literary works such as poetry by looking at the most commonly used words and phrases over time.

To use the data effectively, it is important first to understand what each column represents:

Tok (Tokenized text): This is text that has been broken down into individual words or tokens representing all of the words found in a particular tweet including punctuation marks like commas or exclamation points;

Abs (Absolute Frequency): This is the total number of times that a particular token appears within all tweets;

Rel (Relative Frequency): This is calculated by calculating how many times a particular token appears compared to other tokens;

Var (Variation): This indicates whether there have been any alterations made compared to standard usage such as “has” being replaced with “haz”;

City: The originator's city corresponds with each tweet guiding analysis on usage differences among locales for example “Milan” or “Genua” but also generalized larger geographic areas such as “Italy” versus other countries like “United States.

Using these numeric values alongside thematic exploration allows for understanding not only usages but trends across different geographic populations relative representations both locally and globally provided by Twitter users regarding issues related language use especially non-standard dialectical contructs throughout Italy

Research Ideas

Studying the regional variation of Italian negation constructions by comparing the frequency and variation between cities.

Investigating language change over time by tracking changes in relative and absolute frequencies of negation constructions across tweets.

Exploring how different socio-economic contexts or trends such as news, fashion, sports impacted the evolution of language use in tweets in each city

Acknowledgements

If you use this dataset in your research, please credit the original authors.
Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication
No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: interessa+word1.csv

Column name	Description
tok	Tokenized text of the tweet. (String)
abs	Absolute frequency of a token in the tweet. (Integer)
rel	Relative frequency of a token in the tweet. (Float)
var	Variation of a token in the tweet. (String)
city	City from which the tweet originated. (String)

File: frega+word1.csv

Column name	Description
tok	Tokenized text of the tweet. (String)
abs	Absolute frequency of a token in the tweet. (Integer)
rel	Relative frequency of a token in the tweet. (Float)
var	Variation of a token in the tweet. (String)
city	City from which the tweet originated. (String)

Acknowledgements

If you use this dataset in your research, please credit the original authors.
If you use this dataset in your research, please credit .

Related Datasets

SQuAD-it (Italian SQuAD)

@kaggle
Fur Banning

@owid
SFC2014 - REACT EU Overview Allocation Vs Decided

@esifunds
Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

@owid
2021-2027 Finances Detailed Planned Vs Implemented - Housing

@esifunds
European Electricity Review (Ember, 2022)

@owid

SQuAD-it (Italian SQuAD)

Fur Banning

SFC2014 - REACT EU Overview Allocation Vs Decided

Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

2021-2027 Finances Detailed Planned Vs Implemented - Housing

European Electricity Review (Ember, 2022)