Baselight

GE Soccer Clubs News

News from the largest brazilian sports site about soccer clubs (pt-br)

@kaggle.lgmoneda_ge_soccer_clubs_news

About this Dataset

GE Soccer Clubs News

Context

This dataset is intended to provide a very real-world data sample covering a reasonable time frame. It contains a column with the club name, which can be considered as a class.

Content

The news was extracted from the GE website considering all available articles for a team but in a random order, so it covers the whole period: 2015-2020. It's expected to extract all the available articles for the given clubs present in the dataset and adding new clubs to it.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Known issue: there's almost no news between 2015-12 and 2016-07. They are not in the scraped list. I'm looking for a workaround.

Notice:

  • A same article can appear to n teams, and it's represented by n different rows with the column "club" being different

Acknowledgements

All the article contents are owned by GE and this dataset is solely an effort to put everything together to be easily available for data science experiments and research.

Inspiration

This dataset has a nice time frame so it's very good to check how things evolve over time. It's possible to run sentiment analysis and also a classification using the "club" column as a target - it might be interesting to exclude the club name from the articles then.

Updating it

Use the updating script if you need more recent data. Reach out if you are interested in maintaining it.

Citation

Remember to fix the date and version when citing it.

@unpublished{Moneda2020genews,
  title={Globo Esporte News dataset},
  author={Moneda, Luis},
  year={2020},
  note={Version 11. Retrieved March 31, 2021 from https://www.kaggle.com/lgmoneda/ge-soccer-clubs-news}
}

Share link

Anyone who has the link will be able to view this.