Context
The Olympic games are not just sport events, but some social and economic factors have effects on every nation's performance. In order to measure performance, the first step is collecting data. There is a comprehensive dataset by @heesoo37 which covers Olympic games from 1896 to 2016. I tried in vain to find a similar dataset for 2020 Summer Olympics. Therefore, I decided to make one from the data available on official Olympics website www.olympics.com. rvest
, jasonlite
and tidyverse
packages of R language were used to web scrape the desired data.
Content
This dataset consists of every event in which an athlete participated together with age, nationality, ranks and medals. There two clear differences between current dataset and similar ones. First, in addition to medals, ranks are also included for every event an athlete took part. Second, each event is labeled in a way one can easily confer whether it is team or individual event. I will explain my incentive for doing this way in a separate notebook, however, in a nutshell, measuring performance just by counting medals and treating each team medal as an individual medal is not an accurate way. So, defining a new Key Performance Index is necessary.
Although the data offered by www.olympics.com is not perfect, this website is the most comprehensive reference for 2020 Summer Olympics. www.olympedia.com is another good resource for historical data collection of past Olympic games which is maintained by a number Olympics historians and statisticians. In the process of establishing the current dataset, the main reference was www.olympics.com. In some cases there were dubious entries which was corrected or omitted after verifying them by referring to www.olympedia.com and www.wikipedia.com.
Inspiration
This dataset can be utilised to understand which countries performed better in 2020 Summer Olympics and what factors affected their success.