A dataset of Stack Overflow programming questions. For each question, it includes:
- Question ID
- Creation date
- Closed date, if applicable
- Score
- Owner user ID
- Number of answers
- Tags
This dataset is ideal for answering questions such as:
- The increase or decrease in questions in each tag over time
- Correlations among tags on questions
- Which tags tend to get higher or lower scores
- Which tags tend to be asked on weekends vs weekdays
This dataset was extracted from the Stack Overflow database at 2016-10-13 18:09:48 UTC and contains questions up to 2016-10-12. This includes 12583347 non-deleted questions, and 3654954 deleted ones.
This is all public data within the Stack Exchange Data Dump, which is much more comprehensive (including question and answer text), but also requires much more computational overhead to download and process. This dataset is designed to be easy to read in and start analyzing. Similarly, this data can be examined within the Stack Exchange Data Explorer, but this offers analysts the chance to work with it locally using their tool of choice.
Note that for space reasons only non-deleted questions are included in the sqllite dataset, but the csv.gz files include deleted questions as well (with an additional DeletionDate file).
See the GitHub repo for more.