Baselight
Login
Login

2022 Bikeshare Data -Reduced File Size -All Months

Google Capstone Data Too Big for Posit Cloud? Try this grouped & summarized set

@kaggle.kendallmarie_2022_bikeshare_data_all_months_combined

About this Dataset

2022 Bikeshare Data -Reduced File Size -All Months

This is a condensed version of the raw data obtained through the Google Data Analytics Course, made available by Lyft and the City of Chicago under this license (https://ride.divvybikes.com/data-license-agreement).

I originally did my study in another platform, and the original files were too large to upload to Posit Cloud in full. Each of the 12 monthly files contained anywhere from 100k to 800k rows. Therefore, I decided to reduce the number of rows drastically by performing grouping, summaries, and thoughtful omissions in Excel for each csv file. What I have uploaded here is the result of that process.

Data is grouped by: month, day, rider_type, bike_type, and time_of_day. total_rides represent the sum of the data in each grouping as well as the total number of rows that were combined to make the new summarized row, avg_ride_length is the calculated average of all data in each grouping.

Be sure that you use weighted averages if you want to calculate the mean of avg_ride_length for different subgroups as the values in this file are already averages of the summarized groups. You can include the total_rides value in your weighted average calculation to weigh properly.

9 Columns:

date - year, month, and day in date format - includes all days in 2022
day_of_week - Actual day of week as character. Set up a new sort order if needed.
rider_type - values are either 'casual', those who pay per ride, or 'member', for riders who have annual memberships.
bike_type - Values are 'classic' (non-electric, traditional bikes), or 'electric' (e-bikes).
time_of_day - this divides the day into 6 equal time frames, 4 hours each, starting at 12AM. Each individual ride was placed into one of these time frames using the time they STARTED their rides, even if the ride was long enough to end in a later time frame. This column was added to help summarize the original dataset.
total_rides - Count of all individual rides in each grouping (row). This column was added to help summarize the original dataset.
avg_ride_length - The calculated average of all rides in each grouping (row). Look to total_rides to know how many original rides length values were included in this average. This column was added to help summarize the original dataset.
min_ride_length - Minimum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset.
max_ride_length - Maximum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset.

Please note: the time_of_day column has inconsistent spacing. Use mutate(time_of_day = gsub(" ", "", time_of _day)) to remove all spaces.

Revisions

Below is the list of revisions I made in Excel before uploading the final csv files to the R environment:

  • Deleted station location columns and lat/long as much of this data was already missing.

  • Deleted ride id column since each observation was unique and I would not be joining with another table on this variable.

  • Deleted rows pertaining to "docked bikes" since there were no member entries for this type and I could not compare member vs casual rider data. I also received no information in the project details about what constitutes a "docked" bike.

  • Used ride start time and end time to calculate a new column called ride_length (by subtracting), and deleted all rows with 0 and 1 minute results, which were explained in the project outline as being related to staff tasks rather than users. An example would be taking a bike out of rotation for maintenance.

  • Placed start time into a range of times (time_of_day) in order to group more observations while maintaining general time data. time_of_day now represents a time frame when the bike ride BEGAN. I created six 4-hour time frames, beginning at 12AM.

  • Added a Day of Week column, with Sunday = 1 and Saturday = 7, then changed from numbers to the actual day names.

  • Used pivot tables to group total_rides, avg_ride_length, min_ride_length, and max_ride_length by date, rider_type, bike_type, and time_of_day.

  • Combined into one csv file with all months, containing less than 9,000 rows (instead of several million)

Share link

Anyone who has the link will be able to view this.