Cyclisitic Trip Data 2019 (Google) by Kaggle | Transportation and Logistics

About this Dataset

Cyclisitic Trip Data 2019 (Google)

Intro

Cleaning this data took some time due to many NULL values, typos, and unorganized collection. My first step was to put the dataset into R and work my magic there. After analyzing and cleaning the data, I moved the data to Tableau to create easily understandable and helpful graphs. This step was a learning curve because there are so many potential options inside Tableau. Finding the correct graph to share my findings while keeping the stakeholders' tasks in mind was my biggest obstacle.

RStudio

Firstly I needed to combine the 4 datasets into 1, I did this using the rbind() function.

Step two was to remove typos or poorly named columns.
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "tripduration"] <- "trip_duration"
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "bikeid"] <- "bike_id"'
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "usertype"] <- "user_type"
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "birthyear"] <- "birth_year"

Next step was to remove all NULL and over exaggerated numbers. Such as trip durations more than 10 hours long.

library(dplyr) Cyclistic_Clean_v2 <- Cyclistic_Data_2019 %>% filter(across(where(is.character), ~ . != "NULL")) %>% type.convert(as.is = TRUE)

Once removing the NULL data, it was time to remove potential typos and poorly collected data. I could only identify exaggerated data under the "trip_duration" column. Finding that there were multiple cases of 2,000,000 + second trips. To find these large values, I used the count() function.

Cyclistic_Clean_v2 %>% count(Cyclistic_Clean_v2, trip_duration > "30000")

After finding multiple instances of this, I ran into a hard spot, the trip_duration column was categorized as a character when it needed to be numeric to be further cleaned. it took me quite a while to find out that this was an issue, and then I remembered the class() function. With this, I was easily able to identify that the classification was wrong

class(Cyclistic_Clean_v2$trip_duration)

Once identifying the classification, I still had some work to do before converting it to an integer as it contained quotations, periods, and a trailing 0. To remove these I used the gsub() function.

Cyclistic_Clean_v2$trip_duration <- gsub(".0", "", Cyclistic_Clean_v2$trip_duration)
Cyclistic_Clean_v2$trip_duration <- gsub('"', '', Cyclistic_Clean_v2$trip_duration)

Now that unwanted characters are gone, we can convert the column into numeric.

Cyclistic_Clean_v2$trip_duration <- as.numeric(Cyclistic_Clean_v2$trip_duration)

Doing this allows Tableau and R to read the data properly to create graphs without error.

Next I created a backup dataset incase there was any issue while exporting.

Cyclistic_Clean_v3 <- Cyclistic_Clean_v2
write.csv(Cyclistic_Clean_v2,"Folder.Path\Cyclistic_Data_Cleaned_2019.csv", row.names = FALSE)

After exporting I came to the conclusion that I should have put together a more accurate change log rather than brief notes. That is one major learning lesson I will take away from this project.

All around, I had a lot of fun using R to transform and analyze the data. I learned many of different ways to efficiently clean data.

Tableau

Now onto the fun part! Tableau is a very good tool to learn. There are so many different ways to bring your data to life and show your creativity inside your work. After a few guides and errors, I could finally start building graphs to bring the stakeholders' tasks to fruition.

Charts

Please note this are all made in tableau and meant to be interactive.

Here you can find the relation between male and female riders.

Male vs Female tripduration with usertype

Busiest stations filtered by months. (This is meant to be interactive.)

Most popular starting stations.

Most popular ending stations.

Conclusion

My main goal was to help find out how Cyclistic can convert casual riders into subscribers. Here is my findings.

Casual riders ride much longer than subscribers duration wise.
Although there are many more male riders, females tend to ride longer than males.
Stations #562 & #568 are the most busy by a high margin.
Summer months, July and August are the most popular months for riders.

What can Cyclistic do to increase revenue & convert casual riders?

Increase marketing budget in summer months.
Build in a rewards program. Ex. the more you ride the more rewards you get. (Only available to subscribers.)
When marketing, focus on targeting the male audience in their late 20's to mid 30's.
Boost spending budget in areas surrounding stations #562 & #568.

Thank you for looking at my first data analysis project. I had a great time learning hands-on, and can't wait to see what the future holds for me in my data analysis career!

Tables

Divvy Clean 2019 Final

@kaggle.shainepepper_divvy_2019_trip_data_clean.divvy_clean_2019_final

28.52 MB
1005523 rows
12 columns


CREATE TABLE divvy_clean_2019_final (
  "trip_id" DOUBLE,
  "start_time" TIMESTAMP,
  "end_time" TIMESTAMP,
  "bike_id" DOUBLE,
  "from_station_id" DOUBLE,
  "from_station_name" VARCHAR,
  "to_station_id" DOUBLE,
  "to_station_name" VARCHAR,
  "user_type" VARCHAR,
  "gender" VARCHAR,
  "birth_year" DOUBLE,
  "trip_duration" DOUBLE
);