Cyclisitic Trip Data 2019 (Google)
Take a look at my first case study! I had a lot of fun making this.
@kaggle.shainepepper_divvy_2019_trip_data_clean
Take a look at my first case study! I had a lot of fun making this.
@kaggle.shainepepper_divvy_2019_trip_data_clean
Cleaning this data took some time due to many NULL values, typos, and unorganized collection. My first step was to put the dataset into R and work my magic there. After analyzing and cleaning the data, I moved the data to Tableau to create easily understandable and helpful graphs. This step was a learning curve because there are so many potential options inside Tableau. Finding the correct graph to share my findings while keeping the stakeholders' tasks in mind was my biggest obstacle.
Firstly I needed to combine the 4 datasets into 1, I did this using the rbind() function.
Step two was to remove typos or poorly named columns.
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "tripduration"] <- "trip_duration"
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "bikeid"] <- "bike_id"'
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "usertype"] <- "user_type"
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "birthyear"] <- "birth_year"
Next step was to remove all NULL and over exaggerated numbers. Such as trip durations more than 10 hours long.
library(dplyr) Cyclistic_Clean_v2 <- Cyclistic_Data_2019 %>% filter(across(where(is.character), ~ . != "NULL")) %>% type.convert(as.is = TRUE)
Once removing the NULL data, it was time to remove potential typos and poorly collected data. I could only identify exaggerated data under the "trip_duration" column. Finding that there were multiple cases of 2,000,000 + second trips. To find these large values, I used the count() function.
Cyclistic_Clean_v2 %>% count(Cyclistic_Clean_v2, trip_duration > "30000")
After finding multiple instances of this, I ran into a hard spot, the trip_duration column was categorized as a character when it needed to be numeric to be further cleaned. it took me quite a while to find out that this was an issue, and then I remembered the class() function. With this, I was easily able to identify that the classification was wrong
class(Cyclistic_Clean_v2$trip_duration)
Once identifying the classification, I still had some work to do before converting it to an integer as it contained quotations, periods, and a trailing 0. To remove these I used the gsub() function.
Cyclistic_Clean_v2$trip_duration <- gsub(".0", "", Cyclistic_Clean_v2$trip_duration)
Cyclistic_Clean_v2$trip_duration <- gsub('"', '', Cyclistic_Clean_v2$trip_duration)
Now that unwanted characters are gone, we can convert the column into numeric.
Cyclistic_Clean_v2$trip_duration <- as.numeric(Cyclistic_Clean_v2$trip_duration)
Doing this allows Tableau and R to read the data properly to create graphs without error.
Next I created a backup dataset incase there was any issue while exporting.
Cyclistic_Clean_v3 <- Cyclistic_Clean_v2
write.csv(Cyclistic_Clean_v2,"Folder.Path\Cyclistic_Data_Cleaned_2019.csv", row.names = FALSE)
After exporting I came to the conclusion that I should have put together a more accurate change log rather than brief notes. That is one major learning lesson I will take away from this project.
All around, I had a lot of fun using R to transform and analyze the data. I learned many of different ways to efficiently clean data.
Now onto the fun part! Tableau is a very good tool to learn. There are so many different ways to bring your data to life and show your creativity inside your work. After a few guides and errors, I could finally start building graphs to bring the stakeholders' tasks to fruition.
Please note this are all made in tableau and meant to be interactive.
Here you can find the relation between male and female riders.
Male vs Female tripduration with usertype
Busiest stations filtered by months. (This is meant to be interactive.)
Most popular starting stations.
Most popular ending stations.
My main goal was to help find out how Cyclistic can convert casual riders into subscribers. Here is my findings.
What can Cyclistic do to increase revenue & convert casual riders?
Thank you for looking at my first data analysis project. I had a great time learning hands-on, and can't wait to see what the future holds for me in my data analysis career!
Anyone who has the link will be able to view this.