Airplane Crashes and Fatalities Since 1908 (Full history of airplane crashes throughout the world, from 1908-present)
At the time this Dataset was created in Kaggle (2016-09-09), the original version was hosted by Open Data by Socrata at the at: https://opendata.socrata.com/Government/Airplane-Crashes-and-Fatalities-Since-1908/q2te-8cvq, but unfortunately that is not available anymore. The dataset contains data of airplane accidents involving civil, commercial and military transport worldwide from 1908-09-17 to 2009-06-08.
While applying for a data scientist job opportunity, I was asked the following questions on this dataset:
- Yearly how many planes crashed? how many people were on board? how many survived? how many died?
- Highest number of crashes by operator and Type of aircrafts.
- ‘Summary’ field has the details about the crashes. Find the reasons of the crash and categorize them in different clusters i.e Fire, shot down, weather (for the ‘Blanks’ in the data category can be UNKNOWN) you are open to make clusters of your choice but they should not exceed 7.
- Find the number of crashed aircrafts and number of deaths against each category from above step.
- Find any interesting trends/behaviors that you encounter when you analyze the dataset.
My solution was:
The following bar charts display the answers requested by point 1. of the assignment, in particular:
- the planes crashed per year
- people aboard per year during crashes
- people dead per year during crashes
- people survived per year during crashes
The following answers regard point 2 of the assignment
- Highest number of crashes by operator: Aeroflot with 179 crashes
- By Type of aircraft: Douglas DC-3 with 334 crashes
I have identified 7 clusters using k-means clustering technique on a matrix obtained by a text corpus created by using Text Analysis (plain text, remove punctuation, to lower, etc.)
The following table summarize for each cluster the number of crashes and death.
- Cluster 1: 258 crashes, 6368 deaths
- Cluster 2: 500 crashes, 9408 deaths
- Cluster 3: 211 crashes, 3513 deaths
- Cluster 4: 1014 crashes, 14790 deaths
- Cluster 5: 2749 crashes, 58826 deaths
- Cluster 6: 195 crashes, 4439 deaths
- Cluster 7: 341 crashes, 8135 deaths
The following picture shows clusters using the first 2 principal components:
For each clusters I will summarize the most used words and I will try to identify the causes of the crash
Cluster 1 (258)
aircraft, crashed, plane, shortly, taking.
No many information about this cluster can be deducted using Text Analysis
Cluster 2 (500)
aircraft, airport, altitude, crashed, crew, due, engine, failed, failure, fire, flight, landing, lost, pilot, plane, runway, takeoff, taking.
Engine failure on the runway after landing or takeoff
Cluster 3 (211):
aircraft, crashed, fog
Crash caused by fog
Cluster 4 (1014):
aircraft, airport, attempting, cargo, crashed, fire, land, landing, miles, pilot, plane, route, runway, struck, takeoff
Struck a cargo during landing or takeoff
Cluster 5 (2749):
accident, aircraft, airport, altitude, approach, attempting, cargo, conditions, control, crashed, crew, due, engine, failed, failure, feet, fire, flight, flying, fog, ground, killed, land, landing, lost, low, miles, mountain, pilot. plane, poor, route, runway, short, shortly, struck, takeoff, taking, weather
Struck a cargo due to engine failure or bad weather conditions mainly fog
Cluster 6 (195):
aircraft, crashed, engine, failure, fire, flight, left, pilot, plane, runway
Engine failure on the runway
Cluster 7 (341):
accident, aircraft, altitude, cargo, control, crashed, crew, due, engine, failure, flight, landing, loss, lost, pilot, plane, takeoff
Engine failure during landing or takeoff
Better solutions are welcome.
Thanks,
Sauro