Data Stories Of US Airlines, 1987-2008
Fight arrival and departure details for all commercial flights
@kaggle.prajitdatta_data_stories_of_us_airlines
Fight arrival and departure details for all commercial flights
@kaggle.prajitdatta_data_stories_of_us_airlines
The data used in this project is real and is based on the collection of over 20 years. The total
number of record in this dataset is roughly around 120 million rows and the size of the data
is approximately 12GB. The data consists of flight arrival and departure details for all
commercial flights within the USA, from October 1987 to April 2008. This is a large dataset.
There are around 29 attributes.
How to get the data?
The data originally comes from http://stat-computing.org/dataexpo/2009/the-data.html
You can download the data for each year by clicking the appropriate link in the above
website (Remember the size is going to be more than 12GB).
(i) Problem Statement
(a) Check the skewness of Distance travelled by airlines.
(b) Calculate the mean, median and quantiles of the distance travelled by US Airlines (US).
(c) Check the standard deviation of distance travelled by American Airlines (AA).
(d) Draw a boxplot of UniqueCarrier with Distance.
(e) Draw the direction of relationship between ArrDelay and DepDelay by drawing a scatterplot.
(ii) Problem Statement
(a) What is the probability that a flight which is landing/taking off is “WN” Airlines (marginal probability)
(b) What is the probability that a flight which is landing/taking off is either “WN” or “AA” Airlines (disjoint events)
(c) What is the joint probability that a flight is both “WN” and travels less than 600 miles (joint probability)
(d) What is the conditional probability that the flight travels less than 2500 miles given that the flight is “AA” Airlines (conditional probability)
(e) What is the joint probability of a flight getting cancelled and is supposed to travel less than 2500 miles given that the flight is “AA” Airlines (joint + conditional probability)
(iii) Problem Statement
(a) Suppose arrival delays of flights belonging to “AA” are normally distributed with mean 15 minutes and standard deviation 3 minutes. If the “AA” plans to announce a scheme where it will give 50% cash back if their flights are delayed by 20 minutes, how much percentage of the trips “AA” is supposed to loose this money. (Hint: pnorm)
(b) Assume that 65% of flights are diverted due to bad weather through the Weather System. What is the probability that in a random sample of 10 flights, 6 are diverted through the Weather System. (Hint: dbinorm)
(c) Do linear regression between the Arrival Delay and Departure Delay of the flights.
(d) Find out the confidence interval of the fitted linear regression line.
(e) Perform a multiple linear regression between the Arrival Delay along with the Departure Delay and Distance travelled by flights.
Anyone who has the link will be able to view this.