data-mining-final/writeup.Rmd at master · SivanMehta/data-mining-final · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: "Final Project - 36-462: Data Mining"
author: "Sivan Mehta, Mary St. John, and Graceanne Wong"
date: "5/12/2017"
output:
  pdf_document:
    fig_height: 4
    fig_caption: false
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
# import data sets and any libraries we might need
source("./analysis/clean-data.R")
source("./analysis/timeseries.R")
library(ggplot2)
library(dplyr)
library(knitr)

# add features
source("./features/add-features.R")
train <- add.features(train)
vis <- add.features(vis)
data.2016 <- add.features(data.2016)
```

# Introduction

The dataset at hand concerns flights to and from the Pittsburgh International Airport (PIT). Each observation is a single flight either entering or leaving Pittsburgh. While the predictive focus will be on departing flights, we will also look at flights overall to get a general structural overview of the dataset.

The 57 variables provided can be split into a few primary groups. Time Period information concerns the date of the flight. Airline information pertains to the specific airline and carrier of the flight. Origin information concerns origin airport details, and Destination information contains similar details. Departure and Arrival performance information contains metrics on the delays, taxiing, etc. Finally, there are general delay metrics, detailing the breakdown of what contributed to the overall delay, if any.

To predict flights, we will be working with three data sets with flight data from the Pittsburgh airport. Our training dataset is all the arrivals and departures from 2015. Our test data is arrivals and departures up to a random timepoint in the day. After that time point, we have unlabeled data on which we will form our guess dataset.

In this report, we will focus on subset *departing* flights and predict whether or not the flight will be delayed. We will first take a general unsupervised learning approach, attempting to ascertain any underlying structure to the dataset. Then we will move towards supervised analysis, attempting to actually predict whether or not a flight will be delayed.

# EDA and unsupervised analysis

First, we'll examine delays on each day, in order to reduce *some* of the noise, we'll also plot some moving averages of delays.

```{r, warning = FALSE}
by_date <- group_by(train, FL_DATE)
delays_per_day <- summarize(by_date,
                            N_FLIGHTS = n(),
                            DEP_DELAY_TOT = sum(DEP_DEL15, na.rm = TRUE),
                            ARR_DELAY_TOT = sum(ARR_DEL15, na.rm = TRUE))


delays_per_day$FL_DATE <- as.Date(delays_per_day$FL_DATE)
delays_per_day$DEP_AVG_7 <-get_moving_averages(delays_per_day$DEP_DELAY_TOT, 7)
delays_per_day$DEP_AVG_30 <-get_moving_averages(delays_per_day$DEP_DELAY_TOT, 30)

# two week departures delays moving averages
ggplot(delays_per_day) +
  scale_x_date() +
  geom_line(aes(x = FL_DATE, y = DEP_DELAY_TOT), alpha = .1) +
  geom_line(aes(x = FL_DATE, y = DEP_AVG_7), color = "red", alpha = .5) +
  geom_line(aes(x = FL_DATE, y = DEP_AVG_30), color = "blue", size = 1) +
  labs(x = "Date of Flight", y = "Number of Delays",
       title = "Delays per day, with weekly average in red, monthly average in blue")
```

Here we can see clear seasonal effects surrounding delays when looking at the moving average. While there is *a lot* of volatility as seen in the red weekly average, we can relatively smooth this with the blue monthly average. This tells us we can gain some insight with date information. Next, we can try clustering to see if the flights break down into meaningful groups. This perhaps could give us some insight on the general types of flights we may see.

```{r, warning = FALSE}
library(protoclust)
sample.data <- sample_n(vis, nrow(vis) * .1)
flights.dist <- dist(sample.data)

tree.com <- hclust(flights.dist, method = "complete")
tree.proto <- protoclust(flights.dist)

par(mfrow = c(1, 2))

plot(tree.com, main = "Complete Linkage")
plot(tree.proto, main = "Minimax Clustering")

assignments.com <- cutree(tree.com, 2)
assignments.proto <- protocut(tree.proto, 2)

predictions.com <- assignments.com - 1
right.com <- length(which(predictions.com == sample.data$DEP_DEL15)) / nrow(sample.data)
misclass.com <- right.com

predictions.proto <- assignments.proto$cl - 1
right.proto <- length(which(predictions.proto == sample.data$DEP_DEL15)) / nrow(sample.data)
misclass.proto <- right.proto

late <- train[which(train$DEP_DEL15 > 0),]
on.time <- train[which(train$DEP_DEL15 == 0),]
base.rate <- nrow(late) / (nrow(late) + nrow(on.time))
```

Generally, these are doing pretty poorly, garnering misclassification rates of nearly 50%, **much** worse than the base rate of `r round(base.rate*100, 2)`%, when using the assigned clusters, at almost any level, as a classification. This rarity of delayed flights makes hierarchical clustering problematic in that small groups of points are difficult. We also tried single linkage clustering because it has the ability to accommodate this behavior, but it performed much worse than either complete or minimax clustering.

Because we have found that flight delays are relatively rare, we hypothesized that the distribution of flights that were not delayed is different from the distribution of flights that were delayed. We utilized latent class regression as implemented in the `poLCA` package to predict the probability of class membership, using indicator variables for weather delay and NAS delay as predictors. In this method, we were able to cluster the data into two separate groups, and evaluate a density on each group.

```{r, echo=FALSE, warning=FALSE, message=FALSE}
source("./analysis/mixmodels.R")
```

From our latent class regression, we found that data had a probability of `r round(class_probs[1], 4)` of being in class 1, and a probability of `r round(class_probs[2], 4)` of being in class 2. These clearly correspond to delayed flights being class 2 and non-delayed flights being class 1.
When conditioned on class membership, we found that data in class 1 had almost 0 probability of having either type of delay, but data in class 2 had less polar splits in probability.

`r kable(cond_weather, digits = 4, caption = "Weather delay class conditional probabilities")`
`r kable(cond_NAS, digits = 4, caption = "NAS delay class conditional probabilities")`

In our evaluation, the overall probability for each data point was calculated by multiplying together the conditional probabilities of each predictor, and weighting by the class probabilities.

$$
\sum_{c \in 1, 2} P(X_{class} = c)
\prod_{d \in {weather.delay, NAS.delay}} P(d=1 | class = c)^{X_d} \cdot P(d = 0 | class = c)^{1 - X_d}
$$

# Supervised analysis

Before we began our supervised analysis, we created new features in the data. After looking at the raw data given to us about the flights we guess on, we determined that there was not much information there. We built features based on previous events in the day, mostly focusing on the percentage of previous flights that have been affected by events that day. For each day, we calculated what percent of flights were delayed before the plane was scheduled to leave, and we did the same for weather delay, national air system delay, and delays on arrival flights. We also added a variable for the time that an outgoing flight was scheduled to leave Pittsburgh, or the time an incoming flight was scheduled to arrive in Pittsburgh.

Because percent of flights delayed at a given point is not very indicative of the rate of delays at a certain time, we created a feature based on the derivative of these ratios as a function of time. In calculating this, we kept the denominator of the ratio fixed and equal to the total number of flights scheduled for that day.

## Mixture Models

We started with making predictions using a mixture model. The mixture model was trained on an indicator variable which was based on current truth, which is not available in the guess data set which is why we predicted that all flights are not delayed. When evaluated on our 2015 training data and our 2016 testing data, we had a training error of `r round(lcm_train_err * 100, 2)`% and a testing error of `r round(lcm_vis_err * 100, 2)`% both of which beat our base rate.

However, when we evaluted on our prediction set, we were unable to generate the indicator variables for weather delay and NAS delay directly from the prediction set, so we had to rely on the data in the test set to generate these features. This resulted in predicting that all flights were not delayed.

## Unused, but evaluated models

We had many other approaches to making a prediction model. We began with a random forest on all predictors to determine the most important ones by using a variable importance plot. We then put these predictors into mixture models, linear models, complete linkage clustering, prototype clustering, support vector machines, and random forests. These models proved to be at or near the base rate of classifying delayed and not delayed flights.

```{r, warning = FALSE, message=FALSE}
source("./analysis/model-table.R")
knitr::kable(model.table, caption = "Unused model summary")
```

For each model we first generated a training error. If it was close to or better than the base rate, we then generated a test error. If the test error was then better than the base, rate we then saw how many flights were guessed to be delayed in the submission dataset.

## Resolution: Random Forests

We chose to use random forest because we are working in high dimensional space. Random forests can perform variable selection by choosing important predictors to split on and can also account for interactions between predictors. We used the ratio features we created, time variables, and distance variables. As can be seen in the variable importance plot, the most important predictors are the day, the percent of flights delayed on that day so far, and Pittsburgh time. These last two predictors are the same for departing flights, so they are correlated. Since random forest automatically accomodates interactions, this is not an issue

![](./variable-importance.png)

The training and test error for the random forest are both slightly greater than the base rate, but it at least makes a prediction that *some* flights are delayed. This model is better than the mixture model because it was not trained on variables that were created from the truth.

# Conclusion

For our final model, a random forest, we guess that 14 flights would be delayed out of the 871 provided in the guessing dataset. We chose to use a random forest for these guesses as it was the only model that both provided useful predictions close to the base rate, but also did not guess the *same thing* for every single flight.

If we wanted to improve the dataset provided, there are several approaches we could have pursued. One approach would have been to expand the dataset beyond the city of Pittsburgh, perhaps picking up on trends in airports that could have indirectly affected Pittsburgh. We would then have more information into the underlying causes of flight delays. In the expansion, we could perhaps more easily track individual planes, and find a relationship between specific planes and delays.

Additionally, we could have expanded the breadth of the dataset by including Twitter data. Previous attempts at predicting delays$^1$ using this dataset have used volume of negative sentiment of the tweets to predict delays in airports. While it didn't predict it perfectly, it is an interesting approach that is worth taking a look at.

$^1$http://ddowey.github.io/cs109-Final-Project/#!index.md