Predicting Delayed Flights. The file
FlightDelays.cv contains information on all
commercial flights departing the
Washington, DC area and arriving at New
York during January 2004. For each flight,
there is information on the departure and
arrival airports, the distance of the route,
the scheduled time and date of the flight,
and so on. The variable that we are trying to
predict is whether or not a flight is delayed.
A delay is defined as an arrival that is at
least 15 minutes later than scheduled.
Data Preprocessing. Transform variable day
of week (DAY WEEK) info a categorical
variable. Bin the scheduled departure time
into eight bins (in R use function cut)). Use
these and all other columns as predictors
(excluding DAY_OF_MONTH). Partition the
data into training and validation sets.
a. Fit a classification tree to the flight delay
variable using all the relevant predictors. Do
not include DEP TIME (actual departure
time) in the model because it is unknown at
the time of prediction (unless we are
generating our predictions of delays after
the plane takes off, which is unlikely). Use a
pruned tree with maximum of 8 levels,
setting cp = 0.001. Express the resulting
tree as a set of rules.
b. If you needed to fly between DCA and
EWR on a Monday at 7:00 AM, would you be
able to use this tree? What other
information would you need? Is it available
in practice? What information is redundant?
C. Fit the same tree as in (a), this time
excluding the Weather predictor. Display
both the pruned and unpruned tree. You will
find that the pruned tree contains a single
terminal node.
i. How is the pruned tree used for
classification? (What is the rule for
classifying?)
il. To what is this rule equivalent?
ill. Examine the unpruned tree. What are the
top three predictors according to this tree?
iv. Why, technically, does the pruned tree
result in a single node?
v. What is the disadvantage of using the top
levels of the unpruned tree as opposed to
the pruned tree?
vi. Compare this general result to that from
logistic regression in the example in