R Business Problem

General Information for Candidates

This project has 7 tasks numbered 1 through 7. The points for each task are indicated at the beginning of the task.

Each task pertains to the business problem and related data files and data dictionary. An .Rmd file with some initial data work. Unless otherwise specified, each task builds upon the work and conclusions from prior tasks. Due to the nature of predictive modeling, work on later tasks may influence responses on earlier tasks.

The responses to each specific task should be written after the task response header in this Word document. Where code, tables, or graphs from your own work is required, it should be copied and pasted into this Word document.

You may use resources such as textbooks and the internet. You may use any analytics software you wish to perform the analysis directed by the tasks. You may ++not++ consult with other individuals about the specific business problem, data, and tasks.

Each task will be graded on the quality of your thought process (as documented in your submission), conclusions, and quality of the presentation. The answer should be confined to the prompt as set and written for the audience specified in the prompt. In tasks 1-6, various portions of a technical report are written but these do not comprise an entire report, e.g., a statement of the business problem is not asked for in these tasks. Only write the sections requested.

Business Problem

The following business problem, while using actual data and referring to actual entities, is entirely fictional.

You have recently started a consulting firm specializing in predictive analytics in the rural western state of Idaho, USA. Your firm consists of you and an assistant you are mentoring. After reading about a dispute between the airport authority in Boise, the state capital and largest city with about 225,000 residents, and airlines servicing the city, you decide to offer your services to the head of the airport authority despite having almost no knowledge of the aviation industry.

The dispute centers on a planned increase in airport passenger capacity. Boise, the largest city in a multi-state region, would like to increase its attractiveness for business and pleasure by upgrading its airport, including adding additional passenger gates, but the mountainous location of the airport makes adding runway capacity prohibitively expensive. Two large and dominant freight carriers, FedEx and UPS, are threatening to severely reduce flights to and from Boise, opting to drive freight to and from Salt Lake City, Utah, several hours of driving distance away, because their profit margin for flying to and from Boise is already thin and they believe additional passenger traffic will increase congestion on the runways. The increase in ground time, during which the airplane jet engines are running, will increase their expenses to the point where rerouting will be less costly. If the freight carriers reduce their flights substantially, airport revenue will decrease to a point where adding the passenger gates may not be viable.

The head of airport authority accepts your call and is willing to speak with a supportive voice. "That the airlines have made this dispute public has been particularly stressful," the head explains, "as their real aim is to not to change their usage of the airport but negotiate lower airport usage fees. Retail freight service has been especially valued in this rural area since the COVID-19 pandemic began in March 2020, and so the airlines figured public pressure would help force a steeper compromise. But I do not believe their story about increased ground time from the additional traffic we are proposing makes any sense---ground time is more about weather and delays at other airports than how much traffic we have here. I just downloaded some public FAA data to try to prove this out, but I don't really have time to do this analysis with press briefings already adding to my day job of running a large airport!"

Sensing an opportunity, you explain how predictive analytics, your specialty, would help verify what the underlying factors on ground time really are and ask the head to send over the data so you can consider further. They send the Federal Aviation Administration (FAA) data, and it is in worse shape than you had hoped. The data, which covers all U.S. domestic flights from 2016 to 2021, does not directly have ground time, instead having ramp-to-ramp time and air time. Instead of the data being provided by flight, it is aggregated into monthly totals by route, airline, and other flight plan characteristics. You find the data dictionary for what the head sent you to confirm that the data is what it is and start searching for better or more helpful data, finding a slightly helpful decoder for airport codes along the way. Suddenly, you receive a call from the head of the airport authority.

" Actually, I could really use your help immediately. One of the freight airlines called and demanded a meeting a week from today to finalize negotiations of their fees for the next five years. I cannot stall them any longer and could use any leverage I could get from that predictive analytics thing you were talking about. Can you do analysis on the data I sent you and send me a report in four days?"

You realize that there wouldn ' t be time to find better data and would have to work with the current data to meet this deadline. You explain that you have no experience with the aviation industry and would be relying on data that is not fully fit for the purpose of the analysis and ask whether the head would be comfortable with a report given those circumstances. "I can help determine what makes sense once I have the report, but I cannot do the analysis you are talking about. I agree to use the report for the stated purpose recognizing your limitations."

Having no other clients, you also agree to this work and direct your assistant to start working on the data while you draft a consulting agreement for the work. After sending this to the head of the airport authority, you follow up with a phone call to confirm receipt and ask a couple questions about the data. The head confirms that ground time is the difference between ramp-to-ramp time and air time but then says, "I appreciate you getting in touch with me today and helping with this analysis. I'll send back the signed agreement today, but I need to ask now that you do the best you can with the information you have. BOI are the call letters for our airport. I won't be available for more information about the data or airline industry until after your finish your report. We can discuss the matter further then."

Just as that call ends, your assistant contacts you, sounding less than well. " I ' m so sorry...I appear to have caught a nasty illness. I'm sending you the data work I've done so far, but my mind wasn't working very clearly and I thought I better stop. I filtered the data and dealt with the aggregation but didn't get to joining the data. I hope you can manage from here without me for the next few days."

You wish the assistant well and then size up the situation. With less than ideal (or clean) data, you have less than ideal time to analyze and report on a problem where you have less than ideal background knowledge and no one you can talk to for help. However, you feel you ' ve represented your situation fairly and expect that those reading your report will try to be sympathetic to your position.

File List

Six .csv files labeled T_T100D_SEGMENT_US_CARRIER_ONLY_####, where #### is the year: the FAA data provided with flight statistics aggregated by month, flight path, carrier, and other fields.
One .xls file called DataDictionary: a data dictionary including descriptions of which fields are totaled by month and several tabs that translate various FAA codes
One .csv file called airport: a separate data file with more information on airports
Two .Rmd file called FlightsPrep and FlightsPrep_python: the assistant's work (in R or Python) to prepare the data before becoming ill

Task 1 (10 points)

Perform the following data preparation tasks:

Review your assistant's work on the FAA data and modify it to better address the business problem.
Retain the definition of ground time as the difference between ramp-to-ramp time and air time---its average will be the target variable for later modeling.
Retain some calculation that unitizes the aggregate monthly data to representative data per flight.
Develop a new feature to indicate, for each record, how busy the airport was that month.
Modify the filtering and other choices made by the assistant as needed
Carry out additional data cleaning and validation, include a) filtering observations and b) removing, transforming, and adding fields to improve the modeling.
Join the following tables to the revised FAA data:
The airports data, by IATA code, to add at least airport type, including other fields as desirable.
At least one of the tables in the DataDictionary file.
Remove some observations to reduce the number of unique carriers to between 8 and 20, resulting in a manageable but still informative number of levels to later investigate differences in ground time among unique carriers. Each remaining carrier should appear at least once in both 2016-2020 and 2021.
Explore the target variable and no more than two relationships between it and other types of predictors after your preparation work to prepare for modeling steps.
Further in-depth data exploration typical of a predictive analytics project is not required.

Then, write the technical data preparation section of your report below. Because none of your code or the transformed data itself will be available to the reader, all evidence of the data preparation tasks will be contained in your report. This may include written descriptions as well as charts, however the work may be most effectively conveyed. Be sure that evidence of the joins is included in your write-up.

Task 1 Response

Task 2 (3 points)

Write a separate section of your technical report discussing ethical concerns on the use of this data for this business problem, considering selection, measurement, and omitted variable biases.

Task 2 Response

Task 3 (6 points)

Using representative ground time as the target variable, fit the following models to perform well on unseen data using unique carrier and other variables you select as predictors:

GLM, where unique carrier and other variables have a fixed effect
GLMM, where at least unique carrier and one other predictor have random effects and other variables have fixed or random effects as appropriate

The GLMM should not remove any predictors from the GLM but may transform them and add new predictors. Each model fitting should try to isolate the effect of unique carrier from related predictors in the data as much as possible. The variable selection for this and future models should use 2016-2020 data as training and 2021 data as validation.

Write a technical report section discussing the impact, all else equal, of unique carrier based on these two fitted and validated models. Be sure to give particular attention to the two freight airlines discussed in the business problem. Include detailed results from your model fitting and variable selection to provide sufficient support for your findings.

Task 3 Response

Task 4 (5 points)

Using the same model form as the GLM in the previous task, fit a Bayesian model on the same training data (2016-2010), adjusting the parameters of the fitting function as needed to manage runtime while still achieving convergence in the fit.

Write a technical report section on the uncertainty of predicted ground time, relating it to uncertainty around the impacts of individual predictors and model parameters, backing your discussion with evidence from the fitted Bayesian model, with particular focus on unique carrier.

Task 4 Response

Task 5 (6 points)

Fit a random forest to the same target variable as the previous models using the same training and test data but independently select the significant variables that improve the predictive error on the test data. Then, use predictions from the fitted GLM from Task 3 and the random forest from this task as inputs to a stacked model, optimizing the form of the stage-1 model using the same test data as previously specified.

Continue the modeling section of your technical report by documenting the above modeling work in Task 5 and showing improvement in the performance of the model compared to the models in Task 3. Also, discuss strengths and weaknesses of the overall modeling process and its impact on addressing the business problem.

Task 5 Response

Task 6 (4 points)

Using the random forest from Task 5 trained on all 2016-2020 data, apply partial dependence plots to explain the # most important predictors of ground time, based on variable importance, for flights departing from or arriving at the Boise airport in 2016-2020. Where plots are difficult to read, turn plots off to get a data frame of underlying values.

Write a section of the technical report to explain the most important predictors of the random forest model using the partial dependence plots. Include a comparison of the partial dependence plot results with the coefficients of the GLM from Task 3 where applicable and discuss the differences, with particular attention on unique carrier. Do not interpret findings in this task---just focus on model explanations.

Task 6 Response