*Swamped with your writing assignments? Take the weight off your shoulder!*

Submit your assignment instructions

Consider the dataset “D1.2 Credit card defaults.csv” (described in C1.2). This dataset contains

information about credit card consumers, in particular, their default behavior. Correspondingly,

the key variable in the dataset is “defaultpaymentnextmonth” (call this variable “y”), a

dichotomous variable that indicates whether a customer defaulted on his/her debt. There are 23

other variables that can be used to predict this outcome. For simplicity, we will refer to the set

containing all these variables as “X”.

Using this data, perform the following tasks:

1. [3 points] Generate a random training/validation index that implements a 70/30 split

• Use a random seed of your choice.

2. [7 points] Estimate two logistic specifications that allow you to generate out-of-sample

predictions of y. Take the following points into account:

• You choose the variables X that enter each model specification. These variables X

can be continuous or categorical. Make sure continuous and categorical variables

are entered appropriately into the models.

• Specify model 1 as the simplest of the two. This model must include at least 5

explanatory variables.

• Specify model 2 as the richer/more flexible of the two. Control flexibility through

the set of X variables used. Include at least one variable interaction. [An interaction

of two variables, x1 and x2, would be x3 = x1*x2.]

3. [5 points] Do any of your models exhibit signs of overfitting? Explain.

SUBMISSION DETAILS

Submit two files (one submission per individual):

1. Slide Deck (MS Powerpoint or pdf)

▪ In the slide deck, I expect you to present results in an executive way – you need to

clearly describe:

• what is the goal (question/problem at hand)

• what you did to achieve the goal (analysis procedures)

• why you did it (rationales behind key steps)

• what you obtained (results)

▪ Use as many slides as you need.

▪ The title page must include your name.

▪ If you have worked/discussed with someone else, please also include their name(s) in

a separate line.

2. R script file containing the codes that you used for your analysis.

▪ Include comments in the script to help the TA follow your procedures.

▪ The script file should be understood as a companion: you are encouraged to include

screenshots of the command lines (with command line #) in your slide deck to

demonstrate your key steps. This way TAs can easily go back and double check that

your answer in the ppt are well supported.