Chapter 5 Interpreting Regression Results

Chapter description

A case study, on determining an individual’s characteristics that influence its health expenditures, illustrates the regression modeling process from start to finish. Subsequently, the chapter summarizes what we learn from the modeling process, underscoring the importance of variable selection.

5.1 Case study: MEPS health expenditures

5.1.1 Video

Video Overhead Details

Hide

A Details. MEPS health expenditures

This exercise considers data from the Medical Expenditure Panel Survey (MEPS), conducted by the U.S. Agency of Health Research and Quality. MEPS is a probability survey that provides nationally representative estimates of health care use, expenditures, sources of payment, and insurance coverage for the U.S. civilian population. This survey collects detailed information on individuals of each medical care episode by type of services including physician office visits, hospital emergency room visits, hospital outpatient visits, hospital inpatient stays, all other medical provider visits, and use of prescribed medicines. This detailed information allows one to develop models of health care utilization to predict future expenditures. You can learn more about MEPS at http://www.meps.ahrq.gov/mepsweb/.

We consider MEPS data from the panels 7 and 8 of 2003 that consists of 18,735 individuals between ages 18 and 65. From this sample, we took a random sample of 2,000 individuals that appear in the file HealthExpend. From this sample, there are 1,352 that had positive outpatient expenditures.

Our dependent variable is the amount of expenditures for outpatient visits, expendop. For MEPS, outpatient events include hospital outpatient department visits, office-based provider visits and emergency room visits excluding dental services. (Dental services, compared to other types of health care services, are more predictable and occur in a more regular basis.) Hospital stays with the same date of admission and discharge, known as “zero-night stays,” were included in outpatient counts and expenditures. (Payments associated with emergency room visits that immediately preceded an inpatient stay were included in the inpatient expenditures. Prescribed medicines that can be linked to hospital admissions were included in inpatient expenditures, not in outpatient utilization.)

Hide

B Details. Overhead MEPS health expenditures

Data from the Medical Expenditure Panel Survey (MEPS), conducted by the U.S. Agency of Health Research and Quality (AHRQ).

  • A probability survey that provides nationally representative estimates of health care use, expenditures, sources of payment, and insurance coverage for the U.S. civilian population.
  • Collects detailed information on individuals of each medical care episode by type of services including
    • physician office visits,
    • hospital emergency room visits,
    • hospital outpatient visits,
    • hospital inpatient stays,
    • all other medical provider visits, and
    • use of prescribed medicines.
  • This detailed information allows one to develop models of health care utilization to predict future expenditures.
  • We consider MEPS data from the first panel of 2003 and take a random sample of n = 2, 000 individuals between ages 18 and 65.

Hide

C Details. Outcome variable

Our dependent variable is expenditures for outpatient admissions.

  • For MEPS, inpatient admissions include persons who were admitted to a hospital and stayed overnight.
  • In contrast, outpatient events include hospital outpatient department visits, office-based provider visits and emergency room visits excluding dental services.
    • Hospital stays with the same date of admission and discharge, known as “zero-night stays,” were included in outpatient counts and expenditures.
    • Payments associated with emergency room visits that immediately preceded an inpatient stay were included in the inpatient expenditures.
    • Prescribed medicines that can be linked to hospital admissions were included in inpatient expenditures, not in outpatient utilization.

Hide

D Details. Explanatory variables

9 variables in the database. Here 13 most relevant.

\[ {\small \begin{array}{ll} expendop & \text{Amounts of expenditures for outpatient visits} \\ gender & \text{Indicate gender of patient (=1 if female, =0 if male)} \\ age & \text{Age in years between 18 and 65 }\\ race & \text{Race of patient described as Asian, Black, Native, White and other} \\ region & \text{Region of patient described as WEST, NORTHEAST, MIDWEST and SOUTH} \\ educ & \text{Level of education received described by words (LHIGHSC, HIGHSCH and COLLEGE)} \\ phstat & \text{Self-rated physical health status described as EXCE, VGOO, GOOD, FAIR and POOR} \\ mpoor & \text{Self-rated mental health (=1 if poor or fair, =0 if good to excellent mental health)} \\ anylimit & \text{Any activity limitation (=1 if any functional/activity limitation, =0 if otherwise)} \\ income & \text{Income compared to poverty line described as POOR, NPOOR, LINCOME, MINCOME and HINCOME} \\ insure & \text{Insurance coverage (=1 if covered by public/private health insurance in any month of 1996, =0 otherwise)} \\ usc & \text{1 if dissatisfied with one's usual source of care} \\ unemploy & \text{Employment status of patients} \\ managedcare & \text{1 if enrolled in an HMO or gatekeeper plan} \\ \end{array}} \]

Hide

E Details. Case study outline

The next series of exercises leads you through an analysis of the steps for understanding a complex data set. Because of the complexity of the data, in each step only a sample of procedures will be executed.

The outline consists of:

  • Summary statistics
  • Splitting the data into training and testing portions with initial model fits
  • Selecting variables to be included in the model

5.1.2 Exercise. Summarizing data

Assignment Text

With a complex dataset, you will probably want to take a look at the structure of the data. You are already familiar with taking a [summary()] of a dataframe which provides summary statistics for many variables. You will see that several variables in this dataframe are categorical, or factor, variables. We can use the table() function to summarize them.

After getting a sense of the distributions of explanatory variables, we want to take a deeper dive into the distribution of the outcome variable, expendop. We will do this by comparing the histograms of the variable to that of its logarithmic version.

To examine relationships of the outcome variable visually, we look to scatterplots for continuous variables (such as age) and boxplots for categorical variables (such as phstat).

Instructions

  • Examine the structure of the meps dataframe using the str() function. Also, get a [summary()] of the dataframe.
  • Examine the distribution of the race variable using the table() function.
  • Compare the expenditures distribution to its logarithmic version visually via histograms plotted next to another. par(mfrow = c(1, 2)) is used to organize the plots you create.
  • Examine the distribution of logarithmic expenditures in terms of levels of phstat visually using the boxplot() function.
  • Examine the relationship of age versus logarithmic expenditures using a scatter plot. Superimpose a local fitting line using the lines() and lowess() functions.
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiNtZXBzIDwtIHJlYWQuY3N2KFwiQ1NWRGF0YVxcXFxIZWFsdGhNZXBzLmNzdlwiLCBoZWFkZXIgPSBUUlVFKVxubWVwcyA8LSByZWFkLmNzdihcImh0dHBzOi8vYXNzZXRzLmRhdGFjYW1wLmNvbS9wcm9kdWN0aW9uL3JlcG9zaXRvcmllcy8yNjEwL2RhdGFzZXRzLzdiN2RhYjZkMGM1MjhlNGNkMmY4ZDBlMGZjNzgyNGEyNTQ0MjliZjgvSGVhbHRoTWVwcy5jc3ZcIiwgaGVhZGVyID0gVFJVRSkiLCJzYW1wbGUiOiIjIEV4YW1pbmUgdGhlIHN0cnVjdHVyZSBhbmQgZ2V0IGEgc3VtbWFyeSBvZiB0aGUgYG1lcHNgIGRhdGFmcmFtZSBcbnN0cihfX18pXG5zdW1tYXJ5KF9fXylcblxuIyBFeGFtaW5lIHRoZSBkaXN0cmlidXRpb24gb2YgdGhlIGByYWNlYCB2YXJpYWJsZSBcbnRhYmxlKF9fXylcblxuIyBDb21wYXJlIHRoZSBleHBlbmRpdHVyZXMgZGlzdHJpYnV0aW9uIHRvIGl0cyBsb2dhcml0aG1pYyB2ZXJzaW9uIHZpc3VhbGx5XG5wYXIobWZyb3cgPSBjKDEsIDIpKVxuaGlzdChfX18sIG1haW4gPSBcIlwiLCB4bGFiID0gXCJvdXRwYXRpZW50IGV4cGVuZGl0dXJlc1wiKVxuaGlzdChsb2coX19fKSwgbWFpbiA9IFwiXCIsIHhsYWIgPSBcImxvZyBleHBlbmRpdHVyZXNcIilcblxuIyBFeGFtaW5lIHRoZSBkaXN0cmlidXRpb24gb2YgbG9nYXJpdGhtaWMgZXhwZW5kaXR1cmVzIGluIHRlcm1zIG9mIGxldmVscyBvZiBgcGhzdGF0YCBcbnBhcihtZnJvdyA9IGMoMSwgMSkpXG5tZXBzJGxvZ2V4cGVuZCA8LSBsb2cobWVwcyRleHBlbmRvcClcbmJveHBsb3QobG9nZXhwZW5kIH4gX19fLCBkYXRhID0gbWVwcywgbWFpbiA9IFwiYm94cGxvdCBvZiBsb2cgZXhwZW5kXCIpXG5cbiMgRXhhbWluZSB0aGUgcmVsYXRpb25zaGlwIG9mIGFnZSB2ZXJzdXMgbG9nYXJpdGhtaWMgZXhwZW5kaXR1cmVzLiBTdXBlcmltcG9zZSBhIGxvY2FsIGZpdHRpbmcgbGluZS5cbnBsb3QoX19fLF9fXywgeGxhYiA9IFwiYWdlXCIsIHlsYWIgPSBcImxvZyBleHBlbmRcIilcbmxpbmVzKGxvd2VzcyhfX18sIF9fXyksIGNvbD1cInJlZFwiKSIsInNvbHV0aW9uIjoic3RyKG1lcHMpXG5zdW1tYXJ5KG1lcHMpXG50YWJsZShtZXBzJHJhY2UpXG5wYXIobWZyb3cgPSBjKDEsIDIpKVxuaGlzdChtZXBzJGV4cGVuZG9wLCBtYWluID0gXCJcIiwgeGxhYiA9IFwib3V0cGF0aWVudCBleHBlbmRpdHVyZXNcIilcbmhpc3QobG9nKG1lcHMkZXhwZW5kb3ApLCBtYWluID0gXCJcIiwgeGxhYiA9IFwibG9nIGV4cGVuZGl0dXJlc1wiKVxucGFyKG1mcm93ID0gYygxLCAxKSlcbm1lcHMkbG9nZXhwZW5kIDwtIGxvZyhtZXBzJGV4cGVuZG9wKVxuYm94cGxvdChsb2dleHBlbmQgfiBwaHN0YXQsIGRhdGEgPSBtZXBzLCBtYWluID0gXCJib3hwbG90IG9mIGxvZyBleHBlbmRcIilcbnBsb3QobWVwcyRhZ2UsbWVwcyRsb2dleHBlbmQsIHhsYWIgPSBcImFnZVwiLCB5bGFiID0gXCJsb2cgZXhwZW5kXCIpXG5saW5lcyhsb3dlc3MobWVwcyRhZ2UsIG1lcHMkbG9nZXhwZW5kKSwgY29sPVwicmVkXCIpIiwic2N0IjoiIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInN0clwiLCBub3RfY2FsbGVkX21zZz1cIlVzZSB0aGUgYHN0cmAgY29tbWFuZCB0byB2aWV3IHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIGRhdGEuIFwiKSAlPiUgY2hlY2tfYXJnKC4sIFwib2JqZWN0XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3b3VsZCBsaWtlIHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIGRhdGEgYG1lcHNgLlwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInN1bW1hcnlcIiwgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBzdW1tYXJ5YCBmdW5jdGlvbiB0byB2aWV3IGEgc3VtbWFyeSBvZiB0aGUgZGF0YS4gXCIpICU+JSBjaGVja19hcmcoLiwgXCJvYmplY3RcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gc3BlY2lmeSB0aGF0IHdlIHdvdWxkIGxpa2UgdG8gdmlldyBhIHN1bW1hcnkgb2YgYG1lcHNgLlwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInRhYmxlXCIsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgdGFibGVgIGZ1bmN0aW9uIHRvIGZpbmQgdGhlIG51bWJlciBvZiBvYnNlcnZhdGlvbnMgb2YgZWFjaCByYWNlLiBcIikgJT4lIGNoZWNrX3Jlc3VsdCgpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3b3VsZCBsaWtlIHRvIGNyZWF0ZSBhIHRhYmxlIGZyb20gYG1lcHNgIHRoYXQgc2VwYXJhdGVzIGJhc2VkIG9uIGByYWNlYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJwYXJcIixpbmRleD0xLG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgcGFyYCBmdW5jdGlvbiB0byBhbHRlciB0aGUgcGxvdHRpbmcgZGV2aWNlIHRvIGhhdmUgMiBzaWRlLWJ5LXNpZGUgZ3JhcGhzLiBcIikgJT4lIGNoZWNrX2FyZyguLCBcIm1mcm93XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiVG8gY3JlYXRlIHR3byBzaWRlLWJ5LXNpZGUgZ3JhcGhzLCBzZXQgYG1mcm93YCBlcXVhbCB0byBgYygyLDEpYC4gXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiaGlzdFwiLGluZGV4PTEsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgaGlzdGAgZnVuY3Rpb24gdG8gY3JlYXRlIGEgaGlzdG9ncmFtIG9mIGBleHBlbmRvcGAuXCIpICU+JSBjaGVja19hcmcoLiwgXCJ4XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3YW50IHRoZSBmaXJzdCBoaXN0b2dyYW0gdG8gYmUgbWFkZSB1c2luZyB0aGUgZGF0YSBmb3VuZCBpbiBgZXhwZW5kb3BgLCB3aGljaCBpcyBhIGNvbHVtbiBpbiBgbWVwc2AuXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiaGlzdFwiLGluZGV4PTIgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBoaXN0YCBmdW5jdGlvbiB0byBjcmVhdGUgYSBoaXN0b2dyYW0gb2YgdGhlIGxvZyBvZiBgZXhwZW5kb3BgLlwiKSAlPiUgY2hlY2tfYXJnKC4sIFwieFwiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cIk1ha2Ugc3VyZSB0byBzcGVjaWZ5IHRoYXQgd2Ugd2FudCB0aGUgc2Vjb25kIGhpc3RvZ3JhbSB0byBiZSBtYWRlIHVzaW5nIHRoZSBsb2cgb2YgdGhlIGRhdGEgZm91bmQgaW4gYGV4cGVuZG9wYCwgd2hpY2ggaXMgYSBjb2x1bW4gaW4gYG1lcHNgLlwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInBhclwiLGluZGV4PTIsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIGBwYXJgIHRvIHJlc2V0IHRoZSBwbG90dGluZyBkZXZpY2UgdG8gaG9sZCBhIHNpbmdsZSBncmFwaC4gXCIpICU+JSBjaGVja19hcmcoLiwgXCJtZnJvd1wiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cInRvIHJlc2V0IHRoZSBwbG90dGluZyBkZXZpY2UsIHNldCBgbWZyb3dgIHRvIGJlIGVxdWFsIHRvIGBjKDEsMSkuIFwiKVxuIyBleCgpICU+JSBjaGVja19vYmplY3QoXCJtZXBzXCIpICU+JSBjaGVja19jb2x1bW4oXCJsb2dleHBlbmRcIiwgY29sX21pc3NpbmdfbXNnPVwiTWFrZSBzdXJlIHRvIGNyZWF0ZSBhIG5ldyBjb2x1bW4gaW4gYG1lcHNgIG5hbWVkIGBsb2dleHBlbmRgLlwiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cIlRoZSB2YWx1ZXMgaW4gYGxvZ2V4cGVuZGAgc2hvdWxkIGJlIGVxdWFsIHRvIHRoZSBsb2cgb2YgdGhlIHZhbHVlcyBpbiBgbWVwcyRleHBlbmRvcGAuXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiYm94cGxvdFwiLCBub3RfY2FsbGVkX21zZz1cIlVzZSB0aGUgYGJveHBsb3RgIGZ1bmN0aW9uIHRvIGNyZWF0ZSBhIGJveHBsb3Qgc2hvd2luZyBgbG9nZXhwZW5kYCBiYXNlZCBvbiBgcGhzdGF0YC5cIikgJT4lIHtcbiMgICBjaGVja19hcmcoLiwgXCJmb3JtdWxhXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3YW50IGJveHBsb3RzIHRoYXQgc2hvdyBgbG9nZXhwZW5kYCBmb3IgZWFjaCB2YWx1ZSBvZiBgcGhzdGF0YC5cIilcbiMgICBjaGVja19hcmcoLiwgXCJkYXRhXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCBvdXIgZGF0YSBjb21lcyBmcm9tIHRoZSBkYXRhZnJhbWUgYG1lcHNgLlwiKVxuIyB9XG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwicGxvdFwiLCBub3RfY2FsbGVkX21zZz1cIlVzZSBgcGxvdGAgdG8gY3JlYXRlIGEgcGxvdCB0aGF0IHNob3dzIGBsb2dleHBlbmRgIGJhc2VkIG9uIGBhZ2VgLlwiKSAlPiUge1xuIyAgIGNoZWNrX2FyZyguLCBcInhcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJUaGUgaW5kZXBlbmRlbnQgdmFyaWFibGUgc2hvdWxkIGJlIGBhZ2VgLlwiKVxuIyAgIGNoZWNrX2FyZyguLCBcInlcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJUaGUgZGVwZW5kZW50IHZhcmlhYmxlIHNob3VsZCBiZSBgbG9nZXhwZW5kYC5cIilcbiMgfVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcImxpbmVzXCIsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgbGluZXNgIGZ1bmN0aW9uIHRvIGFkZCBhIGxpbmUgdG8gb3VyIHBsb3QuIFwiKSBcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJsb3dlc3NcIiwgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBsb3dlc3NgIGZ1bmN0aW9uIHRvIGFkZCBhIHNtb290aGluZyBlZmZlY3QgdG8gb3VyIGxpbmUuIFwiKSAlPiUge1xuIyAgIGNoZWNrX2FyZyguLCBcInhcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJUaGUgZmlyc3QgYXJndW1lbnQgaW4gYGxvd2Vzc2Agc2hvdWxkIGJlIG91ciBgeGAgdmFyaWFibGUsIHdoaWNoIGlzIGBhZ2VgLlwiKVxuIyAgIGNoZWNrX2FyZyguLCBcInlcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJUaGUgc2Vjb25kIGFyZ3VtZW50IGluIGBsb3dlc3NgIHNob3VsZCBiZSBvdXIgYHlgIHZhcmlhYmxlLCB3aGljaCBpcyBgbG9nZXhwZW5kYC5cIilcbiMgfVxuc3VjY2Vzc19tc2coXCJFeGNlbGxlbnQhIFN1bW1hcml6aW5nIGRhdGEsIHdpdGhvdXQgcmVmZXJlbmNlIHRvIGEgbW9kZWwsIGlzIHByb2JhYmx5IHRoZSBtb3N0IHRpbWUtY29uc3VtaW5nIHBhcnQgb2YgYW55IHByZWRpY3RpdmUgbW9kZWxpbmcgZXhlcmNpc2UuIFN1bW1hcnkgc3RhdGlzdGljcyBhcmUgYWxzbyBhIGtleSBwYXJ0IG9mIGFueSByZXBvcnQgYXMgdGhleSBpbGx1c3RyYXRlIGZlYXR1cmVzIG9mIHRoZSBkYXRhIHRoYXQgYXJlIGFjY2Vzc2libGUgdG8gYSBicm9hZCBhdWRpZW5jZS5cIikiLCJoaW50IjoiT25lIG9mIHRoZSBtb3N0IGltcG9ydGFudCBzdGVwcyBpbiBsaW5lYXIgcmVncmVzc2lvbiBpcyB1bmRlcnN0YW5kaW5nIHlvdXIgZGF0YS4gVXNpbmcgYSBjb21iaW5hdGlvbiBvZiA8Y29kZT5zdHIoKTwvY29kZT4gYW5kICA8Y29kZT5zdW1tYXJ5KCk8L2NvZGU+IGxldHMgeW91IHZpZXcgcXVhbnRpdGF0aXZlIHN0YXRpc3RpY3MgYWJvdXQgeW91ciBkYXRhLCB3aGlsZSA8Y29kZT5ib3hwbG90KCk8L2NvZGU+IGFuZCA8Y29kZT5oaXN0KCk8L2NvZGU+IGdpdmUgeW91IGEgbW9yZSB2aXN1YWwgcmVwcmVzZW50YXRpb24gb2YgeW91ciBkYXRhLiJ9

5.1.3 Exercise. Fit a benchmark multiple linear regression model

Assignment Text

As part of the pre-processing for the model fitting, we will split the data into training and test subsamples. For this exercise, we use a 75/25 split although other choices are certainly suitable. Some analysts prefer to do this splitting before looking at the data. Another approach, adopted here, is that the final report typically contains summary statistcs of the entire data set and so it makes sense to do so when examining summary statistics.

We start by fitting a benchmark model. It is common to use all available explanatory variables with the outcome on the original scale and so we use this as our benchmark model. This exercise shows that when you plot() a fitted linear regression model in R, the result provides four graphs that you have seen before. These can be useful for identifying an appropriate model.

Instructions

  • Randomly split the data into a training and a testing data sets. Use 75% for the training, 25% for the testing.
  • Fit a full model using expendop as the outcome and all explanatory variables. Summarize the results of this model fitting.
  • You can plot() the fitted model to view several diagnostic plots. These plots provide evidence that expenditures may not be the best scale for linear regression.
  • Fit a full model using logexpend as the outcome and all explanatory variables and summarize the fit. Use the plot() function for evidence that this variable is more suited for linear regression methods than expenditures on the original scale.

Hint. A plot of a regression object such as plot(mlr) provides four diagnostic plots. These can be organized as a 2 by 2 array using par(mfrow = c(2, 2)).

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiNtZXBzIDwtIHJlYWQuY3N2KFwiQ1NWRGF0YVxcXFxIZWFsdGhNZXBzLmNzdlwiLCBoZWFkZXIgPSBUUlVFKVxubWVwcyA8LSByZWFkLmNzdihcImh0dHBzOi8vYXNzZXRzLmRhdGFjYW1wLmNvbS9wcm9kdWN0aW9uL3JlcG9zaXRvcmllcy8yNjEwL2RhdGFzZXRzLzdiN2RhYjZkMGM1MjhlNGNkMmY4ZDBlMGZjNzgyNGEyNTQ0MjliZjgvSGVhbHRoTWVwcy5jc3ZcIiwgaGVhZGVyID0gVFJVRSlcbm1lcHMkbG9nZXhwZW5kIDwtIGxvZyhtZXBzJGV4cGVuZG9wKSIsInNhbXBsZSI6IiMgUmFuZG9tbHkgc3BsaXQgdGhlIGRhdGEgaW50byBhIHRyYWluaW5nIGFuZCBhIHRlc3RpbmcgZGF0YSBzZXRzLiBVc2UgNzVcXCUgZm9yIHRoZSB0cmFpbmluZywgMjVcXCUgZm9yIHRoZSB0ZXN0aW5nLlxubiA8LSBucm93KG1lcHMpXG5zZXQuc2VlZCgxMjM0NylcbnNodWZmbGVkX21lcHMgPC0gbWVwc1tzYW1wbGUobiksIF1cbnRyYWluX2luZGljZXMgPC0gMTpyb3VuZCgwLjc1ICogbilcbnRyYWluX21lcHMgICAgPC0gc2h1ZmZsZWRfbWVwc1t0cmFpbl9pbmRpY2VzLCBdXG50ZXN0X2luZGljZXMgIDwtIChyb3VuZCgwLjI1ICogbikgKyAxKTpuXG50ZXN0X21lcHMgICAgIDwtIHNodWZmbGVkX21lcHNbdGVzdF9pbmRpY2VzLCBdXG5cbiMgRml0IGEgZnVsbCBtb2RlbCB1c2luZyBgZXhwZW5kb3BgIGFzIHRoZSBvdXRjb21lIGFuZCBhbGwgZXhwbGFuYXRvcnkgdmFyaWFibGVzLiBTdW1tYXJpemUgdGhlIHJlc3VsdHMgb2YgdGhpcyBtb2RlbCBmaXR0aW5nLlxubWVwc19tbHIxIDwtIGxtKF9fXyB+IGdlbmRlciArIGFnZSArIHJhY2UgKyByZWdpb24gKyBlZHVjICsgcGhzdGF0ICsgbXBvb3IgKyBhbnlsaW1pdCArIGluY29tZSArIGluc3VyZSArIHVzYyArIHVuZW1wbG95ICsgbWFuYWdlZGNhcmUsIGRhdGEgPSBfX18pXG5zdW1tYXJ5KG1lcHNfbWxyMSlcblxuIyBQcm92aWRlIGRpYWdub3N0aWMgcGxvdHMgb2YgdGhlIGZpdHRlZCBtb2RlbC4gXG5wYXIobWZyb3cgPSBjKDIsIDIpKVxucGxvdChfX18pXG5cbiMgRml0IGEgZnVsbCBtb2RlbCB1c2luZyBgbG9nZXhwZW5kYCBhcyB0aGUgb3V0Y29tZSBhbmQgYWxsIGV4cGxhbmF0b3J5IHZhcmlhYmxlcy4gU3VtbWFyaXplIHRoZSBmaXQgYW5kIGV4YW1pbmUgZGlhZ25vc3RpYyBwbG90cyBvZiB0aGUgZml0dGVkIG1vZGVsLiBcbm1lcHNfbWxyMiA8LSBsbShfX18gfiBnZW5kZXIgKyBhZ2UgKyByYWNlICsgcmVnaW9uICsgZWR1YyArIHBoc3RhdCArIG1wb29yICsgYW55bGltaXQgKyBpbmNvbWUgKyBpbnN1cmUgKyB1c2MgKyB1bmVtcGxveSArIG1hbmFnZWRjYXJlLCBkYXRhID0gX19fKVxuc3VtbWFyeShtZXBzX21scjIpXG5wbG90KG1lcHNfbWxyMikiLCJzb2x1dGlvbiI6IiMgU3BsaXQgdGhlIHNhbXBsZSBpbnRvIGEgYHRyYWluaW5nYCBhbmQgYHRlc3RgIGRhdGFcbm4gPC0gbnJvdyhtZXBzKVxuc2V0LnNlZWQoMTIzNDcpXG5zaHVmZmxlZF9tZXBzIDwtIG1lcHNbc2FtcGxlKG4pLCBdXG50cmFpbl9pbmRpY2VzIDwtIDE6cm91bmQoMC43NSAqIG4pXG50cmFpbl9tZXBzICAgIDwtIHNodWZmbGVkX21lcHNbdHJhaW5faW5kaWNlcywgXVxudGVzdF9pbmRpY2VzICA8LSAocm91bmQoMC4yNSAqIG4pICsgMSk6blxudGVzdF9tZXBzICAgICA8LSBzaHVmZmxlZF9tZXBzW3Rlc3RfaW5kaWNlcywgXVxuXG5tZXBzX21scjEgPC0gbG0oZXhwZW5kb3AgfiBnZW5kZXIgKyBhZ2UgKyByYWNlICsgcmVnaW9uICsgZWR1YyArIHBoc3RhdCArIG1wb29yICsgYW55bGltaXQgKyBpbmNvbWUgKyBpbnN1cmUgKyB1c2MgKyB1bmVtcGxveSArIG1hbmFnZWRjYXJlLCBkYXRhID0gdHJhaW5fbWVwcylcbnN1bW1hcnkobWVwc19tbHIxKVxucGFyKG1mcm93ID0gYygyLCAyKSlcbnBsb3QoeD1tZXBzX21scjEpXG5cbm1lcHNfbWxyMiA8LSBsbShsb2dleHBlbmQgfiBnZW5kZXIgKyBhZ2UgKyByYWNlICsgcmVnaW9uICsgZWR1YyArIHBoc3RhdCArIG1wb29yICsgYW55bGltaXQgKyBpbmNvbWUgKyBpbnN1cmUgKyB1c2MgKyB1bmVtcGxveSArIG1hbmFnZWRjYXJlLCBkYXRhID0gdHJhaW5fbWVwcylcbnN1bW1hcnkobWVwc19tbHIyKVxucGxvdChtZXBzX21scjIpIiwic2N0IjoiIyBleCgpICU+JSBjaGVja19vYmplY3QoXCJtZXBzX21scjFcIiwgdW5kZWZpbmVkX21zZz1cIlVzZSBgbG1gIHRvIGNyZWF0ZSBhIGxpbmVhciBtb2RlbCBuYW1lZCBgbWVwc19tbHIxYFwiLikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gY3JlYXRlIHRoaXMgbW9kZWwgc28gdGhhdCBgZXhwZW5kb3BgIGlzIGJhc2VkIG9uIGBnZW5kZXJgLCBgYWdlYCwgYHJhY2VgLCBgcmVnaW9uYCwgYGVkdWNgLCBgcGhzdGF0YCwgYG1wb29yYCwgYGFueWxpbWl0YCwgYGluY29tZWAsIGBpbnN1cmVgLCBgdXNjYCwgYHVuZW1wbG95YCwgYW5kIGBtYW5hZ2VkY2FyZWAgZnJvbSB0aGUgZGF0YWZyYW1lIGBtZXBzYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJzdW1tYXJ5XCIsaW5kZXg9MSxub3RfY2FsbGVkX21zZz1cIlVzZSB0aGUgYHN1bW1hcnlgIGZ1bmN0aW9uIHRvIHZpZXcgYSBzdW1tYXJ5IG9mIG91ciBsaW5lYXIgbW9kZWwuIFwiKSAlPiUgY2hlY2tfYXJnKC4sIFwib2JqZWN0XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3b3VsZCBsaWtlIGEgc3VtbWFyeSBvZiBgbWVwc19tbHIxYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJwYXJcIiwgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBwYXJgIGZ1bmN0aW9uIHRvIGFsdGVyIHRoZSBwbG90dGluZyBkZXZpY2UgdG8gc2hvdyBhIDIgYnkgMiBncmlkIG9mIGdyYXBocy4gXCIpICU+JSBjaGVja19hcmcoLiwgXCJtZnJvd1wiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cIlRvIGNyZWF0ZSBhIDIgYnkgMiBncmlkIG9mIGdyYXBocywgc2V0IGBtZnJvd2AgZXF1YWwgdG8gYGMoMiwyKWAuIFwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInBsb3RcIixpbmRleD0xLCBub3RfY2FsbGVkX21zZz1cIlVzZSBgcGxvdGAgdG8gY3JlYXRlIHRoZSA0IHBsb3RzIGFzc29jaWF0ZWQgd2l0aCBhIGxpbmVhciBtb2RlbC4gXCIpICU+JSBjaGVja19hcmcoLiwgXCJ4XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiSWYgeW91IHNldCB0aGUgcGFyYW1ldGVyIGB4YCB0byBiZSBlcXVhbCB0byBhIGxpbmVhciBtb2RlbCwgaW4gdGhpcyBjYXNlIGBtZXBzX21scjFgLCB5b3Ugd2lsbCBnZXQgNCB1c2VmdWwgZ3JhcGhzIVwiKVxuIyBleCgpICU+JSBjaGVja19vYmplY3QoXCJtZXBzX21scjJcIiwgdW5kZWZpbmVkX21zZz1cIlVzZSBgbG1gIHRvIGNyZWF0ZSBhIGxpbmVhciBtb2RlbCBuYW1lZCBgbWVwc19tbHIyYC5cIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gY3JlYXRlIHRoaXMgbW9kZWwgc28gdGhhdCBgbG9nZXhwZW5kYCBpcyBiYXNlZCBvbiBgZ2VuZGVyYCwgYGFnZWAsIGByYWNlYCwgYHJlZ2lvbmAsIGBlZHVjYCwgYHBoc3RhdGAsIGBtcG9vcmAsIGBhbnlsaW1pdGAsIGBpbmNvbWVgLCBgaW5zdXJlYCwgYHVzY2AsIGB1bmVtcGxveWAsIGFuZCBgbWFuYWdlZGNhcmVgIGZyb20gdGhlIGRhdGFmcmFtZSBgbWVwc2AuXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwic3VtbWFyeVwiLGluZGV4PTIsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgc3VtbWFyeWAgZnVuY3Rpb24gdG8gdmlldyBhIHN1bW1hcnkgb2Ygb3VyIG5ldyBsaW5lYXIgbW9kZWwsIGBtZXBzX21scjJgLlwiKSAlPiUgY2hlY2tfYXJnKC4sIFwib2JqZWN0XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3b3VsZCBsaWtlIGEgc3VtbWFyeSBvZiBgbWVwc19tbHIyYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJwbG90XCIsaW5kZXg9Miwgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBwbG90YCBmdW5jdGlvbiB0byBjcmVhdGUgdGhlIDQgcGxvdHMgYXNzb2NpYXRlZCB3aXRoIGEgbGluZWFyIG1vZGVsLiBcIikgJT4lIGNoZWNrX2FyZyguLCBcInhcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJJZiB5b3Ugc2V0IHRoZSBwYXJhbWV0ZXIgYHhgIHRvIGJlIGVxdWFsIHRvIGEgbGluZWFyIG1vZGVsLCBpbiB0aGlzIGNhc2UgYG1lcHNfbWxyMmAsIHlvdSB3aWxsIGdldCA0IHVzZWZ1bCBncmFwaHMhXCIpXG5zdWNjZXNzX21zZyhcIkV4Y2VsbGVudCEgWW91IG1heSBoYXZlIGNvbXBhcmVkIHRoZSBmb3VyIGRpYWdub3N0aWMgZ3JhcGhzIGZyb20gdGhlIE1MUiBtb2RlbCBmaXQgb2YgJ2V4cGVuZCcgdG8gdGhvc2UgY3JlYXRlZCB1c2luZyB0aGUgc2FtZSBwcm9jZWR1cmUgYnV0IHdpdGggbG9nYXJpdGhtaWMgZXhwZW5kaXR1cmVzIGFzIHRoZSBvdXRjb21lLiBUaGlzIHByb3ZpZGVzIGFub3RoZXIgcGllY2Ugb2YgZXZpZGVuY2UgdGhhdCBsb2cgZXhwZW5kaXR1cmVzIGFyZSBtb3JlIHN1aXRhYmxlIGZvciByZWdyZXNzaW9uIG1vZGVsaW5nLiBVc2luZyBsb2dhcml0aG1pYyBvdXRjb21lcyBpcyBhIGNvbW1vbiBmZWF0dXJlIG9mIGFjdHVhcmlhbCBhcHBsaWNhdGlvbnMgYnV0IGNhbiBiZSBkaWZmaWN1bHQgdG8gZGlhZ25vc2UgYW5kIGludGVycHJldCB3aXRob3V0IHByYWN0aWNlLlwiKSIsImhpbnQiOiJBIDxjb2RlPnBsb3Q8L2NvZGU+IG9mIGEgcmVncmVzc2lvbiBvYmplY3Qgc3VjaCBhcyBwbG90KG1scikgcHJvdmlkZXMgZm91ciBkaWFnbm9zdGljIHBsb3RzLiBUaGVzZSBjYW4gYmUgb3JnYW5pemVkIGFzIGEgMiBieSAyIGFycmF5IHVzaW5nIDxjb2RlPnBhcihtZnJvdyA9IGMoMiwgMikpPC9jb2RlPi4ifQ==

5.1.4 Exercise. Variable selection

Assignment Text

Modeling building can be approached using a “ground-up” strategy, where the analyst introduces a variable, examines residuls from a regression fit, and then seeks to understand the relationship between these residuals and other available variables so that these variables might be added to the model.

Another approach is a “top-down” strategy where all available variables are entered into a model and unnecessary variables are pruned from the model. Both approaches are helpful when using data to specify models. This exercise illustrates the latter approach, using the [step()] function to help narrow our search for the best fitting model.

Instructions

From our prior work, the training dataframe train_meps has already been loaded in. A multiple linear regression model fit object meps_mlr2 is available that summarizes a fit of logexpend as the outcome variable using all 13 explanatory variables.

  • Use the step() function function to drop unnecessary variables from the full fitted model summarized in the object meps_mlr2 and summarize this recommended model.
  • As an alternative, use the explanatory variables in the recommended model and add the varibles phstat. Summarize the fit and note that statistical significance of the new variable.
  • You have been reminded by your boss that use of the variable gender is unsuitable for actuarial pricing purposes. As an another alternative, drop gender from the recommended model (still keeping phstat). Note the statistical significance of the variable uscwith this fitted model.
{"language":"r","pre_exercise_code":"#meps <- read.csv(\"CSVData\\\\HealthMeps.csv\", header = TRUE)\nmeps <- read.csv(\"https://assets.datacamp.com/production/repositories/2610/datasets/7b7dab6d0c528e4cd2f8d0e0fc7824a254429bf8/HealthMeps.csv\", header = TRUE)\nmeps$logexpend <- log(meps$expendop)\n# Split the sample into a `training` and `test` data\nn <- nrow(meps)\nset.seed(12347)\nshuffled_meps <- meps[sample(n), ]\ntrain_indices <- 1:round(0.75 * n)\ntrain_meps    <- shuffled_meps[train_indices, ]\ntest_indices  <- (round(0.25 * n) + 1):n\ntest_meps     <- shuffled_meps[test_indices, ]","sample":"meps_mlr2 <- lm(logexpend ~ gender + age + race + region + educ + phstat + mpoor + anylimit + income + insure + usc + unemploy + managedcare, data = train_meps)\n# Use the step() to drop unnecessary variables from the full fitted model summarized in the object `meps_mlr2` and summarize this recommended model.\nmodel_stepwise <- step(___, direction= \"both\", k = log(nrow(train_meps)), trace = 0)\nsummary(model_stepwise)\n\n# As an alternative, use the explanatory variables in the recommended model and add the varibles `mpoor`. Summarize the fit  and note that statistical significance of the new variable.\nmeps_mlr4 <- lm(___ ~ gender + age + phstat + anylimit + insure  + ___, data = train_meps)\nsummary(meps_mlr4)\n\n# You have been reminded by your boss that use of the variable `gender` is unsuitable for actuarial pricing purposes. As an another alternative, drop `gender` from the recommended model (still keeping `mpoor`). Note the statistical significance of the variable `usc`with this fitted model.\nmeps_mlr5 <- lm(logexpend ~ age + phstat + anylimit + insure  + ___, data = train_meps)\nsummary(___)","solution":"meps_mlr2 <- lm(logexpend ~ gender + age + race + region + educ + phstat + mpoor + anylimit + income + insure + usc + unemploy + managedcare, data = train_meps)\n#library(Rcmdr)\n#temp <- stepwise(meps_mlr2, direction = 'backward/forward')\nmodel_stepwise <- step(meps_mlr2, direction= \"both\", k = log(nrow(train_meps)), trace = 0) \nsummary(model_stepwise)\nmeps_mlr3 <- lm(logexpend ~ gender + age + phstat + anylimit + insure , data = train_meps)\nsummary(meps_mlr3)\nmeps_mlr4 <- lm(logexpend ~ gender + age + phstat + anylimit + insure  + mpoor, data = train_meps)\nsummary(meps_mlr4)\nmeps_mlr5 <- lm(logexpend ~ age + phstat + anylimit + insure  + mpoor, data = train_meps)\nsummary(meps_mlr5)\n\n# par(mfrow = c(2, 2))\n# plot(meps_mlr3)\n# \n# meps_mlr4 <- lm(logexpend ~ gender + age + mpoor + anylimit + insure + usc  + phstat, data = train_meps)\n# summary(meps_mlr4)\n# \n# \n# meps_mlr5 <- lm(logexpend ~ age  + anylimit + mpoor + insure  + usc  + phstat, data = train_meps)\n# summary(meps_mlr5)\n# anova(meps_mlr4, meps_mlr5)\n# \n# #boxplot(train_meps$logexpend ~ train_meps$phstat*train_meps$usc)","sct":"# ex() %>% check_object(\"meps_mlr2\", undefined_msg=\"Use `lm` to create a linear model named `meps_mlr2`.\") %>% check_equal(incorrect_msg=\"Make sure to create the linear model so that `logexpend` is based on `gender`, `age`, `race`, `region`, `educ`, `phstat`, `mpoor`, `anylimit`, `income`, `insure`, `usc`, `unemploy`, and `managedcare` from the data set `train_meps`.\")\n# ex() %>% check_object(\"model_stepwise\", undefined_msg=\"Utilize the `step` function to create a simplified model named `model_stepwise`.\") %>% check_equal(incorrect_msg=\"Make sure to run `step` on `meps_mlr2` in both directions, and with k set to be the BIC estimate. \")\n# ex() %>% check_function(\"summary\",index=1, not_called_msg=\"Use `summary` to create and view a summary of `model_stepwise`.\") %>% check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like a summary of `model_stepwise`.\")\n# ex() %>% check_object(\"meps_mlr3\", undefined_msg=\"Create a new linear model off a small number of explanatory variables named `meps_mlr3`.\") %>% check_equal(incorrect_msg=\"When you create your linear model, make sure to model `logexpend` based on `gender`, `age`, `phstat`, `anylimit`, and `insure` from the dataframe `train_meps`.\")\n# ex() %>% check_function(\"summary\",index=2, not_called_msg=\"Use the `summary` function to create and view a summary of `meps_mlr3`.\") %>% check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like a summary of `meps_mlr3`.\")\n# ex() %>% check_object(\"meps_mlr4\", undefined_msg=\"Create a new linear model off a small number of explanatory variables named `meps_mlr4`.\") %>% check_equal(incorrect_msg=\"When you create your linear model, make sure to model `logexpend` based on `gender`, `age`, `phstat`, `anylimit`, `insure`, and `mpoor` from the dataframe `train_meps`.\")\n# ex() %>% check_function(\"summary\",index=3, not_called_msg=\"Use the `summary` function to create and view a summary of `meps_mlr4`.\") %>% check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like a summary of `meps_mlr4`.\")\n# ex() %>% check_object(\"meps_mlr5\", undefined_msg=\"Create a new linear model off a small number of explanatory variables named `meps_mlr5`.\") %>% check_equal(incorrect_msg=\"When you create your linear model, make sure to model `logexpend` based on `age`, `phstat`, `anylimit`, and `insure` from the dataframe `train_meps`.\")\n# ex() %>% check_function(\"summary\",index=4, not_called_msg=\"Use the `summary` function to create and view a summary of `meps_mlr5`.\") %>% check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like a summary of `meps_mlr5`.\")\nsuccess_msg(\"Excellent! Sometimes variables may have good predictive power but are unacceptable for policy purposes - in insurance, ethnicity and sometimes sex are good examples. This implies that model interpretation can be just as important as the ability to predict.\")","hint":"Starting with a model that consists of all your explanatory variables and using <code>step()</code> is a good way to create a model that fits the data well. On the other hand, this is simply statistics, which is unable to capture &ldquo;real&rdquo; explanations, or follow regulatory rules."}

5.1.5 Exercise. Model comparisons using cross-validation

Assignment Text

To compare alternative models, you decide to utilize cross-validation. For this exercise, you split the training sample into six subsamples of approximately equal size.

In the sample code, the cross-validation procedure has been summarized into a function that you can call. The input to the function is a list of variables that you select as your model explanatory variables. With this function, you can readily test several candidate models.

Instructions

  • Run the cross validation (crossvalfct) function using the explanatory variables suggested by the stepwise function.
  • Run the function again but adding the mpoor variable
  • Run the function again but omitting the gender variable

Note which model is suggested by the cross validation function.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiNtZXBzIDwtIHJlYWQuY3N2KFwiQ1NWRGF0YVxcXFxIZWFsdGhNZXBzLmNzdlwiLCBoZWFkZXIgPSBUUlVFKVxubWVwcyA8LSByZWFkLmNzdihcImh0dHBzOi8vYXNzZXRzLmRhdGFjYW1wLmNvbS9wcm9kdWN0aW9uL3JlcG9zaXRvcmllcy8yNjEwL2RhdGFzZXRzLzdiN2RhYjZkMGM1MjhlNGNkMmY4ZDBlMGZjNzgyNGEyNTQ0MjliZjgvSGVhbHRoTWVwcy5jc3ZcIiwgaGVhZGVyID0gVFJVRSlcbm1lcHMkbG9nZXhwZW5kIDwtIGxvZyhtZXBzJGV4cGVuZG9wKVxuIyBTcGxpdCB0aGUgc2FtcGxlIGludG8gYSBgdHJhaW5pbmdgIGFuZCBgdGVzdGAgZGF0YVxubiA8LSBucm93KG1lcHMpXG5zZXQuc2VlZCgxMjM0NylcbnNodWZmbGVkX21lcHMgPC0gbWVwc1tzYW1wbGUobiksIF1cbnRyYWluX2luZGljZXMgPC0gMTpyb3VuZCgwLjc1ICogbilcbnRyYWluX21lcHMgICAgPC0gc2h1ZmZsZWRfbWVwc1t0cmFpbl9pbmRpY2VzLCBdXG50ZXN0X2luZGljZXMgIDwtIChyb3VuZCgwLjI1ICogbikgKyAxKTpuXG50ZXN0X21lcHMgICAgIDwtIHNodWZmbGVkX21lcHNbdGVzdF9pbmRpY2VzLCBdXG5cbiMjIENyb3NzIC0gVmFsaWRhdGlvblxuXG5jcm9zc3ZhbGZjdCA8LSBmdW5jdGlvbihleHBsdmFycyl7XG4gIGN2ZGF0YSAgIDwtIHRyYWluX21lcHNbLCBjKFwibG9nZXhwZW5kXCIsIGV4cGx2YXJzKV1cbiAgY3Jvc3N2YWwgPC0gMFxuICBmb3IgKGkgaW4gMTo2KSB7XG4gICAgaW5kaWNlcyA8LSAoKChpLTEpICogcm91bmQoKDEvNikqbnJvdyhjdmRhdGEpKSkgKyAxKTooKGkqcm91bmQoKDEvNikgKiBucm93KGN2ZGF0YSkpKSlcbiAgICAjIEV4Y2x1ZGUgdGhlbSBmcm9tIHRoZSB0cmFpbiBzZXRcbiAgICB0cmFpbl9tbHIgPC0gbG0obG9nZXhwZW5kIH4gLiwgZGF0YSA9IGN2ZGF0YVstaW5kaWNlcyxdKVxuICAgICMgSW5jbHVkZSB0aGVtIGluIHRoZSB0ZXN0IHNldFxuICAgIHRlc3QgIDwtIGRhdGEuZnJhbWUoY3ZkYXRhW2luZGljZXMsIGV4cGx2YXJzXSlcbiAgICBuYW1lcyh0ZXN0KSAgPC0gZXhwbHZhcnNcbiAgICBwcmVkaWN0X3Rlc3QgPC0gZXhwKHByZWRpY3QodHJhaW5fbWxyLCB0ZXN0KSlcbiAgICAjIENvbXBhcmUgcHJlZGljdGVkIHRvIGhlbGQtb3V0IGFuZCBzdW1tYXJpemVcbiAgICBwcmVkaWN0X2VyciAgPC0gZXhwKGN2ZGF0YVtpbmRpY2VzLCBcImxvZ2V4cGVuZFwiXSkgLSBwcmVkaWN0X3Rlc3RcbiAgICBjcm9zc3ZhbCA8LSBjcm9zc3ZhbCArIHN1bShhYnMocHJlZGljdF9lcnIpKVxuICB9XG4gIGNyb3NzdmFsLzEwMDAwMDBcbn0iLCJzYW1wbGUiOiIjIFJ1biB0aGUgY3Jvc3MgdmFsaWRhdGlvbiAoYGNyb3NzdmFsZmN0YCkgZnVuY3Rpb24gdXNpbmcgdGhlIGV4cGxhbmF0b3J5IHZhcmlhYmxlcyBzdWdnZXN0ZWQgYnkgdGhlIHN0ZXB3aXNlIGZ1bmN0aW9uLlxuZXhwbHZhcnMuMSA8LSBjKFwiZ2VuZGVyXCIsIFwiYWdlXCIsIFwicGhzdGF0XCIsIFwiYW55bGltaXRcIiwgXCJpbnN1cmVcIilcbmNyb3NzdmFsZmN0KGV4cGx2YXJzKVxuXG4jIFJ1biB0aGUgZnVuY3Rpb24gYWdhaW4gYnV0IGFkZGluZyB0aGUgYG1wb29yYCB2YXJpYWJsZVxuZXhwbHZhcnMuMiA8LSBjKF9fXylcbmNyb3NzdmFsZmN0KGV4cGx2YXJzLjIpXG5cbiMgUnVuIHRoZSBmdW5jdGlvbiBhZ2FpbiBidXQgb21pdHRpbmcgdGhlIGBnZW5kZXJgIHZhcmlhYmxlXG5leHBsdmFycy4zIDwtIGMoIF9fXylcbmNyb3NzdmFsZmN0KGV4cGx2YXJzLjMpIiwic29sdXRpb24iOiJleHBsdmFycy4xIDwtIGMoXCJnZW5kZXJcIiwgXCJhZ2VcIiwgXCJwaHN0YXRcIiwgXCJhbnlsaW1pdFwiLCBcImluc3VyZVwiKVxuY3Jvc3N2YWxmY3QoZXhwbHZhcnMuMSlcbmV4cGx2YXJzLjIgPC0gYyhcImdlbmRlclwiLCBcImFnZVwiLCBcInBoc3RhdFwiLCBcImFueWxpbWl0XCIsIFwiaW5zdXJlXCIsIFwibXBvb3JcIilcbmNyb3NzdmFsZmN0KGV4cGx2YXJzLjIpXG5leHBsdmFycy4zIDwtIGMoIFwiYWdlXCIsIFwicGhzdGF0XCIsIFwiYW55bGltaXRcIiwgXCJpbnN1cmVcIiwgXCJtcG9vclwiKVxuY3Jvc3N2YWxmY3QoZXhwbHZhcnMuMykiLCJzY3QiOiIjIGV4KCkgJT4lIGNoZWNrX29iamVjdChcImV4cGx2YXJzLjFcIiwgdW5kZWZpbmVkX21zZz1cIkNyZWF0ZSBhIHZhcmlhYmxlIG5hbWVkIGBleHBsdmFycy4xYCB0aGF0IGNvbnRhaW5zIHRoZSBmb2xsb3dpbmcgYXMgY2hhcmFjdGVyIHN0cmluZ3M6IGBnZW5kZXJgLCBgYWdlYCwgYHBoc3RhdGAsIGBhbnlsaW1pdGAsIGFuZCBgaW5zdXJlYC5cIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gc3BlY2lmeSB0aGF0IHRoZSBmb2xsb3dpbmcgYXJlIGV4cHJlc3NlZCBhcyBjaGFyYWN0ZXIgc3RyaW5ncywgYW5kIG5vdCBhcyBudW1iZXJzOiA6IGBnZW5kZXJgLCBgYWdlYCwgYHBoc3RhdGAsIGBhbnlsaW1pdGAsIGFuZCBgaW5zdXJlYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJjcm9zc3ZhbGZjdFwiLGluZGV4PTEsIG5vdF9jYWxsZWRfbXNnPVwiVXRpbGl6ZSB0aGUgY3VzdG9tIGJ1aWx0IGZ1bmN0aW9uIGBjcm9zc3ZhbGZjdGAgdG8gZmluZCB0aGUgY3Jvc3MtdmFsaWRhdGlvbiBzdGF0aXN0aWMgZm9yIGBleHBsdmFycy4xYC5cIikgJT4lIGNoZWNrX2FyZyguLCBcImV4cGx2YXJzXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3YW50IHRoZSBjcm9zcy12YWxpZGF0aW9uIHN0YXRpc3RpYyBmb3IgYGV4cGx2YXNyLjFgLlwiKVxuIyBleCgpICU+JSBjaGVja19vYmplY3QoXCJleHBsdmFycy4yXCIsIHVuZGVmaW5lZF9tc2c9XCJDcmVhdGUgYSB2YXJpYWJsZSBuYW1lZCBgZXhwbHZhcnMuMmAgdGhhdCBjb250YWlucyB0aGUgZm9sbG93aW5nIGFzIGNoYXJhY3RlciBzdHJpbmdzOiBgZ2VuZGVyYCwgYGFnZWAsIGBwaHN0YXRgLCBgYW55bGltaXRgLCBgaW5zdXJlYCwgYW5kIGBtcG9vcmAuXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB0aGUgZm9sbG93aW5nIGFyZSBleHByZXNzZWQgYXMgY2hhcmFjdGVyIHN0cmluZ3MsIGFuZCBub3QgYXMgbnVtYmVyczogOiBgZ2VuZGVyYCwgYGFnZWAsIGBwaHN0YXRgLCBgYW55bGltaXRgLCBgaW5zdXJlYCwgYW5kIGBtcG9vcmAuXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiY3Jvc3N2YWxmY3RcIixpbmRleD0yLCBub3RfY2FsbGVkX21zZz1cIlV0aWxpemUgdGhlIGN1c3RvbSBidWlsdCBmdW5jdGlvbiBgY3Jvc3N2YWxmY3RgIHRvIGZpbmQgdGhlIGNyb3NzLXZhbGlkYXRpb24gc3RhdGlzdGljIGZvciBgZXhwbHZhcnMuMmAuXCIpICU+JSBjaGVja19hcmcoLiwgXCJleHBsdmFyc1wiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cIk1ha2Ugc3VyZSB0byBzcGVjaWZ5IHRoYXQgd2Ugd2FudCB0aGUgY3Jvc3MtdmFsaWRhdGlvbiBzdGF0aXN0aWMgZm9yIGBleHBsdmFzci4yYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfb2JqZWN0KFwiZXhwbHZhcnMuM1wiLCB1bmRlZmluZWRfbXNnPVwiQ3JlYXRlIGEgdmFyaWFibGUgbmFtZWQgYGV4cGx2YXJzLjNgIHRoYXQgY29udGFpbnMgdGhlIGZvbGxvd2luZyBhcyBjaGFyYWN0ZXIgc3RyaW5nczogYGFnZWAsIGBwaHN0YXRgLCBgYW55bGltaXRgLCBgaW5zdXJlYCwgYW5kIGBtcG9vcmAuXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB0aGUgZm9sbG93aW5nIGFyZSBleHByZXNzZWQgYXMgY2hhcmFjdGVyIHN0cmluZ3MsIGFuZCBub3QgYXMgbnVtYmVyczogOmBhZ2VgLCBgcGhzdGF0YCwgYGFueWxpbWl0YCwgYGluc3VyZWAsIGFuZCBgbXBvb3JgLlwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcImNyb3NzdmFsZmN0XCIsaW5kZXg9Mywgbm90X2NhbGxlZF9tc2c9XCJVdGlsaXplIHRoZSBjdXN0b20gYnVpbHQgZnVuY3Rpb24gYGNyb3NzdmFsZmN0YCB0byBmaW5kIHRoZSBjcm9zcy12YWxpZGF0aW9uIHN0YXRpc3RpYyBmb3IgYGV4cGx2YXJzLjNgLlwiKSAlPiUgY2hlY2tfYXJnKC4sIFwiZXhwbHZhcnNcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gc3BlY2lmeSB0aGF0IHdlIHdhbnQgdGhlIGNyb3NzLXZhbGlkYXRpb24gc3RhdGlzdGljIGZvciBgZXhwbHZhc3IuM2AuXCIpXG5zdWNjZXNzX21zZyhcIkV4Y2VsbGVudCEgQ3Jvc3MtdmFsaWRhdGlvbiBoYXMgYmVjb21lIGFuIGVzc2VudGlhbCBwaWVjZSBvZiB0aGUgZGF0YSBhbmFseXN0cyB0b29sa2l0LiBHb29kIHRoYXQgeW91IG5vdyBoYXZlIGFkZGl0aW9uYWwgZXhwZXJpZW5jZSB3aXRoIGl0LlwiKSIsImhpbnQiOiJQZXJmb3JtaW5nIGNyb3NzLXZhbGlkYXRpb24gbWVhc3VyZXMgb24geW91ciBtb2RlbCBhbGxvd3MgeW91IHRvIHNlZSBob3cgd2VsbCBpdCBkb2VzIHdoZW4gaXQgY29tZXMgdG8gcHJlZGljdGluZyAmbGRxdW87b3V0IG9mIHNhbXBsZSZyZHF1bzsgZGF0YSBwb2ludHMuIn0=

5.1.6 Exercise. Out of sample validation

Assignment Text

From our prior work, the training train_meps and test test_meps dataframes have already been loaded in. We think our best model is based on logarithmic expenditures as the outcome and the following explanatory variables:

explvars3 <- c("gender", "age", "phstat", "anylimit", "insure", "mpoor")

We will compare this to a benchmark model that is based on expenditures as the outcome and all 13 explanatory variables

explvars4 <- c(explvars3, "race", "income", "region", "educ", "unemploy", "managedcare", "usc")

The comparisons will be based on expenditures in dollars using the held-out validation sample.

Instructions

  • Use the training sample to fit a linear model with logexpend and explanatory variables listed in explvars3
  • Predict expenditures (not logged) for the test data and summarize the fit using the sum of absolute prediction errors.
  • Use the training sample to fit a benchmark linear model with expendop and explanatory variables listed in explvars4
  • Predict expenditures for the test data and summarize the fit for the benchmark model using the sum of absolute prediction errors.
  • Compare the predictions of the models graphically.
{"language":"r","pre_exercise_code":"#meps <- read.csv(\"CSVData\\\\HealthMeps.csv\", header = TRUE)\nmeps <- read.csv(\"https://assets.datacamp.com/production/repositories/2610/datasets/7b7dab6d0c528e4cd2f8d0e0fc7824a254429bf8/HealthMeps.csv\", header = TRUE)\nmeps$logexpend <- log(meps$expendop)\n# Split the sample into a `training` and `test` data\nn <- nrow(meps)\nset.seed(12347)\nshuffled_meps <- meps[sample(n), ]\ntrain_indices <- 1:round(0.75 * n)\ntrain_meps    <- shuffled_meps[train_indices, ]\ntest_indices  <- (round(0.25 * n) + 1):n\ntest_meps     <- shuffled_meps[test_indices, ]\nexplvars3 <- c(\"gender\", \"age\", \"race\", \"mpoor\", \"anylimit\", \"income\", \"insure\", \"usc\")\nexplvars4 <- c(explvars3, \"region\", \"educ\", \"phstat\", \"unemploy\", \"managedcare\")","sample":"# Regress `logexpend` on the explanatory variables listed in `explvars3`\nmeps_mlr3 <- lm(logexpend ~ gender + age + phstat + anylimit  + insure + mpoor, data = train_meps)\n\n# Predict expenditures (not logged) and summarize using the sum of absolute prediction errors.\nexplvars3 <- c(\"gender\", \"age\", \"phstat\", \"anylimit\", \"insure\", \"mpoor\")\npredict_meps3 <- test_meps[,explvars3]\npredict_mlr3  <- exp(predict(meps_mlr3, predict_meps3))\npredict_err_mlr3 <- test_meps$expendop - predict_mlr3\nsape3     <- sum(abs(predict_err_mlr3))/1000\n\n# Regress `expendop` on all 13 explanatory variables\nmeps_mlr4 <- lm(___~ gender + age + race + region + educ + phstat + mpoor + anylimit + income + insure + usc + unemploy + managedcare, data = train_meps)\n\n# Predict expenditures and summarize using the sum of absolute prediction errors.\nexplvars4 <- c(\"gender\",\"age\",\"race\",\"region\",\"educ\",\"phstat\",\"mpoor\",\"anylimit\",\"income\",\"insure\",\"usc\",\"unemploy\",\"managedcare\")\npredict_meps4 <- test_meps[,explvars4]\npredict_mlr4  <- predict(meps_mlr4, predict_meps4)\npredict_err_mlr4 <- test_meps$expendop - predict_mlr4\nsape4     <- sum(abs(predict_err_mlr4))/1000\nsape3;sape4\n\n# Compare the predictions of the models graphically.\npar(mfrow = c(1, 2))\nplot(predict_err_mlr4, predict_err_mlr3, xlab = \"Benchmark Predict Error\", ylab = \"MLR Predict Error\")\nplot(predict_mlr3, test_meps$expendop, xlab = \"MLR Predicts\", ylab = \"Held Out Expends\")","solution":"meps_mlr3 <- lm(logexpend ~ gender + age + phstat + anylimit  + insure + mpoor, data = train_meps)\nexplvars3 <- c(\"gender\", \"age\", \"phstat\", \"anylimit\", \"insure\", \"mpoor\")\npredict_meps3 <- test_meps[,explvars3]\npredict_mlr3  <- exp(predict(meps_mlr3, predict_meps3))\npredict_err_mlr3 <- test_meps$expendop - predict_mlr3\nsape3     <- sum(abs(predict_err_mlr3))/1000\n\nmeps_mlr4 <- lm(expendop ~ gender + age + race + region + educ + phstat + mpoor + anylimit + income + insure + usc + unemploy + managedcare, data = train_meps)\nexplvars4 <- c(\"gender\",\"age\",\"race\",\"region\",\"educ\",\"phstat\",\"mpoor\",\"anylimit\",\"income\",\"insure\",\"usc\",\"unemploy\",\"managedcare\")\npredict_meps4 <- test_meps[,explvars4]\npredict_mlr4  <- predict(meps_mlr4, predict_meps4)\npredict_err_mlr4 <- test_meps$expendop - predict_mlr4\nsape4     <- sum(abs(predict_err_mlr4))/1000\n\nsape3;sape4\n\npar(mfrow = c(1, 2))\nplot(predict_err_mlr4, predict_err_mlr3, xlab = \"Benchmark Predict Error\", ylab = \"MLR Predict Error\")\nplot(predict_mlr3, test_meps$expendop, xlab = \"MLR Predicts\", ylab = \"Held Out Expends\")","sct":"# ex() %>% check_object(\"meps_mlr3\", undefined_msg=\"Use `lm` to create a linear model named `meps_mlr3`.\") %>% check_equal(incorrect_msg=\"Make sure to create your model so that `logexpend` is based on `gender`, `age`, `phstat`, `anylimit`, `insure`, and `mpoor` from the data in `train_meps`.\")\n# ex() %>% check_object(\"explvars3\", undefined_msg=\"Create a variable named `explvars3` that has `gender`, `age`, `phstat`, `anylimit`, `insure`, and `mpoor` as character strings. \") %>% check_equal(incorrect_msg=\"Make sure to have the following as a character string instead of the values contained in each: `gender`, `age`, `phstat`, `anylimit`, `insure`, and `mpoor`.\")\n# ex() %>% check_object(\"predict_meps3\", undefined_msg=\"Make a new dataframe named `predict_meps3` that a subset of the data not used to create the model. \") %>% check_equal(incorrect_msg=\"The subset of data not used is stored in `test_meps`, and the columns we want are now stored in `explvars3`.\")\n# ex() %>% check_function(\"exp\",index=1,not_called_msg=\"Make sure to take the exponential of the predicted values to remove the log effect. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to run `exp` on the values you get from your `predict` call. \")\n# ex() %>% check_function(\"predict\",index=1, not_called_msg=\"Run `predict` on the data you did not use to create the model. \") %>% {\n#   check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that the model we are using to make predictions is `meps_mlr3`.\")\n#   check_arg(., \"newdata\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like predictions for the values found in `predict_meps3`.\")\n# }\n# ex() %>% check_object(\"predict_mlr3\", undefined_msg=\"Make sure to save the exponential of the predicted values to `predict_mlr3`.\") %>% check_equal(incorrect_msg=\"Make sure to take the exponential of the values found using `predict`.\")\n# ex() %>% check_object(\"predict_err_mlr3\", undefined_msg=\"Save the prediction errors to a variable named `predict_err_mlr3`) %>% check_equal(incorrect_msg=\"To find the prediction errors, subtract the predicted values in `predict_mlr3` from the actual values found in `test_meps$expendop`.\")\n# ex() %>% check_function(\"abs\",index=1, undefined_msg=\"Make sure to take the absolute value of the prediction errors. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to take the absolute value of the prediction errors, which can be found in `predict_err_mlr3`.\")\n# ex() %>% check_function(\"sum\",index=1, not_called_msg=\"Take the sum of the absolute value of the prediction errors. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like the sum of the absolute value of the prediction errors. \")\n# ex() %>% check_object(\"sape3\", undefined_msg=\"Make sure to set the sum of the absolute value of the prediction errors divided by 1000 equal to `sape3`.\") %>% check_equal(incorrect_msg=\"Make sure that you have taken the sum of the absolute value of the prediction errors, and divided that number by 1000. \")\n# ex() %>% check_object(\"meps_mlr4\", undefined_msg=\"Use `lm` to create a linear model named `meps_mlr4`.\") %>% check_equal(incorrect_msg=\"Make sure to create your model so that `expendop` is based on `gender`,` age`, `race`, `region`, `educ`, `phstat`, `mpoor`, `anylimit`, `income`, `usc`, `unemploy`, and `managedcare` from the data in `train_meps`.\")\n# ex() %>% check_object(\"explvars4\", undefined_msg=\"Create a variable named `explvars4` that has `gender`,` age`, `race`, `region`, `educ`, `phstat`, `mpoor`, `anylimit`, `income`, `usc`, `unemploy`, and `managedcare` as character strings. \") %>% check_equal(incorrect_msg=\"Make sure to have the following as a character string instead of the values contained in each: `gender`,` age`, `race`, `region`, `educ`, `phstat`, `mpoor`, `anylimit`, `income`, `usc`, `unemploy`, and `managedcare`.\")\n# ex() %>% check_object(\"predict_meps4\", undefined_msg=\"Make a new dataframe named `predict_meps4` that a subset of the data not used to create the model. \") %>% check_equal(incorrect_msg=\"The subset of data not used is stored in `test_meps`, and the columns we want are now stored in `explvars4`.\")\n# ex() %>% check_function(\"exp\",index=2,not_called_msg=\"Make sure to take the exponential of the predicted values to remove the log effect. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to run `exp` on the values you get from your `predict` call. \")\n# ex() %>% check_function(\"predict\",index=2, not_called_msg=\"Run `predict` on the data you did not use to create the model. \") %>% {\n#   check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that the model we are using to make predictions is `meps_mlr4`.\")\n#   check_arg(., \"newdata\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like predictions for the values found in `predict_meps4`.\")\n# }\n# ex() %>% check_object(\"predict_mlr4\", undefined_msg=\"Make sure to save the exponential of the predicted values to `predict_mlr4`.\") %>% check_equal(incorrect_msg=\"Make sure to take the exponential of the values found using `predict`.\")\n# ex() %>% check_object(\"predict_err_mlr4\", undefined_msg=\"Save the prediction errors to a variable named `predict_err_mlr4`) %>% check_equal(incorrect_msg=\"To find the prediction errors, subtract the predicted values in `predict_mlr4` from the actual values found in `test_meps$expendop`.\")\n# ex() %>% check_function(\"abs\",index=2, undefined_msg=\"Make sure to take the absolute value of the prediction errors. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to take the absolute value of the prediction errors, which can be found in `predict_err_mlr4`.\")\n# ex() %>% check_function(\"sum\",index=2, not_called_msg=\"Take the sum of the absolute value of the prediction errors. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like the sum of the absolute value of the prediction errors. \")\n# ex() %>% check_object(\"sape4\", undefined_msg=\"Make sure to set the sum of the absolute value of the prediction errors divided by 1000 equal to `sape4`.\") %>% check_equal(incorrect_msg=\"Make sure that you have taken the sum of the absolute value of the prediction errors, and divided that number by 1000. \")\n# ex() %>% check_function(\"par\",not_called_msg=\"Use `par` to alter the graphing device to have 2 side by side graphs. \") %>% check_arg(., \"mfrow\") %>% check_equal(incorrect_msg=\"To do this, set `mfrow` equal to `c(2,1)`. \")\n# ex() %>% check_function(\"plot\",index=1, not_called_msg=\"Create a plot of the two prediction errors as the first graph. \") %>% {\n#   check_arg(., \"x\") %>% check_equal(incorrect_msg=\"The x axis should have the prediction errors from `predict_err_mlr4`.\")\n#   check_arg(., \"y\") %>% check_equal(incorrect_msg=\"The y axis should have the prediction errors from `predict_err_mlr3`.\")\n# }\n# ex() %>% check_function(\"plot\",index=2, not_called_msg=\"The second graph should plot the predictions from mlr3 against the actual values of expenditures. \") %>% {\n#   check_arg(., \"x\") %>% check_equal(incorrect_msg=\"The X axis should have the predictions, found in `predict_mlr3`.\")\n#   check_arg(., \"y\") %>% check_equal(incorrect_msg=\"The Y axis should have the actual expenditure values, found in `test_meps$expendop`.\")\n# }\nsuccess_msg(\"Excellent! We found that the model of log expenditures outperforms the benchmark that models expenditures, even when the out of sample criterion was in the original 'dollar' units. It is comoforting to know that a search for a good model does well when using different out of sample criteria.\")","hint":"There is nothing in this code to change. simply examine the code, play around with it, and figure out what each individual line does, and how they work together."}

5.2 What the modeling procedure tells us


In this section, you learn how to:

  • Interpret individual effects, based on their substantive and statistical significance
  • Describe other purposes of regression modeling, including regression function for pricing, benchmarking studies, and predicting future observations.

5.2.1 Video

Video Overhead Details

Hide

A Details. Interpreting individual effects

  • Substantive Effect
    • Does a 1 unit change in \(x\) imply an economically meaningful change in \(y\)?
    • Example: Looking at urban and rural claims experience, is there a big enough difference to warrant differentiating prices by location?
  • Statistical Significance
    • We have standards for deciding whether or not a variable is statistically significant.
    • A “statistically significant effect” is the result of a regression coefficient that is large relative to its standard error.
  • Statistical significance is driven by
    • precision of \(s\),
    • collinearity (\(VIF\)) and
    • sample size
  • Causal Effects
    • If we change \(x\), would \(y\) change?

Hide

B Details. Other Interpretations

  • Regression function and pricing
    • The regression function is \(\mathrm{E~}y = \beta_0 + \beta_1 x_1 + \cdots +\beta _k x_k\).
    • Think about expected claims as our baseline price for short-term insurance coverages.
  • Benchmarking studies
    • In studies of CEO’s salaries, who is making a lot (or a little), controlled for industry, years of experience and so forth?
    • In studies of medical claims, who are the high-cost patients?
  • Prediction
    • A new patient comes in with a given set of characteristics, what can I say about his or her future medical claims?


MC Exercise. Which of the following are not important when interpreting the effects of individual variables?
Substantive significance
Statistical significance
The amount of effort that it took to gather the data and do the analysis
Role of causality



MC Exercise. Which of the following is not a potential explanation for the lack of statistical significance of an explanatory variable?
Large variation of the disturbance term
High collinearity, so that the variable may be confounded with other variables
The coefficient of determination, \(R^2\), is not sufficiently large



MC Exercise. Which of the following is not an important purpose of regression modeling?
Pricing of risks such as insurance contracts
Benchmarking studies, to compare an observation to others
Prediction
Keeping a computer occupied with work

5.3 The importance of variable selection


In this section, you learn how to:

  • Describe the bias that can occur when omitting important variables
  • Describe the principle of parsimony and reasons for adopting this approach

5.3.1 Video

Video Overhead Details

<button class=“tablinks” onclick=“openTab(event, ‘Over5.3C’)”C Details. Principle of parsimony

Hide

A Details. The importance of variable selection

  • With too many or too few variables, \(s\) is too large an estimate of \(\sigma\).
    • Prediction intervals are too large
    • Standard errors for the partial slopes are too large
  • With too few or incorrect variables, we produce biased estimates of the slopes \(\beta\). Thus, our predictions are biased and hence inaccurate.

Hide

B Details. Example. Regression using one explanatory variable

  • Too Many Variables
    • The “true” model is \(y_i = \beta_0+ \varepsilon_i\)
    • We mistakenly use \(y_i = \beta_0+ \beta_1 x_i^* + \varepsilon_i\)
    • The prediction at a generic level \(x\) is \(b_0^* + b_1^* x\).
    • It is not to hard to confirm that \(Bias = \mathrm{E} (b_0^* + b_1^* x) - \mathrm{E } y= 0\).
  • Too Few Variables
    • The “true” model is \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\).
    • We mistakenly use \(y_i = \beta_0^* \varepsilon_i\).
    • Under the true model, \(\overline{y} = \beta_0 + \beta_1 \overline{x} + \overline{\varepsilon}\)
    • Thus, the bias is

\[ Bias = \text{E }\bar{y} - \text{E }(\beta_0 + \beta_1 x + \varepsilon) \\ = \text{E }(\beta_0 + \beta_1 \bar{x}+\bar{\varepsilon})-(\beta_0+\beta_1 x)=\beta_1 (\bar{x}-x). \]

There is a persistent, long-term error in omitting the explanatory variable \(x\).

Hide

C Details. Principle of parsimony

  • The principle of parsimony, also known as Occam’s Razor, states that when there are several possible explanations for a phenomenon, use the simplest.
    • A simpler explanation is easier to interpret.
    • Simpler models, also known as ``more parsimonious’’ models, often do well on fitting out-of-sample data
    • Extraneous variables can cause problems of collinearity, leading to difficulty in interpreting individual coefficients.
  • In contrast, in a quote often attributed to Albert Einstein, we should use “the simplest model possible, but no simpler.”
    • Omitting important variables can lead to biased results, a potentially serious error.
    • Including extraneous variables decreases the degrees of freedom and increases the estimate of variability, typically of less concern in actuarial applications.


MC Exercise. Which of the following is true about under- and over-fitting a model?
When we over-fit a model, estimates of regression coefficients are over-biased as is \(s^2\), the estimate of model variance \(\sigma^2\).
When we over-fit a model, estimates of regression coefficients remain unbiased whereas \(s^2\), the estimate of model variance \(\sigma^2\), is over-biased.
When we over-fit a model, estimates of regression coefficients remain under-biased as is \(s^2\), the estimate of model variance \(\sigma^2\).
When we under-fit a model, estimates of regression coefficients remain unbiased whereas \(s^2\), the estimate of model variance \(\sigma^2\), is over-biased.



MC Exercise. Which of the following is not true of Occam’s Razor?
When there are several possible explanations for a phenomenon, use the simplest one.
Simpler models are easier to interpret.
Variables can be statistically significant but practically unimportant.
Simpler models often do better for predicting out-of-sample data