A case study, on determining an individual’s characteristics that influence its health expenditures, illustrates the regression modeling process from start to finish. Subsequently, the chapter summarizes what we learn from the modeling process, underscoring the importance of variable selection.
Case study: MEPS health expenditures
Video
Video Overhead Details
A Details. MEPS health expenditures
B Details. Overhead MEPS health expenditures
C Details. Outcome variable
D Details. Explanatory variables
E Details. Case study outline
Hide
A Details. MEPS health expenditures
This exercise considers data from the Medical Expenditure Panel Survey (MEPS), conducted by the U.S. Agency of Health Research and Quality. MEPS is a probability survey that provides nationally representative estimates of health care use, expenditures, sources of payment, and insurance coverage for the U.S. civilian population. This survey collects detailed information on individuals of each medical care episode by type of services including physician office visits, hospital emergency room visits, hospital outpatient visits, hospital inpatient stays, all other medical provider visits, and use of prescribed medicines. This detailed information allows one to develop models of health care utilization to predict future expenditures. You can learn more about MEPS at http://www.meps.ahrq.gov/mepsweb/ .
We consider MEPS data from the panels 7 and 8 of 2003 that consists of 18,735 individuals between ages 18 and 65. From this sample, we took a random sample of 2,000 individuals that appear in the file HealthExpend
. From this sample, there are 1,352 that had positive outpatient expenditures.
Our dependent variable is the amount of expenditures for outpatient visits, expendop
. For MEPS, outpatient events include hospital outpatient department visits, office-based provider visits and emergency room visits excluding dental services. (Dental services, compared to other types of health care services, are more predictable and occur in a more regular basis.) Hospital stays with the same date of admission and discharge, known as “zero-night stays,” were included in outpatient counts and expenditures. (Payments associated with emergency room visits that immediately preceded an inpatient stay were included in the inpatient expenditures. Prescribed medicines that can be linked to hospital admissions were included in inpatient expenditures, not in outpatient utilization.)
Hide
B Details. Overhead MEPS health expenditures
Data from the Medical Expenditure Panel Survey (MEPS), conducted by the U.S. Agency of Health Research and Quality (AHRQ).
A probability survey that provides nationally representative estimates of health care use, expenditures, sources of payment, and insurance coverage for the U.S. civilian population.
Collects detailed information on individuals of each medical care episode by type of services including
physician office visits,
hospital emergency room visits,
hospital outpatient visits,
hospital inpatient stays,
all other medical provider visits, and
use of prescribed medicines.
This detailed information allows one to develop models of health care utilization to predict future expenditures.
We consider MEPS data from the first panel of 2003 and take a random sample of n = 2, 000 individuals between ages 18 and 65.
Hide
C Details. Outcome variable
Our dependent variable is expenditures for outpatient admissions.
For MEPS, inpatient admissions include persons who were admitted to a hospital and stayed overnight.
In contrast, outpatient events include hospital outpatient department visits, office-based provider visits and emergency room visits excluding dental services.
Hospital stays with the same date of admission and discharge, known as “zero-night stays,” were included in outpatient counts and expenditures.
Payments associated with emergency room visits that immediately preceded an inpatient stay were included in the inpatient expenditures.
Prescribed medicines that can be linked to hospital admissions were included in inpatient expenditures, not in outpatient utilization.
Hide
D Details. Explanatory variables
9 variables in the database. Here 13 most relevant.
\[
{\small \begin{array}{ll}
expendop & \text{Amounts of expenditures for outpatient visits} \\
gender & \text{Indicate gender of patient (=1 if female, =0 if male)} \\
age & \text{Age in years between 18 and 65 }\\
race & \text{Race of patient described as Asian, Black, Native, White and other} \\
region & \text{Region of patient described as WEST, NORTHEAST, MIDWEST and SOUTH} \\
educ & \text{Level of education received described by words (LHIGHSC, HIGHSCH and COLLEGE)} \\
phstat & \text{Self-rated physical health status described as EXCE, VGOO, GOOD, FAIR and POOR} \\
mpoor & \text{Self-rated mental health (=1 if poor or fair, =0 if good to excellent mental health)} \\
anylimit & \text{Any activity limitation (=1 if any functional/activity limitation, =0 if otherwise)} \\
income & \text{Income compared to poverty line described as POOR, NPOOR, LINCOME, MINCOME and HINCOME} \\
insure & \text{Insurance coverage (=1 if covered by public/private health insurance in any month of 1996, =0 otherwise)} \\
usc & \text{1 if dissatisfied with one's usual source of care} \\
unemploy & \text{Employment status of patients} \\
managedcare & \text{1 if enrolled in an HMO or gatekeeper plan} \\
\end{array}}
\]
Hide
E Details. Case study outline
The next series of exercises leads you through an analysis of the steps for understanding a complex data set. Because of the complexity of the data, in each step only a sample of procedures will be executed.
The outline consists of:
Summary statistics
Splitting the data into training and testing portions with initial model fits
Selecting variables to be included in the model
Exercise. Summarizing data
Assignment Text
With a complex dataset, you will probably want to take a look at the structure of the data. You are already familiar with taking a [summary()] of a dataframe which provides summary statistics for many variables. You will see that several variables in this dataframe are categorical, or factor, variables. We can use the table() function to summarize them.
After getting a sense of the distributions of explanatory variables, we want to take a deeper dive into the distribution of the outcome variable, expendop
. We will do this by comparing the histograms of the variable to that of its logarithmic version.
To examine relationships of the outcome variable visually, we look to scatterplots for continuous variables (such as age
) and boxplots for categorical variables (such as phstat
).
Instructions
Examine the structure of the meps
dataframe using the str() function. Also, get a [summary()] of the dataframe.
Examine the distribution of the race
variable using the table() function.
Compare the expenditures distribution to its logarithmic version visually via histograms plotted next to another. par(mfrow = c(1, 2))
is used to organize the plots you create.
Examine the distribution of logarithmic expenditures in terms of levels of phstat
visually using the
boxplot() function.
Examine the relationship of age versus logarithmic expenditures using a scatter plot. Superimpose a local fitting line using the lines() and
lowess() functions.
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiNtZXBzIDwtIHJlYWQuY3N2KFwiQ1NWRGF0YVxcXFxIZWFsdGhNZXBzLmNzdlwiLCBoZWFkZXIgPSBUUlVFKVxubWVwcyA8LSByZWFkLmNzdihcImh0dHBzOi8vYXNzZXRzLmRhdGFjYW1wLmNvbS9wcm9kdWN0aW9uL3JlcG9zaXRvcmllcy8yNjEwL2RhdGFzZXRzLzdiN2RhYjZkMGM1MjhlNGNkMmY4ZDBlMGZjNzgyNGEyNTQ0MjliZjgvSGVhbHRoTWVwcy5jc3ZcIiwgaGVhZGVyID0gVFJVRSkiLCJzYW1wbGUiOiIjIEV4YW1pbmUgdGhlIHN0cnVjdHVyZSBhbmQgZ2V0IGEgc3VtbWFyeSBvZiB0aGUgYG1lcHNgIGRhdGFmcmFtZSBcbnN0cihfX18pXG5zdW1tYXJ5KF9fXylcblxuIyBFeGFtaW5lIHRoZSBkaXN0cmlidXRpb24gb2YgdGhlIGByYWNlYCB2YXJpYWJsZSBcbnRhYmxlKF9fXylcblxuIyBDb21wYXJlIHRoZSBleHBlbmRpdHVyZXMgZGlzdHJpYnV0aW9uIHRvIGl0cyBsb2dhcml0aG1pYyB2ZXJzaW9uIHZpc3VhbGx5XG5wYXIobWZyb3cgPSBjKDEsIDIpKVxuaGlzdChfX18sIG1haW4gPSBcIlwiLCB4bGFiID0gXCJvdXRwYXRpZW50IGV4cGVuZGl0dXJlc1wiKVxuaGlzdChsb2coX19fKSwgbWFpbiA9IFwiXCIsIHhsYWIgPSBcImxvZyBleHBlbmRpdHVyZXNcIilcblxuIyBFeGFtaW5lIHRoZSBkaXN0cmlidXRpb24gb2YgbG9nYXJpdGhtaWMgZXhwZW5kaXR1cmVzIGluIHRlcm1zIG9mIGxldmVscyBvZiBgcGhzdGF0YCBcbnBhcihtZnJvdyA9IGMoMSwgMSkpXG5tZXBzJGxvZ2V4cGVuZCA8LSBsb2cobWVwcyRleHBlbmRvcClcbmJveHBsb3QobG9nZXhwZW5kIH4gX19fLCBkYXRhID0gbWVwcywgbWFpbiA9IFwiYm94cGxvdCBvZiBsb2cgZXhwZW5kXCIpXG5cbiMgRXhhbWluZSB0aGUgcmVsYXRpb25zaGlwIG9mIGFnZSB2ZXJzdXMgbG9nYXJpdGhtaWMgZXhwZW5kaXR1cmVzLiBTdXBlcmltcG9zZSBhIGxvY2FsIGZpdHRpbmcgbGluZS5cbnBsb3QoX19fLF9fXywgeGxhYiA9IFwiYWdlXCIsIHlsYWIgPSBcImxvZyBleHBlbmRcIilcbmxpbmVzKGxvd2VzcyhfX18sIF9fXyksIGNvbD1cInJlZFwiKSIsInNvbHV0aW9uIjoic3RyKG1lcHMpXG5zdW1tYXJ5KG1lcHMpXG50YWJsZShtZXBzJHJhY2UpXG5wYXIobWZyb3cgPSBjKDEsIDIpKVxuaGlzdChtZXBzJGV4cGVuZG9wLCBtYWluID0gXCJcIiwgeGxhYiA9IFwib3V0cGF0aWVudCBleHBlbmRpdHVyZXNcIilcbmhpc3QobG9nKG1lcHMkZXhwZW5kb3ApLCBtYWluID0gXCJcIiwgeGxhYiA9IFwibG9nIGV4cGVuZGl0dXJlc1wiKVxucGFyKG1mcm93ID0gYygxLCAxKSlcbm1lcHMkbG9nZXhwZW5kIDwtIGxvZyhtZXBzJGV4cGVuZG9wKVxuYm94cGxvdChsb2dleHBlbmQgfiBwaHN0YXQsIGRhdGEgPSBtZXBzLCBtYWluID0gXCJib3hwbG90IG9mIGxvZyBleHBlbmRcIilcbnBsb3QobWVwcyRhZ2UsbWVwcyRsb2dleHBlbmQsIHhsYWIgPSBcImFnZVwiLCB5bGFiID0gXCJsb2cgZXhwZW5kXCIpXG5saW5lcyhsb3dlc3MobWVwcyRhZ2UsIG1lcHMkbG9nZXhwZW5kKSwgY29sPVwicmVkXCIpIiwic2N0IjoiIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInN0clwiLCBub3RfY2FsbGVkX21zZz1cIlVzZSB0aGUgYHN0cmAgY29tbWFuZCB0byB2aWV3IHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIGRhdGEuIFwiKSAlPiUgY2hlY2tfYXJnKC4sIFwib2JqZWN0XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3b3VsZCBsaWtlIHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIGRhdGEgYG1lcHNgLlwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInN1bW1hcnlcIiwgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBzdW1tYXJ5YCBmdW5jdGlvbiB0byB2aWV3IGEgc3VtbWFyeSBvZiB0aGUgZGF0YS4gXCIpICU+JSBjaGVja19hcmcoLiwgXCJvYmplY3RcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gc3BlY2lmeSB0aGF0IHdlIHdvdWxkIGxpa2UgdG8gdmlldyBhIHN1bW1hcnkgb2YgYG1lcHNgLlwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInRhYmxlXCIsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgdGFibGVgIGZ1bmN0aW9uIHRvIGZpbmQgdGhlIG51bWJlciBvZiBvYnNlcnZhdGlvbnMgb2YgZWFjaCByYWNlLiBcIikgJT4lIGNoZWNrX3Jlc3VsdCgpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3b3VsZCBsaWtlIHRvIGNyZWF0ZSBhIHRhYmxlIGZyb20gYG1lcHNgIHRoYXQgc2VwYXJhdGVzIGJhc2VkIG9uIGByYWNlYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJwYXJcIixpbmRleD0xLG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgcGFyYCBmdW5jdGlvbiB0byBhbHRlciB0aGUgcGxvdHRpbmcgZGV2aWNlIHRvIGhhdmUgMiBzaWRlLWJ5LXNpZGUgZ3JhcGhzLiBcIikgJT4lIGNoZWNrX2FyZyguLCBcIm1mcm93XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiVG8gY3JlYXRlIHR3byBzaWRlLWJ5LXNpZGUgZ3JhcGhzLCBzZXQgYG1mcm93YCBlcXVhbCB0byBgYygyLDEpYC4gXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiaGlzdFwiLGluZGV4PTEsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgaGlzdGAgZnVuY3Rpb24gdG8gY3JlYXRlIGEgaGlzdG9ncmFtIG9mIGBleHBlbmRvcGAuXCIpICU+JSBjaGVja19hcmcoLiwgXCJ4XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3YW50IHRoZSBmaXJzdCBoaXN0b2dyYW0gdG8gYmUgbWFkZSB1c2luZyB0aGUgZGF0YSBmb3VuZCBpbiBgZXhwZW5kb3BgLCB3aGljaCBpcyBhIGNvbHVtbiBpbiBgbWVwc2AuXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiaGlzdFwiLGluZGV4PTIgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBoaXN0YCBmdW5jdGlvbiB0byBjcmVhdGUgYSBoaXN0b2dyYW0gb2YgdGhlIGxvZyBvZiBgZXhwZW5kb3BgLlwiKSAlPiUgY2hlY2tfYXJnKC4sIFwieFwiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cIk1ha2Ugc3VyZSB0byBzcGVjaWZ5IHRoYXQgd2Ugd2FudCB0aGUgc2Vjb25kIGhpc3RvZ3JhbSB0byBiZSBtYWRlIHVzaW5nIHRoZSBsb2cgb2YgdGhlIGRhdGEgZm91bmQgaW4gYGV4cGVuZG9wYCwgd2hpY2ggaXMgYSBjb2x1bW4gaW4gYG1lcHNgLlwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInBhclwiLGluZGV4PTIsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIGBwYXJgIHRvIHJlc2V0IHRoZSBwbG90dGluZyBkZXZpY2UgdG8gaG9sZCBhIHNpbmdsZSBncmFwaC4gXCIpICU+JSBjaGVja19hcmcoLiwgXCJtZnJvd1wiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cInRvIHJlc2V0IHRoZSBwbG90dGluZyBkZXZpY2UsIHNldCBgbWZyb3dgIHRvIGJlIGVxdWFsIHRvIGBjKDEsMSkuIFwiKVxuIyBleCgpICU+JSBjaGVja19vYmplY3QoXCJtZXBzXCIpICU+JSBjaGVja19jb2x1bW4oXCJsb2dleHBlbmRcIiwgY29sX21pc3NpbmdfbXNnPVwiTWFrZSBzdXJlIHRvIGNyZWF0ZSBhIG5ldyBjb2x1bW4gaW4gYG1lcHNgIG5hbWVkIGBsb2dleHBlbmRgLlwiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cIlRoZSB2YWx1ZXMgaW4gYGxvZ2V4cGVuZGAgc2hvdWxkIGJlIGVxdWFsIHRvIHRoZSBsb2cgb2YgdGhlIHZhbHVlcyBpbiBgbWVwcyRleHBlbmRvcGAuXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiYm94cGxvdFwiLCBub3RfY2FsbGVkX21zZz1cIlVzZSB0aGUgYGJveHBsb3RgIGZ1bmN0aW9uIHRvIGNyZWF0ZSBhIGJveHBsb3Qgc2hvd2luZyBgbG9nZXhwZW5kYCBiYXNlZCBvbiBgcGhzdGF0YC5cIikgJT4lIHtcbiMgICBjaGVja19hcmcoLiwgXCJmb3JtdWxhXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3YW50IGJveHBsb3RzIHRoYXQgc2hvdyBgbG9nZXhwZW5kYCBmb3IgZWFjaCB2YWx1ZSBvZiBgcGhzdGF0YC5cIilcbiMgICBjaGVja19hcmcoLiwgXCJkYXRhXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCBvdXIgZGF0YSBjb21lcyBmcm9tIHRoZSBkYXRhZnJhbWUgYG1lcHNgLlwiKVxuIyB9XG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwicGxvdFwiLCBub3RfY2FsbGVkX21zZz1cIlVzZSBgcGxvdGAgdG8gY3JlYXRlIGEgcGxvdCB0aGF0IHNob3dzIGBsb2dleHBlbmRgIGJhc2VkIG9uIGBhZ2VgLlwiKSAlPiUge1xuIyAgIGNoZWNrX2FyZyguLCBcInhcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJUaGUgaW5kZXBlbmRlbnQgdmFyaWFibGUgc2hvdWxkIGJlIGBhZ2VgLlwiKVxuIyAgIGNoZWNrX2FyZyguLCBcInlcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJUaGUgZGVwZW5kZW50IHZhcmlhYmxlIHNob3VsZCBiZSBgbG9nZXhwZW5kYC5cIilcbiMgfVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcImxpbmVzXCIsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgbGluZXNgIGZ1bmN0aW9uIHRvIGFkZCBhIGxpbmUgdG8gb3VyIHBsb3QuIFwiKSBcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJsb3dlc3NcIiwgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBsb3dlc3NgIGZ1bmN0aW9uIHRvIGFkZCBhIHNtb290aGluZyBlZmZlY3QgdG8gb3VyIGxpbmUuIFwiKSAlPiUge1xuIyAgIGNoZWNrX2FyZyguLCBcInhcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJUaGUgZmlyc3QgYXJndW1lbnQgaW4gYGxvd2Vzc2Agc2hvdWxkIGJlIG91ciBgeGAgdmFyaWFibGUsIHdoaWNoIGlzIGBhZ2VgLlwiKVxuIyAgIGNoZWNrX2FyZyguLCBcInlcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJUaGUgc2Vjb25kIGFyZ3VtZW50IGluIGBsb3dlc3NgIHNob3VsZCBiZSBvdXIgYHlgIHZhcmlhYmxlLCB3aGljaCBpcyBgbG9nZXhwZW5kYC5cIilcbiMgfVxuc3VjY2Vzc19tc2coXCJFeGNlbGxlbnQhIFN1bW1hcml6aW5nIGRhdGEsIHdpdGhvdXQgcmVmZXJlbmNlIHRvIGEgbW9kZWwsIGlzIHByb2JhYmx5IHRoZSBtb3N0IHRpbWUtY29uc3VtaW5nIHBhcnQgb2YgYW55IHByZWRpY3RpdmUgbW9kZWxpbmcgZXhlcmNpc2UuIFN1bW1hcnkgc3RhdGlzdGljcyBhcmUgYWxzbyBhIGtleSBwYXJ0IG9mIGFueSByZXBvcnQgYXMgdGhleSBpbGx1c3RyYXRlIGZlYXR1cmVzIG9mIHRoZSBkYXRhIHRoYXQgYXJlIGFjY2Vzc2libGUgdG8gYSBicm9hZCBhdWRpZW5jZS5cIikiLCJoaW50IjoiT25lIG9mIHRoZSBtb3N0IGltcG9ydGFudCBzdGVwcyBpbiBsaW5lYXIgcmVncmVzc2lvbiBpcyB1bmRlcnN0YW5kaW5nIHlvdXIgZGF0YS4gVXNpbmcgYSBjb21iaW5hdGlvbiBvZiA8Y29kZT5zdHIoKTwvY29kZT4gYW5kICA8Y29kZT5zdW1tYXJ5KCk8L2NvZGU+IGxldHMgeW91IHZpZXcgcXVhbnRpdGF0aXZlIHN0YXRpc3RpY3MgYWJvdXQgeW91ciBkYXRhLCB3aGlsZSA8Y29kZT5ib3hwbG90KCk8L2NvZGU+IGFuZCA8Y29kZT5oaXN0KCk8L2NvZGU+IGdpdmUgeW91IGEgbW9yZSB2aXN1YWwgcmVwcmVzZW50YXRpb24gb2YgeW91ciBkYXRhLiJ9
Exercise. Fit a benchmark multiple linear regression model
Assignment Text
As part of the pre-processing for the model fitting, we will split the data into training and test subsamples. For this exercise, we use a 75/25 split although other choices are certainly suitable. Some analysts prefer to do this splitting before looking at the data. Another approach, adopted here, is that the final report typically contains summary statistcs of the entire data set and so it makes sense to do so when examining summary statistics.
We start by fitting a benchmark model. It is common to use all available explanatory variables with the outcome on the original scale and so we use this as our benchmark model. This exercise shows that when you plot() a fitted linear regression model in R
, the result provides four graphs that you have seen before. These can be useful for identifying an appropriate model.
Instructions
Randomly split the data into a training and a testing data sets. Use 75% for the training, 25% for the testing.
Fit a full model using expendop
as the outcome and all explanatory variables. Summarize the results of this model fitting.
You can plot() the fitted model to view several diagnostic plots. These plots provide evidence that expenditures may not be the best scale for linear regression.
Fit a full model using logexpend
as the outcome and all explanatory variables and summarize the fit. Use the plot() function for evidence that this variable is more suited for linear regression methods than expenditures on the original scale.
Hint. A plot
of a regression object such as plot(mlr) provides four diagnostic plots. These can be organized as a 2 by 2 array using par(mfrow = c(2, 2))
.
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiNtZXBzIDwtIHJlYWQuY3N2KFwiQ1NWRGF0YVxcXFxIZWFsdGhNZXBzLmNzdlwiLCBoZWFkZXIgPSBUUlVFKVxubWVwcyA8LSByZWFkLmNzdihcImh0dHBzOi8vYXNzZXRzLmRhdGFjYW1wLmNvbS9wcm9kdWN0aW9uL3JlcG9zaXRvcmllcy8yNjEwL2RhdGFzZXRzLzdiN2RhYjZkMGM1MjhlNGNkMmY4ZDBlMGZjNzgyNGEyNTQ0MjliZjgvSGVhbHRoTWVwcy5jc3ZcIiwgaGVhZGVyID0gVFJVRSlcbm1lcHMkbG9nZXhwZW5kIDwtIGxvZyhtZXBzJGV4cGVuZG9wKSIsInNhbXBsZSI6IiMgUmFuZG9tbHkgc3BsaXQgdGhlIGRhdGEgaW50byBhIHRyYWluaW5nIGFuZCBhIHRlc3RpbmcgZGF0YSBzZXRzLiBVc2UgNzVcXCUgZm9yIHRoZSB0cmFpbmluZywgMjVcXCUgZm9yIHRoZSB0ZXN0aW5nLlxubiA8LSBucm93KG1lcHMpXG5zZXQuc2VlZCgxMjM0NylcbnNodWZmbGVkX21lcHMgPC0gbWVwc1tzYW1wbGUobiksIF1cbnRyYWluX2luZGljZXMgPC0gMTpyb3VuZCgwLjc1ICogbilcbnRyYWluX21lcHMgICAgPC0gc2h1ZmZsZWRfbWVwc1t0cmFpbl9pbmRpY2VzLCBdXG50ZXN0X2luZGljZXMgIDwtIChyb3VuZCgwLjI1ICogbikgKyAxKTpuXG50ZXN0X21lcHMgICAgIDwtIHNodWZmbGVkX21lcHNbdGVzdF9pbmRpY2VzLCBdXG5cbiMgRml0IGEgZnVsbCBtb2RlbCB1c2luZyBgZXhwZW5kb3BgIGFzIHRoZSBvdXRjb21lIGFuZCBhbGwgZXhwbGFuYXRvcnkgdmFyaWFibGVzLiBTdW1tYXJpemUgdGhlIHJlc3VsdHMgb2YgdGhpcyBtb2RlbCBmaXR0aW5nLlxubWVwc19tbHIxIDwtIGxtKF9fXyB+IGdlbmRlciArIGFnZSArIHJhY2UgKyByZWdpb24gKyBlZHVjICsgcGhzdGF0ICsgbXBvb3IgKyBhbnlsaW1pdCArIGluY29tZSArIGluc3VyZSArIHVzYyArIHVuZW1wbG95ICsgbWFuYWdlZGNhcmUsIGRhdGEgPSBfX18pXG5zdW1tYXJ5KG1lcHNfbWxyMSlcblxuIyBQcm92aWRlIGRpYWdub3N0aWMgcGxvdHMgb2YgdGhlIGZpdHRlZCBtb2RlbC4gXG5wYXIobWZyb3cgPSBjKDIsIDIpKVxucGxvdChfX18pXG5cbiMgRml0IGEgZnVsbCBtb2RlbCB1c2luZyBgbG9nZXhwZW5kYCBhcyB0aGUgb3V0Y29tZSBhbmQgYWxsIGV4cGxhbmF0b3J5IHZhcmlhYmxlcy4gU3VtbWFyaXplIHRoZSBmaXQgYW5kIGV4YW1pbmUgZGlhZ25vc3RpYyBwbG90cyBvZiB0aGUgZml0dGVkIG1vZGVsLiBcbm1lcHNfbWxyMiA8LSBsbShfX18gfiBnZW5kZXIgKyBhZ2UgKyByYWNlICsgcmVnaW9uICsgZWR1YyArIHBoc3RhdCArIG1wb29yICsgYW55bGltaXQgKyBpbmNvbWUgKyBpbnN1cmUgKyB1c2MgKyB1bmVtcGxveSArIG1hbmFnZWRjYXJlLCBkYXRhID0gX19fKVxuc3VtbWFyeShtZXBzX21scjIpXG5wbG90KG1lcHNfbWxyMikiLCJzb2x1dGlvbiI6IiMgU3BsaXQgdGhlIHNhbXBsZSBpbnRvIGEgYHRyYWluaW5nYCBhbmQgYHRlc3RgIGRhdGFcbm4gPC0gbnJvdyhtZXBzKVxuc2V0LnNlZWQoMTIzNDcpXG5zaHVmZmxlZF9tZXBzIDwtIG1lcHNbc2FtcGxlKG4pLCBdXG50cmFpbl9pbmRpY2VzIDwtIDE6cm91bmQoMC43NSAqIG4pXG50cmFpbl9tZXBzICAgIDwtIHNodWZmbGVkX21lcHNbdHJhaW5faW5kaWNlcywgXVxudGVzdF9pbmRpY2VzICA8LSAocm91bmQoMC4yNSAqIG4pICsgMSk6blxudGVzdF9tZXBzICAgICA8LSBzaHVmZmxlZF9tZXBzW3Rlc3RfaW5kaWNlcywgXVxuXG5tZXBzX21scjEgPC0gbG0oZXhwZW5kb3AgfiBnZW5kZXIgKyBhZ2UgKyByYWNlICsgcmVnaW9uICsgZWR1YyArIHBoc3RhdCArIG1wb29yICsgYW55bGltaXQgKyBpbmNvbWUgKyBpbnN1cmUgKyB1c2MgKyB1bmVtcGxveSArIG1hbmFnZWRjYXJlLCBkYXRhID0gdHJhaW5fbWVwcylcbnN1bW1hcnkobWVwc19tbHIxKVxucGFyKG1mcm93ID0gYygyLCAyKSlcbnBsb3QoeD1tZXBzX21scjEpXG5cbm1lcHNfbWxyMiA8LSBsbShsb2dleHBlbmQgfiBnZW5kZXIgKyBhZ2UgKyByYWNlICsgcmVnaW9uICsgZWR1YyArIHBoc3RhdCArIG1wb29yICsgYW55bGltaXQgKyBpbmNvbWUgKyBpbnN1cmUgKyB1c2MgKyB1bmVtcGxveSArIG1hbmFnZWRjYXJlLCBkYXRhID0gdHJhaW5fbWVwcylcbnN1bW1hcnkobWVwc19tbHIyKVxucGxvdChtZXBzX21scjIpIiwic2N0IjoiIyBleCgpICU+JSBjaGVja19vYmplY3QoXCJtZXBzX21scjFcIiwgdW5kZWZpbmVkX21zZz1cIlVzZSBgbG1gIHRvIGNyZWF0ZSBhIGxpbmVhciBtb2RlbCBuYW1lZCBgbWVwc19tbHIxYFwiLikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gY3JlYXRlIHRoaXMgbW9kZWwgc28gdGhhdCBgZXhwZW5kb3BgIGlzIGJhc2VkIG9uIGBnZW5kZXJgLCBgYWdlYCwgYHJhY2VgLCBgcmVnaW9uYCwgYGVkdWNgLCBgcGhzdGF0YCwgYG1wb29yYCwgYGFueWxpbWl0YCwgYGluY29tZWAsIGBpbnN1cmVgLCBgdXNjYCwgYHVuZW1wbG95YCwgYW5kIGBtYW5hZ2VkY2FyZWAgZnJvbSB0aGUgZGF0YWZyYW1lIGBtZXBzYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJzdW1tYXJ5XCIsaW5kZXg9MSxub3RfY2FsbGVkX21zZz1cIlVzZSB0aGUgYHN1bW1hcnlgIGZ1bmN0aW9uIHRvIHZpZXcgYSBzdW1tYXJ5IG9mIG91ciBsaW5lYXIgbW9kZWwuIFwiKSAlPiUgY2hlY2tfYXJnKC4sIFwib2JqZWN0XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3b3VsZCBsaWtlIGEgc3VtbWFyeSBvZiBgbWVwc19tbHIxYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJwYXJcIiwgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBwYXJgIGZ1bmN0aW9uIHRvIGFsdGVyIHRoZSBwbG90dGluZyBkZXZpY2UgdG8gc2hvdyBhIDIgYnkgMiBncmlkIG9mIGdyYXBocy4gXCIpICU+JSBjaGVja19hcmcoLiwgXCJtZnJvd1wiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cIlRvIGNyZWF0ZSBhIDIgYnkgMiBncmlkIG9mIGdyYXBocywgc2V0IGBtZnJvd2AgZXF1YWwgdG8gYGMoMiwyKWAuIFwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcInBsb3RcIixpbmRleD0xLCBub3RfY2FsbGVkX21zZz1cIlVzZSBgcGxvdGAgdG8gY3JlYXRlIHRoZSA0IHBsb3RzIGFzc29jaWF0ZWQgd2l0aCBhIGxpbmVhciBtb2RlbC4gXCIpICU+JSBjaGVja19hcmcoLiwgXCJ4XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiSWYgeW91IHNldCB0aGUgcGFyYW1ldGVyIGB4YCB0byBiZSBlcXVhbCB0byBhIGxpbmVhciBtb2RlbCwgaW4gdGhpcyBjYXNlIGBtZXBzX21scjFgLCB5b3Ugd2lsbCBnZXQgNCB1c2VmdWwgZ3JhcGhzIVwiKVxuIyBleCgpICU+JSBjaGVja19vYmplY3QoXCJtZXBzX21scjJcIiwgdW5kZWZpbmVkX21zZz1cIlVzZSBgbG1gIHRvIGNyZWF0ZSBhIGxpbmVhciBtb2RlbCBuYW1lZCBgbWVwc19tbHIyYC5cIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gY3JlYXRlIHRoaXMgbW9kZWwgc28gdGhhdCBgbG9nZXhwZW5kYCBpcyBiYXNlZCBvbiBgZ2VuZGVyYCwgYGFnZWAsIGByYWNlYCwgYHJlZ2lvbmAsIGBlZHVjYCwgYHBoc3RhdGAsIGBtcG9vcmAsIGBhbnlsaW1pdGAsIGBpbmNvbWVgLCBgaW5zdXJlYCwgYHVzY2AsIGB1bmVtcGxveWAsIGFuZCBgbWFuYWdlZGNhcmVgIGZyb20gdGhlIGRhdGFmcmFtZSBgbWVwc2AuXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwic3VtbWFyeVwiLGluZGV4PTIsIG5vdF9jYWxsZWRfbXNnPVwiVXNlIHRoZSBgc3VtbWFyeWAgZnVuY3Rpb24gdG8gdmlldyBhIHN1bW1hcnkgb2Ygb3VyIG5ldyBsaW5lYXIgbW9kZWwsIGBtZXBzX21scjJgLlwiKSAlPiUgY2hlY2tfYXJnKC4sIFwib2JqZWN0XCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3b3VsZCBsaWtlIGEgc3VtbWFyeSBvZiBgbWVwc19tbHIyYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJwbG90XCIsaW5kZXg9Miwgbm90X2NhbGxlZF9tc2c9XCJVc2UgdGhlIGBwbG90YCBmdW5jdGlvbiB0byBjcmVhdGUgdGhlIDQgcGxvdHMgYXNzb2NpYXRlZCB3aXRoIGEgbGluZWFyIG1vZGVsLiBcIikgJT4lIGNoZWNrX2FyZyguLCBcInhcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJJZiB5b3Ugc2V0IHRoZSBwYXJhbWV0ZXIgYHhgIHRvIGJlIGVxdWFsIHRvIGEgbGluZWFyIG1vZGVsLCBpbiB0aGlzIGNhc2UgYG1lcHNfbWxyMmAsIHlvdSB3aWxsIGdldCA0IHVzZWZ1bCBncmFwaHMhXCIpXG5zdWNjZXNzX21zZyhcIkV4Y2VsbGVudCEgWW91IG1heSBoYXZlIGNvbXBhcmVkIHRoZSBmb3VyIGRpYWdub3N0aWMgZ3JhcGhzIGZyb20gdGhlIE1MUiBtb2RlbCBmaXQgb2YgJ2V4cGVuZCcgdG8gdGhvc2UgY3JlYXRlZCB1c2luZyB0aGUgc2FtZSBwcm9jZWR1cmUgYnV0IHdpdGggbG9nYXJpdGhtaWMgZXhwZW5kaXR1cmVzIGFzIHRoZSBvdXRjb21lLiBUaGlzIHByb3ZpZGVzIGFub3RoZXIgcGllY2Ugb2YgZXZpZGVuY2UgdGhhdCBsb2cgZXhwZW5kaXR1cmVzIGFyZSBtb3JlIHN1aXRhYmxlIGZvciByZWdyZXNzaW9uIG1vZGVsaW5nLiBVc2luZyBsb2dhcml0aG1pYyBvdXRjb21lcyBpcyBhIGNvbW1vbiBmZWF0dXJlIG9mIGFjdHVhcmlhbCBhcHBsaWNhdGlvbnMgYnV0IGNhbiBiZSBkaWZmaWN1bHQgdG8gZGlhZ25vc2UgYW5kIGludGVycHJldCB3aXRob3V0IHByYWN0aWNlLlwiKSIsImhpbnQiOiJBIDxjb2RlPnBsb3Q8L2NvZGU+IG9mIGEgcmVncmVzc2lvbiBvYmplY3Qgc3VjaCBhcyBwbG90KG1scikgcHJvdmlkZXMgZm91ciBkaWFnbm9zdGljIHBsb3RzLiBUaGVzZSBjYW4gYmUgb3JnYW5pemVkIGFzIGEgMiBieSAyIGFycmF5IHVzaW5nIDxjb2RlPnBhcihtZnJvdyA9IGMoMiwgMikpPC9jb2RlPi4ifQ==
Exercise. Variable selection
Assignment Text
Modeling building can be approached using a “ground-up” strategy, where the analyst introduces a variable, examines residuls from a regression fit, and then seeks to understand the relationship between these residuals and other available variables so that these variables might be added to the model.
Another approach is a “top-down” strategy where all available variables are entered into a model and unnecessary variables are pruned from the model. Both approaches are helpful when using data to specify models. This exercise illustrates the latter approach, using the [step()] function to help narrow our search for the best fitting model.
Instructions
From our prior work, the training dataframe train_meps
has already been loaded in. A multiple linear regression model fit object meps_mlr2
is available that summarizes a fit of logexpend
as the outcome variable using all 13 explanatory variables.
Use the step() function function to drop unnecessary variables from the full fitted model summarized in the object meps_mlr2
and summarize this recommended model.
As an alternative, use the explanatory variables in the recommended model and add the varibles phstat
. Summarize the fit and note that statistical significance of the new variable.
You have been reminded by your boss that use of the variable gender
is unsuitable for actuarial pricing purposes. As an another alternative, drop gender
from the recommended model (still keeping phstat
). Note the statistical significance of the variable usc
with this fitted model.
{"language":"r","pre_exercise_code":"#meps <- read.csv(\"CSVData\\\\HealthMeps.csv\", header = TRUE)\nmeps <- read.csv(\"https://assets.datacamp.com/production/repositories/2610/datasets/7b7dab6d0c528e4cd2f8d0e0fc7824a254429bf8/HealthMeps.csv\", header = TRUE)\nmeps$logexpend <- log(meps$expendop)\n# Split the sample into a `training` and `test` data\nn <- nrow(meps)\nset.seed(12347)\nshuffled_meps <- meps[sample(n), ]\ntrain_indices <- 1:round(0.75 * n)\ntrain_meps    <- shuffled_meps[train_indices, ]\ntest_indices  <- (round(0.25 * n) + 1):n\ntest_meps     <- shuffled_meps[test_indices, ]","sample":"meps_mlr2 <- lm(logexpend ~ gender + age + race + region + educ + phstat + mpoor + anylimit + income + insure + usc + unemploy + managedcare, data = train_meps)\n# Use the step() to drop unnecessary variables from the full fitted model summarized in the object `meps_mlr2` and summarize this recommended model.\nmodel_stepwise <- step(___, direction= \"both\", k = log(nrow(train_meps)), trace = 0)\nsummary(model_stepwise)\n\n# As an alternative, use the explanatory variables in the recommended model and add the varibles `mpoor`. Summarize the fit  and note that statistical significance of the new variable.\nmeps_mlr4 <- lm(___ ~ gender + age + phstat + anylimit + insure  + ___, data = train_meps)\nsummary(meps_mlr4)\n\n# You have been reminded by your boss that use of the variable `gender` is unsuitable for actuarial pricing purposes. As an another alternative, drop `gender` from the recommended model (still keeping `mpoor`). Note the statistical significance of the variable `usc`with this fitted model.\nmeps_mlr5 <- lm(logexpend ~ age + phstat + anylimit + insure  + ___, data = train_meps)\nsummary(___)","solution":"meps_mlr2 <- lm(logexpend ~ gender + age + race + region + educ + phstat + mpoor + anylimit + income + insure + usc + unemploy + managedcare, data = train_meps)\n#library(Rcmdr)\n#temp <- stepwise(meps_mlr2, direction = 'backward/forward')\nmodel_stepwise <- step(meps_mlr2, direction= \"both\", k = log(nrow(train_meps)), trace = 0) \nsummary(model_stepwise)\nmeps_mlr3 <- lm(logexpend ~ gender + age + phstat + anylimit + insure , data = train_meps)\nsummary(meps_mlr3)\nmeps_mlr4 <- lm(logexpend ~ gender + age + phstat + anylimit + insure  + mpoor, data = train_meps)\nsummary(meps_mlr4)\nmeps_mlr5 <- lm(logexpend ~ age + phstat + anylimit + insure  + mpoor, data = train_meps)\nsummary(meps_mlr5)\n\n# par(mfrow = c(2, 2))\n# plot(meps_mlr3)\n# \n# meps_mlr4 <- lm(logexpend ~ gender + age + mpoor + anylimit + insure + usc  + phstat, data = train_meps)\n# summary(meps_mlr4)\n# \n# \n# meps_mlr5 <- lm(logexpend ~ age  + anylimit + mpoor + insure  + usc  + phstat, data = train_meps)\n# summary(meps_mlr5)\n# anova(meps_mlr4, meps_mlr5)\n# \n# #boxplot(train_meps$logexpend ~ train_meps$phstat*train_meps$usc)","sct":"# ex() %>% check_object(\"meps_mlr2\", undefined_msg=\"Use `lm` to create a linear model named `meps_mlr2`.\") %>% check_equal(incorrect_msg=\"Make sure to create the linear model so that `logexpend` is based on `gender`, `age`, `race`, `region`, `educ`, `phstat`, `mpoor`, `anylimit`, `income`, `insure`, `usc`, `unemploy`, and `managedcare` from the data set `train_meps`.\")\n# ex() %>% check_object(\"model_stepwise\", undefined_msg=\"Utilize the `step` function to create a simplified model named `model_stepwise`.\") %>% check_equal(incorrect_msg=\"Make sure to run `step` on `meps_mlr2` in both directions, and with k set to be the BIC estimate. \")\n# ex() %>% check_function(\"summary\",index=1, not_called_msg=\"Use `summary` to create and view a summary of `model_stepwise`.\") %>% check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like a summary of `model_stepwise`.\")\n# ex() %>% check_object(\"meps_mlr3\", undefined_msg=\"Create a new linear model off a small number of explanatory variables named `meps_mlr3`.\") %>% check_equal(incorrect_msg=\"When you create your linear model, make sure to model `logexpend` based on `gender`, `age`, `phstat`, `anylimit`, and `insure` from the dataframe `train_meps`.\")\n# ex() %>% check_function(\"summary\",index=2, not_called_msg=\"Use the `summary` function to create and view a summary of `meps_mlr3`.\") %>% check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like a summary of `meps_mlr3`.\")\n# ex() %>% check_object(\"meps_mlr4\", undefined_msg=\"Create a new linear model off a small number of explanatory variables named `meps_mlr4`.\") %>% check_equal(incorrect_msg=\"When you create your linear model, make sure to model `logexpend` based on `gender`, `age`, `phstat`, `anylimit`, `insure`, and `mpoor` from the dataframe `train_meps`.\")\n# ex() %>% check_function(\"summary\",index=3, not_called_msg=\"Use the `summary` function to create and view a summary of `meps_mlr4`.\") %>% check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like a summary of `meps_mlr4`.\")\n# ex() %>% check_object(\"meps_mlr5\", undefined_msg=\"Create a new linear model off a small number of explanatory variables named `meps_mlr5`.\") %>% check_equal(incorrect_msg=\"When you create your linear model, make sure to model `logexpend` based on `age`, `phstat`, `anylimit`, and `insure` from the dataframe `train_meps`.\")\n# ex() %>% check_function(\"summary\",index=4, not_called_msg=\"Use the `summary` function to create and view a summary of `meps_mlr5`.\") %>% check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like a summary of `meps_mlr5`.\")\nsuccess_msg(\"Excellent! Sometimes variables may have good predictive power but are unacceptable for policy purposes - in insurance, ethnicity and sometimes sex are good examples. This implies that model interpretation can be just as important as the ability to predict.\")","hint":"Starting with a model that consists of all your explanatory variables and using <code>step()</code> is a good way to create a model that fits the data well. On the other hand, this is simply statistics, which is unable to capture real explanations, or follow regulatory rules."}
Exercise. Model comparisons using cross-validation
Assignment Text
To compare alternative models, you decide to utilize cross-validation. For this exercise, you split the training sample into six subsamples of approximately equal size.
In the sample code, the cross-validation procedure has been summarized into a function that you can call. The input to the function is a list of variables that you select as your model explanatory variables. With this function, you can readily test several candidate models.
Instructions
Run the cross validation (crossvalfct
) function using the explanatory variables suggested by the stepwise function.
Run the function again but adding the mpoor
variable
Run the function again but omitting the gender
variable
Note which model is suggested by the cross validation function.
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiNtZXBzIDwtIHJlYWQuY3N2KFwiQ1NWRGF0YVxcXFxIZWFsdGhNZXBzLmNzdlwiLCBoZWFkZXIgPSBUUlVFKVxubWVwcyA8LSByZWFkLmNzdihcImh0dHBzOi8vYXNzZXRzLmRhdGFjYW1wLmNvbS9wcm9kdWN0aW9uL3JlcG9zaXRvcmllcy8yNjEwL2RhdGFzZXRzLzdiN2RhYjZkMGM1MjhlNGNkMmY4ZDBlMGZjNzgyNGEyNTQ0MjliZjgvSGVhbHRoTWVwcy5jc3ZcIiwgaGVhZGVyID0gVFJVRSlcbm1lcHMkbG9nZXhwZW5kIDwtIGxvZyhtZXBzJGV4cGVuZG9wKVxuIyBTcGxpdCB0aGUgc2FtcGxlIGludG8gYSBgdHJhaW5pbmdgIGFuZCBgdGVzdGAgZGF0YVxubiA8LSBucm93KG1lcHMpXG5zZXQuc2VlZCgxMjM0NylcbnNodWZmbGVkX21lcHMgPC0gbWVwc1tzYW1wbGUobiksIF1cbnRyYWluX2luZGljZXMgPC0gMTpyb3VuZCgwLjc1ICogbilcbnRyYWluX21lcHMgICAgPC0gc2h1ZmZsZWRfbWVwc1t0cmFpbl9pbmRpY2VzLCBdXG50ZXN0X2luZGljZXMgIDwtIChyb3VuZCgwLjI1ICogbikgKyAxKTpuXG50ZXN0X21lcHMgICAgIDwtIHNodWZmbGVkX21lcHNbdGVzdF9pbmRpY2VzLCBdXG5cbiMjIENyb3NzIC0gVmFsaWRhdGlvblxuXG5jcm9zc3ZhbGZjdCA8LSBmdW5jdGlvbihleHBsdmFycyl7XG4gIGN2ZGF0YSAgIDwtIHRyYWluX21lcHNbLCBjKFwibG9nZXhwZW5kXCIsIGV4cGx2YXJzKV1cbiAgY3Jvc3N2YWwgPC0gMFxuICBmb3IgKGkgaW4gMTo2KSB7XG4gICAgaW5kaWNlcyA8LSAoKChpLTEpICogcm91bmQoKDEvNikqbnJvdyhjdmRhdGEpKSkgKyAxKTooKGkqcm91bmQoKDEvNikgKiBucm93KGN2ZGF0YSkpKSlcbiAgICAjIEV4Y2x1ZGUgdGhlbSBmcm9tIHRoZSB0cmFpbiBzZXRcbiAgICB0cmFpbl9tbHIgPC0gbG0obG9nZXhwZW5kIH4gLiwgZGF0YSA9IGN2ZGF0YVstaW5kaWNlcyxdKVxuICAgICMgSW5jbHVkZSB0aGVtIGluIHRoZSB0ZXN0IHNldFxuICAgIHRlc3QgIDwtIGRhdGEuZnJhbWUoY3ZkYXRhW2luZGljZXMsIGV4cGx2YXJzXSlcbiAgICBuYW1lcyh0ZXN0KSAgPC0gZXhwbHZhcnNcbiAgICBwcmVkaWN0X3Rlc3QgPC0gZXhwKHByZWRpY3QodHJhaW5fbWxyLCB0ZXN0KSlcbiAgICAjIENvbXBhcmUgcHJlZGljdGVkIHRvIGhlbGQtb3V0IGFuZCBzdW1tYXJpemVcbiAgICBwcmVkaWN0X2VyciAgPC0gZXhwKGN2ZGF0YVtpbmRpY2VzLCBcImxvZ2V4cGVuZFwiXSkgLSBwcmVkaWN0X3Rlc3RcbiAgICBjcm9zc3ZhbCA8LSBjcm9zc3ZhbCArIHN1bShhYnMocHJlZGljdF9lcnIpKVxuICB9XG4gIGNyb3NzdmFsLzEwMDAwMDBcbn0iLCJzYW1wbGUiOiIjIFJ1biB0aGUgY3Jvc3MgdmFsaWRhdGlvbiAoYGNyb3NzdmFsZmN0YCkgZnVuY3Rpb24gdXNpbmcgdGhlIGV4cGxhbmF0b3J5IHZhcmlhYmxlcyBzdWdnZXN0ZWQgYnkgdGhlIHN0ZXB3aXNlIGZ1bmN0aW9uLlxuZXhwbHZhcnMuMSA8LSBjKFwiZ2VuZGVyXCIsIFwiYWdlXCIsIFwicGhzdGF0XCIsIFwiYW55bGltaXRcIiwgXCJpbnN1cmVcIilcbmNyb3NzdmFsZmN0KGV4cGx2YXJzKVxuXG4jIFJ1biB0aGUgZnVuY3Rpb24gYWdhaW4gYnV0IGFkZGluZyB0aGUgYG1wb29yYCB2YXJpYWJsZVxuZXhwbHZhcnMuMiA8LSBjKF9fXylcbmNyb3NzdmFsZmN0KGV4cGx2YXJzLjIpXG5cbiMgUnVuIHRoZSBmdW5jdGlvbiBhZ2FpbiBidXQgb21pdHRpbmcgdGhlIGBnZW5kZXJgIHZhcmlhYmxlXG5leHBsdmFycy4zIDwtIGMoIF9fXylcbmNyb3NzdmFsZmN0KGV4cGx2YXJzLjMpIiwic29sdXRpb24iOiJleHBsdmFycy4xIDwtIGMoXCJnZW5kZXJcIiwgXCJhZ2VcIiwgXCJwaHN0YXRcIiwgXCJhbnlsaW1pdFwiLCBcImluc3VyZVwiKVxuY3Jvc3N2YWxmY3QoZXhwbHZhcnMuMSlcbmV4cGx2YXJzLjIgPC0gYyhcImdlbmRlclwiLCBcImFnZVwiLCBcInBoc3RhdFwiLCBcImFueWxpbWl0XCIsIFwiaW5zdXJlXCIsIFwibXBvb3JcIilcbmNyb3NzdmFsZmN0KGV4cGx2YXJzLjIpXG5leHBsdmFycy4zIDwtIGMoIFwiYWdlXCIsIFwicGhzdGF0XCIsIFwiYW55bGltaXRcIiwgXCJpbnN1cmVcIiwgXCJtcG9vclwiKVxuY3Jvc3N2YWxmY3QoZXhwbHZhcnMuMykiLCJzY3QiOiIjIGV4KCkgJT4lIGNoZWNrX29iamVjdChcImV4cGx2YXJzLjFcIiwgdW5kZWZpbmVkX21zZz1cIkNyZWF0ZSBhIHZhcmlhYmxlIG5hbWVkIGBleHBsdmFycy4xYCB0aGF0IGNvbnRhaW5zIHRoZSBmb2xsb3dpbmcgYXMgY2hhcmFjdGVyIHN0cmluZ3M6IGBnZW5kZXJgLCBgYWdlYCwgYHBoc3RhdGAsIGBhbnlsaW1pdGAsIGFuZCBgaW5zdXJlYC5cIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gc3BlY2lmeSB0aGF0IHRoZSBmb2xsb3dpbmcgYXJlIGV4cHJlc3NlZCBhcyBjaGFyYWN0ZXIgc3RyaW5ncywgYW5kIG5vdCBhcyBudW1iZXJzOiA6IGBnZW5kZXJgLCBgYWdlYCwgYHBoc3RhdGAsIGBhbnlsaW1pdGAsIGFuZCBgaW5zdXJlYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfZnVuY3Rpb24oXCJjcm9zc3ZhbGZjdFwiLGluZGV4PTEsIG5vdF9jYWxsZWRfbXNnPVwiVXRpbGl6ZSB0aGUgY3VzdG9tIGJ1aWx0IGZ1bmN0aW9uIGBjcm9zc3ZhbGZjdGAgdG8gZmluZCB0aGUgY3Jvc3MtdmFsaWRhdGlvbiBzdGF0aXN0aWMgZm9yIGBleHBsdmFycy4xYC5cIikgJT4lIGNoZWNrX2FyZyguLCBcImV4cGx2YXJzXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB3ZSB3YW50IHRoZSBjcm9zcy12YWxpZGF0aW9uIHN0YXRpc3RpYyBmb3IgYGV4cGx2YXNyLjFgLlwiKVxuIyBleCgpICU+JSBjaGVja19vYmplY3QoXCJleHBsdmFycy4yXCIsIHVuZGVmaW5lZF9tc2c9XCJDcmVhdGUgYSB2YXJpYWJsZSBuYW1lZCBgZXhwbHZhcnMuMmAgdGhhdCBjb250YWlucyB0aGUgZm9sbG93aW5nIGFzIGNoYXJhY3RlciBzdHJpbmdzOiBgZ2VuZGVyYCwgYGFnZWAsIGBwaHN0YXRgLCBgYW55bGltaXRgLCBgaW5zdXJlYCwgYW5kIGBtcG9vcmAuXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB0aGUgZm9sbG93aW5nIGFyZSBleHByZXNzZWQgYXMgY2hhcmFjdGVyIHN0cmluZ3MsIGFuZCBub3QgYXMgbnVtYmVyczogOiBgZ2VuZGVyYCwgYGFnZWAsIGBwaHN0YXRgLCBgYW55bGltaXRgLCBgaW5zdXJlYCwgYW5kIGBtcG9vcmAuXCIpXG4jIGV4KCkgJT4lIGNoZWNrX2Z1bmN0aW9uKFwiY3Jvc3N2YWxmY3RcIixpbmRleD0yLCBub3RfY2FsbGVkX21zZz1cIlV0aWxpemUgdGhlIGN1c3RvbSBidWlsdCBmdW5jdGlvbiBgY3Jvc3N2YWxmY3RgIHRvIGZpbmQgdGhlIGNyb3NzLXZhbGlkYXRpb24gc3RhdGlzdGljIGZvciBgZXhwbHZhcnMuMmAuXCIpICU+JSBjaGVja19hcmcoLiwgXCJleHBsdmFyc1wiKSAlPiUgY2hlY2tfZXF1YWwoaW5jb3JyZWN0X21zZz1cIk1ha2Ugc3VyZSB0byBzcGVjaWZ5IHRoYXQgd2Ugd2FudCB0aGUgY3Jvc3MtdmFsaWRhdGlvbiBzdGF0aXN0aWMgZm9yIGBleHBsdmFzci4yYC5cIilcbiMgZXgoKSAlPiUgY2hlY2tfb2JqZWN0KFwiZXhwbHZhcnMuM1wiLCB1bmRlZmluZWRfbXNnPVwiQ3JlYXRlIGEgdmFyaWFibGUgbmFtZWQgYGV4cGx2YXJzLjNgIHRoYXQgY29udGFpbnMgdGhlIGZvbGxvd2luZyBhcyBjaGFyYWN0ZXIgc3RyaW5nczogYGFnZWAsIGBwaHN0YXRgLCBgYW55bGltaXRgLCBgaW5zdXJlYCwgYW5kIGBtcG9vcmAuXCIpICU+JSBjaGVja19lcXVhbChpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHNwZWNpZnkgdGhhdCB0aGUgZm9sbG93aW5nIGFyZSBleHByZXNzZWQgYXMgY2hhcmFjdGVyIHN0cmluZ3MsIGFuZCBub3QgYXMgbnVtYmVyczogOmBhZ2VgLCBgcGhzdGF0YCwgYGFueWxpbWl0YCwgYGluc3VyZWAsIGFuZCBgbXBvb3JgLlwiKVxuIyBleCgpICU+JSBjaGVja19mdW5jdGlvbihcImNyb3NzdmFsZmN0XCIsaW5kZXg9Mywgbm90X2NhbGxlZF9tc2c9XCJVdGlsaXplIHRoZSBjdXN0b20gYnVpbHQgZnVuY3Rpb24gYGNyb3NzdmFsZmN0YCB0byBmaW5kIHRoZSBjcm9zcy12YWxpZGF0aW9uIHN0YXRpc3RpYyBmb3IgYGV4cGx2YXJzLjNgLlwiKSAlPiUgY2hlY2tfYXJnKC4sIFwiZXhwbHZhcnNcIikgJT4lIGNoZWNrX2VxdWFsKGluY29ycmVjdF9tc2c9XCJNYWtlIHN1cmUgdG8gc3BlY2lmeSB0aGF0IHdlIHdhbnQgdGhlIGNyb3NzLXZhbGlkYXRpb24gc3RhdGlzdGljIGZvciBgZXhwbHZhc3IuM2AuXCIpXG5zdWNjZXNzX21zZyhcIkV4Y2VsbGVudCEgQ3Jvc3MtdmFsaWRhdGlvbiBoYXMgYmVjb21lIGFuIGVzc2VudGlhbCBwaWVjZSBvZiB0aGUgZGF0YSBhbmFseXN0cyB0b29sa2l0LiBHb29kIHRoYXQgeW91IG5vdyBoYXZlIGFkZGl0aW9uYWwgZXhwZXJpZW5jZSB3aXRoIGl0LlwiKSIsImhpbnQiOiJQZXJmb3JtaW5nIGNyb3NzLXZhbGlkYXRpb24gbWVhc3VyZXMgb24geW91ciBtb2RlbCBhbGxvd3MgeW91IHRvIHNlZSBob3cgd2VsbCBpdCBkb2VzIHdoZW4gaXQgY29tZXMgdG8gcHJlZGljdGluZyBvdXQgb2Ygc2FtcGxlIGRhdGEgcG9pbnRzLiJ9
Exercise. Out of sample validation
Assignment Text
From our prior work, the training train_meps
and test test_meps
dataframes have already been loaded in. We think our best model is based on logarithmic expenditures as the outcome and the following explanatory variables:
explvars3 <- c("gender", "age", "phstat", "anylimit", "insure", "mpoor")
We will compare this to a benchmark model that is based on expenditures as the outcome and all 13 explanatory variables
explvars4 <- c(explvars3, "race", "income", "region", "educ", "unemploy", "managedcare", "usc")
The comparisons will be based on expenditures in dollars using the held-out validation sample.
Instructions
Use the training sample to fit a linear model with logexpend
and explanatory variables listed in explvars3
Predict expenditures (not logged) for the test data and summarize the fit using the sum of absolute prediction errors.
Use the training sample to fit a benchmark linear model with expendop
and explanatory variables listed in explvars4
Predict expenditures for the test data and summarize the fit for the benchmark model using the sum of absolute prediction errors.
Compare the predictions of the models graphically.
{"language":"r","pre_exercise_code":"#meps <- read.csv(\"CSVData\\\\HealthMeps.csv\", header = TRUE)\nmeps <- read.csv(\"https://assets.datacamp.com/production/repositories/2610/datasets/7b7dab6d0c528e4cd2f8d0e0fc7824a254429bf8/HealthMeps.csv\", header = TRUE)\nmeps$logexpend <- log(meps$expendop)\n# Split the sample into a `training` and `test` data\nn <- nrow(meps)\nset.seed(12347)\nshuffled_meps <- meps[sample(n), ]\ntrain_indices <- 1:round(0.75 * n)\ntrain_meps    <- shuffled_meps[train_indices, ]\ntest_indices  <- (round(0.25 * n) + 1):n\ntest_meps     <- shuffled_meps[test_indices, ]\nexplvars3 <- c(\"gender\", \"age\", \"race\", \"mpoor\", \"anylimit\", \"income\", \"insure\", \"usc\")\nexplvars4 <- c(explvars3, \"region\", \"educ\", \"phstat\", \"unemploy\", \"managedcare\")","sample":"# Regress `logexpend` on the explanatory variables listed in `explvars3`\nmeps_mlr3 <- lm(logexpend ~ gender + age + phstat + anylimit  + insure + mpoor, data = train_meps)\n\n# Predict expenditures (not logged) and summarize using the sum of absolute prediction errors.\nexplvars3 <- c(\"gender\", \"age\", \"phstat\", \"anylimit\", \"insure\", \"mpoor\")\npredict_meps3 <- test_meps[,explvars3]\npredict_mlr3  <- exp(predict(meps_mlr3, predict_meps3))\npredict_err_mlr3 <- test_meps$expendop - predict_mlr3\nsape3     <- sum(abs(predict_err_mlr3))/1000\n\n# Regress `expendop` on all 13 explanatory variables\nmeps_mlr4 <- lm(___~ gender + age + race + region + educ + phstat + mpoor + anylimit + income + insure + usc + unemploy + managedcare, data = train_meps)\n\n# Predict expenditures and summarize using the sum of absolute prediction errors.\nexplvars4 <- c(\"gender\",\"age\",\"race\",\"region\",\"educ\",\"phstat\",\"mpoor\",\"anylimit\",\"income\",\"insure\",\"usc\",\"unemploy\",\"managedcare\")\npredict_meps4 <- test_meps[,explvars4]\npredict_mlr4  <- predict(meps_mlr4, predict_meps4)\npredict_err_mlr4 <- test_meps$expendop - predict_mlr4\nsape4     <- sum(abs(predict_err_mlr4))/1000\nsape3;sape4\n\n# Compare the predictions of the models graphically.\npar(mfrow = c(1, 2))\nplot(predict_err_mlr4, predict_err_mlr3, xlab = \"Benchmark Predict Error\", ylab = \"MLR Predict Error\")\nplot(predict_mlr3, test_meps$expendop, xlab = \"MLR Predicts\", ylab = \"Held Out Expends\")","solution":"meps_mlr3 <- lm(logexpend ~ gender + age + phstat + anylimit  + insure + mpoor, data = train_meps)\nexplvars3 <- c(\"gender\", \"age\", \"phstat\", \"anylimit\", \"insure\", \"mpoor\")\npredict_meps3 <- test_meps[,explvars3]\npredict_mlr3  <- exp(predict(meps_mlr3, predict_meps3))\npredict_err_mlr3 <- test_meps$expendop - predict_mlr3\nsape3     <- sum(abs(predict_err_mlr3))/1000\n\nmeps_mlr4 <- lm(expendop ~ gender + age + race + region + educ + phstat + mpoor + anylimit + income + insure + usc + unemploy + managedcare, data = train_meps)\nexplvars4 <- c(\"gender\",\"age\",\"race\",\"region\",\"educ\",\"phstat\",\"mpoor\",\"anylimit\",\"income\",\"insure\",\"usc\",\"unemploy\",\"managedcare\")\npredict_meps4 <- test_meps[,explvars4]\npredict_mlr4  <- predict(meps_mlr4, predict_meps4)\npredict_err_mlr4 <- test_meps$expendop - predict_mlr4\nsape4     <- sum(abs(predict_err_mlr4))/1000\n\nsape3;sape4\n\npar(mfrow = c(1, 2))\nplot(predict_err_mlr4, predict_err_mlr3, xlab = \"Benchmark Predict Error\", ylab = \"MLR Predict Error\")\nplot(predict_mlr3, test_meps$expendop, xlab = \"MLR Predicts\", ylab = \"Held Out Expends\")","sct":"# ex() %>% check_object(\"meps_mlr3\", undefined_msg=\"Use `lm` to create a linear model named `meps_mlr3`.\") %>% check_equal(incorrect_msg=\"Make sure to create your model so that `logexpend` is based on `gender`, `age`, `phstat`, `anylimit`, `insure`, and `mpoor` from the data in `train_meps`.\")\n# ex() %>% check_object(\"explvars3\", undefined_msg=\"Create a variable named `explvars3` that has `gender`, `age`, `phstat`, `anylimit`, `insure`, and `mpoor` as character strings. \") %>% check_equal(incorrect_msg=\"Make sure to have the following as a character string instead of the values contained in each: `gender`, `age`, `phstat`, `anylimit`, `insure`, and `mpoor`.\")\n# ex() %>% check_object(\"predict_meps3\", undefined_msg=\"Make a new dataframe named `predict_meps3` that a subset of the data not used to create the model. \") %>% check_equal(incorrect_msg=\"The subset of data not used is stored in `test_meps`, and the columns we want are now stored in `explvars3`.\")\n# ex() %>% check_function(\"exp\",index=1,not_called_msg=\"Make sure to take the exponential of the predicted values to remove the log effect. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to run `exp` on the values you get from your `predict` call. \")\n# ex() %>% check_function(\"predict\",index=1, not_called_msg=\"Run `predict` on the data you did not use to create the model. \") %>% {\n#   check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that the model we are using to make predictions is `meps_mlr3`.\")\n#   check_arg(., \"newdata\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like predictions for the values found in `predict_meps3`.\")\n# }\n# ex() %>% check_object(\"predict_mlr3\", undefined_msg=\"Make sure to save the exponential of the predicted values to `predict_mlr3`.\") %>% check_equal(incorrect_msg=\"Make sure to take the exponential of the values found using `predict`.\")\n# ex() %>% check_object(\"predict_err_mlr3\", undefined_msg=\"Save the prediction errors to a variable named `predict_err_mlr3`) %>% check_equal(incorrect_msg=\"To find the prediction errors, subtract the predicted values in `predict_mlr3` from the actual values found in `test_meps$expendop`.\")\n# ex() %>% check_function(\"abs\",index=1, undefined_msg=\"Make sure to take the absolute value of the prediction errors. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to take the absolute value of the prediction errors, which can be found in `predict_err_mlr3`.\")\n# ex() %>% check_function(\"sum\",index=1, not_called_msg=\"Take the sum of the absolute value of the prediction errors. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like the sum of the absolute value of the prediction errors. \")\n# ex() %>% check_object(\"sape3\", undefined_msg=\"Make sure to set the sum of the absolute value of the prediction errors divided by 1000 equal to `sape3`.\") %>% check_equal(incorrect_msg=\"Make sure that you have taken the sum of the absolute value of the prediction errors, and divided that number by 1000. \")\n# ex() %>% check_object(\"meps_mlr4\", undefined_msg=\"Use `lm` to create a linear model named `meps_mlr4`.\") %>% check_equal(incorrect_msg=\"Make sure to create your model so that `expendop` is based on `gender`,` age`, `race`, `region`, `educ`, `phstat`, `mpoor`, `anylimit`, `income`, `usc`, `unemploy`, and `managedcare` from the data in `train_meps`.\")\n# ex() %>% check_object(\"explvars4\", undefined_msg=\"Create a variable named `explvars4` that has `gender`,` age`, `race`, `region`, `educ`, `phstat`, `mpoor`, `anylimit`, `income`, `usc`, `unemploy`, and `managedcare` as character strings. \") %>% check_equal(incorrect_msg=\"Make sure to have the following as a character string instead of the values contained in each: `gender`,` age`, `race`, `region`, `educ`, `phstat`, `mpoor`, `anylimit`, `income`, `usc`, `unemploy`, and `managedcare`.\")\n# ex() %>% check_object(\"predict_meps4\", undefined_msg=\"Make a new dataframe named `predict_meps4` that a subset of the data not used to create the model. \") %>% check_equal(incorrect_msg=\"The subset of data not used is stored in `test_meps`, and the columns we want are now stored in `explvars4`.\")\n# ex() %>% check_function(\"exp\",index=2,not_called_msg=\"Make sure to take the exponential of the predicted values to remove the log effect. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to run `exp` on the values you get from your `predict` call. \")\n# ex() %>% check_function(\"predict\",index=2, not_called_msg=\"Run `predict` on the data you did not use to create the model. \") %>% {\n#   check_arg(., \"object\") %>% check_equal(incorrect_msg=\"Make sure to specify that the model we are using to make predictions is `meps_mlr4`.\")\n#   check_arg(., \"newdata\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like predictions for the values found in `predict_meps4`.\")\n# }\n# ex() %>% check_object(\"predict_mlr4\", undefined_msg=\"Make sure to save the exponential of the predicted values to `predict_mlr4`.\") %>% check_equal(incorrect_msg=\"Make sure to take the exponential of the values found using `predict`.\")\n# ex() %>% check_object(\"predict_err_mlr4\", undefined_msg=\"Save the prediction errors to a variable named `predict_err_mlr4`) %>% check_equal(incorrect_msg=\"To find the prediction errors, subtract the predicted values in `predict_mlr4` from the actual values found in `test_meps$expendop`.\")\n# ex() %>% check_function(\"abs\",index=2, undefined_msg=\"Make sure to take the absolute value of the prediction errors. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to take the absolute value of the prediction errors, which can be found in `predict_err_mlr4`.\")\n# ex() %>% check_function(\"sum\",index=2, not_called_msg=\"Take the sum of the absolute value of the prediction errors. \") %>% check_arg(., \"x\") %>% check_equal(incorrect_msg=\"Make sure to specify that we would like the sum of the absolute value of the prediction errors. \")\n# ex() %>% check_object(\"sape4\", undefined_msg=\"Make sure to set the sum of the absolute value of the prediction errors divided by 1000 equal to `sape4`.\") %>% check_equal(incorrect_msg=\"Make sure that you have taken the sum of the absolute value of the prediction errors, and divided that number by 1000. \")\n# ex() %>% check_function(\"par\",not_called_msg=\"Use `par` to alter the graphing device to have 2 side by side graphs. \") %>% check_arg(., \"mfrow\") %>% check_equal(incorrect_msg=\"To do this, set `mfrow` equal to `c(2,1)`. \")\n# ex() %>% check_function(\"plot\",index=1, not_called_msg=\"Create a plot of the two prediction errors as the first graph. \") %>% {\n#   check_arg(., \"x\") %>% check_equal(incorrect_msg=\"The x axis should have the prediction errors from `predict_err_mlr4`.\")\n#   check_arg(., \"y\") %>% check_equal(incorrect_msg=\"The y axis should have the prediction errors from `predict_err_mlr3`.\")\n# }\n# ex() %>% check_function(\"plot\",index=2, not_called_msg=\"The second graph should plot the predictions from mlr3 against the actual values of expenditures. \") %>% {\n#   check_arg(., \"x\") %>% check_equal(incorrect_msg=\"The X axis should have the predictions, found in `predict_mlr3`.\")\n#   check_arg(., \"y\") %>% check_equal(incorrect_msg=\"The Y axis should have the actual expenditure values, found in `test_meps$expendop`.\")\n# }\nsuccess_msg(\"Excellent! We found that the model of log expenditures outperforms the benchmark that models expenditures, even when the out of sample criterion was in the original 'dollar' units. It is comoforting to know that a search for a good model does well when using different out of sample criteria.\")","hint":"There is nothing in this code to change. simply examine the code, play around with it, and figure out what each individual line does, and how they work together."}