Preface

Date: 16 March 2021

About Regression Modeling

Statistical techniques can be used to address new situations. This is important in a rapidly evolving risk management world. Analysts with a strong analytical background understand that a large data set can represent a treasure trove of information to be mined and can yield a strong competitive advantage. This book and online tutorial provides budding analysts with a foundation in multiple reression. Viewers will learn about these statistical techniques using data on the demand for insurance, healthcare expenditures, and other applications. Although no specific knowledge of actuarial or risk management is presumed, the approach introduces applications in which statistical techniques can be used to analyze real data of interest.

Resources

Tutorial Description

  • This online tutorial is designed to guide you through the foundations of regession with applications in actuarial science.
  • Anticipated completion time is approximately six hours.
  • The tutorial assumes that you are familiar with the foundations in the statistical software R, such as Datacamp’s Introduction to R.

General Layout. There are five chapters in this tutorial that summarize the foundations of multiple linear regression. Each chapter is subdivided into several sections. At the beginning of each section is a short video, typically 4-8 minutes, that summarizes the section key learning outcomes. Following the video, you can see more details about the underlying R code for the analysis presented in the video.

Role of Exercises. Following each video, there are one or two exercises that allow you to practice skills to make sure that you fully grasp the learning outcomes. The exercises are implented using an online learning platfor provided by Datacamp so that you need not install R. Feedback is programmed into the exercises so that you will learn a lot by making mistakes! You will be pacing yourself, so always feel free to reveal the answers by hitting the Solution tab. Remember, going through quickly is not equivalent to learning deeply. Use this tool to enhance your understanding of one of the foundations of data science, regression analysis.

Welcome to the Tutorial Video


In this video, you learn how to:

  • Describe regression briefly, i.e., in a nutshell
  • Explain Galton’s height example as a regression application

Video Overhead

Hide

A. Galton’s 1885 Regression Data

\[ \small{\begin{array}{l|ccccccccccc|c} \hline \text{Height of }& & & & & & & & & & & & \\ \text{adult child }& & & & & & & & & & & & \\ \text{in inches }& <64.0 & 64.5 & 65.5 & 66.5 & 67.5 & 68.5 & 69.5 & 70.5 & 71.5 & 72.5 & >73.0 & \text{Totals} \\ \hline >73.7 & - & - & - & - & - & - & 5 & 3 & 2 & 4 & - & 14 \\ 73.2 & - & - & - & - & - & 3 & 4 & 3 & 2 & 2 & 3 & 17 \\ 72.2 & - & - & 1 & - & 4 & 4 & 11 & 4 & 9 & 7 & 1 & 41 \\ 71.2 & - & - & 2 & - & 11 & 18 & 20 & 7 & 4 & 2 & - & 64 \\ 70.2 & - & - & 5 & 4 & 19 & 21 & 25 & 14 & 10 & 1 & - & 99 \\ 69.2 & 1 & 2 & 7 & 13 & 38 & 48 & 33 & 18 & 5 & 2 & - & 167 \\ 68.2 & 1 & - & 7 & 14 & 28 & 34 & 20 & 12 & 3 & 1 & - & 120 \\ 67.2 & 2 & 5 & 11 & 17 & 38 & 31 & 27 & 3 & 4 & - & - & 138 \\ 66.2 & 2 & 5 & 11 & 17 & 36 & 25 & 17 & 1 & 3 & - & - & 117 \\ 65.2 & 1 & 1 & 7 & 2 & 15 & 16 & 4 & 1 & 1 & - & - & 48 \\ 64.2 & 4 & 4 & 5 & 5 & 14 & 11 & 16 & - & - & - & - & 59 \\ 63.2 & 2 & 4 & 9 & 3 & 5 & 7 & 1 & 1 & - & - & - & 32 \\ 62.2 & - & 1 & - & 3 & 3 & - & - & - & - & - & - & 7 \\ <61.2 & 1 & 1 & 1 & - & - & 1 & - & 1 & - & - & - & 5 \\ \hline \text{Totals }& 14 & 23 & 66 & 78 & 211 & 219 & 183 & 68 & 43 & 19 & 4 & 928 \\ \hline \end{array}} \]

Hide

B. Supporting R Code

# Reformat Data Set
#heights <- read.csv("CSVData\\GaltonFamily.csv",header = TRUE)
heights <- read.csv("https://assets.datacamp.com/production/repositories/2610/datasets/c85ede6c205d22049e766bd08956b225c576255b/galton_height.csv", header = TRUE)
str(heights)
head(heights)
heights$child_ht <- heights$CHILDC
heights$parent_ht <- heights$PARENTC
heights2 <- heights[c("child_ht","parent_ht")]
#heights <- read.csv("CSVData\\galton_height.csv",header = TRUE)
heights <- read.csv("https://assets.datacamp.com/production/repositories/2610/datasets/c85ede6c205d22049e766bd08956b225c576255b/galton_height.csv", header = TRUE)
plot(jitter(heights$parent_ht),jitter(heights$child_ht), ylim = c(60,80), xlim = c(60,80),
     ylab = "height of child", xlab = "height of parents")
abline(lm(heights$child_ht~heights$parent_ht))
abline(0,1,col = "red", lty=2)

summary(lm(heights$child_ht~heights$parent_ht))

Call:
lm(formula = heights$child_ht ~ heights$parent_ht)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2577 -1.4280  0.1323  1.5720  5.7918 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       25.84856    2.69009   9.609   <2e-16 ***
heights$parent_ht  0.60992    0.03882  15.710   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.26 on 926 degrees of freedom
Multiple R-squared:  0.2104,    Adjusted R-squared:  0.2096 
F-statistic: 246.8 on 1 and 926 DF,  p-value: < 2.2e-16