Chapter 3 Models with Random Effects

3.1 Import Data

#  "\t"  INDICATES SEPARATED BY TABLES  ;
taxprep  = read.table("TXTData/TaxPrep.txt", sep ="\t", quote = "",header=TRUE)

# taxprep=read.table(choose.files(), header=TRUE, sep="\t")

Data for this study are from the Statistics of Income (SOI) Panel of Individual Returns, a part of the Ernst and Young/University of Michigan Tax Research Database. The SOI Panel represents a simple random sample of unaudited individual income tax returns filed for tax years 1979-1990. The data are compiled from a stratified probability sample of unaudited individual income tax returns, Forms 1040, 1040A and 1040EZ, filed by U.S. taxpayers. The estimates that are obtained from these data are intended to represent all returns filed for the income tax years under review. All returns processed are subjected to sampling except tentative and amended returns.

Variable	Description
MS	is an indicator variable of the taxpayer’s marital status. It is coded one if the taxpayer is married and zero otherwise.
HH	is an indicator variable, one if the taxpayer is a head of household and zero otherwise.
DEPEND	is the number of dependents claimed by the taxpayer.
AGE	is the presence of an indicator for age 65 or over.
F1040A	is an indicator variable of the taxpayer’s filing type. It is coded one if the taxpayer uses Form 1040A and zero otherwise.
F1040EZ	is an indicator variable of the taxpayer’s filing type. It is coded one if the taxpayer uses Form 1040EZ and zero otherwise.
TPI	is the sum of all positive income line items on the return. is a marginal tax rate.
TXRT	is a marginal tax rate。 It is computed on TPI less exemptions and the standard deduction.
MR	is an exogenous marginal tax rate. It is computed on TPI less exemptions and the standard deduction.
EMP	is an indicator variable, one if Schedule C or F is present and zero otherwise. Self-employed taxpayers have greater need for professional assistance to reduce the reporting risks of doing business.
PREP	is a variable indicating the presence of a paid preparer.
TAX	is the tax liability on the return.
SUBJECT	Subject identifier, 1- 258.
TIME	Time identifier, 1-5.
LNTAX	is the natural logarithm of the tax liability on the return.
LNTPI	is the natural logarithm of the sum of all positive income line items on the return.

3.2 Example 3.2: Income Tax Payments (Page 81)

In this section, we study the effects that an individual’s economic and demographic characteristics have on the amount of income tax paid. Specifically, the response of interest is LNTAX, defined as the natural logarithm of the liability on the tax return.

3.2.1 Table 3.2. Averages of binary variables

The binary variables in Table 3.2 indicate that over half the sample is married (MS) and approximately half the sample uses a paid preparer (PREP).

library(nlme)
gsummary(taxprep[, c("MS", "HH", "AGE", "EMP", "PREP")], groups=taxprep$TIME, FUN=mean)

##          MS         HH        AGE       EMP      PREP
## 1 0.5968992 0.08139535 0.08527132 0.1395349 0.4496124
## 2 0.5968992 0.09302326 0.10465116 0.1589147 0.4418605
## 3 0.6240310 0.08527132 0.11240310 0.1550388 0.4844961
## 4 0.6472868 0.08139535 0.13178295 0.1472868 0.5077519
## 5 0.6472868 0.09302326 0.14728682 0.1472868 0.5155039

3.2.2 TABLE 3.3 - Summary statistics for continuous variables

Tables 3.2 and 3.3 describe the basic taxpayer characteristics used in our analysis. The summary statistics for the other nonbinary variables are in Table 3.3.

summary(taxprep[, c("DEPEND", "LNTPI", "MR", "LNTAX")]) #summary does not provid standard deviation

##      DEPEND          LNTPI               MR            LNTAX       
##  Min.   :0.000   Min.   :-0.1275   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.: 9.4467   1st Qu.:15.00   1st Qu.: 6.645  
##  Median :2.000   Median :10.0506   Median :22.00   Median : 7.701  
##  Mean   :2.419   Mean   : 9.8886   Mean   :23.52   Mean   : 6.880  
##  3rd Qu.:3.000   3rd Qu.:10.5320   3rd Qu.:33.00   3rd Qu.: 8.420  
##  Max.   :6.000   Max.   :13.2220   Max.   :50.00   Max.   :11.860

Standard deviation of some variables.

#Standard Deviation
var<-var(taxprep[, c("DEPEND", "LNTPI", "MR", "LNTAX")])
sqrt(diag(var))

##    DEPEND     LNTPI        MR     LNTAX 
##  1.337562  1.164625 11.453800  2.694961

3.2.3 TABLE 3.4 - Averages by level of binary explanatory variable

To explore the relationship between each indicator variable and logarithmic tax, Table 3.4 presents the average logarithmic tax liability by level of indicator variable. This table shows that married filers pay greater tax, head-of-household filers pay less tax, taxpayers 65 or over pay less, taxpayers with self-employed income pay less, and taxpayers who use a professional tax preparer pay more.

library(Hmisc)
summarize(taxprep$LNTAX, taxprep$MS, mean)

##   taxprep$MS taxprep$LNTAX
## 1          0      5.973412
## 2          1      7.429948

summarize(taxprep$LNTAX, taxprep$HH, mean)

##   taxprep$HH taxprep$LNTAX
## 1          0      7.013197
## 2          1      5.479947

summarize(taxprep$LNTAX, taxprep$AGE, mean)

##   taxprep$AGE taxprep$LNTAX
## 1           0      6.939184
## 2           1      6.430867

summarize(taxprep$LNTAX, taxprep$EMP, mean)

##   taxprep$EMP taxprep$LNTAX
## 1           0      6.982682
## 2           1      6.296879

summarize(taxprep$LNTAX, taxprep$PREP, mean)

##   taxprep$PREP taxprep$LNTAX
## 1            0      6.623648
## 2            1      7.158049

# TABLE counts of BINARY EXPLANATORY VARIABLE
# CREATE CATEGORICAL VARIABLE
taxprep$MSF=taxprep$MS
taxprep$HHF=taxprep$HH
taxprep$AGEF=taxprep$AGE
taxprep$EMPF=taxprep$EMP
taxprep$PREPF=taxprep$PREP
table(taxprep$MSF)

## 
##   0   1 
## 487 803

table(taxprep$HHF)

## 
##    0    1 
## 1178  112

table(taxprep$AGEF)

## 
##    0    1 
## 1140  150

table(taxprep$EMPF)

## 
##    0    1 
## 1097  193

table(taxprep$PREPF)

## 
##   0   1 
## 671 619

3.2.4 TABLE 3.5 - Correlation for continous variables

Table 3.5 summarizes basic relations among logarithmic tax and the other nonbinary explanatory variables. Both LNTPI and MR are strongly correlated with logarithmic tax whereas the relationship between DEPEND and logarithmic tax is positive, yet weaker. Table 3.5 also shows that LNTPI and MR are strongly positively correlated.

cor(taxprep[,c("LNTAX", "DEPEND", "LNTPI", "MR")])

##             LNTAX     DEPEND     LNTPI        MR
## LNTAX  1.00000000 0.08519899 0.7176476 0.7466574
## DEPEND 0.08519899 1.00000000 0.2777381 0.1275044
## LNTPI  0.71764760 0.27773808 1.0000000 0.7958007
## MR     0.74665744 0.12750438 0.7958007 1.0000000

3.2.5 FIGURE 3.2: Basic added variable plot (y vs. x)

Moreover, both the mean and median marginal tax rates (MR) are decreasing, although mean and median tax liabilities (LNTAX) are stable (see Figure 3.2). These results are consistent with congressional efforts to reduce rates and expand the tax base through broadening the definition of income and eliminating deductions.

#CREATE CATEGORICAL VARIABLE
taxprep$SUBJECT1=factor(taxprep$SUBJECT)
lntax.lm = lm(LNTAX ~ SUBJECT1, data=taxprep)
lntpi.lm = lm(LNTPI ~ SUBJECT1, data=taxprep)
taxprep$Resid1=residuals(lntax.lm)
taxprep$Resid2=residuals(lntpi.lm)
plot(Resid1 ~ Resid2, data=taxprep, xaxt="n", yaxt="n", ylab="", xlab="")
axis(2, at=seq(-8, 7, by=2), las=1, font=10, cex=0.005, tck=0.01)
axis(2, at=seq(-8, 8, by=0.2), lab=F, tck=0.005)
axis(1, at=seq(-8,4, by=2), font=10, cex=0.005, tck=0.01)
axis(1, at=seq(-8, 4, by=0.2), lab=F, tck=0.005)
mtext("Residuals from LNTAX", side=2, line=-7, at=7.5, font=10, cex=1, las=1)
mtext("Residuals from LNTPI", side=1, line=3, at=-2, font=10, cex=1)

3.2.6 Five ways of doing the pooling model

#1
summary(lm(LNTAX~MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR, data=taxprep))

## 
## Call:
## lm(formula = LNTAX ~ MS + HH + AGE + EMP + PREP + LNTPI + DEPEND + 
##     MR, data = taxprep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6371 -0.4145  0.3875  0.9473 10.2893 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4303158  0.5647315  -6.074 1.64e-09 ***
## MS           0.0417333  0.1546948   0.270 0.787374    
## HH          -0.6969164  0.1924733  -3.621 0.000305 ***
## AGE          0.0121770  0.1562682   0.078 0.937901    
## EMP         -0.6216132  0.1351459  -4.600 4.65e-06 ***
## PREP        -0.0004085  0.0976086  -0.004 0.996661    
## LNTPI        0.8332396  0.0715145  11.651  < 2e-16 ***
## DEPEND      -0.1395622  0.0496814  -2.809 0.005043 ** 
## MR           0.1077579  0.0069511  15.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.676 on 1281 degrees of freedom
## Multiple R-squared:  0.6155, Adjusted R-squared:  0.6131 
## F-statistic: 256.3 on 8 and 1281 DF,  p-value: < 2.2e-16

#2
summary(glm(LNTAX~MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR,data=taxprep, 
              family=gaussian(link=identity)) )

## 
## Call:
## glm(formula = LNTAX ~ MS + HH + AGE + EMP + PREP + LNTPI + DEPEND + 
##     MR, family = gaussian(link = identity), data = taxprep)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -9.6371  -0.4145   0.3875   0.9473  10.2893  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4303158  0.5647315  -6.074 1.64e-09 ***
## MS           0.0417333  0.1546948   0.270 0.787374    
## HH          -0.6969164  0.1924733  -3.621 0.000305 ***
## AGE          0.0121770  0.1562682   0.078 0.937901    
## EMP         -0.6216132  0.1351459  -4.600 4.65e-06 ***
## PREP        -0.0004085  0.0976086  -0.004 0.996661    
## LNTPI        0.8332396  0.0715145  11.651  < 2e-16 ***
## DEPEND      -0.1395622  0.0496814  -2.809 0.005043 ** 
## MR           0.1077579  0.0069511  15.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.810093)
## 
##     Null deviance: 9361.8  on 1289  degrees of freedom
## Residual deviance: 3599.7  on 1281  degrees of freedom
## AIC: 5004.7
## 
## Number of Fisher Scoring iterations: 2

#3
library(plm)
summary(plm(LNTAX~MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR, data=taxprep, 
       index=c("SUBJECT","TIME"),model="pooling"))

## Pooling Model
## 
## Call:
## plm(formula = LNTAX ~ MS + HH + AGE + EMP + PREP + LNTPI + DEPEND + 
##     MR, data = taxprep, model = "pooling", index = c("SUBJECT", 
##     "TIME"))
## 
## Balanced Panel: n = 258, T = 5, N = 1290
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -9.63705 -0.41454  0.38751  0.94730 10.28935 
## 
## Coefficients:
##                Estimate  Std. Error t-value  Pr(>|t|)    
## (Intercept) -3.43031582  0.56473154 -6.0742 1.640e-09 ***
## MS           0.04173327  0.15469480  0.2698 0.7873744    
## HH          -0.69691637  0.19247327 -3.6208 0.0003051 ***
## AGE          0.01217701  0.15626823  0.0779 0.9379009    
## EMP         -0.62161319  0.13514593 -4.5996 4.653e-06 ***
## PREP        -0.00040854  0.09760862 -0.0042 0.9966612    
## LNTPI        0.83323956  0.07151445 11.6513 < 2.2e-16 ***
## DEPEND      -0.13956224  0.04968138 -2.8091 0.0050428 ** 
## MR           0.10775788  0.00695106 15.5024 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    9361.8
## Residual Sum of Squares: 3599.7
## R-Squared:      0.61549
## Adj. R-Squared: 0.61308
## F-statistic: 256.31 on 8 and 1281 DF, p-value: < 2.22e-16

#4
library(nlme)
summary(gls(LNTAX~MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR,data=taxprep))

## Generalized least squares fit by REML
##   Model: LNTAX ~ MS + HH + AGE + EMP + PREP + LNTPI + DEPEND + MR 
##   Data: taxprep 
##        AIC      BIC    logLik
##   5037.135 5088.688 -2508.567
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) -3.430316 0.5647315 -6.074242  0.0000
## MS           0.041733 0.1546948  0.269778  0.7874
## HH          -0.696916 0.1924733 -3.620848  0.0003
## AGE          0.012177 0.1562682  0.077924  0.9379
## EMP         -0.621613 0.1351459 -4.599570  0.0000
## PREP        -0.000409 0.0976086 -0.004185  0.9967
## LNTPI        0.833240 0.0715145 11.651345  0.0000
## DEPEND      -0.139562 0.0496814 -2.809146  0.0050
## MR           0.107758 0.0069511 15.502370  0.0000
## 
##  Correlation: 
##        (Intr) MS     HH     AGE    EMP    PREP   LNTPI  DEPEND
## MS      0.225                                                 
## HH      0.061  0.485                                          
## AGE    -0.011 -0.218 -0.045                                   
## EMP    -0.119 -0.093  0.025 -0.038                            
## PREP   -0.024 -0.024  0.013 -0.158 -0.141                     
## LNTPI  -0.969 -0.208 -0.100 -0.062  0.107 -0.008              
## DEPEND -0.015 -0.644 -0.316  0.292 -0.037 -0.064 -0.100       
## MR      0.654 -0.002  0.066  0.162 -0.049 -0.076 -0.778  0.151
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -5.7488891 -0.2472916  0.2311642  0.5651018  6.1380086 
## 
## Residual standard error: 1.676333 
## Degrees of freedom: 1290 total; 1281 residual

#install.packages("gee")
#5
library(gee)
summary(gee(LNTAX~MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR, id=SUBJECT, data=taxprep, 
              family=gaussian(link=identity), corstr="independence") )

## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27

## running glm to get initial regression estimate

##   (Intercept)            MS            HH           AGE           EMP 
## -3.4303158248  0.0417332677 -0.6969163659  0.0121770081 -0.6216131914 
##          PREP         LNTPI        DEPEND            MR 
## -0.0004085366  0.8332395616 -0.1395622388  0.1077578820

## 
##  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
##  gee S-function, version 4.13 modified 98/01/27 (1998) 
## 
## Model:
##  Link:                      Identity 
##  Variance to Mean Relation: Gaussian 
##  Correlation Structure:     Independent 
## 
## Call:
## gee(formula = LNTAX ~ MS + HH + AGE + EMP + PREP + LNTPI + DEPEND + 
##     MR, id = SUBJECT, data = taxprep, family = gaussian(link = identity), 
##     corstr = "independence")
## 
## Summary of Residuals:
##        Min         1Q     Median         3Q        Max 
## -9.6370540 -0.4145432  0.3875082  0.9472989 10.2893480 
## 
## 
## Coefficients:
##                  Estimate  Naive S.E.      Naive z Robust S.E.
## (Intercept) -3.4303158248 0.564731536 -6.074241663  1.54222163
## MS           0.0417332677 0.154694799  0.269778091  0.14854310
## HH          -0.6969163659 0.192473270 -3.620847529  0.23465427
## AGE          0.0121770081 0.156268232  0.077923759  0.18216326
## EMP         -0.6216131914 0.135145927 -4.599570312  0.17738436
## PREP        -0.0004085366 0.097608623 -0.004185456  0.09723974
## LNTPI        0.8332395616 0.071514453 11.651345042  0.19444654
## DEPEND      -0.1395622388 0.049681383 -2.809145616  0.05180461
## MR           0.1077578820 0.006951058 15.502370457  0.01417988
##                 Robust z
## (Intercept) -2.224269044
## MS           0.280950564
## HH          -2.969970915
## AGE          0.066846673
## EMP         -3.504329265
## PREP        -0.004201333
## LNTPI        4.285185792
## DEPEND      -2.694012121
## MR           7.599350059
## 
## Estimated Scale Parameter:  2.810093
## Number of Iterations:  1
## 
## Working Correlation
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0

3.2.7 DISPLAY 3.1 - Error components model

The estimated model appears in Display 3.1, from a fit using the statistical package SAS. Display 3.1 shows that HH, EMP, LNTPI, and MR are statistically significant variables that affect LNTAX. Somewhat surprisingly, the PREP variable was not statistically significant.

random<-lme(LNTAX~MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR, data=taxprep, random=~1|SUBJECT, method="ML")
## NOTE* THE DEFAULT METHOD IN lme IS "REML"
summary(random)

## Linear mixed-effects model fit by maximum likelihood
##  Data: taxprep 
##        AIC      BIC    logLik
##   4813.255 4870.041 -2395.627
## 
## Random effects:
##  Formula: ~1 | SUBJECT
##         (Intercept) Residual
## StdDev:   0.9602161 1.368896
## 
## Fixed effects: LNTAX ~ MS + HH + AGE + EMP + PREP + LNTPI + DEPEND + MR 
##                  Value Std.Error   DF   t-value p-value
## (Intercept) -2.9603371 0.5705536 1024 -5.188534  0.0000
## MS           0.0373000 0.1824839 1024  0.204402  0.8381
## HH          -0.6889876 0.2320057 1024 -2.969702  0.0031
## AGE          0.0207431 0.2000035 1024  0.103713  0.9174
## EMP         -0.5048035 0.1679848 1024 -3.005054  0.0027
## PREP        -0.0217036 0.1175229 1024 -0.184675  0.8535
## LNTPI        0.7604058 0.0699692 1024 10.867728  0.0000
## DEPEND      -0.1127475 0.0592818 1024 -1.901891  0.0575
## MR           0.1153752 0.0073142 1024 15.774213  0.0000
##  Correlation: 
##        (Intr) MS     HH     AGE    EMP    PREP   LNTPI  DEPEND
## MS      0.176                                                 
## HH      0.030  0.419                                          
## AGE    -0.043 -0.167 -0.023                                   
## EMP    -0.116 -0.069  0.024 -0.030                            
## PREP   -0.035 -0.045  0.004 -0.115 -0.112                     
## LNTPI  -0.948 -0.180 -0.081 -0.043  0.099 -0.016              
## DEPEND -0.074 -0.604 -0.269  0.224 -0.038 -0.039 -0.068       
## MR      0.522 -0.020  0.055  0.149 -0.041 -0.051 -0.698  0.102
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -5.83483692 -0.21263981  0.09677632  0.39814646  5.79731648 
## 
## Number of Observations: 1290
## Number of Groups: 258

3.3 Section 3.3 - Random coefficients model

If we include all random coefficients of eight covariates (MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR), then the number of cofficients exceeds the number of observations and it leads to warnings. If we only include a few of covariates, then lmer() can be used to estimate the random coefficients model. Here is an example of including (1+MS+HH|SUBJECT) as the random part.

#randomcoeff<-lme(LNTAX~MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR, data=taxprep, random=~1+MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR|SUBJECT, method="ML") 

# NOTE*:It takes forever to run the estimation, in the end a warning messaged was given. 
# No estimation result was produced. 
# The reason is due to the fact that in SAS, the method of mivque0 allows estimation for this model, in R this method is not readily available to be coded.
library(lme4)
#randomcoeff<-lmer(LNTAX ~ MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR+(1+MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR|SUBJECT),data=taxprep)

randomcoeff<-lmer(LNTAX ~ MS+HH+AGE+EMP+PREP+LNTPI+DEPEND+MR+(1+MS+HH|SUBJECT),data=taxprep)

## boundary (singular) fit: see ?isSingular

summary(randomcoeff)

## Linear mixed model fit by REML ['lmerMod']
## Formula: LNTAX ~ MS + HH + AGE + EMP + PREP + LNTPI + DEPEND + MR + (1 +  
##     MS + HH | SUBJECT)
##    Data: taxprep
## 
## REML criterion at convergence: 4796
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.9447 -0.2058  0.0971  0.3869  5.7321 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr       
##  SUBJECT  (Intercept) 0.94837  0.9738              
##           MS          0.01595  0.1263   -1.00      
##           HH          1.89191  1.3755    0.13 -0.13
##  Residual             1.84028  1.3566              
## Number of obs: 1290, groups:  SUBJECT, 258
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) -2.751324   0.564033  -4.878
## MS           0.092114   0.178701   0.515
## HH          -0.615376   0.340976  -1.805
## AGE         -0.008254   0.192330  -0.043
## EMP         -0.545085   0.161306  -3.379
## PREP        -0.056847   0.114553  -0.496
## LNTPI        0.747633   0.069049  10.828
## DEPEND      -0.122449   0.056746  -2.158
## MR           0.112551   0.007152  15.737
## 
## Correlation of Fixed Effects:
##        (Intr) MS     HH     AGE    EMP    PREP   LNTPI  DEPEND
## MS      0.169                                                 
## HH      0.023  0.278                                          
## AGE    -0.043 -0.170 -0.016                                   
## EMP    -0.119 -0.073  0.009 -0.032                            
## PREP   -0.037 -0.047  0.002 -0.111 -0.111                     
## LNTPI  -0.947 -0.185 -0.057 -0.044  0.101 -0.015              
## DEPEND -0.072 -0.595 -0.168  0.235 -0.035 -0.036 -0.067       
## MR      0.525 -0.018  0.038  0.154 -0.039 -0.051 -0.700  0.107
## convergence code: 0
## boundary (singular) fit: see ?isSingular