Chapter 8 Binary Dependent Variables

8.1 Import Data

taxprep=read.table("TXTData/TaxPrep.txt", sep ="\t", quote = "",header=TRUE)

#taxprep=read.table(choose.files(), header=TRUE, sep="\t")

Data for this study are from the Statistics of Income (SOI) Panel of Individual Returns, a part of the Ernst and Young/University of Michigan Tax Research Database. The SOI Panel represents a simple random sample of unaudited individual income tax returns filed for tax years 1979-1990. The data are compiled from a stratified probability sample of unaudited individual income tax returns, Forms 1040, 1040A and 1040EZ, filed by U.S. taxpayers. The estimates that are obtained from these data are intended to represent all returns filed for the income tax years under review. All returns processed are subjected to sampling except tentative and amended returns.

Variable	Description
MS	is an indicator variable of the taxpayer’s marital status. It is coded one if the taxpayer is married and zero otherwise.
HH	is an indicator variable, one if the taxpayer is a head of household and zero otherwise.
DEPEND	is the number of dependents claimed by the taxpayer.
AGE	is the presence of an indicator for age 65 or over.
F1040A	is an indicator variable of the taxpayer’s filing type. It is coded one if the taxpayer uses Form 1040A and zero otherwise.
F1040EZ	is an indicator variable of the taxpayer’s filing type. It is coded one if the taxpayer uses Form 1040EZ and zero otherwise.
TPI	is the sum of all positive income line items on the return.
TXRT	is a marginal tax rate. It is computed on TPI less exemptions and the standard deduction.
MR	is an exogenous marginal tax rate. It is computed on TPI less exemptions and the standard deduction.
EMP	is an indicator variable, one if Schedule C or F is present and zero otherwise. Self-employed taxpayers have greater need for professional assistance to reduce the reporting risks of doing business.
PREP	is a variable indicating the presence of a paid preparer.
TAX	is the tax liability on the return.
SUBJECT	Subject identifier, 1-258.
TIME	Time identifier, 1-5.
LNTAX	is the natural logarithm of the tax liability on the return.
LNTPI	is the natural logarithm of the sum of all positive income line items on the return.

8.2 Example: Income Tax Payments and Tax Preparers (page 326)

8.2.1 TABLE 9.2. Means for binary variables

library(Hmisc)
summarize(taxprep$MS, taxprep$PREP, mean)

  taxprep$PREP taxprep$MS
1            0  0.5424739
2            1  0.7092084

summarize(taxprep$HH, taxprep$PREP, mean)

  taxprep$PREP taxprep$HH
1            0 0.10581222
2            1 0.06623586

summarize(taxprep$AGE, taxprep$PREP, mean)

  taxprep$PREP taxprep$AGE
1            0  0.07153502
2            1  0.16478191

summarize(taxprep$EMP, taxprep$PREP, mean)

  taxprep$PREP taxprep$EMP
1            0   0.0923994
2            1   0.2116317

Table 9.2 shows that those taxpayers using a professional tax preparer (PREP = 1) were more likely to be married, not the head of a household, age 65 and over, and self-employed.

8.2.2 TABLE 9.3. Summary stats for other variables

library(nlme)
gsummary(taxprep[, c("DEPEND", "LNTPI", "MR")], groups=taxprep$PREP, FUN=mean)

    DEPEND    LNTPI       MR
0 2.266766  9.73151 21.98733
1 2.584814 10.05881 25.18821

gsummary(taxprep[, c("DEPEND", "LNTPI", "MR")], groups=taxprep$PREP, FUN=min)

  DEPEND       LNTPI MR
0      0 -0.12751332  0
1      0 -0.09166719  0

gsummary(taxprep[, c("DEPEND", "LNTPI", "MR")], groups=taxprep$PREP, FUN=max)

  DEPEND    LNTPI MR
0      6 12.04322 50
1      6 13.22203 50

gsummary(taxprep[, c("DEPEND", "LNTPI", "MR")], groups=taxprep$PREP, FUN=sd)

    DEPEND    LNTPI       MR
0 1.300545 1.088713 11.16809
1 1.358360 1.219911 11.53564

Table 9.3 shows that those taxpayers using a professional tax preparer had more dependents, had a larger income, and were in a higher tax bracket.

8.2.3 TABLE 9.4. Frequency tables for some of the binary variables

xtabs(~taxprep$PREP+taxprep$EMP, data=taxprep)

            taxprep$EMP
taxprep$PREP   0   1
           0 609  62
           1 488 131

Table 9.4 provides additional information about the relation between EMP and PREP.

8.2.4 DISPLAY 9.1 Fit the logistic distribution function using maximum likelihood

library(Hmisc)
library(rms)
# `rms` is an R package that is a replacement for the `Design` package.
preplogit<-lrm(PREP~LNTPI+MR+EMP, data=taxprep) 
preplogit

Logistic Regression Model
 
 lrm(formula = PREP ~ LNTPI + MR + EMP, data = taxprep)
 
                       Model Likelihood     Discrimination    Rank Discrim.    
                          Ratio Test           Indexes           Indexes       
 Obs          1290    LR chi2      67.24    R2       0.068    C       0.642    
  0            671    d.f.             3    g        0.512    Dxy     0.283    
  1            619    Pr(> chi2) <0.0001    gr       1.668    gamma   0.283    
 max |deriv| 2e-10                          gp       0.121    tau-a   0.141    
                                            Brier    0.236                     
 
           Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept -2.3447 0.7754 -3.02  0.0025  
 LNTPI      0.1881 0.0940  2.00  0.0455  
 MR         0.0108 0.0088  1.22  0.2212  
 EMP        1.0091 0.1693  5.96  <0.0001

# ALTERNATIVE - FIT A GENERALIZED LINEAR MODEL;
prepglm<-glm(PREP~LNTPI+MR+EMP, binomial(link=logit), data=taxprep)
prepglm


Call:  glm(formula = PREP ~ LNTPI + MR + EMP, family = binomial(link = logit), 
    data = taxprep)

Coefficients:
(Intercept)        LNTPI           MR          EMP  
   -2.34471      0.18811      0.01081      1.00906  

Degrees of Freedom: 1289 Total (i.e. Null);  1286 Residual
Null Deviance:      1786 
Residual Deviance: 1719     AIC: 1727

Display 9.1 shows a fitted logistic regression model, using LNTPI, MR, and EMP as explanatory variables. The calculations were done using SAS PROC LOGISTIC.

8.3 SECTION 9.2 Random effects nonlinear mixed effects model

library(glmmML) 
# nlme can not be used to fit a mixed effects model with responses as binomially distributed 
# In R nlme can be used to estimate a mechanistic model of the relationship between response and covariates
# install library glmmML: menu - packages - install package(s) from CRAN - glmmML
# glmmML estimates generalized linear model with random intercepts using Maximum Likelihood 
# and numerical integration via Gauss-Hermite quadrature.
prepglmml<-glmmML(PREP~LNTPI+MR+EMP, binomial(link=logit), data=taxprep, cluster=taxprep$SUBJECT)
prepglmml


Call:  glmmML(formula = PREP ~ LNTPI + MR + EMP, family = binomial(link = logit),      data = taxprep, cluster = taxprep$SUBJECT) 

                coef se(coef)       z Pr(>|z|)
(Intercept) -3.11544  1.43807 -2.1664  0.03030
LNTPI        0.22805  0.16531  1.3795  0.16800
MR           0.01394  0.02116  0.6591  0.51000
EMP          1.79380  0.56817  3.1572  0.00159

Scale parameter in mixing distribution:  4.454 gaussian 
Std. Error:                              0.1963 

        LR p-value for H_0: sigma = 0:  7.788e-145 

Residual deviance: 1064 on 1285 degrees of freedom  AIC: 1074

8.3.1 Generalized linear mixed effects model

# FIT GLMM with multivariate normal random effects, using Penalized Quasi-Likelihood
library(lme4)
prepGLMM<-glmer(PREP~LNTPI+MR+EMP+ (1|SUBJECT), family=binomial(link=logit), data=taxprep)

8.4 SECTION 9.3 Fixed effect model

taxprep$facsub<-factor(taxprep$SUBJECT)
# The fixed - effects model did not converge under maximum likelihood method, because of the `facsub`
# prepfxlogit<-lrm(PREP~LNTPI+MR+EMP+facsub,data=taxprep)
# I assume we can use glm() to fit the model.
prepfxlogit<-glm(PREP~LNTPI+MR+EMP+facsub,family=binomial(link=logit),data=taxprep)

8.5 SECTION 9.4 Marginal model and generalized equation estimation

library(gee)
prepgee1<-gee(PREP ~ LNTPI+MR+EMP, id=SUBJECT, data=taxprep, family=binomial(link=logit), corstr="exchangeable")

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27

running glm to get initial regression estimate

(Intercept)       LNTPI          MR         EMP 
-2.34471453  0.18810526  0.01081409  1.00906337

#gee Results match with SAS results
summary(prepgee1)


 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     Exchangeable 

Call:
gee(formula = PREP ~ LNTPI + MR + EMP, id = SUBJECT, data = taxprep, 
    family = binomial(link = logit), corstr = "exchangeable")

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-0.8131251 -0.4480400 -0.2898825  0.5079648  0.9138800 


Coefficients:
               Estimate  Naive S.E.   Naive z Robust S.E.   Robust z
(Intercept) -2.34471453 0.779479227 -3.008053  1.13139184 -2.0724160
LNTPI        0.18810526 0.094523103  1.990045  0.13685915  1.3744442
MR           0.01081409 0.008886585  1.216901  0.01122493  0.9633996
EMP          1.00906337 0.170162931  5.929983  0.17813257  5.6646764

Estimated Scale Parameter:  1.010469
Number of Iterations:  1

Working Correlation
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0    0    0    0    0
[4,]    0    0    0    0    0
[5,]    0    0    0    0    0

prepgee2<-gee(PREP ~ LNTPI+MR+EMP, id=SUBJECT, data=taxprep, family=binomial(link=logit), corstr="unstructured") #Results match with SAS results

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate

(Intercept)       LNTPI          MR         EMP 
-2.34471453  0.18810526  0.01081409  1.00906337

summary(prepgee2)


 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     Unstructured 

Call:
gee(formula = PREP ~ LNTPI + MR + EMP, id = SUBJECT, data = taxprep, 
    family = binomial(link = logit), corstr = "unstructured")

Summary of Residuals:
       Min         1Q     Median         3Q        Max 
-0.8131251 -0.4480400 -0.2898825  0.5079648  0.9138800 


Coefficients:
               Estimate  Naive S.E.   Naive z Robust S.E.   Robust z
(Intercept) -2.34471453 0.779479227 -3.008053  1.13139184 -2.0724160
LNTPI        0.18810526 0.094523103  1.990045  0.13685915  1.3744442
MR           0.01081409 0.008886585  1.216901  0.01122493  0.9633996
EMP          1.00906337 0.170162931  5.929983  0.17813257  5.6646764

Estimated Scale Parameter:  1.010469
Number of Iterations:  1

Working Correlation
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0    0    0    0    0
[4,]    0    0    0    0    0
[5,]    0    0    0    0    0