Chapter 8 Binary Dependent Variables
8.1 Import Data
taxprep=read.table("TXTData/TaxPrep.txt", sep ="\t", quote = "",header=TRUE)
#taxprep=read.table(choose.files(), header=TRUE, sep="\t")
Data for this study are from the Statistics of Income (SOI) Panel of Individual Returns, a part of the Ernst and Young/University of Michigan Tax Research Database. The SOI Panel represents a simple random sample of unaudited individual income tax returns filed for tax years 1979-1990. The data are compiled from a stratified probability sample of unaudited individual income tax returns, Forms 1040, 1040A and 1040EZ, filed by U.S. taxpayers. The estimates that are obtained from these data are intended to represent all returns filed for the income tax years under review. All returns processed are subjected to sampling except tentative and amended returns.
Variable | Description |
---|---|
MS | is an indicator variable of the taxpayer’s marital status. It is coded one if the taxpayer is married and zero otherwise. |
HH | is an indicator variable, one if the taxpayer is a head of household and zero otherwise. |
DEPEND | is the number of dependents claimed by the taxpayer. |
AGE | is the presence of an indicator for age 65 or over. |
F1040A | is an indicator variable of the taxpayer’s filing type. It is coded one if the taxpayer uses Form 1040A and zero otherwise. |
F1040EZ | is an indicator variable of the taxpayer’s filing type. It is coded one if the taxpayer uses Form 1040EZ and zero otherwise. |
TPI | is the sum of all positive income line items on the return. |
TXRT | is a marginal tax rate. It is computed on TPI less exemptions and the standard deduction. |
MR | is an exogenous marginal tax rate. It is computed on TPI less exemptions and the standard deduction. |
EMP | is an indicator variable, one if Schedule C or F is present and zero otherwise. Self-employed taxpayers have greater need for professional assistance to reduce the reporting risks of doing business. |
PREP | is a variable indicating the presence of a paid preparer. |
TAX | is the tax liability on the return. |
SUBJECT | Subject identifier, 1-258. |
TIME | Time identifier, 1-5. |
LNTAX | is the natural logarithm of the tax liability on the return. |
LNTPI | is the natural logarithm of the sum of all positive income line items on the return. |
8.2 Example: Income Tax Payments and Tax Preparers (page 326)
8.2.1 TABLE 9.2. Means for binary variables
library(Hmisc)
summarize(taxprep$MS, taxprep$PREP, mean)
taxprep$PREP taxprep$MS
1 0 0.5424739
2 1 0.7092084
summarize(taxprep$HH, taxprep$PREP, mean)
taxprep$PREP taxprep$HH
1 0 0.10581222
2 1 0.06623586
summarize(taxprep$AGE, taxprep$PREP, mean)
taxprep$PREP taxprep$AGE
1 0 0.07153502
2 1 0.16478191
summarize(taxprep$EMP, taxprep$PREP, mean)
taxprep$PREP taxprep$EMP
1 0 0.0923994
2 1 0.2116317
Table 9.2 shows that those taxpayers using a professional tax preparer (PREP
= 1) were more likely to be married, not the head of a household, age 65 and over, and self-employed.
8.2.2 TABLE 9.3. Summary stats for other variables
library(nlme)
gsummary(taxprep[, c("DEPEND", "LNTPI", "MR")], groups=taxprep$PREP, FUN=mean)
DEPEND LNTPI MR
0 2.266766 9.73151 21.98733
1 2.584814 10.05881 25.18821
gsummary(taxprep[, c("DEPEND", "LNTPI", "MR")], groups=taxprep$PREP, FUN=min)
DEPEND LNTPI MR
0 0 -0.12751332 0
1 0 -0.09166719 0
gsummary(taxprep[, c("DEPEND", "LNTPI", "MR")], groups=taxprep$PREP, FUN=max)
DEPEND LNTPI MR
0 6 12.04322 50
1 6 13.22203 50
gsummary(taxprep[, c("DEPEND", "LNTPI", "MR")], groups=taxprep$PREP, FUN=sd)
DEPEND LNTPI MR
0 1.300545 1.088713 11.16809
1 1.358360 1.219911 11.53564
Table 9.3 shows that those taxpayers using a professional tax preparer had more dependents, had a larger income, and were in a higher tax bracket.
8.2.3 TABLE 9.4. Frequency tables for some of the binary variables
xtabs(~taxprep$PREP+taxprep$EMP, data=taxprep)
taxprep$EMP
taxprep$PREP 0 1
0 609 62
1 488 131
Table 9.4 provides additional information about the relation between EMP
and PREP
.
8.2.4 DISPLAY 9.1 Fit the logistic distribution function using maximum likelihood
library(Hmisc)
library(rms)
# `rms` is an R package that is a replacement for the `Design` package.
preplogit<-lrm(PREP~LNTPI+MR+EMP, data=taxprep)
preplogit
Logistic Regression Model
lrm(formula = PREP ~ LNTPI + MR + EMP, data = taxprep)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1290 LR chi2 67.24 R2 0.068 C 0.642
0 671 d.f. 3 g 0.512 Dxy 0.283
1 619 Pr(> chi2) <0.0001 gr 1.668 gamma 0.283
max |deriv| 2e-10 gp 0.121 tau-a 0.141
Brier 0.236
Coef S.E. Wald Z Pr(>|Z|)
Intercept -2.3447 0.7754 -3.02 0.0025
LNTPI 0.1881 0.0940 2.00 0.0455
MR 0.0108 0.0088 1.22 0.2212
EMP 1.0091 0.1693 5.96 <0.0001
# ALTERNATIVE - FIT A GENERALIZED LINEAR MODEL;
prepglm<-glm(PREP~LNTPI+MR+EMP, binomial(link=logit), data=taxprep)
prepglm
Call: glm(formula = PREP ~ LNTPI + MR + EMP, family = binomial(link = logit),
data = taxprep)
Coefficients:
(Intercept) LNTPI MR EMP
-2.34471 0.18811 0.01081 1.00906
Degrees of Freedom: 1289 Total (i.e. Null); 1286 Residual
Null Deviance: 1786
Residual Deviance: 1719 AIC: 1727
Display 9.1 shows a fitted logistic regression model, using LNTPI
, MR
, and EMP
as explanatory variables. The calculations were done using SAS PROC LOGISTIC.
8.3 SECTION 9.2 Random effects nonlinear mixed effects model
library(glmmML)
# nlme can not be used to fit a mixed effects model with responses as binomially distributed
# In R nlme can be used to estimate a mechanistic model of the relationship between response and covariates
# install library glmmML: menu - packages - install package(s) from CRAN - glmmML
# glmmML estimates generalized linear model with random intercepts using Maximum Likelihood
# and numerical integration via Gauss-Hermite quadrature.
prepglmml<-glmmML(PREP~LNTPI+MR+EMP, binomial(link=logit), data=taxprep, cluster=taxprep$SUBJECT)
prepglmml
Call: glmmML(formula = PREP ~ LNTPI + MR + EMP, family = binomial(link = logit), data = taxprep, cluster = taxprep$SUBJECT)
coef se(coef) z Pr(>|z|)
(Intercept) -3.11544 1.43807 -2.1664 0.03030
LNTPI 0.22805 0.16531 1.3795 0.16800
MR 0.01394 0.02116 0.6591 0.51000
EMP 1.79380 0.56817 3.1572 0.00159
Scale parameter in mixing distribution: 4.454 gaussian
Std. Error: 0.1963
LR p-value for H_0: sigma = 0: 7.788e-145
Residual deviance: 1064 on 1285 degrees of freedom AIC: 1074
8.3.1 Generalized linear mixed effects model
# FIT GLMM with multivariate normal random effects, using Penalized Quasi-Likelihood
library(lme4)
prepGLMM<-glmer(PREP~LNTPI+MR+EMP+ (1|SUBJECT), family=binomial(link=logit), data=taxprep)
8.4 SECTION 9.3 Fixed effect model
taxprep$facsub<-factor(taxprep$SUBJECT)
# The fixed - effects model did not converge under maximum likelihood method, because of the `facsub`
# prepfxlogit<-lrm(PREP~LNTPI+MR+EMP+facsub,data=taxprep)
# I assume we can use glm() to fit the model.
prepfxlogit<-glm(PREP~LNTPI+MR+EMP+facsub,family=binomial(link=logit),data=taxprep)
8.5 SECTION 9.4 Marginal model and generalized equation estimation
library(gee)
prepgee1<-gee(PREP ~ LNTPI+MR+EMP, id=SUBJECT, data=taxprep, family=binomial(link=logit), corstr="exchangeable")
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) LNTPI MR EMP
-2.34471453 0.18810526 0.01081409 1.00906337
#gee Results match with SAS results
summary(prepgee1)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Logit
Variance to Mean Relation: Binomial
Correlation Structure: Exchangeable
Call:
gee(formula = PREP ~ LNTPI + MR + EMP, id = SUBJECT, data = taxprep,
family = binomial(link = logit), corstr = "exchangeable")
Summary of Residuals:
Min 1Q Median 3Q Max
-0.8131251 -0.4480400 -0.2898825 0.5079648 0.9138800
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -2.34471453 0.779479227 -3.008053 1.13139184 -2.0724160
LNTPI 0.18810526 0.094523103 1.990045 0.13685915 1.3744442
MR 0.01081409 0.008886585 1.216901 0.01122493 0.9633996
EMP 1.00906337 0.170162931 5.929983 0.17813257 5.6646764
Estimated Scale Parameter: 1.010469
Number of Iterations: 1
Working Correlation
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 0 0
[5,] 0 0 0 0 0
prepgee2<-gee(PREP ~ LNTPI+MR+EMP, id=SUBJECT, data=taxprep, family=binomial(link=logit), corstr="unstructured") #Results match with SAS results
Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate
(Intercept) LNTPI MR EMP
-2.34471453 0.18810526 0.01081409 1.00906337
summary(prepgee2)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Logit
Variance to Mean Relation: Binomial
Correlation Structure: Unstructured
Call:
gee(formula = PREP ~ LNTPI + MR + EMP, id = SUBJECT, data = taxprep,
family = binomial(link = logit), corstr = "unstructured")
Summary of Residuals:
Min 1Q Median 3Q Max
-0.8131251 -0.4480400 -0.2898825 0.5079648 0.9138800
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) -2.34471453 0.779479227 -3.008053 1.13139184 -2.0724160
LNTPI 0.18810526 0.094523103 1.990045 0.13685915 1.3744442
MR 0.01081409 0.008886585 1.216901 0.01122493 0.9633996
EMP 1.00906337 0.170162931 5.929983 0.17813257 5.6646764
Estimated Scale Parameter: 1.010469
Number of Iterations: 1
Working Correlation
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 0 0
[5,] 0 0 0 0 0