Qualitative Information
POLSCI 630: Probability and Basic Regression
March 18, 2025
Representation
We can represent qualitative information using numeric codes that have no quantitative meaning
Binary variables, e.g., gender (male or female), coded 0 or 1
Multicategory discrete variables, e.g., US region (South, West, Northeast, Midwest), coded 1, 2, 3, or 4
Dummy variables
It is typical in regression applications to represent a categorical variable using a series of \(J-1\) “dummy” variables, where \(J\) is the # of categories, each of which represents membership (or not) in a particular category of the variable
For US census region, we could create 3 dummies:
1=South
1=Northeast
1=Midwest
0/1 coding is technically arbitrary, but easy to interpret
Example, gender
Call:
lm(formula = econ_mean ~ male, data = lucid)
Residuals:
Min 1Q Median 3Q Max
-0.38302 -0.13302 -0.00802 0.12028 0.66194
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.338056 0.003831 88.234 < 2e-16 ***
male 0.044960 0.005629 7.987 1.73e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1925 on 4702 degrees of freedom
(44 observations deleted due to missingness)
Multiple R-squared: 0.01338, Adjusted R-squared: 0.01317
F-statistic: 63.79 on 1 and 4702 DF, p-value: 1.729e-15
What’s the intercept here?
Example, gender
Call:
lm(formula = econ_mean ~ age_scale + I(age_scale^2) + male +
educ_scale + income_scale, data = lucid)
Residuals:
Min 1Q Median 3Q Max
-0.43693 -0.13101 -0.00162 0.11975 0.69088
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.322067 0.011080 29.068 < 2e-16 ***
age_scale -0.158514 0.049534 -3.200 0.00138 **
I(age_scale^2) 0.119425 0.056702 2.106 0.03524 *
male 0.033077 0.005582 5.926 3.33e-09 ***
educ_scale 0.021022 0.010864 1.935 0.05304 .
income_scale 0.120404 0.010602 11.356 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1882 on 4684 degrees of freedom
(58 observations deleted due to missingness)
Multiple R-squared: 0.05823, Adjusted R-squared: 0.05723
F-statistic: 57.93 on 5 and 4684 DF, p-value: < 2.2e-16
Example, multiarm experiment
In this experiment, there are 2 factors:
cues
: 0=No party cues, 1=Unpolarized party cues, 2=Polarized party cues
decision
: 0=respondent disagrees w/SCOTUS decision, 1=r agrees w/SCOTUS decision
Decision direction
Cues condition 0 1
0 242 111
1 248 112
2 257 106
The outcome, narcurb01
, represents attitudes toward reforming the Supreme Court.
Example, multiarm experiment
# estimate model
m2a <- lm (narcurb01 ~ as.factor (cues), data = yougov)
summary (m2a)
Call:
lm(formula = narcurb01 ~ as.factor(cues), data = yougov)
Residuals:
Min 1Q Median 3Q Max
-0.42403 -0.22403 0.01218 0.11218 0.61218
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.38782 0.01479 26.220 <2e-16 ***
as.factor(cues)1 0.01829 0.02082 0.879 0.3797
as.factor(cues)2 0.03621 0.02079 1.742 0.0818 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2779 on 1072 degrees of freedom
(1425 observations deleted due to missingness)
Multiple R-squared: 0.002823, Adjusted R-squared: 0.0009629
F-statistic: 1.518 on 2 and 1072 DF, p-value: 0.2197
What if we remove intercept?
# estimate model
m2b <- lm (narcurb01 ~ 0 + as.factor (cues),
data = yougov)
summary (m2b)
Call:
lm(formula = narcurb01 ~ 0 + as.factor(cues), data = yougov)
Residuals:
Min 1Q Median 3Q Max
-0.42403 -0.22403 0.01218 0.11218 0.61218
Coefficients:
Estimate Std. Error t value Pr(>|t|)
as.factor(cues)0 0.38782 0.01479 26.22 <2e-16 ***
as.factor(cues)1 0.40611 0.01465 27.73 <2e-16 ***
as.factor(cues)2 0.42403 0.01461 29.03 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2779 on 1072 degrees of freedom
(1425 observations deleted due to missingness)
Multiple R-squared: 0.682, Adjusted R-squared: 0.6811
F-statistic: 766.4 on 3 and 1072 DF, p-value: < 2.2e-16
aggregate (yougov$ narcurb01,
by = list (yougov$ cues),
FUN = mean, na.rm= T)
Group.1 x
1 0 0.3878187
2 1 0.4061111
3 2 0.4240332
What does \(p \approx 0\) mean in this context?
Multiple factors
# estimate model
m2c <- lm (narcurb01 ~ decision + as.factor (cues), data = yougov)
summary (m2c)
Call:
lm(formula = narcurb01 ~ decision + as.factor(cues), data = yougov)
Residuals:
Min 1Q Median 3Q Max
-0.46417 -0.21169 0.03583 0.16908 0.70615
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.43092 0.01547 27.851 < 2e-16 ***
decision -0.13706 0.01792 -7.648 4.52e-14 ***
as.factor(cues)1 0.01784 0.02028 0.879 0.379
as.factor(cues)2 0.03325 0.02025 1.642 0.101
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2707 on 1071 degrees of freedom
(1425 observations deleted due to missingness)
Multiple R-squared: 0.05447, Adjusted R-squared: 0.05182
F-statistic: 20.56 on 3 and 1071 DF, p-value: 5.851e-13
Multiple factors w/ interaction
# estimate model
m2d <- lm (narcurb01 ~ decision* as.factor (cues), data = yougov)
summary (m2d)
Call:
lm(formula = narcurb01 ~ decision * as.factor(cues), data = yougov)
Residuals:
Min 1Q Median 3Q Max
-0.47695 -0.19910 0.02305 0.16429 0.70377
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.428512 0.017393 24.638 < 2e-16 ***
decision -0.129413 0.031016 -4.172 3.26e-05 ***
as.factor(cues)1 0.009391 0.024448 0.384 0.7010
as.factor(cues)2 0.048441 0.024258 1.997 0.0461 *
decision:as.factor(cues)1 0.027224 0.043713 0.623 0.5335
decision:as.factor(cues)2 -0.051313 0.044029 -1.165 0.2441
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2706 on 1069 degrees of freedom
(1425 observations deleted due to missingness)
Multiple R-squared: 0.05737, Adjusted R-squared: 0.05296
F-statistic: 13.01 on 5 and 1069 DF, p-value: 2.587e-12
With covariates
Call:
lm(formula = narcurb01 ~ decision + as.factor(cues) + age01 +
female + black + hisp + educ + incomei01 + know01 + relig01,
data = yougov)
Residuals:
Min 1Q Median 3Q Max
-0.55868 -0.20735 -0.00413 0.15581 0.74082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.500751 0.030324 16.513 < 2e-16 ***
decision -0.130590 0.017907 -7.293 5.92e-13 ***
as.factor(cues)1 0.020782 0.020137 1.032 0.30228
as.factor(cues)2 0.036530 0.020118 1.816 0.06968 .
age01 -0.012661 0.039309 -0.322 0.74745
female -0.050737 0.017579 -2.886 0.00398 **
black -0.019779 0.028814 -0.686 0.49259
hisp 0.004528 0.029028 0.156 0.87608
educ -0.009146 0.008920 -1.025 0.30540
incomei01 -0.065093 0.072755 -0.895 0.37116
know01 -0.083533 0.029410 -2.840 0.00459 **
relig01 0.069245 0.024184 2.863 0.00428 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2678 on 1063 degrees of freedom
(1425 observations deleted due to missingness)
Multiple R-squared: 0.08153, Adjusted R-squared: 0.07202
F-statistic: 8.578 on 11 and 1063 DF, p-value: 1.173e-14
Ordinal variables
Educational attainment in the previous example is ordinal, but we treated it as interval
0 1 2 3 4
121 934 767 446 232
Partisan identity another common example of this
1 = Strong Democrat \(\rightarrow\) 7 = Strong Republican
Ordinal w/ no constraints
Technically more precise to treat it as a category, but costly in degrees of freedom
# estimate model
m2f <- lm (narcurb01 ~ decision + as.factor (cues)
+ age01 + female + black + hisp + as.factor (educ)
+ incomei01 + know01 + relig01, data = yougov)
summary (m2f)
Call:
lm(formula = narcurb01 ~ decision + as.factor(cues) + age01 +
female + black + hisp + as.factor(educ) + incomei01 + know01 +
relig01, data = yougov)
Residuals:
Min 1Q Median 3Q Max
-0.55907 -0.20508 -0.00192 0.15537 0.72213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.533900 0.046581 11.462 < 2e-16 ***
decision -0.130640 0.017932 -7.285 6.25e-13 ***
as.factor(cues)1 0.020233 0.020158 1.004 0.31572
as.factor(cues)2 0.035888 0.020158 1.780 0.07530 .
age01 -0.008711 0.039806 -0.219 0.82682
female -0.050638 0.017590 -2.879 0.00407 **
black -0.019516 0.028896 -0.675 0.49958
hisp 0.004254 0.029067 0.146 0.88368
as.factor(educ)1 -0.050380 0.041430 -1.216 0.22424
as.factor(educ)2 -0.052061 0.042466 -1.226 0.22050
as.factor(educ)3 -0.047296 0.045103 -1.049 0.29459
as.factor(educ)4 -0.082008 0.050104 -1.637 0.10198
incomei01 -0.062780 0.073051 -0.859 0.39031
know01 -0.086011 0.029502 -2.915 0.00363 **
relig01 0.069634 0.024258 2.871 0.00418 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.268 on 1060 degrees of freedom
(1425 observations deleted due to missingness)
Multiple R-squared: 0.08303, Adjusted R-squared: 0.07092
F-statistic: 6.856 on 14 and 1060 DF, p-value: 1.337e-13
Example
Trexler (2025)
F test
We can use an F test to see whether the unconstrained model for educ
fits better than the interval model
\(\beta_1\text{educ} = \beta_1\text{educ}_1 + 2\beta_1\text{educ}_2 + 3\beta_1\text{educ}_3...\)
Analysis of Variance Table
Model 1: narcurb01 ~ decision + as.factor(cues) + age01 + female + black +
hisp + educ + incomei01 + know01 + relig01
Model 2: narcurb01 ~ decision + as.factor(cues) + age01 + female + black +
hisp + as.factor(educ) + incomei01 + know01 + relig01
Res.Df RSS Df Sum of Sq F Pr(>F)
1 1063 76.251
2 1060 76.127 3 0.12462 0.5784 0.6292
Interactions, continuous x categorical
Interacting a continuous with a categorical variable implies distinct slope for each categorical group
Can also estimate separate models for each group
This effectively interacts every variable with the group
Example
m3 <- lm (narcurb01 ~ decision* know01 +
age01 + black + hisp + educ + incomei01 + relig01,
data = yougov)
summary (m3)
Call:
lm(formula = narcurb01 ~ decision * know01 + age01 + black +
hisp + educ + incomei01 + relig01, data = yougov)
Residuals:
Min 1Q Median 3Q Max
-0.53933 -0.18459 0.00242 0.14452 0.77772
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.464448 0.017978 25.834 < 2e-16 ***
decision -0.038675 0.020411 -1.895 0.05826 .
know01 -0.062982 0.022306 -2.824 0.00479 **
age01 0.014945 0.026925 0.555 0.57889
black -0.011883 0.019026 -0.625 0.53234
hisp -0.005703 0.019694 -0.290 0.77218
educ -0.011778 0.005964 -1.975 0.04841 *
incomei01 -0.094627 0.050737 -1.865 0.06231 .
relig01 0.082886 0.016352 5.069 4.35e-07 ***
decision:know01 -0.073063 0.036279 -2.014 0.04414 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2578 on 2136 degrees of freedom
(354 observations deleted due to missingness)
Multiple R-squared: 0.05953, Adjusted R-squared: 0.05557
F-statistic: 15.02 on 9 and 2136 DF, p-value: < 2.2e-16
Example, using sim
Predicted values w/ confidence bounds, across know01
, for each value of decision
, holding other vars at their central tendencies
library (arm)
# get sims
m3_sims <- sim (m3, n.sims = 10000 )
# new data for predictions
nd_3 <- data.frame (decision = c (rep (0 ,11 ),rep (1 ,11 )),
know01 = seq (0 ,1 ,0.1 ),
age01 = mean (yougov$ age01, na.rm = T),
black = 0 , hisp = 0 , educ = median (yougov$ educ, na.rm = T),
incomei01 = mean (yougov$ incomei01, na.rm = T),
relig01 = mean (yougov$ relig01, na.rm = T))
nd_3 <- cbind (1 , nd_3, nd_3$ decision* nd_3$ know01)
# calculate slopes for each sim
know_sims <- t (sapply (1 : 10000 , function (x) as.matrix (nd_3) %*% m3_sims@ coef[x, ]))
Example
Example, using marginaleffects
library (marginaleffects)
# effect of decision at values of know01
slopes (m3,
variables = c ("decision" ),
newdata = datagrid (know01 = seq (0 , 1 , 0.1 )))
know01 Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
0.0 -0.0387 0.0204 -1.89 0.05812 4.1 -0.0787 0.00133
0.1 -0.0460 0.0176 -2.61 0.00911 6.8 -0.0805 -0.01142
0.2 -0.0533 0.0152 -3.50 < 0.001 11.1 -0.0831 -0.02347
0.3 -0.0606 0.0134 -4.54 < 0.001 17.4 -0.0868 -0.03442
0.4 -0.0679 0.0123 -5.51 < 0.001 24.8 -0.0920 -0.04377
0.5 -0.0752 0.0123 -6.12 < 0.001 30.0 -0.0993 -0.05111
0.6 -0.0825 0.0133 -6.20 < 0.001 30.7 -0.1086 -0.05644
0.7 -0.0898 0.0151 -5.93 < 0.001 28.3 -0.1195 -0.06015
0.8 -0.0971 0.0175 -5.54 < 0.001 25.0 -0.1315 -0.06274
0.9 -0.1044 0.0203 -5.14 < 0.001 21.8 -0.1442 -0.06462
1.0 -0.1117 0.0233 -4.79 < 0.001 19.2 -0.1574 -0.06603
Term: decision
Type: response
Comparison: 1 - 0
Chow test
Reasonable to wonder whether this model reflects the DGP for everyone, or whether the DGP varies by group.
A special form of F test - called a Chow test - can be used to test the null hypothesis that a linear model is identical across groups
Chow test
# model for all
m4 <- lm (narcurb01 ~ know01 +
age01 + black + hisp + educ + incomei01 + relig01,
data = yougov)
# model for disliked decisions
m4_d <- lm (narcurb01 ~ know01 +
age01 + black + hisp + educ + incomei01 + relig01,
data = subset (yougov, decision == 0 ))
# model for liked decisions
m4_l <- lm (narcurb01 ~ know01 +
age01 + black + hisp + educ + incomei01 + relig01,
data = subset (yougov, decision == 1 ))
Chow test
The formula requires estimation of restricted and both unrestricted models
\[
F = \frac{SSR_R - (SSR_{UR1} + SSR_{UR2})}{SSR_{UR1} + SSR_{UR2}} \cdot \frac{N - 2(K + 1)}{K + 1}
\]
Chow test
SSR_R <- sum (m4$ residuals^ 2 )
SSR_D <- sum (m4_d$ residuals^ 2 )
SSR_L <- sum (m4_l$ residuals^ 2 )
Fstat <- ( (SSR_R - (SSR_D + SSR_L)) / (SSR_D + SSR_L) ) *
(nrow (m4$ model) - 2 * (length (coef (m4)))) / length (coef (m4))
p <- 1 - pf (Fstat, length (coef (m4)), nrow (m4$ model) - 2 * (length (coef (m4))))
cbind (Fstat, p)
Fstat p
[1,] 6.564468 1.719644e-08