Qualitative Information

POLSCI 630: Probability and Basic Regression

March 18, 2025

Qualitative information

Information about differences in kind rather than quantity

Gender identification
Religious identification
Marital status
Region

Representation

We can represent qualitative information using numeric codes that have no quantitative meaning

Binary variables, e.g., gender (male or female), coded 0 or 1
Multicategory discrete variables, e.g., US region (South, West, Northeast, Midwest), coded 1, 2, 3, or 4

Dummy variables

It is typical in regression applications to represent a categorical variable using a series of \(J-1\) “dummy” variables, where \(J\) is the # of categories, each of which represents membership (or not) in a particular category of the variable

For US census region, we could create 3 dummies:
- 1=South
- 1=Northeast
- 1=Midwest
0/1 coding is technically arbitrary, but easy to interpret

Example, gender


Call:
lm(formula = econ_mean ~ male, data = lucid)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38302 -0.13302 -0.00802  0.12028  0.66194 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.338056   0.003831  88.234  < 2e-16 ***
male        0.044960   0.005629   7.987 1.73e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1925 on 4702 degrees of freedom
  (44 observations deleted due to missingness)
Multiple R-squared:  0.01338,   Adjusted R-squared:  0.01317 
F-statistic: 63.79 on 1 and 4702 DF,  p-value: 1.729e-15

What’s the intercept here?

Example, gender


Call:
lm(formula = econ_mean ~ age_scale + I(age_scale^2) + male + 
    educ_scale + income_scale, data = lucid)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43693 -0.13101 -0.00162  0.11975  0.69088 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.322067   0.011080  29.068  < 2e-16 ***
age_scale      -0.158514   0.049534  -3.200  0.00138 ** 
I(age_scale^2)  0.119425   0.056702   2.106  0.03524 *  
male            0.033077   0.005582   5.926 3.33e-09 ***
educ_scale      0.021022   0.010864   1.935  0.05304 .  
income_scale    0.120404   0.010602  11.356  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1882 on 4684 degrees of freedom
  (58 observations deleted due to missingness)
Multiple R-squared:  0.05823,   Adjusted R-squared:  0.05723 
F-statistic: 57.93 on 5 and 4684 DF,  p-value: < 2.2e-16

Example, multiarm experiment

In this experiment, there are 2 factors:

cues: 0=No party cues, 1=Unpolarized party cues, 2=Polarized party cues
decision: 0=respondent disagrees w/SCOTUS decision, 1=r agrees w/SCOTUS decision

              Decision direction
Cues condition   0   1
             0 242 111
             1 248 112
             2 257 106

The outcome, narcurb01, represents attitudes toward reforming the Supreme Court.

Example, multiarm experiment

# estimate model
m2a <- lm(narcurb01 ~ as.factor(cues), data = yougov)
summary(m2a)


Call:
lm(formula = narcurb01 ~ as.factor(cues), data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42403 -0.22403  0.01218  0.11218  0.61218 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.38782    0.01479  26.220   <2e-16 ***
as.factor(cues)1  0.01829    0.02082   0.879   0.3797    
as.factor(cues)2  0.03621    0.02079   1.742   0.0818 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2779 on 1072 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.002823,  Adjusted R-squared:  0.0009629 
F-statistic: 1.518 on 2 and 1072 DF,  p-value: 0.2197

What if we remove intercept?

# estimate model
m2b <- lm(narcurb01 ~ 0 + as.factor(cues), 
          data = yougov)
summary(m2b)


Call:
lm(formula = narcurb01 ~ 0 + as.factor(cues), data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42403 -0.22403  0.01218  0.11218  0.61218 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
as.factor(cues)0  0.38782    0.01479   26.22   <2e-16 ***
as.factor(cues)1  0.40611    0.01465   27.73   <2e-16 ***
as.factor(cues)2  0.42403    0.01461   29.03   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2779 on 1072 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.682, Adjusted R-squared:  0.6811 
F-statistic: 766.4 on 3 and 1072 DF,  p-value: < 2.2e-16

aggregate(yougov$narcurb01, 
          by = list(yougov$cues), 
          FUN = mean, na.rm=T)

  Group.1         x
1       0 0.3878187
2       1 0.4061111
3       2 0.4240332

What does \(p \approx 0\) mean in this context?

Multiple factors

# estimate model
m2c <- lm(narcurb01 ~ decision + as.factor(cues), data = yougov)
summary(m2c)


Call:
lm(formula = narcurb01 ~ decision + as.factor(cues), data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46417 -0.21169  0.03583  0.16908  0.70615 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.43092    0.01547  27.851  < 2e-16 ***
decision         -0.13706    0.01792  -7.648 4.52e-14 ***
as.factor(cues)1  0.01784    0.02028   0.879    0.379    
as.factor(cues)2  0.03325    0.02025   1.642    0.101    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2707 on 1071 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.05447,   Adjusted R-squared:  0.05182 
F-statistic: 20.56 on 3 and 1071 DF,  p-value: 5.851e-13

Multiple factors w/ interaction

# estimate model
m2d <- lm(narcurb01 ~ decision*as.factor(cues), data = yougov)
summary(m2d)


Call:
lm(formula = narcurb01 ~ decision * as.factor(cues), data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47695 -0.19910  0.02305  0.16429  0.70377 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                0.428512   0.017393  24.638  < 2e-16 ***
decision                  -0.129413   0.031016  -4.172 3.26e-05 ***
as.factor(cues)1           0.009391   0.024448   0.384   0.7010    
as.factor(cues)2           0.048441   0.024258   1.997   0.0461 *  
decision:as.factor(cues)1  0.027224   0.043713   0.623   0.5335    
decision:as.factor(cues)2 -0.051313   0.044029  -1.165   0.2441    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2706 on 1069 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.05737,   Adjusted R-squared:  0.05296 
F-statistic: 13.01 on 5 and 1069 DF,  p-value: 2.587e-12

With covariates


Call:
lm(formula = narcurb01 ~ decision + as.factor(cues) + age01 + 
    female + black + hisp + educ + incomei01 + know01 + relig01, 
    data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55868 -0.20735 -0.00413  0.15581  0.74082 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.500751   0.030324  16.513  < 2e-16 ***
decision         -0.130590   0.017907  -7.293 5.92e-13 ***
as.factor(cues)1  0.020782   0.020137   1.032  0.30228    
as.factor(cues)2  0.036530   0.020118   1.816  0.06968 .  
age01            -0.012661   0.039309  -0.322  0.74745    
female           -0.050737   0.017579  -2.886  0.00398 ** 
black            -0.019779   0.028814  -0.686  0.49259    
hisp              0.004528   0.029028   0.156  0.87608    
educ             -0.009146   0.008920  -1.025  0.30540    
incomei01        -0.065093   0.072755  -0.895  0.37116    
know01           -0.083533   0.029410  -2.840  0.00459 ** 
relig01           0.069245   0.024184   2.863  0.00428 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2678 on 1063 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.08153,   Adjusted R-squared:  0.07202 
F-statistic: 8.578 on 11 and 1063 DF,  p-value: 1.173e-14

Ordinal variables

Educational attainment in the previous example is ordinal, but we treated it as interval


  0   1   2   3   4 
121 934 767 446 232

Partisan identity another common example of this

1 = Strong Democrat \(\rightarrow\) 7 = Strong Republican

Ordinal w/ no constraints

Technically more precise to treat it as a category, but costly in degrees of freedom

# estimate model
m2f <- lm(narcurb01 ~ decision + as.factor(cues) 
          + age01 + female + black + hisp + as.factor(educ) 
          + incomei01 + know01 + relig01, data = yougov)
summary(m2f)


Call:
lm(formula = narcurb01 ~ decision + as.factor(cues) + age01 + 
    female + black + hisp + as.factor(educ) + incomei01 + know01 + 
    relig01, data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55907 -0.20508 -0.00192  0.15537  0.72213 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.533900   0.046581  11.462  < 2e-16 ***
decision         -0.130640   0.017932  -7.285 6.25e-13 ***
as.factor(cues)1  0.020233   0.020158   1.004  0.31572    
as.factor(cues)2  0.035888   0.020158   1.780  0.07530 .  
age01            -0.008711   0.039806  -0.219  0.82682    
female           -0.050638   0.017590  -2.879  0.00407 ** 
black            -0.019516   0.028896  -0.675  0.49958    
hisp              0.004254   0.029067   0.146  0.88368    
as.factor(educ)1 -0.050380   0.041430  -1.216  0.22424    
as.factor(educ)2 -0.052061   0.042466  -1.226  0.22050    
as.factor(educ)3 -0.047296   0.045103  -1.049  0.29459    
as.factor(educ)4 -0.082008   0.050104  -1.637  0.10198    
incomei01        -0.062780   0.073051  -0.859  0.39031    
know01           -0.086011   0.029502  -2.915  0.00363 ** 
relig01           0.069634   0.024258   2.871  0.00418 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.268 on 1060 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.08303,   Adjusted R-squared:  0.07092 
F-statistic: 6.856 on 14 and 1060 DF,  p-value: 1.337e-13

Example

Trexler (2025)

F test

We can use an F test to see whether the unconstrained model for educ fits better than the interval model

\(\beta_1\text{educ} = \beta_1\text{educ}_1 + 2\beta_1\text{educ}_2 + 3\beta_1\text{educ}_3...\)

anova(m2e, m2f)

Analysis of Variance Table

Model 1: narcurb01 ~ decision + as.factor(cues) + age01 + female + black + 
    hisp + educ + incomei01 + know01 + relig01
Model 2: narcurb01 ~ decision + as.factor(cues) + age01 + female + black + 
    hisp + as.factor(educ) + incomei01 + know01 + relig01
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1   1063 76.251                           
2   1060 76.127  3   0.12462 0.5784 0.6292

Interactions, continuous x categorical

Interacting a continuous with a categorical variable implies distinct slope for each categorical group

Number of unique slopes = number of groups
If > 2 categories, need to interact with each category

Can also estimate separate models for each group

This effectively interacts every variable with the group

Example

m3 <- lm(narcurb01 ~ decision*know01 +
                      age01 + black + hisp + educ + incomei01 + relig01, 
            data = yougov)
summary(m3)


Call:
lm(formula = narcurb01 ~ decision * know01 + age01 + black + 
    hisp + educ + incomei01 + relig01, data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53933 -0.18459  0.00242  0.14452  0.77772 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.464448   0.017978  25.834  < 2e-16 ***
decision        -0.038675   0.020411  -1.895  0.05826 .  
know01          -0.062982   0.022306  -2.824  0.00479 ** 
age01            0.014945   0.026925   0.555  0.57889    
black           -0.011883   0.019026  -0.625  0.53234    
hisp            -0.005703   0.019694  -0.290  0.77218    
educ            -0.011778   0.005964  -1.975  0.04841 *  
incomei01       -0.094627   0.050737  -1.865  0.06231 .  
relig01          0.082886   0.016352   5.069 4.35e-07 ***
decision:know01 -0.073063   0.036279  -2.014  0.04414 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2578 on 2136 degrees of freedom
  (354 observations deleted due to missingness)
Multiple R-squared:  0.05953,   Adjusted R-squared:  0.05557 
F-statistic: 15.02 on 9 and 2136 DF,  p-value: < 2.2e-16

Example, using `sim`

Predicted values w/ confidence bounds, across know01, for each value of decision, holding other vars at their central tendencies

library(arm)

# get sims
m3_sims <- sim(m3, n.sims = 10000)

# new data for predictions
nd_3 <- data.frame(decision = c(rep(0,11),rep(1,11)), 
                   know01 = seq(0,1,0.1), 
                   age01 = mean(yougov$age01, na.rm = T), 
                   black = 0, hisp = 0, educ = median(yougov$educ, na.rm = T), 
                   incomei01 = mean(yougov$incomei01, na.rm = T), 
                   relig01 = mean(yougov$relig01, na.rm = T))
nd_3 <- cbind(1, nd_3, nd_3$decision*nd_3$know01)

# calculate slopes for each sim
know_sims <- t(sapply(1:10000, function(x) as.matrix(nd_3) %*% m3_sims@coef[x, ]))

Example

Example, using `marginaleffects`

library(marginaleffects)

# effect of decision at values of know01
slopes(m3, 
       variables = c("decision"), 
       newdata = datagrid(know01 = seq(0, 1, 0.1)))


 know01 Estimate Std. Error     z Pr(>|z|)    S   2.5 %   97.5 %
    0.0  -0.0387     0.0204 -1.89  0.05812  4.1 -0.0787  0.00133
    0.1  -0.0460     0.0176 -2.61  0.00911  6.8 -0.0805 -0.01142
    0.2  -0.0533     0.0152 -3.50  < 0.001 11.1 -0.0831 -0.02347
    0.3  -0.0606     0.0134 -4.54  < 0.001 17.4 -0.0868 -0.03442
    0.4  -0.0679     0.0123 -5.51  < 0.001 24.8 -0.0920 -0.04377
    0.5  -0.0752     0.0123 -6.12  < 0.001 30.0 -0.0993 -0.05111
    0.6  -0.0825     0.0133 -6.20  < 0.001 30.7 -0.1086 -0.05644
    0.7  -0.0898     0.0151 -5.93  < 0.001 28.3 -0.1195 -0.06015
    0.8  -0.0971     0.0175 -5.54  < 0.001 25.0 -0.1315 -0.06274
    0.9  -0.1044     0.0203 -5.14  < 0.001 21.8 -0.1442 -0.06462
    1.0  -0.1117     0.0233 -4.79  < 0.001 19.2 -0.1574 -0.06603

Term: decision
Type:  response 
Comparison: 1 - 0

Chow test

Reasonable to wonder whether this model reflects the DGP for everyone, or whether the DGP varies by group.

A special form of F test - called a Chow test - can be used to test the null hypothesis that a linear model is identical across groups

Chow test

# model for all
m4 <- lm(narcurb01 ~ know01 +
                      age01 + black + hisp + educ + incomei01 + relig01, 
            data = yougov)

# model for disliked decisions
m4_d <- lm(narcurb01 ~ know01 +
                      age01 + black + hisp + educ + incomei01 + relig01, 
            data = subset(yougov, decision == 0))

# model for liked decisions
m4_l <- lm(narcurb01 ~ know01 +
                      age01 + black + hisp + educ + incomei01 + relig01, 
            data = subset(yougov, decision == 1))

Chow test

The formula requires estimation of restricted and both unrestricted models

\[ F = \frac{SSR_R - (SSR_{UR1} + SSR_{UR2})}{SSR_{UR1} + SSR_{UR2}} \cdot \frac{N - 2(K + 1)}{K + 1} \]

Chow test

SSR_R <- sum(m4$residuals^2)
SSR_D <- sum(m4_d$residuals^2)
SSR_L <- sum(m4_l$residuals^2)

Fstat <- ( (SSR_R  - (SSR_D + SSR_L)) / (SSR_D + SSR_L) ) * 
         (nrow(m4$model) - 2*(length(coef(m4)))) / length(coef(m4))

p <- 1 - pf(Fstat, length(coef(m4)), nrow(m4$model) - 2*(length(coef(m4))))

cbind(Fstat, p)

        Fstat            p
[1,] 6.564468 1.719644e-08

Qualitative Information