Qualitative Information

POLSCI 630: Probability and Basic Regression

March 18, 2025

Qualitative information

Information about differences in kind rather than quantity

  • Gender identification

  • Religious identification

  • Marital status

  • Region

Representation

We can represent qualitative information using numeric codes that have no quantitative meaning

  • Binary variables, e.g., gender (male or female), coded 0 or 1

  • Multicategory discrete variables, e.g., US region (South, West, Northeast, Midwest), coded 1, 2, 3, or 4

Dummy variables

It is typical in regression applications to represent a categorical variable using a series of \(J-1\) “dummy” variables, where \(J\) is the # of categories, each of which represents membership (or not) in a particular category of the variable

  • For US census region, we could create 3 dummies:
    • 1=South
    • 1=Northeast
    • 1=Midwest
  • 0/1 coding is technically arbitrary, but easy to interpret

Example, gender


Call:
lm(formula = econ_mean ~ male, data = lucid)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38302 -0.13302 -0.00802  0.12028  0.66194 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.338056   0.003831  88.234  < 2e-16 ***
male        0.044960   0.005629   7.987 1.73e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1925 on 4702 degrees of freedom
  (44 observations deleted due to missingness)
Multiple R-squared:  0.01338,   Adjusted R-squared:  0.01317 
F-statistic: 63.79 on 1 and 4702 DF,  p-value: 1.729e-15

What’s the intercept here?

Example, gender


Call:
lm(formula = econ_mean ~ age_scale + I(age_scale^2) + male + 
    educ_scale + income_scale, data = lucid)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43693 -0.13101 -0.00162  0.11975  0.69088 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.322067   0.011080  29.068  < 2e-16 ***
age_scale      -0.158514   0.049534  -3.200  0.00138 ** 
I(age_scale^2)  0.119425   0.056702   2.106  0.03524 *  
male            0.033077   0.005582   5.926 3.33e-09 ***
educ_scale      0.021022   0.010864   1.935  0.05304 .  
income_scale    0.120404   0.010602  11.356  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1882 on 4684 degrees of freedom
  (58 observations deleted due to missingness)
Multiple R-squared:  0.05823,   Adjusted R-squared:  0.05723 
F-statistic: 57.93 on 5 and 4684 DF,  p-value: < 2.2e-16

Example, multiarm experiment

In this experiment, there are 2 factors:

  1. cues: 0=No party cues, 1=Unpolarized party cues, 2=Polarized party cues

  2. decision: 0=respondent disagrees w/SCOTUS decision, 1=r agrees w/SCOTUS decision

              Decision direction
Cues condition   0   1
             0 242 111
             1 248 112
             2 257 106

The outcome, narcurb01, represents attitudes toward reforming the Supreme Court.

Example, multiarm experiment

# estimate model
m2a <- lm(narcurb01 ~ as.factor(cues), data = yougov)
summary(m2a)

Call:
lm(formula = narcurb01 ~ as.factor(cues), data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42403 -0.22403  0.01218  0.11218  0.61218 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.38782    0.01479  26.220   <2e-16 ***
as.factor(cues)1  0.01829    0.02082   0.879   0.3797    
as.factor(cues)2  0.03621    0.02079   1.742   0.0818 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2779 on 1072 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.002823,  Adjusted R-squared:  0.0009629 
F-statistic: 1.518 on 2 and 1072 DF,  p-value: 0.2197

What if we remove intercept?

# estimate model
m2b <- lm(narcurb01 ~ 0 + as.factor(cues), 
          data = yougov)
summary(m2b)

Call:
lm(formula = narcurb01 ~ 0 + as.factor(cues), data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42403 -0.22403  0.01218  0.11218  0.61218 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
as.factor(cues)0  0.38782    0.01479   26.22   <2e-16 ***
as.factor(cues)1  0.40611    0.01465   27.73   <2e-16 ***
as.factor(cues)2  0.42403    0.01461   29.03   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2779 on 1072 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.682, Adjusted R-squared:  0.6811 
F-statistic: 766.4 on 3 and 1072 DF,  p-value: < 2.2e-16
aggregate(yougov$narcurb01, 
          by = list(yougov$cues), 
          FUN = mean, na.rm=T)
  Group.1         x
1       0 0.3878187
2       1 0.4061111
3       2 0.4240332

What does \(p \approx 0\) mean in this context?

Multiple factors

# estimate model
m2c <- lm(narcurb01 ~ decision + as.factor(cues), data = yougov)
summary(m2c)

Call:
lm(formula = narcurb01 ~ decision + as.factor(cues), data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46417 -0.21169  0.03583  0.16908  0.70615 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.43092    0.01547  27.851  < 2e-16 ***
decision         -0.13706    0.01792  -7.648 4.52e-14 ***
as.factor(cues)1  0.01784    0.02028   0.879    0.379    
as.factor(cues)2  0.03325    0.02025   1.642    0.101    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2707 on 1071 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.05447,   Adjusted R-squared:  0.05182 
F-statistic: 20.56 on 3 and 1071 DF,  p-value: 5.851e-13

Multiple factors w/ interaction

# estimate model
m2d <- lm(narcurb01 ~ decision*as.factor(cues), data = yougov)
summary(m2d)

Call:
lm(formula = narcurb01 ~ decision * as.factor(cues), data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47695 -0.19910  0.02305  0.16429  0.70377 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                0.428512   0.017393  24.638  < 2e-16 ***
decision                  -0.129413   0.031016  -4.172 3.26e-05 ***
as.factor(cues)1           0.009391   0.024448   0.384   0.7010    
as.factor(cues)2           0.048441   0.024258   1.997   0.0461 *  
decision:as.factor(cues)1  0.027224   0.043713   0.623   0.5335    
decision:as.factor(cues)2 -0.051313   0.044029  -1.165   0.2441    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2706 on 1069 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.05737,   Adjusted R-squared:  0.05296 
F-statistic: 13.01 on 5 and 1069 DF,  p-value: 2.587e-12

With covariates


Call:
lm(formula = narcurb01 ~ decision + as.factor(cues) + age01 + 
    female + black + hisp + educ + incomei01 + know01 + relig01, 
    data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55868 -0.20735 -0.00413  0.15581  0.74082 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.500751   0.030324  16.513  < 2e-16 ***
decision         -0.130590   0.017907  -7.293 5.92e-13 ***
as.factor(cues)1  0.020782   0.020137   1.032  0.30228    
as.factor(cues)2  0.036530   0.020118   1.816  0.06968 .  
age01            -0.012661   0.039309  -0.322  0.74745    
female           -0.050737   0.017579  -2.886  0.00398 ** 
black            -0.019779   0.028814  -0.686  0.49259    
hisp              0.004528   0.029028   0.156  0.87608    
educ             -0.009146   0.008920  -1.025  0.30540    
incomei01        -0.065093   0.072755  -0.895  0.37116    
know01           -0.083533   0.029410  -2.840  0.00459 ** 
relig01           0.069245   0.024184   2.863  0.00428 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2678 on 1063 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.08153,   Adjusted R-squared:  0.07202 
F-statistic: 8.578 on 11 and 1063 DF,  p-value: 1.173e-14

Ordinal variables

Educational attainment in the previous example is ordinal, but we treated it as interval


  0   1   2   3   4 
121 934 767 446 232 

Partisan identity another common example of this

  • 1 = Strong Democrat \(\rightarrow\) 7 = Strong Republican

Ordinal w/ no constraints

Technically more precise to treat it as a category, but costly in degrees of freedom

# estimate model
m2f <- lm(narcurb01 ~ decision + as.factor(cues) 
          + age01 + female + black + hisp + as.factor(educ) 
          + incomei01 + know01 + relig01, data = yougov)
summary(m2f)

Call:
lm(formula = narcurb01 ~ decision + as.factor(cues) + age01 + 
    female + black + hisp + as.factor(educ) + incomei01 + know01 + 
    relig01, data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55907 -0.20508 -0.00192  0.15537  0.72213 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.533900   0.046581  11.462  < 2e-16 ***
decision         -0.130640   0.017932  -7.285 6.25e-13 ***
as.factor(cues)1  0.020233   0.020158   1.004  0.31572    
as.factor(cues)2  0.035888   0.020158   1.780  0.07530 .  
age01            -0.008711   0.039806  -0.219  0.82682    
female           -0.050638   0.017590  -2.879  0.00407 ** 
black            -0.019516   0.028896  -0.675  0.49958    
hisp              0.004254   0.029067   0.146  0.88368    
as.factor(educ)1 -0.050380   0.041430  -1.216  0.22424    
as.factor(educ)2 -0.052061   0.042466  -1.226  0.22050    
as.factor(educ)3 -0.047296   0.045103  -1.049  0.29459    
as.factor(educ)4 -0.082008   0.050104  -1.637  0.10198    
incomei01        -0.062780   0.073051  -0.859  0.39031    
know01           -0.086011   0.029502  -2.915  0.00363 ** 
relig01           0.069634   0.024258   2.871  0.00418 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.268 on 1060 degrees of freedom
  (1425 observations deleted due to missingness)
Multiple R-squared:  0.08303,   Adjusted R-squared:  0.07092 
F-statistic: 6.856 on 14 and 1060 DF,  p-value: 1.337e-13

Example

Trexler (2025)

F test

We can use an F test to see whether the unconstrained model for educ fits better than the interval model

\(\beta_1\text{educ} = \beta_1\text{educ}_1 + 2\beta_1\text{educ}_2 + 3\beta_1\text{educ}_3...\)


anova(m2e, m2f)
Analysis of Variance Table

Model 1: narcurb01 ~ decision + as.factor(cues) + age01 + female + black + 
    hisp + educ + incomei01 + know01 + relig01
Model 2: narcurb01 ~ decision + as.factor(cues) + age01 + female + black + 
    hisp + as.factor(educ) + incomei01 + know01 + relig01
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1   1063 76.251                           
2   1060 76.127  3   0.12462 0.5784 0.6292

Interactions, continuous x categorical

Interacting a continuous with a categorical variable implies distinct slope for each categorical group

  • Number of unique slopes = number of groups

  • If > 2 categories, need to interact with each category

Can also estimate separate models for each group

  • This effectively interacts every variable with the group

Example

m3 <- lm(narcurb01 ~ decision*know01 +
                      age01 + black + hisp + educ + incomei01 + relig01, 
            data = yougov)
summary(m3)

Call:
lm(formula = narcurb01 ~ decision * know01 + age01 + black + 
    hisp + educ + incomei01 + relig01, data = yougov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53933 -0.18459  0.00242  0.14452  0.77772 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.464448   0.017978  25.834  < 2e-16 ***
decision        -0.038675   0.020411  -1.895  0.05826 .  
know01          -0.062982   0.022306  -2.824  0.00479 ** 
age01            0.014945   0.026925   0.555  0.57889    
black           -0.011883   0.019026  -0.625  0.53234    
hisp            -0.005703   0.019694  -0.290  0.77218    
educ            -0.011778   0.005964  -1.975  0.04841 *  
incomei01       -0.094627   0.050737  -1.865  0.06231 .  
relig01          0.082886   0.016352   5.069 4.35e-07 ***
decision:know01 -0.073063   0.036279  -2.014  0.04414 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2578 on 2136 degrees of freedom
  (354 observations deleted due to missingness)
Multiple R-squared:  0.05953,   Adjusted R-squared:  0.05557 
F-statistic: 15.02 on 9 and 2136 DF,  p-value: < 2.2e-16

Example, using sim

Predicted values w/ confidence bounds, across know01, for each value of decision, holding other vars at their central tendencies

library(arm)

# get sims
m3_sims <- sim(m3, n.sims = 10000)

# new data for predictions
nd_3 <- data.frame(decision = c(rep(0,11),rep(1,11)), 
                   know01 = seq(0,1,0.1), 
                   age01 = mean(yougov$age01, na.rm = T), 
                   black = 0, hisp = 0, educ = median(yougov$educ, na.rm = T), 
                   incomei01 = mean(yougov$incomei01, na.rm = T), 
                   relig01 = mean(yougov$relig01, na.rm = T))
nd_3 <- cbind(1, nd_3, nd_3$decision*nd_3$know01)

# calculate slopes for each sim
know_sims <- t(sapply(1:10000, function(x) as.matrix(nd_3) %*% m3_sims@coef[x, ]))

Example

Example, using marginaleffects

library(marginaleffects)

# effect of decision at values of know01
slopes(m3, 
       variables = c("decision"), 
       newdata = datagrid(know01 = seq(0, 1, 0.1)))

 know01 Estimate Std. Error     z Pr(>|z|)    S   2.5 %   97.5 %
    0.0  -0.0387     0.0204 -1.89  0.05812  4.1 -0.0787  0.00133
    0.1  -0.0460     0.0176 -2.61  0.00911  6.8 -0.0805 -0.01142
    0.2  -0.0533     0.0152 -3.50  < 0.001 11.1 -0.0831 -0.02347
    0.3  -0.0606     0.0134 -4.54  < 0.001 17.4 -0.0868 -0.03442
    0.4  -0.0679     0.0123 -5.51  < 0.001 24.8 -0.0920 -0.04377
    0.5  -0.0752     0.0123 -6.12  < 0.001 30.0 -0.0993 -0.05111
    0.6  -0.0825     0.0133 -6.20  < 0.001 30.7 -0.1086 -0.05644
    0.7  -0.0898     0.0151 -5.93  < 0.001 28.3 -0.1195 -0.06015
    0.8  -0.0971     0.0175 -5.54  < 0.001 25.0 -0.1315 -0.06274
    0.9  -0.1044     0.0203 -5.14  < 0.001 21.8 -0.1442 -0.06462
    1.0  -0.1117     0.0233 -4.79  < 0.001 19.2 -0.1574 -0.06603

Term: decision
Type:  response 
Comparison: 1 - 0

Chow test

Reasonable to wonder whether this model reflects the DGP for everyone, or whether the DGP varies by group.

A special form of F test - called a Chow test - can be used to test the null hypothesis that a linear model is identical across groups

Chow test

# model for all
m4 <- lm(narcurb01 ~ know01 +
                      age01 + black + hisp + educ + incomei01 + relig01, 
            data = yougov)

# model for disliked decisions
m4_d <- lm(narcurb01 ~ know01 +
                      age01 + black + hisp + educ + incomei01 + relig01, 
            data = subset(yougov, decision == 0))

# model for liked decisions
m4_l <- lm(narcurb01 ~ know01 +
                      age01 + black + hisp + educ + incomei01 + relig01, 
            data = subset(yougov, decision == 1))

Chow test

The formula requires estimation of restricted and both unrestricted models

\[ F = \frac{SSR_R - (SSR_{UR1} + SSR_{UR2})}{SSR_{UR1} + SSR_{UR2}} \cdot \frac{N - 2(K + 1)}{K + 1} \]

Chow test

SSR_R <- sum(m4$residuals^2)
SSR_D <- sum(m4_d$residuals^2)
SSR_L <- sum(m4_l$residuals^2)

Fstat <- ( (SSR_R  - (SSR_D + SSR_L)) / (SSR_D + SSR_L) ) * 
         (nrow(m4$model) - 2*(length(coef(m4)))) / length(coef(m4))

p <- 1 - pf(Fstat, length(coef(m4)), nrow(m4$model) - 2*(length(coef(m4))))

cbind(Fstat, p)
        Fstat            p
[1,] 6.564468 1.719644e-08