PS1
Instructions
Create a new markdown document where you will answer all questions. Submit (1) your code as an .Rmd
file, and (2) a rendered html file including only what you wish to present, to the Problem Set 1
assignment under the Assignments tab in Canvas. The problem set is due Friday, 24 January, by 10:00 EST. Name your files lastname_ps01
(.html
and .Rmd
).
All submitted work must be your own.
Data
In this problem set, we will work with the 2020 American National Elections Study anes_timeseries_2020_csv_20220210.csv
.
We are interested in three variables: (1) respondents’ feelings toward conservatives relative to feelings toward liberals, (2) their age, and (3) their family income.
We begin with the raw data downloaded from the ANES website. You may need to use the codebook to clean the data for analysis and to interpret your results.
We will use the following variables:
V202164
: feeling thermometer toward conservatives (name:con_feel
)V202161
: feeling thermometer toward liberals (name:lib_feel
)V201507x
: respondent age (name:age
)V202468x
: family income (name:income
)
First, do the following:
- Store the data in an object called
anes
. - Create one new variable in
anes
for each of the four variables above; name each new variable using the names provided above. - Recode each new variable so that any all negative values, and any responses of
998
or999
, are coded as missing (NA
). - Create a new variable called
libtocon
, which is equal tocon_feel
minuslib_feel
. - Create a new data frame,
anes_clean
, which includes only these variables, and which removes all observations in the data with missing values on any of the variables above.
Problem 1
\[\text{libtocon}_i = \beta_0 + \beta_1 \text{age}_{i} + u_i\]
Find ordinary least squares estimates of \(\beta_0\) and \(\beta_1\), “by hand”; that is, without using canned functions like
lm
(you can of course use basic math functions likesolve
). Show your code for this one (echo=T
).What is the estimated slope and what is the estimated intercept? Interpret each substantively; that is, explain in words what each of these estimates means in the context of these specific variables. Is the value of the intercept substantively meaningful as a prediction about the dependent variable here? Why or why not?
Fill in the blanks (do it “by hand”, do not use
predict
, show your work in the code):- A 50-year-old respondent’s predicted value of the dependent variable is ______.
- A 27-year-old respondent’s predicted value of the dependent variable is ______.
- A 12-year increase in age is associated with a change of _________ in the dependent variable.
This model assumes a linear relationship between
libtocon
andage
. Explain what this means. Do you think it is a reasonable assumption? Why or why not?
Problem 2
Create a scatter plot depicting the relationship between
libtocon
andage
(withage
on the x-axis). Make sure the plot looks nice, and has proper labeling.Add the estimated regression line to the scatter plot (HINT:
abline
(base),geom_abline
(ggplot)).Estimate the predicted value of
libtocon
when age is 65, plot the relevant point on the regression line (make sure we can see it!), and add horizontal and vertical dashed lines associated with this point to the figure.
Problem 3
\[\text{libtocon}_i = \beta_0 + \beta_1 \text{age}_{i} + \beta_2 \text{income}_{i} + u_i\]
Estimate the new model above, using OLS, in two ways: (1) “by hand” using linear algebra, and (2) using the
lm
function. Store thelm
output in an object namedm_3
(you can then usem_3
for the remaining tasks below). Show your code for this one (echo=T
).Report the coefficient estimates from
m_3
, along with their standard errors, t-statistics, and two-tailed p-values, in a professional looking table. Interpret each of the estimated coefficients substantively (remember that there are now two independent variables in the model!).Create a figure that plots the predicted value of
libtocon
as a function ofincome
at three different values forage
: the 5th percentile, the 50th percentile, and 95th percentile. Use thepredict
function to do it. You should have one plot with three lines. Make sure your lines can be distinguished and add a legend so the reader knows which is which.Explain why the lines in your figure are parallel.
Standardize all three variables (give them means of 0 and standard deviations of 1; you can use the
scale
function for this). Estimate the same regression model using the standardized variables. Interpret all three coefficient estimates in words.Re-estimate the model with only
age
and the model with bothage
andincome
usinglm
and the data frameanes
(instead ofanes_clean
). Now compare your \(N\) (number of data points) for the two models. You have lost respondents! Why? And why might this be a problem for your estimates?
Problem 4
Calculate, “by hand”, the following measures of fit for the model you estimated in Problem 3 (stored in m_3
). For each quantity, explain what it means, statistically. Finish with a summary statement of model fit, taking into account the three quantities. Show your code for this one (echo=T
).
- \(R^2\)
- mean squared error
- adjusted \(R^2\)