R doesn’t have a built-in method for calculating heteroskedastically-robust and clustered standard errors. Since this is a common need for us, I’ve put together a little library to provide these functions.

Download the library.

The file contains three functions:

`get.coef.robust`

: Calculate heteroskedastically-robust standard errors`get.coef.clust`

: Calculate clustered standard errors`get.conf.clust`

: Calculate confidence intervals for clustered standard errors

The arguments for the functions are below, and then an example is at the end.

**get.coef.robust(mod, var=c(), estimator=”HC”)**

*Calculate heteroskedastically-robust standard errors*

mod: the result of a call to lm() e.g., lm(satell ~ width + factor(color)) var: a list of indexes to extract only certain coefficients (default: all) e.g., 1:2 [to drop FE from above] estimator: an estimator type passed to vcovHC (default: White's estimator) Returns an object of type coeftest, with estimate, std. err, t and p values

**get.coef.clust(mod, cluster, var=c())**

*Calculate clustered standard errors*

mod: the result of a call to lm() e.g., lm(satell ~ width + factor(color)) cluster: a cluster variable, with a length equal to the observation count e.g., color var: a list of indexes to extract only certain coefficients (default: all) e.g., 1:2 [to drop FE from above] Returns an object of type coeftest, with estimate, std. err, t and p values

**get.conf.clust(mod, cluster, xx, alpha, var=c())**

*Calculate confidence intervals for clustered standard errors*

mod: the result of a call to lm() e.g., lm(satell ~ width + factor(color)) cluster: a cluster variable, with a length equal to the observation count e.g., color xx: new values for each of the variables (only needs to be as long as 'var') e.g., seq(0, 50, length.out=100) alpha: the level of confidence, as an error rate e.g., .05 [for 95% confidence intervals] var: a list of indexes to extract only certain coefficients (default: all) e.g., 1:2 [to drop FE from above] Returns a data.frame of yhat, lo, and hi

### Example in Stata and R

The following example compare the results from Stata and R for an example from section 3.3.2 of “An Introduction to Categorical Data Analysis” by Alan Agresti. The data describes 173 female horseshoe crabs, and the number of male “satellites” that live near them.

Download the data as a CSV file. The columns are the crabs color (2-5), spine condition (1-3), carapace width (cm), number of satellite crabs, and weight (kg).

The code below calculates a simple model with crab color fixed-effects, relating width (which is thought to be an explanatory variable) and number of satellites. The first model is a straight OLS regression; then we add robust errors, and finally crab color clustered errors.

The analysis in **Stata**.

First, let’s do the analysis in Stata, to make sure that we can reproduce it in R.

import delim data.csv

xi: reg satell width i.color

xi: reg satell width i.color, vce(hc2)

xi: reg satell width i.color, vce(cluster color)

Here are excerpts of the results.

The OLS model:

------------------------------------------------------------------------------ satell | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- width | .4633951 .1117381 4.15 0.000 .2428033 .6839868 _cons | -8.412887 3.133074 -2.69 0.008 -14.59816 -2.227618 ------------------------------------------------------------------------------

The robust errors:

------------------------------------------------------------------------------ | Robust HC2 satell | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- width | .4633951 .1028225 4.51 0.000 .2604045 .6663856 _cons | -8.412887 2.939015 -2.86 0.005 -14.21505 -2.610726 ------------------------------------------------------------------------------

Note that we’ve used HC2 robust errors, to provide a direct comparison with the R results.

The clustered errors.

------------------------------------------------------------------------------ | Robust satell | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- width | .4633951 .0466564 9.93 0.002 .3149135 .6118767 _cons | -8.412887 1.258169 -6.69 0.007 -12.41694 -4.408833 ------------------------------------------------------------------------------

The robust and clustered errors are a little tighter for this example.

The analysis in **R**.

Here’s the equivalent code in R.

source("tools_reg.R") tbl <- read.csv("data.csv") mod <- lm(satell ~ width + factor(color), data=tbl) summary(mod) get.coef.robust(mod, estimator="HC2") get.coef.clust(mod, tbl$color)

And excerpts of the results. The OLS model:

Estimate Std. Error t value Pr(>|t|) (Intercept) -8.4129 3.1331 -2.685 0.00798 ** width 0.4634 0.1117 4.147 5.33e-05 ***

The coefficient estimates will be the same for all three models. Stata gives us `-8.412887 + .4633951 w`

and R gives us `-8.4129 + 0.4634 w`

. These are identical, but with different reporting precision, as are the standard errors.

The robust errors:

(Intercept) -8.41289 2.93902 -2.8625 0.004739 ** width 0.46340 0.10282 4.5067 1.229e-05 ***

Again, we use HC2 robust errors. Stata provides 2.939015 and .1028225 for the errors, matching these exactly.

The clustered errors:

Estimate Std. Error t value Pr(>|t|) (Intercept) -8.412887 1.258169 -6.6866 3.235e-10 *** width 0.463395 0.046656 9.9321 < 2.2e-16 ***

Stata gives the exact same values as R.