Tag Archives: Pro

Google Scholar Alerts to RSS: A punctuated equilibrium

If you’re like me, you have a pile of Google Scholar Alerts that you never manage to read. It’s a reflection of a more general problem: how do you find good articles, when there are so many articles to sift through?

I’ve recently started using Sux0r, a Bayesian filtering RSS feed reader. However, Google Scholar sends alerts to one’s email, and we’ll want to extract each paper as a separate RSS item.

alertemail

Here’s my process, and the steps for doing it yourself:

Google Scholar Alerts → IFTTT → Blogger → Perl → DreamHost → RSS → Bayesian Reader

  1. Create a Blogger blog that you will just use for Google Scholar Alerts: Go to the Blogger Home Page and follow the steps under “New Blog”.
  2. Sign up for IFTTT (if you don’t already have an account), and create a new recipe to post emails from scholaralerts-noreply@google.com to your new blog. The channel for the trigger is your email system (Gmail for me); the trigger is “New email in inbox from…”; the channel for the action is Blogger; and the title and labels can be whatever you want as along as the body is “{{BodyPlain}}” (which includes HTML).

    ifttttrigger

  3. Modify the Perl code below, pointing it to the front page of your new Blogger blog. It will return an RSS feed when called at the command line (perl scholar.pl).

    rssfeed

  4. Upload the Perl script to your favorite server (mine, http://existencia.org/, is powered by DreamHost.
  5. Point your favorite RSS reader to the URL of the Perl script as an RSS feed, and wait as the Google Alerts come streaming in!

Here is the code for the Alert-Blogger-to-RSS Perl script. All you need to do is fill in the $url line below.

#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);

use XML::RSS; # Library for RSS generation
use LWP::Simple; # Library for web access

# Download the first page from the blog
my $url = "http://mygooglealerts.blogspot.com/"; ### <-- FILL IN HERE!
my $input = get($url);
my @lines = split /\n/, $input;

# Set up the RSS feed we will fill
my $rss = new XML::RSS(version => '2.0');
$rss->channel(title => "Google Scholar Alerts");

# Iterate through the lines of HTML
my $ii = 0;
while ($ii < $#lines) {
    my $line = $lines[$ii];
    # Look for a <h3> starting the entry
    if ($line !~ /^<h3 style="font-weight:normal/) {
        $ii = ++$ii;
        next;
    }

    # Extract the title and link
    $line =~ /<a href="([^"]+)"><font .*?>(.+)<\/font>/;
    my $title = $2;
    my $link = $1;

    # Extract the authors and publication information
    my $line2 = $lines[$ii+1];
    $line2 =~ /<div><font .+?>([^<]+?) - (.*?, )?(\d{4})/;
    my $authors = $1;
    my $journal = (defined $2) ? $2 : '';
    my $year = $3;

    # Extract the snippets
    my $line3 = $lines[$ii+2];
    $line3 =~ /<div><font .+?>(.+?)<br \/>/;
    my $content = $1;
    for ($ii = $ii + 3; $ii < @lines; $ii++) {
        my $linen = $lines[$ii];
        # Are we done, or is there another line of snippets?
        if ($linen =~ /^(.+?)<\/font><\/div>/) {
            $content = $content . '<br />' . $1;
            last;
        } else {
            $linen =~ /^(.+?)<br \/>/;
            $content = $content . '<br />' . $1;
        }
    }
    $ii = ++$ii;

    # Use the title and publication for the RSS entry title
    my $longtitle = "$title ($authors, $journal $year)";

    # Add it to the RSS feed
    $rss->add_item(title => $longtitle,
                   link => $link,
                   description => $content);
        
    $ii = ++$ii;
}

# Write out the RSS feed
print header('application/xml+rss');
print $rss->as_string;

In Sux0r, here are a couple of items form the final result:

sux0rfeed

Scripts for Twitter Data

Twitter data– the endless stream of tweets, the user network, and the rise and fall of hashtags– offers a flood of insight into the minute-by-minute state of the society. Or at least one self-selecting part of it. A lot of people want to use it for research, and it turns out to be pretty easy to do so.

You can either purchase twitter data, or collect it in real-time. If you purchase twitter data, it’s all organized for you and available historically, but it basically isn’t anything that you can’t get yourself by monitoring twitter in real-time. I’ve used GNIP, where the going rate was about $500 per million tweets in 2013.

There are two main ways to collect data directly from twitter: “queries” and the “stream”. Queries let you get up to 1000 tweets at any point in time– whichever the most recent tweets that match your search criteria. The stream gives you a fraction of a percent of tweets continuously, which very quickly adds up, based on filtering criteria.

Scripts for doing these two options are below, but you need to decide on the search/streaming criteria. Typically, these are search terms and geographical constraints. See Twitter’s API documentation to decide on your search options.

Twitter uses an athentication system to identify both the individual collecting the data, and what tool is helping them do it. It is easy to register a new tool, whereby you pretend that you’re a startup with a great new app. Here are the steps:

  1. Install python’s twitter package, using “easy_install twitter” or “pip install twitter”.
  2. Create an app at http://ift.tt/1oHSTpv. Leave the callback URL blank, but fill in the rest.
  3. Set the CONSUMER_KEY and CONSUMER_SECRET in the code below to the values you get on the keys and access tokens tab of your app.
  4. Fill in the name of the application.
  5. Fill in any search terms or structured searches you like.
  6. If you’re using the downloaded scripts, which output data to a CSV file, change where the file is written, to some directory (where it says “twitter/us_”).
  7. Run the script from your computer’s terminal (i.e., python search.py)
  8. The script will pop up a browser for you to log into twitter and accept permissions from your app.
  9. Get data.

Here is what a simple script looks like:

import os, twitter

APP_NAME = "Your app name"
CONSUMER_KEY = 'Your consumer key'
CONSUMER_SECRET = 'Your consumer token'

# Do we already have a token saved?
MY_TWITTER_CREDS = os.path.expanduser('~/.class_credentials')
if not os.path.exists(MY_TWITTER_CREDS):
    # This will ask you to accept the permissions and save the token
    twitter.oauth_dance(APP_NAME, CONSUMER_KEY, CONSUMER_SECRET,
                        MY_TWITTER_CREDS)

# Read the token
oauth_token, oauth_secret = twitter.read_token_file(MY_TWITTER_CREDS)

# Open up an API object, with the OAuth token
api = twitter.Twitter(api_version="1.1", auth=twitter.OAuth(oauth_token, oauth_secret, CONSUMER_KEY, CONSUMER_SECRET))

# Perform our query
tweets = api.search.tweets(q="risky business")

# Print the results
for tweet in tweets['statuses']:
    if not 'text' in tweet:
        continue

    print tweet
    break

For automating twitter collection, I’ve put together scripts for queries (search.py), streaming (filter.py), and bash scripts that run them repeatedly (repsearch.sh and repfilter.sh). Download the scripts.

To use the repetition scripts, make the repetition scripts executable by running “chmod a+x repsearch.sh repfilter.sh“. Then run them, by typing ./repfilter.sh or ./repsearch.sh. Note that these will create many many files over time, which you’ll have to merge together.

US Water Network

The America’s Water project, coordinated at Columbia’s Water Center by Upmanu Lall, is trying to understand the US water system as an integrated whole, and understand how that system will evolve over the next decades. Doing so will require a comprehensive model, incorporating agriculture, energy, cities, policy, and more.

We are just beginning to lay the foundation for that model. A first step is to create a network of links between station gauges around the US, representing upstream and downstream flows and counties served. The ultimate form of that model will rely on physical flow data, but I created a first pass using simple rules:

  1. Every gauge can only be connected to one downstream gauge (but not visa versa).
  2. Upstream gauges must be at a higher elevation than downstream gauges.
  3. Upstream gauges must be fed by a smaller drainage basin than downstream gauges.
  4. Of the gauges that satisfy the first two constraints, the chosen downstream gauge is the one with the shortest distance and the most “plausible” streamflow.

The full description is available on Overleaf. I’ve applied the algorithm to the GAGES II database from USGSU, which includes all station gauges with at least 20 years of data.

network3

Every red dot is a gauge, black lines are upstream-downstream connections between gauges, and the blue and green lines connect counties with each of the gauges by similar rules to the ones above (green edges if the link is forced to be longer than 100 km).

This kind of network opens the door for a lot of interesting analyses. For example, if agricultural withdrawals increase in the midwest, how much less water will be available downstream? We’re working now to construct a full optimization model that accounts for upstream dependencies.

Another simple question is, how much of the demand in each county is satisfied by flows available to it? Here are the results, and many cities show up in sharp red, showing that their demands exceed the surface water by 10 times or more.

supplydemand-surface

Negative result: country-wide growing degree-days

I put this here as a warning. While growing degree-days (GDD) are well-known as an effective model to predict yields, they don’t perform so hot at the country-scale.

I used mean temperature GDDs, between 8 and 24 degrees C, estimated at many locations from station data, and then using the weighted average by production within each country. Here are the results:

Statistical models
Barley Maize Millet Rice Sorghum Wheat
GDD / 1000 -0.03 0.01 -0.07** 0.08 0.04* -0.08***
(0.01) (0.01) (0.03) (0.06) (0.02) (0.02)
Precip. (m) 0.09 0.11*** 0.12* 0.02 0.14*** -0.04
(0.05) (0.03) (0.05) (0.03) (0.04) (0.04)
Country Cubic Y Y Y Y Y Y
R2 0.95 0.97 0.91 0.97 0.92 0.96
Adj. R2 0.94 0.96 0.90 0.97 0.91 0.95
Num. obs. 1639 3595 1516 1721 2300 1791
***p < 0.001, **p < 0.01, *p < 0.05

As you can see, for most crops, these GDDs aren’t even significant, and as frequently negative as positive. This defies a century of agricultural research, but the same data at a fine spatial scale seems to work just fine.

Growing Degree-Day Calculations

Schlenker and Roberts (2009) use daily minimum and maximum temperatures to calculate growing degrees, rather than daily mean temperatures. This is important when the effect of extreme temperatures is an issue, since these often will not show up in mean temperatures.

Growing degree days form a useful model of crop productivity. DMAS has examples of these for maize, soybeans, and cotton.

To do this, they use a sinusoidal approximation, integrating the area of a curve through the minimum and maximum temperatures:
thresholds
(adapted from here– but don’t use their calculations!)

The calculations aren’t very difficult, but require some careful math. I had a need to write them in python and translate them to R, so I’m providing them here for anyone’s benefit.

import numpy as np
import warnings

warnings.simplefilter("ignore", RuntimeWarning)

def above_threshold(mins, maxs, threshold):
    """Use a sinusoidal approximation to estimate the number of Growing
Degree-Days above a given threshold, using daily minimum and maximum
temperatures.

mins and maxs are numpy arrays; threshold is in the same units."""

    # Determine crossing points, as a fraction of the day
    plus_over_2 = (mins + maxs)/2
    minus_over_2 = (maxs - mins)/2
    two_pi = 2*np.pi
    # d0s is the times of crossing above; d1s is when cross below
    d0s = np.arcsin((threshold - plus_over_2) / minus_over_2) / two_pi
    d1s = .5 - d0s

    # If always above or below threshold, set crossings accordingly
    aboves = mins >= threshold
    belows = maxs <= threshold

    d0s[aboves] = 0
    d1s[aboves] = 1
    d0s[belows] = 0
    d1s[belows] = 0

    # Calculate integral
    F1s = -minus_over_2 * np.cos(2*np.pi*d1s) / two_pi + plus_over_2 * d1s
    F0s = -minus_over_2 * np.cos(2*np.pi*d0s) / two_pi + plus_over_2 * d0s
    return np.sum(F1s - F0s - threshold * (d1s - d0s))

def get_gddkdd(mins, maxs, gdd_start, kdd_start):
    """Get the Growing Degree-Days, as degree-days between gdd_start and
kdd_start, and Killing Degree-Days, as the degree-days above
kdd_start.

mins and maxs are numpy arrays; threshold is in the same units."""

    dd_lowup = above_threshold(mins, maxs, gdd_start)
    dd_above = above_threshold(mins, maxs, kdd_start)
    dd_lower = dd_lowup - dd_above

    return (dd_lower, dd_above)

Download the code for R or python.

Integrated Food System Speed Talk

Below is my Prezi presentation, from a 4 minute “speed talk” on my research a week ago:

The main points of the narrative are below:

  1. Frame 1: My pre-thesis research:
    • I’m going to give a brief sense of where my research has gone, and my hopes for where it’s going.
    • My research has evolved– or wandered– but I’ve realized that it centers around food.
    • Food represents some of the most important ways humanity both relies upon and impacts the planet.
    • We need better models to understand this relationship– models that are sensitive to changing environmental conditions (including climate), and changing management (including adaptation).
  2. Frame 2: My thesis:
    • Perhaps more importantly, we need models powerful enough to capture feedbacks.
    • Harvests feedback and affect the harvesters: Fluctuations in a harvest affect producers in immediate and long-term ways.
    • Producers aren’t mechanical or linear: they respond dynamically to conditions.
    • Producers are part of the system, and I’m interested in understanding these systems as wholes.
  3. Frame 3: The greater whole:
    • These feedback loops are part of a larger integrated system of how food works.
    • Example: When farmers use excessive inputs in the course of their management, it impacts the environment, including that of fishers.
    • Example: 1/3 of fish caught is sold as fishmeal, which goes to feed livestock. These are all connected.
  4. Frame 4: What we eat becomes a driving force throughout this integrated system.
  5. Frame 5: A call for help:
    • Consumption patterns feed back on the production system, which feeds back on large environmental processes like climate change.
    • We need integrated models with all these parts to understand the whole and find leverage points where policies can make a difference.
    • I’d like to invite people from many fields to help out: biologists, computer scientists, economists, and foodies too.
    • Help me better represent the vast opportunities the food system holds for a sustainable future.

The Distribution of Possible Climates

There’s a lot we don’t know about future climates. Despite continuous improvements, our models still do notoriously poorly with some fairly fundamental dynamics: ocean energy transport, the Indian monsoon, and just reproducing past climate. Worse, we have the computing resources to run each model a hand-full of times. We haven’t estimated the distribution of possible temperatures that even one model predicts.

Instead, we treat the range of models as a kind of distribution over beliefs. This is useful, but a poor approximation to an actual distribution. For one, most of the models are incestuously related. Climate is chaotic, no matter how you model it, so changing the inputs within measurement error can produce a whole range of future temperatures.

This is all motivation for a big innovation from the Risky Business report, on the part of Bob Kopp and DJ Rasmussen.

First, they collected the full range of 44 CMIP5 models (downscaled and adjusted to predict every US county). They extracted each one’s global 2080-2099 temperature prediction.

Then they ran an *hemisphere* scale model (MAGICC), for which its possible to compute enough runs to produce a whole distribution of future temperatures.

Here’s the result: a matching between models and the temperatures distribution for end-of-century global temperatures.

magicc-cmip5

[From American Climate Prospectus, Technical Appendix I: Physical Climate Projections]

The opaque maps are each the output of a different GCM. The ghosted maps aren’t– at least, not alone. In fact, every ghosted map represents a portion of the probability distribution which is currently missing from our models.

To fix this, they made “surrogate” models, by decomposing existing GCM outputs into “forced” seasonal patterns which scale with temperature, and unforced variability. Then they scaled up the forced pattern, added back in the variability, and bingo! A new surrogate model.

Surrogates can’t tell us the true horrors of a high-temperature world, but without them we’d be navigating without a tail.

Tweets and Emotions

This animation is dedicated to my advisor Upmanu Lall, who’s been so supportive of all my many projects, including this one! Prabhat Barnwal and I have been studying the effects of Hurricane Sandy on the New York area through the lens of twitter emotions, ever since we propitiously began collecting tweets a week before the storm emerged (see the working paper).

The animation shows every geolocated tweet in between October 20 and November 5, across much of the Northeastern seaboard– about 2 million tweets in total.

Each tweet is colored based on our term-usage analysis of its emotional content. The hue reflects happiness, varying from green (happy) to red (sad), greyer colors reflect a lack of authority, and darker colors reflect a lack of engagement. The hurricane, as a red circle, jutters through the bottom-left of the screen the night of Oct. 29.

The first thing to notice is how much more tweeting activity there is near the hurricane and the election. Look at the difference between seconds 22 (midnight on Oct. 25) and 40 (midnight on Oct. 30).

The night of the hurricane, the tweets edge toward pastel as people get excited. The green glow near NYC that each night finishes with becomes brighter, but the next morning the whole region turns red as people survey the disaster.

Make your own Espresso Buzzer

My girlfriend uses a “Moka Express”-style stovetop Bialetti espresso-maker most mornings for her cappuccino. These devices are wonderful, reliable, and simple, but they have a nearly fatal flaw. A basin collects the espresso as it condenses, and for the five minutes before the steam brewing happens, there will be no indication of anything happening. Then espresso will start to quietly gurgle up, and if you don’t stop the process in the 20 seconds after it starts, the drink will be ruined.

I built a simple device to solve this problem. It has two wires that sit in the basin, and a loud buzzer that sounds as soon as coffee touches them.

setup1

How it works

Here is the circuit diagram for the coffee buzzer, designed to be powered by a 9V battery:

detector

The core of the coffee buzzer is a simple voltage divider. Normally, when the device is not in coffee, the resistance through the air between the two leads on the left (labeled “LOAD”) is very high. As a result, the entire voltage from 9V battery is applied to that gap, so that the voltage across the 500 KΩ resistor is 0.

The IRF1104 is a simple MOSFET, which acts like a voltage-controlled switch. With no voltage across the resistor, the MOSFET is off, so the buzzer doesn’t sound.

To turn the MOSFET on, the voltage across the 500 KΩ resistor needs to be about 2 V. As a result, anything between the two LOAD leads with a resistance of less than about 2000 KΩ will cause the buzzer to turn on.

resistance

Coffee resistance seems to vary quite a bit, and the 137 KΩ shown here is on the low end. For this, you need a resistor of at least 40 KΩ. I suggest using something higher, so the detector will be more sensitive.

What you need

Tools

You will need wire-strippers, a multimeter (to check your circuit), and a soldering iron (to put everything together).

strippers multimeter solder

Parts

wire2

These wires will be the main detector of the coffee buzzer.

resistor

Here I use a 1500 KΩ resistor, so the buzzer will sound for resistances less than 5000 KΩ will be detected. Make sure that you use a resistor of more than 300 KΩ.

mosfet

The MOSFET is a voltage controlled switch, and it’s what allows the buzzer to get plenty of current even though there’s a lot of resistance for current moving through the coffee.

connector buzzers battery

You can get a 9V battery connector (top) and a buzzer (middle, rated for between 4 and 8 V) at RadioShack or from DigiKey. And, of course, the battery itself (not included).

Optional (but highly recommended!)

switch

A simple switch (which may look very different from this one) is a great way to build an integrity tester into the device itself.

breadboard

A breadboard will let you put the whole circuit together and test it before you solder it.

wires

Wires like these make plugging everything into a breadboard easier.

tape

I suggest taping up the whole device after you solder it, both for protection and to keep everything in one tight package.

How to make an Espresso Buzzer

Start by preparing the detector wires (the black and white wires above). Take at least 30 cm of wire, so the device can sit on the counter away from the flame. Use the wire stripper to strip one end of the wires for use in the circuit. You may want to strip the “detection” end, or not: if you leave the detector end unstripped, the buzzer won’t go off prematurely when if both wires touch the bottom of the basin, but you’ll have to wipe off the end of the wires to get them to stop buzzing once they have started.

Now connect all of the pieces on the breadboard. Here’s one arrangement:

circuit-annotated

If you aren’t sure how to use a breadboard or how to read a circuit diagram, you can still make the buzzer by skipping this step and soldering the wires together as specified below.

Once everything is connected on the breadboard, you should be able to test the coffee buzzer. If you use a switch, pressing it should make the buzzer sound. Then make a little experimental coffee, and dip the leads into the coffee to check that it works.

Next, solder it all together. As you can see in the circuit diagram, you want to solder together the following groups of wires:

  • Ground: The black wire from the battery connector; one wire from the resistor; and the source lead on the MOSFET (furthest right).
  • Power: The red wire from the battery connector; one wire from the buzzer; the (stripped) circuit end of one of the detector wires; and optionally one end of the switch.
  • Gate: The (stripped) circuit end of the other detector wire; the other end of the resistor; the gate lead on the MOSFET (furthest left); and optionally the other end of the switch.
  • Buzzer: The gate lead on the MOSFET (middle); and the remaining wire to the buzzer.

soldered

After you have soldered everything together, test it and then wrap tape around it all.

complete

The results

That’s it! Now, to use it, place the detector leads at the bottom of the coffee basin. I suggest putting a bend into the wires a couple inches from the end, so they can sit easily in the basin. If you’ve stripped the ends, the buzzer may buzz when the leads touch the base, but if you just let them go at this point, one wire will sit slightly above the other and just millimeters away from the bottom for perfect detection.

setup2

As soon as the coffee reaches a level where both leads are submerged, the detector will start buzzing!

final1

Enjoy!

final2

Simple Robust Standard Errors Library for R

R doesn’t have a built-in method for calculating heteroskedastically-robust and clustered standard errors. Since this is a common need for us, I’ve put together a little library to provide these functions.

Download the library.

The file contains three functions:

  • get.coef.robust: Calculate heteroskedastically-robust standard errors
  • get.coef.clust: Calculate clustered standard errors
  • get.conf.clust: Calculate confidence intervals for clustered standard errors

The arguments for the functions are below, and then an example is at the end.

get.coef.robust(mod, var=c(), estimator=”HC”)
Calculate heteroskedastically-robust standard errors

mod: the result of a call to lm()
     e.g., lm(satell ~ width + factor(color))
var: a list of indexes to extract only certain coefficients (default: all)
     e.g., 1:2 [to drop FE from above]
estimator: an estimator type passed to vcovHC (default: White's estimator)
Returns an object of type coeftest, with estimate, std. err, t and p values

get.coef.clust(mod, cluster, var=c())
Calculate clustered standard errors

mod: the result of a call to lm()
     e.g., lm(satell ~ width + factor(color))
cluster: a cluster variable, with a length equal to the observation count
     e.g., color
var: a list of indexes to extract only certain coefficients (default: all)
     e.g., 1:2 [to drop FE from above]
Returns an object of type coeftest, with estimate, std. err, t and p values

get.conf.clust(mod, cluster, xx, alpha, var=c())
Calculate confidence intervals for clustered standard errors

mod: the result of a call to lm()
     e.g., lm(satell ~ width + factor(color))
cluster: a cluster variable, with a length equal to the observation count
     e.g., color
xx: new values for each of the variables (only needs to be as long as 'var')
     e.g., seq(0, 50, length.out=100)
alpha: the level of confidence, as an error rate
     e.g., .05 [for 95% confidence intervals]
var: a list of indexes to extract only certain coefficients (default: all)
     e.g., 1:2 [to drop FE from above]
Returns a data.frame of yhat, lo, and hi

Example in Stata and R

The following example compare the results from Stata and R for an example from section 3.3.2 of “An Introduction to Categorical Data Analysis” by Alan Agresti. The data describes 173 female horseshoe crabs, and the number of male “satellites” that live near them.

Download the data as a CSV file. The columns are the crabs color (2-5), spine condition (1-3), carapace width (cm), number of satellite crabs, and weight (kg).

The code below calculates a simple model with crab color fixed-effects, relating width (which is thought to be an explanatory variable) and number of satellites. The first model is a straight OLS regression; then we add robust errors, and finally crab color clustered errors.

The analysis in Stata.

First, let’s do the analysis in Stata, to make sure that we can reproduce it in R.

import delim data.csv

xi: reg satell width i.color
xi: reg satell width i.color, vce(hc2)
xi: reg satell width i.color, vce(cluster color)

Here are excerpts of the results.

The OLS model:

------------------------------------------------------------------------------
      satell |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       width |   .4633951   .1117381     4.15   0.000     .2428033    .6839868
       _cons |  -8.412887   3.133074    -2.69   0.008    -14.59816   -2.227618
------------------------------------------------------------------------------

The robust errors:

------------------------------------------------------------------------------
             |             Robust HC2
      satell |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       width |   .4633951   .1028225     4.51   0.000     .2604045    .6663856
       _cons |  -8.412887   2.939015    -2.86   0.005    -14.21505   -2.610726
------------------------------------------------------------------------------

Note that we’ve used HC2 robust errors, to provide a direct comparison with the R results.

The clustered errors.

------------------------------------------------------------------------------
             |               Robust
      satell |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       width |   .4633951   .0466564     9.93   0.002     .3149135    .6118767
       _cons |  -8.412887   1.258169    -6.69   0.007    -12.41694   -4.408833
------------------------------------------------------------------------------

The robust and clustered errors are a little tighter for this example.

The analysis in R.

Here’s the equivalent code in R.

source("tools_reg.R")
tbl <- read.csv("data.csv")

mod <- lm(satell ~ width + factor(color), data=tbl)

summary(mod)
get.coef.robust(mod, estimator="HC2")
get.coef.clust(mod, tbl$color)

And excerpts of the results. The OLS model:

               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -8.4129     3.1331  -2.685  0.00798 ** 
width            0.4634     0.1117   4.147 5.33e-05 ***

The coefficient estimates will be the same for all three models. Stata gives us -8.412887 + .4633951 w and R gives us -8.4129 + 0.4634 w. These are identical, but with different reporting precision, as are the standard errors.

The robust errors:

(Intercept)    -8.41289    2.93902 -2.8625  0.004739 ** 
width           0.46340    0.10282  4.5067 1.229e-05 ***

Again, we use HC2 robust errors. Stata provides 2.939015 and .1028225 for the errors, matching these exactly.

The clustered errors:

                Estimate Std. Error  t value  Pr(>|t|)    
(Intercept)    -8.412887   1.258169  -6.6866 3.235e-10 ***
width           0.463395   0.046656   9.9321 < 2.2e-16 ***

Stata gives the exact same values as R.