Propensity Score Matching in Python

Update 8/11/2017: I’ve been working on turning this code into a package people can download and contribute to. Please use the package, linked here, instead of the code I shared in a Jupyter notebook previously.

Update 4/3/2016: I’ve since included a more comprehensive notebook here.  Use this version instead of the code posted below.

I need to run some simulations using propensity score matching, but I haven’t been able to find a Python module that does it.  So, I’ve taken it upon myself to implement it.  This is a serious work in progress, progress being the keyword.  I’m definitely the tortoise in this race, slow and steady progress.

Propensity score matching is a method to match case-control pairs in observational studies (or treated-control pairs in quasi-experimental studies) in order to better estimate the effect of the treatment or exposure on the outcome of interest.  We first estimate the “propensity” of getting assigned to the treatment group given the other covariates measured, then match pairs with similar propensities.  The idea is that the individuals in these pairs have similar levels of other confounding variables, so in a sense we can see the effect of the treatment with the other things held constant.

The tricky thing about propensity score matching is that there’s no one good way to do it.  If your estimated propensities are wrong, then you’re screwed right off the bat.  But assuming they’re alright, then how do you pick the “best” case-control pairs?  You could try every possible pairing and minimize the within-pair differences in propensity, but that’s computationally intensive.  What’s typically done is greedy matching, but even then there are a number of factors to decide: in what order do we match cases to controls?  Do we match with or without replacement, allowing one control to be matched to one or more cases?  Do we use a caliper to set a maximum difference in propensities, and if so how do we pick the caliper?

I thought I’d share my IPython notebook for this code so far because I’m really enjoying using it.  Before, I was just using a basic text editor to write code and copying and pasting it into the terminal, but this is much more elegant.  I’ve posted the bit of code that I’ve written so far to GitHub.  I set it up to do greedy matching with no replacement and randomized the order of matching the treatment group to controls, with the default caliper set to 0.05 arbitrarily.  My IPython notebook uses the Dehejia-Wahba sample data from “Evaluating the Econometric Evaluations of Training Programs,” available freely here.


Learning a new language

I’m embarking on a new research project that involves a substantial amount of coding.  I’ve primarily used R in the past but Python seems to be gaining users in the statistics community lately.  For this project, I’m making the switch to Python. I know the fundamentals of Python, but I’ve never used the numpy, scipy, and pandas packages that will be crucial for my work.  As I go along, I’m going to use this blog to document some of the things I learn that I think are cool, so that 1) I practice coding and making plots, and more importantly, 2) to make these tools available to others who aren’t familiar with what Python can do.

For reference, I’m going to be using Python for Data Analysis, by Wes McKinney. You can find a pdf version online without searching too hard.  I will put my code (and data, depending on the source) in a Github repository for the sake of sharing.

I played with one of the introductory examples in the book, the GroupLens movie ratings data and put the code I wrote in a Github repository here for anyone to download.  I downloaded the MovieLens 1M file and loaded it into IPython (you could do this in any Python interface you’d like, though) and imported the pandas package.  The first few rows of the data look like this:1

I merged the tables into one and created a new column indicating a single genre for each movie.  I chose to look at Comedy, Documentary, and Thriller (I assumed there would be little to no overlap in the categories) and lumped everything else into Other, using the np.where function (very handy).  Then I made a pivot table to average all the ratings for each genre, grouped by males and females.  There was not much a difference by gender or between genres.  Lesson #1: there’s not always something interesting in the data.


Next I looked at ratings over time.  The year that the movies came out was not supplied as a column but as a part of the movie title column, so I used regular expressions from the re module to extract them.


They show up as a list, so I used the str.join method to convert them to strings so we can use them as an index for a pivot table.  This result was a bit more interesting, so I made a time series with matlibplot.  From the graph, we see that movie ratings were volatile at the beginning of the twentieth century, presumably because there were fewer movies coming out at the time.  And sadly, movie ratings are getting worse as time goes on…either peoples’ artistic tastes are becoming more refined or more bad movies are coming out.

6 7

Anyway, this was a toy example, but I’m excited to see what Python can do for me!  What computing tools are you most thankful to have at your disposal?