Propensity score matching in Python, revisited

Update 8/11/2017: I’ve been working on turning this code into a package people can download and contribute to. Please use the package, linked here, instead of the code I shared in a Jupyter notebook previously.

I can’t believe how many people from all around the world visit my previous blog post on propensity score matching in Python every day.  It feels great to know that my code is out there and people are actually using it.  However, I realized that the notebook I link to previously doesn’t contain much and that I wrote heaps more code after posting it.  Hence, I’m sharing a more complete notebook with code for different variations on propensity score matching, functions to compute average treatment effects and get standard errors, and check for balance between matched groups.

Learning a new language

I’m embarking on a new research project that involves a substantial amount of coding.  I’ve primarily used R in the past but Python seems to be gaining users in the statistics community lately.  For this project, I’m making the switch to Python. I know the fundamentals of Python, but I’ve never used the numpy, scipy, and pandas packages that will be crucial for my work.  As I go along, I’m going to use this blog to document some of the things I learn that I think are cool, so that 1) I practice coding and making plots, and more importantly, 2) to make these tools available to others who aren’t familiar with what Python can do.

For reference, I’m going to be using Python for Data Analysis, by Wes McKinney. You can find a pdf version online without searching too hard.  I will put my code (and data, depending on the source) in a Github repository for the sake of sharing.

I played with one of the introductory examples in the book, the GroupLens movie ratings data and put the code I wrote in a Github repository here for anyone to download.  I downloaded the MovieLens 1M file and loaded it into IPython (you could do this in any Python interface you’d like, though) and imported the pandas package.  The first few rows of the data look like this:1

I merged the tables into one and created a new column indicating a single genre for each movie.  I chose to look at Comedy, Documentary, and Thriller (I assumed there would be little to no overlap in the categories) and lumped everything else into Other, using the np.where function (very handy).  Then I made a pivot table to average all the ratings for each genre, grouped by males and females.  There was not much a difference by gender or between genres.  Lesson #1: there’s not always something interesting in the data.


Next I looked at ratings over time.  The year that the movies came out was not supplied as a column but as a part of the movie title column, so I used regular expressions from the re module to extract them.


They show up as a list, so I used the str.join method to convert them to strings so we can use them as an index for a pivot table.  This result was a bit more interesting, so I made a time series with matlibplot.  From the graph, we see that movie ratings were volatile at the beginning of the twentieth century, presumably because there were fewer movies coming out at the time.  And sadly, movie ratings are getting worse as time goes on…either peoples’ artistic tastes are becoming more refined or more bad movies are coming out.

6 7

Anyway, this was a toy example, but I’m excited to see what Python can do for me!  What computing tools are you most thankful to have at your disposal?