Applied statistician’s lab notebook

I’ve been working on the same project on and off for a bit more than a year now. From the get go I knew I’d need to document my steps, so I started using a little green spiral notebook to keep track of what I did each day.  15 months later, it’s time to write up the project and I’m shocked by the notes I’ve kept (sparse and not helpful).  It’s not so hard to find the code you need when you wrote it several weeks ago, but how about the code you wrote 6 months ago?  And when you find it, how do you use it?  What inputs do you need to supply and what outputs does it spit out?

Unfortunately nobody teaches you how to research efficiently; I’ve been learning as I go along.  Since starting this project, I’ve learned what doesn’t work for me: naming files by date.  This is a convention I started using when I saw a mentor of mine doing it a while back.  Frankly I don’t know how he made it work for him.  The problem is pretty obvious; you don’t know what’s in each file until you open it.  I suppose it’s a good practice for version control, if every time you modify a file you save a new copy with the date.  But even then, when you go back and try to find the right code, how do you know which one to choose?  It also results in a lot of duplicated code taking up memory on your computer.  I’ve only found this file naming convention useful when I also summarize the file contents in my spiral notebook.  Unfortunately, I didn’t have enough self-discipline to do that consistently.

What has worked for me so far is keeping a “Methods” subdirectory in my main directory for the project.  Maybe “Methods” is an inappropriate name, as my folder includes presentations for meetings and intermediate results.  In there, it makes sense to date files so there is a chronological work flow.  Again, I wasn’t consistent about keeping the files in this folder up to date, but the notes I did make as I went along have been immensely helpful.

Where to go from here?  I’ve learned a few things along the way:

  • Automate as much as possible.  When you write a script, test it out as is, but then once you’re convinced it works properly, wrap it in a function.  You will inevitably have to rerun your script, maybe on many different datasets, and it’s useful to have it in a function where you only have to change the inputs once.  Along the same lines, try to avoid one-time scripts or running things at the command line.  These moves may be faster at the moment, but they’re not reproducible and will give you a headache later on.
  • Write in your lab notebook consistently.  Self-explanatory.  I wish I’d read this earlier: Noble suggests keeping an electronic “lab notebook” in the same directory as your project.  I like this idea because then you can include plots, bits of code, output, etc. and it is easy to share with others if need be.
  • Comment code profusely.  In Python, it’s good practice to include a “docstring” at the beginning of every function, enclosed in triple ”’.  Do the same in any language: describe what your function does, the function’s inputs and outputs, and any packages that are required for it to run.

I think this quote from the linked article sums it up:

The core guiding principle is simple: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.

Right now, that unfamiliar someone is me.  May the next project go better!

Propensity Score Matching in Python

Update 4/3/2016: I’ve since included a more comprehensive notebook here.  Use this version instead of the code posted below.

I need to run some simulations using propensity score matching, but I haven’t been able to find a Python module that does it.  So, I’ve taken it upon myself to implement it.  This is a serious work in progress, progress being the keyword.  I’m definitely the tortoise in this race, slow and steady progress.

Propensity score matching is a method to match case-control pairs in observational studies (or treated-control pairs in quasi-experimental studies) in order to better estimate the effect of the treatment or exposure on the outcome of interest.  We first estimate the “propensity” of getting assigned to the treatment group given the other covariates measured, then match pairs with similar propensities.  The idea is that the individuals in these pairs have similar levels of other confounding variables, so in a sense we can see the effect of the treatment with the other things held constant.

The tricky thing about propensity score matching is that there’s no one good way to do it.  If your estimated propensities are wrong, then you’re screwed right off the bat.  But assuming they’re alright, then how do you pick the “best” case-control pairs?  You could try every possible pairing and minimize the within-pair differences in propensity, but that’s computationally intensive.  What’s typically done is greedy matching, but even then there are a number of factors to decide: in what order do we match cases to controls?  Do we match with or without replacement, allowing one control to be matched to one or more cases?  Do we use a caliper to set a maximum difference in propensities, and if so how do we pick the caliper?

I thought I’d share my IPython notebook for this code so far because I’m really enjoying using it.  Before, I was just using a basic text editor to write code and copying and pasting it into the terminal, but this is much more elegant.  I’ve posted the bit of code that I’ve written so far to GitHub.  I set it up to do greedy matching with no replacement and randomized the order of matching the treatment group to controls, with the default caliper set to 0.05 arbitrarily.  My IPython notebook uses the Dehejia-Wahba sample data from “Evaluating the Econometric Evaluations of Training Programs,” available freely here.