Cognitive dissonance about research reform

An interesting editorial on research practices came out in PLOS Medicine yesterday.  It’s good to hear about reproducibility and reforms we need to see in science from a fellow statistician, John Ioannidis over at Stanford.  Each discipline has its own quirks and accepted practices, but statistics is a common factor in every study.  I believe we statisticians have a unique perspective on the problem: we get to play the role of data advisor on other peoples’ studies and the PIs on our own.

Ioannidis cites examples of things that work in several fields, including reproducibility practices, data registration, and stricter statistical analyses.  Then he proposes a new “structure of scientific careers” that doesn’t just favor old men with fancy titles and big grants.  In this framework,

 Resources and power are seen as opportunities, and researchers need to match their output to the opportunities that they have been offered—the more opportunities, the more the expected (replicated and, hopefully, even translated) output. Academic ranks have no value in this model and may even be eliminated: researchers simply have to maintain a non-negative balance of output versus opportunities. In this deliberately provocative scenario, investigators would be loath to obtain grants or become powerful (in the current sense), because this would be seen as a burden.

I got to this part of the article and thought, “Wait, this sounds crazy?”  It almost seems like there would be no incentive to work hard, like any award would come with some negative consequences and you’d be punished if your work didn’t produce results.  Isn’t that exactly what research reforms are trying to get around?  Maybe a greater emphasis on sharing negative results would get around this problem, but I digress.

After reading this the first time and feeling my knee-jerk disagreement, I took a step back and realized that my negative response is precisely due to my being immersed in the current culture of “publish or perish” and academic hierarchies.  I’m so entrenched in this way of thought that it’s hard to see other models for scientific careers.  However, I’m on Ioannidis’s side and I believe we need to seriously rethink the way research is done in order to have more high quality results.

Frankly my commentary on the subject is pretty useless because it’s a hard question and I’m no expert.  You should just go read the article here.

Tips for efficiency at the command line

The terminal can be a powerful tool in statistical computing.  I’ve used UNIX a bit before, mostly just to run jobs on remote servers and handle giant files.  I know how to copy files, move them between folders, edit text files in Emacs, and use simple commands like grep and cut along with pipes and filters to sift through files.  I wouldn’t say I’m savvy by any means, but I know enough to get by.  I’m always looking for tips to maximize my speed while minimizing the work I have to do.  This week I learned a few things that I thought were useful for getting more out of the terminal as a tool for research.  In particular, some of the things I learned facilitate reproducibility, which is becoming increasingly important and rare.

Aliasing: There are probably things you type repeatedly at the command line, like your username when logging into remote machines or cd to a certain frequently used directory. You can save time typing these things over and over again by creating an “alias” for them in the terminal. This essentially means you create a shortcut to call this command, instead of writing out the whole thing. For instance, I might write something like

alias jobs="top -u kellie"

Then, when I type jobs, I will see the CPU, % Memory, and run time for any jobs I have running and ignore the processes that belong to other users.  It’s a clever trick that might save me 1 second each time I use it.

Now, that’s nice to have in one session, but what if you want to keep using the same shortcuts every time you open the terminal?  If you just created an alias once and closed the terminal, then it would get discarded.  Every time you open the terminal, the bash shell scans through a file called .bashrc and imports all the variables, aliases and functions within the file.  You can save shortcuts you create by opening the .bashrc file in any text editor and pasting them in.

Note: I use OS X Mavericks and by default, there is no .bashrc file.   I have had luck doing the same thing with the .bash_profile file instead.  To see if you have a .bashrc file in your home directory, run

cd
ls -a | grep ^.bash

Saving shell code: This one’s probably a no-brainer for some of you, but I didn’t know you could do it!  You can write your shell scripts in a text file and save it with the extension .sh.  You should include #!/bin/bash in the first line of this file. Then, you can run it at the command line by calling ./myfile.sh, where you replace “myfile” with whatever your file is called.

– Command history: Related to the previous point.  You can always press the up arrow to bring to the prompt lines you’ve recently run.  What if you ran a command much earlier in the day and want to do it again, but don’t remember the exact syntax?  Your working directory has a file with your command history in it.  By default, mine contains the last 500 lines I’ve run. You can access it by typing either history to print it all to the screen or less ~/.bash_history to view page by page.

Variables in the command line:  While there might be more clever ways to pass arguments to functions (like xargs), it’s easy and intuitive do it if you store your argument as a variable.  You can store output of a command by enclosing it in back ticks.  Suppose I have a bunch of csv files of the same format and I want to put the first column of each file into one master file. I would first get the names of all the files I want and store them in a variable called files

files=`ls | grep .csv`

Then you can do something like loop over the files and use >> to append them to the master file.  Compared to something like a string of pipes, this syntax feels intuitive to me because it mimics scripting languages I’m more familiar with.

for file in $files
do
    cut -d',' -f1 $file >> masterfile.csv
done

And of course, you can completely automate this by putting it in a .sh file and running it from the command line!  This is really powerful for facilitating reproducibility.  Often, we download files from the web or do easy text processing and file cleanup like this by hand, but then nobody else can retrace our steps.  Sometimes I can’t even reproduce what I did myself.  This is a clean way to keep track of what has been done (not to mention much faster, once you get things running).

Applied statistician’s lab notebook

I’ve been working on the same project on and off for a bit more than a year now. From the get go I knew I’d need to document my steps, so I started using a little green spiral notebook to keep track of what I did each day.  15 months later, it’s time to write up the project and I’m shocked by the notes I’ve kept (sparse and not helpful).  It’s not so hard to find the code you need when you wrote it several weeks ago, but how about the code you wrote 6 months ago?  And when you find it, how do you use it?  What inputs do you need to supply and what outputs does it spit out?

Unfortunately nobody teaches you how to research efficiently; I’ve been learning as I go along.  Since starting this project, I’ve learned what doesn’t work for me: naming files by date.  This is a convention I started using when I saw a mentor of mine doing it a while back.  Frankly I don’t know how he made it work for him.  The problem is pretty obvious; you don’t know what’s in each file until you open it.  I suppose it’s a good practice for version control, if every time you modify a file you save a new copy with the date.  But even then, when you go back and try to find the right code, how do you know which one to choose?  It also results in a lot of duplicated code taking up memory on your computer.  I’ve only found this file naming convention useful when I also summarize the file contents in my spiral notebook.  Unfortunately, I didn’t have enough self-discipline to do that consistently.

What has worked for me so far is keeping a “Methods” subdirectory in my main directory for the project.  Maybe “Methods” is an inappropriate name, as my folder includes presentations for meetings and intermediate results.  In there, it makes sense to date files so there is a chronological work flow.  Again, I wasn’t consistent about keeping the files in this folder up to date, but the notes I did make as I went along have been immensely helpful.

Where to go from here?  I’ve learned a few things along the way:

  • Automate as much as possible.  When you write a script, test it out as is, but then once you’re convinced it works properly, wrap it in a function.  You will inevitably have to rerun your script, maybe on many different datasets, and it’s useful to have it in a function where you only have to change the inputs once.  Along the same lines, try to avoid one-time scripts or running things at the command line.  These moves may be faster at the moment, but they’re not reproducible and will give you a headache later on.
  • Write in your lab notebook consistently.  Self-explanatory.  I wish I’d read this earlier: Noble suggests keeping an electronic “lab notebook” in the same directory as your project.  I like this idea because then you can include plots, bits of code, output, etc. and it is easy to share with others if need be.
  • Comment code profusely.  In Python, it’s good practice to include a “docstring” at the beginning of every function, enclosed in triple ”’.  Do the same in any language: describe what your function does, the function’s inputs and outputs, and any packages that are required for it to run.

I think this quote from the linked article sums it up:

The core guiding principle is simple: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.

Right now, that unfamiliar someone is me.  May the next project go better!