The terminal can be a powerful tool in statistical computing. I’ve used UNIX a bit before, mostly just to run jobs on remote servers and handle giant files. I know how to copy files, move them between folders, edit text files in Emacs, and use simple commands like
cut along with pipes and filters to sift through files. I wouldn’t say I’m savvy by any means, but I know enough to get by. I’m always looking for tips to maximize my speed while minimizing the work I have to do. This week I learned a few things that I thought were useful for getting more out of the terminal as a tool for research. In particular, some of the things I learned facilitate reproducibility, which is becoming increasingly important and rare.
– Aliasing: There are probably things you type repeatedly at the command line, like your username when logging into remote machines or
cd to a certain frequently used directory. You can save time typing these things over and over again by creating an “alias” for them in the terminal. This essentially means you create a shortcut to call this command, instead of writing out the whole thing. For instance, I might write something like
alias jobs="top -u kellie"
Then, when I type
jobs, I will see the CPU, % Memory, and run time for any jobs I have running and ignore the processes that belong to other users. It’s a clever trick that might save me 1 second each time I use it.
Now, that’s nice to have in one session, but what if you want to keep using the same shortcuts every time you open the terminal? If you just created an alias once and closed the terminal, then it would get discarded. Every time you open the terminal, the bash shell scans through a file called .bashrc and imports all the variables, aliases and functions within the file. You can save shortcuts you create by opening the .bashrc file in any text editor and pasting them in.
Note: I use OS X Mavericks and by default, there is no .bashrc file. I have had luck doing the same thing with the .bash_profile file instead. To see if you have a .bashrc file in your home directory, run
cd ls -a | grep ^.bash
– Saving shell code: This one’s probably a no-brainer for some of you, but I didn’t know you could do it! You can write your shell scripts in a text file and save it with the extension .sh. You should include
#!/bin/bash in the first line of this file. Then, you can run it at the command line by calling
./myfile.sh, where you replace “myfile” with whatever your file is called.
– Command history: Related to the previous point. You can always press the up arrow to bring to the prompt lines you’ve recently run. What if you ran a command much earlier in the day and want to do it again, but don’t remember the exact syntax? Your working directory has a file with your command history in it. By default, mine contains the last 500 lines I’ve run. You can access it by typing either
history to print it all to the screen or
less ~/.bash_history to view page by page.
– Variables in the command line: While there might be more clever ways to pass arguments to functions (like
xargs), it’s easy and intuitive do it if you store your argument as a variable. You can store output of a command by enclosing it in back ticks. Suppose I have a bunch of csv files of the same format and I want to put the first column of each file into one master file. I would first get the names of all the files I want and store them in a variable called
files=`ls | grep .csv`
Then you can do something like loop over the files and use
>> to append them to the master file. Compared to something like a string of pipes, this syntax feels intuitive to me because it mimics scripting languages I’m more familiar with.
for file in $files do cut -d',' -f1 $file >> masterfile.csv done
And of course, you can completely automate this by putting it in a .sh file and running it from the command line! This is really powerful for facilitating reproducibility. Often, we download files from the web or do easy text processing and file cleanup like this by hand, but then nobody else can retrace our steps. Sometimes I can’t even reproduce what I did myself. This is a clean way to keep track of what has been done (not to mention much faster, once you get things running).