Simple random sampling: Not so simple

This was originally posted on the BIDS blog, here.

Simple random sampling is drawing k objects from a group of n in such a way that all possible subsets are equally likely. In practice, it is difficult to draw truly random samples. Instead, people tend to draw samples using

  1. A pseudorandom number generator (PRNG) that produces sequences of bits, plus
  2. A sampling algorithm that maps a sequence of pseudorandom numbers into a subset of the population.

Most people take for granted that this procedure is a sufficient approximation to simple random sampling. If it isn’t, then many statistical results may be called into question: anything that relies on sampling, including permutations, bootstrapping, and Monte Carlo simulations, may give biased results.

This blog post is a preview of what I plan to talk about at the UC Berkeley Risk Management Seminar on Tuesday, February 7. This is joint work with Philip B. Stark and Ronald Rivest.


Finite state space

A PRNG is a deterministic function with several components:

  • A user-supplied seed value used to set the internal state
  • A function that maps the internal state to pseudorandom bits
  • A function that updates the internal state

image1.pngThe internal state of a PRNG is usually stored as an integer or matrix with fixed size. As such, it can only take on finitely many values. PRNGs are periodic: if we generate enough pseudorandom numbers, we will update the internal state so many times that the PRNG will return to its starting state.

This periodicity is a problem. PRNGs are deterministic, so for each value of the internal state, our sampling algorithm of choice will give us exactly one random sample. If the number of samples of size k from a population of n is greater than the size of the PRNG’s state space, then the PRNG cannot possibly generate all samples.

This will certainly be a problem for most PRNGs when n and k grow large, even for those like the Mersenne Twister, which is widely accepted and used as the default PRNG in most common software packages.


Cryptographically secure PRNGs

One solution is to use PRNGs that have an infinite state space. Cryptographers have worked extensively on this problem, but cryptographically secure PRNGs haven’t yet become mainstream in other fields. They’re a bit slower than the PRNGs in wide use, so they’re typically reserved for applications where security is important. For the purpose of sampling, the bulk of the computational time will be spent in the sampling algorithm and not in the PRNG, so we are less concerned.

Hash functions take in a message of arbitrary length and return a hashed value of fixed length (e.g. 256 bits). A cryptographic hash function has the additional properties that it is computationally infeasible to invert in polynomial tim; it’s difficult to find two inputs that hash to the same output; small changes to the input message produce large, unpredictable changes to the output; and the output bits are uniformly distributed. These properties are amenable to generating pseudorandom numbers. The diagram gives a cartoon of how a hash function operates on a message x to output a hashed value h(x).

image2.pngWe are developing plug-in PRNGs based on the SHA256 hash function for R and Python. The Python package is in development on GitHub. The code is currently only prototyped in Python, but watch our repository for a sped up C implementation.


Tests for pseudorandomness

Generating pseudorandom numbers with traditional PRNGs is a problem when n and k grow large, but how do they perform for small or moderate n and k? I would argue that if we’re using PRNGs for statistical methods, we should judge their performance by how well they can generate simple random samples. We are actively working on testing PRNGs for this goal and hope to have a paper out later this year. Stay tuned!


Applied statistician’s lab notebook

I’ve been working on the same project on and off for a bit more than a year now. From the get go I knew I’d need to document my steps, so I started using a little green spiral notebook to keep track of what I did each day.  15 months later, it’s time to write up the project and I’m shocked by the notes I’ve kept (sparse and not helpful).  It’s not so hard to find the code you need when you wrote it several weeks ago, but how about the code you wrote 6 months ago?  And when you find it, how do you use it?  What inputs do you need to supply and what outputs does it spit out?

Unfortunately nobody teaches you how to research efficiently; I’ve been learning as I go along.  Since starting this project, I’ve learned what doesn’t work for me: naming files by date.  This is a convention I started using when I saw a mentor of mine doing it a while back.  Frankly I don’t know how he made it work for him.  The problem is pretty obvious; you don’t know what’s in each file until you open it.  I suppose it’s a good practice for version control, if every time you modify a file you save a new copy with the date.  But even then, when you go back and try to find the right code, how do you know which one to choose?  It also results in a lot of duplicated code taking up memory on your computer.  I’ve only found this file naming convention useful when I also summarize the file contents in my spiral notebook.  Unfortunately, I didn’t have enough self-discipline to do that consistently.

What has worked for me so far is keeping a “Methods” subdirectory in my main directory for the project.  Maybe “Methods” is an inappropriate name, as my folder includes presentations for meetings and intermediate results.  In there, it makes sense to date files so there is a chronological work flow.  Again, I wasn’t consistent about keeping the files in this folder up to date, but the notes I did make as I went along have been immensely helpful.

Where to go from here?  I’ve learned a few things along the way:

  • Automate as much as possible.  When you write a script, test it out as is, but then once you’re convinced it works properly, wrap it in a function.  You will inevitably have to rerun your script, maybe on many different datasets, and it’s useful to have it in a function where you only have to change the inputs once.  Along the same lines, try to avoid one-time scripts or running things at the command line.  These moves may be faster at the moment, but they’re not reproducible and will give you a headache later on.
  • Write in your lab notebook consistently.  Self-explanatory.  I wish I’d read this earlier: Noble suggests keeping an electronic “lab notebook” in the same directory as your project.  I like this idea because then you can include plots, bits of code, output, etc. and it is easy to share with others if need be.
  • Comment code profusely.  In Python, it’s good practice to include a “docstring” at the beginning of every function, enclosed in triple ”’.  Do the same in any language: describe what your function does, the function’s inputs and outputs, and any packages that are required for it to run.

I think this quote from the linked article sums it up:

The core guiding principle is simple: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.

Right now, that unfamiliar someone is me.  May the next project go better!