Simple random sampling: Not so simple

This was originally posted on the BIDS blog, here.

Simple random sampling is drawing k objects from a group of n in such a way that all possible subsets are equally likely. In practice, it is difficult to draw truly random samples. Instead, people tend to draw samples using

  1. A pseudorandom number generator (PRNG) that produces sequences of bits, plus
  2. A sampling algorithm that maps a sequence of pseudorandom numbers into a subset of the population.

Most people take for granted that this procedure is a sufficient approximation to simple random sampling. If it isn’t, then many statistical results may be called into question: anything that relies on sampling, including permutations, bootstrapping, and Monte Carlo simulations, may give biased results.

This blog post is a preview of what I plan to talk about at the UC Berkeley Risk Management Seminar on Tuesday, February 7. This is joint work with Philip B. Stark and Ronald Rivest.


Finite state space

A PRNG is a deterministic function with several components:

  • A user-supplied seed value used to set the internal state
  • A function that maps the internal state to pseudorandom bits
  • A function that updates the internal state

image1.pngThe internal state of a PRNG is usually stored as an integer or matrix with fixed size. As such, it can only take on finitely many values. PRNGs are periodic: if we generate enough pseudorandom numbers, we will update the internal state so many times that the PRNG will return to its starting state.

This periodicity is a problem. PRNGs are deterministic, so for each value of the internal state, our sampling algorithm of choice will give us exactly one random sample. If the number of samples of size k from a population of n is greater than the size of the PRNG’s state space, then the PRNG cannot possibly generate all samples.

This will certainly be a problem for most PRNGs when n and k grow large, even for those like the Mersenne Twister, which is widely accepted and used as the default PRNG in most common software packages.


Cryptographically secure PRNGs

One solution is to use PRNGs that have an infinite state space. Cryptographers have worked extensively on this problem, but cryptographically secure PRNGs haven’t yet become mainstream in other fields. They’re a bit slower than the PRNGs in wide use, so they’re typically reserved for applications where security is important. For the purpose of sampling, the bulk of the computational time will be spent in the sampling algorithm and not in the PRNG, so we are less concerned.

Hash functions take in a message of arbitrary length and return a hashed value of fixed length (e.g. 256 bits). A cryptographic hash function has the additional properties that it is computationally infeasible to invert in polynomial tim; it’s difficult to find two inputs that hash to the same output; small changes to the input message produce large, unpredictable changes to the output; and the output bits are uniformly distributed. These properties are amenable to generating pseudorandom numbers. The diagram gives a cartoon of how a hash function operates on a message x to output a hashed value h(x).

image2.pngWe are developing plug-in PRNGs based on the SHA256 hash function for R and Python. The Python package is in development on GitHub. The code is currently only prototyped in Python, but watch our repository for a sped up C implementation.


Tests for pseudorandomness

Generating pseudorandom numbers with traditional PRNGs is a problem when n and k grow large, but how do they perform for small or moderate n and k? I would argue that if we’re using PRNGs for statistical methods, we should judge their performance by how well they can generate simple random samples. We are actively working on testing PRNGs for this goal and hope to have a paper out later this year. Stay tuned!


Propensity score matching in Python, revisited

Update 8/11/2017: I’ve been working on turning this code into a package people can download and contribute to. Please use the package, linked here, instead of the code I shared in a Jupyter notebook previously.

I can’t believe how many people from all around the world visit my previous blog post on propensity score matching in Python every day.  It feels great to know that my code is out there and people are actually using it.  However, I realized that the notebook I link to previously doesn’t contain much and that I wrote heaps more code after posting it.  Hence, I’m sharing a more complete notebook with code for different variations on propensity score matching, functions to compute average treatment effects and get standard errors, and check for balance between matched groups.

Things I’ve learned from my students

This summer I’m the TA for an undergraduate intro stats class for non-majors.  It’s been an enlightening process.  I’m learning a lot about teaching obviously, as it’s my first time actually running discussion sections and being fully in charge.  I’m glad to be doing an intro course because I’m really nailing down the basic things I need to know as a statistician, and I don’t need to worry about what I’m presenting so much as how I’m presenting it.  Interacting with the students is the most rewarding part of the experience, and observing them has been an exposition of the best and worst habits that people have.  A few things I’ve noted:

 Top students put in the time.  Even the naturally smart students who do well on tests don’t score as high as the ones who come to class consistently, every day. Those who put in the time get the most out of the class. The takeaway is that even when things seem easy, you can’t get cocky and stop paying attention; there is always more to learn.

How to hack the brainstem.  I love this idea.  I stole it from my advisor, Philip, who once said that the best way to convey statistical ideas to non-statisticians is to “hack the brainstem using metaphors.”  I have to teach things that I’ve internalized and taken for granted for so many years.  It’s really tough to boil things down to their essential parts when I see how topics connect and relate to each other, but the students just do not.  The metaphor I like relates chance variability in a random sample to measuring the concentration of a chemical solution: if your solution is well-mixed, it doesn’t matter if the drop you sample comes from a test tube or a gallon jug.  It should always have the same concentration.  Similarly, if you take a random sample of people, it doesn’t matter whether the population they came from is 1,000 or 1,000,000 people; your estimate of the average/percentage of some characteristic based on the sample will have the same accuracy.

Another version I like is hacking the brainstem using hyperboles.  To illustrate confusing concepts, take them to their limit: what happens if you flip a coin 1,000 times, then 10,000 times, then 100,000 times? What happens with non-response bias if all the unhealthy people in your sample die before the survey?

– Kindness makes a difference.  There are a handful of students who say hello, goodbye, and thank you to me every day.  I’m sure they do it without even thinking twice, but I definitely notice.  It’s nice to feel appreciated, and it’s a reminder to show others how much I appreciate their help and support regularly.

– My generation is spoiled by technology.  I look around when the students are working on exercises and I see them all on their phones.  Some have headphones on, some are clearly texting, while the rest are actually using tiny touch screen calculators to solve the problems.  Yes, it’s convenient, but wouldn’t they benefit more from actually talking to each other about the material instead of isolating themselves on their phones?  I’m guilty of this myself.  Sometimes, you just need to shut it off.

I’ve noticed that students overuse technology in other ways.  They think it’s okay to email pictures of their problem sets instead of turning in a hard copy, without even writing a message of explanation.  As if messy math scribbles on paper weren’t hard enough to grade, now the end product is one step removed and only visible on a screen, and it comes without acknowledgement that the student is bending the class rules.  It’s as though the availability of email makes people forget basic politeness.  I hope not to be like that; I try to be cordial, straightforward, and clear in my electronic communications.

There’s only so much I can do.  And that’s okay.  No matter how much time and effort I put into the class, some students are still going to fail.  It’s not a reflection of me as a TA if the student doesn’t hold up their end of the deal.  Are you emailing me asking for extra help?  You better come to office hours.  Are you asking me to return your graded homework?  Then you better show up to class and pick it up.  The failures of others aren’t a reflection of my work, and I shouldn’t take any of it personally.