data analysis – Stat Moments

The food world is buzzing about the study that came out last week claiming that organic foods contain higher levels of antioxidants and lower pesticide residues than non-organic foods. It has gotten so much media attention that I just can’t not comment. I don’t necessarily think it’s nonsense, but I am skeptical of its conclusions.

The study is a meta-analysis of previous papers on the topic. The authors read through 300+ studies of organic crops and aggregated data from them. The idea is that with a bigger sample size and more data, we should have more power to detect a difference between the compounds in organic and non-organic foods. Frankly, I didn’t read the paper in great detail; I’m generally mistrustful of meta-analyses. In their essay “Statistical Assumptions as Empirical Commitments”, Berk and Freedman criticize meta-analyses, first on the grounds that it doesn’t necessarily make sense to assume treatment (organic farming practices, in this case) should have the same effect across all studies:

“If we seek to combine studies with different kinds of outcome measures (earnings, weeks worked, time to first job), standardization seems helpful. And yet, why are standardized effects constant across these different measures? Is there really one underlying construct being measured, constant across studies, except for scale? We find no satisfactory answers to these critical questions.”

The studies used in the meta-analysis were done in countries all across Europe. Certainly there are regulations about what can be considered organic, but there’s no telling how different farms handled crops differently and differences in how the outcomes were measured across studies.

Furthermore, a successful meta-analysis relies on the assumptions of random sampling and statistical independence. Since the “units of analysis” are research studies, these assumptions hardly make sense. They clearly are not sampled randomly; the authors carefully read through hundreds of papers to find data and chose the ones that met certain requirements. The assumption of statistical independence is even less justified. Freedman and Berk bring up an interesting point, the human side of how they simply cannot be independent:

“The assumed independence of studies is worth a little more attention. Investigators are trained in similar ways, read the same papers, talk to one another, write proposals for funding to the same agencies, and publish the findings after peer review. Earlier studies beget later studies, just as each generation of Ph.D. students trains the next. After the first few million dollars are committed, granting agencies develop agendas of their own, which investigators learn to accommodate. Meta-analytic summaries of past work further channel the effort. There is, in short, a web of social dependence inherent in all scientific research. Does social dependence compromise statistical independence? Only if you think that investigators’ expectations, attitudes, preferences, and motivations affect the written word – and never forget those peer reviewers.”

And here’s the kicker: the study was funded by an organization that funds research in support of organic farming practices. They state at the end of the paper that the “design and management” weren’t influenced by the funding organization, but it’s not difficult to imagine biases in how the proposal and research questions were formulated from the get-go.

It’s going to take more than a meta-analysis to get me to go organic.

I’m embarking on a new research project that involves a substantial amount of coding. I’ve primarily used R in the past but Python seems to be gaining users in the statistics community lately. For this project, I’m making the switch to Python. I know the fundamentals of Python, but I’ve never used the numpy, scipy, and pandas packages that will be crucial for my work. As I go along, I’m going to use this blog to document some of the things I learn that I think are cool, so that 1) I practice coding and making plots, and more importantly, 2) to make these tools available to others who aren’t familiar with what Python can do.

For reference, I’m going to be using Python for Data Analysis, by Wes McKinney. You can find a pdf version online without searching too hard. I will put my code (and data, depending on the source) in a Github repository for the sake of sharing.

I played with one of the introductory examples in the book, the GroupLens movie ratings data and put the code I wrote in a Github repository here for anyone to download. I downloaded the MovieLens 1M file and loaded it into IPython (you could do this in any Python interface you’d like, though) and imported the pandas package. The first few rows of the data look like this:

I merged the tables into one and created a new column indicating a single genre for each movie. I chose to look at Comedy, Documentary, and Thriller (I assumed there would be little to no overlap in the categories) and lumped everything else into Other, using the np.where function (very handy). Then I made a pivot table to average all the ratings for each genre, grouped by males and females. There was not much a difference by gender or between genres. Lesson #1: there’s not always something interesting in the data.

Next I looked at ratings over time. The year that the movies came out was not supplied as a column but as a part of the movie title column, so I used regular expressions from the re module to extract them.

They show up as a list, so I used the str.join method to convert them to strings so we can use them as an index for a pivot table. This result was a bit more interesting, so I made a time series with matlibplot. From the graph, we see that movie ratings were volatile at the beginning of the twentieth century, presumably because there were fewer movies coming out at the time. And sadly, movie ratings are getting worse as time goes on…either peoples’ artistic tastes are becoming more refined or more bad movies are coming out.

Anyway, this was a toy example, but I’m excited to see what Python can do for me! What computing tools are you most thankful to have at your disposal?

Stat Moments

Tag Archives: data analysis

Does social dependence compromise statistical independence? Why I still eat conventionally grown foods

Learning a new language

Share this:

Share this: