Libraries are for Use

Demonstrating the value of librarianship

Problems analyzing LIS data

The librarianship profession has had a very uneasy relationship with statistical analysis.  While it is clear that there is a definite need to learn and understand the basics (and more), most librarians and students preparing to be a practitioner have an intense aversion to the very idea of statistics.  While most programs require a research methods course for completion of a standard MLS degree, fewer require graduate-level statistics.  Amy Van Epps provides a very nice overview of the kinds of statistical analyses need in librarianship (Van Epps, 2012).  She also discusses the research that has been conducted on the use of statistical analyses in LIS literature, as well as the unfilled need of training requirements in MLS programs.

While I have taken several basic and intermediate statistics courses, and have done my share of analysis for both clinical trials and librarianship, I have been bothered about my education.  The courses covered analysis of continuous and categorical data fairly well.  They cover the normal or Gaussian distribution very well, as well as the t-test and p-value, and Chi-square.

But there was precious little on analyzing count data.  And in librarianship, there is a lot of data that are simply counts — reference questions, book use, journal use, citations, attendees at workshops, Twitter posts, students who graduate or stay in school or dropout.  And these are not distributed normally.  This data is severely skewed, usually to the right (think “long tail”).  And the variance (standard deviation) is much greater than the mean – that’s because of the long tail, all those very high counts.  How the heck are you supposed to analyze that?

Then I found this article on probability distributions in LIS by Stephen J. Bensman, Original Cataloger at Louisiana State University. Bensman struggled with the same “extremely difficult problem”, and wrote this article to “connect the information science laws with the probability distributions, on which statistics are based, in some easily understandable manner, as an aid to persons conducting statistical investigations of the problems afflicting libraries.”  From this article, which provided some very good historiography of the development of modern statistical analysis, which included the identification and analysis of these skewed distributions.

Inferential statistical analysis (which Van Epps does a very good job at describing) is essentially comparing the distribution of the data you collected with the theoretical distribution, what it is expected to be.  This is essentially your null hypothesis.  You primary hypothesis is how you think your data is actually distributed based on some difference that you either suspect is there or that you planned to be there (like an experiment).  An example of the former is that you believe that print books are not circulating as much as they did used to.  So you compare the distribution of book circulations from the past year with those of 5 years ago (the expected).  An example of the latter is that you want to see the effects of the implementation of your discovery system on database usage.  So you compare a measure of usage (that’s a whole ‘nother story) post-implementation with that of pre-implementation (expected).

The problem with this method is that the statistical tests most of us learn are based on certain assumptions:

  1. The theoretical distribution is Gaussian or “normal” bell-curve.
  2. The mean and the variance are independent of each other.  This means that there is no relationship between the average of the theoretical distribution and how much it varies.
  3. The conditions of the variance (reasons why usage varies) are additive (they don’t compound or build on each other, effectively multiplying).

The problem is, of course, the theoretical distribution of most count data fly in the face of these assumptions.  And this is why I believe that statistical analysis courses should be revised to spend much less time on the normal distribution and the analysis of continuous data, and more time on the analysis of categorical data and counts.  There are precious few studies in LIS (indeed, much of the social sciences) that utilize continuous data whose theoretical distribution is normal.  It has taken me years to understand this, and I very much appreciate Bensman’s article which enlightened me on the efforts to address this “extremely difficult problem.”

BTW – Bensman recommends dealing with this problem by transforming (or converting) the data to logarithm, which often results in a normal distribution of the logarithms (called, “lognormal”).  The problem with this solution, however, is that you can’t calculate a logarithm of zero (0).  Now, if you just have a few zeros, you can simply add a very small value (e.g. 0.01) to all of your data.  But with usage data that I analyze, titles with zero usage dominate the data set.  Most titles have not been used.  Now, how do you deal with that?  I’ve signed up for a commercial online course in modeling count data.  It is costing me and the UNT Libraries $500 of precious travel money, but I hope to have some more answers after I finish that course.


Bensman, S. J. (2000). Probability distributions in library and information science: A historical and practitioner viewpoint. Journal of the American Society for Information Science, 51(9), 816-833. doi:10.1002/(SICI)1097-4571(2000)51:9<816::AID-ASI50>3.0.CO;2-6.
Van Epps, A. S. (2012). Librarians and statistics: Thoughts on a tentative relationship. Practical Academic Librarianship: The International Journal of the SLA, 2(1), 25-25.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


This entry was posted on April 6, 2014 by in Assessment, LIS Data, LIS Education, LIS Research, Statistical Analysis.

Join 27 other followers

April 2014
« Mar   May »
The Scholarly Kitchen

What’s Hot and Cooking In Scholarly Publishing

Libraries are for Use

Demonstrating the value of librarianship

Scholarly Communication |

Demonstrating the value of librarianship

Library & Information Science Research |

Demonstrating the value of librarianship

Library Collections |

Demonstrating the value of librarianship

Lib(rary) Performance

About library statistics & measurement - by Ray Lyons

Walt at Random

Demonstrating the value of librarianship

The Scholarly Kitchen

Demonstrating the value of librarianship

The Quarterly Journal of Economics Current Issue

Demonstrating the value of librarianship

Texas Library Association blogs

Demonstrating the value of librarianship

Demonstrating the value of librarianship

Stephen's Lighthouse

Demonstrating the value of librarianship


Demonstrating the value of librarianship

Reference Notes

Demonstrating the value of librarianship Truth-O-Meter rulings from National

Demonstrating the value of librarianship

Open and Shut?

Demonstrating the value of librarianship


Demonstrating the value of librarianship

Musings about librarianship

Demonstrating the value of librarianship


Demonstrating the value of librarianship

%d bloggers like this: