After nearly a year, the Library Statistics System -- an inventive name that I thought up all by myself -- has finally reached "1.0" status. This means that the first major iteration of the system is complete, with all planned features having been implemented and all known bugs fixed. The system has been ingesting statistics since last summer, and as of this writing, exactly 4,113 data points have been entered into it.
This blog post is intended to provide some background on the system: where it came from, what motivated it, and how it works.
Before the system
Before the system, library statistics were collected manually and saved in a fiscal-yearly Excel spreadsheet. In general, each spreadsheet collected more detail than the last. Statistics were stored in a way that was not really amenable to automated analysis. With new requirements for more fine-grained data that had never been collected before, there was a need for an alternative to the "old" workflow.
Work on the system started last May and entailed a from-scratch attempt at designing a new way to store library statistics -- specifically our statistics, which are a unique creature. The design of the system carries with it certain tradeoffs, as all designs do. Overall, it was informed by these primary goals:
- Be able to store all current and past statistics without a lot of kludging.
- Be easy to use for data entry.
- Be able to generate useful chart-based reports.
- Be able to generate useful table-based reports.
- Be able to provide basic audit trails.
- Support data integrity.
- Be able to provide contextual information.
- Be able to accommodate future schema changes without breaking any of the above.
Below, I will talk about each goal in more detail.
Be able to store all past statistics
The Uniform Statistics spreadsheets contain most of our historical statistical information. The problem is that they are actually not very uniform, because their schema -- the tables that appear and the particular rows and columns in the tables -- differs quite a lot from year to year. For the purposes of comparing across time, this is a problem, and is the reason why, for example, the Census Bureau still collects information about stuff like "non-farm payrolls" even though it doesn't make very much sense in a post-agrarian society; it's done to retain comparability with past stats.
All statistics in the system reside at the intersection of a "department," a "category," and a point in time. A "department" refers to some conceptual entity or subdivision within the libraries. Circulation is a department; "Online" is a "department" that doesn't fit anywhere else. The concept can be a little rough in places, but it basically works.
The department and category lists are ad hoc taxonomies. There is one main department -- "University Libraries" -- of which all other departments are "children," or subdepartments. Some have children of their own, and some don't. All except "University Libraries" have at least one parent. (I call the departments and categories "nodes," although a librarian might think of them as "terms.") The taxonomy provides a few really important features:
- Conceptual grouping: If the department taxonomy were just a flat list of terms, it would be difficult to understand the relationships between the departments, and even more difficult to enter and query the statistics.
- Cascadable operability: This is another term that I thought up all by myself. It means that, for example, we can easily get the sum of all of a node's child nodes. We can enter data on as fine-grained a basis as we have, yet still see that data added into the results for a broader query.
- Adaptability. I will talk about this more in "accommodating future schema changes" below.
Why departments and categories? What is the significance? The explanation requires a brief diversion. Albert Einstein was the first to realize that space and time are related, and actually part of a whole, which scientists who were much smarter than I am but no more clever about naming things called "spacetime." We live in a four-dimensional spacetime with three spatial dimensions and one time dimension, as far as we know. The first three spatial dimensions are easy to understand: one is a line; two are a plane; three are a volume. The time dimension is likewise easy to understand. Without a time dimension, everything would be frozen, ageless, and not even aware of it, like an OPAC vendor.
But what do dimensions have to do with statistics? It makes sense when you think about the project goals. Let's say we want to collect statistics. We want to collect them for different points in time, so we need a time dimension for sure. We need charts and tables, and both of those are two-dimensional, so we know that we need at least two spatial dimensions. It turns out that in the system, we have names for our two spatial dimensions: we call them "departments" and "categories."
Why just two spatial dimensions and not three? Wouldn't it have been cool to do 3D charts? It's true that adding more dimensions would have allowed for richer data, but the thing about adding more dimensions is that it would make data entry, querying, visualization, and general conceptual understanding a lot harder. So we limited it to two.
Be easy to use for data entry
Each data point needs a department, category, and time period associated with it. Ultimately, this needs to be keyed or clicked in by a human. The "add statistics" form is probably not perfect in terms of efficiency, but it's at least OK for now. Recent enhancements to the privilege system, guided by user feedback, have made it easier for enterers to know where to enter their stats by graying out unavailable departments and categories and restricting certain past time periods.
Be able to generate useful charts
There are potentially a lot of different types of charts that would be possible - chronological, comparative, proportional, and so on. Instead of developing a bunch of individual custom charts, our goal is to provide a general-purpose tool that will allow users to easily do it themselves. The current chart tool provides chronological charts to support trend analysis.
Earlier iterations of the system had a simple charting tool based on Google Charts. We scrapped that and recently switched to a different graphing library which gives us more control and better results.
(As of this writing, the charts are "locked" and only available to a few people, but soon they will be opened up to everyone.)
Be able to generate useful tables
Charts are useful, but in certain cases, it's nice to have access to the raw data. Another recent addition to the system has been the Table Builder tool, which allows dynamic construction of table-based reports. In the Table Builder tables, departments show up as rows and categories as columns. Generated tables can be viewed and printed as-is, or exported into Excel for further manipulation.
(Like the charts, the tables are currently restricted for a little while yet.)
Be able to provide basic audit trails
Every data point is associated with the date and time it was entered as well as the staff member who entered it. This means that if you screw up a little too often, we can deduct the mistakes out of your paycheck. Seriously, it means that we know that mistakes are an inevitable part of data entry, and this makes them easier to find. You can also "flag" your statistic, which is an easy way of highlighting it to make it stand out. Overall, audit trails just make it easier to find and fix problems of all kinds.
Support data integrity
The system is backed by a database which, with no code written to prevent it, would be perfectly happy to allow anyone to edit or delete anything at all. With dozens of people using the system, this is not a good ability to have. The built-in privilege system allows statistics administrators to restrict different users' abilities to add, edit, and delete statistics in certain departments, categories, and time periods. This serves two main purposes:
- It makes statistics easier to enter;
- It prevents Department A from changing Department B's statistics, whether accidentally (as is virtually always the case) or intentionally.
Periodically, the Libraries administration will use the statistics in the system to compile reports. It's important that the data used to compile these reports remains the same over time. This is accomplished by "locking" past time periods. When an administrator locks a past month, nobody can add, edit, or delete any statistics within it.
Physically, the statistics in the system reside on the Libraries' production database server, which gets backed up daily by the folks in Library Systems.
Be able to provide contextual information
Quite often, a statistic needs to be qualified with additional information -- for example, "Beginning in 2005, we stopped collecting such and such this way and started doing it another way and this explains the different totals." It turns out that this kind of thing happens all the time in our statistics, so we need some way to record it. We do this with notes. We can associate a note with any department/category/time period combination and it will appear alongside the results of any relevant query.
Be able to accommodate future schema changes
Over the past decade, the schema in which the stats have been recorded has undergone many changes in favor of complexity, not simplicity. We don't want to force future square-peg statistics into a round-hole system - we want them to integrate cleanly, and we do that by having the ability to add to, and rearrange, the department & category taxonomies.
It would have been interesting to have been able to explore other approaches to the problem. Development began in May of last year and the first release was due before July, in order to allow for monthly data entry beginning with this fiscal year. The tight schedule as well as finite developer resources imposed constraints on the amount of research that was possible before implementation.
In hindsight, there was only one "extremely bad" design decision. The Uniform Statistics report -- which was my fault and has since been removed -- was intended to be quick to develop and allow for an easy transition from the Excel Uniform Statistics spreadsheets to the new Table Builder tool. It actually ended up being more of a nightmare which had to be scrapped and replaced by the Table Builder -- the original idea for which there was not enough time to build initially.
Due to its fundamental design, the system is always going to have difficulty dealing with arbitrary complex statistical classifications. If we ever want to have an automated way of finding the ratio of the number of students with brown hair who have checked out a book in the last 2 months compared to the standard deviation of the number of checkouts over a 9-month period; we will have a hard time. Then again, this is likely to also be the case for any other solution we could have deployed. To some degree, the choice of system must guide the choice of statistics that will be collected, and in this case, we need to:
- Take care not to get too carried away with details (not that we have been); and
- Strive for methodological consistency in our statistic-gathering process.
I hope this background was informative. If you have any questions or suggestions, don't hesitate to contact someone on the project.