One of the interesting thing that we have found during the first phase of our project is that the workflow in digital collections is going to need to evolve to accommodate the needs of linked data. While many tasks are very similar to managing traditional metadata records, the consequences of data quality are much more severe. In the past, if we did not do quality control on one of our collections, there were no indicators of how this was impacting users. The systems still (for the most part) would intake messy or inaccurate data, metadata fields were not regulated strictly, and data quality was something we talked about and contemplated, but were not forced to act upon.
The world of linked data is very different. In order to prepare data to link to other data sets it needs to be good quality data that is ready for the transformation process. In this process we use tools like OpenRefine that powerfully utilize automated functions such as reconciliation services, that check the compliance of our data in batches against authority records like the LCSH or other authority files. If bad data goes in, we lose the power of automated reconcilation. No longer can we simply ignore data.The recent webinar, "How to Pick the Low Hanging Fruits of Linked Data" presented by Seth van Hooland and Ruben Verborgh was a fantastic overview of this topic explaining actions we can take now to begin the process of overcoming messy data and adding value in the process.
Some of the key points from the webinar include:
- The presenters suggest an approach that includes the following steps: clean, reconcile, enrich, and publish
- We can all clean up our data, even if we are not actively publishing linked data
- But, if you are working on reconciling and enriching your data...you should seek to publish it. We need more data sets published as linked data
- Data quality is fast becoming an important area of research and is impacting the future of our professional roles in libraries and archives, one suggested publication was Data Quality: The Accuracy Dimension. We will all need to develop guidelines locally for data quality management, but it may be helpful for us to familiarize ourselves with the larger picture to help establish our own best practices
- The presenters discussed the challenges of exposing data (via websites and APIs) and proposed the use of REST guidelines for architecture which is the most sustainable model
- The handbook authored by the presenters "Linked Data for Libraries, Archives and Museums", will be published by Facet Publishing in June 2014.
OpenRefine is a great place to start. Even if you just need to review, sort, analyze, or examine your existing data, it is "Excel on steroids" and has many easy to use functions that you can put to work right away. We will post some of the common functions we use to separate values in fields, move values to their own fields, and clean up our controlled vocabulary fields using authority files and reconciliation services. It all feels a little like magic at first!
So if you think messy data is something you can live with, we encourage you to think again!