This is part two of a two-part article on Collections as Data research at the UNLV Libraries.
Select Las Vegas City Commission Records are now available as a dataset that can be analyzed using computational research tools. Read on for more information, or go directly to the dataset at https://github.com/UNLV-Libraries/UNLV-Collections-as-Data/tree/master/Las-Vegas-Commission.
Cory Lampert, Head of Digital Collections, and Emily Lapworth, Digital Special Collections Librarianprovide us with an overview of how Digital Collections is working on this important research initiative.
The first test case this strategy was applied to is the Las Vegas City Commission Records Dataset (1911-1960). The dataset is comprised of bound materials from the original Las Vegas City Commission. Twelve of the bound volumes are minutes that served as the official record of the proceedings of all Las Vegas City Commission meetings from 1911-1960. There are also three volumes of City of Las Vegas ordinances dating from 1911 to 1958, and one volume of legal documents from 1944-1945. They provide a valuable historical record of a wide variety of business and community activities in Las Vegas in the first fifty years of its incorporation. The Las Vegas City Commission Records Dataset was derived from digitized materials from the Las Vegas City Commission Records archival collection (MS-00237) (A guide to MS-00237 can be found online at: http://n2t.net/ark:/62930/f1n034).
The dataset consists of plain text files of the 16 volumes described above. The Las Vegas Centennial Commission provided funding for Digital Collections staff and student assistants (including Kathleen Marx, Kayla Ott, Tierre Cabbell, Elizabeth Villasenor, Natale Muro, and Kelsey Mazmanyan) to manually review and correct computer-generated transcription from high-resolution digital images, resulting in highly accurate text in this dataset. The dataset, along with additional information about its contents and how it was created, are available for download via the UNLV Libraries GitHub page (https://github.com/UNLV-Libraries/UNLV-Collections-as-Data/tree/master/Las-Vegas-Commission).
The UNLV SCA CAD project team focused on the LVC materials as a candidate dataset based upon the following factors:
- The optical character recognition text transcripts were of a high quality due to the manual correction completed through grant funding (most OCR texts created in digital collections are uncorrected and presented “as-is” to users).
- Creating a dataset from the corrected transcriptions fit the criteria of a small, quickly attainable goal
- This collection is popular for local researchers with the physical materials often requested by on-site patrons. The original volumes are quite large, heavy to handle and fragile in nature lending themselves to more practical use in a digital format.
- While limited to a small geographic space, the collection content is highly relevant for those seeking to gain an understanding of the city’s development over time. Ordinance names, subjects, proper names, and dates have a high degree of accuracy and can be analyzed across the dataset.
- As part of creating the dataset, a ReadMe file was generated to document the characteristics of the published data. Future datasets will also include a ReadMe file and this project piloted a selection of core elements for future datasets. This practice of documentation has been discussed and adopted by the team as a best practice.
- The dataset is published along with its documentation and the team is open to gathering a incorporating user feedback.
What you could do with LVC dataset
Each researcher may come to the dataset with their own unique research interests. Computational research methods often require a multi-step process:
- Developing a research question
- Identifying available data
- Matching the data and research question with an appropriate method or instrument to analyze the data
- “Cleaning” or preparing the data
- Running the analysis
- Analyzing the results
Before performing analysis, researchers should expect to perform custom data preparation to create an appropriate workset that meets their unique needs and that will produce fruitful results from selected computer operations.
The LVC dataset consists of plain text files that could be “mined” for patterns (a profile of text mining as a research method is available on the Collections as Data project website). Some of the specific methods that the UNLV SCA CAD project team experimented with using this dataset include topic modeling and named entity recognition. The Stanford Named Entity Recognizer was used to identify and tag entities such as the names of people and places. Researchers could explore what names appear most in the records, or how the frequency of certain names changes over time. Place names could be used in conjunction with a geocoder (such as gpsvisualizer.com/geocoder) to visually map the places mentioned in the records. The team also tried using Topic Modeling Tool to identify topics in the LVC meeting minutes. The Topic Modeling Tool was easy to use but it did not immediately identify meaningful topics. The team hypothesized that further tweaking of the tool or dataset was needed, or maybe topic modeling is not the best method to use with meeting minutes.
The team is continuing to pursue CAD projects for other unique library special collections. To help us prioritize this work and make it more useful for researchers, we welcome feedback and comments on the project.
- Have you used the LVC data for your research? Could it be improved in some way?
- Is there a collection you would like to see transformed into a dataset?
- What tools and methods do you commonly use in your research?
Contact the team members directly with your feedback or if you would like more information or an in-person demonstration of Collections as Data concepts, datasets, or methods for your classes or community.
Emily Lapworth, Digital Special Collections Librarian (firstname.lastname@example.org)
Cory Lampert, Head of Digital Collections (email@example.com)
Halle Burns, Data Librarian and Instructor (firstname.lastname@example.org)
Thomas Padilla, Interim Head of Knowledge Production (email@example.com)