cross posted at MSU Libraries DSC Sandbox
Digital Humanists inside and outside of libraries make use of cultural heritage institution collection metadata in digital projects. Examples of use can be seen in the classroom, in research that seeks to gain insight into scholarly production in the field of History, and in library attempts to gain insight into aspects of their collections like manuscript provenance and the interdisciplinary character of monograph holdings. At MSU Libraries we are moving toward supporting this type of inquiry by preparing and making available data that correspond to the library catalog as a whole and subsets of that data that correspond to holdings in Special Collections. It is worth noting at the outset that the work of preparing this data is made possible by a wide cast of key players across the library that contribute programming, cataloging, metadata, and subject area expertise.
The rationale for creating subsets of the larger dataset is that metadata corresponding to a Special Collection, what we might call a ‘collection of distinction’ around these parts, provides information about something that is representative of a unique body of materials. As representative of a unique body of materials, the reasoning follows that insight derived from the metadata at scale can in turn be used to support and even extend a wide range of research questions that might be asked from these materials on an item by item basis. In effect, it allows application of a macroscopic lens as well as microscopic lens to library collections.
Some basic information about the collection records:
- Overall No. Records = 4366
- Records with Publisher = 3078
- Records with Country = 4366
- Records with City + State/Country = 3121
- Records with State alone = 1245
- Records with Publication Date = 2770
After getting the catalog records (thanks Autumn Faulkner), I used a bit of Python (thanksDevin Higgins for the help there) to extract a number of pieces of data from the records.
Since then I’ve embarked on use of a whole lot of OpenRefine and Excel for data cleaning and normalization. This basically entails wrangling data into a consistent format. For example, geographic information representation in catalog records is highly variable – a lot of Lans. mi, Lansing Michigan, Lansing, Michig., Lansing Mich., etc., rather than dedication to Lansing, MI. Across thousands of records, a not insignificant amount of love and care goes into getting data like this into such a state that it is amenable to allowing exploration of say, a question that is predicated on being able to map a special collection across time and space using a tool like Palladio.
Community Cookbook data cleaning and normalization is ongoing. Initial steps have been made with publication data as well as publication locations in the sense that they have been normalized and they have been augmented with latitude and longitude data via geocoder.
Community Cookbooks, mapped across time and space
Distribution of Michigan Community Cookbooks
Distribution of Kansas Cookbooks
A fair amount of work remains to be done but the hope is that initial results spark some interest!