As the semester draws to a close, my topic modeling for digital collections appraisal project is almost complete. I have been exploring how well topic modeling can improve the process of appraising personal digital collections with respect to time and efficiency. I think my results are interesting enough that they merit a rethinking of how we appraise large collections of heterogenous files. The data Im working with is drawn from the personal laptop of a preeminent scholar. Very interesting results so far. I still need a bit more time to interpret the data, and a couple of permissions need to be worked out relative to sharing the data.
Until then I want to highlight some tools and resources that made this project significantly easier. Notice that many of my resources are drawn from the most recent issue of the Journal of Digital Humanities. Shout out to Scott Weingart and Elijah Meeks for a job well done.
Forensic Toolkit Imager: http://www.forensicswiki.org/wiki/FTK_Imager
Topic Modeling Tool:https://code.google.com/p/topic-modeling-tool/
David Blei, Topic Modeling and Digital Humanities
David Blei, Probalistic Topic Models[PDF]
Miles Efron, Peter Organisciak, Katrina Fenlon, Building Topic Models in a Federated Digital Library Through Selective Document Exclusion[PDF]
Andrew Goldstone and Ted Underwood, What Can Topic Models of PMLA Teach Us About the History of Literary Scholarship?
David Mimno, The Details: Training and Validating Big Models on Big Data
Miriam Posner and Andy Wallace, Very basic strategies for interpreting results from the Topic Modeling Tool
Lisa Rhody, Topic Modeling and Figurative Language
Ben Schmidt, When you have a MALLET, everything looks like a nail
Ted Underwood, What kinds of “topics” does topic modeling actually produce?
Ted Underwood, Topic modeling made just simple enough.