This Bookworm is a National Endowment for the Humanities (NEH)-funded exploration of the HathiTrust Research Center (HTRC), a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library. HTRC’s goal is to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.
Bookworm is a tool that visualizes language usage trends in repositories of digitized texts in a simple and powerful way. It is a tool for culturomic exploration through the observation of chronological trends for words and phrases in large digitized collections of textual documents with metadata facets.
The world's great research libraries have, over time, carefully assembled a rich body of metadata pertaining to the books in their collections. Since the HTRC has access to volume-level metadata as well as volume-level content, we have constructed a Bookworm of the HathiTrust Non-Google-digitized public domain (NGPD) corpus. We felt that setting up Bookworm with a HathiTrust corpus would provides scholarly researchers with the means of exploring trends.
This tool enables scholars to discover new textual use patterns across the entire corpus. In the future, we plan to ingest the entire HathiTrust corpus and continue to identify appropriate metadata to use for the faceted browsing. This tool will be particularly useful to scholars interested in books that are still under copyright - which is the case for most books published after 1923. Although these books will not be available for reading or downloading online, working with individual words and phrases and tracking their occurrences through time will be useful to academic researchers, especially historians, sociologists and literary scholars.
John Unsworth has noted that a fundamental goal of the humanities is appreciation: "by paying attention to an object of interest, we can explore it, find new dimensions within it, notice things about it that have never been noticed before, and increase its value" (2004). Shifting from traditional close reading to a large-scale view of text presents a profound discomfort for humanities scholars, due to the difficulty in retaining the same sensitivity to what is actually contained in the works being studied. HTRC-Bookworm will function as a link between quantitative analysis (distant reading) and close reading. According to Frederick Gibbs and Daniel Cohen, "any robust digital research methodology must allow the scholar to move easily between distant and close reading, between the bird's eye view and the ground level of the texts themselves" (2011). This is what HTRC-Bookworm intends to accomplish (within the limitations of applicable copyright laws.
Current limitations and future improvements:
The current instantiation (the alpha version) of the HTRC-Bookworm is set up to work with only 250,000 out-of-copyright volumes from the HathiTrust NGPD corpus. Overall the HathiTrust has more than 11,000,000 volumes in all their corpus, consisting of 420 billion pages.
As a demonstration we have selected some metadata for use with this smaller corpus, but we are open to feedback on improvements to this metadata.
Our plan is to ingest the larger HathiTrust corpus and to allow facets to be selected based on HTRC worksets.
References:
Michel, Jean-Baptiste, Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., ... & Aiden, E. L. "Quantitative analysis of culture using millions of digitized books." Science 331.6014 (2011): pp. 176-182.
Unsworth, John. “Forms of Attention: Digital Humanities Beyond Representation," delivered at "The Face of Text: Computer-Assisted Text Analysis in the Humanities," The Third Conference of the Canadian Symposium on Text Analysis (CaSTA), McMaster University, November 19-21, 2004.
Gibbs, Frederick W., and Daniel J. Cohen. "A Conversation with data: prospecting Victorian words and ideas." Victorian Studies, Vol. 54, No. 1 (2011): pp. 69-77.