C19 Reprint Discovery Engine

The reprint discovery engine for nineteenth-century periodicals archives would be a tool not unlike the Google Ngram Viewer, but focused on textual reprint and reference. This project would likely start by investigating a database like the Library of Congress’ “Chronicling America” collection, which is open and includes “an extensive application programming interface (API) which you can use to explore all of our data in many ways.”

I imagine the reprint discovery tool developing in two stages:

1.) In its first stage, the tool likely would require base texts for each inquiry. Users would enter, say, the text of Poe’s “Purloined Letter” and the tool would automatically break the short story into n-grams—sequences of words or letters. Then, the tool would automatically query a periodical archive for each n-gram sequence. Why so many queries? As I found with the Hawthorne project, simple title searches are insufficient, as reprints were often untitled or retitled by newspaper and magazine editors. I addition, title searches won’t return quotations from or references to the base text in other kinds of articles: such as the sermons or religious articles I found that quoted just a line or two from “The Celestial Railroad.” The tool should allow readers to tweak the length of the n-gram sequences on the fly—in my OS X-bound imagination, I see a slider—so that an inquiry could be broadened or narrowed based on the results returned. Such a tool would allow users to discover not only reprints of their chosen text, but also the paratexts essential to understanding the reception history of the story or poem.
2.) In the tool’s second stage, I would hope to automate the first part of the reprint discovery process: the discovery of base texts. The problem with the tool I’ve outlined in stage 1 is that it would likely only be used for texts scholars already find interesting—stories or poems that scholars suspect are worth searching periodicals archives for, because they have some sense of an existing history of widespread reprinting and/or reference. If the tool itself could dig into the archive in search of base texts, however, then we might discover texts that were widely reprinted and referenced but have since fallen out of our cultural memory. Such a tool could generate significant new scholarship, as important new texts and authors resurfaced and demanded further study. How might this work technically? I’m not certain. Perhaps the tool would crawl through the entire archive database, breaking the archive itself into n-grams and then looking for matches. I’ll need a programmer to tell me whether that’s in the realm of possibilities, or whether there’s another approach that would be more fruitful.



Help type
Kinds of collaborators
Individual/small group
Help description
I need a programmer(s) interested in working with the APIs of periodicals archives to develop such a tool. I would also be interested in other scholars of the period or of periodicals to shape the questions such a tool can and should be able to answer.
Contact person
Help needed