The documents used in this study were located in the open collections of Harvard University Library and in the Evans Early American Imprint Collection at the University of Michigan. The Harvard Contagion collection includes high resolution images of all the documents, many of which have been converted to plain text using Optimal Character Recognition (OCR). However, this OCR is not high enough quality to reliably quantify word frequencies.

The low quality of the OCR is not due to outdated technology or low resolution photos, but rather, the nature of 18th century printing. OCR works by scanning an image of a document and converting it into a file with regular text after comparing to a dictionary of recognized characters. Although huge improvements have been made since the concept OCR was first conceived, OCR still has limits. The documents used in this study often have dark spots or rough edges that obscured the text. Italic type further complicated the reading. Most notably, in the font used in 18th century printing, the character for an “s” appears as an “f”. Even if a photograph is perfectly clear and readable to the human eye, it is nearly impossible for a computer to interpret without error.

One solution to this problem is to manipulate the image using photoshop to make the text stand out more clearly (see image below). This does produce a cleaner OCR, but the text still is not perfect, especially where the text is in italics or in words with an “s” in them. The only way to correct these flaws is by carefully checking  and editing them with a human eye. Applying this process to approximately 200 pages was simply not feasible within the scope of this project.OCRProcessingFor this reason, I turned to the Text Creation Partnership (TCP). Rather than use OCR, this organization has transcribed hundreds of documents by hand and made them publically available. On their website they state, “Because of the irregularity and difficulty of early printing, as well as the variable quality of the microfilm-based images from which we are working, optical character recognition cannot reliably “read” the EEBO images to produce an accurate electronic text. The review and correction of the text produced would be so expensive and labor-intensive that it is more efficient to simply key the work from scratch.” The transcriptions  provided by the TCP were of high quality. The only further processing required was to remove symbols indicating line breaks which Voyant would interpret as two separate words. Spelling was also normalized so that it was consistent across all texts. For example, a sentence that read, “Inoculation might be sus|pended from being carried into the Country Towns, be|fore any Method or Contrivance was endeavour’d, to make it more easy to the Patient and safe to the Neigh|bourhood,” became “Inoculation might be suspended from being carried into the Country Towns, before any Method or Contrivance was endeavoured, to make it more easy to the Patient and safe to the Neighbourhood.”1 Not all the documents contained in the Harvard Contagion collection were available through the TCP, but the ones that existed formed a representative sample from which some conclusions could reasonably be drawn.

In conclusion, the choice of documents for this project was not ideal for the method of OCR and quantitative analysis, but it did reveal important limits in this method of digital humanities research. A larger number of works could be processed in much less time if they were from a later time period. Text analysis through tools like Voyant has been typically used to analyze  a large corpus that could not be feasibly read by one individual. However, using a smaller corpus gave the advantage of being able to read the texts qualitatively. This gave context to the data produced in quantitative analysis and provided a frame of reference for evaluating the accuracy of conclusions made from the quantitative analysis.

