Although quantitative analysis brings valuable new perspectives on large collections of text, it does have certain disadvantages and limitations that should be considered.
The ten words chosen for each category (religious and secular) certainly do fit into their categories, but it is possible that certain authors used other words not included in that list because of their infrequent use overall. For example “temptation” certainly carries religious connotation, but it was only used 4 times out of 50,000 words and so it was not included in the list. Edmund Massey was the only author who used this word, and it can be argued that this did not skew the results because the word “sin” which was included, was also used significantly more by Massey. Still, this points to an unanswered question. If a word is only used a few times, is it still significant? Does a threshold exist? As our qualitative analysis shows, a single use of religious language can reveal a deeply held belief.
Problems also arise when multiple people use different words to describe the same thing. For example, “pock” and “pustules” mean the same thing, but Mather uses pustules significantly more while Boylston favors the word “pock.” “Pock” was chosen because it was used more frequently and it’s use was more evenly distributed across the text, indicating that it was the more common way to refer to symptom, but this decision did introduce bias and led to Mather having a lower score for total secular language use. This type of problem is especially apparent with a small sample size since these subtle distinctions are hard to discern. Mather’s use of pustules may have appeared as even more of an outlier with a larger collection of texts, and he also may have used “pock” more often in another of his writings.
Limitations of Voyant
Voyant provides a huge amount of valuable data, but it’s not a flawless system. One of the biggest disadvantages is the inability to search for variations of words in combination. For example, combining “sin” “sins” and “sinned” into one data point across the text in comparison to other terms is more accurate than choosing one that is used most frequently or seems to be the best representation. Voyant does have an option to collapse terms into a single data point, but this could lead to disproportionately large frequencies for that word in comparison to a word with fewer variations.
The collate clusters tool also has some flaws. One of the largest is a lack of documentation on exactly how the tool determines which words are in association. As a test, the 20 words used for this analysis were entered in alphabetical order and then again in reverse alphabetical order. Although the graphics were nearly identical, there were slight variations in how the words were linked. Without knowing how the tool works, it is difficult to account for this variation. In addition, this tool uses colors to indicate the text in which a
word is used most frequently, but only five different colors are available. When using more than five texts, the limit on colors minimizes the amount of information that can be determined from the graphic in a static image.
Although quantitative data can reduce subjectivity, this method is not entirely bias free. In their paper, “Meaning and mining: the impact of implicit assumptions in data mining for the humanities” D. Sculley and Bradley Pasanek note “The worst dangers may lie in the humanists’ ability to interpret nearly any result, projecting his or her own biases into the outcome of an experiment — perhaps all the more unwittingly due to the superficial objectivity of computational methods.” Decisions in the methodology and interpretation can be easily biased even though the data of word frequencies cannot. When choosing the words used in the analysis, words with religious connotations were partially identified by peaks in the texts written by ministers. This was a good indicator for accuracy in finding words with religious connotations, but using those words to show a lack of religious use in the doctors could be considered circular reasoning.
Words in Context
Single words carry meaning and connotations, but do not tell the full story. A quantitative text analysis is most effective when paired with qualitative analysis to determine the context in which words are used. For example, “blood” was used in the medical sense in several of the texts. However, it was removed from the final list of secular words because it was used in other contexts. Boylston refers to “Royal blood” and Williams says “What a fountain of blood the promoters are guilty of” (using it in a sense of death and guilt rather than medical observations).
The nature of the texts also could lead to misrepresentation. In a debate, the author often states their opponent’s objection or argument before refuting it. Thus, they may use a word, but they are actually quoting another party when they may not have normally used that type of language. It would require a much more nuanced algorithm to make this distinction in a text analysis.
Finally, the use of religious and secular language must also be considered along with the genre of the document. An increased use of secular language in an observational account of smallpox may not necessarily be an indication of the author’s true beliefs, but rather a reflection of the nature of the document. Similarly, it is logical that a sermon will contain more religious language than a dissertation, regardless of who is writing it. These factors do not necessarily discount all outcomes from the quantitative analysis, but keeping them in mind while analyzing the texts can only lead to stronger conclusions.