A professor of mine tells me that the trick to academic writing is translating associative connections into logical, causal connections–a fact of the academy that I find showmanshipy yet absolutely true. With topic modeling, the relationship between scholar, association and logic strains yet further under the computing weight of algorithmically generated association.
Ben Schmidt, in his “When you have a MALLET, everything looks like a nail” post, notes one of the methodological “saves” of working with two-diminutional graphical data plotted on a familiar plane rather than language (bags of words): one can intuitively recognize error. He identifies the whaling map in which the LDA algorithm grouped together the eastern seaboard shipping and some pacific whaling into a single “topic.” Schmidt writes,
This is a case where I’m really being saved by the restrictive feature space of data. If I were interpreting these MALLET results as text, I might notice it, for example, but start to tell a just-so story about how transatlantic shipping and Pacific whaling really are connected. (Which they are; but so is everything else.) The absurdity of doing that with geographic data like this is pretty clear; but interpretive leaps are extraordinarily easy to make with texts.
The question becomes, what is the threshold for a reasonable connection. Indeed, Schmidt’s interpretation seems particularly not literary. It seems to me that the “just-so” story about the connection between these two seemingly unrelated patterns would be not only what an academic of literature would accidentally expand on, but would be precisely the bit of information that he or she would be most likely to expand on, turn into a conference presentation, and tote about the conference circuit as a lively report on some unexpected associations (hence, I suppose, Schmidt’s warning).
Ryan Heuser and Long Le-Khac’s “Learning to read data” offers a counter-balance to the impulse to avoid spurious associations (or associations above the spuriosity threshold, which, as Schmidt implies, we must place somewhere). The problem at the other end of the spectrum is throwing away data that does not already confirm what we believe, that is, eliminating data that does not support the conceptual associations and categories that we have already built.
A troubling corollary to this is a tendency to throw away data that does not fit our established concepts. When Cohen discards a striking correlation between “belief,” “atheism,” and “Aristotle” as an accident of the data, he does just this. Whether or not the correlation is accidental should be decided by statistical analysis rather than the feeling that it doesn’t make sense. If we required all data to make sense—that is, fit our established concepts—quantitative methods would never produce new knowledge. If the digital humanities are to be more than simply an efficient tool for confirming what we already know, then we need to check this tendency to seek validation.
It seems as though Schmidt may be on the verge of doing just this—or, at least, encouraging literary people that thrive on association to do this—throwing away data that does not fit into a pre-established topic. What is the happy mean here? Heuser and Le-Khac advocate for doing some follow-up statistical modeling to check out the validity of these inchoate associations (when it rains algorithms…). This is, however, where a more traditional literary scholarship could also take over. Perhaps after getting a whiff of some new associative logic, it is time to set off into one’s text(s) and attempt a demonstration on the grounds of compelling and satisfying persuasive writing. Or do we want to see our field move further forward than this?