Topic modeling is a probabilistic, statistical method that can uncover themes and categories in amounts of text so large that they cannot be read by any individual human being. […] Topic modeling allows us to step back even further from analyzing representative articles in these topics to interpreting all of them, to supplement close readings of individual items with distant readings of tens of thousands of them.
(Uit: Robert K. Nelson, ‘Of monsters, Men – and Topic Modeling‘, The New York Times Opinionator Blog)
Topic modeling uses statistical techniques to categorize individual texts and, perhaps more importantly, to discover categories, topics, and patterns that we might not be aware of in those texts. A topic modeling program—here the impressive MALLET application developed by Andrew McCallum and others at the University of Massachusetts, Amherst—generates a specified number of topics from a group of documents. The specific topics are not predetermined by the researcher but instead emerge from the patterns uncovered by the statistical algorithm. All that is provided by the researcher is the number of topics.
(Uit: Robert K. Nelson, ‘Mining the Dispatch‘)
Princeton historicus Ben Schmidt heeft recent een interessante blog gepost over de valkuilen van topic modeling: Keeping the Words in Topic Models. Het valt voor niet ingewijden niet altijd mee om te volgen, maar hij doet zijdelings een paar uitspraken die het doel van topic modeling ook voor amateurs als ikzelf begrijpelijk maken:
Bookworm/Ngrams-type graphs and these topic-model graphs promote pretty much the same type of reflection, and share many of the same pitfalls. But one of the reasons I like the Ngrams-style approach better is that it wears its weaknesses on its sleeves. Weaknesses like: vocabulary changes, individual words don’t necessarily capture the full breadth of something like “Western Marxism,” any word can have multiple meanings, an individual word is much rarer.
Topic modeling seems like an appealing way to fix just these problems, by producing statistical aggregates that map the history of ideas better than any word could. Instead of dividing texts into 200,000 (or so) words, it divides them into 200-or-so topics that should be nearly as easy to cognize, but that will be much more heavily populated; the topics should map onto concepts better than words; and they avoid the ambiguity of a word like “bank” (riverbank? Bank of England?) by splitting it into different bins based on context.
Dat alleen al maakt zijn blog het lezen waard. Dat geldt ook voor zijn andere bijdragen over digital humanities, waarin hij een aantal aardige observaties doet over geschiedschrijving in het digitale tijdperk:
To do humanistic readings of digital data, we cannot rely on either traditional humanistic competency or technical expertise from the sciences. This presents a challenge for the execution of research projects on digital sources: research-center driven models for digital humanistic resource, which are not uncommon, presume that traditional humanists can bring their interpretive skills to bear on sources presented by others.