From keyword searching to concept mining

keyword10Historical newspapers have traditionally been popular sources to study public mentalities and collective cultures within historical scholarship. At the same time, they have been known as notoriously time-consuming and complex to analyze. The recent digitization of newspapers and the use of computers to gain access to the growing mass of digital corpora of historical news media are altering the historian’s heuristic process in fundamental ways.

The large digitization project the Dutch National Library currently runs can illustrate this. Until now, the KB has made publicly available over 80 million historical newspaper articles from the last four centuries. Researchers (as well as the wider public) are able to do full-text searches in the entire repository of articles through the KB’s own online search interface Delpher . Instead of manually skimming through a selected numbers of editions or volumes this functionality allows for the searching of particular (strings of) keywords within the entire corpus. As basic as it may seem, full-text searching completely overturns the way in which historians are used to approach newspapers. Instead of the successive top-down selections historians traditionally made in order to gradually isolate potentially interesting material, keyword searching treats the corpus as a singular bag of words and, therefore, enables researchers to immediately dive into the texts that meet their search criteria. Lees verder


Topic Modeling: huh?

Topic modeling is a probabilistic, statistical method that can uncover themes and categories in amounts of text so large that they cannot be read by any individual human being. […] Topic modeling allows us to step back even further from analyzing representative articles in these topics to interpreting all of them, to supplement close readings of individual items with distant readings of tens of thousands of them.

(Uit: Robert K. Nelson, ‘Of monsters, Men – and Topic Modeling‘, The New York Times Opinionator Blog)

Topic modeling uses statistical techniques to categorize individual texts and, perhaps more importantly, to discover categories, topics, and patterns that we might not be aware of in those texts. A topic modeling program—here the impressive MALLET application developed by Andrew McCallum and others at the University of Massachusetts, Amherst—generates a specified number of topics from a group of documents. The specific topics are not predetermined by the researcher but instead emerge from the patterns uncovered by the statistical algorithm. All that is provided by the researcher is the number of topics.

(Uit: Robert K. Nelson, ‘Mining the Dispatch‘)

Lees verder