From keyword searching to concept mining

keyword10Historical newspapers have traditionally been popular sources to study public mentalities and collective cultures within historical scholarship. At the same time, they have been known as notoriously time-consuming and complex to analyze. The recent digitization of newspapers and the use of computers to gain access to the growing mass of digital corpora of historical news media are altering the historian’s heuristic process in fundamental ways.

The large digitization project the Dutch National Library currently runs can illustrate this. Until now, the KB has made publicly available over 80 million historical newspaper articles from the last four centuries. Researchers (as well as the wider public) are able to do full-text searches in the entire repository of articles through the KB’s own online search interface Delpher . Instead of manually skimming through a selected numbers of editions or volumes this functionality allows for the searching of particular (strings of) keywords within the entire corpus. As basic as it may seem, full-text searching completely overturns the way in which historians are used to approach newspapers. Instead of the successive top-down selections historians traditionally made in order to gradually isolate potentially interesting material, keyword searching treats the corpus as a singular bag of words and, therefore, enables researchers to immediately dive into the texts that meet their search criteria. Lees verder

OCR’ing and analysing comic books – a workflow report

spidermanI like experimenting with text analysis tools like Voyant. However, most tools for corpus linguistics don’t account for historical change – which I am, as a historian, mostly interested in. Historians working with tools like these have to think of ways themselves to add a time scale to their analyses. The most straightforward way to do so is by arranging a corpus in a chronological order. Highly interesting, for example, is to study linguistic changes in successive volumes of newspapers or periodicals, as I do in my academic research.

Really just as an experiment, I’ve also tried analyzing volumes of comic books. I happen to possess some digital comic book archives, they have a nice chronological order, and they are quite under-studied as historical sources for changes in (popular) culture. As turning digital comic books (in cbr format) into analyzable text files took more effort than I realized, what follows is the workflow I constructed. I don’t know whether it’s the optimal way of doing so, but as someone who is new to bash commands, OCR’ing, and the combination of both, I had a lot of fun figuring this out. No, it’s probably a long way from being optimal. More like quick-and-dirty, although it isn’t quick either (depending on the volume of your dataset). But it does requires hardly any action, so for all its downsides it really is a fun way of experimenting with the (historical) text analysis of some original data. Lees verder

KB-onderzoek: De taal van het Taylorisme

Van februari tot en met juli 2015 ben ik Onderzoeker te Gast bij de onderzoeksafdeling van de Koninklijke Bibliotheek in Den Haag. Ik ben van plan mijn tijd hier te gebruiken om te werken aan een veelvoorkomend probleem bij digitaal historisch onderzoek: de vertaalslag van een geschiedwetenschappelijk probleem naar de woorden die zoekprogramma’s en andere tools begrijpen. Dat doe ik aan de hand van een casus uit mijn eigen onderzoek, ‘de taal van het Taylorisme’. Lees verder

The Humanities and Technology in Utrecht

Samen met Ilja Nieuwland (Huygens-ING Den Haag), Arjan van Hessen (CLARIAH) en mijn Utrechtse collega Melvin Wevers organiseer ik in januari 2015 een THATCamp in Utrecht. Hieronder de aankondigingstekst:

Er komt weer een THATCAMP,  dit keer georganiseerd in Utrecht. Het is een 2-daagse bijeenkomst waar aan alle geesteswetenschappers in Nederland de gelegenheid wordt geboden om met dataproviders, IT’ers en elkaar ervaringen en/of vragen te delen rondom het gebruik van digitale middelen in onderzoek en/of onderwijs. Volgens de ‘regels’ van het THATCamp wordt het tweedaags programma op 28 en 29 januari 2015 grotendeels door de deelnemers zelf vastgesteld. Lees verder

Ruis in big data

Screen Shot 2014-03-25 at 14.38.15Een verstandige stap van de ING vorige week: de bank laat zijn proefballon voortijds leeglopen en gaat voorlopig niet proberen de ‘big data’ die haar klantgegevens vormen voorlopig niet proberen te gelde te maken. Dat is heel verstandig van de bank en het biedt de gelegenheid eerst eens even stil te staan met de – niet geringe – consequenties van het big data-denken dat overal om ons heen postvat. En niet alleen in het bonuskaartensysteem van het bedrijfsleven. Big data zijn ook de miljoenen telefoongesprekken die de Nederlandse inlichtingendiensten MIVD en AIVD maandelijks al dan niet legaal afvangen.

Wat is big data? Stel dat iemand  met griep naar zijn huisarts gaat. Als diegene influenza had, is dat gegeven samen met alle andere griepgevallen in Europa bij de Wereldgezondheidsorganisatie beland. Die houdt zo al decennia de verspreiding bij van het influenza-virus in Europa. Door met die data grafieken en kaarten te maken, worden patronen zichtbaar die waardevolle inzichten geven over de terugkerende risicoperiode’s en -gebieden voor griep. Het perspectief van de grieperige patiënt tegenover dat van de WHO is het verschil tussen ‘small data’ en ‘big data’.

Lees verder

Digital Humanities like The Secret of Monkey Island™


Cableway to Hook IsleIn their excellent chapter on the use of digital data in historical research, Frederick W. Gibbs and Trevor J. Owens distinguish between two DH approaches to data. ‘Data’, they argue, ‘does not always have to be used as evidence. It can also help with discovering and framing research questions’. On the one hand, you have ‘complex statistical methods’ and ‘rigorous mathematics’ (or ‘mathematical rigor’) to ‘support epistemological claims’. Gibbs and Owens equal this type of DH research to the wave of quantitative history in the 1960s and 1970s, using data ‘for quantifying, computing and creating knowledge’.

On the other, there is a ‘fundamentally different’ form of using data – a form that is exploratory instead of analytic and deliberately without the mathematical complexity that is needed to derive evidence from quantitative analyses. Above all, it’s a form of data manipulation that can be playful (although the authors removed the adjective at one of the places it appeared in their text). Gibbs and Owens state that ‘playing with data – in all its formats and forms – is more important than ever.
Lees verder

Digital Newspapers as a source for (digital) history

slide pptLast week, I gave a talk at the Europeana Newspapers Information Day at the Staatsbibliothek Berlin on the use of digitised historical newspapers in our Translantis project. I gave an impression of the tools and functionalities we are experimenting with and the challenges – in terms of source criticism and interpretation – that come along with this fairly new type of historical research.

These are exciting times for historians. Both the quantity of historical source material getting digitized in an ever-growing pace, as the development of tools and techniques for grasping this data will have an irreversible impact on the way historical research is done. All the more essential is the realization that digital methods are there to assist and not to replace the historian. They can never make up for the need for the ‘old-fashioned’ historical analysis and narrative.

Lees verder