Applying distributional semantics to trace conceptual change

Here is the abstract of a talk I gave January 2017 at the AIUCD conference in Rome.

screen-shot-2017-01-26-at-10-29-14What we talk about when we talk about concepts – Applying distributional semantics on Dutch historical newspapers to trace conceptual change

Word embeddings – vector representations of words that embed words in a so-called semantic space where the vectors of semantically similar words lie close together – are increasingly used for semantic searches in large text corpora. Word vector distances can be used to build semantic networks of words. This closely resembles the notion of semantic fields that humanities scholars are familiar with.

We have previously shown how word embeddings, as produced by a popular implementation word2vec, can be used to trace concepts through time without the dependency of particular keywords (Kenter 2014). However, there are two main challenges that come with the use of word embeddings to represent concepts and conceptual change for the study of history. Firstly: commensurability. The use of computational techniques like word2vec demands choices of practical or technical nature. How do we legitimize these choices in terms of conceptual theory? Secondly: dependency on data. Do the results of word embedding techniques provide insights into real conceptual change, or do they merely reflect arbitrary biases in the underlying data? Lees verder


From keyword searching to concept mining

keyword10Historical newspapers have traditionally been popular sources to study public mentalities and collective cultures within historical scholarship. At the same time, they have been known as notoriously time-consuming and complex to analyze. The recent digitization of newspapers and the use of computers to gain access to the growing mass of digital corpora of historical news media are altering the historian’s heuristic process in fundamental ways.

The large digitization project the Dutch National Library currently runs can illustrate this. Until now, the KB has made publicly available over 80 million historical newspaper articles from the last four centuries. Researchers (as well as the wider public) are able to do full-text searches in the entire repository of articles through the KB’s own online search interface Delpher . Instead of manually skimming through a selected numbers of editions or volumes this functionality allows for the searching of particular (strings of) keywords within the entire corpus. As basic as it may seem, full-text searching completely overturns the way in which historians are used to approach newspapers. Instead of the successive top-down selections historians traditionally made in order to gradually isolate potentially interesting material, keyword searching treats the corpus as a singular bag of words and, therefore, enables researchers to immediately dive into the texts that meet their search criteria. Lees verder

OCR’ing and analysing comic books – a workflow report

spidermanI like experimenting with text analysis tools like Voyant. However, most tools for corpus linguistics don’t account for historical change – which I am, as a historian, mostly interested in. Historians working with tools like these have to think of ways themselves to add a time scale to their analyses. The most straightforward way to do so is by arranging a corpus in a chronological order. Highly interesting, for example, is to study linguistic changes in successive volumes of newspapers or periodicals, as I do in my academic research.

Really just as an experiment, I’ve also tried analyzing volumes of comic books. I happen to possess some digital comic book archives, they have a nice chronological order, and they are quite under-studied as historical sources for changes in (popular) culture. As turning digital comic books (in cbr format) into analyzable text files took more effort than I realized, what follows is the workflow I constructed. I don’t know whether it’s the optimal way of doing so, but as someone who is new to bash commands, OCR’ing, and the combination of both, I had a lot of fun figuring this out. No, it’s probably a long way from being optimal. More like quick-and-dirty, although it isn’t quick either (depending on the volume of your dataset). But it does requires hardly any action, so for all its downsides it really is a fun way of experimenting with the (historical) text analysis of some original data. Lees verder

KB-onderzoek: De taal van het Taylorisme

Van februari tot en met juli 2015 ben ik Onderzoeker te Gast bij de onderzoeksafdeling van de Koninklijke Bibliotheek in Den Haag. Ik ben van plan mijn tijd hier te gebruiken om te werken aan een veelvoorkomend probleem bij digitaal historisch onderzoek: de vertaalslag van een geschiedwetenschappelijk probleem naar de woorden die zoekprogramma’s en andere tools begrijpen. Dat doe ik aan de hand van een casus uit mijn eigen onderzoek, ‘de taal van het Taylorisme’. Lees verder

The Humanities and Technology in Utrecht

Samen met Ilja Nieuwland (Huygens-ING Den Haag), Arjan van Hessen (CLARIAH) en mijn Utrechtse collega Melvin Wevers organiseer ik in januari 2015 een THATCamp in Utrecht. Hieronder de aankondigingstekst:

Er komt weer een THATCAMP,  dit keer georganiseerd in Utrecht. Het is een 2-daagse bijeenkomst waar aan alle geesteswetenschappers in Nederland de gelegenheid wordt geboden om met dataproviders, IT’ers en elkaar ervaringen en/of vragen te delen rondom het gebruik van digitale middelen in onderzoek en/of onderwijs. Volgens de ‘regels’ van het THATCamp wordt het tweedaags programma op 28 en 29 januari 2015 grotendeels door de deelnemers zelf vastgesteld. Lees verder

Ruis in big data

Screen Shot 2014-03-25 at 14.38.15Een verstandige stap van de ING vorige week: de bank laat zijn proefballon voortijds leeglopen en gaat voorlopig niet proberen de ‘big data’ die haar klantgegevens vormen voorlopig niet proberen te gelde te maken. Dat is heel verstandig van de bank en het biedt de gelegenheid eerst eens even stil te staan met de – niet geringe – consequenties van het big data-denken dat overal om ons heen postvat. En niet alleen in het bonuskaartensysteem van het bedrijfsleven. Big data zijn ook de miljoenen telefoongesprekken die de Nederlandse inlichtingendiensten MIVD en AIVD maandelijks al dan niet legaal afvangen.

Wat is big data? Stel dat iemand  met griep naar zijn huisarts gaat. Als diegene influenza had, is dat gegeven samen met alle andere griepgevallen in Europa bij de Wereldgezondheidsorganisatie beland. Die houdt zo al decennia de verspreiding bij van het influenza-virus in Europa. Door met die data grafieken en kaarten te maken, worden patronen zichtbaar die waardevolle inzichten geven over de terugkerende risicoperiode’s en -gebieden voor griep. Het perspectief van de grieperige patiënt tegenover dat van de WHO is het verschil tussen ‘small data’ en ‘big data’.

Lees verder

Digital Humanities like The Secret of Monkey Island™

Cableway to Hook IsleIn their excellent chapter on the use of digital data in historical research, Frederick W. Gibbs and Trevor J. Owens distinguish between two DH approaches to data. ‘Data’, they argue, ‘does not always have to be used as evidence. It can also help with discovering and framing research questions’. On the one hand, you have ‘complex statistical methods’ and ‘rigorous mathematics’ (or ‘mathematical rigor’) to ‘support epistemological claims’. Gibbs and Owens equal this type of DH research to the wave of quantitative history in the 1960s and 1970s, using data ‘for quantifying, computing and creating knowledge’.

On the other, there is a ‘fundamentally different’ form of using data – a form that is exploratory instead of analytic and deliberately without the mathematical complexity that is needed to derive evidence from quantitative analyses. Above all, it’s a form of data manipulation that can be playful (although the authors removed the adjective at one of the places it appeared in their text). Gibbs and Owens state that ‘playing with data – in all its formats and forms – is more important than ever.
Lees verder