Applying distributional semantics to trace conceptual change

Here is the abstract of a talk I gave at the AIUCD conference in Rome in January 2017.

screen-shot-2017-01-26-at-10-29-14What we talk about when we talk about concepts – Applying distributional semantics on Dutch historical newspapers to trace conceptual change

Word embeddings – vector representations of words that embed words in a so-called semantic space where the vectors of semantically similar words lie close together – are increasingly used for semantic searches in large text corpora. Word vector distances can be used to build semantic networks of words. This closely resembles the notion of semantic fields that humanities scholars are familiar with.

We have previously shown how word embeddings, as produced by a popular implementation word2vec, can be used to trace concepts through time without the dependency of particular keywords (Kenter 2014). However, there are two main challenges that come with the use of word embeddings to represent concepts and conceptual change for the study of history. Firstly: commensurability. The use of computational techniques like word2vec demands choices of practical or technical nature. How do we legitimize these choices in terms of conceptual theory? Secondly: dependency on data. Do the results of word embedding techniques provide insights into real conceptual change, or do they merely reflect arbitrary biases in the underlying data? Doorgaan met het lezen van “Applying distributional semantics to trace conceptual change”

From keyword searching to concept mining

keyword10Historical newspapers have traditionally been popular sources to study public mentalities and collective cultures within historical scholarship. At the same time, they have been known as notoriously time-consuming and complex to analyze. The recent digitization of newspapers and the use of computers to gain access to the growing mass of digital corpora of historical news media are altering the historian’s heuristic process in fundamental ways.

The large digitization project the Dutch National Library currently runs can illustrate this. Until now, the KB has made publicly available over 80 million historical newspaper articles from the last four centuries. Researchers (as well as the wider public) are able to do full-text searches in the entire repository of articles through the KB’s own online search interface Delpher . Instead of manually skimming through a selected numbers of editions or volumes this functionality allows for the searching of particular (strings of) keywords within the entire corpus. As basic as it may seem, full-text searching completely overturns the way in which historians are used to approach newspapers. Instead of the successive top-down selections historians traditionally made in order to gradually isolate potentially interesting material, keyword searching treats the corpus as a singular bag of words and, therefore, enables researchers to immediately dive into the texts that meet their search criteria. Doorgaan met het lezen van “From keyword searching to concept mining”

OCR’ing and analysing comic books – a workflow report

spidermanI like experimenting with text analysis tools like Voyant. However, most tools for corpus linguistics don’t account for historical change – which I am, as a historian, mostly interested in. Historians working with tools like these have to think of ways themselves to add a time scale to their analyses. The most straightforward way to do so is by arranging a corpus in a chronological order. Highly interesting, for example, is to study linguistic changes in successive volumes of newspapers or periodicals, as I do in my academic research.

Really just as an experiment, I’ve also tried analyzing volumes of comic books. I happen to possess some digital comic book archives, they have a nice chronological order, and they are quite under-studied as historical sources for changes in (popular) culture. As turning digital comic books (in cbr format) into analyzable text files took more effort than I realized, what follows is the workflow I constructed. I don’t know whether it’s the optimal way of doing so, but as someone who is new to bash commands, OCR’ing, and the combination of both, I had a lot of fun figuring this out. No, it’s probably a long way from being optimal. More like quick-and-dirty, although it isn’t quick either (depending on the volume of your dataset). But it does requires hardly any action, so for all its downsides it really is a fun way of experimenting with the (historical) text analysis of some original data. Doorgaan met het lezen van “OCR’ing and analysing comic books – a workflow report”

KB-onderzoek: De taal van het Taylorisme

Van februari tot en met juli 2015 ben ik Onderzoeker te Gast bij de onderzoeksafdeling van de Koninklijke Bibliotheek in Den Haag. Ik ben van plan mijn tijd hier te gebruiken om te werken aan een veelvoorkomend probleem bij digitaal historisch onderzoek: de vertaalslag van een geschiedwetenschappelijk probleem naar de woorden die zoekprogramma’s en andere tools begrijpen. Dat doe ik aan de hand van een casus uit mijn eigen onderzoek, ‘de taal van het Taylorisme’. Doorgaan met het lezen van “KB-onderzoek: De taal van het Taylorisme”

The Humanities and Technology in Utrecht

Samen met Ilja Nieuwland (Huygens-ING Den Haag), Arjan van Hessen (CLARIAH) en mijn Utrechtse collega Melvin Wevers organiseer ik in januari 2015 een THATCamp in Utrecht. Hieronder de aankondigingstekst:

Er komt weer een THATCAMP,  dit keer georganiseerd in Utrecht. Het is een 2-daagse bijeenkomst waar aan alle geesteswetenschappers in Nederland de gelegenheid wordt geboden om met dataproviders, IT’ers en elkaar ervaringen en/of vragen te delen rondom het gebruik van digitale middelen in onderzoek en/of onderwijs. Volgens de ‘regels’ van het THATCamp wordt het tweedaags programma op 28 en 29 januari 2015 grotendeels door de deelnemers zelf vastgesteld. Doorgaan met het lezen van “The Humanities and Technology in Utrecht”

Ruis in big data

Screen Shot 2014-03-25 at 14.38.15Een verstandige stap van de ING vorige week: de bank laat zijn proefballon voortijds leeglopen en gaat voorlopig niet proberen de ‘big data’ die haar klantgegevens vormen voorlopig niet proberen te gelde te maken. Dat is heel verstandig van de bank en het biedt de gelegenheid eerst eens even stil te staan met de – niet geringe – consequenties van het big data-denken dat overal om ons heen postvat. En niet alleen in het bonuskaartensysteem van het bedrijfsleven. Big data zijn ook de miljoenen telefoongesprekken die de Nederlandse inlichtingendiensten MIVD en AIVD maandelijks al dan niet legaal afvangen.

Wat is big data? Stel dat iemand  met griep naar zijn huisarts gaat. Als diegene influenza had, is dat gegeven samen met alle andere griepgevallen in Europa bij de Wereldgezondheidsorganisatie beland. Die houdt zo al decennia de verspreiding bij van het influenza-virus in Europa. Door met die data grafieken en kaarten te maken, worden patronen zichtbaar die waardevolle inzichten geven over de terugkerende risicoperiode’s en -gebieden voor griep. Het perspectief van de grieperige patiënt tegenover dat van de WHO is het verschil tussen ‘small data’ en ‘big data’.

Doorgaan met het lezen van “Ruis in big data”

Digital Humanities like The Secret of Monkey Island™

Cableway to Hook IsleIn their excellent chapter on the use of digital data in historical research, Frederick W. Gibbs and Trevor J. Owens distinguish between two DH approaches to data. ‘Data’, they argue, ‘does not always have to be used as evidence. It can also help with discovering and framing research questions’. On the one hand, you have ‘complex statistical methods’ and ‘rigorous mathematics’ (or ‘mathematical rigor’) to ‘support epistemological claims’. Gibbs and Owens equal this type of DH research to the wave of quantitative history in the 1960s and 1970s, using data ‘for quantifying, computing and creating knowledge’.

On the other, there is a ‘fundamentally different’ form of using data – a form that is exploratory instead of analytic and deliberately without the mathematical complexity that is needed to derive evidence from quantitative analyses. Above all, it’s a form of data manipulation that can be playful (although the authors removed the adjective at one of the places it appeared in their text). Gibbs and Owens state that ‘playing with data – in all its formats and forms – is more important than ever.
Doorgaan met het lezen van “Digital Humanities like The Secret of Monkey Island™”

Digital Newspapers as a source for (digital) history

slide pptLast week, I gave a talk at the Europeana Newspapers Information Day at the Staatsbibliothek Berlin on the use of digitised historical newspapers in our Translantis project. I gave an impression of the tools and functionalities we are experimenting with and the challenges – in terms of source criticism and interpretation – that come along with this fairly new type of historical research.

These are exciting times for historians. Both the quantity of historical source material getting digitized in an ever-growing pace, as the development of tools and techniques for grasping this data will have an irreversible impact on the way historical research is done. All the more essential is the realization that digital methods are there to assist and not to replace the historian. They can never make up for the need for the ‘old-fashioned’ historical analysis and narrative.

Doorgaan met het lezen van “Digital Newspapers as a source for (digital) history”

Before your do digital history…

Histogram and word cloud 'Eugenetica'This blog post is the adapted conclusion from the paper ‘A Digital Humanities Approach to the History of Science.
Eugenics revisited in hidden debates by means of semantic text mining’ I wrote in collaboration with Fons Laan, Maarten de Rijke and Toine Pieters. The article was based on the research I did within the historical text mining project BILAND, as well as its predecessor WAHSP. The article is in press as part of the Proceedings of the 1st International Workshop on Histoinformatics

In a recent blog post called ‘The Deceptions of Data’, Andrew Prescott has criticized the jubilation of the ‘digital revolution’. He states that “One of the problems confronting data enthusiasts in the humanities is that we feel a need to convince our more old fashioned colleagues about what can be done. But our role as advocates of digitized data shouldn’t mean that we lose our critical sense as scholars. [. . . ] [T]here is a risk that we look more carefully at the technical components of the datasets than the historical context of the information that they represent.” Doorgaan met het lezen van “Before your do digital history…”

Omzien in bewondering – Anne Kox neemt afscheid

Solvay Conferentie 1911‘Omzien in bewondering’ heette, met een variatie op de beroemde herinneringen van Annie Romein-Verschoor, de afscheidsrede die Anne Kox op 12 september uitsprak in de Aula van de Universiteit van Amsterdam. Met deze rede nam Kox afscheid als hoogleraar in de geschiedenis van de natuurkunde aan deze universiteit.

Kox’ bewondering uit de titel gold – zo was het punt dat hij maakte – de wetenschappers die er door de geschiedenis heen in slaagden buiten de bestaande paradigma’s te denken en de wetenschap voorwaarts te stuwen. Hiervan gaf hij in zijn college enkele voorbeelden, waarbij hij zich concentreerde op zijn specialisatiegebied, de revolutionaire ontwikkeling van de natuurkunde in de eerste helft van de twintigste eeuw. Kox is een groot kenner van veel van de hoofdrolspelers uit die periode en met name van Albert Einstein en de Nederlandse hoogleraar theoretische natuurkunde Hendrik Antoon Lorentz. Hij is de bezorger van Lorentz’ wetenschappelijke correspondentie (deel 1 verscheen in 2008 bij Springer in New York, deel 2 wordt binnenkort verwacht) en sinds 1985 redacteur bij het prestigieuze Einstein Paper Project van Caltech in Pasadena, dat bezig is het volledige archief van Einstein in dikke banden te ontsluiten. Dat duurt nog wel even en zijn Amerikaanse baan houdt Kox dan ook gewoon aan. Doorgaan met het lezen van “Omzien in bewondering – Anne Kox neemt afscheid”