Hacking the News: from digitised newspapers to the archived-web: an introductory workshop to text and data-mining

Event date
Mon 5 to Tue 6.3.2018
12:30 to 17:00
National Library of Finland’s Network Services, Kaikukatu 4, Helsinki

Call for Participation: DHN 2018 Pre-Conference Workshop

Libraries have been digitising historical newspapers since the early 2000’s. However, to what extent are these digitised newspaper archives being used in digital humanities research? Web-archiving began in 1996 with the Internet Archive initiative and its well-known digital archive ‘The Wayback Machine’. Since then a multitude of web-archiving initiatives have been established to continue these efforts. However, the true potential of digital newspaper corpora and web-archives is as yet under-exploited. Hacking the news: from digitised newspapers to the archived-web: an introductory workshop to text and data-mining is intended to help redress this balance.

Hacking the news is a 1.5 day workshop, prior to DHN 2018. Primarily intended for, but not limited to, early career researchers, the aim of this workshop is to provide an introduction to a range of topics to consider when undertaking digital analysis of newspaper corpora and analysing web-archives for research.  A draft programme for the workshop is available.

Day one of the workshop will focus on setting the context. Topics such as: How are digital newspaper corpora created? What is Optical Character Recognition? How does that differ from Optical Layout Recognition? How does news on the archived web differ from digitised newspapers? What data formats are used for the archived web and how do we analyse Web Archive datasets?

Day two of workshop will provide opportunity for participants to get their (digital) hands dirty, by working with digital newspaper and web-archives. Corpora will be provided in a number of languages, as far as possible, based on the needs of the workshop participants. As well as corpus preparation, where issues such as data cleaning will be explored, there will be opportunity to test a range of text and data mining tools for analysing digital corpora.

Call for Participation: To participate in the workshop (ca. 25-30 participants), please complete this form with details of your research interests, preferred languages for the digital corpora, level of technical experience, motivation for participating in the workshop plus a short biographical note by Wednesday 31 January 2018.

Venue: Hacking the News is hosted by the National Library of Finland’s Network Services.

Organisers: This workshop is co-organised by the Ghent Centre for Digital Humanities (GhentCDH), the International Internet Preservation Consortium (IIPC) and the National Library of Finland, in collaboration with the Helsinki Centre for Digital Humanities (HELDIG), Digital Humanities Lab (DIGHUMLAB), Denmark, the DH Lab of École polytechnique fédérale de Lausanne (EPFL), the Luxembourg Centre for Contemporary and Digital History (C2DH), Alan Turing Institute, Platform DH, University of Antwerp and the Leuven Centre for Digital Humanities. The digitised newspaper collections and web-archives will be provided by a number of National and University Libraries. The workshop is supported by DARIAH (Digital Research Infrastructure for the Arts and Humanities), CLARIN (European Research Infrastructure for Language Resources and Technology) and the IMPACT Centre of Competence.


Contact information


Sally Chambers, Ghent Centre for Digital Humanities

Olga Holownia, International Internet Preservation Consortium

Lassi Lager, National Library of Finland