Periodic Reporting for period 2 - NewsEye (NewsEye: A Digital Investigator for Historical Newspapers)
Période du rapport: 2019-05-01 au 2022-01-31
In the last decades, tens of millions of newspaper pages from European libraries have been digitized and made available online, while national libraries will intensify their digitization efforts in the coming years. There is large demand for access to historical newspapers. At this very moment, probably thousands of European citizens are accessing digitized versions of historical newspapers utilizing digital library services. Whilst the broad public shows general interest in this historical and cultural resource, it is of crucial importance for many humanities scholars.
The NewsEye project involves national libraries, humanities and social science research groups and computer science research groups. It addressed a number of challenges, which resulted in significant scientific advances, in several directions:
- in text recognition, text analysis, natural language processing, computational creativity and natural language generation, with regard to historical newspapers but also more universally,
- in digital newspaper research, addressing a number of editorial issues like OCR and article separation,
- in digital humanities, in respect to huge amounts of text material, availability of useful tools and possibilities of searching and browsing,
- in history, in terms of analyzing historical assets with new methods across different language corpora.
NewsEye has been successful in terms of its communication, dissemination aims and achievements. Concrete exploitation leads were established early on and pushed throughout. The NewsEye events (conferences, workshops, trainings, hackathons, etc.) attracted numerous participants from various user groups. NewsEye has paved the way for future research to be undertaken in the European Commission’s Horizon Europe and Digital Europe programs, bridging the gap between computer science, cultural heritage and digital humanities (and their funding streams). The development of the NewsEye project has proven the value and necessity of progressing toward opening the utility of historical newspaper data as a concerted effort combining expertise in digital cultural heritage, digital humanities and computer science.
- Text Recognition and Article Separation, extracting the layout of newspapers (e.g. articles and graphical regions) from digitized newspapers and transforming the content to textual format, providing full articles through automatic layout analysis, text recognition and article separation.
- Semantic Text Enrichment, enhancing the utility of the newspaper collections by enriching the texts with higher-level semantic annotation using named-entity recognition. Extracted named entities were linked to external references (such as the Wikipedia) across languages, with the goal to support multilingual analysis. This layer also ensured event detection, as support for pattern discovery from textual contents.
- Dynamic Text Analysis, providing tools to exploit the enriched data for more elaborated analysis of user-selected newspaper content, supporting interactive queries to discover different viewpoints, sub-topics or trends concerning the selected topic, named entity, newspaper, timeframe or other category, so as to provide insights into the newspaper collection in contextualized and comparative manners.
- Intelligent analysis and reporting (“Personalized Research Assistant”), providing an alternative, “intelligent” interface to the other tools and the data, carrying out iterative cycles of analysis and reporting to the user in natural language. The user became able to authorize the Personal Research Assistant to investigate a given topic (or time window or newspaper etc.) on the user’s behalf, with the Assistant reporting back on findings which it assesses as potentially interesting for the user, reported in natural language and in a transparent manner so the findings can be understood and verified by the user. Given the European context, we were be able not only to analyze newspapers written in multiple languages but also to report on the findings in multiple languages; to this end, the Assistant used multilingual natural language generation (NLG) to produce textual descriptions of the results obtained by the Investigator.
The NewsEye consortium further involved experts whose role was to ensure (i) additional technical expertise in the above-mentioned aspects, (ii) access to and enrichment of digitized newspapers, (iii) insight and experience in using historical newspapers as a rich cultural heritage resource for the understanding of developments in society, economy and politics, (iv) use cases with the aim to address important humanities’ research desiderata and gain experience and feedback to guide iterative development of the NewsEye demonstrator, and (v) strong dissemination and viable paths towards wider adoption and sustainability of the developed tools.
All the results and outputs of the project are available on the project website, notably with data sets, publications and source code inventoried under its "Open Science" tab.