Skip to main content
European Commission logo
English English
CORDIS - EU research results
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary

NewsEye: A Digital Investigator for Historical Newspapers

Deliverables

Automatic Text Recognition (final)

Reports on software tools and modules incl documentation for Automatic Text Recognition Technical Reports on further development and innovative adaptation of algorithms and methods for Automatic Text Recognition

Dissemination, communication and exploitation of results (e) (final)

The PEDR will be delivered at M3 and the project will followthrough by maintaining a rolling plan of activities to disseminate and exploit project results including reports or publications for each event on a particular topic This deliverable includes rapid dissemination channels in the form of blog posts tweets and other online media as well as more traditional dissemination outputs conference papers scholarly articlesAt M12 M24 and M36 we will provide yearly reports on the execution of the PEDR as well as on all dissemination and communication events organized during the projects Main dissemination and communication events are planned at M3 M14 M24 M25 M26 and M30 but will be reported on yearly together with smaller scale eventsThis deliverable under the lead of WP7 by BNF after M36 will provide details on the dissemination communication and exploitation of results during the project extension

Layout Analysis (final)

Reports on software tools and modules incl documentation for Layout Analysis Technical Reports on further development and innovative adaptation of algorithms and methods for Layout Analysis

Usability/Fit for research purpose test of tools and user interfaces (c) (final)

The deliverables will report on testing the methods tools and interfaces to the core They are the result of collaboration on the mockups and prototypes workshophackathon participation with the computer science groups and the libraries as indicated in Task T74 providing extensive feedback on tools and methods UIBKICH will supervise the production of reports in preparation for and as a followup to the tools prototypes betaversions and publishable tools and along the timeline of WP7 The final version is due at M34 with a possible update at M45

Contextualized Case Studies for academic use (d) (final)

The deliverables will report on the four digital humanities case studies prepared by using already existing methods and tools as well as the ones to be developed in this project showing progress and improvement of search and research outcome UIBKICH will be responsible for the case studies on migration UHDH for the case study on nationalisms and revolutions UNIVIE for the case study on media and journalism and UPVM for the case study on gender The members of the DHgroup will furthermore compare and contrast the results of the case studies in order to show how newspapers work both as a space for change as well as for stability while addressing the relationship between press politics and society in different regions and languages across Europe thus showing the transformation of our societiesThe deliverables will a include thorough literature and background research for each of the case studies b work with the semantically enriched The deliverables will report on testing the methods tools and interfaces to the core They are the result of collaboration on the mockups and prototypes workshophackathon participation with the computer science groups and the libraries as indicated in Task T74 providing extensive feedback on tools and methods UIBKICH will supervise the production of reports in preparation for and as a followup to the tools prototypes betaversions and publishable tools and along the timeline of WP7text as well as applicationutilization of the developed dynamic text analysis features in different languages in order to improve the quality of the case studies c show how the developed tools contribute to change and continuity discussions for European societiesDraftreports will be delivered at M6 complete reports at M12 while final reports to be submitted for publication in renowned humanities and digital humanities journals will be completed at M24 and M36

Personal Research Assistant: Explainer (b) (final)

This deliverable describes the Explainer component The first version M24 will be able to produce initial descriptions of strategies goals and decisions of the Investigator while the second version M36 describes the final version The final version is due at M36 with a possible update at M45

Article separation (c) (final)

Reports on software tools and modules incl documentation for Article Separation Technical Reports on further development and innovative adaptation of algorithms and methods for Article Separation journal research paper submissions on new preferably Machine Learning based neural algorithms and technologies for Article Separation along with the inherently used Layout Analysis Text Line Detection and Automatic Text The final version is due at M36 with a possible update at M45

Event detection (final)

Report on the level of completion of the event detection tool at M24 present the state of the art in event detection replying on the detection of events based on the sole document content using stringbased multilingual approaches based on rhetoric and specificities of the news genre as previously developed at ULR The second version at M36 will integrate contrastive knowledge from other documents The final version is due at M36 with a possible update at M45

Personal Research Assistant: Reporter (c) (final)

This deliverable describes the Reporter component and how it is used The first version M12 will be capable of some simple natural language generation using relatively rigid document structures and mechanisms for talking about the results of tools produced in WP34 during year one The second version M24 will have more elaborate document structuring and will be able to report more flexibly on a wider range of analysis results The second version will also have a first version of summarization of textual contents The third version of the deliverable M36 will describe the final version with full functionality The final version is due at M36 with a possible update at M45

Use of project results for the general public (b) (final)

The deliverables will report on the texts podcasts and social media activities by the digital humanities group UNIVIE will be supervising the podcast production UPVM the linking with Wikipedia and UHDH the social media activities

NewsEye Demonstrator (c) (final)

Reports and software on the development of the NewsEye Demonstrator a web based user interface for tools developed in WP3 and 4 and for the Personal Research Assistant WP5 Tools for the user interface of WP3 will be provided at M12 while the complete Minimum viable product MVP will be delivered at M24 and the final version at M36 The final version is due at M36 with a possible update at M45

Sustainability plan (c) (final)

The project will conceptualize a sustainability strategy for the longterm access of tools and data generated by the project to be planned in full details at M26 being implemented at M36 and fully implemented at M45

Stance detection (final)

Reports on the level of completion of the software tool for stance detection M12 The first version at M12 will rely on standards of the state of the art and the second version at M24 contains our principal research contribution robust to noise and language independent

Showcase case studies for the user interface (b) (final)

The deliverables will consist of texts videos statistics search paths how to etc on the user interface and on the project homepage All partners of the digital humanities group will contribute to the deliverable

Personnal Research Assistant: Investigator (c) (final)

The deliverable describes the Investigator tool In the first iteration M12 the Investigator will be capable of planning forming and running some queries using analysis tools developed in parallel in WP34 and of interacting with the user in simple ways to continue the investigation In the second iteration M24 the Investigator will also be able to create strategies for investigation to analyze the results obtained and to adjust its strategy accordingly The third iteration M36 describes the final version with full functionality The final version is due at M36 with a possible update at M45

Advanced tool to query the enriched data sets (final)

Report on the software to query the data sets (M6). The first version is delivered early on at M6 to allow que-rying the data set as soon as possible, without the semantic enrichment produced in other deliverables of WP3, and the second version at M12 reporting on the software to analyze the data and the enriched data sets is delivered as soon as possible, and allows querying the data set and the enriched data set, including the se-mantic text enrichment to be produced in the rest of WP3 (D3.1-D3.3).

Data models (d) (final)

Regular reports providing a detailed description of the data models formats and specifications used in the project including publicly available example data

Data collection and preservation (d) (final)

Report and data collection

Comparative analysis of data between contexts (b) (final)

Reports on the developed methods and tools for dynamic comparative analysis of data between given contexts The first version at M24 describes the methods to extract sets of characteristics to describe similarities or contrasts between document groups and the second version at M36 describes the final methods to extract contrasting characteristics from groups of documents integrated with work on intelligible descriptions The final version is due at M36 with a possible update at M45

Educational material for teachers, pupils and lay historians (b) (final)

The deliverables consist of prototypes of the educational material in M24 and the online published material in M36 While all partners of the digital humanities group will contribute in the production of the material UHDH will supervise the production of material for teachers UPVM for pupils and students and UIBKICH for lay historians in different languagesA report on educational material prototypes will be delivered at M24 the final report will be delivered at M36

Analysis of data in a given context (c) (final)

Reports on the level of completion of the software tool for dynamic analysis of data in a given context The first version at M12 will be tools for building multilingual topic models topic hierarchies and dynamic topic models and using them to analyze articles in the initial dataset the second version at M24 contains document analysis methods for article similarity and link discovery to suggest related articles combining multilingual hierarchical dynamic topic models and the third version at M36 contains document analysis methods refined on the basis of feedback from their use in Personal Research Assistant and evaluation of their integration with intelligible descriptions The final version is due at M36 with a possible update at M45

NE recognition and linking (final)

Reports on the level of completion of the software tool to recognize and link NEs The first version at M12 will rely on standards of the state of the art and the second version at M24 contains our principal research contribution robust to noise and language independent

Intelligible representation of statistical analysis (b) (final)

Reports on the methods and tools for outputting humanintelligible representations based on the outputs from statistical models developed in T41 and T42 The first version at M24 describes the methods that provide intelligible namesdescriptions of topics and extracted characteristics for use in Personal Research Assistant and the second version at M36 describes the final methods to provide intelligible descriptions refined after integration in Personal Research Assistant The final version is due at M36 with a possible update at M45

Project website (to be continuously updated)

The project will maintain a website that will act as a portal for the communications activities. In M1 a web page will be published to advertise and announce the project. By M8 the full website structure will be in place, integrating social media (such as Twitter) channels. The website will be maintained throughout the duration of the project and content will be contributed by all project partners.

Data management plan

The NewsEye project will contribute to the open research data pilot. According to the guidelines for Research Data Management of Horizon 2020 (http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf) a Data Management Plan will be written during the first six months explaining what data will be generated, collected, shared and curated during project duration as well as after the project’s end. It will consider the different kinds of research outcomes (WP6) and data (WP2-5) resulting from the project. One im-portant goal of Newseye is to make its data findable, accessible, interoperable and reusable (FAIR).

Publications

Exploring Entities in Event Detection as Question Answering

Author(s): Boros, Emanuela; Moreno, Jose G.; Doucet, Antoine
Published in: Proceedings of the 44th European Conference on Information Retrieval (ECIR), 2022
Publisher: Springer
DOI: 10.5281/zenodo.5779941

L3i at SemEval-2022 Task 11: Straightforward Additional Context for Multilingual Named Entity Recognition

Author(s): Emanuela Boros, Carlos-Emiliano Gonzalez-Gallardo, Jose G. Moreno, Antoine Doucet
Published in: International Workshop on Semantic Evaluation (SemEval), Issue Task 11, 2022
Publisher: ACL
DOI: 10.5281/zenodo.6369947

A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers

Author(s): Ahmed Hamdi; Elvys Linhares Pontes; Emanuela Boros; Thi Tuyet Hai Nguyen; Günter Hackl; Jose G. Moreno; Antoine Doucet
Published in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, Page(s) 2328–2334
Publisher: ACM
DOI: 10.1145/3404835.3463255

Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition

Author(s): Ahmed Hamdi; Axel Jean-Caurant; Nicolas Sidere; Mickaël Coustaty; Antoine Doucet
Published in: Proceedings of the 24th International Conference on Theory and Practice of Digital Libraries, TPDL 2020, Issue 12246, 2020, Page(s) 87–101
Publisher: Springer
DOI: 10.1007/978-3-030-54956-5_7

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Author(s): Emanuela Boros; Ahmed Hamdi; Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; Jose G. Moreno; Nicolas Sidere; Antoine Doucet
Published in: Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL), 2020, Page(s) 431–441
Publisher: ACL
DOI: 10.18653/v1/2020.conll-1.35

Exploring Entities in Event Detection as Question Answering

Author(s): Boros, Emanuela; Moreno, Jose G.; Doucet, Antoine
Published in: European Conference on Information Retrieval (ECIR 2022), 2022, Page(s) 65-79, ISBN 978-3-030-99735-9
Publisher: Springer
DOI: 10.1007/978-3-030-99736-6_5

Grammatical Profiling for Semantic Change Detection

Author(s): Giulianelli, Mario; Kutuzov, Andrey; Pivovarova, Lidia
Published in: Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021), 2021
Publisher: ACL
DOI: 10.18653/v1/2021.conll-1.33

Multilingual Epidemic Event Extraction

Author(s): Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine; Lejeune, Gaël; Jatowt, Adam; Odeo, Moses
Published in: Proceedings of the 23rd International Conference on Asian Digital Libraries (ICADL)., Issue 13133, 2021, Page(s) 139–156
Publisher: Springer
DOI: 10.5281/zenodo.5779966

Transformer-based Methods for Recognizing Ultra Fine-grained Entities (RUFES)

Author(s): Boros, Emanuela; Doucet, Antoine
Published in: Thirteenth Text Analysis Conference ((TAC 2020), 2021
Publisher: NIST
DOI: 10.5281/zenodo.4555778

Information Extraction from Invoices

Author(s): Ahmed Hamdi; Elodie Carel; Aurelie Joseph; Mickael Coustaty; Antoine Doucet
Published in: International Conference on Document Analysis and Recognition ICDAR 2021, Issue 12822, 2021, Page(s) 699–714
Publisher: Springer
DOI: 10.1007/978-3-030-86331-9_45

Event Detection with Entity Markers

Author(s): Emanuela Boros; Jose G. Moreno; Antoine Doucet
Published in: Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021), Issue 12657, 2021, Page(s) 233–240
Publisher: Springer
DOI: 10.1007/978-3-030-72240-1_20

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Author(s): Quan Duong; Mika K Hämäläinen; Simon Hengchen
Published in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 2020, Page(s) 240–248
Publisher: ACL
DOI: 10.5281/zenodo.4242890

Dataset for Temporal Analysis of English-French Cognates

Author(s): Frossard, Esteban; Coustaty, Mickael; Doucet, Antoine; Jatowt, Adam; Hengchen, Simon
Published in: Proceedings of the 12th Language Resources and Evaluation Conference, 2020, Page(s) 855–859
Publisher: European Language Resources Association
DOI: 10.5281/zenodo.3693650

NewsEye: A digital investigator for historical newspapers

Author(s): Doucet, Antoine; Gasteiner, Martin; Granroth-Wilding, Mark; Kaiser, Max; Kaukonen, Minna; Labahn, Roger; Moreux, Jean-Philippe; Muehlberger, Guenter; Pfanzelter, Eva; Therenty, Marie-Eve; Toivonen, Hannu; Tolonen, Mikko
Published in: 15th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2020, 2020
Publisher: ADHO
DOI: 10.5281/zenodo.3895269

Robust Named Entity Recognition and Linking on Historical Multilingual Documents

Author(s): Emanuela Boros; Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; Ahmed Hamdi; José Moreno; Nicolas Sidère; Antoine Doucet
Published in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Issue 2696, 2020, Page(s) 1-17
Publisher: CEUR
DOI: 10.5281/zenodo.4068074

Using a Frustratingly Easy Domain and Tagset Adaptation for Creating Slavic Named Entity Recognition Systems

Author(s): Cabrera-Diego, Luis Adrián; Moreno, Jose G.; Doucet, Antoine
Published in: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing (BSNLP at ACL), 2021, Page(s) 98–104
Publisher: ACL
DOI: 10.5281/zenodo.4730477

SpaceWars: A Web Interface for Exploring the Spatio-temporal Dimensions of WWI Newspaper Reporting

Author(s): Gutehrlé, Nicolas; Harlamov, Oleg; Karimi, Farimah; Wei, Haoyu; Jean-Caurant, Axel; Pivovarova, Lidia
Published in: Proceedings of the 6th International Workshop on Computational History (HistoInformatics 2021), 2021
Publisher: CEUR
DOI: 10.5281/zenodo.5566463

Disappearing Discourses: Avoiding anachronisms and teleology with data-driven methods in studying digital newspaper collections

Author(s): Zosa, Elaine; Hengchen, Simon; Marjanen, Jani; Pivovarova, Lidia; Tolonen, Mikko
Published in: Digital Humanities in the Nordic countries (DHN 2020), 2020
Publisher: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.3631613

Atténuer les erreurs de numérisation dans la reconnaissance d'entités nommées pour les documents historiques

Author(s): Boros, Emanuela; Hamdi, Ahmed; Linhares Pontes, Elvys; Cabrera-Diego, Luis Adrián; Moreno, José G.; Sidere, Nicolas; Doucet, Antoine
Published in: Conférence en Recherche d’Informations et Applications - CORIA 2021, French Information Retrieval Conference,, 2021
Publisher: ARIA
DOI: 10.24348/coria.2021.mini_24

Neural Machine Translation with BERT for Post-OCR Error Detection and Correction

Author(s): Thi Tuyet Hai Nguyen; Adam Jatowt; Nhu-Van Nguyen; Mickael Coustaty; Antoine Doucet
Published in: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 2020, Page(s) 333–336
Publisher: ACM
DOI: 10.1145/3383583.3398605

Post-OCR Error Detection by Generating Plausible Candidates

Author(s): Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, Antoine Doucet
Published in: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, Page(s) 876-881, ISBN 978-1-7281-3014-9
Publisher: IEEE
DOI: 10.1109/ICDAR.2019.00145

Elastic Embedded Background Linking for News Articles with Keywords, Entities and Events.

Author(s): Luis Adrián Cabrera-Diego, Emanuela Boros, Antoine Doucet
Published in: Text REtrieval Conference (TREC) 2021, Issue News Track, 2022
Publisher: NIST
DOI: 10.5281/zenodo.6334523

Opening Digitized Newspapers for Different User Groups - Successes and Challenges

Author(s): Juha Rautiainen
Published in: IFLA World Library and Information Congress 2019, 2019
Publisher: IFLA
DOI: 10.5281/zenodo.3403158

A Baseline Document Planning Method for Automated Journalism

Author(s): Leo Leppänen; Hannu Toivonen
Published in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 2021, Page(s) 101–111
Publisher: ACL
DOI: 10.5281/zenodo.4694492

Personal Research Assistant for Online Exploration of Historical News

Author(s): Lidia Pivovarova; Axel Jean-Caurant; Jari Avikainen; Khalid Alnajjar; Mark Granroth-Wilding; Leo Leppänen; Elaine Zosa; Hannu Toivonen
Published in: Proceedings of the 42nd European Conference on IR Research, Issue 12036, 2020, Page(s) 481–485, ISBN 9783030454418
Publisher: Springer
DOI: 10.1007/978-3-030-45442-5_62

Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic languages

Author(s): Piskorski, Jakub; Babych, Bogdan; Kancheva, Zara; Kanishcheva, Olga; Lebedeva, Maria; Marcinczuk, Michał; Nakov, Preslav; Osenova, Petya; Pivovarova, Lidia; Pollak, Senja; Přibáň, Pavel; Radev, Ivaylo; Robnik-Šikonja, Marko; Starko, Vasyl; Steinberger, Josef; Yangarber, Roman
Published in: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, 2021, Page(s) 122–133
Publisher: ACL
DOI: 10.5281/zenodo.4635585

When to Use OCR Post-correction for Named Entity Recognition?

Author(s): Vinh-Nam Huynh; Ahmed Hamdi; Antoine Doucet
Published in: Proceedings of the 14th International Conference on Data Analytics in Logistics (ICDAL 2020), Issue 12504, 2020, Page(s) 33–42, ISBN 9783030644512
Publisher: Springer
DOI: 10.1007/978-3-030-64452-9_3

A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval

Author(s): Elaine Zosa; Mark Granroth-Wilding; Lidia Pivovarova
Published in: Proceedings of the Workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020), 2020, Page(s) 32-37
Publisher: ACL
DOI: 10.5281/zenodo.3751036

"Transformer-based Methods with #Entities for Detecting Emergency Events on Social Media"

Author(s): Emanuela Boros, Nhu Khoa Nguyen, Gaël Lejeune, Mickaël Coustaty, Antoine Doucet
Published in: Text REtrieval Conference (TREC) 2021, Issue Incident Streams Track, 2022
Publisher: NIST
DOI: 10.5281/zenodo.6334513

Simple ways to improve NER in every language using markup

Author(s): Luis Adrián Cabrera-Diego; Moreno, J. G.; Doucet, A.
Published in: Proceedings of the 2nd International Workshop on Cross-Lingual Event-Centric Open Analytics Co-Located with the 30th The Web Conference (WWW 2021), 2021, ISSN 1613-0073
Publisher: CEUR-WS
DOI: 10.5281/zenodo.4680998

Digging Deeper into the Finnish Parliamentary Protocols – Using a Lexical Semantic Tagger for Studying Meaning Change of Everyman's Rights (allemansrätten)

Author(s): Kettunen, Kimmo; La Mela, Matti
Published in: Proceedings of the Digital Humanities in the Nordic Countries (5th Conference), 2020, Page(s) 63–80
Publisher: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.3676371

Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents

Author(s): Ehrmann, Maud; Romanello, Matteo; Doucet, Antoine; Clematide, Simon
Published in: European Conference on Information Retrieval (ECIR 2022), 2022, Page(s) 347–354, ISBN 978-3-030-99739-7
Publisher: Springer
DOI: 10.1007/978-3-030-99739-7_44

Event Related Document Retrieval with Multilingual Real World Event Representation

Author(s): Guillaume Bernard, Cyrille Suire, Cyril Faucher, Antoine Doucet
Published in: Proceedings of the 20th International Semantic Web Conference (ISWC), 2021
Publisher: CEUR-WS
DOI: 10.5281/zenodo.5900742

Three-part diachronic semantic change dataset for Russian

Author(s): Andrey Kutuzov; Lidia Pivovarova
Published in: Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021, 2021, Page(s) 7-13
Publisher: ACL
DOI: 10.18653/v1/2021.lchange-1.2

ICDAR 2019 Competition on Post-OCR Text Correction

Author(s): Christophe Rigaud; Antoine Doucet; Mickaël Coustaty; Jean-Philippe Moreux
Published in: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, ISBN 978-1-7281-3015-6
Publisher: IEEE
DOI: 10.1109/icdar.2019.00255

Multilingual Dynamic Topic Model

Author(s): Zosa, Elaine; Granroth-Wilding, Mark; Department of Computer Science, University of Helsinki, Finland
Published in: Proceedings - Natural Language Processing in a Deep Learning World (RANLP), 2019, Page(s) 1388–1396
Publisher: RANLP
DOI: 10.26615/978-954-452-056-4_159

Visual Topic Modelling for NewsImage Task at MediaEval 2021

Author(s): Lidia Pivovarova, Elaine Zosa
Published in: Working Notes Proceedings of the MediaEval 2021 Workshop, 2021
Publisher: CEUR-WS
DOI: 10.5281/zenodo.5900719

Linking Named Entities across Languages using Multilingual Word Embeddings

Author(s): Elvys Linhares Pontes; Jose G. Moreno; Antoine Doucet
Published in: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2020, Page(s) 329–332
Publisher: ACM
DOI: 10.1145/3383583.3398597

Can Umlauts Ruin Your Research in Digitized Newspaper Collections? A NewsEye Case Study on 'The Dark Sides of War' (1914–1918)

Author(s): Klaus, Barbara
Published in: Proceedings of the Digital Humanities in the Nordic Countries (5th Conference), Issue 2612, 2020, Page(s) 267–274
Publisher: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.4686731

Large Scale Analysis of Semantic and Temporal Aspects in Cultural Heritage Collection's Search

Author(s): Sumikawa, Yasunobu; Jatowt, Adam; Doucet, Antoine; Moreux, Jean-Phillippe
Published in: 2019 JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), Urbana-Champaign, Illinois, June 2-6, 2019, Issue yearly, 2019, Page(s) 77-86, ISBN 978-1-7281-1547-4
Publisher: IEEE computer society
DOI: 10.1109/jcdl.2019.00021

Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

Author(s): Nguyen, Thi-Tuyet-Hai; Jatowt, Adam; Coustaty, Mickael; Nguyen, Nhu-Van; Doucet, Antoine
Published in: 2019 JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), Urbana-Champaign, Illinois, June 2-6, 2019, Issue yearly, 2019, Page(s) 29-38, ISBN 978-1-7281-1547-4
Publisher: IEEE computer society
DOI: 10.1109/jcdl.2019.00015

Towards Data-Driven Generation of Visualizations for Automatically Generated News Articles

Author(s): Rola Alhalaseh, Myriam Munezero, Miika Leinonen, Leo Leppänen, Jari Avikainen, Hannu Toivonen
Published in: Proceedings of the 22nd International Academic Mindtrek Conference on - Mindtrek '18, Issue yearly, 2018, Page(s) 100-109, ISBN 9781-450365895
Publisher: ACM Press
DOI: 10.1145/3275116.3275131

An Analysis of the Performance of Named Entity Recognition over OCRed Documents

Author(s): Hamdi, Ahmed; Jean-Caurant, Axel; Sidere, Nicolas; Coustaty, Mickael; Doucet, Antoine
Published in: 2019 JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), Urbana-Champaign, Illinois, June 2-6, 2019, Issue yearly, 2019, Page(s) 333-334, ISBN 978-1-7281-1547-4
Publisher: IEEE computer society
DOI: 10.1109/jcdl.2019.00057

Impact Analysis of Document Digitization on Event Extraction

Author(s): Nhu Khoa Nguyen; Emanuela Boroş; Gaël Lejeune; Antoine Doucet
Published in: Proceedings of the 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020), Issue 2735, 2020, Page(s) 17–28
Publisher: CEUR-WS
DOI: 10.5281/zenodo.4734267

Scalable and Interpretable Semantic Change Detection

Author(s): Syrielle Montariol; Matej Martinc; Lidia Pivovarova
Published in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021, Page(s) 4642–4652
Publisher: ACL
DOI: 10.18653/v1/2021.naacl-main.369

Word Clustering for Historical Newspapers Analysis

Author(s): Lidia Pivovarova; Jani Marjanen; Elaine Zosa
Published in: Proceedings of the Workshop on Language Technology for Digital Historical Archives, 2019, Page(s) 3-10
Publisher: ACL Bulgaria
DOI: 10.26615/978-954-452-059-5_002

Multilingual Epidemiological Text Classification: A Comparative Study

Author(s): Stephen Mutuvi; Emanuela Boros; Antoine Doucet; Adam Jatowt; Gaël Lejeune; Moses Odeo
Published in: Proceedings of the 28th International Conference on Computational Linguistics (COLING), 2020, Page(s) 6172–6183
Publisher: ACL
DOI: 10.18653/v1/2020.coling-main.543

Impact of OCR Quality on Named Entity Linking

Author(s): Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet
Published in: International Conference on Asia-Pacific Digital Libraries 2019, 2019, Page(s) 102–115, ISBN 978-3-030-34058-2
Publisher: Springer
DOI: 10.1007/978-3-030-34058-2_11

Entity Linking for Historical Documents: Challenges and Solutions

Author(s): Pontes, Elvys Linhares; Cabrera-Diego, Luis Adrián; Moreno, José G.; Boros, Emanuela; Pontes, Elvys,; Hamdi, Ahmed; Sidère, Nicolas; Coustaty, Mickaël; Doucet, Antoine
Published in: Proceedings of the 22nd International Conference on Asia-Pacific Digital Libraries (ICADL 2020), Issue 12504, 2020, Page(s) 215–231, ISBN 9783030644512
Publisher: Springer
DOI: 10.1007/978-3-030-64452-9_19

Clustering Ideological Terms in Historical Newspaper Data with Diachronic Word Embeddings

Author(s): Jani Pekka Marjanen; Lidia Pivovarova; Elaine Zosa; Jussi Kurunmäki
Published in: HistoInformatics 2019: International Workshop on Computational History 2019, part of TPDL 2019, 2019
Publisher: Springer
DOI: 10.5281/zenodo.3689466

Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

Author(s): Elaine Zosa, Stephen Mutuvi, Mark Granroth-Wilding, Antoine Doucet
Published in: International Conference on Asian Digital Libraries (ICADL), 2021, ISBN 978-3-030-91668-8
Publisher: Springer
DOI: 10.1007/978-3-030-91669-5_30

Topic Modelling Discourse Dynamics in Historical Newspapers

Author(s): Marjanen, Jani; Zosa, Elaine; Hengchen, Simon; Pivovarova, Lidia; Tolonen, Mikko
Published in: Proceedings of the 5th Conference Digital Humanities in the Nordic Countries (DHN 2020), 2020, Page(s) 63-77
Publisher: CEUR-WS
DOI: 10.5281/zenodo.5648114

Benchmarks for Unsupervised Discourse Change Detection

Author(s): Duong, Quan; Pivovarova, Lidia; Zosa, Elaine
Published in: Proceedings of the 6th International Workshop on Computational History (HistoInformatics 2021), Issue 2981, 2021
Publisher: Springer
DOI: 10.5281/zenodo.5780033

Capturing Evolution in Word Usage: Just Add More Clusters?

Author(s): Matej Martinc; Syrielle Montariol; Elaine Zosa; Lidia Pivovarova
Published in: WWW '20: Companion Proceedings of the Web Conference 2020, 2020, Page(s) 343-349
Publisher: ACM
DOI: 10.1145/3366424.3382186

A Dataset for Multi-lingual Epidemiological Event Extraction

Author(s): Mutuvi, Stephen; Doucet, Antoine; Lejeune, Gael; Odeo, Moses
Published in: Proceedings of the 12th Language Resources and Evaluation Conference, 2020, Page(s) 4139–4144
Publisher: European Language Resources Association
DOI: 10.5281/zenodo.3709626

Not All Comments are Equal: Insights into Comment Moderation from a Topic-Aware Model

Author(s): Elaine Zosa; Ravi Shekhar; Mladen Karan; Matthew Purver
Published in: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021, Page(s) 1652–1662
Publisher: RANLP
DOI: 10.5281/zenodo.5648098

EMBEDDIA at SemEval-2022 Task 8: Investigating Sentence, Image, and Knowledge Graph Representations for Multilingual News Article Similarity

Author(s): Elaine Zosa, Emanuela Boros, Boshko Koloski, Lidia Pivovarova
Published in: Proceedings of SemEval-2022 Workshop Task 8, 2022
Publisher: ACL
DOI: 10.5281/zenodo.6369944

Token-Level Multilingual Epidemic Dataset for Event Extraction

Author(s): Stephen Mutuvi; Stephen Mutuvi; Emanuela Boros; Antoine Doucet; Gaël Lejeune; Adam Jatowt; Moses Odeo
Published in: Proceedings of the 25th International Conference on Theory and Practice of Digital Libraries (TPDL), Issue 12866, 2021, Page(s) 55–59
Publisher: Springer
DOI: 10.5281/zenodo.5780019

Evaluating Sequence-to-Sequence Models for Handwritten Text Recognition

Author(s): Johannes Michael, Roger Labahn, Tobias Gruning, Jochen Zollner
Published in: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, Page(s) 1286-1293, ISBN 978-1-7281-3014-9
Publisher: IEEE
DOI: 10.1109/icdar.2019.00208

L3i_LBPAM at the FinSim-2 task: Learning Financial Semantic Similarities with Siamese Transformers

Author(s): Nhu Khoa Nguyen; Emanuela Boros; Gaël Lejeune; Antoine Doucet; Thierry Delahaut
Published in: Companion Proceedings of the Web Conference, 2020, Page(s) 302–306
Publisher: ACM
DOI: 10.5281/zenodo.4734321

The Helsinki Digital Humanities Hackathon: Two Perspectives on Multidisciplinary Historical Newspapers Research in a Hackathon Context

Author(s): Ros, Ruben; Oberbichler, Sarah
Published in: Proceedings of the Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020, 2020, Page(s) 66–74
Publisher: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.3689228

Multilingual Topic Labelling of News Topics using Ontological Mapping

Author(s): Elaine Zosa, Lidia Pivovarova, Michele Boggia, Sardana Ivanova
Published in: European Conference on Information Retrieval (ECIR), 2022
Publisher: Springer
DOI: 10.5281/zenodo.6334491

Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie

Author(s): Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine; Lejeune, Gaël; Jatowt, Adam; Odeo, Moses
Published in: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, 2021
Publisher: ARIA
DOI: 10.5281/zenodo.4734471

A Comprehensive Extraction of Relevant Real-World-Event Qualifiers for Semantic Search Engines

Author(s): Guillaume Bernard, Cyrille Suire, Cyril Faucher, Antoine Doucet
Published in: International Conference on Theory and Practice of Digital Libraries (TPDL), 2021, Page(s) 153-164, ISBN 978-3-030-86323-4
Publisher: Springer
DOI: 10.1007/978-3-030-86324-1_19

A Method for Wavelet-Based Time Series Analysis of Historical Newspapers

Author(s): Avikainen, Jari
Published in: 2019
Publisher: University of Helsinki
DOI: 10.5281/zenodo.3628262

"""Wir dürfen wieder Österreicher sein!"" Die Rolle der Tagespresse in österreichischen Nation-Building-Prozessen 1945–1948 – eine quantitative Analyse ausgewählter digitaler Zeitungskorpora samt Vorschlägen zur didaktischen Umsetzung"

Author(s): Stefan Patrick Hechl
Published in: 2021
Publisher: Universität Innsbruck
DOI: 10.5281/zenodo.4468295

Wortvektoren

Author(s): Laasch, Bastian Marc
Published in: 2018
Publisher: University of Rostock
DOI: 10.18453/rosdok_id00002309

Embeddings built on 19th century newspapers from Finland

Author(s): Lidia Pivovarova, Elaine Zosa, Jani Marjanen
Published in: 2019
Publisher: Zenodo
DOI: 10.5281/zenodo.3557480

Doing historical research with digital newspapers – perspectives of DH scholars

Author(s): Sarah Oberbichler, Eva Pfanzelter, Stefan Hechl, Jani Marjanen
Published in: Europeana Tech, Issue Issue 16: Newspapers, 2021
Publisher: Europeana

Using LDA and Jensen-Shannon Distance (JSD) to group similar newspaper articles

Author(s): Sarah Oberbichler
Published in: 2020
Publisher: Zenodo
DOI: 10.5281/zenodo.3887193

The Book of Abstracts for What’s Past is Prologue: The NewsEye International Conference.

Author(s): Antti Kanner, Eetu Mäkelä, Jani Marjanen, Mikko Tolonen, Sarah Oberbichler, Quan Duong, Lidia Pivovarova, Dilawar Ali, Steven Verstockt, Étienne Ollion, Rubing Shen, Matthias Arnold, David Brown, Raven Adam, Saranya Balasubramanian, Vera Maria Charvat, Manfred Füllsack, Jörn Kleinert, Hanna Misera, Nenad Pantelic, Jakob Sonnberger, Georg Vogelor, Alessandra De Mulder, Heikki K
Published in: 2021
Publisher: Zenodo
DOI: 10.5281/zenodo.5167375

Covid-19 et grippe espagnole: Quand la presse du XXe siècle rappelle celle de 2020

Author(s): Nejma Omari, Antoine Doucet
Published in: 2020
Publisher: The Conversation

Annotation Guidelines for Named Entity Recognition, Entity Linking and Stance Detection (v3.1)

Author(s): Ahmed Hamdi, Elvys Linhares Pontes, Antoine Doucet
Published in: 2021
Publisher: Zenodo
DOI: 10.5281/zenodo.4574199

NewsEye Policy Brief

Author(s): NewsEye consortium
Published in: 2020
Publisher: Zenodo
DOI: 10.5281/zenodo.4291895

Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents

Author(s): Emanuela Boros, Nhu Khoa Nguyen, Gaël Lejeune, Antoine Doucet
Published in: International Journal on Digital Libraries, Issue 14325012, 2022, ISSN 1432-5012
Publisher: Springer Verlag
DOI: 10.1007/s00799-022-00325-2

The expansion of isms, 1820-1917: Data-driven analysis of political language in digitized newspaper collections

Author(s): Jani Marjanen; Jussi Antero Kurunmäki; Lidia Pivovarova; Elaine Zosa
Published in: Journal of Data Mining & Digital Humanities, HistoInformatics, Issue 6159, 2020, ISSN 2416-5999
Publisher: EPIsciences
DOI: 10.5281/zenodo.4447025

A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

Author(s): Linhares Pontes, Elvys; Huet, Stéphane; Torres Moreno, Juan Manuel; Gouveia da Silva, Thiago; Carneiro Linhares, Andréa
Published in: Computación y Sistemas, Issue 24 (2), 2020, ISSN 2007-9737
Publisher: IPN
DOI: 10.13053/cys-24-2-3335

Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians

Author(s): Sarah Oberbichler; Emanuela Boros; Antoine Doucet; Jani Marjanen; Eva Pfanzelter; Juha Rautiainen; Hannu Toivonen; Mikko Tolonen
Published in: Journal of the Association for Information Science and Technology, Issue 73 (2), 2022, Page(s) 225–239, ISSN 2330-1643
Publisher: John Wiley and Sons Ltd
DOI: 10.1002/asi.24565

In Depth Analysis of the Impact of OCR Errors on Named Entity Recognition and Linking

Author(s): Ahmed Hamdi, Evlys Linhares Pontes, Nicolas Sidère, Mickaël Coustaty, Antoine Doucet
Published in: Natural Language Engineering, 2022, Page(s) 1-24, ISSN 1351-3249
Publisher: Cambridge University Press
DOI: 10.1017/s1351324922000110

Digital interfaces of historical newspapers: opportunities, restrictions and recommendations

Author(s): Eva Pfanzelter; Sarah Oberbichler; Jani Marjanen; Pierre-Carl Langlais; Stefan Hechl
Published in: Journal of Data Mining and Digital Humanities, Volume on HistoInformatics, Issue 6121, 2021, ISSN 2416-5999
Publisher: EPIsciences
DOI: 10.5281/zenodo.4446818

Als eine andere Epidemie die Welt in Atem hielt: Die Spanische Grippe 1918/19 in der österreichischen Presse

Author(s): Sarah Oberbichler, Stefan Hechl, Eva Pfanzelter
Published in: Tiroler Chronist - Fachblatt von und für Chronisten in Nord-, Süd- und Osttirol, Issue 154, 2020, Page(s) 15-22, ISSN 1990-9799
Publisher: Tiroler Bildungsforum

A data-driven approach to studying changing vocabularies in historical newspaper collections

Author(s): Hengchen, Simon; Ros, Ruben; Marjanen, Jani; Tolonen, Mikko
Published in: Digital Scholarship in the Humanities, Issue 36, 2021, Page(s) 109–126, ISSN 2055-7671
Publisher: Oxford University Press
DOI: 10.5281/zenodo.5783070

Survey of Post-OCR Processing Approaches

Author(s): Thi Tuyet Hai Nguyen; Adam Jatowt; Mickaël Coustaty; Antoine Doucet
Published in: ACM Computing Surveys, Issue 54(6), 2022, Page(s) 1–37, ISSN 0360-0300
Publisher: Association for Computing Machinary, Inc.
DOI: 10.1145/3453476

A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917

Author(s): Jani Marjanen; Villle Vaara; Antti Kanner; Hege Roivainen; Eetu Mäkelä; Leo Lahti; Mikko Tolonen
Published in: Journal of European Periodical Studies, Issue 4 (1), 2019, Page(s) 55–78, ISSN 2506-6587
Publisher: ESPRit (European Society for Periodical Research)
DOI: 10.21825/jeps.v4i1.10483

MELHISSA: a multilingual entity linking architecture for historical press articles

Author(s): Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; Jose G. Moreno; Emanuela Boros; Ahmed Hamdi; Antoine Doucet; Nicolas Sidere; Mickaël Coustaty
Published in: International Journal on Digital Libraries, 2021, ISSN 1432-5012
Publisher: Springer Verlag
DOI: 10.1007/s00799-021-00319-6

Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods

Author(s): Sarah Oberbichler, Eva Pfanzelter
Published in: Journal of Digital History, 2021
Publisher: De Gruyter

Tracing Discourses in Digital Newspaper Collections: A Contribution to Digital Hermeneutics while Investigating 'Return Migration' in Historical Press Coverage

Author(s): Sarah Oberbichler, Eva Pfanzelter
Published in: Digitised Newspapers – A New Eldorado for Historians?, 2022, ISBN 9783110729214
Publisher: De Gruyter Oldenbourg

Crossing or Intersecting the Emperor’s Desk with digitized Newspaper Data: Entity-source-networks in the late Habsburg Empire

Author(s): Martin Gasteiner, Andreas Enderlin
Published in: Digitised Newspapers – A New Eldorado for Historians?, 2022, ISBN 9783110729214
Publisher: De Gruyter Oldenbourg

ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset

Author(s): Johannes Michael; Max Weidemann; Bastian Laasch; Roger Labahn
Published in: Proceedings of ICPR International Workshops and Challenges (2020), Issue 12668, 2021, Page(s) 405–418
Publisher: Springer
DOI: 10.1007/978-3-030-68793-9_30

International: From Legal to Civic Discourse and Beyond in the Nineteenth Century

Author(s): Jani Marjanen, Ruben Ros
Published in: Nationalism and Internationalism Intertwined - A European History of Concepts Beyond the Nation State, 2022, Page(s) 60-85, ISBN 978-1-80073-314-5
Publisher: Berghahn

Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction

Author(s): Thi-Tuyet-Hai Nguyen, Mickael Coustaty, Antoine Doucet, Adam Jatowt, Nhu-Van Nguyen
Published in: Maturity and Innovation in Digital Libraries - 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19-22, 2018, Proceedings, Issue 11279, 2018, Page(s) 278-289, ISBN 978-3-030-04256-1
Publisher: Springer International Publishing
DOI: 10.1007/978-3-030-04257-8_29

Evaluating the Impact of OCR Errors on Topic Modeling

Author(s): Stephen Mutuvi, Antoine Doucet, Moses Odeo, Adam Jatowt
Published in: Maturity and Innovation in Digital Libraries - 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19-22, 2018, Proceedings, Issue 11279, 2018, Page(s) 3-14, ISBN 978-3-030-04256-1
Publisher: Springer International Publishing
DOI: 10.1007/978-3-030-04257-8_1

National Sentiment: Nation Building and Emotional Language in Nineteenth-Century Finland

Author(s): Jani Marjanen
Published in: Lived Nation as the History of Experiences and Emotions in Finland, 1800-2000, 2021, Page(s) 61–83, ISBN 978-3-030-69881-2
Publisher: Springer
DOI: 10.1007/978-3-030-69882-9_3

Searching for OpenAIRE data...

There was an error trying to search data from OpenAIRE

No results available