Skip to main content
European Commission logo
français français
CORDIS - Résultats de la recherche de l’UE
CORDIS

NewsEye: A Digital Investigator for Historical Newspapers

Livrables

Automatic Text Recognition (final)

Reports on software tools and modules incl documentation for Automatic Text Recognition Technical Reports on further development and innovative adaptation of algorithms and methods for Automatic Text Recognition

Dissemination, communication and exploitation of results (e) (final)

The PEDR will be delivered at M3 and the project will followthrough by maintaining a rolling plan of activities to disseminate and exploit project results including reports or publications for each event on a particular topic This deliverable includes rapid dissemination channels in the form of blog posts tweets and other online media as well as more traditional dissemination outputs conference papers scholarly articlesAt M12 M24 and M36 we will provide yearly reports on the execution of the PEDR as well as on all dissemination and communication events organized during the projects Main dissemination and communication events are planned at M3 M14 M24 M25 M26 and M30 but will be reported on yearly together with smaller scale eventsThis deliverable under the lead of WP7 by BNF after M36 will provide details on the dissemination communication and exploitation of results during the project extension

Layout Analysis (final)

Reports on software tools and modules incl documentation for Layout Analysis Technical Reports on further development and innovative adaptation of algorithms and methods for Layout Analysis

Usability/Fit for research purpose test of tools and user interfaces (c) (final)

The deliverables will report on testing the methods tools and interfaces to the core They are the result of collaboration on the mockups and prototypes workshophackathon participation with the computer science groups and the libraries as indicated in Task T74 providing extensive feedback on tools and methods UIBKICH will supervise the production of reports in preparation for and as a followup to the tools prototypes betaversions and publishable tools and along the timeline of WP7 The final version is due at M34 with a possible update at M45

Contextualized Case Studies for academic use (d) (final)

The deliverables will report on the four digital humanities case studies prepared by using already existing methods and tools as well as the ones to be developed in this project showing progress and improvement of search and research outcome UIBKICH will be responsible for the case studies on migration UHDH for the case study on nationalisms and revolutions UNIVIE for the case study on media and journalism and UPVM for the case study on gender The members of the DHgroup will furthermore compare and contrast the results of the case studies in order to show how newspapers work both as a space for change as well as for stability while addressing the relationship between press politics and society in different regions and languages across Europe thus showing the transformation of our societiesThe deliverables will a include thorough literature and background research for each of the case studies b work with the semantically enriched The deliverables will report on testing the methods tools and interfaces to the core They are the result of collaboration on the mockups and prototypes workshophackathon participation with the computer science groups and the libraries as indicated in Task T74 providing extensive feedback on tools and methods UIBKICH will supervise the production of reports in preparation for and as a followup to the tools prototypes betaversions and publishable tools and along the timeline of WP7text as well as applicationutilization of the developed dynamic text analysis features in different languages in order to improve the quality of the case studies c show how the developed tools contribute to change and continuity discussions for European societiesDraftreports will be delivered at M6 complete reports at M12 while final reports to be submitted for publication in renowned humanities and digital humanities journals will be completed at M24 and M36

Personal Research Assistant: Explainer (b) (final)

This deliverable describes the Explainer component The first version M24 will be able to produce initial descriptions of strategies goals and decisions of the Investigator while the second version M36 describes the final version The final version is due at M36 with a possible update at M45

Article separation (c) (final)

Reports on software tools and modules incl documentation for Article Separation Technical Reports on further development and innovative adaptation of algorithms and methods for Article Separation journal research paper submissions on new preferably Machine Learning based neural algorithms and technologies for Article Separation along with the inherently used Layout Analysis Text Line Detection and Automatic Text The final version is due at M36 with a possible update at M45

Event detection (final)

Report on the level of completion of the event detection tool at M24 present the state of the art in event detection replying on the detection of events based on the sole document content using stringbased multilingual approaches based on rhetoric and specificities of the news genre as previously developed at ULR The second version at M36 will integrate contrastive knowledge from other documents The final version is due at M36 with a possible update at M45

Personal Research Assistant: Reporter (c) (final)

This deliverable describes the Reporter component and how it is used The first version M12 will be capable of some simple natural language generation using relatively rigid document structures and mechanisms for talking about the results of tools produced in WP34 during year one The second version M24 will have more elaborate document structuring and will be able to report more flexibly on a wider range of analysis results The second version will also have a first version of summarization of textual contents The third version of the deliverable M36 will describe the final version with full functionality The final version is due at M36 with a possible update at M45

Use of project results for the general public (b) (final)

The deliverables will report on the texts podcasts and social media activities by the digital humanities group UNIVIE will be supervising the podcast production UPVM the linking with Wikipedia and UHDH the social media activities

NewsEye Demonstrator (c) (final)

Reports and software on the development of the NewsEye Demonstrator a web based user interface for tools developed in WP3 and 4 and for the Personal Research Assistant WP5 Tools for the user interface of WP3 will be provided at M12 while the complete Minimum viable product MVP will be delivered at M24 and the final version at M36 The final version is due at M36 with a possible update at M45

Sustainability plan (c) (final)

The project will conceptualize a sustainability strategy for the longterm access of tools and data generated by the project to be planned in full details at M26 being implemented at M36 and fully implemented at M45

Stance detection (final)

Reports on the level of completion of the software tool for stance detection M12 The first version at M12 will rely on standards of the state of the art and the second version at M24 contains our principal research contribution robust to noise and language independent

Showcase case studies for the user interface (b) (final)

The deliverables will consist of texts videos statistics search paths how to etc on the user interface and on the project homepage All partners of the digital humanities group will contribute to the deliverable

Personnal Research Assistant: Investigator (c) (final)

The deliverable describes the Investigator tool In the first iteration M12 the Investigator will be capable of planning forming and running some queries using analysis tools developed in parallel in WP34 and of interacting with the user in simple ways to continue the investigation In the second iteration M24 the Investigator will also be able to create strategies for investigation to analyze the results obtained and to adjust its strategy accordingly The third iteration M36 describes the final version with full functionality The final version is due at M36 with a possible update at M45

Advanced tool to query the enriched data sets (final)

Report on the software to query the data sets (M6). The first version is delivered early on at M6 to allow que-rying the data set as soon as possible, without the semantic enrichment produced in other deliverables of WP3, and the second version at M12 reporting on the software to analyze the data and the enriched data sets is delivered as soon as possible, and allows querying the data set and the enriched data set, including the se-mantic text enrichment to be produced in the rest of WP3 (D3.1-D3.3).

Data models (d) (final)

Regular reports providing a detailed description of the data models formats and specifications used in the project including publicly available example data

Data collection and preservation (d) (final)

Report and data collection

Comparative analysis of data between contexts (b) (final)

Reports on the developed methods and tools for dynamic comparative analysis of data between given contexts The first version at M24 describes the methods to extract sets of characteristics to describe similarities or contrasts between document groups and the second version at M36 describes the final methods to extract contrasting characteristics from groups of documents integrated with work on intelligible descriptions The final version is due at M36 with a possible update at M45

Educational material for teachers, pupils and lay historians (b) (final)

The deliverables consist of prototypes of the educational material in M24 and the online published material in M36 While all partners of the digital humanities group will contribute in the production of the material UHDH will supervise the production of material for teachers UPVM for pupils and students and UIBKICH for lay historians in different languagesA report on educational material prototypes will be delivered at M24 the final report will be delivered at M36

Analysis of data in a given context (c) (final)

Reports on the level of completion of the software tool for dynamic analysis of data in a given context The first version at M12 will be tools for building multilingual topic models topic hierarchies and dynamic topic models and using them to analyze articles in the initial dataset the second version at M24 contains document analysis methods for article similarity and link discovery to suggest related articles combining multilingual hierarchical dynamic topic models and the third version at M36 contains document analysis methods refined on the basis of feedback from their use in Personal Research Assistant and evaluation of their integration with intelligible descriptions The final version is due at M36 with a possible update at M45

NE recognition and linking (final)

Reports on the level of completion of the software tool to recognize and link NEs The first version at M12 will rely on standards of the state of the art and the second version at M24 contains our principal research contribution robust to noise and language independent

Intelligible representation of statistical analysis (b) (final)

Reports on the methods and tools for outputting humanintelligible representations based on the outputs from statistical models developed in T41 and T42 The first version at M24 describes the methods that provide intelligible namesdescriptions of topics and extracted characteristics for use in Personal Research Assistant and the second version at M36 describes the final methods to provide intelligible descriptions refined after integration in Personal Research Assistant The final version is due at M36 with a possible update at M45

Project website (to be continuously updated)

The project will maintain a website that will act as a portal for the communications activities. In M1 a web page will be published to advertise and announce the project. By M8 the full website structure will be in place, integrating social media (such as Twitter) channels. The website will be maintained throughout the duration of the project and content will be contributed by all project partners.

Data management plan

The NewsEye project will contribute to the open research data pilot. According to the guidelines for Research Data Management of Horizon 2020 (http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf) a Data Management Plan will be written during the first six months explaining what data will be generated, collected, shared and curated during project duration as well as after the project’s end. It will consider the different kinds of research outcomes (WP6) and data (WP2-5) resulting from the project. One im-portant goal of Newseye is to make its data findable, accessible, interoperable and reusable (FAIR).

Publications

Exploring Entities in Event Detection as Question Answering

Auteurs: Boros, Emanuela; Moreno, Jose G.; Doucet, Antoine
Publié dans: Proceedings of the 44th European Conference on Information Retrieval (ECIR), 2022
Éditeur: Springer
DOI: 10.5281/zenodo.5779941

L3i at SemEval-2022 Task 11: Straightforward Additional Context for Multilingual Named Entity Recognition

Auteurs: Emanuela Boros, Carlos-Emiliano Gonzalez-Gallardo, Jose G. Moreno, Antoine Doucet
Publié dans: International Workshop on Semantic Evaluation (SemEval), Numéro Task 11, 2022
Éditeur: ACL
DOI: 10.5281/zenodo.6369947

A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers

Auteurs: Ahmed Hamdi; Elvys Linhares Pontes; Emanuela Boros; Thi Tuyet Hai Nguyen; Günter Hackl; Jose G. Moreno; Antoine Doucet
Publié dans: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, Page(s) 2328–2334
Éditeur: ACM
DOI: 10.1145/3404835.3463255

Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition

Auteurs: Ahmed Hamdi; Axel Jean-Caurant; Nicolas Sidere; Mickaël Coustaty; Antoine Doucet
Publié dans: Proceedings of the 24th International Conference on Theory and Practice of Digital Libraries, TPDL 2020, Numéro 12246, 2020, Page(s) 87–101
Éditeur: Springer
DOI: 10.1007/978-3-030-54956-5_7

Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Auteurs: Emanuela Boros; Ahmed Hamdi; Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; Jose G. Moreno; Nicolas Sidere; Antoine Doucet
Publié dans: Proceedings of the 24th Conference on Computational Natural Language Learning (CoNLL), 2020, Page(s) 431–441
Éditeur: ACL
DOI: 10.18653/v1/2020.conll-1.35

Exploring Entities in Event Detection as Question Answering

Auteurs: Boros, Emanuela; Moreno, Jose G.; Doucet, Antoine
Publié dans: European Conference on Information Retrieval (ECIR 2022), 2022, Page(s) 65-79, ISBN 978-3-030-99735-9
Éditeur: Springer
DOI: 10.1007/978-3-030-99736-6_5

Grammatical Profiling for Semantic Change Detection

Auteurs: Giulianelli, Mario; Kutuzov, Andrey; Pivovarova, Lidia
Publié dans: Proceedings of the 25th Conference on Computational Natural Language Learning (CoNLL 2021), 2021
Éditeur: ACL
DOI: 10.18653/v1/2021.conll-1.33

Multilingual Epidemic Event Extraction

Auteurs: Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine; Lejeune, Gaël; Jatowt, Adam; Odeo, Moses
Publié dans: Proceedings of the 23rd International Conference on Asian Digital Libraries (ICADL)., Numéro 13133, 2021, Page(s) 139–156
Éditeur: Springer
DOI: 10.5281/zenodo.5779966

Transformer-based Methods for Recognizing Ultra Fine-grained Entities (RUFES)

Auteurs: Boros, Emanuela; Doucet, Antoine
Publié dans: Thirteenth Text Analysis Conference ((TAC 2020), 2021
Éditeur: NIST
DOI: 10.5281/zenodo.4555778

Information Extraction from Invoices

Auteurs: Ahmed Hamdi; Elodie Carel; Aurelie Joseph; Mickael Coustaty; Antoine Doucet
Publié dans: International Conference on Document Analysis and Recognition ICDAR 2021, Numéro 12822, 2021, Page(s) 699–714
Éditeur: Springer
DOI: 10.1007/978-3-030-86331-9_45

Event Detection with Entity Markers

Auteurs: Emanuela Boros; Jose G. Moreno; Antoine Doucet
Publié dans: Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021), Numéro 12657, 2021, Page(s) 233–240
Éditeur: Springer
DOI: 10.1007/978-3-030-72240-1_20

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Auteurs: Quan Duong; Mika K Hämäläinen; Simon Hengchen
Publié dans: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 2020, Page(s) 240–248
Éditeur: ACL
DOI: 10.5281/zenodo.4242890

Dataset for Temporal Analysis of English-French Cognates

Auteurs: Frossard, Esteban; Coustaty, Mickael; Doucet, Antoine; Jatowt, Adam; Hengchen, Simon
Publié dans: Proceedings of the 12th Language Resources and Evaluation Conference, 2020, Page(s) 855–859
Éditeur: European Language Resources Association
DOI: 10.5281/zenodo.3693650

NewsEye: A digital investigator for historical newspapers

Auteurs: Doucet, Antoine; Gasteiner, Martin; Granroth-Wilding, Mark; Kaiser, Max; Kaukonen, Minna; Labahn, Roger; Moreux, Jean-Philippe; Muehlberger, Guenter; Pfanzelter, Eva; Therenty, Marie-Eve; Toivonen, Hannu; Tolonen, Mikko
Publié dans: 15th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2020, 2020
Éditeur: ADHO
DOI: 10.5281/zenodo.3895269

Robust Named Entity Recognition and Linking on Historical Multilingual Documents

Auteurs: Emanuela Boros; Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; Ahmed Hamdi; José Moreno; Nicolas Sidère; Antoine Doucet
Publié dans: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Numéro 2696, 2020, Page(s) 1-17
Éditeur: CEUR
DOI: 10.5281/zenodo.4068074

Using a Frustratingly Easy Domain and Tagset Adaptation for Creating Slavic Named Entity Recognition Systems

Auteurs: Cabrera-Diego, Luis Adrián; Moreno, Jose G.; Doucet, Antoine
Publié dans: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing (BSNLP at ACL), 2021, Page(s) 98–104
Éditeur: ACL
DOI: 10.5281/zenodo.4730477

SpaceWars: A Web Interface for Exploring the Spatio-temporal Dimensions of WWI Newspaper Reporting

Auteurs: Gutehrlé, Nicolas; Harlamov, Oleg; Karimi, Farimah; Wei, Haoyu; Jean-Caurant, Axel; Pivovarova, Lidia
Publié dans: Proceedings of the 6th International Workshop on Computational History (HistoInformatics 2021), 2021
Éditeur: CEUR
DOI: 10.5281/zenodo.5566463

Disappearing Discourses: Avoiding anachronisms and teleology with data-driven methods in studying digital newspaper collections

Auteurs: Zosa, Elaine; Hengchen, Simon; Marjanen, Jani; Pivovarova, Lidia; Tolonen, Mikko
Publié dans: Digital Humanities in the Nordic countries (DHN 2020), 2020
Éditeur: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.3631613

Atténuer les erreurs de numérisation dans la reconnaissance d'entités nommées pour les documents historiques

Auteurs: Boros, Emanuela; Hamdi, Ahmed; Linhares Pontes, Elvys; Cabrera-Diego, Luis Adrián; Moreno, José G.; Sidere, Nicolas; Doucet, Antoine
Publié dans: Conférence en Recherche d’Informations et Applications - CORIA 2021, French Information Retrieval Conference,, 2021
Éditeur: ARIA
DOI: 10.24348/coria.2021.mini_24

Neural Machine Translation with BERT for Post-OCR Error Detection and Correction

Auteurs: Thi Tuyet Hai Nguyen; Adam Jatowt; Nhu-Van Nguyen; Mickael Coustaty; Antoine Doucet
Publié dans: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 2020, Page(s) 333–336
Éditeur: ACM
DOI: 10.1145/3383583.3398605

Post-OCR Error Detection by Generating Plausible Candidates

Auteurs: Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, Antoine Doucet
Publié dans: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, Page(s) 876-881, ISBN 978-1-7281-3014-9
Éditeur: IEEE
DOI: 10.1109/ICDAR.2019.00145

Elastic Embedded Background Linking for News Articles with Keywords, Entities and Events.

Auteurs: Luis Adrián Cabrera-Diego, Emanuela Boros, Antoine Doucet
Publié dans: Text REtrieval Conference (TREC) 2021, Numéro News Track, 2022
Éditeur: NIST
DOI: 10.5281/zenodo.6334523

Opening Digitized Newspapers for Different User Groups - Successes and Challenges

Auteurs: Juha Rautiainen
Publié dans: IFLA World Library and Information Congress 2019, 2019
Éditeur: IFLA
DOI: 10.5281/zenodo.3403158

A Baseline Document Planning Method for Automated Journalism

Auteurs: Leo Leppänen; Hannu Toivonen
Publié dans: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 2021, Page(s) 101–111
Éditeur: ACL
DOI: 10.5281/zenodo.4694492

Personal Research Assistant for Online Exploration of Historical News

Auteurs: Lidia Pivovarova; Axel Jean-Caurant; Jari Avikainen; Khalid Alnajjar; Mark Granroth-Wilding; Leo Leppänen; Elaine Zosa; Hannu Toivonen
Publié dans: Proceedings of the 42nd European Conference on IR Research, Numéro 12036, 2020, Page(s) 481–485, ISBN 9783030454418
Éditeur: Springer
DOI: 10.1007/978-3-030-45442-5_62

Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic languages

Auteurs: Piskorski, Jakub; Babych, Bogdan; Kancheva, Zara; Kanishcheva, Olga; Lebedeva, Maria; Marcinczuk, Michał; Nakov, Preslav; Osenova, Petya; Pivovarova, Lidia; Pollak, Senja; Přibáň, Pavel; Radev, Ivaylo; Robnik-Šikonja, Marko; Starko, Vasyl; Steinberger, Josef; Yangarber, Roman
Publié dans: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, 2021, Page(s) 122–133
Éditeur: ACL
DOI: 10.5281/zenodo.4635585

When to Use OCR Post-correction for Named Entity Recognition?

Auteurs: Vinh-Nam Huynh; Ahmed Hamdi; Antoine Doucet
Publié dans: Proceedings of the 14th International Conference on Data Analytics in Logistics (ICDAL 2020), Numéro 12504, 2020, Page(s) 33–42, ISBN 9783030644512
Éditeur: Springer
DOI: 10.1007/978-3-030-64452-9_3

A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval

Auteurs: Elaine Zosa; Mark Granroth-Wilding; Lidia Pivovarova
Publié dans: Proceedings of the Workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020), 2020, Page(s) 32-37
Éditeur: ACL
DOI: 10.5281/zenodo.3751036

"Transformer-based Methods with #Entities for Detecting Emergency Events on Social Media"

Auteurs: Emanuela Boros, Nhu Khoa Nguyen, Gaël Lejeune, Mickaël Coustaty, Antoine Doucet
Publié dans: Text REtrieval Conference (TREC) 2021, Numéro Incident Streams Track, 2022
Éditeur: NIST
DOI: 10.5281/zenodo.6334513

Simple ways to improve NER in every language using markup

Auteurs: Luis Adrián Cabrera-Diego; Moreno, J. G.; Doucet, A.
Publié dans: Proceedings of the 2nd International Workshop on Cross-Lingual Event-Centric Open Analytics Co-Located with the 30th The Web Conference (WWW 2021), 2021, ISSN 1613-0073
Éditeur: CEUR-WS
DOI: 10.5281/zenodo.4680998

Digging Deeper into the Finnish Parliamentary Protocols – Using a Lexical Semantic Tagger for Studying Meaning Change of Everyman's Rights (allemansrätten)

Auteurs: Kettunen, Kimmo; La Mela, Matti
Publié dans: Proceedings of the Digital Humanities in the Nordic Countries (5th Conference), 2020, Page(s) 63–80
Éditeur: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.3676371

Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents

Auteurs: Ehrmann, Maud; Romanello, Matteo; Doucet, Antoine; Clematide, Simon
Publié dans: European Conference on Information Retrieval (ECIR 2022), 2022, Page(s) 347–354, ISBN 978-3-030-99739-7
Éditeur: Springer
DOI: 10.1007/978-3-030-99739-7_44

Event Related Document Retrieval with Multilingual Real World Event Representation

Auteurs: Guillaume Bernard, Cyrille Suire, Cyril Faucher, Antoine Doucet
Publié dans: Proceedings of the 20th International Semantic Web Conference (ISWC), 2021
Éditeur: CEUR-WS
DOI: 10.5281/zenodo.5900742

Three-part diachronic semantic change dataset for Russian

Auteurs: Andrey Kutuzov; Lidia Pivovarova
Publié dans: Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021, 2021, Page(s) 7-13
Éditeur: ACL
DOI: 10.18653/v1/2021.lchange-1.2

ICDAR 2019 Competition on Post-OCR Text Correction

Auteurs: Christophe Rigaud; Antoine Doucet; Mickaël Coustaty; Jean-Philippe Moreux
Publié dans: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, ISBN 978-1-7281-3015-6
Éditeur: IEEE
DOI: 10.1109/icdar.2019.00255

Multilingual Dynamic Topic Model

Auteurs: Zosa, Elaine; Granroth-Wilding, Mark; Department of Computer Science, University of Helsinki, Finland
Publié dans: Proceedings - Natural Language Processing in a Deep Learning World (RANLP), 2019, Page(s) 1388–1396
Éditeur: RANLP
DOI: 10.26615/978-954-452-056-4_159

Visual Topic Modelling for NewsImage Task at MediaEval 2021

Auteurs: Lidia Pivovarova, Elaine Zosa
Publié dans: Working Notes Proceedings of the MediaEval 2021 Workshop, 2021
Éditeur: CEUR-WS
DOI: 10.5281/zenodo.5900719

Linking Named Entities across Languages using Multilingual Word Embeddings

Auteurs: Elvys Linhares Pontes; Jose G. Moreno; Antoine Doucet
Publié dans: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2020, Page(s) 329–332
Éditeur: ACM
DOI: 10.1145/3383583.3398597

Can Umlauts Ruin Your Research in Digitized Newspaper Collections? A NewsEye Case Study on 'The Dark Sides of War' (1914–1918)

Auteurs: Klaus, Barbara
Publié dans: Proceedings of the Digital Humanities in the Nordic Countries (5th Conference), Numéro 2612, 2020, Page(s) 267–274
Éditeur: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.4686731

Large Scale Analysis of Semantic and Temporal Aspects in Cultural Heritage Collection's Search

Auteurs: Sumikawa, Yasunobu; Jatowt, Adam; Doucet, Antoine; Moreux, Jean-Phillippe
Publié dans: 2019 JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), Urbana-Champaign, Illinois, June 2-6, 2019, Numéro yearly, 2019, Page(s) 77-86, ISBN 978-1-7281-1547-4
Éditeur: IEEE computer society
DOI: 10.1109/jcdl.2019.00021

Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

Auteurs: Nguyen, Thi-Tuyet-Hai; Jatowt, Adam; Coustaty, Mickael; Nguyen, Nhu-Van; Doucet, Antoine
Publié dans: 2019 JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), Urbana-Champaign, Illinois, June 2-6, 2019, Numéro yearly, 2019, Page(s) 29-38, ISBN 978-1-7281-1547-4
Éditeur: IEEE computer society
DOI: 10.1109/jcdl.2019.00015

Towards Data-Driven Generation of Visualizations for Automatically Generated News Articles

Auteurs: Rola Alhalaseh, Myriam Munezero, Miika Leinonen, Leo Leppänen, Jari Avikainen, Hannu Toivonen
Publié dans: Proceedings of the 22nd International Academic Mindtrek Conference on - Mindtrek '18, Numéro yearly, 2018, Page(s) 100-109, ISBN 9781-450365895
Éditeur: ACM Press
DOI: 10.1145/3275116.3275131

An Analysis of the Performance of Named Entity Recognition over OCRed Documents

Auteurs: Hamdi, Ahmed; Jean-Caurant, Axel; Sidere, Nicolas; Coustaty, Mickael; Doucet, Antoine
Publié dans: 2019 JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), Urbana-Champaign, Illinois, June 2-6, 2019, Numéro yearly, 2019, Page(s) 333-334, ISBN 978-1-7281-1547-4
Éditeur: IEEE computer society
DOI: 10.1109/jcdl.2019.00057

Impact Analysis of Document Digitization on Event Extraction

Auteurs: Nhu Khoa Nguyen; Emanuela Boroş; Gaël Lejeune; Antoine Doucet
Publié dans: Proceedings of the 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020), Numéro 2735, 2020, Page(s) 17–28
Éditeur: CEUR-WS
DOI: 10.5281/zenodo.4734267

Scalable and Interpretable Semantic Change Detection

Auteurs: Syrielle Montariol; Matej Martinc; Lidia Pivovarova
Publié dans: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021, Page(s) 4642–4652
Éditeur: ACL
DOI: 10.18653/v1/2021.naacl-main.369

Word Clustering for Historical Newspapers Analysis

Auteurs: Lidia Pivovarova; Jani Marjanen; Elaine Zosa
Publié dans: Proceedings of the Workshop on Language Technology for Digital Historical Archives, 2019, Page(s) 3-10
Éditeur: ACL Bulgaria
DOI: 10.26615/978-954-452-059-5_002

Multilingual Epidemiological Text Classification: A Comparative Study

Auteurs: Stephen Mutuvi; Emanuela Boros; Antoine Doucet; Adam Jatowt; Gaël Lejeune; Moses Odeo
Publié dans: Proceedings of the 28th International Conference on Computational Linguistics (COLING), 2020, Page(s) 6172–6183
Éditeur: ACL
DOI: 10.18653/v1/2020.coling-main.543

Impact of OCR Quality on Named Entity Linking

Auteurs: Elvys Linhares Pontes; Ahmed Hamdi; Nicolas Sidere; Antoine Doucet
Publié dans: International Conference on Asia-Pacific Digital Libraries 2019, 2019, Page(s) 102–115, ISBN 978-3-030-34058-2
Éditeur: Springer
DOI: 10.1007/978-3-030-34058-2_11

Entity Linking for Historical Documents: Challenges and Solutions

Auteurs: Pontes, Elvys Linhares; Cabrera-Diego, Luis Adrián; Moreno, José G.; Boros, Emanuela; Pontes, Elvys,; Hamdi, Ahmed; Sidère, Nicolas; Coustaty, Mickaël; Doucet, Antoine
Publié dans: Proceedings of the 22nd International Conference on Asia-Pacific Digital Libraries (ICADL 2020), Numéro 12504, 2020, Page(s) 215–231, ISBN 9783030644512
Éditeur: Springer
DOI: 10.1007/978-3-030-64452-9_19

Clustering Ideological Terms in Historical Newspaper Data with Diachronic Word Embeddings

Auteurs: Jani Pekka Marjanen; Lidia Pivovarova; Elaine Zosa; Jussi Kurunmäki
Publié dans: HistoInformatics 2019: International Workshop on Computational History 2019, part of TPDL 2019, 2019
Éditeur: Springer
DOI: 10.5281/zenodo.3689466

Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

Auteurs: Elaine Zosa, Stephen Mutuvi, Mark Granroth-Wilding, Antoine Doucet
Publié dans: International Conference on Asian Digital Libraries (ICADL), 2021, ISBN 978-3-030-91668-8
Éditeur: Springer
DOI: 10.1007/978-3-030-91669-5_30

Topic Modelling Discourse Dynamics in Historical Newspapers

Auteurs: Marjanen, Jani; Zosa, Elaine; Hengchen, Simon; Pivovarova, Lidia; Tolonen, Mikko
Publié dans: Proceedings of the 5th Conference Digital Humanities in the Nordic Countries (DHN 2020), 2020, Page(s) 63-77
Éditeur: CEUR-WS
DOI: 10.5281/zenodo.5648114

Benchmarks for Unsupervised Discourse Change Detection

Auteurs: Duong, Quan; Pivovarova, Lidia; Zosa, Elaine
Publié dans: Proceedings of the 6th International Workshop on Computational History (HistoInformatics 2021), Numéro 2981, 2021
Éditeur: Springer
DOI: 10.5281/zenodo.5780033

Capturing Evolution in Word Usage: Just Add More Clusters?

Auteurs: Matej Martinc; Syrielle Montariol; Elaine Zosa; Lidia Pivovarova
Publié dans: WWW '20: Companion Proceedings of the Web Conference 2020, 2020, Page(s) 343-349
Éditeur: ACM
DOI: 10.1145/3366424.3382186

A Dataset for Multi-lingual Epidemiological Event Extraction

Auteurs: Mutuvi, Stephen; Doucet, Antoine; Lejeune, Gael; Odeo, Moses
Publié dans: Proceedings of the 12th Language Resources and Evaluation Conference, 2020, Page(s) 4139–4144
Éditeur: European Language Resources Association
DOI: 10.5281/zenodo.3709626

Not All Comments are Equal: Insights into Comment Moderation from a Topic-Aware Model

Auteurs: Elaine Zosa; Ravi Shekhar; Mladen Karan; Matthew Purver
Publié dans: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021, Page(s) 1652–1662
Éditeur: RANLP
DOI: 10.5281/zenodo.5648098

EMBEDDIA at SemEval-2022 Task 8: Investigating Sentence, Image, and Knowledge Graph Representations for Multilingual News Article Similarity

Auteurs: Elaine Zosa, Emanuela Boros, Boshko Koloski, Lidia Pivovarova
Publié dans: Proceedings of SemEval-2022 Workshop Task 8, 2022
Éditeur: ACL
DOI: 10.5281/zenodo.6369944

Token-Level Multilingual Epidemic Dataset for Event Extraction

Auteurs: Stephen Mutuvi; Stephen Mutuvi; Emanuela Boros; Antoine Doucet; Gaël Lejeune; Adam Jatowt; Moses Odeo
Publié dans: Proceedings of the 25th International Conference on Theory and Practice of Digital Libraries (TPDL), Numéro 12866, 2021, Page(s) 55–59
Éditeur: Springer
DOI: 10.5281/zenodo.5780019

Evaluating Sequence-to-Sequence Models for Handwritten Text Recognition

Auteurs: Johannes Michael, Roger Labahn, Tobias Gruning, Jochen Zollner
Publié dans: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, Page(s) 1286-1293, ISBN 978-1-7281-3014-9
Éditeur: IEEE
DOI: 10.1109/icdar.2019.00208

L3i_LBPAM at the FinSim-2 task: Learning Financial Semantic Similarities with Siamese Transformers

Auteurs: Nhu Khoa Nguyen; Emanuela Boros; Gaël Lejeune; Antoine Doucet; Thierry Delahaut
Publié dans: Companion Proceedings of the Web Conference, 2020, Page(s) 302–306
Éditeur: ACM
DOI: 10.5281/zenodo.4734321

The Helsinki Digital Humanities Hackathon: Two Perspectives on Multidisciplinary Historical Newspapers Research in a Hackathon Context

Auteurs: Ros, Ruben; Oberbichler, Sarah
Publié dans: Proceedings of the Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020, 2020, Page(s) 66–74
Éditeur: Institute of Literature, Folklore and Art
DOI: 10.5281/zenodo.3689228

Multilingual Topic Labelling of News Topics using Ontological Mapping

Auteurs: Elaine Zosa, Lidia Pivovarova, Michele Boggia, Sardana Ivanova
Publié dans: European Conference on Information Retrieval (ECIR), 2022
Éditeur: Springer
DOI: 10.5281/zenodo.6334491

Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie

Auteurs: Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine; Lejeune, Gaël; Jatowt, Adam; Odeo, Moses
Publié dans: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, 2021
Éditeur: ARIA
DOI: 10.5281/zenodo.4734471

A Comprehensive Extraction of Relevant Real-World-Event Qualifiers for Semantic Search Engines

Auteurs: Guillaume Bernard, Cyrille Suire, Cyril Faucher, Antoine Doucet
Publié dans: International Conference on Theory and Practice of Digital Libraries (TPDL), 2021, Page(s) 153-164, ISBN 978-3-030-86323-4
Éditeur: Springer
DOI: 10.1007/978-3-030-86324-1_19

A Method for Wavelet-Based Time Series Analysis of Historical Newspapers

Auteurs: Avikainen, Jari
Publié dans: 2019
Éditeur: University of Helsinki
DOI: 10.5281/zenodo.3628262

"""Wir dürfen wieder Österreicher sein!"" Die Rolle der Tagespresse in österreichischen Nation-Building-Prozessen 1945–1948 – eine quantitative Analyse ausgewählter digitaler Zeitungskorpora samt Vorschlägen zur didaktischen Umsetzung"

Auteurs: Stefan Patrick Hechl
Publié dans: 2021
Éditeur: Universität Innsbruck
DOI: 10.5281/zenodo.4468295

Wortvektoren

Auteurs: Laasch, Bastian Marc
Publié dans: 2018
Éditeur: University of Rostock
DOI: 10.18453/rosdok_id00002309

Embeddings built on 19th century newspapers from Finland

Auteurs: Lidia Pivovarova, Elaine Zosa, Jani Marjanen
Publié dans: 2019
Éditeur: Zenodo
DOI: 10.5281/zenodo.3557480

Doing historical research with digital newspapers – perspectives of DH scholars

Auteurs: Sarah Oberbichler, Eva Pfanzelter, Stefan Hechl, Jani Marjanen
Publié dans: Europeana Tech, Numéro Numéro 16: Newspapers, 2021
Éditeur: Europeana

Using LDA and Jensen-Shannon Distance (JSD) to group similar newspaper articles

Auteurs: Sarah Oberbichler
Publié dans: 2020
Éditeur: Zenodo
DOI: 10.5281/zenodo.3887193

The Book of Abstracts for What’s Past is Prologue: The NewsEye International Conference.

Auteurs: Antti Kanner, Eetu Mäkelä, Jani Marjanen, Mikko Tolonen, Sarah Oberbichler, Quan Duong, Lidia Pivovarova, Dilawar Ali, Steven Verstockt, Étienne Ollion, Rubing Shen, Matthias Arnold, David Brown, Raven Adam, Saranya Balasubramanian, Vera Maria Charvat, Manfred Füllsack, Jörn Kleinert, Hanna Misera, Nenad Pantelic, Jakob Sonnberger, Georg Vogelor, Alessandra De Mulder, Heikki K
Publié dans: 2021
Éditeur: Zenodo
DOI: 10.5281/zenodo.5167375

Covid-19 et grippe espagnole: Quand la presse du XXe siècle rappelle celle de 2020

Auteurs: Nejma Omari, Antoine Doucet
Publié dans: 2020
Éditeur: The Conversation

Annotation Guidelines for Named Entity Recognition, Entity Linking and Stance Detection (v3.1)

Auteurs: Ahmed Hamdi, Elvys Linhares Pontes, Antoine Doucet
Publié dans: 2021
Éditeur: Zenodo
DOI: 10.5281/zenodo.4574199

NewsEye Policy Brief

Auteurs: NewsEye consortium
Publié dans: 2020
Éditeur: Zenodo
DOI: 10.5281/zenodo.4291895

Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents

Auteurs: Emanuela Boros, Nhu Khoa Nguyen, Gaël Lejeune, Antoine Doucet
Publié dans: International Journal on Digital Libraries, Numéro 14325012, 2022, ISSN 1432-5012
Éditeur: Springer Verlag
DOI: 10.1007/s00799-022-00325-2

The expansion of isms, 1820-1917: Data-driven analysis of political language in digitized newspaper collections

Auteurs: Jani Marjanen; Jussi Antero Kurunmäki; Lidia Pivovarova; Elaine Zosa
Publié dans: Journal of Data Mining & Digital Humanities, HistoInformatics, Numéro 6159, 2020, ISSN 2416-5999
Éditeur: EPIsciences
DOI: 10.5281/zenodo.4447025

A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

Auteurs: Linhares Pontes, Elvys; Huet, Stéphane; Torres Moreno, Juan Manuel; Gouveia da Silva, Thiago; Carneiro Linhares, Andréa
Publié dans: Computación y Sistemas, Numéro 24 (2), 2020, ISSN 2007-9737
Éditeur: IPN
DOI: 10.13053/cys-24-2-3335

Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians

Auteurs: Sarah Oberbichler; Emanuela Boros; Antoine Doucet; Jani Marjanen; Eva Pfanzelter; Juha Rautiainen; Hannu Toivonen; Mikko Tolonen
Publié dans: Journal of the Association for Information Science and Technology, Numéro 73 (2), 2022, Page(s) 225–239, ISSN 2330-1643
Éditeur: John Wiley and Sons Ltd
DOI: 10.1002/asi.24565

In Depth Analysis of the Impact of OCR Errors on Named Entity Recognition and Linking

Auteurs: Ahmed Hamdi, Evlys Linhares Pontes, Nicolas Sidère, Mickaël Coustaty, Antoine Doucet
Publié dans: Natural Language Engineering, 2022, Page(s) 1-24, ISSN 1351-3249
Éditeur: Cambridge University Press
DOI: 10.1017/s1351324922000110

Digital interfaces of historical newspapers: opportunities, restrictions and recommendations

Auteurs: Eva Pfanzelter; Sarah Oberbichler; Jani Marjanen; Pierre-Carl Langlais; Stefan Hechl
Publié dans: Journal of Data Mining and Digital Humanities, Volume on HistoInformatics, Numéro 6121, 2021, ISSN 2416-5999
Éditeur: EPIsciences
DOI: 10.5281/zenodo.4446818

Als eine andere Epidemie die Welt in Atem hielt: Die Spanische Grippe 1918/19 in der österreichischen Presse

Auteurs: Sarah Oberbichler, Stefan Hechl, Eva Pfanzelter
Publié dans: Tiroler Chronist - Fachblatt von und für Chronisten in Nord-, Süd- und Osttirol, Numéro 154, 2020, Page(s) 15-22, ISSN 1990-9799
Éditeur: Tiroler Bildungsforum

A data-driven approach to studying changing vocabularies in historical newspaper collections

Auteurs: Hengchen, Simon; Ros, Ruben; Marjanen, Jani; Tolonen, Mikko
Publié dans: Digital Scholarship in the Humanities, Numéro 36, 2021, Page(s) 109–126, ISSN 2055-7671
Éditeur: Oxford University Press
DOI: 10.5281/zenodo.5783070

Survey of Post-OCR Processing Approaches

Auteurs: Thi Tuyet Hai Nguyen; Adam Jatowt; Mickaël Coustaty; Antoine Doucet
Publié dans: ACM Computing Surveys, Numéro 54(6), 2022, Page(s) 1–37, ISSN 0360-0300
Éditeur: Association for Computing Machinary, Inc.
DOI: 10.1145/3453476

A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917

Auteurs: Jani Marjanen; Villle Vaara; Antti Kanner; Hege Roivainen; Eetu Mäkelä; Leo Lahti; Mikko Tolonen
Publié dans: Journal of European Periodical Studies, Numéro 4 (1), 2019, Page(s) 55–78, ISSN 2506-6587
Éditeur: ESPRit (European Society for Periodical Research)
DOI: 10.21825/jeps.v4i1.10483

MELHISSA: a multilingual entity linking architecture for historical press articles

Auteurs: Elvys Linhares Pontes; Luis Adrián Cabrera-Diego; Jose G. Moreno; Emanuela Boros; Ahmed Hamdi; Antoine Doucet; Nicolas Sidere; Mickaël Coustaty
Publié dans: International Journal on Digital Libraries, 2021, ISSN 1432-5012
Éditeur: Springer Verlag
DOI: 10.1007/s00799-021-00319-6

Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods

Auteurs: Sarah Oberbichler, Eva Pfanzelter
Publié dans: Journal of Digital History, 2021
Éditeur: De Gruyter

Tracing Discourses in Digital Newspaper Collections: A Contribution to Digital Hermeneutics while Investigating 'Return Migration' in Historical Press Coverage

Auteurs: Sarah Oberbichler, Eva Pfanzelter
Publié dans: Digitised Newspapers – A New Eldorado for Historians?, 2022, ISBN 9783110729214
Éditeur: De Gruyter Oldenbourg

Crossing or Intersecting the Emperor’s Desk with digitized Newspaper Data: Entity-source-networks in the late Habsburg Empire

Auteurs: Martin Gasteiner, Andreas Enderlin
Publié dans: Digitised Newspapers – A New Eldorado for Historians?, 2022, ISBN 9783110729214
Éditeur: De Gruyter Oldenbourg

ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset

Auteurs: Johannes Michael; Max Weidemann; Bastian Laasch; Roger Labahn
Publié dans: Proceedings of ICPR International Workshops and Challenges (2020), Numéro 12668, 2021, Page(s) 405–418
Éditeur: Springer
DOI: 10.1007/978-3-030-68793-9_30

International: From Legal to Civic Discourse and Beyond in the Nineteenth Century

Auteurs: Jani Marjanen, Ruben Ros
Publié dans: Nationalism and Internationalism Intertwined - A European History of Concepts Beyond the Nation State, 2022, Page(s) 60-85, ISBN 978-1-80073-314-5
Éditeur: Berghahn

Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction

Auteurs: Thi-Tuyet-Hai Nguyen, Mickael Coustaty, Antoine Doucet, Adam Jatowt, Nhu-Van Nguyen
Publié dans: Maturity and Innovation in Digital Libraries - 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19-22, 2018, Proceedings, Numéro 11279, 2018, Page(s) 278-289, ISBN 978-3-030-04256-1
Éditeur: Springer International Publishing
DOI: 10.1007/978-3-030-04257-8_29

Evaluating the Impact of OCR Errors on Topic Modeling

Auteurs: Stephen Mutuvi, Antoine Doucet, Moses Odeo, Adam Jatowt
Publié dans: Maturity and Innovation in Digital Libraries - 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19-22, 2018, Proceedings, Numéro 11279, 2018, Page(s) 3-14, ISBN 978-3-030-04256-1
Éditeur: Springer International Publishing
DOI: 10.1007/978-3-030-04257-8_1

National Sentiment: Nation Building and Emotional Language in Nineteenth-Century Finland

Auteurs: Jani Marjanen
Publié dans: Lived Nation as the History of Experiences and Emotions in Finland, 1800-2000, 2021, Page(s) 61–83, ISBN 978-3-030-69881-2
Éditeur: Springer
DOI: 10.1007/978-3-030-69882-9_3

Recherche de données OpenAIRE...

Une erreur s’est produite lors de la recherche de données OpenAIRE

Aucun résultat disponible