Recognition and Enrichment of Archival Documents

The history of Europe is preserved in it’s archives. Thousands of shelf-kilometres containing billions of documents provide a true picture of everyday life (and struggles) of Europeans citizens from the Middle Ages until today. However these treasures are hard to access: Even after digitising a complete archive searching millions of pages for specific words or phrases was not possible. This situation has changed dramatically. With the technology developed in the H2020 project READ (Recognition and Enrichment of Archival Documents) access to historical collections from archives and libraries is revolutionzed. Main input comes from cutting edge research in Pattern Recognition, Computer Vision, Natural Language Processing and Digital Humanities. Namely Handwritten Text Recognition and Keyword Spotting are key technologies where European universities are at the forefront of research. These technologies are made available via the service platform “Transkribus”. It offers the world’s first implementation of a freely available Handwritten Text Recognition engine, capable of being trained on medieval handwriting found in codices in the same way as on individual handwriting from famous persons of the 20th century. The main European scripts can be trained and recognised, as well as Hebrew, Arabic or Bangla.
The Virtual Research Environment “Transkribus” aims to provide benefits for all user groups involved in the “eco-system” of historical documents: Archives and libraries as content holders get the chance to enrich their documents on a large scale with full-text transcription and searching, (digital) humanities scholars are enabled to work intensively with historical documents in a sheltered and highly specialized environment, computer scientists are supported with large scale datasets and reference data and finally the public is supported to enjoy the benefits of accessing digital archives. More than 25,000 users are today subscribed in the Transkribus platform contributing with their documents, knowledge and engagement to the further development of the platform. On 1st of July the READ project turned into a European Cooperative Society (SCE) with limited liability. READ-COOP SCE will be the legal entity which maintains and further develops the Transkribus platform.

Our work focused on the following main areas:
First of all we set a bunch of activities to make the project and the technology known to our four target groups. This started with a three days conference combing the public kick-off meeting of the project with a convention meeting of the co:op project. More than 150 people from over 20 countries took part in the conference. Videos of the presentations are online and an important resource for dissemination activities. Reactions on the conference were highly positive and opened the door to many archives and research groups. Dissemination activities were continued on several channels e.g. more than 20 workshops were organized by several groups in the project and held in a number of countries (Austria, France, Germany, Netherlands, Finland, Denmark, Norway, Italy, Switzerland, United Kingdom, Spain). Hundreds of people took part in these workshops and got familiar with the expert tool from the Transkribus platform.
Based on the overwhelming interest of archives and research groups in the project we were able to conclude more than 70 Memorandum of Understandings with institutions. These MoUs provide an excellent framework for cooperation. Among these are the Hessian State Archive (Germany), the Archivo Storico Ricordi (Italy), Huygens Institute for the History of the Netherlands (Netherlands), Alfred Escher Foundation (Switzerland) or The Linnean Society (United Kingdom), to mention just a few of this list. The success of this measure can be clearly seen by the number of users registered in the platform: After Y1 aboout 5000 users were registered in Transkribus, after Y2 nearly 9000 and now, at the end of the project more than 25.000 users are registered representing archivists, librarians, researchers, scholars and public users (family historians) from all over Europe and abroad.
Our second focus was the implementation of the Transkribus platform integrating a number of tools developed by the research groups in the project. Special attention was given here to defining interfaces and data exchange formats, to set up application servers for easy deployment of the single tools (which are coming in different operating systems and computer languages) and also to tackle the challenge of being able to store and process millions of images files. As a highlight of Y1 the award winning Handwritten Text Recognition engine from the CITLab team of the University of Rostock was implemented in the Transkribus platform. In Y2 major progress was made so that today Transkribus is able to offer the complete workflow for a text recognition project including the training of neural networks as well as keyword spotting. In Y3 and Y4 of the project a breakthrough in Handwritten Text Recognition and Layout Analysis was achieved by the teams from the Technical University Valencia and Rostock. Nowadays accuracy rates of clearly below 5% for historical handwritten documents can be achieved with a reasonable effort. Transkribus users have trained more than 2000 neural networks for their own specific documents. The data used for these trainings amount to a monetary value of 2-3 mill. EUR.
Y3 and Y4 was also used to prepare the foundation of a legal entity to further run and develop the Transkribus service platform. The decision was taken to go for a European Cooperative Society since this governance model enables us in the best way to provide services for our target groups in a collaborative manner.

We achieved several breakthroughs during the project. (1) Signficant drop in error rates for handwritten text recognition. Compared to baseline results from the start of the project an improvement of 50-80%were achieved by the leading teams in this domain, such as PRHLT from the Technical University Valencia and the CITlab team from the University of Rostock. (2) A real breakthrough took place in the layout analysis domain. Here it turned out that the approach to create a large set of training data for a scientific competition and the incorporation of machine learning methods led to a dramatic improvement of this basis technology. The best results from the ICDAR 2017 cBAD competition are even exceeded by the teams of the READ project. Based on a very similar technology a breakthrough was also achieved in the structural tagging of documents with the software package dhSegment from EPFL. (3) Major progress was made in making the technology available to archives, libraries, humanities scholars and family historians via the Transkribus platform. The whole workflow of a text recognition project is covered and can be used by any registered user. (4) A Breakthrough was also achieved in providing in creating probabilistic indexes for Keyword Spotting. With this technology extremely fast search can be performed on large datasets. (5)
Significant progress was made in terms of innovation. A specific device (ScanTent) and an app (DocScan) where created as prototype applications. The interest from user side is extremely high so that we will be marketing the device after the end of the project.

Periodic Reporting for period 3 - READ (Recognition and Enrichment of Archival Documents)

Partager cette page

Télécharger