Periodic Reporting for period 1 - MorphIRe (Morphologically-informed representations for natural language processing)
Période du rapport: 2019-04-01 au 2021-03-31
Today's NLP models mostly work with artificial intelligence and machine learning: techniques that require large amounts of training data---e.g. sets of questions with their correct answers---which are then fed into an algorithm that "learns" to perform the task. Importantly, the techniques that are widely used today are indifferent to which language is being used---whether the task is performed on English or on Basque, the algorithms work exactly the same. In particular, they do not take into account the word-internal (i.e. "morphological") structure of these languages: whereas English tends to use separate words to express different grammatical and semantic concepts, morphologically richer languages like Basque can express these concepts within a single word form (compare English "because of the rain" with Basque "euriagatik").
The MorphIRe project provides direct evidence that we shouldn't ignore the morphological structure of languages when building NLP models, as it contributes to errors that current state-of-the-art NLP models make. It also proposes a new algorithm for word segmentation that better corresponds to morphological structure. By highlighting these problems in today's NLP models and working towards concrete solutions that can be integrated into these models, the MorphIRe project makes an important contribution towards improving NLP technology for a wider range of languages.
The project has also produced a meta-study on how the scientific community engages with older literature over the most recent one, such as literature that motivates the need for linguistically-informed approaches versus the very latest advances in artificial intelligence that mostly do not make use of these.
Results have been disseminated at high-profile international conferences, such as the Annual Meeting of the Association for Computational Linguistics (ACL), and all papers, code & datasets produced by this project are openly accessible and re-usable by the scientific community. The next step for exploiting this project's results is successfully applying the developed algorithms to a wide range of NLP tasks and languages and improving on the state of the art for them, which the project's researcher continues to work on.