Less common language varieties also have a place in the era of AI, as demonstrated by two INESC TEC papers presented at a top conference

It’s hard to think of current technologies or innovations that do not resort to Language Models (LM) or Natural Language Processing (NLP). Their presence in various society domains – some with significant relevance, like the legal or healthcare sectors – raise issues (and concerns) that often end up focusing on the same question: are LM-based technologies reaching all communities? Recently, two scientific papers featuring INESC TEC – both accepted at AAAI, an A* conference – sought to address some of the challenges in this new era, which directly influence the Portuguese language.

In “Tradutor: Building a Variety Specific Translation Model”, Hugo Sousa, Ricardo Campos and Alípio Jorge, INESC TEC researchers, focused on language varieties in the training of LM, as well as in the evaluation and implementation phase that follows – issues that tend to be overlooked. Portuguese is a clear example, with the Brazilian population representing 70% of speakers. When considering this, many of the tools and systems exclude the diversity and cultural nuances of countries like Portugal, Mozambique, Angola, etc. With the inclusion of LM in relevant contexts, often associated with decision-making processes, making errors due to grammatical or lexical flaws may pose too high a risk.

One of the possible solutions would be to create a specific LM for a specific language variety. However, there are multiple challenges associated with this process, e.g., the large corpus required. Another alternative would be the creation of machine translation models also dedicated exclusively to a certain variety. In this sense, the creation of a robust model could – in the case of language varieties with few resources – be the first step towards inclusion; said model could also be used to translate training and evaluation resources, since the majority are in English.

The proposal presented by INESC TEC researchers, however, focuses on a third option based on a new methodology that aims to develop a neural machine translation model. The starting point? The compilation of several texts of language varieties associated with communities with fewer resources, which were translated to the closest variety, with more associated resources. According to the paper – also signed by Satya Almasian (University of Heidelberg) -, this parallel corpus  was later used to fine-tune a pre-trained language model, thus leading to the Tradutor. The PTradutor consists of the largest database of English – European Portuguese translation (composed of 1,719,002 documents) ever developed – now available to all users.

According to the researchers, the results “bring open-source systems closer to industrial-grade translation systems, with minimal resources and limited computing.”

The paper “Enhancing Portuguese Variety Identification with Cross-Domain Approaches”, accepted at the AAAI Conference on Artificial Intelligence, focuses on recent advances in natural language processing. Although significant, these can create unrealistic expectations in terms of the models’ production of coherent text in the different language varieties. To fill in the gaps of a less comprehensive application and – in the case of Portuguese language – promote the creation of resources in European Portuguese, INESC TEC researchers developed a cross-domain language variety identifier capable of distinguishing between European and Brazilian Portuguese.

The distinction between two varieties is an important process in NLP, especially with the emergence of language models with numerous varieties. Regardless of the stage in which it occurs – pre-training, refinement or evaluation – a system capable of distinguishing between two varieties will allow for less human supervision. However, the development of said systems also has associated challenges: for example, the identification of relevant linguistic traits – without any bias – that are later transposed to their application. As in LM, texts with inaccuracies cause constraints when they are applied, which underlines the importance of systems that identify effective language varieties.

Throughout the paper, Hugo Sousa, Rúben Almeida (funded by the seed Project PT-PUMP-UP), Ricardo Campos and Alípio Jorge, INESC TEC researchers, described the creation of a multi-domain identifier – with the results of the literature review to be compiled in the PtVId corpus, a multi-domain database -, as well as the study of the effectiveness of transformer-based LVI classifiers for multi-domain scenarios. The paper also featured Purificação Silvano and Inês Cantante from the Centre of Linguistics of the University of Porto.

The researchers mentioned in this news piece are associated with INESC TEC, Faculty of Sciences of the University of Porto, UBI and Ci2 – Smart Cities Research Center.

PHP Code Snippets Powered By : XYZScripts.com
EnglishPortugal