INESC TEC technology used to create a database of keywords obtained from more than 100 million scientific articles

YAKE!, a keyword extraction software developed by researchers at INESC TEC, was used in the process of creating the General Index, a project that catalogued 107 million scientific articles – bringing together a catalogue of phrases and words – to facilitate the search for information. Launched in October, the new database is available on the Internet Archive, the world’s largest digital content preservation archive, with an index of over 19 billion keywords extracted using YAKE!.

According to Ricardo Campos and Alípio Jorge, co-creators of the technology, “YAKE’s! adaptability to different scenarios, its plug-and-play nature, and its effectiveness, namely when compared to different solutions”, as well as “its speed of execution” are traits that led to the use of this technology in the creation of the General Index. “The fact that they used YAKE! in this process it’s a clear example of its applicability in big data contexts”, emphasised the researchers.

With the ability to adapt to different domains of activity, languages, and document sizes, without the need to resort to external data sources, a high volume of data or computationally demanding training processes, the software is a system based on a set of statistical measures, a group of heuristics, which translate into a mathematical formula capable of determining the relevance of a word.

“The algorithm and the reasons to determine if a given word is relevant or not are easily operationalised on the YAKE!, as opposed to systems based on neural networks – that are typically heavier, since they require a larger amount of data to train. This allows YAKE! to be directly applicable to a large set of languages with little software engineering work. Moreover, the algorithm is easily understandable, which enhances the explanation of the results”, stated Ricardo Campos and Alípio Jorge, adding that this technology has contributed to the automation of the keyword extraction process, with special relevance “at a time when the data volume grows at high rates”.

Open-source and cross-cutting system

Newly integrated into John Snow Labs‘ portfolio of open-source solutions, the most widely used natural language processing and text mining library in the business field, YAKE! is also used by the National Library of Finland, by Chartbeat Labs – textacy, and within the scope of the INESC TEC Conta-me Histórias project, included in the Portuguese web archive, arquivo.pt.

In addition to an online demo, from which users can extract keywords by entering text or a URL, an open-source software package is also available, which can be incorporated into projects with different needs. “This is a solution that covers different fields of application. It can be used, for example, by journalists, in the process of annotating news articles, or integrated in different pipelines. There are several examples of scientific articles that refer to and use YAKE! in different case studies, from summarisation processes to text mining processes”, exemplified the researchers.

Developed by Ricardo Campos (INESC TEC and Polytechnic Institute of Tomar), Vítor Mangaravite (Federal University of Minas Gerais), Arian Pasquali (INESC TEC), Alípio Jorge (INESC TEC and Faculty of Sciences of the University of Porto), Célia Nunes (University Beira Interior) and Adam Jatowt (University of Innsbruck), the software is currently cited or used in more than 270 articles, with more than 860 stars on github and 141 forks, accounting for more than 1000 installations on Android system. In 2018, it was awarded the “Best Short Paper” at the most important European conference on information retrieval, the ECIR.

The INESC TEC researchers mentioned in this news piece are associated with INESC TEC, IPT and UP-FCUP.

INESC TEC technology used to create a database of keywords obtained from more than 100 million scientific articles

AgeingFit 2024: INESC TEC joins the main European event on healthy ageing

Categories

NEWSLETTER