In 1982, a girl named Norma became the image of the Catalan autonomous government's campaign for the normalization of the Catalan language at the end of 40 years of repression under Franco. Now, almost another 40 years later, Catalan continues to struggle to defend itself, but in increasingly complex terrain and with more sophisticated weapons, including artificial intelligence. So that Norma can continue to be heard, even in dialogue with computers and virtual assistants such as Siri or Alexa, the Catalan administration is promoting the AINA project, for the digital normalization of Catalan.
Created by the government's Digital Policies department, the AINA project aims to generate a corpus and computer models of the Catalan language to provide the resources necessary so that companies can create applications based on artificial intelligence such as voice assistants, conversational agents or machine translators.
Digital extinction
"The aim of the government is for the public to be able to interact with the digital world in Catalan", explained the digital policies minister, Jordi Puigneró, in the press conference held to present the project.
Puigneró recalled that a study carried out in 2011 by the European Network of Excellence META-NET warned that more than 20 European languages, including Catalan, face digital extinction if they do not receive more technological support in areas such as simultaneous translation, voice interaction, textual analysis and the availability of language resources. The minister warned that Catalan will survive if it can also be used normally in the new digital context, as "a useful and competitive language".
Supercomputing Centre
Counting on a budget of 13.5 million euros between 2020 and 2024 which is expected to be funded by a European grant from NextGenerationEU, the project starts with an initial contribution of 250,000 euros from the Catalan administration's digital policies area that has been assigned to the Barcelona Supercomputing Centre (BSC).
The BSC already has a first textual corpus of Catalan, containing 1,770 million words, used in 95 million sentences, from downloading texts from different digital sources, such as the Catalan Government website and the 500 Catalan language websites with most traffic.
The MareNostrum supercomputer has dedicated 2,000 hours of processing to review this data, eliminate duplicates as well as all the material that was not actually in the Catalan language. AINA will incorporate the dialectal varieties of Catalan, the different linguistic registers, and voice and image archives. This latter category will be boosted with the inclusion of the entire programme archive of the Catalan public broadcasting corporation.
Artificial intelligence
All of this is to be used to develop applications based on artificial intelligence, such as voice assistants, chatbots, automatic summary applications, smart searches, applications for sentiment analysis, or automatic translation and subtitling engines, among others. To make this possible, neural networks will be created to learn Catalan and generate language, speech and translation models.
All the models created in the Supercomunting Centre will be available to all those companies or entities that want to use them because, according to the department, they will be published openly and with permissive licenses.
The name AINA
Puigneró commented that the Catalan language does not have a state behind it protecting it and that thanks to Aina it is more likely that Catalan will be heard issuing from the lips of Alexa, Amazon's digital assistant, than in the Congress of Deputies or the Spanish Supreme Court.
The AINA project maintains the thread which began with Norma in 1982. That project was created by the then Director General of Linguistic Policy, Aina Moll. Hence the name, which also contains the acronym AI, Artificial Intelligence.