AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

التفاصيل البيبلوغرافية
العنوان: AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages
المؤلفون: Wang, Jiayi, Adelani, David, Agrawal, Sweta, Masiak, Marek, Rei, Ricardo, Briakou, Eleftheria, Carpuat, Marine, He, Xuanli, Bourhim, Sofia, Bukula, Andiswa, Mohamed, Muhidin, Olatoye, Temitayo, Adewumi, Tosin, Mokayed, Hamam, Mwase, Christine, Kimotho, Wangui, Yuehgoh, Foutse, Aremu, Anuoluwapo, Ojo, Jessica, Muhammad, Shamsuddeen, Osei, Salomey, Omotayo, Abdul-Hakeem, Chukwuneke, Chiamaka, Ogayo, Perez, Hourrane, Oumaima, El Anigri, Salma, Ndolela, Lolwethu, Mangwana, Thabiso, Mohamed, Shafie, Ayinde, Hassan, Awoyomi, Oluwabusayo, Alkhaled, Lama, Al-Azzawi, Sana, Etori, Naome, Ochieng, Millicent, Siro, Clemencia, Kiragu, Njoroge, Muchiri, Eric, Kimotho, Wangari, Sakayo, Toadoum Sari, Wamba, Lyse Naomi, Abolade, Daud, Ajao, Simbiat, Shode, Iyanuoluwa, Macharm, Ricky, Iro, Ruqayya, Abdullahi, Saheed, Moore, Stephen, Opoku, Bernard, Akinjobi, Zainab, Afolabi, Abeeb, Obiefuna, Nnaemeka, Ogbu, Onyekachi, Ochieng’, Sam, Otiende, Verrah, Mbonu, Chinedu, Lu, Yao, Stenetorp, Pontus
المساهمون: University College of London London (UCL), Masakhane NLP, University of Maryland Baltimore, Unbabel, Instituto Superior Técnico (IST / Técnico Lisboa), Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa (INESC-ID), Instituto Superior Técnico (IST / Técnico Lisboa)-Instituto de Engenharia de Sistemas e Computadores (INESC), Ecole Nationale Supérieure d'Informatique et d'Analyses des Systèmes (ENSIAS), Université Mohammed V de Rabat Agdal (UM5), South African Centre for Digital Language Resources (SADiLaR), Aston University Birmingham, University of Eastern Finland, Luleå University of Technology = Luleå Tekniska Universitet (LUT), Fudan University Shanghai, CEDRIC. Données complexes, apprentissage et représentations (CEDRIC - VERTIGO), Centre d'études et de recherche en informatique et communications (CEDRIC), Ecole Nationale Supérieure d'Informatique pour l'Industrie et l'Entreprise (ENSIIE)-Conservatoire National des Arts et Métiers CNAM (CNAM)-Ecole Nationale Supérieure d'Informatique pour l'Industrie et l'Entreprise (ENSIIE)-Conservatoire National des Arts et Métiers CNAM (CNAM), Lelapa AI, Imperial College London, Hausa natural language processing (HausaNLP), Bayero University Kano (BUK), Universidad de Deusto (DEUSTO), University of California (UC), Lancaster University, Jamhuriya University Of Science and Technology (JUST), Ladoke Akintola University of Technology (LAUTECH), The College of Saint Rose, University of Minnesota Twin Cities (UMN), University of Minnesota System (UMN), Microsoft Research, University of Amsterdam Amsterdam = Universiteit van Amsterdam (UvA), Technical University of Kenya (TUK), African Institute for Mathematical Sciences (AIMS), Catholic University of Leuven = Katholieke Universiteit Leuven (KU Leuven), Shenzhen Institute of Advanced Technology Shenzhen (SIAT), Chinese Academy of Sciences Beijing (CAS), Kaduna State University (KASU), University of Cape Coast Ghana, Ghana Natural Language Processing (Ghana NLP), Kwame Nkrumah University of Science and Technology (KNUST), New Mexico State University, New Mexico Consortium (NMC), United States International University-Africa (USIU-Africa), Nnamdi Azikiwe University (NAU-UNIZIK), Association for Computational Linguistics, Kevin Duh, Helena Gomez, Steven Bethard
المصدر: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) ; 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies ; https://cnam.hal.science/hal-04676542 ; 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Association for Computational Linguistics, Jun 2024, Mexico City, Mexico. pp.5997-6023, ⟨10.18653/v1/2024.naacl-long.334⟩
بيانات النشر: CCSD
Association for Computational Linguistics
سنة النشر: 2024
مصطلحات موضوعية: Computational linguistics, Computer aided language translation, Machine translation, Petroleum reservoir evaluation, Quality control, [INFO]Computer Science [cs], [SHS.LANGUE]Humanities and Social Sciences/Linguistics
جغرافية الموضوع: Mexico City, Mexico
الوصف: International audience ; Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AFRICOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441). © 2024 Association for Computational Linguistics English-Egyptian Arabic (eng-arz), English-French (eng-fra)—a control LP, English-Hausa (eng-hau), English-Igbo (eng-ibo), English-Kikuyu (eng-kik), English-Luo (eng-luo), English-Somali (eng-som), English-Swahili (eng-swh), English-Twi (eng-twi), English-isiXhosa (eng-xho), English-Yoruba (eng-yor), and Yoruba-English (yor-eng). Moreover, we extend our annotation collection to include domain-specific texts from News, TED talks, Movies, and IT domains for English-Yoruba translations, which were established in prior research by Adelani et al. (2021) and Shode et al. (2022), ensuring a comprehensive and domain-varied evaluation. We provide the information of language family groups that our targeted African languages belong to in Table 4 of Appendix ...
نوع الوثيقة: conference object
اللغة: English
Relation: info:eu-repo/semantics/altIdentifier/arxiv/2311.09828; ARXIV: 2311.09828
DOI: 10.18653/v1/2024.naacl-long.334
الاتاحة: https://cnam.hal.science/hal-04676542
https://doi.org/10.18653/v1/2024.naacl-long.334
رقم الانضمام: edsbas.7DB15C8C
قاعدة البيانات: BASE
الوصف
DOI:10.18653/v1/2024.naacl-long.334