How lemmatisation and derivational annotation affect productivity measures: The case of deverbal agent nouns in the Joint Corpus of Lithuanian

Authors

  • Jurgis Pakerys Institute for the Languages and Cultures of the Baltic, Department of Baltic Studies, Vilnius University https://orcid.org/0000-0002-9944-8598
  • Virginijus Dadurkevičius Institute of Digital Resources and Interdisciplinary Research, Vytautas Magnus University https://orcid.org/0000-0001-8602-6591
  • Agnė Navickaitė-Klišauskienė Institute for the Languages and Cultures of the Baltic, Department of Baltic Studies, Vilnius University

DOI:

https://doi.org/10.22364/vnf.15.09

Keywords:

word formation, derivational productivity, agent nouns, Lithuanian

Abstract

We discuss the automatic and manual stages of the lemmatisation and annotation of the Joint Corpus of Lithuanian (1.3 billion words) used to measure derivational productivity. As a case study, we present data of three productive deverbal agent noun suffixes in Lithuanian, -toj-, -ėj-, -ik-, and measure their realized, expanding, and potential productivity. We show that an additional semi-automatic lemmatisation and a manual derivational annotation significantly increase type and hapax counts. We also note that lemmatisation is affected by an artificially increased number of lemmas due to homographic forms unresolved by the lemmatiser. After the manual disambiguation of hapaxes, the numbers of feminine formations in -toj-(a) and -ėj-(a) were the most significantly reduced.

References

Dadurkevičius, Virginijus. 2020a. Wordlist of lemmas from the Joint Corpus of Lithuanian. CLARIN-LT digital library in the Republic of Lithuania. Available at: https://clarin.vdu.lt/

xmlui/handle/20.500.11821/41

Dadurkevičius, Virginijus. 2020b. Assessment data of the Dictionary of Modern Lithuanian versus Joint Corpora. CLARIN-LT digital library in the Republic of Lithuania. Available at: https://clarin.vdu.lt/xmlui/handle/20.500.11821/36 DOI: https://doi.org/10.7220/20.500.12259/240250

Aleksaitė, Agnė. 2022. Lietuvių kalbos naujažodžių daryba (2011–2019 m. Naujažodžių duomenyno pagrindu). Daktaro disertacija. Vilnius: Lietuvių kalbos institutas. Available at: https://talpykla.elaba.lt/elaba-fedora/objects/elaba:132642831/datastreams/MAIN/content

Ambrazas, Vytautas (ed.). 1994. Dabartinės lietuvių kalbos gramatika. Vilnius: Mokslo ir enciklopedijų leidykla.

Baayen, Harald Rolf. 2009. Corpus linguistics in morphology: Morphological productivity. Corpus Linguistics: An International Handbook. 2. Lüdeling, Anke, Kytö, Merja (eds.). Berlin, New York: De Gruyter Mouton, 899–919. https://doi.org/10.1515/9783110213881.2 DOI: https://doi.org/10.1515/9783110213881.2.899

Dadurkevičius, Virginijus. 2017. Lietuvių kalbos morfologija atvirojo kodo “Hunspell” platformoje. Bendrinė kalba. 90, 1–17. Available at: https://journals.lki.lt/bendrinekalba/article/view/156

Dadurkevičius, Virginijus, Petrauskaitė, Rūta. 2020. Corpus-based methods for assessment of traditional dictionaries. Human Language Technologies–The Baltic Perspective. Frontiers in Artificial Intelligence and Applications. 328. Utka, Andrius, Vaičenonienė, Jurgita, Kovalevskaitė, Jolanta, Kalinauskaitė, Danguolė (eds.). Amsterdam: IOS Press, 123–126. https://doi.org/10.3233/FAIA200613 DOI: https://doi.org/10.3233/FAIA200613

Dal, Georgette et al. 2008. Quelques préalables au calcul de la productivité des règles constructionnelles et premiers résultats. Actes du premier Congrès mondial de linguistique française, Paris, 9–12 juillet 2008. Durand, Jacques, Habert, Benoît, Laks, Bernard (eds.). Paris: Institut de Linguistique Française, 1587–1599. https://doi.org/10.1051/cmlf08184 DOI: https://doi.org/10.1051/cmlf08184

Dal, Georgette, Namer, Fiammetta. 2016. Productivity. The Cambridge Handbook of Morphology. Hippisley, Andrew, Stump, Gregory (eds.). Cambridge: Cambridge University Press, 70–90. https://doi.org/10.1017/9781139814720.004 DOI: https://doi.org/10.1017/9781139814720.004

Evert, Stefan, Lüdeling, Anke. 2001. Measuring morphological productivity: Is automatic preprocessing sufficient? Proceedings of the Corpus Linguistics 2001 Conference. Rayson, Paul, Wilson, Andrew, McEnery, Tony, Hardie, Andrew, Khoja, Shereen (eds.). Lancaster: Lancaster University, 167–175.

Fraenkel, Ernst. 1962. Litauisches etymologisches Wörterbuch. Heidelberg: Carl Winter.

Gaeta, Livio, Ricca, Davide. 2006. Productivity in Italian word formation: a variable-corpus approach. Linguistics. 44(1), 57–89. https://doi.org/10.1515/LING.2006.003 DOI: https://doi.org/10.1515/LING.2006.003

Gaeta, Livio, Ricca, Davide. 2015. Productivity. Word-Formation: An International Handbook of the Languages of Europe. 2. Müller, Peter O., Ohnheiser, Ingeborg, Olsen, Susan, Rainer, Franz (eds.). Berlin/Boston: De Gruyter Mouton, 842–858. https://doi.org/10.1515/9783110246278-003 DOI: https://doi.org/10.1515/9783110246278-003

Murmulaitytė, Daiva. 2016. Naujieji asmenų pavadinimai darybos ir semantiniu aspektu. Lietuvių kalba. 10, 1–22. https://doi.org/10.15388/LK.2016.22591 DOI: https://doi.org/10.15388/LK.2016.22591

Murmulaitytė, Daiva. 2021. Naujažodžių darybos ir morfemikos tyrimų perspektyvos (Lietuvių kalbos naujažodžių duomenyno atvejis). Vilnius: Lietuvių kalbos institutas. https://doi.org/10.35321/e-pub.16.naujadaros-tyrimu-perspektyvos DOI: https://doi.org/10.35321/e-pub.16.naujadaros-tyrimu-perspektyvos

Ulvydas, Kazys (ed.). 1965. Lietuvių kalbos gramatika. 1. Vilnius: Mintis.

Ulvydas, Kazys (ed.). 1971. Lietuvių kalbos gramatika. 2. Vilnius: Mintis.

Van Marle, Jaap. 1992. The relationship between morphological productivity and frequency: a comment on Baayen’s performance-oriented conception of morphological productivity. Yearbook of Morphology 1991. Booij, Geert, Van Marle, Jaap (eds.). Dordrecht: Kluwer, 151–163. DOI: https://doi.org/10.1007/978-94-011-2516-1_9

Vaskelienė, Jolanta. 2017. Lietuvių rašytojų naujadarų darybos ir semantikos ypatumai. Bendrinė kalba. 90, 1–30. Available at: http://www.bendrinekalba.lt/Straipsniai/90/Vaskeliene_BK_90_straipsnis.pdf

Zeldes, Amir. 2012. Productivity in Argument Selection: From Morphology to Syntax. Ber-lin/Boston: De Gruyter Mouton. https://doi.org/10.1515/9783110303919 DOI: https://doi.org/10.1515/9783110303919

Downloads

Published

2024-12-16

How to Cite

Pakerys, J., Dadurkevičius, V., & Navickaitė-Klišauskienė, A. (2024). How lemmatisation and derivational annotation affect productivity measures: The case of deverbal agent nouns in the Joint Corpus of Lithuanian. Valoda: Nozīme Un Forma | Language: Meaning and Form, 15, 138-151. https://doi.org/10.22364/vnf.15.09