Abstracts
Abstract
Ever since the publication of Laviosa’s (1998a; 1998b) pioneering work, the study of lexico-syntactic simplification has held centre stage in corpus translation research concerned with the typical features of translated texts. The simplification hypothesis states that translated texts are simpler than non-translated texts. The convergence hypothesis, also discussed by Laviosa (1998a; 1998b), but less so in follow-up studies, is that translated texts are more homogeneous than original texts, that is they display less variance. To date, simplification has mostly been operationalised in CBTS as type-token ratio, lexical density, core vocabulary coverage, list head coverage and average sentence length. Relying on these parameters, previous research has produced mixed results, with simplification varying across translation modalities, language pairs and registers. The present article sets out to revisit the simplification and convergence hypotheses through the lens of NLP-informed readability research. In particular, we rely on a larger set of simplification indicators and make use of multivariate statistical techniques. We present a simplification study of Europarl corpus data in French translated from English and in non-translated French. The results show that translated French is simpler than original French, lexically and syntactically. We also find evidence of convergence that shows that translators smooth out cross-speaker lexical heterogeneity in translated parliamentary proceedings.
Keywords:
- corpus-based translation studies,
- simplification,
- simplicity,
- readability,
- Europarl
Résumé
Depuis les travaux de Laviosa (1998a ; 1998b), les études de corpus se sont souvent penchées sur le phénomène de simplification lexico-syntaxique, afin de déterminer si les textes traduits sont plus simples que les textes non traduits. Laviosa (1998a ; 1998b) traite également de l’hypothèse de convergence, selon laquelle les textes traduits sont plus homogènes que les textes non traduits. Jusqu’à présent, cette question a cependant suscité moins d’intérêt que la simplification. En traductologie de corpus, la simplification a été opérationnalisée à l’aide de cinq paramètres principaux. Sur la base de ceux-ci, les études ont montré que la simplification varie en fonction des modalités de traduction, des paires de langues et des registres analysés. Notre article a pour objectif de revisiter ce type de recherche à travers le prisme des études de lisibilité. En particulier, nous utilisons un ensemble plus fourni de paramètres de simplification et avons recours à des statistiques multivariées. Nos analyses portent sur des données en français tirées du corpus Europarl (français traduit de l’anglais et français non traduit) et montrent que le français traduit est plus simple, d’un point de vue lexical et syntaxique, que le français non traduit. Une tendance à la convergence lexicale est également mise au jour en français traduit, ce qui semble indiquer que les traducteurs lissent les différences lexicales entre les orateurs de la langue source.
Mots-clés :
- traductologie de corpus,
- simplification,
- simplicité,
- lisibilité,
- Europarl
Resumen
Desde que Laviosa publicara su trabajo pionero sobre la simplificación léxico-sintáctica en traducción (1998a; 1998b), esta ha ocupado un lugar destacado en los Estudios de Traducción. De acuerdo con esta hipótesis, los textos traducidos son más simples que los no traducidos. La hipótesis de la convergencia, también elaborada por Laviosa, pero con menor seguimiento en el campo en investigaciones posteriores, postula que los textos traducidos son más homogéneos que los textos originales. Hasta la fecha, la simplificación se ha abordado en los estudios de traducción basados en corpus como la relación tipo-caso, la densidad léxica, la cobertura del vocabulario básico, la cobertura del list head, y la longitud media de la oración. Teniendo en cuenta estos parámetros, investigaciones previas han ofrecido resultados diversos, en los que se observa que la simplificación varía en función de las modalidades de traducción, pares de lenguas o registros. El objetivo de este artículo es revisitar las hipótesis de simplificación y convergencia a la luz de la investigación sobre legibilidad informada por el procesamiento de lenguaje natural. Para ello, nos basamos en un conjunto de indicadores de simplificación más amplio e hicimos uso de técnicas de estadística multivariante. Analizamos la simplificación en datos de francés traducido del inglés y en francés original, extraídos de Europarl. Los resultados muestran que el francés traducido es más simple que el no traducido, tanto a nivel léxico como sintáctico. También observamos casos de convergencia, de acuerdo con la cual los traductores minimizarían la heterogeneidad léxica entre interlocutores en la traducción de las actas parlamentarias.
Palabras clave:
- estudios de traducción basados en corpus,
- simplificación,
- simplicidad,
- legibilidad,
- Europarl
Appendices
Bibliography
- Ädel, Annelie (2010): How to use corpus linguistics in the study of political discourse. In: Anne O’Keeffe and Michael McCarthy, eds. The Routledge Handbook of Corpus Linguistics. London: Routledge, 591-604.
- Azpiazu, Ion Madrazo and Pera, Maria Soledad (2019): Multiattentive recurrent neural network architecture for multilingual readability assessment. Transactions of the Association for Computational Linguistics. 7:421-436.
- Baker, Mona (1993): Corpus linguistics and translation studies: Implications and applications. In: Mona Baker, Gill Francis and Elena Tognini-Bonelli, eds. Text and technology: In honour of John Sinclair. Philadelphia/Amsterdam: John Benjamins, 233-250.
- Baker, Mona (1995): Corpora in translation studies: An overview and some suggestions for future research. Target. 7(2):223-243.
- Baker, Mona (1996): Corpus-based translation studies: The challenges that lie ahead. In: Harold Somers, ed. Terminology, LSP and Translation: Studies in Language Engineering. In Honour of Juan C. Sager. Amsterdam: John Benjamins, 175-186.
- Baroni, Marco, Bernardini, Silvia, Ferraresi, Adriano, et al. (2009): The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation. 43(3):209-226.
- Benjamin, Rebekah George (2012): Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review. 24(1):63-88.
- Bernardini, Silvia, Ferraresi, Adriano and Miličević, Maja (2016): From EPIC to EPTIC—Exploring simplification in interpreting and translation from an intermodal perspective. Target. 28:61-86.
- Cartoni, Bruno and Meyer, Thomas (2012): Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies. In: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, et al., eds. Proceedings of the 8th International Conference on Language Resources and Evaluation. Istanbul: European Language Resources Association (ELRA), 2132-2137.
- Cha, Miriam, Gwon, Youngjune and Kung, H. T. (2017): Language modeling by clustering with word embeddings for text readability assessment. In: Ee-Peng Lim and Marianne Winslett, eds. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: Association for Computing Machinery, 2003-2006.
- Ciobanu, Lina Maria, Dinu, Liviu P. and Pepelea, Flaviu Ioan (2015): Readability Assessment of Translated Texts. In: Ruslan Mitkov, Galia Angelova and Kalina Bontcheva, eds. Proceedings of Recent Advances in Natural Language Processing. Hissar: INCOMA, 97-103.
- Collins-Thompson, Kevyn (2014): Computational assessment of text readability: A survey of current and future research. ITL-International Journal of Applied Linguistics. 165(2):97-135.
- Collins-Thompson, Kevyn and Callan, Jamie (2005): Predicting reading difficulty with statistical language models. Journal of the American Society for Information Science and Technology. 56(13):1448-1462.
- Corpas Pastor, Gloria, Mitkov, Ruslan, Naveed, Afzal, et al. (2008): Translation universals: do they exist? A corpus-based NLP study of convergence and simplification. In: MT at work: Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas. Stroudsburg: Association for Machine Translation in the Americas, 75-81.
- Dale, Edgar and Chall, Jane (1948): A formula for predicting readability: Instructions. Educational Research Bulletin. 27(2):37-54.
- Daoust, François, Laroche, Léo and Ouellet, Lise (1996): SATO-CALIBRAGE: Présentation d’un outil d’assistance au choix et à la rédaction de textes pour l’enseignement. Revue Québécoise de Linguistique. 25(1):205-234.
- De Sutter, Gert and Lefer, Marie-Aude (2020): On the need for a new research agenda for corpus-based translation studies: A multi-methodological, multifactorial and interdisciplinary approach. Perspectives. 28(1):1-23.
- Dumont, Amandine (2018): Fluency and disfluency: a corpus study of non-native and native speaker (dis)fluency profiles. Doctoral dissertation, unpublished. Louvain-la-Neuve: Université catholique de Louvain.
- Ferraresi, Adriano, Bernardini, Silvia, Miličević Petrović, Maja, et al. (2018): Simplified or not simplified? The different guises of mediated English at the European Parliament. Meta. 63(3):717-737.
- Field, Andy (2014): Discovering statistics using IBM SPSS statistics. 4th ed. London: Sage.
- Flesch, Rudolf (1948): A new readability yardstick. Journal of Applied Psychology. 32(3):221-233.
- François, Thomas (2011): Les apports du traitement automatique du langage à la lisibilité du français langue étrangère. Doctoral dissertation, unpublished. Louvain-la-Neuve: Université catholique de Louvain.
- François, Thomas (2015): When readability meets computational linguistics: a new paradigm in readability. Revue française de linguistique appliquée. 20(2):79-97.
- François, Thomas and Fairon, Cédrick (2012): An “AI readability” formula for French as a foreign language. In: Jun’ichi Tsujii, James Henderson and Marius Paşca, eds. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Stroudsburg: Association for Computational Linguistics, 466-477.
- François, Thomas and Miltsakaki, Eleni (2012): Do NLP and machine learning improve traditional readability formulas? In: Sandra Williams, Advaith Siddharthan and Ani Nenkova, eds. NAACL-HLT 2012 Workshop on Predicting and Improving Text Readability for target reader populations (PITR 2012). Stroudsburg: Association for Computational Linguistics, 49-57.
- Grabowski, Łukasz (2013): Interfacing corpus linguistics and computational stylistics. Translation universals in translational literary Polish. International Journal of Corpus Linguistics. 18(2):254-280.
- Gunning, Robert (1952): The Technique of Clear Writing. New York: McGraw-Hill.
- Harris, Albert and Jacobson, Milton (1974): Revised Harris-Jacobson readability formulas. In: Proceedings of the 18th annual meeting of the College Reading Association, Bethesda, Maryland. Oct.31 - Nov. 2. Bethesda: College Reading Association.
- Ilisei, Iustina, Inkpen, Diana, Corpas Pastor, Gloria, et al. (2010): Identification of translationese: A machine learning approach. In: Alexander Gelbukh, ed. Computational Linguistics and Intelligent Text Processing. Heidelberg: Springer, 503-511.
- Jarque, Carlos and Bera, Anil (1987): A Test for Normality of Observations and Regression Residuals. International Statistical Review. 55(2):163-172.
- Jiang, Zhiwei, Gu, Qing, Yin, Yafeng, et al. (2018): Enriching word embeddings with domain knowledge for readability assessment. In: Emily M. Bender, Leon Derczynski and Pierre Isabelle, eds. Proceedings of the 27th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 366-378.
- Jakubíček, Miloš, Kilgarriff, Adam, Kovář, Vojtěch, et al.(2013): The TenTen corpus family. In: Andrew Hardie and Robbie Love, eds. 7th International Corpus Linguistics Conference. Lancaster: University Centre for Computer Corpus Research on Language (UCREL) 125-127.
- Kajzer-Wietrzny, Marta (2015): Simplification in interpreting and translation. Across Languages and Cultures. 16(2):233-255.
- Kajzer-Wietrzny, Marta, Ferraresi, Adriano, Bernardini, Silvia, et al., eds. (forthcoming): Empirical investigations into the forms of mediated discourse at the European Parliament. Berlin: Language Science Press.
- Kajzer-Wietrzny, Marta and Ferraresi, Adriano (2019): Guidelines for EPTIC collaborators. Bologna: DIT Lab, University of Bologna.
- Kintsch, Walter and Vipond, Douglas (1979): Reading comprehension and readability in educational practice and psychological theory. In: Lars-Göran Nilsson, ed. Perspectives on Memory Research. Hillsdale: Lawrence Erlbaum, 329-365.
- Kintsch, Walter, Kozminsky, Ely, Streby, William J., et al. (1975): Comprehension and recall of text as a function of content variables. Journal of Verbal Learning and Verbal Behavior. 14(2):196-214.
- Kruger, Haidee and Van Rooy, Bertus (2012): Register and the features of translated language. Across Languages and Cultures. 13(1):33-65.
- Koehn, Philipp (2005): Europarl: A parallel corpus for statistical machine translation. In: Makoto Nagao, ed. MT Summit X. Tokyo: Asia-Pacific Association for Machine Translation, 79-86.
- Kotze, Haidee (2019): Converging what and how to find out why: An outlook on empirical translation studies. In: Lore Vandevoorde, Joke Daems and Bart Defrancq, eds. New Empirical Perspectives on Translation and Interpreting. London: Routledge, 333-371.
- Laviosa, Sara (1998a): Core patterns of lexical use in a comparable corpus of English narrative prose. Meta. 43(4):557-570.
- Laviosa, Sara (1998b): The English Comparable Corpus: a Resource and a Methodology. In: Lynne Bowker, Michael Cronin, Dorothy Kenny, et al., eds. Unity in Diversity? Current Trends in Translation Studies. Manchester: St. Jerome.
- Le, Dieu-Thu, Nguyen, Cam-Tu and Wang, Xiaoliang (2018): Joint learning of frequency and word embeddings for multilingual readability assessment. In: Yuen-Hsien Tseng, Hsin-Hsi Chen, Vincent Ng, et al., eds. Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications. Stroudsburg: Association for Computational Linguistics, 103-107.
- Lee, Hyeran, Gambette, Philippe, Maillé, Elsa, et al.(2010): Densidées: calcul automatique de la densité des idées dans un corpus oral. In: Alexandre Patry, Philippe Langlais and Aurélien Max, eds. Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Rencontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues. Montréal: ATALA, 11-20.
- Levene, Howard (1960): Robust Tests for Equality of Variances. In: Ingram Olkin, ed. Contributions to Probability and Statistics. Palo Alto: Stanford University Press, 364-367.
- Lv, Qianxi and Liang, Junying (2019): Is consecutive interpreting easier than simultaneous interpreting? – a corpus-based study of lexical simplification in interpretation. Perspectives. 27(1):91-106.
- Meyer, Bonnie (1982): Reading research and the composition teacher: The importance of plans. College Composition and Communication. 33(1):37-49.
- Miller, George A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39-41.
- Pitler, Emily and Nenkova, Ani (2008): Revisiting readability: A unified framework for predicting text quality. In: Mirella Lapata and Hwee Tou Ng, eds. Proceedings of the 2008 conference on empirical methods in natural language processing. Stroudsburg: Association for Computational Linguistics, 186-195.
- Redelinghuys, Karien and Kruger, Haidee (2015): Using the features of translated language to investigate translation expertise: A corpus-based study. International Journal of Corpus Linguistics. 20(3):293-325.
- Revelle, William (2019): Psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston: Northwestern University. R package version 1.9.12. https://CRAN.R-project.org/package=psych.
- Shapiro, Samuel Sanford and Wilk, Martin (1965). An analysis of variance test for normality (complete samples). Biometrika, 52, 591-611.
- Shlens, Jonathon (2014): A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100.
- Schmid, Helmut (1995): Improvements in Part-of-Speech Tagging with an Application to German. InProceedings of the EACL-95 SIGDAT-Workshop: From Text to Tags, 47-50.
- Schwarm, Sarah and Ostendorf, Mari (2005): Reading level assessment using support vector machines and statistical language models. In: Kevin Knight, Hwee Tou Ng and Kemal Oflazer, eds. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). New Brunswick: Association for Computational Linguistics, 523-530.
- Vajjala, Sowmya (2021): Trends, limitations and open challenges in automatic readability assessment research. arXiv preprint arXiv:2105.00973.
- Vajjala, Sowmya and Meurers, Detmar (2012): On improving the accuracy of readability classification using insights from second language acquisition. In: Joel Tetreault, Jill Burstein and Claudia Leacock, eds. Proceedings of the seventh workshop on building educational applications using NLP. Stroudsburg: Association for Computational Linguistics, 163-173.
- Vanderauwera, Ria (1985): Dutch Novels Translated into English: The Transformation of a “Minority” Literature. Amsterdam: Rodopi.
- Volansky, Vered, Ordan, Noam and Wintner, Shuly (2015): On the features of translationese. Digital Scholarship in the Humanities. 30(1):98-118.
- Wilkens, Rodrigo and Todirascu, Amalia (2020): Simplifying Coreference Chains for Dyslexic Children. In: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, et al., eds. Proceedings of the 12th Language Resources and Evaluation Conference. Paris: European Language Resources Association (ELRA), 1142-1151.
- Williams, Donna (2005): Recurrent Features of Translation in Canada. Doctoral dissertation, unpublished. Ottawa: University of Ottawa.
- Zakaluk, Beverley and Samuels, Jay (1996): Issues related to text comprehensibility: The future of readability. Revue québécoise de linguistique. 25(1):41-59.