Abstracts
Abstract
In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them.
The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises.
One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.
Mots-Clés/Keywords:
- alignment,
- concordancing,
- parallel corpus,
- translation memory
Résumé
Au cours des dernières années, l’exploitation de grands corpus de textes pour résoudre des problèmes linguistiques, notamment des problèmes de traduction, est devenue une pratique courante. Jusqu’à récemment, aucun corpus bilingue anglais-persan à grande échelle n’avait été constitué, en raison des difficultés qu’implique une telle entreprise.
Cet article présente un projet réalisé en vue de colliger des corpus de textes numériques variés, tels que des documents du réseau Internet, avec le moins de bruit possible. L’utilisation d’Internet peut être considérée comme une aide précieuse car, souvent, il existe des traductions antérieures qui sont déjà publiées sur le Web. La tâche consiste à trouver les pages parallèles en anglais et en persan, à évaluer la qualité de leur traduction, à les télécharger et à les aligner. Le corpus ainsi obtenu est un corpus ouvert, soit un corpus auquel de nouvelles données peuvent être ajoutées, selon les besoins.
Une des principales conséquences de l’élaboration d’un tel corpus est la mise au point d’un logiciel de concordance parallèle, dans lequel l’utilisateur pourrait introduire une chaîne de caractères dans une langue et afficher toutes les citations concernant cette chaîne dans la langue recherchée ainsi que des phrases correspondantes dans la langue cible. L’étape suivante serait d’utiliser ce corpus parallèle pour construire un logiciel de traduction générale.
Le corpus bilingue aligné se trouve être utile dans beaucoup d’autres cas, entre autres pour la traduction par ordinateur, pour lever les ambiguïtés de sens, pour le rétablissement des données interlangues, en lexicographie ainsi que pour l’apprentissage des langues.
Appendices
References
- Aston, G. (1997): “Enriching the learning environment: Corpora in ELT,” in Wichmann, A., Fligelstone, S.
- Braschler, M. and P. Schauble (2000): “Using corpus-based approaches in a system for multilingual information retrieval,” Information Retrieval 3, pp. 273-284.
- Brill, E. (1995): “Unsupervised learning of disambiguation rules for part of speech tagging,” 2nd Workshop on Large Corpora, Boston, USA.
- Brown, P., Cocke, S., Della Pietra, V., Della Pietra, S., Jelinek, F., Lafferty, J., Mercer, R. and P. Roosin (1990): “A Statistical Approach to Machine Translation,” Computational Linguistics 16-2, pp. 79-85.
- Brown, P., Lai, J. C., and R. L. Mercer (1991): “Aligning Sentences in Parallel Corpora,” proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL-91), pp. 169-176.
- Brown, P., Della Pietra, S., Della Pietra, V., and R. Mercer (1993): “The mathematics of machine translation: parameter estimation,” Computational Linguistics 19, pp. 263-312.
- Callison-Burch, C. and M. Osborne (2003): “Bootstrapping Parallel Corpora,” proceedings of the HLT-NAACL 2003 Workshop Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, pp. 44-49.
- Chen, S. (1993): “Aligning Sentences in Bilingual Corpora Using Lexical information,” proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL-93), pp. 9-16.
- Cutting, D., Kupiec, J., Peterson, J. and P. Sibun (1992): “A practical part of speech tagger,” proceedings of the 3rd Conference on Applied Computational Linguistics, pp. 133-140.
- Dagan, I. and A. Itai (1994): “Word sense disambiguation using a second language monolingual corpus,” Association for Computational Linguistics 20-4, pp. 563–596.
- Davis, M. and T. Dunning (1995): “A TREC evaluation of query translation methods for multi-lingual text retrieval,” proceedings of the 4th Text Retrieval Conference (TREC-4), NIST, pp. 483-497.
- Fan, M. and X. Xunfeng, (2002): “An evaluation of an online bilingual corpus for the self-learning of legal English,” System 30-1, pp. 47-63.
- Gale, W. and K. Church (1991) “Identifying Word Correspondences in Parallel Text,” Fourth Darpa Workshop on Speech and Natural Language, Asilomar.
- Johansson, S. (1997): “Using the English Norwegian parallel corpus – a corpus of contrastive analysis and translation studies,” Lewandowska-Tomaszczyk, B. and J. Melia (eds.), PALC- 97 Practical Applications in Language Corpora, Lodz University Press, pp. 282–296.
- Klavans, J. L. and E. Tzoukermann (1989): “Movement Verbs in English-French Translation: A Corpus-based Approach,” Proceedings of the Sixth Israeli Conference of Artificial Intelligence and Computer Vision, Tel Aviv, Israel.
- Landauer, T. K. and M. L. Littman (1990): “Fully automatic cross-language document retrieval using latent semantic indexing,” proceedings of the 6th Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, UW Center for the New OED and Text Research, Waterloo, Ontario, pp. 31-38.
- Leech, G. (1997): “Teaching and language corpora: a convergence,” Wichmann, A., Fligelstone, S., McEnery, T. and G. Knowles (eds.), Teaching and language corpora, New York, Addison Wesley Longman, pp. 1-23.
- McEnery, T. and G. Knowles (eds.), Teaching and Language Corpora, New York, Addison Wesley Longman, pp. 51-64.
- Melamed, I. D. (1996): “Automatic Detection of Omissions in Translations,” Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark
- Moore, R. C. (2002): “Fast and accurate sentence alignment of bilingual corpora,” Proceedings of the 5th AMTA Conf: Machine Translation: From Research to Real Users, Langhorne, PA, Springer, pp. 135–244.
- Mosavi Miangah, T. and A. Delavar Khalafi (2005): “Word sense disambiguation using target language corpus in a machine translation system,” Literary and Linguistic Computing 20-2, pp. 237-249.
- Mosavi Miangah, T. (2006): “Automatic lemmatization of Persian words,” Journal of Quantitative Linguistics 13-1, pp. 1-15.
- Nesselhauf, N. (2004): “Learner corpora and their potential for language teaching,” in McHardy Sinclair, J. (ed.), How to use corpora in language teaching, Amsterdam, John Benjamins, pp. 125-152.
- Oard, D. W. (1997): “Cross-language text retrieval research in the USA,” 3rd DELOS Workshop, European Research Consortium for Informatics and Mathematics.
- Oard, D. W. and P. Resnik (1999): “Support for interactive document selection in cross-language information retrieval,” Information Processing and Management 35-4, pp. 363-379.
- Resnik, P. (1998): “Parallel strands: A preliminary investigation into mining the Web for bilingual text,” proceedings of the 3rd Conference of the Association for Machine Translation in the Americas (AMTA-98), in Lecture Notes in Artificial Intelligence, Langhorne, pp. 28-31.
- Resnik, P. (1999): “Mining the Web for bilingual text,” Proceedings of the 37th Meeting of the ACL, Maryland, pp. 527-534.
- Sadler, V. (1989): “Working with analogical semantics: disambiguation techniques,” DLT, Dordrecht, Foris Publications.
- Savoy, J. (2003): “Cross-language information retrieval: experiments based on CLEF 2000 corpora,” Information Processing and Management 39, pp. 75-115.
- Simard, M. and P. Plamondon (1998): “Bilingual Sentence Alignment: Balancing Robustness and Accuracy,” Machine Translation 13-1, pp. 59-80.
- Sun, L., Du, L., Sun, Y. and Y. Jin (1999): “Sentence Alignment of English-Chinese Complex Bilingual Corpora,” proceedings of the workshop MAL’99, pp. 135-139.
- Sun, L., Xue, S., Qu, W., Wang, X. and Y. Sun (2002): “Constructing of a Large-Scale Chinese-English Parallel Corpus,” Proceedings of the 3rd workshop on Asian Language Resources and International Standardization, pp. 1-8.
- Warwick, S. and G. Russell (1990): “Bilingual Concordancing and Bilingual Lexicography,” EURALEX 4th International Congress, Málaga, Spain.