Résumés
Abstract
The Chinese language, unlike English, is written without marked word boundaries, and Chinese word segmentation is often referred to as the bottleneck for Chinese-English machine translation. The current word-segmentation systems in machine translation are either linguistically-oriented or statistically-oriented. Chinese, however, is a pragmatically-oriented language, which explains why the existing Chinese word segmentation systems in machine translation are not successful in dealing with the language. Based on a language investigation consisting of two surveys and eight interviews, and its findings concerning how Chinese people segment a Chinese sentence into words in their reading, we have developed a new word-segmentation model, aiming to address the word-segmentation problem in machine translation from a cognitive perspective.
Keywords:
- Chinese word segmentation,
- machine translation,
- pragmatically-oriented language,
- contextual information,
- cognitive model
Résumé
À la différence de l’anglais, la langue chinoise ne marque pas la délimitation entre les mots. C’est pourquoi la segmentation du chinois constitue l’obstacle principal de la traduction automatique vers l’anglais. Actuellement, les méthodes de segmentation en traduction automatique sont soumises à des règles linguistiques ou font appel à des analyses statistiques. Le chinois, toutefois, présente des caractéristiques pragmatiques très fortes, ce qui explique l’échec des stratégies actuelles. Nous avons réalisé une étude constituée de deux enquêtes et de huit entrevues visant à déterminer comment les Chinois segmentent une phrase dans leur langue en situation de lecture. Sur la base des résultats obtenus, nous avons mis au point un nouveau modèle de segmentation lexicale visant à résoudre la question de la segmentation en traduction automatique sous un angle cognitif.
Mots-clés :
- segmentation des mots en chinois,
- traduction automatique,
- caractère pragmatique de la langue,
- information contextuelle,
- modèle cognitif
Parties annexes
Bibliography
- Carl, Michael, Iomdin, Leonid and Streiter, Oliver (2000): Towards a Dynamic Linkage of Example-based and Rule-based Machine Translation. Machine Translation. 15(3):223-257.
- Emerson, Thomas (2000): Segmenting Chinese in Unicode. In: Proceedings of the 16th International Unicode Conference (Amsterdam, 27-30 March 2000). Visited on 15 August 2010, http://seba.ulyssis.org/thesis/papers/iuc16.pdf.
- Huang, Yan (2000): Anaphora: A Cross-linguistic Study. Oxford: Oxford University Press.
- Langacker, Ronald (1987): Foundations of Cognitive Grammar: Theoretical Prerequisites. Stanford: Stanford University Press.
- Liu, Kaiying (2000): Automatic Word-segmentation and Tagging for Chinese Texts (中文文本自动分词和标注, Zhong wen wen ben zi dong fen ci he biao zhu). Beijing: The Commercial Press.
- Liu, Qun and Yu, Shiwen (1998): Difficulties in Chinese-English Machine Translation. In: Changning Huang, ed. Proceedings of the 1998 International Conference on Chinese Information Processing (1998中文信息处理国际会议论文集,1998 Zhong wen xin xi chu li guo ji hui yi lun wen ji). Beijing: Tsinghua University Press, 507-514.
- Luo, Zhengqing, Chen, Zengwu, Wang, Zebing, et al. (1997): A Review of the Study of Chinese Automatic Segmentation (汉语自动分词研究综述,Han yu zi dong fen ci yan jiu zong shu). Journal of Zhejiang University. 31(3):306-312.
- Mao, Jun, Cheng, Gang, He, Yanxiang, et al. (2007): A Trigram Statistical Language Model Algorithm for Chinese Word Segmentation. In: Franco P. Preparata and Qizhi Fang, eds. Frontiers in Algorithmics. Berlin/Heidelberg: Springer, 271-280.
- Robertson, Daniel (2000): Variability in the Use of the English Article System by Chinese Learners of English. Second Language Research. 16(2):135-172.
- Taylor, John (2002): Cognitive Grammar. Oxford: Oxford University Press.
- Wang, Ke, Gao, Changbo, Zhai, Xuefeng, et al. (2003): The Main Techniques in Chinese Word-segmentation and Its Prospect of Application (汉语分词的主要技术及其应用展望,Han yu fen ci de zhu yao ji shu ji qi ying yong zhan wang). Communications Technology. 138(6):12-15.
- Wu, Zhijie (2008): New Light Shed on Chinese Word Segmentation in MT by a Language Investigation. Meta. 53(3):630-647.
- Xu, Tongqiang (1997): On Language: Structural Rules and Research Methodology of the Semantically-oriented Language (语言论—语义型语言的结构原理和研究方法, Yu yan lun – yu yi xing yu yan de jie gou yuan li he yan jiu fang fa). Changchun: Northeast Normal University Press.
- Yin, Jianping (1998): Automatic Word Segmentation Methods for Chinese Language (汉语自动分词方法,Han yu zi dong fen ci fang fa). Computer Engineering and Science. 20(3):60-66.