Abstracts
Résumé
De nombreuses recherches ont tenté de quantifier les écarts entre les niveaux de sévérité de différents examinateurs travaillant pour les mêmes évaluations. Leurs résultats montrent que les écarts interindividuels de niveaux de sévérité sont souvent importants, peu importe le contexte évaluatif. Toutefois, peu de recherches ont modélisé l’évolution temporelle intra-individuelle du niveau de sévérité et encore moins ont comparé, sur une période donnée, le rapport entre les étendues intra-individuelles et interindividuelles des niveaux de sévérité. Cette étude vise à combler ce manque en comparant les rapports entre les écarts intra- et interindividuels de six examinateurs ayant travaillé de septembre 2011 à avril 2014 pour l’épreuve d’expression orale du Test d’évaluation du français adapté au Québec (TEFAQ). Ces six examinateurs ont évalué la performance de 4083 candidats au test et leur niveau de sévérité a été estimé à l’aide du modèle de Rasch à multifacettes. Cinq dyades d’examinateurs ont été suivies durant cinq périodes distinctes, totalisant de 11 à 38 temps de mesure. Le niveau de sévérité a été estimé d’une à quatre fois par mois, ce qui a permis de calculer, pour chaque période, une étendue intra-individuelle du niveau de sévérité ainsi qu’une étendue interindividuelle. Ces étendues ont ensuite été mises en rapport, pour obtenir un ratio permettant de voir si le niveau de sévérité fluctue autant d’un examinateur à lui-même et d’un examinateur à l’autre. Les résultats montrent que, globalement, les écarts intra-individuels sont aussi élevés que les écarts interindividuels (rapport médian de 0,97), et ce, malgré le faible nombre d’examinateurs impliqués dans les modélisations. Finalement, les considérations pratiques, les limites méthodologiques et conceptuelles de l’étude sont discutées.
Mots-clés :
- sévérité des examinateurs,
- dérive temporelle de la sévérité,
- français langue étrangère (L2),
- effets de l’examinateur
Abstract
Several studies have tried to quantify the differences in severity levels between raters working for the same assessments. Their results show that interindividual differences in severity levels are often important, regardless of the assessment situations. However, few studies have modeled the longitudinal evolution of intra-individual severity levels, and even fewer have compared the ratio between the intra- and interindividual differences. This paper seeks to remedy this lack of knowledge by comparing the ratio between the intra- and interindividual severity levels of six raters, who worked together, from September 2011 to April 2014, as raters for the oral expression test of the Test d’évaluation du français adapté au Québec (TEFAQ). Those six raters assessed the performance of 4,083 candidates and their severity levels were estimated using the multi-facet Rasch model. Five raters dyads were modeled during five distinct periods, totaling from 11 to 38 time points, and their severity levels were estimated from once to four times per month. This allowed us to calculate, for each period, an intra-individual and interindividual severity range and these ranges were then compared to obtain a ratio showing whether a given rater’s severity level fluctuates as much over time as it does when compared to the severity level of their peer. Results show that, overall, the intra-individual differences are as high as the interindividual ones, with a median ratio of 0.97, despite the small number of raters modeled. The practical impacts of those results are then discussed, as well as the methodological and conceptual limits of this study.
Keywords:
- rater severity,
- severity time drift,
- French as a second language (L2),
- rater effects
Resumo
Numerosas investigações tentaram quantificar as diferenças entre os níveis de severidade de diferentes examinadores que trabalharam as mesmas avaliações. Os resultados mostram que as diferenças interindividuais nos níveis de severidade são muitas vezes significativas, independentemente do contexto avaliativo. No entanto, são poucas as investigações que modelizaram a evolução temporal intraindividual do nível de severidade, e menos ainda que tenham comparado, ao longo de um determinado período, a relação entre as extensões intraindividual e interindividual dos níveis de severidade. Este estudo visa preencher esta lacuna comparando as razões entre as lacunas intra e interindividuais de seis examinadores que trabalharam de setembro de 2011 a abril de 2014 para a prova de expressão oral do Teste de avaliação do francês adaptado para o Québec (TEFAQ). Estes seis examinadores avaliaram o desempenho de 4.083 candidatos ao teste e o seu nível de severidade foi estimado usando o modelo Rasch multifacetado. Cinco duplas de examinadores foram monitorizadas durante cinco períodos distintos, totalizando de 11 a 38 momentos de aferição, e o grau de severidade foi estimado de uma a quatro vezes por mês, o que possibilitou calcular, para cada período, uma extensão intraindividual do nível de severidade, bem como uma extensão interindividual. Estes intervalos foram então comparados para obter um rácio que mostrasse se o nível de severidade flutua tanto de um examinador para consigo próprio quanto de um examinador para outro. Os resultados mostram que, no geral, as diferenças intraindividuais são tão altas quanto as diferenças interindividuais, com uma razão mediana de 0,97, apesar do pequeno número de examinadores envolvidos na modelização. Por fim, são discutidas considerações práticas, limites metodológicos e conceituais do estudo.
Palavras chaves:
- severidade dos examinadores,
- desvio temporal na severidade,
- francês como língua estrangeira (L2),
- efeitos do examinador
Appendices
Liste de références
- American Educational Research Association, American Psychological Association et National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association.
- Aryadoust, V., Ying Ng, L., & Sayama, H. (2021). A comprehensive review of Rasch measurement in language assessment : Recommendations and guidelines for research. Language Testing, 38(1), 6-40. https://doi.org/10.1177%2F0265532220927487
- Barkaoui, K. (2010). Do ESL Essay Raters’ Evaluation Criteria Change With Experience ? A Mixed-Method, Cross-Sectional Study. TESOL Quarterly, 44(1), 31-57. https://doi.org/10.5054/tq.2010.214047
- Bond, T. G., & Fox, C. M. (2015). Applying the Rasch Model : Fundamental Measurement in the Human Sciences (3e éd.). Routledge.
- Cardinet, J. (1987). L’objectivité de l’évaluation. Recherches, Institut romand de recherches et de documentations pédagogiques.
- Casanova, D., & Demeuse, M. (2011). Analyse des différentes facettes influant sur la fidélité de l’épreuve d’expression écrite d’un test de français langue étrangère. Mesure et évaluation en éducation, 34(1), 25-53. https://doi.org/10.7202/1024862ar
- Casanova, D., & Demeuse, M. (2016). Évaluateurs évalués : évaluation diagnostique des compétences en évaluation des correcteurs d’une épreuve d’expression écrite à forts enjeux. Mesure et évaluation en éducation, 39(3), 59-96. https://doi.org/10.7202/1040137ar
- CCI Paris-Île-de-France Éducation. (2021). Test d’évaluation du français. https://www.lefrancaisdesaffaires.fr/tests-diplomes/test-evaluation-francais-tef/
- Chénier, C. (2018). Étude longitudinale du niveau de sévérité d’examinateurs d’un test d’expression orale en français langue étrangère [Thèse de doctorat non publiée]. Université du Québec à Montréal.
- Congdon, P. J., & McQueen, J. (2000). The Stability of Rater Severity in Large-Scale Assessment Programs. Journal of Educational Measurement, 37(2), 163-178. https://www.jstor.org/stable/1435283
- Conseil de l’Europe (2011). Manuel pour l’élaboration et la passation de tests et d’examens de langue. Association of language testers in Europe.
- Council of Europe (2009). Relating Language Examinations to the Common European Framework of Reference for Languages : Learning, Teaching, Assessment (CEFR). Council of Europe Publishing.
- Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135. https://doi.org/10.1177%2F0265532215582282
- Eckes, T. (2012). Operational Rater Types in Writing Assessment : Linking Rater Cognition to Rater Behavior. Language Assessment Quarterly, 9(3), 270-292. https://doi.org/10.1080/15434303.2011.649381
- Eckes, T. (2015). Introduction to Many-Facet Rasch Measurement (2e éd.). Peter Lang.
- Edgeworth, F. Y. (1888). The Statistics of Examinations. Journal of the Royal Statistical Society, 51(3), 599-635. https://www.jstor.org/stable/2339898
- Edgeworth, F. Y. (1890). The Element of Chance in Competitive Examinations. Journal of the Royal Statistical Society, 53(4), 644-663. https://www.jstor.org/stable/2979446
- Gerard, F.-M. (2002). L’indispensable subjectivité de l’évaluation. Antipodes, 156, 26-34.
- Hadji, C. (1992). L’évaluation des actions éducatives. Presses Universitaires de France.
- Kim, H. J. (2011). Investigating raters’ development of rating ability on a second language speaking assessment [Thèse de doctorat non publiée]. Teachers College, Columbia University.
- Lamprianou, I., Tsagari, D., & Kyriakou, N. (2021). The longitudinal stability of rating characteristics in an EFL examination : Methodological and substantive considerations. Language Testing, 38(2), 273-301. https://doi.org/10.1177%2F0265532220940960
- Leckie, G., & Baird, J.-A. (2011). Rater Effects on Essay Scoring : A Multilevel Analysis of Severity Drift, Central Tendency, and Rater Experience. Journal of Educational Measurement, 48(4), 399-418. https://doi.org/10.1111/j.1745-3984.2011.00152.x
- Lim, G. (2009). Prompt and rater effects in second language writing and performance assessment. [Thèse doctorale non publiée]. Michigan State University.
- Lim, G. (2011). The development and maintenance of rating quality in performance writing assessment : A longitudinal study of new and experienced raters. Language testing, 28(4), 543-560. https://doi.org/10.1177%2F0265532211406422
- Linacre, J. M. (1994). Many-Facet Rasch Measurement (2e éd.). Mesa.
- Linacre, J. M. (2021a). Facets computer program for many-facet Rasch measurement Program Manual. Winsteps.com.
- Linacre, J. M. (2021b). Facets computer program for many-facet Rasch measurement, version 3.83.6. Winsteps.com.
- Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias : implications for training. Language Testing, 12(1), 54-71. https://doi.org/10.1177%2F026553229501200104
- Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement : Part 1. Journal of Applied Measurement, 4(4), 386-422.
- Park, H., & Yan, X. (2019). An investigation into rater performance with a holistic scale and a binary, analytic scale on an ESL writing placement test. Papers in Language Testing and Assessment, 8(2), 34-64.
- R Core Team (2021). R : A language and environment for statistical computing (version 4.1.2). R Foundation for Statistical Computing.
- Shehadeh, A. (2012). Task-Based Language Assessment : Components, Development, and Implementation. Dans C. Coombe, P. Davidson, B. O’Sullivan et S. Stoynoff (dir.), The Cambridge Guide to Second Language Assessment (p. 156-163), Cambridge University Press.
- Shin, Y. (2017). Time Series Analysis in the Social Sciences. University of California Press.
- Spolsky, B. (2000). Language Testing in The Modern Language Journal. The Modern Language Journal, 84(4), 536-552. https://doi.org/10.1111/0026-7902.00086
- Uyaniker, P. (2017). Language Assessment : Now and Then. Avrasya Dil E-itimi ve Ara-tırmaları Dergisi, 1(1), 1- 20.
- Wind, S. A., & Engelhard Jr., G. (2016). Exploring Rating Quality in Rater-Mediated Assessments Using Mokken Scale Analysis. Educational and Psychological Measurement, 76(4), 685-706. https://doi.org/10.1177/0013164415604704
- Wolfe, E. W., Myford, C. M., Engelhard Jr., G., & Manalo, J. R. (2007). Monitoring ReaderPerformance and DRIFT in the AP® English Literature and Composition Examination Using Benchmark Essays. (Rapport de recherche no 2007-2). College Board.
- Zhang, J. (2016). Same text different processing ? Exploring how raters’ cognitive and meta-cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 37-53. https://doi.org/10.1016/j.asw.2015.11.001