Examining Gender Bias in Multiple Choice Item Formats Violating Item-Writing Guidelines

Author :  

Year-Number: 2019-Volume 11, Issue 1
Language : null
Konu :
Number of pages: 214-229
Mendeley EndNote Alıntı Yap

Abstract

Keywords

Abstract

Multiple choice items (MCIs) are commonly used in high-stake testing and classroom assessment because of their reliable assessment results. However, the recent literature has revealed that item-writing guidelines have been repeatedly violated in creating MCIs, which could also threaten reliability and validity. Another threat to the validity occurs when items favor certain groups even though the magnitude of underlying ability of the different groups is the same, and this is called differential item functioning (DIF). This empirical study aims to compare item parameters for MCIs with negative wording stem and complex MCIs, which are commonly used MCI formats that violate item-writing guidelines for MCIs, and to investigate the impact of DIF on gender differences considering the use of these item formats. The results of this study showed that DIF detection methods flagged two complex MCIs favoring male students because of the item format and tendency of male students’ taking more risk on solving MCIs.

Keywords


  • Atalmis, E. H. (2016). Do the guideline violations influence test difficulty of high-stake test? An investigation on university entrance examination in Turkey. Journal of Education and Training Studies, 4(10), 1-7. http://dx.doi.org/10.11114/jets.v4i10.1738

  • Ben-Shakhar, G., & Sinai, Y. (1991). Gender differences in multiple-choice tests: The role of differential guessing tendencies. Journal of Educational Measurement, 28(1), 23-35. https://doi.org/10.1111/j.17453984.1991.tb00341.x

  • Boland, R. J., Lester, N. A., & Williams, E. (2010). Writing multiple-choice questions. Academic Psychiatry, 34(4), 310-316.

  • Bolger, N., & Kellaghan, T. (1990). Method of measurement and gender differences in scholastic achievement. Journal of Educational Measurement, 27(2), 165-174. http://www.jstor.org/stable/1434975

  • Caldwell, J. S. (2008). Comprehension assessment: A classroom guide. New York, NY: The Guildford Pub.

  • Cohen, A. S., & Wollack, J. A. (2004). Handbook on test development: Helpful tips for creating reliable and valid classroom tests. University of Wisconsin-Madison, WI, USA, 2004. https://www.researchgate.net/profile/Allan_Cohen2/publication/248808614_Handbook_on_Test_Deve lopment_Helpful_Tips_for_Creating_Reliable_and_Valid_Classroom_Tests/links/5632378d08aefa44c3 67cea8.pdf

  • Collins, J. (2006). Education techniques for lifelong learning: Writing multiple-choice questions for continuing medical education activities and self-assessment modules. RadioGraphics, 26(2), 543-551. https://doi.org/10.1148/rg.262055145

  • Delgado, A. R., & Prieto, G. (1998). Further evidence favoring three-option items in multiple-choice tests. European Journal of Psychological Assessment, 14(3), 197-201. https://doi.org/10.1027/1015-5759.14.3.197

  • DeMars, C. E. (2000). Test stakes and item format interactions. Applied Measurement in Education, 13(1), 55-77. https://doi.org/10.1207/s15324818ame1301_3

  • Douglas, M., Wilson, J., & Ennis, S. (2012). Multiple-choice question tests: a convenient, flexible and effective learning tool? A case study. Innovations in Education and Teaching International, 49(2), 111-121. https://doi.org/10.1080/14703297.2012.677596

  • Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Advances in health sciences education, 10(2), 133-143. https://doi.org/10.1007/s10459-004-4019-5

  • Drasgow, F., & Lissak, R. I. (1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68(3), 363-373. http://dx.doi.org/10.1037/0021-9010.68.3.363

  • Frey, B. B., Petersen, S., Edwards, L. M., Pedrotti, J. T., & Peyton, V. (2005). Item-writing rules: Collective wisdom. Teaching and Teacher Education, 21(4), 357-364. https://doi.org/10.1016/j.tate.2005.01.008

  • Haladyna, T. M., & Downing, S. M. (1989a). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 37-50. https://doi.org/10.1207/s15324818ame0201_3

  • Haladyna, T. M., & Downing, S. M. (1993). How many options is enough for a multiple-choice test item? Educational and Psychological Measurement, 53(4), 999-1010. https://doi.org/10.1177/0013164493053004013

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-334. https://doi.org/10.1207/S15324818AME1503_5

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge.

  • Hamilton, L. S. (1999). Detecting gender-based differential item functioning on a constructed-response science test. Applied Measurement in Education, 12(3), 211-235. https://doi.org/10.1207/S15324818AME1203_1

  • Hamilton, L. S., & Snow, R. E. (1998). Exploring differential item functioning on science achievement tests (CSE Tech. Rep. No. 483). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing.

  • Hansen, J. D., & Dexter, L. (1997). Quality multiple-choice test questions: Item-writing guidelines and an analysis of auditing test banks. [Super about MCQ]. The Journal of Education for Business, 73(2), 94-97. https://doi.org/10.1080/08832329709601623

  • Harasym, P. H., Doran, M. L., Brant, R., & Lorscheider, F. L. (1993). Negation in stems of single-response multiple-choice items: An overestimation of student ability. Evaluation & the Health Professions, 16(3), 342-357. https://doi.org/10.1177/016327879301600307

  • Harter, C. L., & Harter, J. F. R. (2004). Teaching with technology: Does access to computer technology increase student achievement? Eastern Economic Journal, 30(4), 505-514. https://www.jstor.org/stable/40326144

  • Holland, P. W., & Thayer, D. T. (1985). An alternative definition of the ETS delta scale of item difficulty. Research Report RR-85-4. Princeton, NJ: Educational Testing Service.

  • Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale NJ: Erlbaum.

  • Jodoin, M. G., & Gierl, M. J. (2001). Evaluating Type I error and power rates using an effect size measure with logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329-349. https://doi.org/10.1207/S15324818AME1404_2

  • Comfort, K. (1997). Gender and racial/ethnic differences on performance assessments in science. Educational Evaluation & Policy Analysis, 19(2), 83-97. https://doi.org/10.3102/01623737019002083

  • Liu, O. L., & Wilson, M. (2009). Gender differences in large-scale math assessments: PISA trend 2000 and 2003. Applied Measurement in Education, 22(2), 164-184. https://doi.org/10.1080/08957340902754635

  • Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.

  • Magis, D., Beland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847-862. https://doi.org/10.3758/BRM.42.3.847

  • Masters, J. C., Hulsmeyer, B. S., Pike, M. E., Leichty, K., Miller, M. T., & Verst, A. L. (2001). Assessment of multiple-choice questions in selected test banks accompanying text books used in nursing education. Journal of Nursing Education, 40(1), 25-32. https://doi.org/10.3928/0148-4834-20010101-07

  • Mazzeo, J., Schmitt, A. P., & Bleistein, C. A. (1993). Sex-related performance differences on constructed-response and multiple-choice sections of Advanced Placement Examinations. College Board Report No. 92-7. New York, NY: College Entrance Examination Board. https://doi.org/10.1002/j.2333-8504.1993.tb01516.x

  • McCoubrie, P. (2004). Improving the fairness of multiple-choice questions: A literature review. Medical Teacher, 26(8), 709-712. https://doi.org/10.1080/01421590400013495

  • Miles, M. B. ve Huberman, A. M. (1994). Qualitative data analysis (2nd ed.). Thousand Oaks, CA: Sage.

  • Moreno, R., Martínez, R. J., & Muñiz, J. (2015). Guidelines based on validity criteria for the development of multiple choice items. Psicothema, 27(4), 388-394. https://doi.org/10.7334/psicothema2015.110

  • Nicol, D. (2007). E-assessment by design: Using multiple-choice tests to good effect. Journal of Further and Higher Education, 31(1), 53-64. https://doi.org/10.1080/03098770601167922

  • Nnodim, J. O. (1992). Multiple-choice testing in anatomy. Medical Education, 26(4), 301-309. https://doi.org/10.1111/j.1365-2923.1992.tb00173.x

  • Parker, C., & Somers, J. (1983, December). A comparison of the difficulty and reliability of type K and best response test items. Paper presented at the Iowa Evaluation and Research Association Conference, Des Moines, IA.

  • Pate, A., & Caldwell, D. J. (2014). Effects of multiple-choice item-writing guideline utilization item and student performance. Currents in Pharmacy Teaching and Learning, 6(1), 130-134. https://doi.org/10.1016/j.cptl.2013.09.003

  • Rizopoulos, D. (2006). Ltm: An R package for latent variable modeling and item response theory analysis. Journal of Statistical Software, 17(5), 1-25. https://core.ac.uk/download/pdf/6305163.pdf

  • Shizuka, T., Takeuchi, O., Yashima, T., & Yoshizawa, K. (2006). A comparison of three-and four-option English tests for university entrance selection purposes in Japan. Language Testing, 23(1), 35-57. https://doi.org/10.1191/0265532206lt319oa

  • Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370. https://doi.org/10.1111/j.17453984.1990.tb00754.x

  • Tamir, P. (1993). Positive and negative multiple choice items: How difficult are they? Studies in Educational Evaluation, 19(3), 311-332. https://eric.ed.gov/?id=EJ471898

  • Tarrant, M., Knierim, A., Hayes, S. K., & Ware, J. (2006). The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Education Today, 26(8), 354-363. https://doi.org/10.1016/j.nedt.2006.07.006

  • Tarrant, M., & Ware, J. (2008). Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Medical Education, 42(2), 198-206. https://doi.org/10.1111/j.1365-2923.2007.02957.x

  • Tarrant, M., Ware, J., & Mohammed, A. M. (2009). An assessment of functioning and non-functioning distractors in multiple-choice questions: A descriptive analysis. BMC Medical Education, 9(1), 40-47. https://doi.org/10.1186/1472-6920-9-40

  • Terzi, R., & Suh, Y. (2015). An odds ratio approach for detecting DDF under the nested logit modeling framework. Journal of Educational Measurement, 52(4), 376–398. https://doi.org/10.1111/jedm.12091

  • Terzi, R., & Yakar, L. (2018). Differential item and differential distractor functioning analyses on Turkish high school entrance exam. Journal of Measurement and Evaluation in Education and Psychology, 9(2), 136-149. https://doi.org/10.21031/epod.368081

  • Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147-169). Hillsdale, NJ: Erlbaum.

  • Thorndike, R. M. (2005). Measurement and evaluation in psychology and education (7th ed.). Upper Saddle River, NJ: Pearson Education.

  • Tripp, A., & Tollefson, N. (1985). Are complex multiple-choice options more difficult and discriminating than conventional multiple-choice options? Journal of Nursing Education, 24(3), 92-98. https://doi.org/10.3928/0148-4834-19850301-04

  • Zenisky, A.L., Hambleton, R.K., & Robin, F. (2004). DIF detection and interpretation in large-scale science assessments: Informing item writing practices. Educational Assessment, 9(1-2), 61-78. https://doi.org/10.1080/10627197.2004.9652959

  • Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. https://s3.amazonaws.com/academia.edu.documents/33861736/handbook4__pdf_dif.pdf?AWSAccess KeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1535487960&Signature=lTCvN%2BycU1dLORbreXBM 0vYO89w%3D&response-content-disposition=inline%3B%20filename%3DHandbook-4_pdf_dif.pdf

  • Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223-233. https://doi.org/10.1080/15434300701375832

  • Zumbo, B. D., & Thomas, D. R. (1997). A measure of effect size for a model-based approach for studying DIF. Prince George, Canada: University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science.

                                                                                                                                                                                                        
  • Article Statistics