TY - CONF T1 - Is CAT Suitable for Automated Speaking Test? T2 - IACAT 2017 Conference Y1 - 2017 A1 - Shingo Imai KW - Automated Speaking Test KW - CAT KW - language testing AB -

We have developed automated scoring system of Japanese speaking proficiency, namely SJ-CAT (Speaking Japanese Computerized Adaptive Test), which is operational for last few months. One of the unique features of the test is an adaptive test base on polytomous IRT.

SJ-CAT consists of two sections; Section 1 has sentence reading aloud tasks and a multiple choicereading tasks and Section 2 has sentence generation tasks and an open answer tasks. In reading aloud tasks, a test taker reads a phoneme-balanced sentence on the screen after listening to a model reading. In a multiple choice-reading task, a test taker sees a picture and reads aloud one sentence among three sentences on the screen, which describe the scene most appropriately. In a sentence generation task, a test taker sees a picture or watches a video clip and describes the scene with his/her own words for about ten seconds. In an open answer tasks, the test taker expresses one’s support for or opposition to e.g., a nuclear power generation with reasons for about 30 seconds.

In the course of the development of the test, we found many unexpected and unique characteristics of speaking CAT, which are not found in usual CATs with multiple choices. In this presentation, we will discuss some of such factors that are not previously noticed in our previous project of developing dichotomous J-CAT (Japanese Computerized Adaptive Test), which consists of vocabulary, grammar, reading, and listening. Firstly, we will claim that distribution of item difficulty parameters depends on the types of items. An item pool with unrestricted types of items such as open questions is difficult to achieve ideal distributions, either normal distribution or uniform distribution. Secondly, contrary to our expectations, open questions are not necessarily more difficult to operate in automated scoring system than more restricted questions such as sentence reading, as long as if one can set up suitable algorithm for open question scoring. Thirdly, we will show that the speed of convergence of standard deviation of posterior distribution, or standard error of theta parameter in polytomous IRT used for SJCAT is faster than dichotomous IRT used in J-CAT. Fourthly, we will discuss problems in equation of items in SJ-CAT, and suggest introducing deep learning with reinforcement learning instead of equation. And finally, we will discuss the issues of operation of SJ-CAT on the web, including speed of scoring, operation costs, security among others.

Session Video

JF - IACAT 2017 Conference PB - Niigata Seiryo University CY - Niigata, Japan ER - TY - JOUR T1 - A Comparison of Four Methods for Obtaining Information Functions for Scores From Computerized Adaptive Tests With Normally Distributed Item Difficulties and Discriminations JF - Journal of Computerized Adaptive Testing Y1 - 2013 A1 - Ito, K. A1 - Segall, D.O. VL - 1 IS - 5 ER - TY - CHAP T1 - An evaluation of a new procedure for computing information functions for Bayesian scores from computerized adaptive tests Y1 - 2009 A1 - Ito, K. A1 - Pommerich, M A1 - Segall, D. CY - D. J. Weiss (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. N1 - {PDF file, 571 KB} ER - TY - CHAP T1 - Features of J-CAT (Japanese Computerized Adaptive Test) Y1 - 2009 A1 - Imai, S. A1 - Ito, S. A1 - Nakamura, Y. A1 - Kikuchi, K. A1 - Akagi, Y. A1 - Nakasono, H. A1 - Honda, A. A1 - Hiramura, T. CY - D. J. Weiss (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. N1 - {PDF File, 655KB} ER - TY - JOUR T1 - A Strategy for Controlling Item Exposure in Multidimensional Computerized Adaptive Testing JF - Educational and Psychological Measurement Y1 - 2008 A1 - Lee, Yi-Hsuan A1 - Ip, Edward H. A1 - Fuh, Cheng-Der AB -

Although computerized adaptive tests have enjoyed tremendous growth, solutions for important problems remain unavailable. One problem is the control of item exposure rate. Because adaptive algorithms are designed to select optimal items, they choose items with high discriminating power. Thus, these items are selected more often than others, leading to both overexposure and underutilization of some parts of the item pool. Overused items are often compromised, creating a security problem that could threaten the validity of a test. Building on a previously proposed stratification scheme to control the exposure rate for one-dimensional tests, the authors extend their method to multidimensional tests. A strategy is proposed based on stratification in accordance with a functional of the vector of the discrimination parameter, which can be implemented with minimal computational overhead. Both theoretical and empirical validation studies are provided. Empirical results indicate significant improvement over the commonly used method of controlling exposure rate that requires only a reasonable sacrifice in efficiency.

VL - 68 UR - http://epm.sagepub.com/content/68/2/215.abstract ER - TY - JOUR T1 - Using computerized adaptive testing to reduce the burden of mental health assessment JF - Psychiatric Services Y1 - 2008 A1 - Gibbons, R. D. A1 - Weiss, D. J. A1 - Kupfer, D. J. A1 - Frank, E. A1 - Fagiolini, A. A1 - Grochocinski, V. J. A1 - Bhaumik, D. K. A1 - Stover, A. A1 - Bock, R. D. A1 - Immekus, J. C. KW - *Diagnosis, Computer-Assisted KW - *Questionnaires KW - Adolescent KW - Adult KW - Aged KW - Agoraphobia/diagnosis KW - Anxiety Disorders/diagnosis KW - Bipolar Disorder/diagnosis KW - Female KW - Humans KW - Male KW - Mental Disorders/*diagnosis KW - Middle Aged KW - Mood Disorders/diagnosis KW - Obsessive-Compulsive Disorder/diagnosis KW - Panic Disorder/diagnosis KW - Phobic Disorders/diagnosis KW - Reproducibility of Results KW - Time Factors AB - OBJECTIVE: This study investigated the combination of item response theory and computerized adaptive testing (CAT) for psychiatric measurement as a means of reducing the burden of research and clinical assessments. METHODS: Data were from 800 participants in outpatient treatment for a mood or anxiety disorder; they completed 616 items of the 626-item Mood and Anxiety Spectrum Scales (MASS) at two times. The first administration was used to design and evaluate a CAT version of the MASS by using post hoc simulation. The second confirmed the functioning of CAT in live testing. RESULTS: Tests of competing models based on item response theory supported the scale's bifactor structure, consisting of a primary dimension and four group factors (mood, panic-agoraphobia, obsessive-compulsive, and social phobia). Both simulated and live CAT showed a 95% average reduction (585 items) in items administered (24 and 30 items, respectively) compared with administration of the full MASS. The correlation between scores on the full MASS and the CAT version was .93. For the mood disorder subscale, differences in scores between two groups of depressed patients--one with bipolar disorder and one without--on the full scale and on the CAT showed effect sizes of .63 (p<.003) and 1.19 (p<.001) standard deviation units, respectively, indicating better discriminant validity for CAT. CONCLUSIONS: Instead of using small fixed-length tests, clinicians can create item banks with a large item pool, and a small set of the items most relevant for a given individual can be administered with no loss of information, yielding a dramatic reduction in administration time and patient and clinician burden. VL - 59 SN - 1075-2730 (Print) N1 - Gibbons, Robert DWeiss, David JKupfer, David JFrank, EllenFagiolini, AndreaGrochocinski, Victoria JBhaumik, Dulal KStover, AngelaBock, R DarrellImmekus, Jason CR01-MH-30915/MH/United States NIMHR01-MH-66302/MH/United States NIMHResearch Support, N.I.H., ExtramuralUnited StatesPsychiatric services (Washington, D.C.)Psychiatr Serv. 2008 Apr;59(4):361-8. ER - TY - CHAP T1 - Patient-reported outcomes measurement and computerized adaptive testing: An application of post-hoc simulation to a diagnostic screening instrument Y1 - 2007 A1 - Immekus, J. C. A1 - Gibbons, R. D. A1 - Rush, J. A. CY - D. J. Weiss (Ed.). Proceedings of the 2007 GMAC Conference on Computerized Adaptive Testing. N1 - {PDF file, 203 KB} ER - TY - JOUR T1 - Toward efficient and comprehensive measurement of the alcohol problems continuum in college students: The Brief Young Adult Alcohol Consequences Questionnaire JF - Alcoholism: Clinical & Experimental Research Y1 - 2005 A1 - Kahler, C. W. A1 - Strong, D. R. A1 - Read, J. P. A1 - De Boeck, P. A1 - Wilson, M. A1 - Acton, G. S. A1 - Palfai, T. P. A1 - Wood, M. D. A1 - Mehta, P. D. A1 - Neale, M. C. A1 - Flay, B. R. A1 - Conklin, C. A. A1 - Clayton, R. R. A1 - Tiffany, S. T. A1 - Shiffman, S. A1 - Krueger, R. F. A1 - Nichol, P. E. A1 - Hicks, B. M. A1 - Markon, K. E. A1 - Patrick, C. J. A1 - Iacono, William G. A1 - McGue, Matt A1 - Langenbucher, J. W. A1 - Labouvie, E. A1 - Martin, C. S. A1 - Sanjuan, P. M. A1 - Bavly, L. A1 - Kirisci, L. A1 - Chung, T. A1 - Vanyukov, M. A1 - Dunn, M. A1 - Tarter, R. A1 - Handel, R. W. A1 - Ben-Porath, Y. S. A1 - Watt, M. KW - Psychometrics KW - Substance-Related Disorders AB - Background: Although a number of measures of alcohol problems in college students have been studied, the psychometric development and validation of these scales have been limited, for the most part, to methods based on classical test theory. In this study, we conducted analyses based on item response theory to select a set of items for measuring the alcohol problem severity continuum in college students that balances comprehensiveness and efficiency and is free from significant gender bias., Method: We conducted Rasch model analyses of responses to the 48-item Young Adult Alcohol Consequences Questionnaire by 164 male and 176 female college students who drank on at least a weekly basis. An iterative process using item fit statistics, item severities, item discrimination parameters, model residuals, and analysis of differential item functioning by gender was used to pare the items down to those that best fit a Rasch model and that were most efficient in discriminating among levels of alcohol problems in the sample., Results: The process of iterative Rasch model analyses resulted in a final 24-item scale with the data fitting the unidimensional Rasch model very well. The scale showed excellent distributional properties, had items adequately matched to the severity of alcohol problems in the sample, covered a full range of problem severity, and appeared highly efficient in retaining all of the meaningful variance captured by the original set of 48 items., Conclusions: The use of Rasch model analyses to inform item selection produced a final scale that, in both its comprehensiveness and its efficiency, should be a useful tool for researchers studying alcohol problems in college students. To aid interpretation of raw scores, examples of the types of alcohol problems that are likely to be experienced across a range of selected scores are provided., (C)2005Research Society on AlcoholismAn important, sometimes controversial feature of all psychological phenomena is whether they are categorical or dimensional. A conceptual and psychometric framework is described for distinguishing whether the latent structure behind manifest categories (e.g., psychiatric diagnoses, attitude groups, or stages of development) is category-like or dimension-like. Being dimension-like requires (a) within-category heterogeneity and (b) between-category quantitative differences. Being category-like requires (a) within-category homogeneity and (b) between-category qualitative differences. The relation between this classification and abrupt versus smooth differences is discussed. Hybrid structures are possible. Being category-like is itself a matter of degree; the authors offer a formalized framework to determine this degree. Empirical applications to personality disorders, attitudes toward capital punishment, and stages of cognitive development illustrate the approach., (C) 2005 by the American Psychological AssociationThe authors conducted Rasch model ( G. Rasch, 1960) analyses of items from the Young Adult Alcohol Problems Screening Test (YAAPST; S. C. Hurlbut & K. J. Sher, 1992) to examine the relative severity and ordering of alcohol problems in 806 college students. Items appeared to measure a single dimension of alcohol problem severity, covering a broad range of the latent continuum. Items fit the Rasch model well, with less severe symptoms reliably preceding more severe symptoms in a potential progression toward increasing levels of problem severity. However, certain items did not index problem severity consistently across demographic subgroups. A shortened, alternative version of the YAAPST is proposed, and a norm table is provided that allows for a linking of total YAAPST scores to expected symptom expression., (C) 2004 by the American Psychological AssociationA didactic on latent growth curve modeling for ordinal outcomes is presented. The conceptual aspects of modeling growth with ordinal variables and the notion of threshold invariance are illustrated graphically using a hypothetical example. The ordinal growth model is described in terms of 3 nested models: (a) multivariate normality of the underlying continuous latent variables (yt) and its relationship with the observed ordinal response pattern (Yt), (b) threshold invariance over time, and (c) growth model for the continuous latent variable on a common scale. Algebraic implications of the model restrictions are derived, and practical aspects of fitting ordinal growth models are discussed with the help of an empirical example and Mx script ( M. C. Neale, S. M. Boker, G. Xie, & H. H. Maes, 1999). The necessary conditions for the identification of growth models with ordinal data and the methodological implications of the model of threshold invariance are discussed., (C) 2004 by the American Psychological AssociationRecent research points toward the viability of conceptualizing alcohol problems as arrayed along a continuum. Nevertheless, modern statistical techniques designed to scale multiple problems along a continuum (latent trait modeling; LTM) have rarely been applied to alcohol problems. This study applies LTM methods to data on 110 problems reported during in-person interviews of 1,348 middle-aged men (mean age = 43) from the general population. The results revealed a continuum of severity linking the 110 problems, ranging from heavy and abusive drinking, through tolerance and withdrawal, to serious complications of alcoholism. These results indicate that alcohol problems can be arrayed along a dimension of severity and emphasize the relevance of LTM to informing the conceptualization and assessment of alcohol problems., (C) 2004 by the American Psychological AssociationItem response theory (IRT) is supplanting classical test theory as the basis for measures development. This study demonstrated the utility of IRT for evaluating DSM-IV diagnostic criteria. Data on alcohol, cannabis, and cocaine symptoms from 372 adult clinical participants interviewed with the Composite International Diagnostic Interview-Expanded Substance Abuse Module (CIDI-SAM) were analyzed with Mplus ( B. Muthen & L. Muthen, 1998) and MULTILOG ( D. Thissen, 1991) software. Tolerance and legal problems criteria were dropped because of poor fit with a unidimensional model. Item response curves, test information curves, and testing of variously constrained models suggested that DSM-IV criteria in the CIDI-SAM discriminate between only impaired and less impaired cases and may not be useful to scale case severity. IRT can be used to study the construct validity of DSM-IV diagnoses and to identify diagnostic criteria with poor performance., (C) 2004 by the American Psychological AssociationThis study examined the psychometric characteristics of an index of substance use involvement using item response theory. The sample consisted of 292 men and 140 women who qualified for a Diagnostic and Statistical Manual of Mental Disorders (3rd ed., rev.; American Psychiatric Association, 1987) substance use disorder (SUD) diagnosis and 293 men and 445 women who did not qualify for a SUD diagnosis. The results indicated that men had a higher probability of endorsing substance use compared with women. The index significantly predicted health, psychiatric, and psychosocial disturbances as well as level of substance use behavior and severity of SUD after a 2-year follow-up. Finally, this index is a reliable and useful prognostic indicator of the risk for SUD and the medical and psychosocial sequelae of drug consumption., (C) 2002 by the American Psychological AssociationComparability, validity, and impact of loss of information of a computerized adaptive administration of the Minnesota Multiphasic Personality Inventory-2 (MMPI-2) were assessed in a sample of 140 Veterans Affairs hospital patients. The countdown method ( Butcher, Keller, & Bacon, 1985) was used to adaptively administer Scales L (Lie) and F (Frequency), the 10 clinical scales, and the 15 content scales. Participants completed the MMPI-2 twice, in 1 of 2 conditions: computerized conventional test-retest, or computerized conventional-computerized adaptive. Mean profiles and test-retest correlations across modalities were comparable. Correlations between MMPI-2 scales and criterion measures supported the validity of the countdown method, although some attenuation of validity was suggested for certain health-related items. Loss of information incurred with this mode of adaptive testing has minimal impact on test validity. Item and time savings were substantial., (C) 1999 by the American Psychological Association VL - 29 N1 - MiscellaneousArticleMiscellaneous Article ER - TY - JOUR T1 - Can examinees use judgments of item difficulty to improve proficiency estimates on computerized adaptive vocabulary tests? JF - Journal of Educational Measurement Y1 - 2002 A1 - Vispoel, W. P. A1 - Clough, S. J. A1 - Bleiler, T. A1 - Hendrickson, A. B. A1 - Ihrig, D. VL - 39 ER - TY - ABST T1 - A strategy for controlling item exposure in multidimensional computerized adaptive testing Y1 - 2002 A1 - Lee, Y. H. A1 - Ip, E.H. A1 - Fuh, C.D. CY - Available from http://www3. tat.sinica.edu.tw/library/c_tec_rep/c-2002-11.pdf ER - TY - CONF T1 - Can examinees use judgments of item difficulty to improve proficiency estimates on computerized adaptive vocabulary tests? T2 - Paper presented at the annual meeting of the National Council on Measurement in Education Y1 - 2001 A1 - Vispoel, W. P. A1 - Clough, S. J. A1 - Bleiler, T. Hendrickson, A. B. A1 - Ihrig, D. JF - Paper presented at the annual meeting of the National Council on Measurement in Education CY - Seattle WA N1 - #VI01-01 ER - TY - CONF T1 - Limiting answer review and change on computerized adaptive vocabulary tests: Psychometric and attitudinal results T2 - Paper presented at the annual meeting of the National Council on Measurement in Education Y1 - 1999 A1 - Vispoel, W. P. A1 - Hendrickson, A. A1 - Bleiler, T. A1 - Widiatmo, H. A1 - Shrairi, S. A1 - Ihrig, D. JF - Paper presented at the annual meeting of the National Council on Measurement in Education CY - Montreal, Canada N1 - #VI99-01 ER - TY - CONF T1 - Study of methods to detect aberrant response patterns in computerized testing T2 - Paper presented at the annual meeting of the National Council on Measurement in Education Y1 - 1999 A1 - Iwamoto, C. K. A1 - Nungester, R. J. A1 - Luecht, RM JF - Paper presented at the annual meeting of the National Council on Measurement in Education CY - Montreal, Canada ER - TY - CONF T1 - Estimation of item difficulty from restricted CAT calibration samples T2 - Paper presented at the annual conference of the National Council on Measurement in Education in San Francisco. Y1 - 1995 A1 - Sykes, R. A1 - Ito, K. JF - Paper presented at the annual conference of the National Council on Measurement in Education in San Francisco. ER - TY - CONF T1 - The effect of restricting ability distributions in the estimation of item difficulties: Implications for a CAT implementation T2 - Paper presented at the annual meeting of the National Council on Measurement in Education Y1 - 1994 A1 - Ito, K. A1 - Sykes, R.C. JF - Paper presented at the annual meeting of the National Council on Measurement in Education CY - New Orleans ER - TY - ABST T1 - The four generations of computerized educational measurement (Research Report 98-35) Y1 - 1988 A1 - Bunderson, C. V A1 - Inouye, D. K A1 - Olsen, J. B. CY - Princeton NJ: Educational Testing Service. ER - TY - CHAP T1 - The four generations of computerized educational measurement Y1 - 1986 A1 - Bunderson, C. V A1 - Inouye, D. K A1 - Olsen, J. B. CY - In R. L. Linn (Ed.), Educational Measurement (3rd ed and pp. 367-407). New York: Macmillan. ER - TY - BOOK T1 - An application of the Rasch one-parameter logistic model to individual intelligence testing in a tailored testing environment Y1 - 1977 A1 - Ireland, C. M. CY - Dissertation Abstracts International, 37 (9-A), 5766 ER -