%0 Journal Article %J Journal of Educational Measurement %D 2020 %T Item Calibration Methods With Multiple Subscale Multistage Testing %A Wang, Chun %A Chen, Ping %A Jiang, Shengyu %K EM %K marginal maximum likelihood %K missing data %K multistage testing %X Abstract Many large-scale educational surveys have moved from linear form design to multistage testing (MST) design. One advantage of MST is that it can provide more accurate latent trait (θ) estimates using fewer items than required by linear tests. However, MST generates incomplete response data by design; hence, questions remain as to how to calibrate items using the incomplete data from MST design. Further complication arises when there are multiple correlated subscales per test, and when items from different subscales need to be calibrated according to their respective score reporting metric. The current calibration-per-subscale method produced biased item parameters, and there is no available method for resolving the challenge. Deriving from the missing data principle, we showed when calibrating all items together the Rubin's ignorability assumption is satisfied such that the traditional single-group calibration is sufficient. When calibrating items per subscale, we proposed a simple modification to the current calibration-per-subscale method that helps reinstate the missing-at-random assumption and therefore corrects for the estimation bias that is otherwise existent. Three mainstream calibration methods are discussed in the context of MST, they are the marginal maximum likelihood estimation, the expectation maximization method, and the fixed parameter calibration. An extensive simulation study is conducted and a real data example from NAEP is analyzed to provide convincing empirical evidence. %B Journal of Educational Measurement %V 57 %P 3-28 %U https://onlinelibrary.wiley.com/doi/abs/10.1111/jedm.12241 %R 10.1111/jedm.12241 %0 Journal Article %J Journal of Educational Measurement %D 2014 %T An Enhanced Approach to Combine Item Response Theory With Cognitive Diagnosis in Adaptive Testing %A Wang, Chun %A Zheng, Chanjin %A Chang, Hua-Hua %X

Computerized adaptive testing offers the possibility of gaining information on both the overall ability and cognitive profile in a single assessment administration. Some algorithms aiming for these dual purposes have been proposed, including the shadow test approach, the dual information method (DIM), and the constraint weighted method. The current study proposed two new methods, aggregate ranked information index (ARI) and aggregate standardized information index (ASI), which appropriately addressed the noncompatibility issue inherent in the original DIM method. More flexible weighting schemes that put different emphasis on information about general ability (i.e., θ in item response theory) and information about cognitive profile (i.e., α in cognitive diagnostic modeling) were also explored. Two simulation studies were carried out to investigate the effectiveness of the new methods and weighting schemes. Results showed that the new methods with the flexible weighting schemes could produce more accurate estimation of both overall ability and cognitive profile than the original DIM. Among them, the ASI with both empirical and theoretical weights is recommended, and attribute-level weighting scheme is preferred if some attributes are considered more important from a substantive perspective.

%B Journal of Educational Measurement %V 51 %P 358–380 %U http://dx.doi.org/10.1111/jedm.12057 %R 10.1111/jedm.12057 %0 Journal Article %J Journal of Educational and Behavioral Statistics %D 2014 %T Improving Measurement Precision of Hierarchical Latent Traits Using Adaptive Testing %A Wang, Chun %X

Many latent traits in social sciences display a hierarchical structure, such as intelligence, cognitive ability, or personality. Usually a second-order factor is linearly related to a group of first-order factors (also called domain abilities in cognitive ability measures), and the first-order factors directly govern the actual item responses. Because only a subtest of items is used to measure each domain, the lack of sufficient reliability becomes the primary impediment for generating and reporting domain abilities. In recent years, several item response theory (IRT) models have been proposed to account for hierarchical factor structures, and these models are also shown to alleviate the low reliability issue by using in-test collateral information to improve measurement precision. This article advocates using adaptive item selection together with a higher order IRT model to further increase the reliability of hierarchical latent trait estimation. Two item selection algorithms are proposed—the constrained D-optimal method and the sequencing domain method. Both are shown to yield improved measurement precision as compared to the unidimensional item selection (by treating each dimension separately). The improvement is more prominent when the test length is short and when the correlation between dimensions is high (e.g., higher than .64). Moreover, two reliability indices for hierarchical latent traits are discussed and their use for quantifying the reliability of hierarchical traits measured by adaptive testing is demonstrated.

%B Journal of Educational and Behavioral Statistics %V 39 %P 452-477 %U http://jeb.sagepub.com/cgi/content/abstract/39/6/452 %R 10.3102/1076998614559419 %0 Journal Article %J Applied Psychological Measurement %D 2013 %T Deriving Stopping Rules for Multidimensional Computerized Adaptive Testing %A Wang, Chun %A Chang, Hua-Hua %A Boughton, Keith A. %X

Multidimensional computerized adaptive testing (MCAT) is able to provide a vector of ability estimates for each examinee, which could be used to provide a more informative profile of an examinee’s performance. The current literature on MCAT focuses on the fixed-length tests, which can generate less accurate results for those examinees whose abilities are quite different from the average difficulty level of the item bank when there are only a limited number of items in the item bank. Therefore, instead of stopping the test with a predetermined fixed test length, the authors use a more informative stopping criterion that is directly related to measurement accuracy. Specifically, this research derives four stopping rules that either quantify the measurement precision of the ability vector (i.e., minimum determinant rule [D-rule], minimum eigenvalue rule [E-rule], and maximum trace rule [T-rule]) or quantify the amount of available information carried by each item (i.e., maximum Kullback–Leibler divergence rule [K-rule]). The simulation results showed that all four stopping rules successfully terminated the test when the mean squared error of ability estimation is within a desired range, regardless of examinees’ true abilities. It was found that when using the D-, E-, or T-rule, examinees with extreme abilities tended to have tests that were twice as long as the tests received by examinees with moderate abilities. However, the test length difference with K-rule is not very dramatic, indicating that K-rule may not be very sensitive to measurement precision. In all cases, the cutoff value for each stopping rule needs to be adjusted on a case-by-case basis to find an optimal solution.

%B Applied Psychological Measurement %V 37 %P 99-122 %U http://apm.sagepub.com/content/37/2/99.abstract %R 10.1177/0146621612463422 %0 Journal Article %J Educational and Psychological Measurement %D 2013 %T Mutual Information Item Selection Method in Cognitive Diagnostic Computerized Adaptive Testing With Short Test Length %A Wang, Chun %X

Cognitive diagnostic computerized adaptive testing (CD-CAT) purports to combine the strengths of both CAT and cognitive diagnosis. Cognitive diagnosis models aim at classifying examinees into the correct mastery profile group so as to pinpoint the strengths and weakness of each examinee whereas CAT algorithms choose items to determine those strengths and weakness as efficiently as possible. Most of the existing CD-CAT item selection algorithms are evaluated when test length is relatively long whereas several applications of CD-CAT, such as in interim assessment, require an item selection algorithm that is able to accurately recover examinees’ mastery profile with short test length. In this article, we introduce the mutual information item selection method in the context of CD-CAT and then provide a computationally easier formula to make the method more amenable in real time. Mutual information is then evaluated against common item selection methods, such as Kullback–Leibler information, posterior weighted Kullback–Leibler information, and Shannon entropy. Based on our simulations, mutual information consistently results in nearly the highest attribute and pattern recovery rate in more than half of the conditions. We conclude by discussing how the number of attributes, Q-matrix structure, correlations among the attributes, and item quality affect estimation accuracy.

%B Educational and Psychological Measurement %V 73 %P 1017-1035 %U http://epm.sagepub.com/content/73/6/1017.abstract %R 10.1177/0013164413498256 %0 Journal Article %J Journal of Educational and Behavioral Statistics %D 2013 %T A Semiparametric Model for Jointly Analyzing Response Times and Accuracy in Computerized Testing %A Wang, Chun %A Fan, Zhewen %A Chang, Hua-Hua %A Douglas, Jeffrey A. %X

The item response times (RTs) collected from computerized testing represent an underutilized type of information about items and examinees. In addition to knowing the examinees’ responses to each item, we can investigate the amount of time examinees spend on each item. Current models for RTs mainly focus on parametric models, which have the advantage of conciseness, but may suffer from reduced flexibility to fit real data. We propose a semiparametric approach, specifically, the Cox proportional hazards model with a latent speed covariate to model the RTs, embedded within the hierarchical framework proposed by van der Linden to model the RTs and response accuracy simultaneously. This semiparametric approach combines the flexibility of nonparametric modeling and the brevity and interpretability of the parametric modeling. A Markov chain Monte Carlo method for parameter estimation is given and may be used with sparse data obtained by computerized adaptive testing. Both simulation studies and real data analysis are carried out to demonstrate the applicability of the new model.

%B Journal of Educational and Behavioral Statistics %V 38 %P 381-417 %U http://jeb.sagepub.com/cgi/content/abstract/38/4/381 %R 10.3102/1076998612461831 %0 Journal Article %J Journal of Educational Measurement %D 2011 %T Restrictive Stochastic Item Selection Methods in Cognitive Diagnostic Computerized Adaptive Testing %A Wang, Chun %A Chang, Hua-Hua %A Huebner, Alan %X

This paper proposes two new item selection methods for cognitive diagnostic computerized adaptive testing: the restrictive progressive method and the restrictive threshold method. They are built upon the posterior weighted Kullback-Leibler (KL) information index but include additional stochastic components either in the item selection index or in the item selection procedure. Simulation studies show that both methods are successful at simultaneously suppressing overexposed items and increasing the usage of underexposed items. Compared to item selection based upon (1) pure KL information and (2) the Sympson-Hetter method, the two new methods strike a better balance between item exposure control and measurement accuracy. The two new methods are also compared with Barrada et al.'s (2008) progressive method and proportional method.

%B Journal of Educational Measurement %V 48 %P 255–273 %U http://dx.doi.org/10.1111/j.1745-3984.2011.00145.x %R 10.1111/j.1745-3984.2011.00145.x