morse et al 2012

25
Article Applied Psychological Measurement 36(2) 122–146 Ó The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0146621612438725 http://apm.sagepub.com Using the Graded Response Model to Control Spurious Interactions in Moderated Multiple Regression Brendan J. Morse 1 , George A. Johanson 2 , and Rodger W. Griffeth 2 Abstract Recent simulation research has demonstrated that using simple raw score to operationalize a latent construct can result in inflated Type I error rates for the interaction term of a mod- erated statistical model when the interaction (or lack thereof) is proposed at the latent vari- able level. Rescaling the scores using an appropriate item response theory (IRT) model can mitigate this effect under similar conditions. However, this work has thus far been limited to dichotomous data. The purpose of this study was to extend this investigation to multicate- gory (polytomous) data using the graded response model (GRM). Consistent with previous studies, inflated Type I error rates were observed under some conditions when polytomous number-correct scores were used, and were mitigated when the data were rescaled with the GRM. These results support the proposition that IRT-derived scores are more robust to spurious interaction effects in moderated statistical models than simple raw scores under certain conditions. Keywords graded response model, item response theory, polytomous models, simulation Operationalizing a latent construct such as an attitude or ability is a common practice in psy- chological research. Stine (1989) described this process as the creation of a mathematical structure (scores) that represents the empirical structure (construct) of interest. Typically, researchers will use simple raw scores (e.g., either as a sum or a mean) from a scale or test as the mathematical structure for a latent construct. However, much debate regarding the properties of such scores has ensued since S. S. Stevens’s classic publication of the nominal, ordinal, interval, and ratio scales of measurement (Stevens, 1946). Although it is beyond the scope of this article to enter the scale of measurement foray, an often agreed-on position is that simple raw scores for latent constructs do not exceed an ordinal scale of measurement. This scale imbues such scores with limited mathematical properties and permissible 1 Bridgewater State University, MA, USA 2 Ohio University, Athens, USA Corresponding author: Brendan J. Morse, Department of Psychology, Bridgewater State University, 90 Burrill Avenue, 340 Hart Hall, Bridgewater, MA 02325, USA Email: [email protected]

Upload: brendan-morse

Post on 19-Jan-2017

64 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Morse et al 2012

Article

Applied Psychological Measurement36(2) 122–146

� The Author(s) 2012Reprints and permission:

sagepub.com/journalsPermissions.navDOI: 10.1177/0146621612438725

http://apm.sagepub.com

Using the Graded ResponseModel to Control SpuriousInteractions in ModeratedMultiple Regression

Brendan J. Morse1, George A. Johanson2, and Rodger W. Griffeth2

Abstract

Recent simulation research has demonstrated that using simple raw score to operationalizea latent construct can result in inflated Type I error rates for the interaction term of a mod-erated statistical model when the interaction (or lack thereof) is proposed at the latent vari-able level. Rescaling the scores using an appropriate item response theory (IRT) model canmitigate this effect under similar conditions. However, this work has thus far been limited todichotomous data. The purpose of this study was to extend this investigation to multicate-gory (polytomous) data using the graded response model (GRM). Consistent with previousstudies, inflated Type I error rates were observed under some conditions when polytomousnumber-correct scores were used, and were mitigated when the data were rescaled with theGRM. These results support the proposition that IRT-derived scores are more robust tospurious interaction effects in moderated statistical models than simple raw scores undercertain conditions.

Keywords

graded response model, item response theory, polytomous models, simulation

Operationalizing a latent construct such as an attitude or ability is a common practice in psy-

chological research. Stine (1989) described this process as the creation of a mathematical

structure (scores) that represents the empirical structure (construct) of interest. Typically,

researchers will use simple raw scores (e.g., either as a sum or a mean) from a scale or test

as the mathematical structure for a latent construct. However, much debate regarding the

properties of such scores has ensued since S. S. Stevens’s classic publication of the nominal,

ordinal, interval, and ratio scales of measurement (Stevens, 1946). Although it is beyond the

scope of this article to enter the scale of measurement foray, an often agreed-on position is

that simple raw scores for latent constructs do not exceed an ordinal scale of measurement.

This scale imbues such scores with limited mathematical properties and permissible

1Bridgewater State University, MA, USA2Ohio University, Athens, USA

Corresponding author:

Brendan J. Morse, Department of Psychology, Bridgewater State University, 90 Burrill Avenue, 340 Hart Hall,

Bridgewater, MA 02325, USA

Email: [email protected]

Page 2: Morse et al 2012

transformations that are necessary for the appropriate application of parametric statistical

models. Nonparametric, or distribution-free, statistics have been proposed as a solution for

the scale of measurement problem. However, many researchers are reluctant to use nonpara-

metric techniques because they are often associated with a loss of information pertaining to

the nature of the variables (Gardner, 1975). McNemar (1969) articulated this point by say-

ing, ‘‘Consequently, in using a non-parametric method as a short-cut, we are throwing away

dollars in order to save pennies’’ (p. 432).

Assuming that simple raw scores are limited to the ordinal scale of measurement and

researchers typically prefer parametric models to their nonparametric analogues, the empiri-

cal question regarding the robustness of various parametric statistical models to scale viola-

tions arises. Davison and Sharma (1988) and Maxwell and Delaney (1985) demonstrated

through mathematical derivations that there is little cause for concern when comparing mean

group differences in the independent samples t test when the assumptions of normality and

homogeneity of variance are met. However, Davison and Sharma (1990) subsequently

demonstrated that scaling-induced spurious interaction effects could occur with ordinal-level

observed scores in multiple regression analyses. These findings suggest that scaling may

become a problem when a multiplicative interaction term is introduced into a parametric sta-

tistical model.

Scaling and Item Response Theory (IRT)

An alternative solution to the scale of measurement issue for parametric statistics is to rescale

the raw data itself into an interval-level metric, and a variety of methods for this rescaling have

been proposed (see Embretson, 2006; Granberg-Rademacker, 2010; Harwell & Gatti, 2001). A

potential method for producing scores with near interval-level scaling properties is the applica-

tion of IRT models to operationalize number-correct scores into estimated theta scores—the

IRT-derived estimate of an individual’s ability or latent construct standing. Conceptually, the

attractiveness of this method rests with the invariance property in IRT scaling, and such scores

may provide a more appropriate metric for use in parametric statistical analyses.1 Reise,

Ainsworth, and Haviland (2005) stated that

Trait-level estimates in IRT are superior to raw total scores because (a) they are optimal scalings of

individual differences (i.e., no scaling can be more precise or reliable) and (b) latent-trait scales have

relatively better (i.e., closer to interval) scaling properties. (p. 98, italics in original)

In addition, Reise and Haviland (2005) gave an elegant treatment of this condition by demon-

strating that the log-odds of endorsing an item and the theta scale form a linearly increasing rela-

tionship. Specifically, the rate of change on the theta scale is preserved (for all levels of theta) in

relation to the log-odds of item endorsement.

Empirical Evidence of IRT Scaling

In a simulation testing the effect of scaling and test difficulty on interaction effects in factor-

ial analysis of variance (ANOVA), Embretson (1996) demonstrated that Type I and Type II

errors for the interaction term could be exacerbated when simple raw scores are used under

nonoptimal psychometric conditions. Such errors occurred primarily due to the ordinal-level

scaling limitations of simple raw scores, and the ceiling and floor effects imposed when an

assessment is either too easy or too difficult for a group of individuals—a condition known

Morse et al. 123

Page 3: Morse et al 2012

as assessment inappropriateness (see Figure 1). Embretson fitted the one-parameter logistic

(Rasch) model to the data and was able to mitigate the null hypothesis errors using the esti-

mated theta scores rather than the simple raw scores. These results illuminated the

Assessment Appropriateness

Theta−4 −2 0 2 4

Trait ScoresTest Information

Assessment Inappropriateness

Theta−4 −2 0 2 4

Figure 1. A representation of the latent construct distribution and test information (reliability)distributions for appropriate assessments (top) and inappropriate assessments (bottom)

124 Applied Psychological Measurement 36(2)

Page 4: Morse et al 2012

usefulness of IRT scaling for dependent variables in factorial models, especially under sub-

optimal psychometric conditions. Embretson argued that researchers are often unaware when

these conditions are present and can benefit from using appropriately fitted IRT models to

generate scores that are more appropriate for use with parametric analyses.

An important question that now arises is whether these characteristics extend to more com-

plex IRT models such as the two- and three-parameter logistic models (dichotomous models

with a discrimination and guessing parameter, respectively) and polytomous models. Although

the Rasch model demonstrates desirable measurement characteristics (i.e., true parameter invar-

iance; Embretson & Reise, 2000; Fischer, 1995; Perline, Wright, & Wainer, 1979), it is some-

times too restrictive to use in practical contexts. However, the consensus regarding the

likelihood that non-Rasch models could achieve interval-level scaling properties is ‘‘yes’’

(Embretson & Reise, 2000; Hambleton, Swaminathan, & Rogers, 1991; Harwell & Gatti, 2001;

Reise et al., 2005). Investigations into the scaling properties of these more complex IRT models

are thus necessary.

In one extension of this sort, Kang and Waller (2005) simulated the scaling properties of

simple raw scores, estimated theta scores derived from a two-parameter logistic IRT model,

and assessment appropriateness with the interaction term in a moderated multiple regression

(MMR) analysis. Similar to the findings of Embretson (1996), Kang and Waller discovered

that using simple raw scores to operationalize a latent construct resulted in substantial infla-

tions of the Type I error rate (.50% or p . .50) for the interaction term in MMR under con-

ditions of assessment inappropriateness. However, the IRT-derived theta score estimates

were found to mitigate the Type I error rate to acceptable levels (\10% or p \ .10) under

the same conditions. This extension demonstrated that the estimated theta scores from a

non-Rasch IRT model could be used to better fit the assumptions of parametric statistical

models involving an interaction term. Finally, Harwell and Gatti (2001) investigated the

congruence of estimated (theta) and actual construct scores using a popular polytomous IRT

model, the graded response model (GRM; Samejima, 1969, 1996). The authors posited that

if the estimated construct (theta) scores were sufficiently similar to the actual construct

(theta) scores, which can be defined (albeit arbitrarily) as a metric with interval-level scaling

properties, then the GRM results in scores that are sufficiently interval level. The results of

their study supported this relationship. However, a concrete endorsement of the scaling prop-

erties should be made with caution due to the inherently arbitrary metric of the theta scale in

most (if not all) IRT models.

The preceding theoretical and simulation evidence suggests that scale of measurement viola-

tions accompanied with suboptimal psychometric conditions (i.e., assessment inappropriateness)

may have nonnegligible effects on the accuracy of common parametric analyses. However, this

evidence still has limited generalizability to the majority of psychological research due to the

nature of the simulated data. Specifically, Embretson (1996) and Kang and Waller (2005) simu-

lated dichotomous data and fit logistic IRT models appropriate for such data. However, the

majority of psychological research uses polytomous data, or data with multiple response

options, such as Likert-type scales. Therefore, the purpose of the current study is to extend the

understanding of scaling and assessment appropriateness on parametric statistical analyses by

simulating polytomous data and fitting an appropriate polytomous IRT model. The authors’ pri-

mary null hypothesis against which the performance of number-correct scores and estimated

theta scores are being compared is that there is no significant interaction on the actual theta

scale. Thus, any significant interaction identified in the simulation results represents a spurious

observed effect (Type I error).

Morse et al. 125

Page 5: Morse et al 2012

Method

This study used a Monte Carlo simulation to identify the psychometric conditions that lead to

an elevated risk of Type I errors for interaction effects in MMR when the theta scale is consid-

ered to be the true metric. This simulation was similar to the simulation conducted by Kang and

Waller (2005) and extends this work into polytomous scales indicative of those commonly used

in applied psychological research.

The GRM

The GRM (Samejima, 1969, 1996) is an IRT model suitable for modeling data with ordered

categories such as Likert-type scales, and is an extension of the two-parameter logistic model.

The GRM is considered a difference family model and was developed specifically to model

polytomous data that represent the psychological processes underlying multicategory decision

making (Ostini & Nering, 2006). In addition, theta estimates derived using the GRM may show

evidence of interval-level scaling properties (Harwell & Gatti, 2001).

Using the GRM, an individual’s likelihood of responding in a particular response category is

derived using a two-step process. First, category boundary functions (CBRFs) are calculated to

determine boundary decision probabilities of j 2 1 response categories for each item. The

CBRFs in the GRM can be derived with Equation 1 (adapted from Embretson & Reise, 2000).

Pix � ðuÞ=e ai u�bijð Þ½ �

1 + e ai u�bijð Þ½ � ; ð1Þ

In Equation 1, Pix � (u) is the probability that an individual with a trait (construct) level u will

respond positively at the boundary of category j for item i where x = j = 1 . . . mi. Theta (u) rep-

resents the individual’s trait (construct) level, ai represents the item discrimination or slope, and

bij represents the category location or difficulty parameter with respect to the trait continuum.

Importantly, the values of bij should be successive integers reflecting increased difficulty in pro-

gressing through the response options in well-functioning items.

In the second step of the GRM, the probability of responding in a particular category is deter-

mined using category response functions (CRFs), which are derived by subtracting Pix � (u)

from the following category. This process is illustrated in Equation 2 (adapted from Embretson

& Reise, 2000).

PixðuÞ= Pix � ðuÞ � Piðx + 1Þ � ðuÞ; ð2Þ

Determining the first category is done by simply subtracting Pi1 � (u) from 1.0, and the last

category is simply Pim � (u). The GRM was used as a model for the number-correct score algo-

rithm as well as to estimate theta scores in this study.

Independent Variables

The independent variables in this study were respondent sample size (n: two levels), scale

length (k: two levels), item discrimination (ai: two levels), item difficulty (bi,1 . . . j21: three lev-

els), scale bandwidth (fidelity: two levels), and the regression coefficients (b1 and b2: two lev-

els). The structure of this study was therefore a 2 3 2 3 2 3 3 3 2 3 2 design comprising 96

conditions.

Sample size (n). Two respondent sample sizes were simulated according to recent evidence

of the stability of parameter estimates in polytomous IRT, and actual sample sizes in MMR

126 Applied Psychological Measurement 36(2)

Page 6: Morse et al 2012

studies in applied psychology. Ostini and Nering (2006) reported that stable estimates for polyt-

omous IRT models could be obtained with as few as 250 individuals, but that samples between

500 and 1,000 are still considered to be desirable. In addition, Aguinis Beaty, Boik, and Pierce

(2005) indicated that the average sample size for MMR studies in applied psychological

research is �xn = 272 with an average standard deviation of sn = 434. These results indicate that

the simulation outcomes for the n = 250 sample size will be the most relevant for the majority

of applied psychological research; however, some studies do achieve sample sizes upward of

n = 1,000 (see, for example, Witt, 1998; n = 979). Therefore, sample sizes included two levels

of n = 250 and n = 750 respondents to maximize the generalizability for the majority of empiri-

cal MMR studies in applied psychology as well as for typical IRT studies.

Scale length (k). Two scale lengths of k = 15 and k = 30 items were simulated in this study to

model typical scales used in applied psychological research. In a review of validated scales used

in applied psychological research, Fields (2002) indicated a modal scale length of 15 items with

a mean of 15.43 and a standard deviation of 10.43 for validated scales in applied psychology.

The distribution related to these values is also slightly positively skewed, indicating the exis-

tence of several very long scales.

Discrimination (ai). To derive the highest level of generalizability from this study, item para-

meter values were randomly selected from specified distributions as opposed to using constant

values. Following the structure of Kang and Waller (2005), item discrimination values were

selected from a uniform distribution between the values of 0.31 to 0.58 for moderate discrimi-

nation and 0.58 to 1.13 for high discrimination. Estimating discrimination values from a uni-

form distribution has been demonstrated to appropriately represent empirically determined item

discrimination values (Reise & Waller, 2003), and the particular cutoff values of 0.31, 0.58,

and 1.13 were demonstrated to appropriately represent low, moderate, and high factor loadings

for items (Kang & Waller, 2005; Takane & De Leeuw, 1987). Because the GRM is a polyto-

mous extension of the two-parameter logistic model, these values can be deemed appropriate

for use in this study. Furthermore, the decision to retain the values from the Kang and Waller

(2005) study was made to maintain a basis of comparison for the extension to polytomous data.

Item difficulty (bi,1 . . . j21)/assessment appropriateness. Three item difficulty conditions were

simulated to represent a ‘‘difficult,’’‘‘moderate,’’ and ‘‘easy’’ scale with respect to the simu-

lated distribution of construct scores (see Figure 1). The item difficulty conditions are also ana-

logous to the assessment appropriateness conditions with the ‘‘difficult’’ and ‘‘easy’’ conditions

representing assessment inappropriateness and the ‘‘moderate’’ condition representing assess-

ment appropriateness. Item difficulty values were randomly selected from a N(21.5, 1.0) distri-

bution for the easy (inappropriate) conditions, a N(0.0, 1.0) distribution for the moderate

(appropriate) conditions, and a N(1.5, 1.0) distribution for the difficult (inappropriate)

conditions.

Four item difficulty parameters were randomly selected from the appropriate distribution for

each item to represent the j – 1 CBRFs specified in Equation 1 for the GRM. An important

aspect of the difficulty parameters in polytomous IRT models is that the difficulty parameter for

each CBRF is sequentially ordered. Therefore, the difficulty parameters were modeled with the

sequential ordering restriction imposed similar to that implemented by Meade, Lautenschlager,

and Johnson (2007).

Scale fidelity. An assessment’s fidelity is measured as the inverse of variability (i.e., band-

width) in the difficulty of the items (Stocking, 1987). Fidelity contributes to assessment appro-

priateness by either restricting (high fidelity) or expanding (low fidelity) the width of the item

difficulty distribution. The high-fidelity conditions were simulated in this study by generating a

second set of item difficulty values from more restricted normal distributions with a mean and

standard deviation of N(21.50, 0.50) for easy scales, N(0.00, 0.50) for moderate scales, and

Morse et al. 127

Page 7: Morse et al 2012

N(1.50, 0.50) for difficult scales. These restricted distributions will create the high-fidelity and

low-bandwidth situation in which Kang and Waller (2005) observed the highest prevalence of

Type I errors. As in the previous difficulty parameter selection, there were four (j 2 1) diffi-

culty values for each item sampled from within the specified distribution with the sequential

ordering restriction imposed.

Regression weights. In accordance with Kang and Waller (2005), regression weights were set

at a value of 0.30 or 0.50 for both b1 and b2. An intercept of 0 is used and therefore omitted

from the regression models. It should be noted that these regression weights are fixed only for

the purposes of simulating the dependent variables.

Fixed EffectsItem response categories (j). Five item response categories were used to simulate a five-

category Likert-type response scale. Fields (2002) identified 134 validated construct assess-

ments that are used in applied psychological research of which five-category Likert-type

response scales were the most common (n = 57).

Regression models. The purpose of this study was to observe the prevalence of Type I errors

in MMR in three different pairs of models. In the first regression model pair, actual latent trait

scores u will be analyzed (see Equations 3a and 3b). In the second regression model pair,

number-correct scores (X) will be analyzed (see Equations 4a and 4b). In the third regression

model pair, estimated theta scores u will be analyzed (see Equations 5a and 5b). These three

model pairs were expressed in accordance with Kang and Waller (2005) as follows:

u3 = b1u1 + b2u2 + e; ð3aÞ

u3 = b1u1 + b2u2 + b3u1u2 + e; ð3bÞ

X3 = b1X1 + b2X2 + e; ð4aÞ

X3 = b1X1 + b2X2 + b3X1X2 + e; ð4bÞ

u3 = b1u1 + b2u2 + e; ð5aÞ

u3 = b1u1 + b2u2 + b3u1u2 + e: ð5bÞ

The first model of each pair is the additive model, and the second model in each pair con-

tains a multiplicative (interaction) term. Each model pair was structured as a hierarchical regres-

sion analysis where the interaction term is entered at the second step (Aiken & West, 1991;

Cohen, Cohen, West, & Aiken, 2003). A significant change in variance accounted for (DR2)

between the first and second model indicated the existence of a spurious interaction effect based

on the null hypothesis that the data were created with no significant interaction on the actual

theta scale.

Regression Main Effects

Two continuous predictor variables were simulated for each regression model specified in

Equations 3 through 5b. Predictor variables u1 and u2 were randomly selected for the number

of observations (n) from normal distributions with a mean and standard deviation equal to

N(0.00, 1.00). These variables served as the main effect scores in the regression models. It is

important to note that u1 and u2 were sampled from identical but independent distributions;

thus, no multicollinearity was modeled.

128 Applied Psychological Measurement 36(2)

Page 8: Morse et al 2012

Regression Criterion Variables

One continuous criterion variable was calculated for each regression model specified in

Equations 3 through 5b. In accordance with Kang and Waller (2005), the general form of the

criterion variables is given by the following equation, which represents a multiple regression

model with two significant main effects and no interaction on the actual theta scale.

u3 = b1u1 + b2u2 +

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� ðb2

1 + b22Þ

q3 e; ð6Þ

In Equation 6, b1 and b2 are the simulated regression weights and e is an error term. Note

that the intercept term, b0, was set to equal 0 and thus omitted from the model. The termffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� b2

1 + b22

� �qwas included to represent an appropriate error variance component for each

level of b. See the Appendix for a derivation of the error term.

Number-correct scores

To generate the number-correct scores, X1, X2, and X3, the values of the previously defined con-

struct scores u1, u2, and u3 were entered into the GRM algorithm (Equations 1 and 2) for each

simulated participant.

A matrix of response scores was generated by reporting the number-correct score (1, 2, 3, 4,

or 5) corresponding to the highest-category response likelihood for each simulated participant

on each item. These values were derived using an algorithm written by the first author in the

R language based on response probabilities calculated in Equations 1 and 2. Actual number-

correct score responses were generated by comparing a randomly selected value from a uniform

distribution, U(0.0, 1.0), with the relative response probabilities that are generated for each level

of theta (or individual) and each item. This process can be thought of as determining the relative

likelihood of a category response given the item and person parameters with a realistic level of

decision-making error (Kang & Waller, 2005; Stone, 1992). This integration of response error is

important so as to not assume perfect responding by simulated individuals. A mean score for X1,

X2, and X3 for each simulated individual was calculated from the number-correct score response

matrices for analysis in the regression models.

Estimated Theta Scores

Finally, theta scores u1, u2, and u3 were estimated from the simulated raw data, using

PARSCALE 4.1 (Muraki & Bock, 2003). PARSCALE was set to derive the person (latent con-

struct scores) and item parameters using the expected a posteriori (EAP) method. This method

calculates u1, u2, and u3 as the modal value of the posterior distribution, which is the most

likely value of theta for the observed response pattern (Baker & Kim, 2004) and is a preferred

estimation method for assessments that are moderate to short in item length (Mislevy &

Stocking, 1989).

Iterations

For the purposes of estimating Type I error rates in Monte Carlo studies, Robey and

Barcikowski (1992) specify that approximately 1,000 iterations will achieve a power equal

to .90 when approximating an alpha level of a = .05 and using the interval of a 6 1=2a as a

robustness interval. Therefore, 1,000 iterations per condition were conducted. This allowed

Morse et al. 129

Page 9: Morse et al 2012

for adequate reduction in sampling variance for the IRT parameter estimates (Harwell,

Stone, Hsu, & Kirisci, 1996), achieves a power of .90 around the interval :025 � a � :075

(Robey & Barcikowski, 1992), and doubles the number of iterations used by Kang and

Waller (2005).

Simulation-Dependent Variables

Type I errors. The primary dependent variable for this study was the empirical Type I error

rate (p) that is observed for the interaction term of the MMR models. The specific value of p

was identified in a three-step process. In each iteration of the simulation, the variance in u3

accounted for by u1 and u2 was recorded as the R2 value for the additive and multiplicative

regression models specified in Equations 3 through 5b. Second, the significance of the change

in variance accounted for, DR2, between the respective additive and multiplicative models was

tested at an alpha level of p � :05, and recorded as 1 for a significant result and 0 for a non-

significant result. Finally, the empirical alpha level p was recorded as the proportion ð x1;000Þ of

iterations resulting in a significant DR2 for the actual latent trait scores u3, the number-correct

scores X3, and the estimated theta scores u3.

Procedure

The simulation for the current study was conducted in the R environment (version 2.9.0; Ihaka

& Gentleman, 1996; R Development Core Team, 2008) using a series of functions written by

the authors, contributed code from Kang and Waller (2005), and PARSCALE 4.1 (Muraki &

Bock, 2003). For ease of interpretation, four separate simulations were conducted. The four

simulations were separated based on sample size (n = 250, 750) and scale fidelity (normal,

high). In each simulation, the independent variables of scale length, regression weights, dis-

crimination, and difficulty will be systematically varied. Therefore, the summary statistics for

each simulation are included in four tables, each with 24 rows.

Each simulation was run using the following process. First, using the pseudorandom number

generator in R, theta vectors were sampled from a standard normal distribution N(0.0, 1.0) for

u1 and u2. Next, corresponding vectors for u3 were calculated using Equation 6. These vectors

were saved as the actual latent construct scores. To calculate the number-correct score matrices,

X1, X2, and X3, each of these three score vectors were evaluated in an algorithm written by the

first author that implements Equation 1 and Equation 2 to determine the probability of a cate-

gory response. Final number-correct score values were determined by the comparison of a ran-

domly selected value from a uniform distribution as previously described. Finally, the estimated

theta scores u1, u2, and u3 were derived using PARSCALE 4.1 (Muraki & Bock, 2003). To

accomplish this task, the number-correct score matrices were ‘‘batched’’ out to PARSCALE

with an accompanying syntax file following the structure identified by Gagne, Furlow, and

Ross (2009). The estimated theta scores from PARSCALE were then returned to R as the vec-

tors u1, u2, and u3.

Finally, the nine score vectors to be entered into the corresponding additive and multiplica-

tive regression models specified in Equations 3 through 5b and the change in variance accounted

for between the two corresponding models was recorded. The final summary statistics and tables

were generated using portions of code provided by Niels Waller and used in the Kang and

Waller (2005) study.

130 Applied Psychological Measurement 36(2)

Page 10: Morse et al 2012

Results

Using the a 6 1=2a criterion, the results indicated meaningfully inflated Type I error rates in 53

of 96 conditions (55%) when number-correct scores were used to operationalize the latent con-

structs, and in 33 of 96 conditions (34%) when estimated theta scores were used to operationa-

lize the latent constructs (see conditions marked with an asterisk under the columns labeled pX

and pu, respectively, in Tables 1-4). In addition, a binomial test was conducted as a measure of

statistically significant departures from the nominal (a = .05) alpha level for each scoring

method. The results of the binomial test were slightly more conservative and indicated signifi-

cantly inflated Type I error rates in 63 of 96 conditions (66%) when number-correct scores

were used to operationalize the latent constructs and in 44 of 96 conditions (46%) when esti-

mated theta scores were used to operationalize the latent constructs (see conditions marked with

dagger under the columns labeled pX and pu, respectively, in Tables 1-4).

In addition, Figure 2 represents the frequency of the empirical Type I error rates for the

number-correct scores and estimated theta scores. For the number-correct scores, these data

indicate a positively skewed distribution (skew = 2.04) ranging from 3.1% to 84.9%, with a

mean empirical Type I error rate of 17.5%, median of 8.7%, and a standard deviation of 20%.

Restricting the summary to only those occurrences outside of the a 6 1=2a interval, the mean

empirical Type I error rate was 27.6% with values ranging from 7.8% to 84.9%. The distribu-

tion for the estimated theta scores was also positively skewed (skew = 2.97) and ranged from

3.7% to 43.6% with a mean empirical Type I error rate of 9.0%, median of 5.9%, and a standard

deviation of 8.0%. Restricting the summary to only those occurrences outside of the a 6 1=2ainterval, the mean empirical Type I error rate was 15.9%, with values ranging from 7.6% to

43.6%.

These results indicate that there were instances of spurious interactions regardless of the

scoring method. However, it is clear that the number-correct scores performed much worse than

the estimated theta scores in comparable conditions. Finally, an important finding to highlight

is that, of the conditions with meaningfully inflated Type I error rates for the estimated theta

scores, none were unique with regard to the number-correct scores (see Tables 1-4). In other

words, no meaningful inflations existed for the estimated theta scores that did not also exist for

the number-correct scores.

Assessment Appropriateness and Type I Errors

The results of this simulation also clearly indicate the anticipated effect of scoring method and

assessment appropriateness on the occurrence of Type I errors for the interaction term of a

MMR analysis. Figure 3 represents the mean and maximum Type I error rates for each scoring

method collapsed across the 32 assessment appropriateness conditions. Under these conditions,

there is no significant departure from the nominal Type I error rate, regardless of whether one

uses simple raw scores or estimated theta scores in the MMR analysis. These results are consis-

tent with previous findings related to scaling effects on Type I error rates for moderated statisti-

cal models (Davison & Sharma, 1990; Embretson, 1996; Kang & Waller, 2005).

However, striking differences in the empirical Type I error rate can be observed for each

scoring method when the assessment is inappropriate for the individuals. Figure 4 represents the

mean and maximum Type I error rates for each scoring method collapsed across the 64 assess-

ment inappropriateness (easy/difficult) conditions. Number-correct scores resulted in empirical

Type I error rates that were above the acceptable interval in 53 of the 64 (83%) inappropriate

assessment conditions. At the iteration level, a direct logistic regression analysis indicated that

the likelihood of committing a Type I error was 8.13 times greater when number-correct scores

(Text continues on p. 137.)

Morse et al. 131

Page 11: Morse et al 2012

Tab

le1.

Res

ults

ofSi

mula

tion

1(N

orm

alFi

del

ity,

Dis

trib

ution

ofLa

tent

Const

ruct

Score

s=

Stan

dar

dN

orm

alN

(0,1

))

cn

b i,j

21

a ib

kp

uD

R2 u

px

DR

2 xp

uD

R2 u

aR

MSE

SWu

SWx

SWu

skx

sku

1250

Eas

yLo

w.3

15

0.0

62

0.0

18

0.0

55

0.0

21

0.0

47

0.0

22

.66

0.8

60.9

60.0

60.7

80.5

40.2

32

250

Eas

yLo

w.3

30

0.0

62

0.0

18

0.0

68

a0.0

20

0.0

52

0.0

23

.80

0.8

40.9

60.0

60.7

70.5

60.2

23

250

Eas

yLo

w.5

15

0.0

62

0.0

11

0.0

88

a,b

0.0

18

0.0

58

0.0

18

.66

0.8

60.9

60.2

30.8

40.5

40.2

44

250

Eas

yLo

w.5

30

0.0

62

0.0

11

0.1

13

a,b

0.0

17

0.0

61

0.0

17

.80

0.8

30.9

60.3

00.8

20.5

70.2

45

250

Eas

yH

igh

.315

0.0

62

0.0

18

0.0

55

0.0

21

0.0

47

0.0

22

.66

1.4

40.9

60.0

60.7

80.5

40.2

36

250

Eas

yH

igh

.330

0.0

61

0.0

18

0.1

05

a,b

0.0

22

0.0

82a,

b0.0

16

.92

1.4

40.9

60.0

00.4

11.0

10.3

47

250

Eas

yH

igh

.515

0.0

61

0.0

11

0.2

96

a,b

0.0

20

0.0

89a,

b0.0

12

.86

1.4

40.9

60.0

10.5

70.9

90.3

48

250

Eas

yH

igh

.530

0.0

62

0.0

11

0.3

72

a,b

0.0

19

0.0

98a,

b0.0

14

.92

1.4

40.9

60.0

20.6

11.0

10.3

49

250

Moder

ate

Low

.315

0.0

62

0.0

18

0.0

46

0.0

20

0.0

56

0.0

20

.69

0.7

00.9

60.8

30.8

60.0

20.0

110

250

Moder

ate

Low

.330

0.0

62

0.0

18

0.0

42

0.0

21

0.0

48

0.0

20

.82

0.7

00.9

60.8

60.8

60.0

20.0

211

250

Moder

ate

Low

.515

0.0

62

0.0

11

0.0

41

0.0

16

0.0

52

0.0

17

.69

0.7

00.9

60.9

10.8

60.0

20.0

212

250

Moder

ate

Low

.530

0.0

62

0.0

11

0.0

37

0.0

15

0.0

37

0.0

18

.82

0.7

00.9

60.9

30.8

60.0

20.0

313

250

Moder

ate

Hig

h.3

15

0.0

62

0.0

18

0.0

51

0.0

19

0.0

47

0.0

20

.88

0.5

30.9

60.4

30.8

80.0

40.0

114

250

Moder

ate

Hig

h.3

30

0.0

62

0.0

18

0.0

48

0.0

19

0.0

59

0.0

19

.94

0.5

20.9

60.5

00.8

80.0

30.0

015

250

Moder

ate

Hig

h.5

15

0.0

62

0.0

11

0.0

41

0.0

14

0.0

50

0.0

14

.88

0.5

30.9

60.8

60.9

10.0

40.0

016

250

Moder

ate

Hig

h.5

30

0.0

61

0.0

11

0.0

39

0.0

13

0.0

55

0.0

14

.94

0.5

20.9

60.8

90.9

10.0

30.0

017

250

Diff

icult

Low

.315

0.0

61

0.0

18

0.0

64

a0.0

20

0.0

53

0.0

21

.65

0.8

90.9

60.0

30.7

20.5

80.2

118

250

Diff

icult

Low

.330

0.0

61

0.0

18

0.0

67

a0.0

21

0.0

52

0.0

22

.79

0.8

80.9

60.0

40.7

00.6

00.2

219

250

Diff

icult

Low

.515

0.0

61

0.0

11

0.1

06

a,b

0.0

19

0.0

75a

0.0

18

.65

0.8

90.9

60.1

50.7

90.5

80.2

120

250

Diff

icult

Low

.530

0.0

61

0.0

11

0.1

38

a,b

0.0

17

0.0

55

0.0

19

.79

0.8

80.9

60.2

30.7

80.6

00.2

021

250

Diff

icult

Hig

h.3

15

0.0

61

0.0

18

0.1

12

a,b

0.0

23

0.0

98a,

b0.0

11

.85

1.5

80.9

60.0

00.3

01.0

60.3

722

250

Diff

icult

Hig

h.3

30

0.0

61

0.0

18

0.1

23

a,b

0.0

22

0.0

70a

0.0

14

.92

1.5

90.9

60.0

00.2

61.0

80.3

823

250

Diff

icult

Hig

h.5

15

0.0

61

0.0

11

0.3

17

a,b

0.0

21

0.1

20a,

b0.0

11

.85

1.5

90.9

60.0

00.4

91.0

60.3

824

250

Diff

icult

Hig

h.5

30

0.0

61

0.0

11

0.3

86

a,b

0.0

20

0.1

02a,

b0.0

12

.92

1.5

90.9

60.0

10.4

71.0

80.3

8

Note

:c

=co

nditio

n;n

=num

ber

ofin

div

idual

s;b i

j21

=item

cate

gory

diff

iculty

dis

trib

ution,Eas

y(a

sses

smen

tin

appro

pri

aten

ess)

=N

(21.5

,1),

Moder

ate

(ass

essm

ent

appro

pria

tenes

s)=

N(0

,1),

Diff

icult

(ass

essm

ent

inap

pro

pri

aten

ess)

=N

(1.5

,1);

a i=

item

dis

crim

inat

ion

dis

trib

ution,

Low

=U

(.31,.5

8),

Hig

h=

U(.58,1

.13);

b=

regr

essi

on

wei

ght;

k=

num

ber

ofitem

s;p

u=

empir

ical

Type

Ier

ror

rate

for

actu

alth

eta

score

s;D

R2 u

=av

erag

eef

fect

size

for

sign

ifica

nt

inte

ract

ions

for

actu

alth

eta

score

s;p

x=

empir

ical

Type

Ier

ror

rate

for

num

ber-

corr

ect

score

s;D

R2 x

=av

erag

eef

fect

size

for

sign

ifica

nt

inte

ract

ions

for

num

ber

-corr

ect

score

s;p

u=

empir

ical

Type

Ier

ror

rate

for

estim

ated

thet

asc

ore

s;D

R2 u

=av

erag

e

effe

ctsi

zefo

rsi

gnifi

cant

inte

ract

ions

for

estim

ated

thet

asc

ore

s;a

=av

erag

ein

tern

alco

nsis

tency

for

the

num

ber

-corr

ect

score

s;R

MSE

=ro

ot

mea

nsq

uar

eer

ror

for

the

estim

ated

thet

asc

ore

s;SW

u=

pro

port

ion

ofn.s

.Shap

iro–W

ilkte

sts

for

the

actu

alth

eta

score

s;SW

x=

pro

port

ion

ofn.s

.Shap

iro–W

ilkte

sts

for

the

num

ber-

corr

ect

score

s;SW

u=

pro

port

ion

ofn.s

.Sh

apir

o–W

ilkte

sts

for

the

estim

ated

thet

asc

ore

s;sk

x=

|ske

wnes

s|fo

rth

enum

ber

-corr

ect

score

s(a

bs.

valu

e);s

ku

=|s

kew

nes

s|fo

rth

ees

tim

ated

thet

asc

ore

s.It

erat

ions

per

conditio

n=

1,0

00.

a Sign

ifica

nt

Type

IErr

or

rate

bas

edon

the

resu

lts

ofa

bin

om

ialte

st.

bSi

gnifi

cant

Type

IErr

or

rate

bas

edon

the

alpha

+/-

.5al

pha

criter

ion.

132

Page 12: Morse et al 2012

Tab

le2.

Res

ults

ofSi

mula

tion

2(N

orm

alFi

del

ity,

Dis

trib

ution

ofLa

tent

Const

ruct

Score

s=

Stan

dar

dN

orm

alN

(0,1

))

cn

b i,j2

1a i

bk

pu

DR

2 up

xD

R2 x

pu

DR

2 ua

RM

SESW

uSW

xSW

usk

xsk

u

25

750

Eas

yLo

w.3

15

0.0

49

0.0

06

0.0

74

a0.0

07

0.0

56

0.0

07

.66

0.8

20.9

50.0

00.4

40.5

50.2

326

750

Eas

yLo

w.3

30

0.0

49

0.0

06

0.0

69

a0.0

07

0.0

57

0.0

07

.80

0.8

40.9

50.0

00.4

50.5

70.2

427

750

Eas

yLo

w.5

15

0.0

49

0.0

04

0.1

67

a,b

0.0

07

0.0

89

a,b

0.0

06

.66

0.8

20.9

50.0

00.6

50.5

50.2

228

750

Eas

yLo

w.5

30

0.0

49

0.0

04

0.2

22

a,b

0.0

06

0.0

79

a,b

0.0

06

.80

0.8

40.9

50.0

10.6

60.5

60.2

429

750

Eas

yH

igh

.315

0.0

49

0.0

06

0.1

62

a,b

0.0

08

0.0

84

a,b

0.0

06

.86

1.3

10.9

50.0

00.1

11.0

00.3

130

750

Eas

yH

igh

.330

0.0

49

0.0

06

0.1

42

a,b

0.0

07

0.0

66

a0.0

06

.92

1.3

10.9

50.0

00.1

11.0

10.3

231

750

Eas

yH

igh

.515

0.0

49

0.0

04

0.6

27

a,b

0.0

09

0.1

73

a,b

0.0

06

.86

1.3

10.9

50.0

00.5

21.0

00.3

132

750

Eas

yH

igh

.530

0.0

49

0.0

04

0.7

10

a,b

0.0

09

0.1

58

a,b

0.0

06

.92

1.3

10.9

50.0

00.5

21.0

00.3

233

750

Moder

ate

Low

.315

0.0

49

0.0

06

0.0

56

0.0

06

0.0

52

0.0

06

.69

0.6

80.9

50.3

20.7

60.0

20.0

134

750

Moder

ate

Low

.330

0.0

49

0.0

06

0.0

46

0.0

07

0.0

43

0.0

07

.82

0.6

60.9

50.4

40.7

70.0

20.0

235

750

Moder

ate

Low

.515

0.0

49

0.0

04

0.0

46

0.0

06

0.0

55

0.0

05

.69

0.6

80.9

50.6

90.8

20.0

20.0

236

750

Moder

ate

Low

.530

0.0

49

0.0

04

0.0

43

0.0

05

0.0

38

0.0

06

.82

0.6

60.9

50.8

10.8

20.0

20.0

237

750

Moder

ate

Hig

h.3

15

0.0

49

0.0

06

0.0

50

0.0

06

0.0

53

0.0

06

.88

0.5

10.9

50.0

10.6

90.0

30.0

038

750

Moder

ate

Hig

h.3

30

0.0

49

0.0

06

0.0

44

0.0

06

0.0

44

0.0

06

.94

0.5

30.9

50.0

10.7

10.0

30.0

139

750

Moder

ate

Hig

h.5

15

0.0

49

0.0

04

0.0

65

a0.0

05

0.0

68

a0.0

05

.88

0.5

00.9

50.5

10.7

90.0

30.0

040

750

Moder

ate

Hig

h.5

30

0.0

49

0.0

04

0.0

56

0.0

04

0.0

55

0.0

05

.94

0.5

30.9

50.6

70.8

00.0

30.0

241

750

Diff

icult

Low

.315

0.0

48

0.0

06

0.0

75

a0.0

07

0.0

58

0.0

07

.66

0.8

60.9

50.0

00.3

40.5

80.2

142

750

Diff

icult

Low

.330

0.0

49

0.0

06

0.0

81

a,b

0.0

08

0.0

43

0.0

07

.79

0.8

70.9

50.0

00.3

60.6

00.2

043

750

Diff

icult

Low

.515

0.0

49

0.0

04

0.1

64

a,b

0.0

07

0.0

75

a0.0

07

.66

0.8

60.9

50.0

00.5

60.5

80.1

844

750

Diff

icult

Low

.530

0.0

49

0.0

04

0.2

69

a,b

0.0

07

0.0

59

0.0

06

.79

0.8

70.9

50.0

00.6

00.6

00.1

945

750

Diff

icult

Hig

h.3

15

0.0

49

0.0

06

0.1

59

a,b

0.0

08

0.0

66

a0.0

07

.85

1.4

00.9

50.0

00.0

71.0

70.3

346

750

Diff

icult

Hig

h.3

30

0.0

49

0.0

06

0.1

80

a,b

0.0

08

0.0

65

a0.0

07

.92

1.4

10.9

50.0

00.0

71.0

90.3

347

750

Diff

icult

Hig

h.5

15

0.0

49

0.0

04

0.6

35

a,b

0.0

10

0.1

93

a,b

0.0

06

.85

1.4

00.9

50.0

00.4

91.0

70.3

348

750

Diff

icult

Hig

h.5

30

0.0

49

0.0

04

0.7

65

a,b

0.0

10

0.1

92

a,b

0.0

06

.92

1.4

10.9

50.0

00.4

81.0

90.3

3

Note

:c

=co

nditio

n;n

=num

ber

ofin

div

idual

s;b

ij2

1=

item

cate

gory

diff

iculty

dis

trib

ution,

Eas

y(a

sses

smen

tin

appro

pri

aten

ess)

=N

(21.5

,1),

Moder

ate

(ass

essm

ent

appr

opri

aten

ess)

=N

(0,1),

Diff

icult

(ass

essm

ent

inap

pro

pri

aten

ess)

=N

(1.5

,1);

a i=

item

dis

crim

inat

ion

dis

trib

ution,L

ow

=U

(.31,

.58),

Hig

h=

U(.58,1

.13);

b=

regr

essi

on

wei

ght;

k=

num

ber

ofitem

s;p

u=

empir

ical

Type

Ier

ror

rate

for

actu

alth

eta

score

s;D

R2 u

=av

erag

eef

fect

size

for

sign

ifica

nt

inte

ract

ions

for

actu

alth

eta

score

s;p

x=

empir

ical

Type

Ier

ror

rate

for

num

ber

-corr

ect

score

s;D

R2 x

=av

erag

eef

fect

size

for

sign

ifica

nt

inte

ract

ions

for

num

ber

-corr

ect

score

s;p

u=

empir

ical

Type

Ier

ror

rate

for

estim

ated

thet

asc

ore

s;D

R2 u

=av

erag

e

effe

ctsi

zefo

rsi

gnifi

cant

inte

ract

ions

for

estim

ated

thet

asc

ore

s;a

=av

erag

ein

tern

alco

nsi

sten

cyfo

rth

enum

ber

-corr

ect

score

s;R

MSE

=ro

ot

mea

nsq

uar

eer

ror

for

the

estim

ated

thet

asc

ore

s;SW

u=

pro

port

ion

ofn.s

.Sh

apir

o–W

ilkte

sts

for

the

actu

alth

eta

score

s;SW

x=

pro

port

ion

ofn.s

.Sh

apir

o–W

ilkte

sts

for

the

num

ber

-corr

ect

score

s;SW

u=

pro

port

ion

ofn.s

.Shap

iro–W

ilkte

sts

for

the

estim

ated

thet

asc

ore

s;sk

x=

|ske

wnes

s|fo

rth

enum

ber-

corr

ect

score

s(a

bs.

valu

e);sk

u=

|ske

wnes

s|fo

rth

ees

tim

ated

thet

asc

ore

s.It

erat

ions

per

cond

itio

n=

1,0

00.

a Sign

ifica

nt

Type

IErr

or

rate

bas

edon

the

resu

lts

ofa

bin

om

ialte

st.

bSi

gnifi

cant

Type

IErr

or

rate

bas

edon

the

alpha

+/-

.5al

pha

criter

ion.

133

Page 13: Morse et al 2012

Tab

le3.

Res

ults

ofSi

mula

tion

3(H

igh

Fidel

ity,

Dis

trib

ution

ofLa

tent

Const

ruct

Score

s=

Stan

dar

dN

orm

alN

(0,1

))

cn

b i,j2

1a i

bk

pu

DR

2 up

xD

R2 x

pu

DR

2 ua

RM

SESW

uSW

xSW

usk

xsk

u

49

250

Eas

yLo

w.3

15

0.0

62

0.0

18

0.0

67

a0.0

21

0.0

55

0.0

20

.64

0.7

0.9

60.0

10.7

80.6

40.2

350

250

Eas

yLo

w.3

30

0.0

62

0.0

18

0.0

78

a,b

0.0

20

0.0

54

0.0

22

.78

0.6

90.9

60.0

10.7

50.6

70.2

451

250

Eas

yLo

w.5

15

0.0

62

0.0

11

0.0

99

a,b

0.0

18

0.0

64

a0.0

18

.64

0.7

00.9

60.1

00.8

90.6

40.2

352

250

Eas

yLo

w.5

30

0.0

62

0.0

11

0.1

32

a,b

0.0

18

0.0

57

0.0

19

.78

0.6

90.9

60.1

50.8

30.6

70.2

453

250

Eas

yH

igh

.315

0.0

62

0.0

18

0.1

28

a,b

0.0

24

0.0

79

a,b

0.0

22

.84

1.5

70.9

60.0

00.0

31.3

40.7

654

250

Eas

yH

igh

.330

0.0

61

0.0

18

0.1

52

a,b

0.0

23

0.0

85

a,b

0.0

21

.91

1.5

60.9

60.0

00.0

41.3

80.7

655

250

Eas

yH

igh

.515

0.0

61

0.0

11

0.3

90

a,b

0.0

23

0.2

15

a,b

0.0

19

.84

1.5

70.9

60.0

00.2

61.3

40.7

756

250

Eas

yH

igh

.530

0.0

62

0.0

11

0.4

67

a,b

0.0

23

0.2

24

a,b

0.0

19

.91

1.5

60.9

60.0

00.2

71.3

70.7

657

250

Moder

ate

Low

.315

0.0

62

0.0

18

0.0

47

0.0

20

0.0

44

0.0

21

.68

0.5

60.9

60.7

70.9

60.0

10.0

058

250

Moder

ate

Low

.330

0.0

62

0.0

18

0.0

44

0.0

20

0.0

58

0.0

20

.81

0.5

60.9

60.8

00.9

70.0

10.0

059

250

Moder

ate

Low

.515

0.0

62

0.0

11

0.0

41

0.0

16

0.0

50

0.0

16

.68

0.5

60.9

60.8

90.9

70.0

10.0

060

250

Moder

ate

Low

.530

0.0

62

0.0

11

0.0

40

0.0

15

0.0

50

0.0

17

.81

0.5

60.9

60.9

00.9

60.0

10.0

061

250

Moder

ate

Hig

h.3

15

0.0

62

0.0

18

0.0

47

0.0

19

0.0

56

0.0

18

.88

0.3

80.9

60.1

10.9

30.0

20.0

162

250

Moder

ate

Hig

h.3

30

0.0

62

0.0

18

0.0

42

0.0

20

0.0

59

0.0

19

.93

0.3

90.9

60.1

40.9

30.0

20.0

163

250

Moder

ate

Hig

h.5

15

0.0

62

0.0

11

0.0

31

0.0

14

0.0

42

0.0

14

.88

0.3

80.9

60.7

90.9

50.0

20.0

164

250

Moder

ate

Hig

h.5

30

0.0

61

0.0

11

0.0

32

0.0

14

0.0

54

0.0

14

.93

0.3

90.9

60.8

60.9

50.0

20.0

165

250

Diff

icult

Low

.315

0.0

61

0.0

18

0.0

75

a0.0

21

0.0

50

0.0

20

.63

0.7

20.9

60.0

10.7

40.6

60.2

466

250

Diff

icult

Low

.330

0.0

61

0.0

18

0.0

66

a0.0

22

0.0

49

0.0

21

.78

0.7

10.9

60.0

10.7

30.6

90.2

467

250

Diff

icult

Low

.515

0.0

61

0.0

11

0.1

15

a,b

0.0

20

0.0

71

a0.0

18

.63

0.7

20.9

60.0

70.8

30.6

60.2

468

250

Diff

icult

Low

.530

0.0

61

0.0

11

0.1

50

a,b

0.0

18

0.0

67

a0.0

18

.78

0.7

00.9

60.1

10.8

30.6

90.2

469

250

Diff

icult

Hig

h.3

15

0.0

61

0.0

18

0.1

41

a,b

0.0

25

0.0

98

a,b

0.0

22

.83

1.6

50.9

60.0

00.0

31.3

90.8

070

250

Diff

icult

Hig

h.3

30

0.0

61

0.0

18

0.1

55

a,b

0.0

24

0.0

93

a,b

0.0

22

.91

1.6

50.9

60.0

00.0

21.4

20.8

071

250

Diff

icult

Hig

h.5

15

0.0

61

0.0

11

0.4

11

a,b

0.0

25

0.2

46

a,b

0.0

21

.83

1.6

50.9

60.0

00.2

21.4

00.8

172

250

Diff

icult

Hig

h.5

30

0.0

61

0.0

11

0.4

88

a,b

0.0

25

0.2

35

a,b

0.0

20

.91

1.6

50.9

60.0

00.2

41.4

30.8

1

Note

:c=

cond

itio

n;n

=num

ber

ofin

div

idual

s;b

i,j2

1=

item

cate

gory

diff

iculty

dis

trib

ution,E

asy

(ass

essm

ent

inap

pro

pri

aten

ess)

=N

(21.5

,0.5

),M

oder

ate

(ass

essm

ent

appro

pri

aten

ess)

=N

(0,0.5

),D

iffic

ult

(ass

essm

ent

inap

pro

pri

aten

ess)

=N

(1.5

,0.5

);a i

=item

dis

crim

inat

ion

dis

trib

ution,L

ow

=U

(.31,

.58),

Hig

h=

U(.58,1

.13);

b=

regr

essi

on

wei

ght;

k=

num

ber

ofitem

s;p

u=

empir

ical

Type

Ier

ror

rate

for

actu

alth

eta

score

s;D

R2 u

=av

erag

eef

fect

size

for

sign

ifica

nt

inte

ract

ions

for

actu

alth

eta

score

s;p

x=

empir

ical

Type

Ier

ror

rate

for

num

ber-

corr

ect

score

s;D

R2 x

=av

erag

eef

fect

size

for

sign

ifica

nt

inte

ract

ions

for

num

ber-

corr

ect

score

s;p

u=

empir

ical

Type

Ier

ror

rate

for

estim

ated

thet

asc

ore

s;D

R2 u

=

aver

age

effe

ctsi

zefo

rsi

gnifi

cant

inte

ract

ions

for

estim

ated

thet

asc

ore

s;a

=av

erag

ein

tern

alco

nsis

tency

for

the

num

ber

-corr

ect

score

s;R

MSE

=ro

ot

mea

nsq

uar

eer

ror

for

the

estim

ated

thet

asc

ore

s;SW

u=

pro

port

ion

ofn.s

.Shap

iro–W

ilkte

sts

for

the

actu

alth

eta

score

s;SW

x=

pro

port

ion

ofn.s

.Shap

iro–W

ilkte

sts

for

the

num

ber

-corr

ect

score

s;SW

u=

pro

port

ion

ofn.s

.Shap

iro–W

ilkte

sts

for

the

estim

ated

thet

asc

ore

s;sk

x=

|ske

wnes

s|fo

rth

enum

ber

-corr

ect

score

s(a

bs.va

lue)

;sk

u=

|ske

wnes

s|fo

rth

ees

tim

ated

thet

asc

ore

s.

Iter

atio

ns

per

cond

itio

n=

1,0

00.

a Sign

ifica

nt

Type

IErr

or

rate

bas

edon

the

resu

lts

ofa

bin

om

ialt

est.

bSi

gnifi

cant

Type

IErr

or

rate

bas

edon

the

alpha

+/-

.5al

pha

criter

ion.

134

Page 14: Morse et al 2012

Tab

le4.

Res

ults

ofSi

mula

tion

4(H

igh

Fidel

ity,

Dis

trib

ution

ofLa

tent

Const

ruct

Score

s=

Stan

dar

dN

orm

alN

(0,1

))

cn

b i,j–

1a i

bk

pu

DR

2 up

xD

R2 x

pu

DR

2 ua

RM

SESW

uSW

xSW

usk

xsk

u

73

750

Eas

yLo

w.3

15

0.0

49

0.0

06

0.0

86

a,b

0.0

07

0.0

66

b0.0

07

.64

0.6

50.9

50.0

00.1

60.6

50.2

674

750

Eas

yLo

w.3

30

0.0

49

0.0

06

0.0

80

a,b

0.0

07

0.0

56

0.0

07

.78

0.6

50.9

50.0

00.1

60.6

70.2

675

750

Eas

yLo

w.5

15

0.0

49

0.0

04

0.1

83

a,b

0.0

07

0.1

00

a,b

0.0

06

.64

0.6

50.9

50.0

00.5

40.6

50.2

676

750

Eas

yLo

w.5

30

0.0

49

0.0

04

0.2

68

a,b

0.0

06

0.0

86

a,b

0.0

06

.78

0.6

50.9

50.0

00.5

50.6

70.2

677

750

Eas

yH

igh

.315

0.0

49

0.0

06

0.2

16

a,b

0.0

08

0.1

09

a,b

0.0

07

.84

1.3

60.9

50.0

00.0

01.3

50.6

278

750

Eas

yH

igh

.330

0.0

49

0.0

06

0.1

99

a,b

0.0

09

0.0

95

a,b

0.0

07

.91

1.3

60.9

50.0

00.0

01.3

80.6

279

750

Eas

yH

igh

.515

0.0

49

0.0

04

0.7

40

a,b

0.0

11

0.4

07

a,b

0.0

07

.84

1.3

60.9

50.0

00.0

71.3

50.6

280

750

Eas

yH

igh

.530

0.0

49

0.0

04

0.8

42

a,b

0.0

12

0.3

88

a,b

0.0

07

.91

1.3

50.9

50.0

00.0

81.3

80.6

281

750

Moder

ate

Low

.315

0.0

49

0.0

06

0.0

47

0.0

06

0.0

49

0.0

06

.68

0.5

60.9

50.1

50.9

00.0

10.0

082

750

Moder

ate

Low

.330

0.0

49

0.0

06

0.0

45

0.0

07

0.0

46

0.0

06

.81

0.5

60.9

50.2

30.8

90.0

10.0

083

750

Moder

ate

Low

.515

0.0

49

0.0

04

0.0

43

0.0

05

0.0

47

0.0

05

.68

0.5

60.9

50.6

10.9

30.0

10.0

084

750

Moder

ate

Low

.530

0.0

49

0.0

04

0.0

46

0.0

05

0.0

42

0.0

06

.81

0.5

60.9

50.7

40.9

40.0

10.0

085

750

Moder

ate

Hig

h.3

15

0.0

49

0.0

06

0.0

43

0.0

06

0.0

49

0.0

06

.88

0.3

90.9

50.0

00.7

00.0

20.0

086

750

Moder

ate

Hig

h.3

30

0.0

49

0.0

06

0.0

41

0.0

06

0.0

42

0.0

06

.93

0.3

90.9

50.0

00.7

20.0

20.0

087

750

Moder

ate

Hig

h.5

15

0.0

49

0.0

04

0.0

45

0.0

05

0.0

54

0.0

04

.88

0.3

90.9

50.3

20.9

20.0

20.0

088

750

Moder

ate

Hig

h.5

30

0.0

49

0.0

04

0.0

40

0.0

04

0.0

47

0.0

05

.93

0.3

90.9

50.4

60.9

20.0

20.0

089

750

Diff

icult

Low

.315

0.0

48

0.0

06

0.0

80

a,b

0.0

08

0.0

59

0.0

07

.64

0.6

80.9

50.0

00.1

20.6

70.2

690

750

Diff

icult

Low

.330

0.0

49

0.0

06

0.0

94

a,b

0.0

07

0.0

41

0.0

07

.78

0.6

70.9

50.0

00.1

20.6

90.2

691

750

Diff

icult

Low

.515

0.0

49

0.0

04

0.1

80

a,b

0.0

07

0.0

94

a,b

0.0

07

.64

0.6

80.9

50.0

00.5

00.6

70.2

692

750

Diff

icult

Low

.530

0.0

49

0.0

04

0.3

15

a,b

0.0

07

0.0

76

a,b

0.0

06

.78

0.6

70.9

50.0

00.5

10.6

90.2

793

750

Diff

icult

Hig

h.3

15

0.0

49

0.0

06

0.1

99

a,b

0.0

09

0.1

06

a,b

0.0

08

.84

1.4

60.9

50.0

00.0

01.4

00.6

694

750

Diff

icult

Hig

h.3

30

0.0

49

0.0

06

0.2

36

a,b

0.0

09

0.1

07

a,b

0.0

08

.91

1.4

60.9

50.0

00.0

01.4

30.6

595

750

Diff

icult

Hig

h.5

15

0.0

49

0.0

04

0.7

34

a,b

0.0

12

0.4

36

a,b

0.0

08

.84

1.4

60.9

50.0

00.0

61.4

00.6

696

750

Diff

icult

Hig

h.5

30

0.0

49

0.0

04

0.8

49

a,b

0.0

13

0.4

04

a,b

0.0

08

.91

1.4

60.9

50.0

00.0

51.4

30.6

5

Note

:c

=co

nditio

n;n

=num

ber

ofin

div

idua

ls;b

i,j2

1=

item

cate

gory

diff

iculty

dis

trib

ution,Eas

y(a

sses

smen

tin

appro

pri

aten

ess)

=N

(21.5

,0.5

),M

oder

ate

(ass

essm

ent

appro

pri

aten

ess)

=N

(0,0.5

),D

iffic

ult

(ass

essm

ent

inap

pro

pri

aten

ess)

=N

(1.5

,0.5

);a i

=item

dis

crim

inat

ion

dis

trib

ution,Lo

w=

U(.31,

.58),

Hig

h=

U(.58,1

.13);

b=

regr

essi

on

wei

ght;

k=

num

ber

ofitem

s;p

u=

empir

ical

Type

Ier

ror

rate

for

actu

alth

eta

score

s;D

R2 u

=av

erag

eef

fect

size

for

sign

ifica

nt

inte

ract

ions

for

actu

alth

eta

score

s;p

x=

empir

ical

Type

Ier

ror

rate

for

num

ber-

corr

ect

score

s;D

R2 x

=av

erag

eef

fect

size

for

sign

ifica

nt

inte

ract

ions

for

num

ber-

corr

ect

score

s;p

u=

empir

ical

Type

Ier

ror

rate

for

estim

ated

thet

asc

ore

s;D

R2 u

=

aver

age

effe

ctsi

zefo

rsi

gnifi

cant

inte

ract

ions

for

estim

ated

thet

asc

ore

s;a

=av

erag

ein

tern

alco

nsis

tency

for

the

num

ber

-corr

ect

score

s;R

MSE

=ro

ot

mea

nsq

uar

eer

ror

for

the

estim

ated

thet

asc

ore

s;SW

u=

pro

port

ion

ofn.s

.Shap

iro–W

ilkte

sts

for

the

actu

alth

eta

score

s;SW

x=

pro

port

ion

ofn.s

.Shap

iro–W

ilkte

sts

for

the

num

ber

-corr

ect

score

s;SW

u=

pro

port

ion

ofn.s

.Shap

iro–W

ilkte

sts

for

the

estim

ated

thet

asc

ore

s;sk

x=|s

kew

nes

s|fo

rth

enum

ber

-corr

ect

score

s(a

bs.

valu

e);s

ku

=|s

kew

nes

s|fo

rth

ees

tim

ated

thet

asc

ore

s.

Iter

atio

ns

per

conditio

n=

1,0

00.

a Sign

ifica

nt

Type

IErr

or

rate

bas

edon

the

resu

lts

ofa

bin

om

ialte

st.

bSi

gnifi

cant

Type

IErr

or

rate

bas

edon

the

resu

lts

ofth

eal

pha

+/-

.5al

pha

criter

ion.

135

Page 15: Morse et al 2012

Figure 3. Empirical Type I error rates for the interaction term of a simulated moderated multipleregression model under conditions of assessment appropriateness

Figure 2. Distribution of spurious interactions for number-correct scores and estimated theta scores

136 Applied Psychological Measurement 36(2)

Page 16: Morse et al 2012

from inappropriate assessments were used, x2(1, N = 96,000) = 5,008.55, p \ .001, odds ratio =

8.13. In addition, estimated theta scores resulted in empirical Type I error rates that were above

the acceptable interval in 33 of the 64 (51%) inappropriate assessment conditions. A direct

logistic regression analysis indicated that the likelihood of committing a Type I error was 2.4

times greater when estimated theta scores from inappropriate assessments were used, x2(1, N =

96,000) = 918.15, p \ .001, odds ratio = 2.40.

Impact of the Independent Variables on the Empirical Type I Error Rate

Table 5 represents the mean empirical Type I error rate as well as direct logistic regression tests

for the levels of each independent variable. The dependent variable in the logistic regression

analyses was the occurrence of a Type I error and the iteration level, and was coded as a 1 if the

DR2 between the additive and multiplicative model was significant (p \ .05) or a 0 if it was not

significant. All of the independent variables were entered into the model simultaneously as cate-

gorical predictors. A general pattern can be identified in these results such that higher empirical

Type I error rates were observed for the stronger level of each independent variable. This pat-

tern would indicate that each psychometric characteristic that was varied in the simulations had

an overall effect on the empirical Type I error rates for the interaction term.

Several important findings can be identified from these results. First, the psychometric char-

acteristics that were manipulated in this simulation had a stronger overall effect on Type I

errors when the variables were operationalized as number-correct scores when compared with

estimated theta scores. These results suggest that number-correct scores are more sensitive to

Figure 4. Empirical Type I error rates for the interaction term of a simulated moderated multipleregression model under conditions of assessment inappropriateness

Morse et al. 137

Page 17: Morse et al 2012

Tab

le5.

Impac

tofIn

div

idual

Pre

dic

tors

on

Em

pir

ical

Type

IErr

or

Rat

es

MD

irec

tlo

gist

icre

gres

sion

for

num

ber

-corr

ect

score

Type

Ier

rors

aD

irec

tlo

gist

icre

gres

sion

for

estim

ated

thet

asc

ore

Type

Ier

rors

b

pu

px

pu

Wal

dc2

dfB

OR

cW

ald

x2

dfB

OR

c

Appro

pri

aten

ess

(diff

iculty)

Appro

pri

ate

0.0

60.0

40.0

55,0

08.5

5***

12.0

96

8.1

3918.1

5***

10.8

75

2.4

0In

appro

pri

ate

0.0

60.2

40.1

1D

iscr

imin

atio

nLo

w0.0

60.1

00.0

64,4

37.1

9***

11.3

39

3.8

21,1

85.0

1***

10.8

36

2.3

0H

igh

0.0

60.2

50.1

2Sa

mple

size

250

0.0

60.1

60.0

9154.0

9***

10.2

34

1.2

69.9

3***

10.0

73

1.0

8750

0.0

60.1

90.0

9Fi

del

ity

Norm

al0.0

60.1

60.0

7165.4

2***

10.2

42

1.2

7364.7

7***

10.4

46

1.5

6H

igh

0.0

60.1

90.1

1It

ems

15

0.0

60.1

60.0

9154.0

9***

10.2

34

1.2

69.9

3***

10.0

73

1.0

830

0.0

60.1

90.0

9Bet

aw

eigh

ts.3

0.0

60.0

90.0

64,8

76.1

3***

11.4

17

4.1

2881.9

8***

10.7

10

2.0

3.5

0.0

60.2

60.1

2

Note

:O

R=

odds

ratio.

a Om

nib

usfu

llm

odel

,x

2(1

,N

=96,0

00)

=17,1

57.5

1,p

\.0

01,R

2=

.27.

bO

mnib

usfu

llm

odel

,x

2(1

,N

=96,

000)

=3,5

71.4

7,p

\.0

01,R

2=

.08.

c Inea

chca

se,

the

OR

report

edco

rres

pond

sto

incr

ease

sin

the

pre

dic

tor

vari

able

(e.g

.,in

crea

sed

asse

ssm

ent

inap

pro

pri

aten

ess

resu

lts

inhig

her

likel

ihoods

of

Type

Ier

rors

,

incr

ease

sin

dis

crim

inat

ion

resu

lts

inhig

her

likel

ihoods

ofTy

peIer

rors

,et

c.).

***p

\.0

01.

138

Page 18: Morse et al 2012

measurement effects in parametric analyses than are IRT-derived theta estimates. For both

dependent variables, assessment appropriateness was the most impactful predictor of Type I

errors, followed by item discrimination and regression weights. This result confirms and

extends the effects of assessment appropriateness identified by Kang and Waller (2005), as well

as arguments raised by Busemeyer (1980) on the role of assessment difficulty in parametric

statistics.

Strength of Spurious Interaction Effects

Finally, the authors were interested in understanding how assessment appropriateness affected

the strength of spurious interactions for the different scoring methods. The columns labeled with

DR2 for each respective scoring method in Tables 1 through 4 indicate the average strength of

the interaction when a spurious interaction was identified. Because sample size is known to

affect the strength of interaction effects, the authors used a multivariate analysis of covariance

(MANCOVA) to determine the effect of assessment appropriateness using sample size as a cov-

ariate. After adjusting for the effects of sample size, the results indicated a significant effect of

assessment appropriateness on the strength of spurious interaction effects for number-correct

scores F(1, 93) = 51.92, p \ .001, partial h2 = .36 such that the average interaction strength in

the inappropriate assessment conditions (M = 0.015, SD = 0.007) was significantly greater than

the appropriate assessment conditions (M = 0.011, SD = 0.006). A similar, albeit much weaker,

result was also identified for estimated theta scores F(1, 93) = 4.47, p \ .05, partial h2 = .05

such that the average interaction strength in the inappropriate assessment conditions (M = 0.013,

SD = 0.006) was significantly greater than the appropriate assessment conditions (M = 0.012,

SD = 0.006). No significant difference was identified for actual theta scores. These results indi-

cate that assessment appropriateness has an effect on the strength of spurious interaction effects

for number-correct scores and estimated theta scores and that the effect is considerably stronger

in the number-correct score conditions.

Discussion

Theoretical and empirical evidence has emerged to suggest that using IRT to operationalize an

individual’s standing on a latent construct has important measurement implications over the use

of number-correct scores (Borsboom, 2008; Embretson, 1996, 2006; Embretson & DeBoeck,

1994; Harwell & Gatti, 2001; Kang & Waller, 2005; Perline et al., 1979; Reise & Haviland,

2005; Reise et al., 2005; Wainer, 1982). Specifically, IRT-derived theta scores have been

demonstrated to be resistant to inflated Type I error rates in moderated statistical models poten-

tially due to achieving an interval, or nearly interval, scale of measurement (Embretson, 1996;

Kang & Waller, 2005). This previous work has been limited to applications of dichotomous

data and restrictive IRT models. Therefore, the authors’ goal was to extend their understanding

of these potentially beneficial measurement properties by modeling multicategory data and

implementing a polytomous IRT model. These studies represent a generalizability trend such

that each successive study branches further away from the measurement ideal and into the reali-

ties of psychological data.

It is imperative to point out that under certain conditions, significantly inflated Type I error

rates were observed for both the estimated theta scores and the number-correct scores. This

result for the number-correct scores was expected; however, this result for the estimated theta

scores was somewhat unexpected. Strong psychometric influences still resulted in inflated

Type I error rates for the estimated theta scores from the GRM. However, it was often the case

that the Type I error rate of the number-correct scores far exceeded that of the estimated theta

Morse et al. 139

Page 19: Morse et al 2012

scores. For example, in conditions 7, 8, 23, and 24 in Table 1, the Type I error rates for the

number-correct scores ranged from .296 to .386 whereas the respective Type I error rates for

the estimated theta scores ranged from .089 to .120. Clearly, given the alternative, the estimated

theta scores would be more attractive to researchers in these conditions. In addition, in cases

where the Type I error rates were grossly inflated, such as conditions 31, 32, 47, and 48 in

Table 2 and 79, 80, 95, and 96 in Table 4, the Type I error rate for the number-correct scores

was approximately 200% to 450% higher than the Type I error rate for the estimated theta

scores. An examination of Figure 2 clearly reveals these disparities between the number-correct

scores and estimated theta scores. Although the estimated theta scores did not perform perfectly

within the acceptable limits, these results demonstrate a clear preference for their use in applied

research when certain psychometric conditions exist.

Another prominent result was that of the role of assessment appropriateness on Type I error

rates for both number-correct scores and estimated theta scores. Assessment appropriateness is

defined as the congruence between the reliability of an assessment and the latent construct dis-

tribution of the individuals responding to an assessment (see Figure 1). The results of these

simulations demonstrated that, under conditions of assessment appropriateness, there is no con-

cern as to unacceptable Type I error rates for any psychometric condition or scoring technique.

The complimentary result, that spurious interactions were only observed under some conditions

of assessment inappropriateness, was also true. Embretson (1996) and Kang and Waller (2005)

also identified assessment appropriateness and inappropriateness as the primary factor in their

simulations of spurious interaction effects. Embretson (1996) determined that the degree and

direction of the inappropriateness fully accounted for the nature of the interaction with regard to

treatment groups in a simulated factorial ANOVA. Prior to these studies, Maxwell and Delaney

(1985) demonstrated how various distributional shapes of latent constructs can interact with

assessment difficulty (appropriateness) to result in artificial group mean differences when the

observed scores and latent scores were related through a nonlinear, monotonic relationship.

However, the expectation that this effect would not occur when using estimated theta scores

was not fully supported. This finding suggests that certain conditions may reduce the level of

linearity in the theta–estimated theta relationship. Specifically, an examination of the root mean

square error (RMSE) values in these conditions indicates less precise parameter recovery. This

would suggest that the degree of congruence between the actual and estimated theta scores was

being eroded in exceedingly easy or difficult assessments. Ferrando (2009) explored the issue of

measurement inappropriateness in which the appropriateness of an item can be constrained to a

particular range of the trait continuum, thus degrading the fit of a linear model above and below

the ceiling and floor limits. Indices for determining this range are provided for binary items in a

unidimensional, two-parameter model as well as in multidimensional cases and can provide use-

ful information for determining an acceptable range of measurement appropriateness.

These results also suggest a more complex relationship underlying data structures and the

assessment of moderators in MMR analyses than perhaps previously thought. Paunonen and

Jackson (1988) conducted a simulation in which Type I error rates were compared between ordi-

nary least squares regression (OLS) and principal components regression (PCR) in relation to

the multicollinearity of the predictors. Their results indicated that OLS performed much better

than did PCR with regard to accurate moderator detection, and that linear transformations of the

data had little effect on the Type I error rates for either procedure. In their study, the researchers

simulated random effects data from normal distributions just as in this study, but did not investi-

gate any influences of psychometric characteristics on the data (i.e., difficulty, discrimination,

assessment appropriateness, etc.). Conceptually, Paunonen and Jackson (1988) generated data

as if they were able to collect actual theta scores. It is not surprising, therefore, that their results

were well within the normal Type I error rate for MMR. An examination of the empirical

140 Applied Psychological Measurement 36(2)

Page 20: Morse et al 2012

Type I error rates for the actual theta scores in Tables 1 through 4 replicate these findings. The

generalizability of the results from Paunonen and Jackson’s simulation are therefore limited

because actual latent trait scores are unobservable (directly). The results of this study indicate

that psychometric characteristics such as assessment appropriateness and the operationalization

of the scores have a significant influence on the performance of MMR analyses.

Limitations and Extensions

Violations of the normality assumption in MMR models were present in these simulations,

lending some caution to the interpretation of the results. Specifically, skewness was observed in

the simulated dependent variables, and the magnitude of the skew was positively related to the

empirical Type I error rate for both number-correct scores and estimated theta scores. In addi-

tion, the results of the Shapiro–Wilk tests indicated that cases of residual nonnormality also

increased with the empirical Type I error rate for both number-correct scores and estimated

theta scores (see Tables 1-4). To examine these results more closely, score distributions were

generated for three separate conditions in which the skew of the scores appeared to be strongly

related to their respective Type I Error rate. Figure 5 contains distributions for the actual theta

scores and the derived scores (estimated theta and number-correct scores) from conditions 12,

44, and 96. These conditions represent cases in which number-correct scores and theta scores

performed equally well (condition 12), number-correct scores performed poorly but theta scores

performed well (condition 44), and neither number-correct scores nor theta scores performed

well (condition 96). The commonality that can be observed here is that the poor performance is

clearly aligned with the nonnormal score distributions, and the greatest amount of nonnormality

is observed under conditions of assessment inappropriateness in which the latent construct dis-

tribution is poorly matched with the test information function (reliability) of the assessment.

These findings create a potential confound such that the empirical Type I error rates may be

partially related to these violations of normality, which were observed as a result of the simula-

tions rather than a controlled factor. Kang and Waller (2005) identified a similar pattern in their

study involving dichotomous data, and in a brief investigation, the researchers were able to cor-

rect some of the effects of nonnormality using the Box-Cox transformation. Specifically, moder-

ate empirical Type I error rates responded well to this correction, but higher rates were not fully

mitigated to the nominal level of .05 (Kang & Waller, 2005).

It is conceivable that variations in observed nonnormality are a result of variations in nonli-

nearity between the overall scale (test) response function and the latent trait. A preliminary

investigation into scatterplots of individual iterations confirmed this relationship (results avail-

able on request). Thus, these relationships could be used as an indicator of a risk for spurious

interactions.2 An additional indication of this effect is the pattern of parameter recovery values

(RMSE) reported in Tables 1 through 4. As nonnormality increased (and in inappropriate assess-

ment conditions), the IRT parameter estimation of the theta scores worsened.

Finally, it is pertinent to discuss the inherent arbitrariness of the scaling metrics and the use

of the GRM to generate the response score matrices. The actual theta scores were generated

without a significant interaction and derived scores were tested for a significant interaction.

Although this practice is common in measurement simulations involving the translation from

IRT scores to other scoring methods (Embretson, 1996; Harwell et al., 1996; Kang & Waller,

2005), an argument can be made that the authors are favoring the GRM scores as a nonarbitrary

trait scale and implicitly satisfying the assumptions of the IRT scale but not the number-correct

score scale. A limiting factor that may arise here is that the results only extend to empirical

situations in which the theta scale is regarded as the true metric. This limitation highlights the

Morse et al. 141

Page 21: Morse et al 2012

possibility that the interaction null hypothesis is scale specific and the findings may not gener-

alize to situations in which another metric is considered to be the true measurement scale.3

Conclusion

Overall, the results of this study provide support for the use of IRT for the operationalization of

latent constructs. However, this support is not ubiquitous and the direct utility of IRT-derived

theta scores for parametric statistics is limited to suboptimal psychometric conditions (i.e.,

assessment inappropriateness). These conditions resulted in ceiling and floor effects, which can

be identified in preliminary data analyses; however, some evidence indicates that extreme cases

will not be fully corrected with typical score transformations.

If a researcher can demonstrate that the reliability of the assessment and the distribution of

construct scores for the individuals are reasonably matched, there is no evidence here that the

Type I error rate will reach an unacceptable level for any scoring technique. However,

Figure 5. Score distributions for conditions 12, 44, and 96

142 Applied Psychological Measurement 36(2)

Page 22: Morse et al 2012

Embretson (1996) cautioned that these conditions are difficult to anticipate, and the use of theta

estimates could be justified as a default measurement technique to assuage any concerns.

Furthermore, De Boeck and Wilson (2004) suggested that experimental hypotheses can be

tested within an appropriate IRT model. This integration of measurement and experimental test-

ing could provide another avenue for greater decision-making accuracy.

The investigation of the performance of polytomous IRT models in a variety of contexts is

still an important avenue of measurement research. Harwell and Gatti (2001) specifically iden-

tified a need for a deeper understanding of the properties of the scores generated from complex

models such as the GRM as well as various item and scale properties that can influence those

scores. The simulations conducted in this study represent an important step forward in filling

these needs.

Appendix

The error variance term can be derived in the following manner. First, the predictor variables u1

and u2 and the criterion variable u3 are normally distributed with a mean and standard deviation

equal to N(0.00, 1.00). Because the standard deviation is simply the square root of the variance,

the variance of the predictor and criterion variables is equal to one. Given these conditions, the

following derivation gives the error term for the regression models:

s2u3

= b21s2

u1+ b2

1s2u2

+ s2e ; ðA1aÞ

where s2u1

, s2u2

, and s2u3

are all equal to 1.00; therefore,

1 = b21 + b2

1 + s2e ; ðA1bÞ

1� b21 � b2

1 = s2e ; ðA1cÞ

1� b21 + b2

1

� �¼ s2

e ; ðA1dÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� b2

1 + b21

� �q= se: ðA1eÞ

Finally, given that the two levels of regression weights b1 and b2 were simulated as .3 and

.5, Equation 6 can be further reduced to the following form.

u3 = :5 u1ð Þ+ :5 u2ð Þ+ :7071 eð Þ; ðA2aÞ

u3 = :3 u1ð Þ+ :3 u2ð Þ + :9055 eð Þ: ðA2bÞ

In the simulation, the criterion variable u3 was operationalized with Equations A2a and A2b

for the two levels of the regression weights, .3 and .5, respectively. An alternative way of speci-

fying this derivation would be to say that the error associated with each regression model is

being sampled from a N(0.00, 0.7071) and N(0.00, 0.9055) distribution for each level of b.

Authors’ Note

The data presented in this study were generated as a part of the first author’s doctoral dissertation.

Morse et al. 143

Page 23: Morse et al 2012

Acknowledgments

The authors would like to acknowledge Niels Waller for his assistance with the R programming code for

portions of the simulation. We would additionally like to thank Dr. Paula Popovich, Dr. Jeff Vancouver,

and Dr. Victor Heh for their comments on a previous version of this manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or pub-

lication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

1. Readers interested in the basic components of IRT models are referred to Embretson and Reise (2000).

2. We would like to thank an anonymous reviewer for calling our attention to this point.

3. We would like to thank Mark Davison for bringing this issue to light in his helpful role as the editor

for this article.

References

Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect size and power in assessing

moderating effects of categorical variables using multiple regression: A 30-year review. Journal of

Applied Psychology, 90, 94-107.

Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Thousand

Oaks, CA: SAGE.

Baker, F. B., & Kim, S. H. (2004). Item response theory parameter estimation techniques (2nd ed.). New

York, NY: Marcel Dekker.

Borsboom, D. (2008). Latent variable theory. Measurement, 6, 25-53.

Busemeyer, J. R. (1980). Importance of measurement theory, error theory, and experimental design for

testing the significance of interactions. Psychological Bulletin, 88, 237-244.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis

for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum.

Davison, M. L., & Sharma, R. (1988). Parametric statistics and levels of measurement. Psychological

Bulletin, 104, 137-144.

Davison, M. L., & Sharma, R. (1990). Parametric statistics and levels of measurement: Factorial designs

and multiple regression. Psychological Bulletin, 107, 394-400.

De Boeck, P., & Wilson, M. (2004). A framework for item response models. In P. De Boeck & M. Wilson

(Eds.), Explanatory item response models: A generalized linear and non-linear approach (pp. 3-39).

New York, NY: Springer.

Embretson, S. E. (1996). Item response theory models and spurious interaction effects in factorial ANOVA

designs. Applied Psychological Measurement, 20, 201-212.

Embretson, S. E. (2006). The continued search for non-arbitrary metrics in psychology. American

Psychologist, 61, 50-55.

Embretson, S. E. & DeBoeck, P. (1994). Latent trait theory. In R. J. Sternberg (Ed.), Encyclopedia of

Intelligence (p. 644–647). New York, NY: MacMillan.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Earlbaum.

Ferrando, P. J. (2009). Difficulty, discrimination, and information indices in the linear factor analysis

model for continuous item responses. Applied Psychological Measurement, 33, 9-24.

144 Applied Psychological Measurement 36(2)

Page 24: Morse et al 2012

Fields, D. L. (2002). Taking the measure of work: A guide to validated scales for organizational research

and diagnosis. Thousand Oaks, CA: SAGE.

Fischer, G. H. (1995). Derivations of the Rasch model. In G. Fischer & I. Molenaar (Eds.), Rasch models:

Foundations, recent developments, and applications (pp. 15-38). New York, NY: Springer-Verlag.

Gagne, P., Furlow, C., & Ross, T. (2009). Increasing the number of replications in item response theory

simulations. Educational and Psychological Measurement, 69, 79-84.

Gardner, P. L. (1975). Scales and statistics. Review of Educational Research, 45, 43-57.

Granberg-Rademacker, J. S. (2010). An algorithm for converting ordinal scale measurement data to

interval/ratio scale. Educational and Psychological Measurement, 70, 74-90.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory.

Newbury Park, CA: SAGE.

Harwell, M., & Gatti, G. G. (2001). Rescaling ordinal data to interval data in educational research. Review

of Educational Research, 71, 105-131.

Harwell, M., Stone, C. A., Hsu, T., & Kirisci, L. (1996). Monte Carlo studies in item response theory.

Applied Psychological Measurement, 20, 101-125.

Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of

Computational and Graphical Statistics, 5, 299-314.

Kang, S., & Waller, G. (2005). Moderated multiple regression, spurious interaction effects, and IRT.

Applied Psychological Measurement, 29, 87-105.

Maxwell, S., & Delaney, H. (1985). Measurement and statistics: An examination of construct validity.

Psychological Bulletin, 97, 85-93.

McNemar, Q. (1969). Psychological statistics (4th ed.). New York, NY: Wiley.

Meade, A. W., Lautenschlager, G. J., & Johnson, E. C. (2007). A Monte Carlo examination of the

sensitivity of the differential functioning of items and tests framework for tests of measurement

invariance with Likert data. Applied Psychological Measurement, 31, 430-455.

Mislevy, R. J., & Stocking, M. L. (1989). A consumer’s guide to LOGIST and BILOG. Applied

Psychological Measurement, 13, 57-75.

Muraki, E., & Bock, R. D. (2003). PARSCALE: IRT item analysis and test scoring for rating-scale data

(Version 4.1) [Computer software]. Lincolnwood, IL: Scientific Software International.

Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models. Thousand Oaks, CA: SAGE.

Paunonen, S. V., & Jackson, D. N. (1988). Type I error rates for moderated multiple regression. Journal of

Applied Psychology, 73, 569-573.

Perline, R., Wright, B. D., & Wainer, H. (1979). The Rasch model as additive conjoint measurement.

Applied Psychological Measurement, 3, 237-255.

R Development Core Team. (2008). R: A language and environment for statistical computing. Vienna,

Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org

Reise, S. P., Ainsworth, A. T., & Haviland, M. G. (2005). Item response theory: Fundamentals,

applications, and promise in psychological research. Current Directions in Psychological Science, 14,

95-101.

Reise, S. P., & Haviland, M. G. (2005). Item response theory and the measurement of clinical change.

Journal of Personality Assessment, 84, 228-238.

Reise, S. P., & Waller, G. (2003). How many IRT parameters does it take to model psychopathology

items? Psychological Methods, 8, 164-184.

Robey, R. R., & Barcikowski, R. S. (1992). Type I error and the number of iterations in Monte Carlo

studies of robustness. British Journal of Mathematical and Statistical Psychology, 45, 283-288.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika

(Monograph Suppl. 17), 35, 139.

Samejima, F. (1996). The graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.),

Handbook of modern item response theory (pp. 85-100). New York, NY: Springer.

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677-680.

Stine, W. W. (1989). Meaningful inference: The role of measurement in statistics. Psychological Bulletin,

105, 147-155.

Morse et al. 145

Page 25: Morse et al 2012

Stocking, M. L. (1987). Two simulated feasibility studies in computerised adaptive testing. Applied

Psychology: An International Review, 36, 263-277.

Stone, C. A. (1992). Recovery of marginal maximum likelihood estimates in the two-parameter logistic

response model: An evaluation of MULTILOG. Applied Psychological Measurement, 16, 1-6.

Takane, Y., & De Leeuw, J. (1987). On the relationship between item response theory and factor analysis

of discretized variables. Psychometrika, 52, 392-408.

Wainer, H. (1982). Robust statistics: A survey and some prescriptions. In G. Keren (Ed.), Statistical and

methodological issues in psychology and social sciences research (pp. 187-214). Hillsdale, NJ:

Erlbaum.

Witt, L. A. (1998). Enhancing organizational goal congruence: A solution to organizational politics.

Journal of Applied Psychology, 83, 666-674.

146 Applied Psychological Measurement 36(2)