Recebido: 01-03-2023 | Aprovado: 16-03-2023
Satyendra Chakrabartty, Indian Statistical Institute, Indian Maritime University, Indian Ports Association (chakrabarttysatyendra3139@gmail.com)
How to Cite:
Chakrabartty, S. (2023). Speed and Power Indices of tests and items. [RMd] RevistaMultidisciplinar, 5(1), 219-232. https://doi.org/10.23882/rmd.23136
Abstract:
A
simple index is proposed to measure speed or power components of a test. The
index is independent of position of the items and provides necessary and
sufficient condition for pure speed test and pure power test and enables
testing of statistical hypothesis to infer that the test can be taken as a
speed test or a power test. Similar index of an item is also proposed to reflect
whether the item is a speed item or a power item. The proposed index C is a ratio such that C =
0 Pure power test and C = 1Pure speed test facilitating computation of
similar index of each item and statistical test of significance. Properties of the index discussed. Operational method outlined to
modify a test to speed or power test.
Items can be ranked with respect to such item-wise index. Identification
of power items and speed items help to modify the test to a speed or power test
by deleting items in stages, if speediness (or power)
is not intended. Relationship between index for the test and item-wise indices derived.
Keywords: Error scores, Unattempted
items, Random guessing, Speed test, Power test.
1. Introduction:
Major challenges of tests relate to assessment
of “ability” for power tests and “speed” for speed tests. However, ability and
speed jointly affect response behavior in tests (Partchev
et al., 2013; Van der Linden, 2009). Primary sources of individual differences
in speed tests and power tests are speed of response or speed
of information processing (SIP) and
accuracy of response. Abilities measured by a
test under speeded conditions are different than the same measured under un-speeded
conditions (Lord, 1956). Van der Linden (2009) found low or even negative
correlations between accuracy and
response time across persons. Speed may be manifested in the form
of random guessing, number of unattempted items, inattentiveness, etc. Subjects
taking a test may need to adjust between accuracy and time to maximize his/her
score. Speeded responses do not
depend solely on a test taker's ability and are therefore not appropriate for
traditional item response theory (Cintron, 2021). However, an
inattentive response is broader than a pure random response (Meade & Craig,
2012).
Methods of analysis like item analysis, reliability, validity,
etc. and interpretation of scores are different for these two types of tests. For example, split-half
reliabilities are erroneously high for speed tests and may be taken as an upper
bound for the reliability coefficient (Gulliksen, 1950). Substantive degrees of speededness
tend to underestimate validity of tests (Lu and Sireci,
2007). Reliability and validity of speed tests are influenced by the speediness component, since variance of speed test is not due
to the mental ability of interest. Problems get aggravated since most
tests are combination of unknown proportion of speed and power, which makes
development of appropriate theorems in test theory more difficult than for pure
type tests (Gulliksen, 1950).
Thus, question arises on quantification of speed and power
components of a test and the items. Constructs of ability and speed are common
primarily in cognitive domains. However, response times are also considered in
non-cognitive domains like personality, attitudes, etc. (Ferrando
& Lorenzo-Seva, 2007; Ranger & Kuhn, 2012). Attempts have been made to isolate the
speed component which is not related to the level of interest in speeded tests
using external information like response times and also
not using any external information (see Lu & Sireci,
2007). Based on the Stafford's Speediness
Quotient (SQ, 1971) for items, Estrada et al. (2017)
separated speed and power components of tests of mental ability without
considering other information like response times where a rule of thumb was
suggested for identifying items affected by speediness.
Need is
felt to derive measures reflecting degree of speed and power of a test. The
paper proposes an index C as a ratio such
that C = 0 Pure power test and C
= 1Pure speed test facilitating computation of similar
index of each item and statistical test of significance. Properties of the
index discussed. Operational method outlined to modify a test to speed or power
test.
2. Literature survey:
2.1 Definitions
and Important terms:
In
a speed test, items are so easy that if a subject attempts an item, he/she gets
it correct. However, due to large number of items and insufficient time, nobody
can finish the test within the specified time limit. Time limit of a power test
is chosen so that each subject gets opportunity to attempt all the items. But
some items are so difficult that all subjects cannot give correct answer to each and every item of the test. Thus, in a speed test,
score differences reflect variations in speed of response and in a power test;
score differences indicate variations in accuracy of responses.
Different
types of error scores in the context of speed –power issues are:
Wrong
answers (W) refer to the items for
which a subject failed to answer correctly. Unattempted
items (U) are the number of omitted
items (a subject decided not to answer after reading those items – primarily
for power test) and Not-reached items (not attempted due to less availability
of time – primarily for speed test).
Thus,
error score E is given by
(1)
It
may be noted that subjects can answer the items in any sequence say from the
end or by skipping alternate items, etc. Thus, it is not justified to consider
Not-reached items as those at the end of a test. For practical purpose, unattempted items are items not endorsed by the subjects.
2.2
Measures of Speed and Power:
Attempts
made to measure speededness by two administrations of
a test with and without time limits. Using two such administrations, Cronbach and Warrington (1951) suggested a measure denoted
as tau ( in terms of correlations between test scores,
corrected for attenuation. However, tau does not consider difference of scores
under speed and power administrations. Strictly speaking, two versions of the
test under speed and power may not be parallel since
mean and variance are likely to vary for the two versions. In other words, if
Version 1(v1) and Version 2(v2) are parallel, at least the following
two conditions need to be satisfied:
- Mean
of ( = 0
- )
=
From
single administration of a test, and denoting standard deviation of U-scores,
W-scores and error scores respectively by and
Gulliksen
(1950) proposed the following two inequalities:
For
power tests: 1 + > > 1 - (2)
For
Speed test: 1+ (3)
However,
Rindler,
(1979) showed difficulty in interpretations of contribution of speed as per Gullicksen’s inequalities when or
are large. Focusing
on proportion of total errors
IRT
based approaches involving set of assumptions like unidimensionality,
local independency etc. have been adopted to estimate speededness
from single-administration of tests. Hambleton et al.
(1991) considered 3-PL IRT model defined as
where
Probability of answering a random item correctly by a
subject with ability θ
item discrimination
item difficulty value and
pseudo guessing parameter
Bejar
(1985) proposed an item-level index and an examinee-level index, making further
assumptions. But values of both the indices may vary depending on other sources
of error confounded with the effect of speededness
and interpretations of the indices are difficult (Lu & Sireci,
2007).
Effect
of random guessing due to speededness has given
contrasting results. With small amount of random guessing due to speededness, Attali (2005) found
largely attenuate inter-item correlations, and attenuate Cronbach’s alpha, but large
amount of random guessing due to low-motivation could
result in inflated Cronbach’s alpha (Wise & DeMars,
2009). Major reason of different conclusions could be use of real data by the
later and conclusions based on analytical derivations and simulations by the
former. Other
factors, like pooled samples, can potentially inflate reliability (Flinn et al.,
2015). Random responses are independent of item content and the latent trait of
the respondent and may arise due to speededness, low
motivation, inattentiveness, and tendencies of respondents to rush to maximize
attempted number of items.
Models
for response times differ in approaches, assumptions, statistical distributions
considered, complexities and findings. Different statistical distributions were
used in different models viz. Log normal (Van der Linden, 2007), Gamma (Maris,
1993), Weibull (Rouder et al., 2003), Box–Cox
transformation to approximate almost any distribution (Klein et al., 2009),
etc.
Impact
of speeded responses on item-total correlations and Cronbach’s alpha were
studied by Hong and Cheng (2019) with two types of manifestations of test
speediness, i.e. random guessing versus reduced
ability and found that that inter-item correlations may inflate or deflate in
different cases depending on the combinations of item parameters, the mean
Cronbach’s alpha rarely increases under simulations using real test parameters,
even with different manifestations of speededness.
Thus, inflated Cronbach’s alpha may be an artifact of a sample and not a
population behavior. However, other manifestations of speededness
giving rise to insufficient effort responding (IER) to survey are there (Huang
et al., 2012). Despite the issue of
inflated or deflated inter-item correlations, factor analysis of SAT data was
undertaken (CEEB, 1984) and found that factors attributable to speed accounted
for about 5% to 10% of the variance of test scores.
IRT with flexibility in choosing data
collection plan offers important advantages. However, conceptually
and procedurally complex IRT is based on strong assumptions, satisfaction of
which need to be tested. For example, IRT assumes that the probability of an
examinee to answer an item correctly does not depend on whether the item is
placed at the beginning, in the middle, or at the end of the test. Probability
of hitting the correct answer by guessing only cannot be determined by usual
IRT model.
3. Proposed method:
For
the i-th
subject, let be the total error score which is sum of (number of wrong answers) and (number of un-attempted items i.e. non-reached items + omitted items). From equation (1),
If
the test consisting of K-number of
items is administered to n-subjects,
one can have mean of error score is equal to sum of mean of W-scores and
U-scores i.e.
(4)
and (5)
An
index is conceptualized to measure degree of power
and degree of speed as (6)
where
is obtained when everybody fails to attempt
even a single item i.e. .
Thus, is a ratio lying between zero and one for a general test.
For
a pure power test, .
This implies for a pure power test. Conversely, i.e. the test is a
pure power test.
Following
similar logic, it can be proved that for a pure speed test and = K i.e. the test is a
pure speed test. Thus, pure power test and pure speed test .
In other words, necessary and sufficient condition for a pure power test is and the same for a pure speed test is In practice, one may not always get a power
test for which and can make statistical test to see whether
the obtained value of is significantly
different from zero i.e. testing .
Alternately, the obtained value of may be taken as a measure of departure from
the pure power position.
3.1 Pure Power test:
A
pure power test can be formally defined as follows:
Definition 1:
A test X is said to be a pure power test if and only of the index as defined in (6) is zero for the test.
For
all practical purposes, a test X can be considered as a power test if index is not significantly different from zero.
Rejection of implies that the test cannot be regarded as a pure
power test.
The
proposed index helps to improve the criterion of power test given by Gulliksen
(1950) by using the following theorem:
Theorem 1. Let be n-independent observations of a variable Y
such that . If is closed to zero (, then
Proof: If
each,
the theorem is trivially true. Assume all are not equal to zero.
Call
where is a small positive number.
Then,
=
=
since
< n
n
since
Remarks:
Converse of the theorem is not true since if each observation is equal to a
large number (say,
then = but
3.2 Improving Gulliksen’s
criteria for power test:
Gulliksen’s
criterion for power test is 1 + > > 1 - (7)
For
a pure power test
As
per the theorem 1,
Thus,
for 1 + and 1- and the inequality (7) becomes
1 + 1 - (8)
However, converse is not true i.e.
does not imply U = 0. Consider a test where and is large positive
integer less than or equal to the total number of items. Here, but and the test is
not a power test.
So,
C = 0 Pure power test is more general statement than
Gulliksen’s criterion
In
fact, C = 0 is the necessary and sufficient condition
for a pure power test. If a test is moderately
power, then by the theorem which implies and inequality (2) holds.
3.3
Index of Speed test:
For
a pure speed test, the index C = 1and vice versa. When number of unattempted items for each subject is equal to total number
of items in the test, C = 1. One can
test and (1 - can be taken as departure from pure speed test.
So, a pure speed test is defined as follows:
Definition 1.2:
A test X is said to be a pure speed test if and only of the index as defined in (6) is equal to one for the
test.
Gulliksen’s condition for speed test (inequality 3) can
be improved considering C = 1. From equation (6),
C = 1
Thus, - since and .
In other words, if C = 1
As per equation (4), . Putting for C = 1, for a pure speed test,
. Using
Theorem 1, we have
For a pure speed test, and . From (1),
Thus, 1 + ; 1-
Accordingly, Gulliksen’s criterion for speed
test (3) boils down to
1+ (9)
Therefore,
for C = 1, Gulliksen’s condition for speed test is
improved to accommodate pure speed test. However, the converse is not true.
Thus, Gulliksen’s condition is true for one way only. C = 1 Pure speed test is more general statement and
is a necessary and sufficient condition for a pure Speed test.
3.4 Speed
and Power items:
Consider the matrix U with n-rows for n-examinees and K-columns for K-number of
items, where the (i-j)th cell = 1 if the i-th individual
has not attempted the j-th item and
= 0 if the i-th individual has attempted the
j-th item
Here, total of j-th column gives total number of unattempted items by the sample of examinees.
The C-index for the j-th item = = (10)
Clearly, maximum value of when
no examinee could attempt the j-th item indicating that the j-th item is a pure speed item. Minimum
value of when
each examinee attempted the j-th item indicating that the j-th item is a pure power item.
The items may be ranked with respect to and
thus facilitate identification of speed items along with assessment of degree
of speededness. In reality, may be
closed to one (
Thumb
rule of accepting j-th item
as a speed item if is arbitrary. Better will be to undertake
testing of = 1. Acceptance
of = 1
imply the j-th
item is a speed item and rejection of = 1
indicates that the j-th item is not a speed item. Similar exercise can be
undertaken for power items with along
with identification of power items.
If C-index of the test is denoted by ,
average of = = = (11)
The equation (11) gives relationship between and
Identification of power items and speed items
help to modify the test to a speed or power test by deleting items in stages.
4. Discussion:
The
proposed index of speediness looks similar to the Speededness Quotient (SQ) proposed by Stafford (1971). SQ
is defined as the percentage of unattempted items in
the total number of errors for individual level and on test level. SQ = 100 for a purely speeded test and for a purely power test SQ = 0.
Like the proposed C-index, SQ
focuses on proportion
of total errors unlike the proportion of test variance affected by speed in
Gulliksen’s approach. The proposed index in terms of ()
for the i-th
individual may not have a one-to-one correspondence with SQ.
Tests with
no penalty for wrong answers will significantly decrease value of SQ. For example, about 99.6% of the
examinees answered each item in the Swedish Scholastic Aptitude Test (SweSAT) (Marcus, 2021) for which SQ will be closed to zero irrespective of speededness or power components of SweSAT. In
addition, the C-index helps
one to test and helps
to identify items measuring speed.
5.
Limitations:
The proposed indices of test and items cannot
help to find effect of random guessing with
different manifestations of speededness. Values of
the indices are also affected by homogeneity or heterogeneity of the sample.
6. Conclusion:
A
simple index in terms of a ratio is
proposed for measuring the degree of speed or degree of power of a test. The
index is
independent of the position of the items and is equal
to where denotes the mean number of unattempted items
by n-examinees under a prescribed
time limit. The test becomes close to a power test as tends to zero and close to a speed test as tends to 1. Converse is also true. In fact,
necessary and sufficient condition for a pure power test is and the same for a pure speed test is Guliksen’s inequalities separately for power
test and speed test modified to include pure power test and pure speed test.
The
index facilitates
statistical test to see whether the obtained value of is
significantly different from zero i.e.
testing .
In case of rejection of the null hypothesis,
Following
similar approach, C-index for the j-th item was
defined, which reflects a pure power item if and a
pure speed item if The
items of the test can be ranked with respect to and
help in identification of speed items along with
assessment of degree of speededness. In reality, may be
closed to one (. Acceptance of statistical
hypothesis = 1
implies that the j-th item is a speed item and rejection of = 1
indicates that the j-th item is not a speed item. Similar exercise can be
undertaken for power items with along
with identification of power items. Identification of power items and speed
items help to modify the test to a speed or power test by deleting items in stages, if speediness (or power) is not intended. Relationship
between and derived. The method can be best applied when
it is desirable to minimize test speededness
or when speed is not a part of the latent trait being measured.
Future empirical studies with real and/or simulated data sets may be undertaken for further
investigation of the indices and effect on psychometric qualities.
References:
Attali, Y. (2005):
Reliability of speeded number-right multiple-choice tests. Applied Psychological Measurement, 29, 357-368. https://doi.org/10.1177/0146621605276676
Bejar, I. I.
(1985): Test speededness under number-right
scoring: An analysis of the Test of English as a Foreign Language. Report
No. ETS-RR-85-11, Educational Testing Service. Princeton, NJ
Cintron,
D. (2021). Methods for Measuring Speededness: Chronology,
Classification, and Ensuing Research and Development. ETS Research Report
Series, 2021(1). https://doi.org/10.1002/ets2.12337
College Entrance
Examination Board (1984). The College Board technical handbook for the
scholastic aptitude test and achievement tests. New York
Cronbach, L. J.,
& Warrington, W. G. (1951). Time limit tests: Estimating their reliability
and degree of speeding. Psychometrika, 14, 167–188.
Estrada, E., Román, F. J., Abad, F. J., & Colom, R.
(2017). Separating power and speed components of standardized
intelligence measures. Intelligence, 61,
159-168.
Ferrando, P. J., &
Lorenzo-Seva, U. (2007). An item response theory model for incorporating
response time data in binary personality items. Applied Psychological
Measurement, 31(6), 525–543. https://doi.org/10.1177/0146621606295197
Flinn, L., Braham, L., & Das
Nair, R. (2015). How reliable are case formulations? A systematic
literature review. British
Journal of Clinical Psychology, 54, 266-290. https://doi.org/10.1111/bjc.12073
Gulliksen, H. (1950).
Theory of Mental Tests. New York,
Wiley, 177 – 198
Hambleton, R.
K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item
response theory (Vol. 2). SAGE Publications,
Inc. Newbury Park. California 91320
Hong, M. R., and Cheng, Y. (2019). Clarifying the Effect of Test Speededness. Applied Psychological Measurement, 43(8),
611–623. https://doi.org/10.1177/0146621618817783
Huang, J. L., Curran, P. G., Keeney, J., Poposki,
E. M., & DeShon, R. P. (2012): Detecting and
deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27, 99-114. https://doi.org/10.1007/s10869-011-9231-8
Klein Entink, R. H., van der Linden, W. J., & Fox, J. P.
(2009): A Box-Cox normal model for response times. British Journal of
Mathematical and Statistical Psychology, 62, 621-640.
Lord, F. (1956). A study of speed factors in tests and academic grades. Psychometrika,
21, 31-50.
Lu, Y., & Sireci, S. G. (2007). Validity issues in
test speededness. Educational Measurement: Issues
and Practice, 26(4), 29–37. https://doi.org/10.1111/j.1745-3992.2007.00106.x
Marcus S. Hjärne (2021). Just Enough Time to Level the Playing Field:
Time Adaptation in a College Admission Test, Scandinavian Journal of Educational Research, 65(6), 941-955.
https://doi.org/10.1080/00313831.2020.1788143
Maris, E. (1993).
Additive and multiplicative models for gamma distributed random variables, and
their application as psychometric models for response times. Psychometrika,
58, 445-469.
Meade, A. W.,
& Craig, S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17,
437-455. https://doi.org/10.1037/a0028085
Partchev, I., De Boeck, P., & Steyer, R. (2013). How much power and
speed is measured in this test? Assessment, 20(2),
242–252. https://doi.org/10.1177/1073191111411658
Ranger, J., &
Kuhn, J.T. (2012). Improving item response theory model calibration by
considering response times in psychological tests. Applied Psychological
Measurement, 36(3), 214–231. https://doi.org/10.1177/0146621612439796
Rindler, S. E.
(1979). Pitfalls in assessing test speededness. Journal
of Educational Measurement, 16(4), 261–270.
Rouder, J.,
Sun, D., Speckman, P., Lu, J., & Zhou, D. (2003).
A hierarchical Bayesian statistical framework for response time distributions. Psychometrika,
68, 589-606.
Stafford, R. E.
(1971). The speed quotient: A new descriptive statistic for tests. Journal of
Educational Measurement, 8, 275–278.
Swineford, F.
(1974). The test analysis manual (SR-74-06). Princeton, NJ: Educational Testing
Service.
Van der Linden, W. J. (2009). Conceptual issues
in response-time modeling. Journal of Educational Measurement, 46(3),
247–272. https://doi.org/10.1111/j.1745-3984.2009.00080.x
Van der Linden, W.
J. (2007). A hierarchical framework for modeling speed and accuracy on test
items. Psychometrika, 72, 287-308.
Wise, S. L., & DeMars, C. E. (2009). A clarification of the effects
of rapid guessing on coefficient a: A note on Attali’s
reliability of speeded number-right multiple-choice tests. Applied Psychological Measurement, 33, 488-490. https://doi.org/10.1177/0146621607304655