W2 - Chapter 5 - Reliability (DN) Flashcards Preview

z. Psychological Testing and Assessment > W2 - Chapter 5 - Reliability (DN) > Flashcards

Flashcards in W2 - Chapter 5 - Reliability (DN) Deck (73)
Loading flashcards...
1
Q

alternate forms

A
  • are simply DIFFERENT VERSIONS of a TEST that have been constructed to be as similar as possible to the original
    e. g., hard copy - online - oral etc.
  • a measure of reliability across time
  • does not have same mean & variance as original test so not as good as parallel forms
    p. 151
2
Q

alternate-forms reliability

A
  • an estimate of the extent to which the ALTERNATE (different) FORMS of a test have been affected by ITEM SAMPLING ERROR, or OTHER ERROR
  • a degree of a test’s reliability across time
    p. 151-152, 161
3
Q

average proportional distance (APD)

A

a measure used to evaluate the INTERNAL CONSISTENCY of a test

  • focuses on the DEGREE of DIFFERENCE that exists between ITEM SCORES
  • typically calculated for a GROUP of TESTTAKERS
    p. 157-158
4
Q

classical test theory (CTT)

A
  • also known as ‘true score theory’ & ‘true score model’
  • system of assumptions about measurement
  • the composition of a TEST SCORE is made up of a relatively stable component which is what the test/individual item is designed to measure PLUS a component that is ERROR.
    p. 123 (164-166, 280-281)
5
Q

coefficient α (alpha)

A
  • developed by Cronbach (1951); elaborated on by others.
  • also referred to as CRONBACH’S ALPHA and ALPHA
  • a statistic widely employed in TEST CONSTRUCTION
  • the preferred statistic for obtaining INTERNAL CONSISTENCY RELIABILITY
  • only requires ONE administration of the test
  • assists in deriving an ESTIMATE of RELIABILITY; more technically, it is equal to the MEAN of ALL SPLIT-HALF RELIABILITIES
  • suitable for use on tests with NON-DICHOTOMOUS ITEMS
  • unlike Pearson r (-1 to +1), COEFFICIENT ALPHA ranges from 0-1 because it is used to gauge SIMILARITY of data sets so 0 = absolutely NO SIMILARITY
    1 = PERFECTLY IDENTICAL
    p.157
6
Q

coefficient of equivalence

A

the estimate of the degree of relationship that exists BETWEEN various FORMS of a TEST

  • can be evaluated with an alternate-forms or parallel forms COEFFICIENT OF STABILITY (these are both known as the COEFFICIENT OF EQUIVALENCE) p.151
7
Q

coefficient of generalisability

A

represents an estimate of the INFLUENCE of particular FACETS on the test score

e. g., - Is the score affected by group as opposed to one on one administration? or
- Is the score affected by the time of day the test is administered?
p. 168

8
Q

coefficient of inter-scorer reliability

A

the estimate of the degree of CONSISTENCY AMONG SCORERS in the scoring of a test

  • this is the COEFFICIENT of CORRELATION for inter-scorer consistency (reliability)
    p. 159
9
Q

coefficient of stability

A

the estimate of a test-retest reliability taken when the interval between tests is GREATER than SIX MONTHS

  • this is a significant estimate as the passage of time can be a source of ERROR VARIANCE i.e., the more time passed, the greater likelihood of a lower reliability coefficient p.151
10
Q

confidence interval

A

a RANGE or BAND of test scores that is likely to contain the ‘TRUE SCORE’
p.177

11
Q

content sampling

A
  • the VARIETY of SUBJECT MATTER contained in the test ITEMS.
  • one source of variance in the measurement process is the VARIATION among items WITHIN a test or BETWEEN tests
    i. e., the way in which a test is CONSTRUCTED is a source of ERROR VARIANCE
  • also referred to as ITEM SAMPLING p.147
12
Q

criterion-referenced test

A
  • way of DERIVING MEANING from test scores by evaluating an individual’s score with reference to a SET STANDARD (CRITERION)
  • also referred to as “domain-referenced testing” & “content-referenced testing and assessment”

DISTINCTION:
CONTENT-REFERENCED interpretations are those where the score is directly interpreted in terms of performance AT EACH POINT on the achievement continuum being measured
- while CRITERION-REFERENCED interpretations are those where the score is DIRECTLY INTERPRETED in terms of performance at ANY GIVEN POINT on the continuum of an EXTERNAL VARIABLE.
p.139-141 (163-164, 243)

13
Q

decision study

A
  • conducted on the conclusion of a generalizability study
  • designed to EXPLORE the UTILITY & VALUE of TEST SCORES in making DECISIONS.
    p. 168
14
Q

dichotomous test item

A
  • a TEST ITEM or QUESTION that can be answered with ONLY one of two responses e.g., true/false or yes/no
    p. 169
15
Q

discrimination

A
  • In IRT
  • the DEGREE to which an ITEM DIFFERENTIATES among people with HIGHER or LOWER levels of the TRAIT, ABILITY or whatever is being measured by a test
    p. 169
16
Q

domain sampling theory

A
  • while Classical Test Theory seeks to estimate the proportion of a test score due to ERROR
  • Domain Sampling Theory seeks to estimate the proportion of a test score that is due to specific sources of variation under defined conditions (i.e., context/domain)
  • in DST, the test’s RELIABILITY is looked upon as an OBJECTIVE MEASURE of how precisely the test score assesses the DOMAIN from which the test DRAWS a SAMPLE
  • of the three TYPES of ESTIMATES of RELIABILITY; measures of INTERNAL CONSISTENCY are the most compatible with DST
    p. 166 & 167
17
Q

dynamic characteristic

A
  • a TRAIT, STATE, or ABILITY presumed to be EVER-CHANGING as a function of SITUATIONAL and COGNITIVE EXPERIENCES; contrast with static characteristic
    p. 162
18
Q

error variance

A

error from IRRELEVANT, RANDOM sources - ERROR VARIANCE plus TRUE VARIANCE = TOTAL VARIANCE p.126,146

19
Q

estimate of inter-item consistency

A
  • the degree of correlation among ALL items on a scale
  • the CONSISTENCY or HOMOGENEITY of ALL items on a test
  • estimated by techniques such as the SPLIT-HALF RELIABILITY method
  • p.152 - 154
20
Q

facet

A
  • include things like the number of items on a test, the amount of training the test scorers have had & the purpose of the test administration
    p. 167
21
Q

generalizability study

A
  • examines how GENERALIZABLE SCORES from a PARTICULAR test are if the test is administered in DIFFERENT SITUATIONS i.e., it examines how much of an IMPACT DIFFERENT FACETS of the UNIVERSE have on a test score p.167, 168
22
Q

generalizability theory

A
  • based on the idea that a person’s test scores VARY from testing to testing because of variables in the TESTING SITUATION
  • test score in its context - DN
  • encourages test users to describe details of a particular test situation or (UNIVERSE) leading to a particular test score
  • a ‘UNIVERSE SCORE’ replaces a ‘TRUE SCORE’
  • Cronbach (1970) & colleagues
    p. 167
23
Q

heterogeneity

A

the degree to which a test measures DIFFERENT FACTORS i.e, the test contains items that measure MORE THAN ONE TRAIT (FACTOR) (also NONHOMOGENEOUS) p.154

24
Q

homogeneity

A
  • When a test contains ITEMS that MEASURE a SINGLE TRAIT i.e., the DEGREE to which a test measures a SINGLE FACTOR - i.e., the extent to which items in a scale are UNIFACTORIAL
  • the more HOMOGENEOUS a test, the more INTER-ITEM CONSISTENCY
  • it is expected to have higher Internal Consistency than a HETEROGENEOUS TEST
  • homogeneity is desirable as it provides straightforward INTERPRETATION (i.e., similar scores -= similar abilities on variable of interest)
    p. 154-155
25
Q

inflation of range/variance

A
  • SAMPLING PROCEDURES may impact the variance of either variable in a correlation analysis
    OUTCOME
  • if variance of EITHER variable is INFLATED by sampling procedure then the resulting CC tends to be HIGHER (i.e., giving a false indicator of correlation
    (thought to self - is this also a validity issue e.g., false positive)
  • conversely referred to as RESTRICTION OF RANGE/VARIANCE
  • if variance of EITHER variable is RESTRICTED by sampling procedure used, then tends to be a LOWER CORRELATION COEFFICIENT (i.e., masking true correlation)
    (thought to self - is this also a validity issue e.g., failing to detect - a miss!!!)
    p.162
26
Q

information function

A
  • an IRT TOOL
  • helps test users to determine the RANGE OVER THETA for which an item is most useful in DISCRIMINATING among groups of testtakers
    p. 171
27
Q

inter-item consistency

A
  • the CONSISTENCY or HOMOGENEITY of ALL items on a test
  • ESTIMATED by techniques such as the SPLIT-HALF RELIABILITY method
  • the DEGREE of CORRELATION among ALL ITEMS on a scale - p.154
28
Q

internal consistency estimate of reliability

A

an ESTIMATE of the RELIABILITY of a test

- obtained from a MEASURE of INTER-ITEM CONSISTENCY p.152

29
Q

inter-scorer reliability

A
  • An ESTIMATE of the DEGREE of agreement or CONSISTENCY between TWO or more SCORERS on a test.
  • also referred to as INTER-RATER reliability; OBSERVER reliability; JUDGE reliability; SCORER reliability.
  • p.159, 161
30
Q

item characteristic curve (ICC)

A
  • graphic representation of the PROBABILISTIC RELATIONSHIP between a person’s LEVEL of TRAIT (ability, characteristic) being measured and the PROBABILITY for responding to an item in a PREDICTED way;
  • also known as a CATEGORY RESPONSE CURVE, or, an ITEM TRACE LINE
    p. 177, 281
31
Q

item response theory (IRT)

A
  • another alternative to the true score model
  • a family of theories/methods (well over 100 varieties of IRT models)
  • each model is designed to HANDLE data with CERTAIN ASSUMPTIONS
  • a way of modelling (predicting?) the PROBABILITY that a person with X ability will be able to perform at a LEVEL OF Y.
  • also referred to as LATENT-TRAIT MODELp.
    p. 166, 168-173
32
Q

item sampling

A
  • one source of VARIANCE in the measurement process is the VARIATION among items WITHIN a test, or BETWEEN tests i.e., the way in which a test is CONSTRUCTED is a source of ERROR VARIANCE
  • also CONTENT SAMPLING
    p. 147
33
Q

Kuder-Richardson formula 20 (KR-20)

A

a series of EQUATIONS developed by G. F Kuder & M. W. Richardson

  • designed to ESTIMATE the INTER-ITEM CONSISTENCY of tests
  • only appropriate for use on tests with DICHOTOMOUS ITEMS (true/false)
    p. 155-156, 163
34
Q

latent-trait theory

A
  • a synonym for IRT (Item Response Theory) in the academic literature
  • a system of ASSUMPTIONS about measurement
  • includes ASSUMPTION that a TRAIT being measured is UNIDIMENSIONAL
  • go back and check this pg 168 - the extent to which each test item measures the targeted trait
  • also referred to as LATENT-TRAIT MODELp. 168
35
Q

measurement error

A

all factors associated with the PROCESS of measuring some variable OTHER than the actual variable being measured p.146

36
Q

odd-even reliability

A
  • an ESTIMATE of the SPLIT-HALF RELIABILITY of a test

- Splitting a test by assigning odd-numbered items to one half & even-numbered items to the other half of the test p.153

37
Q

parallel forms

A

when on each FORM of the test, the MEANS & VARIANCES of OBSERVED TEST SCORES are EQUAL .151

38
Q

parallel-forms reliability

A
  • an estimate of the consistency of two versions of a test across time
  • an ESTIMATE of the extent to which ITEM SAMPLING & OTHER ERRORS have affected test scores on versions of the SAME test, for which MEANS & VARIANCES of OBSERVED TEST SCORES are EQUAL.
    (contrast with alternate forms reliability & also coefficient of equivalence) p.151-152
39
Q

polytomous test item

A

a test item or question with THREE OR MORE ALTERNATIVE RESPONSES

  • where ONLY ONE is scored CORRECT or is CONSISTENT with a TARGETED TRAIT or other CONSTRUCT
    p. 169
40
Q

power test

A
  • a test, usually of achievement or ability
    has
    1) either NO TIME LIMIT or such a long time limit that ALL TESTAKERS can attempt ALL ITEMS
    2) some items are SO DIFFICULT that NO TESTTAKER can obtain a PERFECT SCORE
    (so its isolating the ‘power’ or ‘ability’ variable)
    (contrast with speed test)
    p.163
41
Q

random error

A

a source of ERROR when measuring a target variable due to UNPREDICTABLE FLUCTUATIONS & INCONSISITENCIES of OTHER VARIABLES in the measurement process - sometimes referred to as “NOISE” - contrast with systematic error p.146

42
Q

Rasch model

A

a reference to an IRT MODEL with VERY SPECIFIC ASSUMPTIONS about the UNDERLYING DISTRIBUTION
p.169

43
Q

reliability

A

the proportion of the total variance attributable to TRUE VARIANCE - the GREATER the proportion of TRUE VARIANCE = the GREATER the RELIABILITY of a test - p.157-158

44
Q

reliability coefficient

A
  • general term
  • an INDEX of RELIABILITY - or the RATIO of TRUE SCORE VARIANCE to TOTAL SCORE VARIANCE on a test
    p. 145
45
Q

restriction of range/variance

A
  • SAMPLING PROCEDURES may impact the variance of either variable in a correlation analysis
    OUTCOME
  • if variance of EITHER variable is RESTRICTED by sampling procedure used, then tends to be a LOWER CORRELATION COEFFICIENT (i.e., masking true correlation)
    (thought to self - is this also a validity issue e.g., failing to detect - a miss!!!)
  • conversely referred to as INFLATION OF RANGE/VARIANCE
  • if variance of EITHER variable is INFLATED by sampling procedure then the resulting CC tends to be HIGHER (i.e., giving a false indicator of correlation
    (thought to self - is this also a validity issue e.g., false positive)
    p.162
46
Q

Spearman-Brown formula

A

allows a test developer/user to estimate the INTERNAL consistency reliability from a correlation of TWO HALVES of a test that has been LENGTHENED or SHORTENED.

  • inappropriate for use with HETEROGENEOUS tests or SPEED tests
    p. 153-154
47
Q

speed test

A
  • a test, usually of achievement or ability which has a TIME LIMIT
  • usually contains ITEMS of UNIFORM difficulty (usually uniformly low)
  • so that when given GENEROUS TIME ALL TESTTAKERS should be able to complete ALL ITEMS CORRECTLY

(so its isolating the SPEED variable)
(contrast with ‘power test’)
p.163, 272

48
Q

split-half reliability

A

an ESTIMATE of the INTERNAL CONSISTENCY of a test - obtained by CORRELATING two PAIRS of SCORES taken from EQUIVALENT HALVES of a SINGLE TEST administered ONCE - p.152-
154

49
Q

standard error of a score

A
  • in TRUE SCORE THEORY
  • a STATISTIC designed to ESTIMATE how far an OBSERVED SCORE DEVIATES from a TRUE SCORE
    (also called standard error of measurement (SEM)
    p.175
50
Q

standard error of measurement (SEM)

A
  • in TRUE SCORE THEORY
  • a STATISTIC designed to ESTIMATE how far an OBSERVED SCORE DEVIATES from a TRUE SCORE
    (also called STANDARD ERROR OF A SCORE)
    p.132, 175-178
51
Q

standard error of the difference

A
  • a STATISTIC designed to aid in determining HOW LARGE a DIFFERENCE between two scores should be BEFORE it is considered STATISTICALLY SIGNIFICANT
    p. 132, 178
52
Q

static characteristic

A

a TRAIT, STATE or ABILITY presumed to be relatively STATIC OVER TIME
(contrast with dynamic characteristic)
p.162

53
Q

systematic error

A
  • a source of ERROR in the measurement process
  • typically CONSTANT or PROPORTIONATE to what is presumed to be the TRUE VALUE of the target variable being measured
  • once known, it is predictable & FIXABLE - relative standings remain unchanged
  • may not be VALID but is RELIABLE - p. 146
54
Q

test battery

A

typically composed of TESTS designed to measure DIFFERENT VARIABLES.

  • quite often psychologists rely on a BATTERY of tests in the process of EVALUATION.
    p. 155n5, 502-504 see also specific batteries
55
Q

test-retest reliability

A

an estimate of reliability obtained by CORRELATING pairs of scores from the SAME PEOPLE on TWO DIFFERENT administrations of the test
- appropriate when EVALUATING the RELIABILITY of a test purporting to measure something relatively STABLE over TIME e.g., a personality trait p.150-151, 161

56
Q

theta level (in IRT)

A
  • a reference to the DEGREE of the underlying ability or trait that a TESTTAKER is presumed to BRING TO the test
  • also referred to as THETA
    p. 170
57
Q

transient error

A

a source of error attributable to the testtaker’s FEELINGS, MOODS, or MENTAL STATE OVER TIME p.160

58
Q

true score

A
  • according to CLASSICAL TEST THEORY

- a value that GENUINELY reflects an individual’s ABILITY or TRAIT level as measured by a particular test p.164

59
Q

true variance

A
  • in the TRUE SCORE MODEL
  • the COMPONENT of a score attributable to TRUE DIFFERENCES in the ability or trait being measured
  • can be in an OBSERVED SCORE or a DISTRIBUTION of SCORES p.146
60
Q

universe

A
  • in GENERALIZABILITY THEORY
  • the TOTAL CONTEXT of a particular test situation
  • including ALL the FACTORS that lead to an individual testtakers score
  • p.167
61
Q

universe score

A
  • in GENERALIZABILITY THEORY
  • a test score corresponding to the PARTICULAR UNIVERSE being assessed or evaluated
    p. 167
62
Q

variance

A

a statistic useful in describing SOURCES of test score variability

  • equal to the MEAN of the SQUARES of the DIFFERENCES between SCORES in a distribution and THEIR MEAN
  • calculated by SQUARING & SUMMING all the DEVIATION SCORES then DIVIDING by the total number of scores p.95, 146
63
Q

What is the main challenge of a test creator?

A

to MAXIMIZE the proportion of TOTAL VARIANCE that is TRUE VARIANCE and to MINIMIZE the proportion that is ERROR VARIANCE - p.147

64
Q

What are four main SOURCES of ERROR VARIANCE?

A

1) TEST CONSTRUCTION
- item sampling/content sampling
2) TEST ADMINISTRATION
- test environment; testtaker variables; examiner related variable.
3) TEST SCORING and INTERPRETATION
- scorers; scoring systems).
4) OTHER SOURCES OF ERROR
- sampling error
- methodological error - researchers not trained, ambiguous wording, item biases)
p. 147 - 149

65
Q

What are some methods of measuring INTERNAL CONSISTENCY of a test’s items?

A

1) SPEARMAN-BROWN FORMULA p.153-4
2) KUDER-RICHARDSON FORMULAS p.155-6
3) COEFFICIENT ALPHA p.157
4) AVERAGE PROPORTIONAL DISTANCE (APD) p.157

66
Q

How is obtaining estimates of ALTERNATE-FORMS reliability & PARALLEL FORMS reliability SIMILAR?

A

1) Two test administrations with the SAME GROUP are required
2) Test scores between tests may be AFFECTED by factors such as MOTIVATION, FATIGUE, or INTERVENING EVENTS (practise, learning or therapy) - although not as much as if the EXACT SAME test had been administered twice

67
Q

What is an INHERENT source of ERROR-VARIANCE when computing an ALTERNATE or PARALLEL-FORMS reliability coefficient?

A

ITEM SAMPLING ERROR

p. 152

68
Q

What are the THREE steps of computation of a COEFFICIENT of SPLIT-HALF RELIABILITY?

A

Step 1 - Divide the test into EQUIVALENT HALVES

Step 2 - calculate a Pearson r between scores on the TWO HALVES of the test

STEP 3 - adjust the HALF-TEST reliability using the SPEARMAN-BROWN FORMULA

(p.152-153)

69
Q

Contrast the Coefficient alpha & Pearson r

A

Ca - 0-1
Pr - -1 to +1
Ca - gauging how SIMILAR data sets are
PR - dealing with SIMILARITY & DISSIMILARITY

70
Q

What is the DIFFERENCE between the FOCUS of Average proportional distance (APD) and SPLIT-HALF methods & CRONBACH’s ALPHA?

A

APD - focus is on the DEGREE of DIFFERENCE between item scores
SH & CA - focus is on SIMILARITIES between item scores
p.157

71
Q

What are the 3 approaches to ESTIMATING RELIABILITY?

A

1) test-retest
2) alternate or parallel forms
3) internal or inter-item consistency
method chosen will depend on a number of factors - e.g., the PURPOSE, NATURE for obtaining the measure
p. 160

72
Q

How do we decide which RELIABILITY COEFFICIENT to CHOOSE (use)?

A
  • the method chosen will depend on a number of factors - e.g., the PURPOSE, NATURE for obtaining the measure
    NOTE: the various RELIABILITY COEFFICIENTS DO NOT all reflect the same SOURCES of ERROR VARIANCE
    see pg. 161 (impt to understand why each test is selected, also refer to Table 5-4)
73
Q

What are the 3 ASSUMPTIONS made when using IRT?

A

1) Unidimensionality
2) Local Independence
3) Monotonicity
p. 170