Test Construction Flashcards

1
Q

Some test experts use the term _____________ to refer to the extent to which test items contribute to achieving the stated goals of testing.

A

relevance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A determination of relevance is based on a qualitative judgement that takes into account which factors?

A

Content appropriateness (Does the item actually assess the content or behavior domain that the test is designed to evaluate?) Taxonomic level (Does the item reflect the appropriate cognitive or ability level?) Extraneous abilities (To what extent does the item require knowledge, skills, or abilities outside the domain of interest?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

An item’s difficulty is measured by calculating an item difficulty index (p), which is what equation?

A

The value of p ranges from 0 to 1.0, with larger values indicating easier items. When p is equal to 1.0, this means the item was answered correctly by all examinees; when p is 0, this indicates that none of the examinees answered the item correctly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In most situations, a p value of _____ is optimal. One exception is the case of a true/false test, for which the optimal p value is _____.

A

.50; .75

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

______________________ refers to the extent to which a test item differentiates between examinees who obtain high versus low scores on the entire test or on an external criterion.

A

Item discrimination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The item discrimination index ranges from _____ to _____.

A

-1.0; +1.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

For most tests, an item with a discrimination index of _____ or higher is considered acceptable.

A

.35

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When using item response theory, an ______________________ is constructed for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either the total test score, performance on an external criterion, or a mathematically-derived estimate of a latent ability or trait.

A

Item characteristic curve (ICC)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The theory of measurement that regards observed variability in test scores as reflecting two components: true differences between examinees on the attributes measured by the test and the effects of measurement (random) error

A

Classical test theory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

______________ is a measure of true score variability. It reforest to the consistency of test scores; i.e., the extent to which a test measures an attribute without being affected by random fluctuations (measurement error) that produce inconsistencies over time, across items, or over different forms.

A

Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When a test is ____________, it provides dependable, consistent results and, for this reason, the term consistency is often given as a synonym.

A

Reliable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some methods for establishing reliability?

A

test-retest, alternative forms, split-half, coefficient alpha, and inter-rater

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Most methods for estimating reliability produce a ______________________, which is a correlation coefficient that ranges in value from 0.0 to 1.0.

A

Reliability coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does it mean if a test’s reliability coefficient is 0.0?

A

All variability in obtained test scores is due to measurement error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When a test’s reliability coefficient is 1.0, this indicates that all variability reflects what?

A

True score variability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

If a test has a reliability coefficient of .91, this means that ____% of variability in obtained test scores is due to ______________ variability, while the remaining 9% reflects _____________.

A

91; true score; measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Match the method for estimating reliability to the correct definition: a. Test-Retest Reliability b. Alternate (Equivalent, Parallel) Forms Reliability c. Internal Consistency Reliability d. Inter-Rater (Inter-Scorer, Inter-Observer) Reliability 1. ___ To assess this, two equivalent forms of the test are administered to the same group of examinees and the two sets of scores are correlated. Indicates the consistency of responding to different item samples and, when the forms are administered at different times, the consistency of responding over time. 2. ___ Involves administering the same test to the same group of examinees on two different occasions and then correlating the two sets of scores. It is used for determining the reliability of tests designed to measure attributes that are relatively stable over time and that are not affected by repeated measurement (i.e., aptitude). Most thorough. 3. ___Split-half reliability and coefficient alpha are two methods for evaluating this. Both involve administering the test once to a single group of examinees and is useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test. 4. ___ Is of concern whenever test scores depend on a rater’s judgement. It’s assessed either by calculating a correlation coefficient or by determining the percent of agreement between two or more raters.

A
  1. b 2. a 3. c 4. d
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Link the term that belong together: a. Spearman-brown formula b. KR-20 c. Kappa statistic 1. Inter-rater reliability 2. Split-half reliability 3. Coefficient alpha

A
  1. c 2. a. 3. c
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

_________________ reliability is the most thorough method for estimating reliability.

A

Alternate forms

20
Q

_________________ reliability is not appropriate for speed tests.

A

Internal consistency

21
Q

The magnitude of a reliability coefficient is affected by several factors. In general, the longer a test, the _______________ its reliability coefficient. The _______________ formula is used to estimate the effect of lengthening or ______________ a test on its reliability coefficient. If the new items do not represent the same content domain as the original items or are more susceptible to measurement error, this formula is likely to _____________ the effects of lengthening the test. Like other correlation coefficients, the reliability coefficient is affected by the range of scores: The greater the range, the ___________ the reliability the coefficient. To maximize a test’s reliability coefficient, the sample of examinees should include people who are ___________ with regard to the attributes measured by the test. A reliability coefficient is also affected by the probability that an examinee can select the correct answer to a test question by guessing. The easier it is to guess the correct answer, the ___________ the reliability coefficient.

A

larger Spearman-Brown shortening overestimate larger heterogeneous smaller

22
Q

While the reliability coefficient is useful for assessing the amount of variability in test scores that is due to _____________ variability for a group of examinees, it does not directly indicate how much we can expect an individual examinee’s obtained score to reflect his or her true score. The standard error of ________________. It is calculated by multiplying the standard deviation of the test scores by the ___________________ of one minus the reliability coefficient.

A

True score Measurement square root

23
Q

_____________ refers to a test’s accuracy.

A

Validity

24
Q

There are three main forms of validity: ___________ validity is of concern whenever a test has been designed to measure one or more content or behavior domains. ________________ validity is important when a test will be used to measure a hypothetical trait such as achievement, motivation, intelligence, or mechanical aptitude. ___________ validity is of interest when a test has been designed to estimate or predict performance on another measure.

A

Content Construct Criterion-related

25
Q

One method for assessing a test’s construct validity is to determine if the test has both ______________ and _______________ validity.

A

Convergent; discriminant (divergent)

26
Q

When a test has high correlations with measures that assess the same construct, this provides the evidence of the tests _______________ validity; when a test has low correlations with measures of unrelated characteristics, this indicates that the test has _______________ validity.

A

Convergent; discriminant (divergent)

27
Q

_____________________ is used to identify the dimensions that underlie the intercorrelations among a set of tests.

A

Factor analysis

28
Q

In factor analysis, a test is shown to have construct validity when it has _______ correlations with the factors it is expected to correlate with and ______ correlations with the factors it is not expected to correlate with.

A

high; low

29
Q

__________________ validity is of interest whenever test scores are to be used to draw conclusions about an examinee’s likely standing or performance on another measure.

A

Criterion-related

30
Q

What are the two forms of criterion related validity?

A

Concurrent and predictive

31
Q

When establishing ______________ validity, the predictor is administered to a sample of examinees prior to the criterion. It is the appropriate type of validity when the goal of testing is to predict __________ status on the criterion. When evaluating _____________ validity, the predictor and criterion are administered at about the same time. It is the preferred method for assessing validity when the purpose of testing is to estimate __________ status on the criterion.

A

predictive; future; concurrent; current

32
Q

The data collected in a concurrent or predictive validity study can also be used to assess a predictor’s ________________, or the increase in correct decisions that can be expected if the predictor is used as a decision-making tool.

A

Incremental validity

33
Q

_______________ occurs when a rater’s knowledge of a person’s predictor performance affects how he/she rates the person on the criterion.

A

Criterion contamination

34
Q

A ____________ expresses an examinee’s raw score in terms of the percentage of examinees in the norm sample who achieved lower scores.

A

percentile rank

35
Q

When an examinee’s raw test score is converted to a ___________________, the transformed score indicates the examinee’s position in the normative sample in terms of standard deviations from the mean.

A

Standard scores

36
Q

The ________ equivalent for an examinee’s raw score is calculated by subtracting the mean of the distribution from the raw score to obtain a deviation score and then dividing the deviation score by the distributions standard deviation.

A

z-score z= (X-M) —— SD

37
Q

The optimal item difficulty level for a true/false test is:

a. .25
b. .50
c. .75
d. 1.00

A

c

38
Q

For a test item that has an item discrimation index (D) of +1.0, you would expect:

a. high achievers to be more likely to answer the item correctly than low achievers
b. low achievers to be more likely to answer the item correctly than high achievers
c. low and high achievers to be equally likely to answer the item correctly
d. low and high achievers to be equally likely to answer the item incorrectly

A

a. When all examinees in the upper group and none in the lower group answered the item correctly, D is equal to +1.0

39
Q

Dina Receives a precentile rank of 48 on a test, and her twin brother, Dino receives a percentile rank of 98. Their teacher realizes that she made an error in scoring thier tests and adds four points to Dina and Dino’s raw scores. (The other students; tests were scored correctly.) When she recalculates Dina and dino’s percentile ranks, she will find that:

a. Dina’s percentile rank will change by more points than Dino’s
b. Dino’s percentile rank will change by more points than Dina’s
c. Dina and Dino’s percentile ranks will change by the same number of points
d. Dina and Dino’s percentile ranks will not change

A

a

40
Q

Percentile ranks and standard scores share in common which of the following:

a. both types of transformed scores are normally distributed regardless of the shape of the raw score distribution
b. both report an examinee’s test score in terms of standard deviation units from the mean
c. both reference an examinee’s score to a prespecified external standard
d. both reference an examinee’s score to those achieved by examinees in the standardization sample

A

d. Percentile ranks and standard scores are both norm-referenced scores

41
Q

A Wechsler IQ score is a(n):

a. percentile rank
b. standard score
c. ipsative score
d. stanine score

A

b

42
Q

Assuming a normal distribution, which of the following represents the highest score:

a. a z-score of 1.5
b. a T-score of 70
c. A WAIS score of 120
d. a percentile rank of 92

A

b. a T score of 70 is two standard deviations above the mean

43
Q

A test developer uses a sample of 50 employees to develop a new selection technique. When she correlates scores on the selection test with scores on a measure of job performance, she obtains a validity coefficient of .35. When the test developer adminsters the test and measure of job performance to another sample of 50 employees, she will most likely obtain a validity coefficient that is:

a. greater than .35
b. less than .35
c. about .35
d. negative in value

A

b. When a test is cross-validated on another sample, the validity coefficient ordinarily shrinks (is smaller)

44
Q

In terms of item response theory, the slope (steepness) of the item response curve indicates the item’s:

a. difficulty
b. discriminability
c. reliability
d. validity

A

b

45
Q

A researcher correlates scores on two alternate forms of an achievement test and obtains a correlation coefficent of .80. This means that ___% of observed test score variability reflects true score variability.

a. 80
b. 64
c. 36
d. 20

A

a

46
Q

To estimate the effects of lengthening a 50-item test to 100 items on the test’s reliability, you would use which of the following:

a. Pearson r
b. Kuder-Richardson Formula 20
c. kappa coefficient
d. Spearman-Brown Formula

A

d