assessment: principles of test construction Flashcards

1
Q

Validity

A

How accurately an instrument measures a given construct. Validity is concerned with what an instrument measures, how well it does so, and the extent to which meaningful inferences can be made from the instrument’s results. The three main types of validity are (a) content validity, the extent to which an instrument’s content seems appropriate to its intended purpose; (b) criterion-related validity, the effectiveness of an instrument in predicting an individual’s performance on a specific criterion, either predictive or concurrent; (c) and construct validity, the extent to which an instrument measures a theoretical construct (i.e., idea or concept).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

six types of reliability:

A
  • test-retest,
  • alternative form,
  • internal consistency,
  • split-half reliability,
  • inter-item consistency,
  • inter-rater
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

face validity

A

A superficial measure that is concerned with whether an instrument looks valid or credible. Face validity is not a true type of validity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

validity coefficient

A

Often used to report validity; a correlation between a test score and the criterion measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

factor analysis

A

A statistical test used to reduce a larger number of variables (often items on an assessment) to a smaller number of factors (groups or factors). The two forms of factor analysis are (a) exploratory factor analysis (EFA), which involves an initial examination of potential models (or factor structures) that best categorize the variables and (b) confirmatory factor analysis (CFA), which refers to confirming the EFA results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

standard error of estimate

A

A statistic that indicates the expected margin of error in a predicted criterion score due to the imperfect validity of the test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

sensitivity

A

the instrument’s ability to accurately identify the presence of a phenomenon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

specificity

A

the instrument’s ability to accurately identify the absence of a phenomenon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

false positive

A

an instrument inaccurately identifying the presence of a phenomenon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

false negative

A

an instrument inaccurately identifying the absence of a phenomenon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

efficiency

A

the ratio of total correct decisions divided by the total number of decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

incremental validity

A

the extent to which an instrument enhances the accuracy of prediction of a specific criterion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

decision accuracy

A

Decision Accuracy: The accuracy of an instrument in supporting counselor decisions. Decision accuracy often assesses sensitivity (the instrument’s ability to accurately identify the presence of a phenomenon); specificity (the instrument’s ability to accurately identify the absence of a phenomenon); false positive error (an instrument inaccurately identifying the presence of a phenomenon); false negative error (an instrument inaccurately identifying the absence of a phenomenon); efficiency (the ratio of total correct decisions divided by the total number of decisions); and incremental validity (the extent to which an instrument enhances the accuracy of prediction of a specific criterion).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reliability

A

Consistency of scores attained by the same person on different administrations of the same test. Concerned with measuring the difference between (error) an individual’s observed test score and true test score: X = 1 + e. There are several different types: (a) test-retest reliability (sometimes called temporal stability) determines the correlation between the scores obtained from two different administrations of the same test, thus evaluating the consistency of scores across time; (b) alternate form reliability (sometimes called equivalent form reliability or parallel form reliability) compares the consistency of scores from two alternative, but equivalent, forms of the same test; (c) internal consistency measures the consistency of responses within a single administration of the instrument (two common types of internal consistency are split-half reliability and interitem reliability-e.g., KR-20 and coefficient alpha); and (d) interscorer reliability, sometimes called inter-rater reliability, is used to calculate the degree of consistency of ratings between two or more persons observing the same behavior or assessing an individual through observational or interview methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

reliability coefficient

A

A measure of reliability of a set of scores on a test. Ranges from 0 to 1.00; the closer the coefficient to 1.00, the more reliable the scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

standard error of measurement (SEM)

A

A statistic that indicates how scores from repeated administrations of the same instrument to the same individual are distributed around the true score. The standard error of measurement is computed using the standard deviation and reliability coefficient of the test instrument

17
Q

Item Analysis

A

: A procedure that involves statistically examining test-taker responses to individual test items with the intent to assess the quality of test items as well as the test as a whole. Item analysis is frequently used to eliminate confusing, easy, and difficult items from a test that will be used again.

18
Q

item difficulty

A

The percentage of test-takers who answer a test item correctly, calculated by dividing the number of individuals who correctly answered the item by the total number of test-takers.

19
Q

item discrimination

A

The degree to which a test item is able to correctly differentiate test-takers who vary according to the construct measured by the test. It is calculated by subtracting the performance of the top quarter of total scores from the bottom quarter of total scores on a given test item.

20
Q

Test Theory

A

Assumes that test constructs, in order to be considered empirical, must be measurable for quality and quantity (Erford, 2013); consequently, test theory strives to reduce test error and enhance construct reliability and validity. The two common types of test theory are (a) classical test theory, which postulates that an individual’s observed score is the sum of the true score and the amount of error present during test administration and (b) item response theory, also referred to as modern test theory, which applies mathematical models to the data collected from assessments to evaluate how well individual test items and the test as a whole work.

21
Q

Scale

A

A collection of items or questions that combine to form a composite score on a single variable. Scales can measure discrete or continuous variables and can describe data quantitatively or qualitatively

22
Q

Likert scale

A

Commonly used to measure attitudes or opinions; typically includes a statement regarding the concept in question followed by answer choices that range from Strongly Agree to Strongly Disagree. Sometimes called Likert-type scale

23
Q

semantic differential

A

A scaling technique rooted in the belief that people think dichotomously and commonly includes the statement of an affective question followed by a scale that asks test-takers to place a mark between two dichotomous adjectives. Also referred to as self-anchored scales.

24
Q

Thurstone scale

A

Measures multiple dimensions of an attitude by asking respondents to express their beliefs through agreeing or disagreeing with item statements. The Thurstone scale has equal-appearing, successive intervals and employs a paired comparison method

25
Q

Guttman scale

A

Measures the intensity of a variable being measured. Items are presented in a progressive order so that a respondent, who agrees with an extreme test item, will also agree with all previous, less extreme items