In addition to adequate norms, a test that wishes to be useful and accurate must also be reliable and valid.
RELIABILITY
The consistency of a test. If a youngster is tested in the morning and again in the afternoon, we expect fairly similar scores. If not, the test isn't consistent or "reliable". A test must give us accurate results. It is of little value if it gives a score for Chester today that is very different from the score he gets tomorrow.
Refers to the consistency of a test. It should be:
-stable
-dependable
-predictable
A test's predictability is described by a number called a "reliability coefficient"
3 Types of Reliability
1. Test-Retest (most
commonly reported type of reliability)
2. Equivalent Form
3. Split-half
Test-Retest
?If you use
a test with a student in the morning and then administer it again in the
afternoon, would you expect about the same results? (yes)
?Which would
have the highest test-retest reliability coefficient (correlation)?
-retesting a youngster on an achievement test
within the same week (***answer...less time
to learn/change)
-retesting in different years
?Would you
use a reading test that had a test-retest reliability of .40? (I
hope not)
Equivalent Form Reliability
Reliability is established by using similar/alternate forms (Forms
A & B) that measure the same trait/knowledge.
Two forms are created by splitting the questions on the test randomly before administration of the forms. One group of students gets form "A" first, then "B". Another group takes the "B" form of the test first, then the "A" version. The scores on both are then correlated, producing a correlation or reliability coefficient.
Many tests have two forms for your use. Why would you want two
forms at your disposal?
-prevent cheating
-prevent learning of the answers before the
next administration
-prevent student saying "Not this one again!
I already took this test."
Split-half Reliability
After the test has been administered to a large group of students,
the test developers correlate the odd vs. the even numbered questions (or
the 1st half vs. 2nd half, etc.). This process measures the "internal
consistency".
Which factors can influence the
reliability of a test?
(Whether kids get similar scores or no on
repeated administrations of a test)
Test Length
Are more questions better?
Yes,
but why?
More questions assess more of a trait or characteristic
Test-Retest Interval
What happens if you wait too long?
(People change)
Guessing
(gets it right one time and wrong the next)
Changes in the Testing Situation
like: (?)
-vacuuming in the hallway
-opportunity to cheat
-headache
-tiredness
Another type of reliability-related characteristic is SEM (?)
standard error of measurement
Indicates the amount of error in a score obtained
on a test/survey
Tries to account for error that occurred during
administration, scoring, & interpretation
Because no test is totally reliable, we never
really know someone's "True" score
(the score
you would have obtained if the test was totally reliable and free of error)
Stated in "confidence intervals"
...a range inside
which you can be sure to some degree that the true score will fall
*The smaller
the SEM, the more reliable is the test and test score.
VALIDITY
The degree to which a test really measures what it claims to assess.
The extent to which their claim is truthful and genuine.
What is the connection between reliability and validity? Is one needed to achieve the other? (yes)
A test must be reliable if it is to be valid. If reliability reflects
the amount of error in a test, you can't say that you're accurately measuring
some trait/characteristic if scores vary a great deal. Reliability
is necessary for validity, but not sufficient (more information is needed).
You CAN have good reliability WITHOUT validity. You can attain consistent scores, but the test might not be measuring what you think you're measuring. For example:
Back in the 1960's to 1970's, the Frostig Test of Visual Perception revealed reliable scores on test-retest, but they claimed that it measured the degree to which a student had a reading impairment. We now know that troubles with visual perception DO NOT necessarily contribute to reading problems. (Many kids with good visual perception have poor reading ability, while most kids with very poor visual perception skills are excellent readers. Most poor readers have no visual perception problems.)
Criterion
Content
Construct
Criterion Validity
The degree of relationship
between two variables (things)
Subcategories
a. Concurrent
-Test developers provide evidence of a strong correlation of scores on
their test to scores on another
test accepted as the standard (the best test available previous to this
one). Look at the correlation between the
test under consideration and another test that is accepted as being
a good measure of that trait or domain.
b. Predictive
- the degree to which the test predicts something that will happen in the
future (e.g., SAT scores and
college grades).
Content Validity
The test developers pretty much argue their case logically and present
proof that the test is adequate to measure the trait they claim it is assessing.
3 parts to Content
Validity
1. Test Item Appropriateness (the questions at at the correct
age, grade, and curriculum level)
(questions are appropriate for those youngsters at that age/grade)
2. Completeness of the item sample
(enough questions are provided to adeqately assess
a particular area)
For example, their should
be more than one question on regrouping on a math test.
3. Question Complexity
-How questions were asked and what was demanded
of the student(s)
-The wording of the question isn't too complex
-Talks to which level of "Bloom's taxonomy"
the questions assesses knowledge/ability
( Does the question just
demand repeating of a fact? Does the question require analysis, comparison,
or synthesis?)
Construct Validity
a. Specifically define the behaviors/traits/skills you are attempting
to measure
b. Form two groups: Those who "have it" and those who don't
c. The test should discriminate between the two groups
Types of Non-valid "Validity" (Not a real validity, but used by companies to boost sales)
a. "Cash validity" - Companies talk
about how many copies have been sold and how popular the test has become.
b. "Clinical validity" - Uses testimonials
claiming effectiveness/usefulness/accuracy
c. "Internal consistency" - States
that all the questions seem to measure the same thing
(But that's RELIABILITY, not VALIDITY...can't prove that questions measure
the thing they say it measures)
Tom Mcintyre at www.BehaviorAdvisor.com