Validity and Reliability of a Test
 
 

In addition to adequate norms, a test that wishes to be useful and accurate must also be reliable and valid.


RELIABILITY

    The consistency of a test.  If a youngster is tested in the morning and again in the afternoon, we expect fairly similar scores.  If not, the test isn't consistent or "reliable".  A test must give us accurate results.  It is of little value if it gives a score for Chester today that is very different from the score he gets tomorrow.

Refers to the consistency of a test.  It should be:
     -stable
     -dependable
     -predictable

A test's predictability is described by a number called a "reliability coefficient"



3 Types of Reliability
    1. Test-Retest
(most commonly reported type of reliability)
    2. Equivalent Form

    3. Split-half

 

Test-Retest
?If you use a test with a student in the morning and then administer it again in the afternoon, would you expect about the same results? (yes)

 

?Which would have the highest test-retest reliability coefficient (correlation)?
     -retesting a youngster on an achievement test within the same week (***answer...less time to learn/change)
     -retesting in different years
 

?Would you use a reading test that had a test-retest reliability of  .40? (I hope not)
 

Equivalent Form Reliability
Reliability is established by using similar/alternate forms (Forms A & B) that measure the same trait/knowledge.

Two forms are created by splitting the questions on the test randomly before administration of the forms.  One group of students gets form "A" first, then "B".  Another group takes the "B" form of the test first, then the "A" version.  The scores on both are then correlated, producing a correlation or reliability coefficient.

Many tests have two forms for your use.  Why would you want two forms at  your disposal?
     -prevent cheating
     -prevent learning of the answers before the next administration
     -prevent student saying "Not this one again!  I already took this test."
 
 

Split-half Reliability
After the test has been administered to a large group of students, the test developers correlate the odd vs. the even numbered questions (or the 1st half vs. 2nd half, etc.).  This process measures the "internal consistency".
 



 

Which factors can influence the reliability of a test?
     (Whether kids get similar scores or no on repeated administrations of a test)

Test Length
     Are more questions better?
          Yes, but why?

               More questions assess more of a trait or characteristic
 

Test-Retest Interval
     What happens if you wait too long?
          (People change)
 

Guessing
     (gets it right one time and wrong the next)
 
 

Changes in the Testing Situation like: (?)
     -vacuuming in the hallway
     -opportunity to cheat
     -headache
     -tiredness


Another type of reliability-related characteristic is SEM (?)  standard error of measurement
     Indicates the amount of error in a score obtained on a test/survey
     Tries to account for error that occurred during administration, scoring, & interpretation
     Because no test is totally reliable, we never really know someone's "True" score
           (the score you would have obtained if the test was totally reliable and free of error)
     Stated in "confidence intervals"
          ...a range inside which you can be sure to some degree that the true score will fall
 

*The smaller the SEM, the more reliable is the test and test score.
 


VALIDITY

 The degree to which a test really measures what it claims to assess.  The extent to which their claim is truthful and genuine.
 
 

What is the connection between reliability and validity?  Is one needed to achieve the other? (yes)

A test must be reliable if it is to be valid.  If reliability reflects the amount of error in a test, you can't say that you're accurately measuring some trait/characteristic if scores vary a great deal.  Reliability is necessary for validity, but not sufficient (more information is needed).
 

You CAN have good reliability WITHOUT validity.  You can attain consistent scores, but the test might not be measuring what you think you're measuring.  For example:

Back in the 1960's to 1970's, the Frostig Test of Visual Perception revealed reliable scores on test-retest, but they claimed that it measured the degree to which a student had a reading impairment.  We now know that troubles with visual perception DO NOT necessarily contribute to reading problems. (Many kids with good visual perception have poor reading ability, while most kids with very poor visual perception skills are excellent readers.  Most poor readers have no visual perception problems.)


3 Types of Validity

Criterion
Content
Construct
 

Criterion Validity
    The degree of relationship between two variables (things)

        Subcategories
        a.  Concurrent -Test developers provide evidence of a strong correlation of scores on their test to scores on another
                    test accepted as the standard (the best test available previous to this one).  Look at the correlation between the
                    test under consideration and another test  that is accepted as being a good measure of that trait or domain.
        b.  Predictive - the degree to which the test predicts something that will happen in the future (e.g., SAT scores and
                    college grades).
 
 

Content Validity
The test developers pretty much argue their case logically and present proof that the test is adequate to measure the trait they claim it is assessing.

    3 parts to Content Validity
1. Test Item Appropriateness (the questions at at the correct age, grade, and curriculum level)
                                            (questions are appropriate for those  youngsters at that age/grade)
2. Completeness of the item sample
    (enough questions are provided to adeqately assess a particular area)
        For example, their should be more than one question on regrouping on a math test.
3. Question Complexity
    -How questions were asked and what was demanded of the student(s)
    -The wording of the question isn't too complex
    -Talks to which level of "Bloom's taxonomy" the questions assesses knowledge/ability
        ( Does the question just demand repeating of a fact?  Does the question require analysis, comparison, or synthesis?)
 
 

Construct Validity
a.  Specifically define the behaviors/traits/skills you are attempting to measure
b.  Form two groups: Those who "have it" and those who don't
c.  The test should discriminate between the two groups


Types of Non-valid "Validity" (Not a real validity, but used by companies to boost sales)

    a.  "Cash validity" - Companies talk about how many copies have been sold and how popular the test has become.
    b.  "Clinical validity" - Uses testimonials claiming effectiveness/usefulness/accuracy
    c.  "Internal consistency" - States that all the questions seem to measure the same thing
                (But that's RELIABILITY, not VALIDITY...can't prove that questions measure the thing they say it measures)

Tom Mcintyre at www.BehaviorAdvisor.com