Assessment in school: validity and reliability
The reliability of an assessment tool is the extent to which it measures learning consistently.
The validity of an assessment tool is the extent by which it measures what it was designed to measure.
Reliability
The reliability of an assessment tool is the extent to which it consistently and accurately measures learning.
When the results of an assessment are reliable, we can be confident that repeated or equivalent assessments will provide consistent results. This puts us in a better position to make generalised statements about a student’s level of achievement, which is especially important when we are using the results of an assessment to make decisions about teaching and learning, or when we are reporting back to students and their parents or caregivers. No results, however, can be completely reliable. There is always some random variation that may affect the assessment, so educators should always be prepared to question results.
Factors which can affect reliability:
- The length of the assessment – a longer assessment generally produces more reliable results.
- The suitability of the questions or tasks for the students being assessed.
- The phrasing and terminology of the questions.
- The consistency in test administration – for example, the length of time given for the assessment, instructions given to students before the test.
- The design of the marking schedule and moderation of marking procedures.
- The readiness of students for the assessment – for example, a hot afternoon or straight after physical activity might not be the best time for students to be assessed.
Validity
Educational assessment should always have a clear purpose. Nothing will be gained from assessment unless the assessment has some validity for the purpose. For that reason, validity is the most important single attribute of a good test.
The validity of an assessment tool is the extent to which it measures what it was designed to measure, without contamination from other characteristics. For example, a test of reading comprehension should not require mathematical ability.
There are several different types of validity:
- Face validity: do the assessment items appear to be appropriate?
- Content validity: does the assessment content cover what you want to assess?
- Criterion-related validity: how well does the test measure what you want it to?
- Construct validity: are you measuring what you think you're measuring?
It is fairly obvious that a valid assessment should have a good coverage of the criteria (concepts, skills and knowledge) relevant to the purpose of the examination. The important notion here is the purpose. For example:
- STAR (Supplementary Test of Achievement in Reading) is not designed as a comprehensive test of reading ability. It focuses on assessing students’ vocabulary understanding, basic sentence comprehension and paragraph comprehension. It is most appropriately used for students who don’t score well on more general testing (such as PAT or e-asTTle) as it provides a more fine grained analysis of basic comprehension strategies.
There is an important relationship between reliability and validity. An assessment that has very low reliability will also have low validity; clearly a measurement with very poor accuracy or consistency is unlikely to be fit for its purpose. But, by the same token, the things required to achieve a very high degree of reliability can impact negatively on validity. For example, consistency in assessment conditions leads to greater reliability because it reduces 'noise' (variability) in the results. On the other hand, one of the things that can improve validity is flexibility in assessment tasks and conditions. Such flexibility allows assessment to be set appropriate to the learning context and to be made relevant to particular groups of students. Insisting on highly consistent assessment conditions to attain high reliability will result in little flexibility, and might therefore limit validity.
The Overall Teacher Judgment balances these notions with a balance between the reliability of a formal assessment tool, and the flexibility to use other evidence to make a judgment.
For further reading Here