Course: Educational Measurement and Evaluation (6507)
Semester: Autumn, 2021
What is the importance of feesdaback in student progress” Explain the difference of summative and formative assessment.
Formative assessment and summative assessment are two overlapping, complementary ways of assessing pupil progress in schools. While the common goal is to establish the development, strengths and weaknesses of each student, each assessment type provides different insights and actions for educators. The key to holistic assessment practice is to understand what each method contributes to the end goals — improving school attainment levels and individual pupils’ learning — and to maximise the effectiveness of each.
Both terms are ubiquitous, yet teachers sometimes lack clarity around the most effective types of summative assessment and more creative methods of formative assessment. In our latest State of Technology in Education report, we learnt that more educators are using online tools to track summative assessment than formative, for example. Yet this needn’t be the case. In this post we will explain the difference between these two types of assessment, outline some methods of evaluation, and assess why both are essential to student development.
Summative assessment explained
Summative assessment aims to evaluate student learning and academic achievement at the end of a term, year or semester by comparing it against a universal standard or school benchmark. Summative assessments often have a high point value, take place under controlled conditions, and therefore have more visibility.
Summative assessment examples:
- End-of-term or midterm exams
- Cumulative work over an extended period such as a final project or creative portfolio
- End-of-unit or chapter tests
- Standardised tests that demonstrate school accountability are used for pupil admissions;
Assessment allows both instructor and student to monitor progress towards achieving learning objectives, and can be approached in a variety of ways. Formative assessment refers to tools that identify misconceptions, struggles, and learning gaps along the way and assess how to close those gaps. It includes effective tools for helping to shape learning, and can even bolster students’ abilities to take ownership of their learning when they understand that the goal is to improve learning, not apply final marks (Trumbull and Lash, 2013). It can include students assessing themselves, peers, or even the instructor, through writing, quizzes, conversation, and more. In short, formative assessment occurs throughout a class or course, and seeks to improve student achievement of learning objectives through approaches that can support specific student needs (Theal and Franklin, 2010, p. 151).
In contrast, summative assessments evaluate student learning, knowledge, proficiency, or success at the conclusion of an instructional period, like a unit, course, or program. Summative assessments are almost always formally graded and often heavily weighted (though they do not need to be). Summative assessment can be used to great effect in conjunction and alignment with formative assessment, and instructors can consider a variety of ways to combine these approaches.
Formative Assessment Ideally, formative assessment strategies improve teaching and learning simultaneously. Instructors can help students grow as learners by actively encouraging them to self-assess their own skills and knowledge retention, and by giving clear instructions and feedback. Seven principles (adapted from Nicol and Macfarlane-Dick, 2007 with additions) can guide instructor strategies:
Keep clear criteria for what defines good performance – Instructors can explain criteria for A-F graded papers, and encourage student discussion and reflection about these criteria (this can be accomplished though office hours, rubrics, post-grade peer review, or exam / assignment wrappers(link is external)). Instructors may also hold class-wide conversations on performance criteria at strategic moments throughout a term.
Encourage students’ self-reflection – Instructors can ask students to utilize course criteria to evaluate their own or a peer’s work, and to share what kinds of feedback they find most valuable. In addition, instructors can ask students to describe the qualities of their best work, either through writing or group discussion.
Give students detailed, actionable feedback – Instructors can consistently provide specific feedback tied to predefined criteria, with opportunities to revise or apply feedback before final submission. Feedback may be corrective and forward-looking, rather than just evaluative. Examples include comments on multiple paper drafts, criterion discussions during 1-on-1 conferences, and regular online quizzes.
Encourage teacher and peer dialogue around learning – Instructors can invite students to discuss the formative learning process together. This practice primarily revolves around mid-semester feedback and small group feedback sessions, where students reflect on the course and instructors respond to student concerns. Students can also identify examples of feedback comments they found useful and explain how they helped. A particularly useful strategy, instructors can invite students to discuss learning goals and assignment criteria, and weave student hopes into the syllabus.
Promote positive motivational beliefs and self-esteem – Students will be more motivated and engaged when they are assured that an instructor cares for their development. Instructors can allow for rewrites/resubmissions to signal that an assignment is designed to promote development of learning. These rewrites might utilize low-stakes assessments, or even automated online testing that is anonymous, and (if appropriate) allows for unlimited resubmissions.
Provide opportunities to close the gap between current and desired performance – Related to the above, instructors can improve student motivation and engagement by making visible any opportunities to close gaps between current and desired performance. Examples include opportunities for resubmission, specific action points for writing or task-based assignments, and sharing study or process strategies that an instructor would use in order to succeed.
Collect information which can be used to help shape teaching – Instructors can feel free to collect useful information from students in order to provide targeted feedback and instruction. Students can identify where they are having difficulties, either on an assignment or test, or in written submissions. This approach also promotes metacognition, as students are asked to think about their own learning. Poorvu Center staff can also perform a classroom observation or conduct a small group feedback session that can provide instructors with potential student struggles.
Instructors can find a variety of other formative assessment techniques through Angelo and Cross (1993), Classroom Assessment Techniques (list of techniques available here(link is external)).
Summative Assessment Because summative assessments are usually higher-stakes than formative assessments, it is especially important to ensure that the assessment aligns with the goals and expected outcomes of the instruction.
Use a Rubric or Table of Specifications – Instructors can use a rubric to lay out expected performance criteria for a range of grades. Rubrics will describe what an ideal assignment looks like, and “summarize” expected performance at the beginning of term, providing students with a trajectory and sense of completion.
Design Clear, Effective Questions – If designing essay questions, instructors can ensure that questions meet criteria while allowing students freedom to express their knowledge creatively and in ways that honor how they digested, constructed, or mastered meaning. Instructors can read about ways to design effective multiple choice questions.
Assess Comprehensiveness – Effective summative assessments provide an opportunity for students to consider the totality of a course’s content, making broad connections, demonstrating synthesized skills, and exploring deeper concepts that drive or found a course’s ideas and content.
Make Parameters Clear – When approaching a final assessment, instructors can ensure that parameters are well defined (length of assessment, depth of response, time and date, grading standards); knowledge assessed relates clearly to content covered in course; and students with disabilities are provided required space and support.
Consider Blind Grading – Instructors may wish to know whose work they grade, in order to provide feedback that speaks to a student’s term-long trajectory. If instructors wish to provide truly unbiased summative assessment, they can also consider a variety of blind grading techniques.
Differentiate between the concepts of content and construct validity. What is the importance of adequacy of test items in testing?
Construct validity is established experimentally to demonstrate that a survey distinguishes between people who do and do not have certain characteristics. For example, a survey researcher who claims constructive validity for a measure of satisfaction will have to demonstrate in a scientific manner that satisfied respondents behave differently from dissatisfied respondents.
Construct validity is commonly established in at least two ways:
The survey researcher hypothesizes that the new measure correlates with one or more measures of a similar characteristic (convergent validity) and does not correlate with measures of dissimilar characteristics (discriminant validity). For example, a survey researcher who is validating a new quality-of-life survey might posit that it is highly correlated with another quality-of-life instrument, a measure of functioning, and a measure of health status. At the same time, the survey researcher would hypothesize that the new measure does not correlate with selected measures of social desirability (the tendency to answer questions so as to present yourself in a more positive light) and of hostility.
The survey researcher hypothesizes that the measure can distinguish one group from the other on some important variable. For example, a measure of compassion should be able to demonstrate that people who are high scorers are compassionate but that people who are low scorers are unfeeling. This requires translating a theory of compassionate behavior into measurable terms, identifying people who are compassionate and those who are unfeeling (according to the theory), and proving that the measure consistently and correctly distinguishes between the two groups.
Construct validity concerns the identification of the causes, effects, settings, and participants that are present in a study. For example, a medication might have an effect not because its putative active ingredients are absorbed into the bloodstream but because of its placebo effects. In this case, the cause would be misidentified if its effects were attributed to the absorption of the medication’s active ingredients rather than to a placebo effect. Assessing the causal chain by which a treatment has its effects (i.e., determining the variables that mediate an effect) can reduce such misattributions. Alternatively, a cause can be misspecified when the treatment is implemented with less strength or integrity than intended. Manipulation checks are often used, especially in laboratory studies, to assess whether treatments are implemented as planned.
Content and Construct Validity
Construct validity means the test measures the skills/abilities that should be measured.
Content validity means the test measures appropriate content.
ETS gathers information from graduate and professional school programs, including business and law schools, about the skills that they consider essential for success in their programs.
Verbal Reasoning Section
The Verbal Reasoning section of the GRE® General Test measures skills that faculty have identified through surveys as important for graduate-level success. The capabilities that are assessed include:
the ability to understand text (such as the ability to understand the meanings of sentences, to recognize an accurate summary of a text or to distinguish major points from irrelevant points in a passage); and
the ability to interpret discourse (such as the ability to draw conclusions, to infer missing information or to identify assumptions).
Quantitative Reasoning Section
The Quantitative Reasoning section of the GRE General Test measures skills that are consistent with those outlined in the Mathematical Association of America’s Quantitative Reasoning for College Graduates: A Complement to the Standards are based on feedback from faculty surveys. The skills that are assessed in the GRE quantitative measure include:
- reading and understanding quantitative information
- interpreting and analyzing quantitative information, including drawing inferences from data
- using mathematical methods to solve quantitative problems
Analytical Writing Section
Interviews with graduate-level faculty, surveys of graduate-level faculty, and the work of the GRE Writing Test Committee have consistently identified critical thinking and writing skills as important for success in graduate programs.
The two tasks that comprise the Analytical Writing section (evaluating an issue and evaluating an argument) are both considered essential in many fields of graduate study. Thus, the structure of the test can be shown to have content validity because the test assesses skills identified by the graduate community as essential for success in many fields of graduate-level work.
Validity is the most important issue in selecting a test. Validity refers to what characteristic the test measures and how well the test measures that characteristic.
Validity tells you if the characteristic being measured by a test is related to job qualifications and requirements.
Validity gives meaning to the test scores. Validity evidence indicates that there is linkage between test performance and job performance. It can tell you what you may conclude or predict about someone from his or her score on the test. If a test has been demonstrated to be a valid predictor of performance on a specific job, you can conclude that persons scoring high on the test are more likely to perform well on the job than persons who score low on the test, all else being equal.
Validity also describes the degree to which you can make specific conclusions or predictions about people based on their test scores. In other words, it indicates the usefulness of the test.
Methods for conducting validation studies
The Uniform Guidelines discuss the following three methods of conducting validation studies. The Guidelines describe conditions under which each type of validation strategy is appropriate. They do not express a preference for any one strategy to demonstrate the job-relatedness of a test.
Criterion-related validation requires demonstration of a correlation or other statistical relationship between test performance and job performance. In other words, individuals who score high on the test tend to perform better on the job than those who score low on the test. If the criterion is obtained at the same time the test is given, it is called concurrent validity; if the criterion is obtained at a later time, it is called predictive validity.
Content-related validation requires a demonstration that the content of the test represents important job-related behaviors. In other words, test items should be relevant to and measure directly important requirements and qualifications for the job.
Construct-related validation requires a demonstration that the test measures the construct or characteristic it claims to measure, and that this characteristic is important to successful performance on the job.
The three methods of validity-criterion-related, content, and construct-should be used to provide validation support depending on the situation. These three general methods often overlap, and, depending on the situation, one or more may be appropriate. French (1990) offers situational examples of when each method of validity may be applied.
Explain practical considerations in developing a test blue print. Develop a protocol for test administration
The content areas listed in the test blueprint, or table of specifications, are frequently drawn directly from the results of a job analysis. These content areas comprise the knowledge, skills, and abilities that have been determined to be the essential elements of competency for the job or occupation being assessed. In addition to the listing of content areas, the test blueprint specifies the number or proportion of items that are planned to be included on each test form for each content area. These proportions reflect the relative importance of each content area to competency in the occupation.
Most test blueprints also indicate the levels of cognitive processing that the examinees will be expected to use in responding to specific items (e.g., Knowledge, Application). It is critical that your test blueprint and test items include a substantial proportion of items targeted above the Knowledge-level of cognition. A typical test blueprint is presented in a two-way matrix with the content areas listed in the table rows and the cognitive processes in the table columns. The total number of items specified for each column indicates the proportional plan for each cognitive level on the overall test, just as the total number of items for each row indicates the proportional emphasis of each content area.
The test blueprint is used to guide and target item writing as well as for test form assembly. Use of a test blueprint improves consistency across test forms as well as helping ensure that the goals and plans for the test are met in each operational test. An example of a test blueprint is provided next.
Example of a Test Blueprint
In the (artificial) test blueprint for a Real Estate licensure exam given below the overall test length is specified as 80 items. This relatively small test blueprint includes four major content areas for the exam (e.g., Real Estate Law). Three levels of cognitive processing are specified. These are Knowledge, Comprehension, and Application.
A test blueprint is a list of key components defining your test, including:
The purpose of the test: It might be something simple, such as assessing knowledge prior to instruction to a get a baseline of what students know before taking a course. Alternatively, the test purpose might be more complex, such as assessing retention of material learned across several organ-system courses to determine eligibility for advancement.
The content framework: Start with the schemas or frameworks commonly used to organize and consolidate medical knowledge. For example, basic science (e.g., biochemistry, genetics) or clinical science (e.g., surgery, pediatrics) disciplines are common schemas.
The testing time: This includes amount of testing time available and the need for breaks, as well as other logistical issues related to the test administration.
The content weighting (aka, number of items per content area): The number of questions per topic category should reflect the importance of the topic; that is, they should correlate with the amount of time spent on that topic in the course. For example, if there are 20 one-hour lectures, there may be 10 questions from each hour of lecture or associated with each hour of expected study. The number of questions per category can be adjusted up or down to better balance the overall test content and represent the importance of each lecture, as well as the total lecture time.
The item formats (e.g., MCQ, essay question): The item formats should always be appropriate for the purpose of the assessment.
Benefits of Test Blueprints
Test blueprints will help ensure that your tests:
o Appropriately assess the instructional objectives of the course
o Appropriately reflect key course goals and objectives – the material to be learned
o Include the appropriate item formats for the skills being assessed
Test blueprints can be used for additional purposes besides test construction:
o Demonstrate to students the topics you value, and serve as a study guide for them
o Facilitate learning by providing a framework or mental schema for students
o Ensure consistent coverage of exam content from year to year
o Communicate course expectations to stakeholders (e.g., trainees, other faculty, administration)
Examinations are a frequent assessment method used in higher education to objectively measure student competency in attaining course learning objectives.1,2 Examinations can serve as powerful motivators by communicating to students which concepts and material are particularly important.2,3 Faculty may then use the results to identify student misconceptions, evaluate learning objectives/activities, and make decisions regarding instructional practices.
The Accreditation Council for Pharmacy Education standards hold pharmacy programs accountable to ensure the validity of individual student assessments and integrity of student work.5 Among specific requirements and suggestions, faculty should ensure that examinations take place under circumstances that minimize academic misconduct, confirm the identity of students taking the examination, and consider examination validation and enhancement to ensure appropriate student progression.5 Sound principles for examination construction and validation may not be fully understood by all faculty. Additionally, uniform agreement is lacking on the best procedures for examination administration, and whether examinations should be returned to students or retained by faculty. This commentary provides an overview for best practices in examination construction and blueprinting, considerations for ensuring optimal and secure administration of examinations, and guidance for examination reviews and feedback on student performance.
examination is considered reliable if the results it generates are consistent and reproducible, such that a student would perform similarly on multiple versions of the examination. An examination is considered valid if it is both reliable and measures the student’s knowledge and skill(s) that it intends to measure. The goal of validity for examinations is to ensure that a representative sample of the intended learning objectives is measured, and that students have satisfied the minimum performance level to be competent with respect to the stated objectives.1,2
There are four important principles that faculty need to consider when creating a content-valid examination. First, establish the purpose of the examination and take steps to ensure that it measures the desired construct(s). Second, link items on the examination to the course learning objectives and the intended teaching taxonomies. Third, ensure that items are clearly written and well-structured; items that are ambiguous or lack congruence with the objectives may confuse students and directly affect examination scores. The last principle specifies that experts in the field should review the examination to ensure the other three principles have been met.
Why say type items are considered easy to administer but difficult to score? Explam the challenges of scoring essay type test itens practical examples
The item analysis is an important phase in the development of an exam program. In this phase statistical methods are used to identify any test items that are not working well. If an item is too easy, too difficult, failing to show a difference between skilled and unskilled examinees, or even scored incorrectly, an item analysis will reveal it. The two most common statistics reported in an item analysis are the item difficulty, which is a measure of the proportion of examinees who responded to an item correctly, and the item discrimination, which is a measure of how well the item discriminates between examinees who are knowledgeable in the content area and those who are not. An additional analysis that is often reported is the distractor analysis. The distractor analysis provides a measure of how well each of the incorrect options contributes to the quality of a multiple choice item. Once the item analysis information is available, an item review is often conducted.
Item Analysis Statistics
Item Difficulty Index
The item difficulty index is one of the most useful, and most frequently reported, item analysis statistics. It is a measure of the proportion of examinees who answered the item correctly; for this reason it is frequently called the p-value. As the proportion of examinees who got the item right, the p-value might more properly be called the item easiness index, rather than the item difficulty. It can range between 0.0 and 1.0, with a higher value indicating that a greater proportion of examinees responded to the item correctly, and it was thus an easier item. For criterion-referenced tests (CRTs), with their emphasis on mastery-testing, many items on an exam form will have p-values of .9 or above. Norm-referenced tests (NRTs), on the other hand, are designed to be harder overall and to spread out the examinees’ scores. Thus, many of the items on an NRT will have difficulty indexes between .4 and .6.
Item Discrimination Index
The item discrimination index is a measure of how well an item is able to distinguish between examinees who are knowledgeable and those who are not, or between masters and non-masters. There are actually several ways to compute an item discrimination, but one of the most common is the point-biserial correlation. This statistic looks at the relationship between an examinee’s performance on the given item (correct or incorrect) and the examinee’s score on the overall test. For an item that is highly discriminating, in general the examinees who responded to the item correctly also did well on the test, while in general the examinees who responded to the item incorrectly also tended to do poorly on the overall test.
The possible range of the discrimination index is -1.0 to 1.0; however, if an item has a discrimination below 0.0, it suggests a problem. When an item is discriminating negatively, overall the most knowledgeable examinees are getting the item wrong and the least knowledgeable examinees are getting the item right. A negative discrimination index may indicate that the item is measuring something other than what the rest of the test is measuring. More often, it is a sign that the item has been mis-keyed.
When interpreting the value of a discrimination it is important to be aware that there is a relationship between an item’s difficulty index and its discrimination index. If an item has a very high (or very low) p-value, the potential value of the discrimination index will be much less than if the item has a mid-range p-value. In other words, if an item is either very easy or very hard, it is not likely to be very discriminating. A typical CRT, with many high item p-values, may have most item discriminations in the range of 0.0 to 0.3. A useful approach when reviewing a set of item discrimination indexes is to also view each item’s p-value at the same time. For example, if a given item has a discrimination index below .1, but the item’s p-value is greater than .9, you may interpret the item as being easy for almost the entire set of examinees, and probably for that reason not providing much discrimination between high ability and low ability examinees.
One important element in the quality of a multiple choice item is the quality of the item’s distractors. However, neither the item difficulty nor the item discrimination index considers the performance of the incorrect response options, or distractors. A distractor analysis addresses the performance of these incorrect response options.
Just as the key, or correct response option, must be definitively correct, the distractors must be clearly incorrect (or clearly not the “best” option). In addition to being clearly incorrect, the distractors must also be plausible. That is, the distractors should seem likely or reasonable to an examinee who is not sufficiently knowledgeable in the content area. If a distractor appears so unlikely that almost no examinee will select it, it is not contributing to the performance of the item. In fact, the presence of one or more implausible distractors in a multiple choice item can make the item artificially far easier than it ought to be.
In a simple approach to distractor analysis, the proportion of examinees who selected each of the response options is examined. For the key, this proportion is equivalent to the item p-value, or difficulty. If the proportions are summed across all of an item’s response options they will add up to 1.0, or 100% of the examinees’ selections.
The proportion of examinees who select each of the distractors can be very informative. For example, it can reveal an item mis-key. Whenever the proportion of examinees who selected a distractor is greater than the proportion of examinees who selected the key, the item should be examined to determine if it has been mis-keyed or double-keyed. A distractor analysis can also reveal an implausible distractor. In CRTs, where the item p-values are typically high, the proportions of examinees selecting all the distractors are, as a result, low. Nevertheless, if examinees consistently fail to select a given distractor, this may be evidence that the distractor is implausible or simply too easy.
Once the item analysis data are available, it is useful to hold a meeting of test developers, psychometricians, and subject matter experts. During this meeting the items can be reviewed using the information provided by the item analysis statistics. Decisions can then be made about item changes that are needed or even items that ought to be dropped from the exam. Any item that has been substantially changed should be returned to the bank for pretesting before it is again used operationally. Once these decisions have been made, the exams should be rescored, leaving out any items that were dropped and using the correct key for any items that were found to have been mis-keyed. This corrected scoring will be used for the examinees’ score reports.
Essay tests are useful for teachers when they want students to select, organize, analyze, synthesize, and/or evaluate information. In other words, they rely on the upper levels of Bloom’s Taxonomy. There are two types of essay questions: restricted and extended response.
Restricted Response – These essay questions limit what the student will discuss in the essay based on the wording of the question. For example, “State the main differences between John Adams’ and Thomas Jefferson’s beliefs about federalism,” is a restricted response. What the student is to write about has been expressed to them within the question.
Extended Response – These allow students to select what they wish to include in order to answer the question. For example, “In Of Mice and Men, was George’s killing of Lennie justified? Explain your answer.” The student is given the overall topic, but they are free to use their own judgment and integrate outside information to help support their opinion.
Student Skills Required for Essay Tests
Before expecting students to perform well on either type of essay question, we must make sure that they have the required skills to excel. Following are four skills that students should have learned and practiced before taking essay exams:
The ability to select appropriate material from the information learned in order to best answer the question.
The ability to organize that material in an effective manner.
The ability to show how ideas relate and interact in a specific context.
The ability to write effectively in both sentences and paragraphs.
Constructing an Effective Essay Question
Following are a few tips to help in the construction of effective essay questions:
Begin with the lesson objectives in mind. Make sure to know what you wish the student to show by answering the essay question.
Decide if your goal requires a restricted or extended response. In general, if you wish to see if the student can synthesize and organize the information that they learned, then restricted response is the way to go. However, if you wish them to judge or evaluate something using the information taught during class, then you will want to use the extended response.
If you are including more than one essay, be cognizant of time constraints. You do not want to punish students because they ran out of time on the test.
Write the question in a novel or interesting manner to help motivate the student.
State the number of points that the essay is worth. You can also provide them with a time guideline to help them as they work through the exam.
If your essay item is part of a larger objective test, make sure that it is the last item on the exam.
Scoring the Essay Item
One of the downfalls of essay tests is that they lack in reliability. Even when teachers grade essays with a well-constructed rubric, subjective decisions are made. Therefore, it is important to try and be as reliable as possible when scoring your essay items. Here are a few tips to help improve reliability in grading:
Determine whether you will use a holistic or analytic scoring system before you write your rubric. With the holistic grading system, you evaluate the answer as a whole, rating papers against each other. With the analytic system, you list specific pieces of information and award points for their inclusion.
Prepare the essay rubric in advance. Determine what you are looking for and how many points you will be assigning for each aspect of the question.
Avoid looking at names. Some teachers have students put numbers on their essays to try and help with this.
Score one item at a time. This helps ensure that you use the same thinking and standards for all students.
Avoid interruptions when scoring a specific question. Again, consistency will be increased if you grade the same item on all the papers in one sitting.
If an important decision like an award or scholarship is based on the score for the essay, obtain two or more independent readers.
Beware of negative influences that can affect essay scoring. These include handwriting and writing style bias, the length of the response, and the inclusion of irrelevant material.
Review papers that are on the borderline a second time before assigning a final grade.
Wht is the role of authentic assessment in student progress? How students can be benefited from continuous testing and feedback?
When considering how to assess student learning in a course, most instructors would agree that the ideal assessment would be one that not only assesses students’ learning; it also teaches students and improves their skills and understanding of course content. One fundamental aspect of such assessments is that they are authentic.
An authentic assignment is one that requires application of what students have learned to a new situation, and that demands judgment to determine what information and skills are relevant and how they should be used. Authentic assignments often focus on messy, complex real-world situations and their accompanying constraints; they can involve a real-world audience of stakeholders or “clients” as well. According to Grant Wiggins (1998), an assignment is authentic if it
requires judgment and innovation.
asks the student to “do” the subject.
replicates or simulates the contexts in which adults are “tested” in the workplace or in civic or personal life.
assesses the student’s ability to efficiently and effectively use a repertoire of knowledge and skills to negotiate a complex task.
allows appropriate opportunities to rehearse, practice, consult resources, and get feedback on and refine performances and products.
Authentic assessments can be contrasted with conventional test questions, which are often indirect measures of a student’s ability to apply the knowledge and skills gained in a course. Conventional tests have an important place in college courses, but cannot take the place of authentic assessments.
Assessment is a critical component of the online classroom. It provides students with an idea of their progress in a course, identifies individual strengths and weaknesses, and ultimately serves as the measure of whether students achieve the course’s learning objectives. Although each of these characteristics serves a valuable instructional or pedagogical function, it’s also important that assessments engage students and prepare them with the skills they’ll need in future courses, practicums, and even their careers.
Assessment isn’t just important from a student perspective. With the online marketplace becoming increasingly crowded, it’s critical that institutions ensure they are offering the courses and experiences that students are looking for.
In their survey of more than 1,500 past, present, and prospective online college students, Magda and Aslanian (2018) found that 74% of online college students are pursuing their program for career-focused reasons, such as transitioning to a new career, updating the skills required for their current job, increasing wages, or meeting employers’ requirements.
Because the majority of online students are career-focused, courses and degree programs must provide those ties to the real world for institutions to stand out in the online landscape. Given its importance to the online classroom, assessment can be a great starting point for integrating this type of relevance into your course.
Authentic assessment is the idea of using creative learning experiences to test students’ skills and knowledge in realistic situations. Authentic assessment measures students’ success in a way that’s relevant to the skills required of them once they’ve finished your course or degree program. In this article, we’ll discuss the benefits and challenges of this type of assessment and how you can incorporate authentic assessment in your online course.
Before we dig too far into how to create an authentic assessment, let’s look at what makes an assessment authentic. Wiggins (1998) identifies a few key criteria:
It requires judgement and innovation.
It asks the student to “do” the subject.
It replicates or simulates the contexts in which adults are “tested” in the workplace, in civic life, and in personal life.
It assesses the student’s ability to efficiently and effectively use a repertoire of knowledge and skill to negotiate a complex task.
It allows appropriate opportunities to rehearse, practice, consult resources, and get feedback on and refine performances and products.
Simply put, an authentic assessment is one that requires students to apply what they’ve learned in a new, complex circumstance or situation. Typically, this can take one of two forms: real-world assessments that require students to engage in actual situations in their field, or realistic assessments that are relevant in nature but have students engage in situations that mimic the real world (e.g., a case study).
Regardless of the type, authentic assessment is often coupled with opportunities for rehearsal and/or practice. Authentic assessments are often scaffolded throughout a course and allow the instructor to provide feedback that students can then implement in subsequent drafts. Unlike traditional assessments (such as essays and multiple-choice exams), authentic assessments ask students to engage in scenarios or practices that are complex, realistic, and sometimes messy.
Benefits of Authentic Assessment
Shank (2009) identifies a few key challenges of assessments in the online environment: expecting a bell curve, using the wrong type of assessment (performance assessments vs. test assessments), not creating valid (enough) assessments, and using poorly written multiple-choice tests. Although authentic assessment is unlikely to overcome all of these challenges, it offers a number of benefits in an online course.
Notably, authentic assessment breaks the traditional paradigm of multiple-choice or automatically scoring tests and quizzes, which can lead students to believe that learning means staying up all night and cramming to memorize terms or expected answers. Instead, authentic assessments tend to be more student centered, as they ask students to demonstrate their learning through hands-on activities. Rather than asking students to memorize and recall facts, authentic assessments ask students to actively participate in situations that require them to apply the principles they’ve learned about in the instructional material. Thus, learning isn’t about recalling; it’s about performing, which, ideally, will motivate students to engage in the course and succeed in their endeavors.
Define Relevant Tasks
After identifying the learning objective(s) you’re looking to measure through your authentic assessment, you can then start defining what students will actually do. Given that the assessment should be, well, authentic, start by looking at what professionals in your field do on a daily basis and how those tasks might relate to your selected learning objective. Although your task doesn’t ultimately have to relate to your field, it should require students to apply themselves in a relevant and new situation. Ultimately, the relevance of your assessment to students’ lives and/or goals should be clear. In fact, it’s helpful to state the relevance explicitly at the outset of the assessment.If you’re struggling to identify a relevant task, consider starting with the verb of your learning objective. Oftentimes, you’ll find that you’re able to define your task by looking at what the objective asks students to do. For instance, if the objective for a business course is that students will be able to analyze the local and global impact of organizational decisions, consider creating a fictional scenario in which students have to make organizational decisions for a business and analyze the impact of those decisions. Looking at your objective’s verb also ensures that your assessment aligns with your learning objective, which is a pillar of effective course design.
Identify Essential Performance Criteria
If the previous step was to define what students will do to complete your authentic assessment, this step focuses on how you’ll know whether they’ve done it well. After all, just because an authentic assessment doesn’t look like a traditional assessment doesn’t mean that the goal isn’t the same. You still need to have an indication of how well students have performed and whether they’ve achieved mastery.
With this in mind, it’s important for these performance criteria to align with the nature of your task. To return to our business example from earlier, you’d want to make sure that the way you measure students’ performance is reflective of or similar to the expectations they would encounter in a business scenario. For example, you’d want to create performance criteria specific to how students should make the organizational decisions and how accurately and/or appropriately they analyze those decisions. Although students shouldn’t be held to the same standards as professionals in the field (they’re novices, after all), it’s still possible to measure student success in a new and relevant way.
Develop a Rubric
Rubrics are a powerful tool for many assessment types, and they’re an essential component of authentic assessment. After all, authentic assessments are fairly subjective, and rubrics help ensure instructors are grading fairly and consistently from assessment to assessment and student to student. With this in mind, once you’ve identified the task and essential performance criteria (that is, what students will do and what benchmarks exist to make sure students do it well), the next step is to develop a rubric.
You might be thinking that this seems pretty similar to that last step. Well, it is! When designing your rubric, you should use the performance criteria you’ve identified and come up with measurable levels for each. Once you’ve developed your rubric, consider presenting it to students before they begin the assessment. That way, they know what you expect of them and can more readily gauge their own performance.