Taking a closer look at assessment programs. What does genAI do to the validity of an assessment program?
From: Esther van Dijk, Steven Raaijmakers, Laura Koenders and Frans Prins. Translated using ChatGPT (OpenAI, 2025, model 4O)
The rise of generative AI (genAI) raises fundamental questions about the value of assessment results in higher education. When students use genAI tools, such as ChatGPT, during assessments without instructors knowing what exactly they are doing, what does their performance still say about their own knowledge and skills? In other words, genAI in some cases reduces the validity of our assessments. This potentially undermines the value of the diploma.
Directors of Education are responsible for the quality of assessment in their programmes (Handreiking Kwaliteitszorg Examinering) and therefore play a crucial role in analysing vulnerabilities introduced by genAI technologies. The step-by-step plan we present here is a tool for Directors of Education to gain insight into the validity of the assessment program now that genAI is widely accessible. This is a diagnostic step, intended to determine what adjustments to assessment may be necessary.
Decisions about awarding a degree are based on information from multiple assessments conducted throughout a study programme. The combination of assessment types and formsused over the years is called the assessment program. This combination is deliberately designed to match the goals, content, structure and coherence of the curriculum (Van Schilt-Mol & Joosten-ten Brinke, 2025). The impact of genAI varies greatly between different assessment types and functions, which calls for differentiated adjustments. A programme-wide perspective鈥攐r insight into the assessment program鈥攊s thus a necessary step for programmes to make informed choices about potential adaptations.
This instrument guides you step-by-step through an analysis to systematically map out the assessment program of a degree programme and to assess the influence of genAI on its validity. The analysis helps programmes to:
make the validity of an assessment program as a whole visible, and understand the influence of genAI on it;
interpret the influence of genAI from an educational vision; and
formulate targeted actions to adapt assessment programs, including genAI wherewanted and necessary.
Basic principles
This instrument is based on the following assumptions:
- Students are using genAI tools on a large scale. According to recent surveys, approximately 65%-83% of students are currently using genAI tools (Deschenes and McMahon, 2024; Chung et al., 2024).
- GenAI performance is now advanced enough to successfully complete a wide range of exams (see e.g. Ghosh & Bir, 2023; Kumah-Crystal et al., 2023).
- As the technology evolves, the use of genAI is becoming increasingly difficult to detect in terms of content (Fleckenstein et al., 2024). There are no automatic AI-content detection tools that reliably distinguish whether a text was written with or without genAI (Elkhatat, 2023).
Purpose
This step-by-step plan is not a normative framework but a diagnostic instrument that helps educational programmes gain insight into the impact of genAI on their summative assessments. It identifies which end terms are vulnerable because genAI has affected the validity of the assessments used to evaluate them. The insights generated through this analysis support both quality assurance and accreditation preparation, and may lead to curriculum development where needed.
The ultimate goal is to design an assessment program that, despite the influence of genAI, still provides sufficient insight into the student's knowledge and skills. Thus, this analysis is not an endpoint but a starting point for further discussion about the educational vision on genAI and its implications for end terms, assessment, and teaching and learning activities within the programme.
Step 1: Delimitation
The aim of this step is to determine which parts of the programme you will analyse, what exactly you want to analyse, and how you will approach it. A clear scope prevents the analysis from becoming too large, too vague, or unmanageable. It also ensures that decisions are made explicitly and transparently鈥 something that is important in the context of accreditation or educational development.
Scope: which courses will be included in the analysis?
This step-by-step plan focuses on the analysis of assessments with a summative function, andtherefore concentrates on educational units that collectively lead to a degree, such as bachelor鈥檚, master鈥檚, or executive programmes. The minimum scope for this analysis includes all mandatory courses for all students. If the programme includes tracks or specialisations with additional required courses, it is recommended to conduct a separate analysis for each track. This will provide clearer insights into track-specific vulnerabilities and safeguard quality within each track.
There may also be reasons to include certain elective courses in the analysis, in addition to the required courses. For example, when students must complete a fixed number of electives from a predefined list to earn their degree. In such cases, vulnerabilities in those electives may be just as critical for the attainment of final qualifications as those in mandatory courses.
Objective: gaining insight into genAI鈥檚 impact on the validity of the assessment program
GenAI affects two aspects of the validity of assessment programmes (van Berkel et al., 2023): 1) the extent to which the assessment results provide information on student performance in relation to the attainment of all intended learning outcomes (coverage), and 2) the degree to which the combination of assessment formats is suitable for determining whether the intended end term has been achieved (fitness of form).
By collecting information that gives insight into these aspects, you can determine for each intended learning outcome whether you can still make well-founded claims about student mastery.
Possible extensions of the analysis
This analysis may also serve as a foundation for broader insights into the overall quality of the assessment program. For example, it may raise questions about progression in complexity, the balance between individual and group assignments, or the interaction between formative and summative assessments. To make broader claims (beyond the influence of genAI) about the quality of the assessment program, we recommend using the Kwaliteitsinstrument Toetsprogramma (). To answer all questions from this instrument, more extensive information collection is required.
How to approach this?
This analysis can be carried out by the Director of Education or delegated to an individual or project team. We recommend clearly defining the intended output, the stakeholders involved, the activities to be undertaken, and the timeline in advance.
- End Product: Will the result be an analysis only, or will it include recommendations? In what form? Report, presentation, or another format?
- Stakeholders: Who will be involved and at what stage? For example, during the setup of the analysis, data collection, or interpretation. It is important to align with existing quality assurance processes, roles in the program, and institutional practices.
- Activities: Consider whether it is better to request and compile the data yourself, or to collect it collaboratively with responsible stakeholders. The latter approach has the benefit of creating shared understanding and ownership from the start.
Step 2: Data Collection
After determining the scope, objectives, and approach of the analysis, relevant data about the educational programme and its assessments must be collected and structured. To structure this data, we recommend using Excel or a similar tool and following the steps below to create an overview. Figure 1 illustrates what such an overview could look like.
In this analysis, we opted for a direct link between end terms and assessments. Linking via intended learning objectives (assessment 鈫 learning objective 鈫 end term) is possible but not necessary to draw conclusions about the validity of the overall assessment program in the context of genAI.
One important consideration in data collection is the estimation of the 'genAI vulnerability' of assessments, as well as the match between each end term and its associated assessments. This information will in most cases be provided by the course coordinator (see also "how to approach this"). Making these estimations requires both strong educational knowledge of assessment design (see Box 1) and understanding of the (im)possibilities of genAI. If this knowledge is lacking, the estimation will be less reliable. This can result in either an overly optimistic or overly pessimistic view of the quality of the assessment program.
To analyse the influence of genAI on different aspects of validity, the following information is required:
- the end terms;
- the course names and/or course codes;
- the summative assessments per course;
- which end terms are assessed by each assessment; and
- an interpretation made by the course coordinator: how much insight is there into the student's own contribution to the assessment, as opposed to input from genAI. We recommend coding this aspect. An example coding could be:
- Clear insight into the student鈥檚 contribution: The instructor has full insight into the student's own input during the assessment. This means genAI use is either not possible or entirely transparent. Examples include proctored exams and in-person oral exams.
- Limited insight into the student鈥檚 contribution: The instructor has partial insight into the student鈥檚 input and some visibility into the use of genAI. The assessment process is designed in such a way that the instructor can confidently judge the student鈥檚 abilities. For instance, when there is regular contact throughout the process. Examples include essay assignments where close interaction occurs between student and instructor, or tasks followed by a live presentation with follow-up questions.
- No insight into the student鈥檚 contribution: The instructor has no visibility into the student鈥檚 own input or the use of genAI. Examples include take-home assignments where the instructor lacks insight into how the product was created.
Box 1. Example of matching assessments to end terms An assessment task often provides information on the mastery of multiple end terms, to varying degrees. An essay as assessment, for example, can evaluate both 鈥榗onceptual understanding of content鈥 and 鈥榳riting skills鈥, thereby contributing to judgments about the level of multiple end terms. What is actually assessed depends on the exact assignment instructions, and can also be inferred from the assessment instrument. In an analytical rubric, for example, it is clearly indicated which components count and to what extent. For the interpretation step, it is therefore important to carefully examine both the assignment and the assessment instrument, in order to make deliberate decisions about which end term the assessment provides information for. |
How to approach this?
There are two main approaches to data collection:
- Centralised Approach: A project lead or group requests all necessary data and compiles it into an overview. This means each course coordinator individually assesses the extent to which students can use genAI during assessments (and how disruptive that is to the summative decision). Given the complexity of these judgments, there is a significant risk of inconsistency, especially when coordinators lack sufficient knowledge of genAI or its implications for assessment validity.
- Collaborative Approach: Course coordinators come together in a facilitated session to fill out the overview collectively, making shared judgments about the vulnerability of assessments. An example of this method is described in the article by Jongkind et al. (2025). This approach can be supported by an educational specialist if needed.
Step 3: Analysis, interpretation, and reporting
Through the analysis, you can gain better insight into potential risks concerning the coverage and fitness of form of the assessment program. The table below describes how to analyse the data and what the results may imply for the validity of the assessment program.
Before starting the analysis, it is useful to reflect on which intended learning outcomes are thematically related and how the different learning outcomes are prioritised within the programme. This will help guide the interpretation of individual outcomes and the overall results.
The analysis will produce signals indicating which clusters of intended learning outcomes鈥攁nd which individual outcomes鈥攔equire closer scrutiny. We recommend returning from the numerical results to the actual assessment tasks in order to determine whether the validity of the assessment program is indeed at stake.
Quality criterion | Data analysis | Interpretation |
Coverage |
| The key question here is: when is coverage insufficient? If an end term is not assessed at all, of course no valid conclusions can be drawn about student performance on that outcome. This is clearly insufficient. In some cases, one well-designed assessment may be enough to cover an end term. In others, multiple assessments may be necessary. Additionally, genAI may invalidate certain assessments or formats, leaving too few or no valid assessments for a specific edn term. In such cases, the end term is no longer sufficiently covered. |
| Als een eindterm alleen of voornamelijk getoetst wordt door middel van toetsen waarbij de docent weinig of geen zicht heeft op het leren van de student, dan is er mogelijk een risico met betrekking tot de dekking. If an end term is assessed only or mainly through assessments where instructors have little or no insight into student learning, this may pose a coverage risk. Deeper analysis could give more insight. This could include questions such as: when are most low-visibility assessments scheduled? Are there patterns in specific years or learning trajectories? | |
Fitness of form |
| The assessment format and context should align with the content and level of the end term. Examples include end terms that require evaluating genAI usage, or skills that cannot be reliably assessed in uncontrolled settings such as collaboration or self-regulation. Note: Coverage and fitness of form are interrelated. If an end term is assessed using an inappropriate format, this also undermines its coverage and thus the overall validity. |
Reporting
Reporting the results of the analysis and their interpretation is a key step in justifying policies (e.g. for accreditation or examination boards), ensuring continuity within the programme during staff transitions, and engaging teaching staff in follow-up actions and improvement initiatives. The final report can take several forms, but it should at a minimum address the core question: 鈥淲hat risks does genAI pose to the validity of the assessment program, in terms of coverage and fitness of form?鈥.
There are multiple ways to structure this report. For instance by clustering end terms based on thematic content, or by grouping end terms according to the level of identified risk.
How to approach this?
- Many programmes already have internal expertise for analysing this type of data. We recommend discussing the findings together with involved staff to incorporate important contextual insights into the interpretation. An educational consultant can support the interpretation process or help critically review the conclusions.
- When drafting the final report, consider how to involve others in the process of understanding the analysis, its interpretation, and the key conclusions.
Step 4: From Diagnosis to Reform
If the analysis reveals risks to the validity of the assessment program, various interventions can be considered. In all cases, it is essential to maintain constructive alignment (Biggs, 1996), ensuring that adjustments to the end terms go hand in hand with adjustments to assessment. Moreover, further development of a shared vision on the role of genAI within the discipline (and thus the degree programme) is required.
Vision development involves addressing questions such as: What knowledge and skills should graduates possess, and how can the curriculum ensure these are taught and assessed? In what ways will graduates use genAI in further study or the professional field, and how can the programme prepare them for this? How is assessment throughout the program supporting different functions, like qualification or the support of learning? The answers to these and related questions will shape decisions about which programme adjustments are needed.
Adjustments to the Programme
Learning Outcomes
Instead of requiring students to achieve a learning outcome without using genAI, its use could be explicitly allowed or even included as a new or extended learning outcome. Within the Faculty of Social and Behavioural Sciences, several scenarios have been developed that specify varying levels of genAI use. In other faculties, this is referred to as the AI Index, which defines these levels and can be a helpful tool. This clarity helps both students and instructors understand expectations and which skills are being assessed. It is important to note that enforcing restrictions on genAI use in contexts where instructors lack visibility into student activity is still not feasible at the time of writing.
Assessment
Several strategies exist for adjusting assessment practices. Corbin et al. (2025) distinguish between a discursive approach, which focuses on providing students with clear guidance about desirable and undesirable genAI use during assessments, and a structural approach, which refers to revising the assessment program as a whole. Depending on the extent of validity risks, programmes can adopt a strategy that suits their context. We anticipate that most programmes will require a combination of both approaches.
If a structural revision is chosen, the work of Liu and Bridgeman (2023) offers useful guidance. They argue that validity risks posed by genAI can be mitigated by designing assessments in which instructors have sufficient visibility into the student's contributions. This could include controlled assessment environments (e.g. multiple-choice or open-ended exams without access to genAI), or assessments where instructors closely monitor progress through guided interaction鈥攕uch as oral exams or projects with continuous supervision.
However, this approach poses two challenges. First, it only improves validity if the assessment format is appropriate for the intended learning outcome鈥攐therwise, it may introduce new validity issues. Second, these forms of assessment are time-intensive and may not be feasible in terms of logistics, staffing, or cost.
Liu and Bridgeman (2023) therefore conclude that it is difficult to replace all assessments with formats that completely eliminate genAI use or allow for full instructor oversight. They recommend making deliberate choices, ensuring that for each end term at least one carefully designed, summative assessment is administered at key moments in the curriculum (ensuring sufficient coverage). Other assessments can then take on a formative function, focused on student learning. This approach is also known as the two-lane approach. See also the Npuls vision document (Beekman, 2025) for more detail.
How to approach this?
- The input of diverse lecturers from the programme is essential in shaping the vision and proposed adjustments. Different lecturers bring different areas of expertise, and genAI may impact these areas in varying ways. Collective ownership also helps ensure the vision is widely supported and effectively implemented in the programme.
- Making curriculum changes is inherently complex. Stakeholders bring different interests, perspectives, and levels of knowledge. At the same time, reform is urgent. Therefore, the process must balance these interests while also developing a clear vision and making necessary decisions. An educational consultant can help guide this process effectively.
Developing AI literacy in education
At Utrecht 木瓜福利影视, active steps are being taken to increase AI literacy. This is happening in various ways鈥攆rom shaping a shared vision to the practical integration of generative AI into teaching and assessment programmes. Would you like to contribute to this process, or do you have questions about how AI can be effectively applied in your field? Then please contact Laura Koenders (see contact details below).
Author / point of contact
Publication date: July 2025
Bronnen
- Baartman, L. & Prins, F. (2023). Kwaliteit van toetsprogramma鈥檚. In: H. van Berkel, A. Bax, D. Joosten-ten Brinke, T. van Schilt-Mol (Eds), Toetsen in het hoger onderwijs (5e editie). Boom. ISBN 9789024456161
- Beekman, K., Draaijer, S., Beckers, J., Schagen, E., & Hofman, I. (2025). Visie op toetsing en examinering in het tijdperk van AI. Utrecht. Npuls.
- Biggs, J. (1996). Enhancing teaching through constructive alignment. Higher Education, 32(3), 347-364.
- Chung, J., Henderson, M., Pepperell, N., Slade, C., Liang, Y. (2024). Student perspectives on AI in Higher Education: Student Survey. Student Perspectives on AI in Higher Education Project.
- Corbin, T., Dawson, P. & Liu, D. (2025). Talk is cheap: why structural assessment changes are needed for a time of GenAI. Assessment & Evaluation in Higher Education, 1-11.
- Deschenes, A. & McMahon, M. (2024). A Survey on Student Use of Generative AI Chatbots for Academic Research. Evidence Based Library and Information Practice, 19(2), 2鈥22. .
- Elkhatat, A.M., Elsaid, K. & Almeer, S. (2023). Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. Int J Educ Integr 19, 17.
- Fleckenstein, J., Thorben Jansen, J.M., Keller, S.D., K枚ller, O., & M枚ller, J. (2024). Do teachers spot AI? Evaluating the detectability of AI-generated texts among student essays, Computers and Education: Artificial Intelligence, 6,
- Ghosh, A. & Bir, A. (2023). Evaluating ChatGPT鈥檚 ability to solve higher-order questions on the competency-based medical education curriculum in medical biochemistry. Cureus, 15(4).
- Jongkind, R., Elings, E., Joukes, E., Broens, T., Leoplod, H., Wiesman, F., & Meinema, J. (2025) Is your curriculum GenAI-proof? A method for GenAI impact assessment and a case study (pre-print). .
- Kumah-Crystal Y., Mankowitz, S., Embi. P. & Lehmann, C.U. (2023). ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification? Journal of the American Medical Informatics Association, 30(9), 1558鈥1560,
- Liu, D., & Bridgeman, A. (2023). Embracing the future of assessment at the 木瓜福利影视 of Sydney. - innovation.sydney.edu.au/teaching@sydney/embracing-the-future-of-assessment-at-the-university-of-sydney/
- Lodge, J., Howard, S., Bearman, M., & Dawson, P. (2023). Assessment reform for the age of Artificial Intelligence. Tertiary Education Quality and Standards Agency.
- Van Schilt-Mol, T. & Joosten-ten Brinke, D. (2023). Kwaliteit van toetsing geoperationaliseerd. In: H. van Berkel, A. Bax, D. Joosten-ten Brinke, T. van Schilt-Mol (Eds), Toetsen in het hoger onderwijs (5e editie). Boom. ISBN 9789024456161
Toetsprogramma鈥檚 onder de loep 漏 2025 by Van Dijk, Raaijmakers, Koenders en Prins is licensed under CC BY-NC 4.0. To view a copy of this license, visit