Computational Thinking (CT) is widely recognised as a transversal competence essential for learning, problem solving, and knowledge transfer across disciplines. However, its effective integration into school education remains strongly dependent on the availability of assessment instruments that are pedagogically meaningful, psychometrically sound, and applicable across diverse educational contexts. This paper presents COMATH, a cross-national assessment instrument designed to evaluate CT in students aged 9–14. The instrument adopts a phase-based development and validation framework that integrates Bebras-inspired tasks, Item Response Theory, factor-analytic methods, learning analytics, and teacher and student feedback. The assessment was iteratively developed and piloted between 2023 and 2025 in six European countries, with data collected from 6,480 students and 155 teachers. The findings demonstrate that a phased assessment approach enables systematic calibration of task difficulty, robust evaluation of item functioning, and meaningful interpretation of student performance across age groups and national contexts. The results further highlight how well-designed CT assessment can support instructional decision-making rather than serve solely as a summative measure. The study argues for conceptualising CT assessment as a dynamic and iterative process that links measurement, psychometric validation, and pedagogical use in school education.
Computerized Adaptive Testing (CAT) is now widely used. However, inserting new items into the question bank of a CAT requires a great effort that makes impractical the wide application of CAT in classroom teaching. One solution would be to use the tacit knowledge of the teachers or experts for a pre-classification and calibrate during the execution of tests with these items. Thus, this research consists of a comparative case study between a Stratified Adaptive Test (SAT), based on the tacit knowledge of a teacher, and a CAT based on Item Response Theory (IRT). The tests were applied in seven Computer Networks courses. The results indicate that levels of anxiety expressed in the use of the SAT were better than those using the CAT, in addition to being simpler to implement. In this way, it is recommended the implementation of a SAT, where the strata are initially based on the tacit knowledge of the teacher and later, as a result of an IRT calibration.