LAK 2026workshop

LLM Psychometrics

Applying Assessment Theories to Evaluate LLMs and LLM-Supported Learners

Bergen, NorwayApril 27, 2026

Large Language Models (LLMs) are increasingly used for tasks traditionally performed by human learners, such as reading, writing, problem solving, and programming. While current evaluations rely heavily on benchmarks, mature frameworks from educational measurement – such as Item Response Theory (IRT), cognitive diagnostic models (e.g., DINA), and learning taxonomies – offer principled approaches for understanding their capabilities and limitations. This workshop will explore how these theories can inform the evaluation of LLMs and human-AI collaboration, highlight divergences and alignments with human learning processes, and address concerns around responsible AI use in education.

Our workshop goals are to advance theory-grounded evaluation of LLMs and LLM-supported learners, connect researchers across learning analytics, educational measurement, and AI, and surface actionable patterns in LLM performance and translate them into practices for responsible assessment and human–AI collaboration.

Call for Papers

We accept two types of contributions: (1) short empirical work-in-progress papers, and (2) short discussion papers that provoke debate on key issues and challenges in assessing LLMs and LLM-supported learners.

Topics of interest include:

  • Application of educational measurement theories and methods to understanding AI and Human-AI learning and performance
  • Systematic comparisons of human vs AI learner performance
  • Systematic comparisons of human-human vs human-AI team performance
  • Adaptation of conceptual frameworks for designing tasks involving AI and human-AI learners
  • Methods for detecting AI participation in solving complex tasks, including detection of GenAI-enabled cheating, misuse, and implications for academic integrity
  • Methods for studying AI knowledge representation and its impact on learning and performance
  • Error analyses of AI performance on complex tasks that provide insights into AI knowledge, skills, and learning
  • Ethical considerations and limitations of using human-based psychometric models to study LLMs

Important Dates

  1. Submission deadline

    Dec 18, 2025

  2. Notification

    Jan 15, 2026

  3. Camera-ready

    Jan 22, 2026

  4. Workshop Day

    Apr 27, 2026

* All dates in local conference time.

Submission Guidelines

  • Length: 4–6 pages (not including tables, figures, references, acknowledgements, AI declarations, and ethics statements)
  • Format: CEUR Workshop Proceedings template
  • Review: Double-blind
  • Proceedings: Accepted papers will be presented during the workshop and published in CEUR-WS open workshop proceedings (workshop papers are not included in the Companion Proceedings of LAK2026).

Program Schedule

  • 09:00–09:15
    • Welcome and overview of workshop goals
    • Introduction of organizers and agenda
    • Brief participant introductions and ice-breaker
  • 09:15–10:00
    Alina von Davier

    Keynote: Alina von Davier

    Chief of Assessment, Duolingo — talk + Q&A

  • 10:00–11:00

    Long talks (13 min)

  • 11:00–11:15

    Coffee break

  • 11:15–12:00

    Short talks (10 min)

  • 12:00–12:30

    Poster session and discussion into lunch

Accepted Papers

Long Talks

Short Talks

Assessing the Ability of Large Language Models to Give Learning Recommendations with Knowledge Space Theory

Peter Steiner, Jan Hochweber

From Proprietary to Open-Source: A Comparative Evaluation of LLMs for Automatic Feedback Generation

Elisabetta Mazzullo, Okan Bulut

Exploring Evaluation Methods for Generative AI-Powered Chatbot Output

Magdalen Beiting-Parrish, Jodi Casabianca

Toward Measurement Equivalence: LLM-Powered Translation of Critical Thinking Assessments

Euigyum Kim, Hyo Jeong Shin, Alina A. von Davier, Salah Khalil

Assessment Design in the AI Era: Applying Psychometric Theory to Identify Items on Which Humans and Chatbots Diverge

Licol Zeinfeld, Alona Strugatski, Ziva Bar-Dov, Ron Blonder, Shelley Rap, Giora Alexandron

Correcting Human Labels for Rater Effects Using Item Response Theory Models

Magdalen Beiting-Parrish, Jodi Casabianca

Psychometric Analysis of LLM Differential Sensitivity to Word Predictability using Explanatory MFRM

Wesley Morris, Langdon Holmes, Scott Crossley

From Leaderboards to Decisions: A Psychometric Perspective on Criterion-Referenced Evaluation of Large Language Models

Chengyuan Yao, Zhen Xu, Jiayu Zheng, Renzhe Yu

Bergen, Norway