Files
Zotero-Thesis/storage/6N9Q6CGV/.zotero-ft-cache
2026-01-04 23:05:47 +01:00

59 lines
2.8 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Skip to main content
Computer Science > Computation and Language
arXiv:2511.17069 (cs)
[Submitted on 21 Nov 2025]
Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments
Yunsung Kim, Mike Hardy, Joseph Tey, Candace Thille, Chris Piech
View PDF
HTML (experimental)
AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability -- Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. AnalyticScore operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.
Comments: 16 pages, 2 figures
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2511.17069 [cs.CL]
  (or arXiv:2511.17069v1 [cs.CL] for this version)
 
https://doi.org/10.48550/arXiv.2511.17069
Focus to learn more
Submission history
From: Yunsung Kim [view email]
[v1] Fri, 21 Nov 2025 09:19:05 UTC (183 KB)
Access Paper:
View PDFHTML (experimental)TeX Source
view license
Current browse context: cs.CL
< prev next >
newrecent2025-11
Change to browse by: cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
About
Help
Contact
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status