update

2026-01-04 23:05:47 +01:00
parent a5f30a65e6
commit 9910bd202a
34 changed files with 51442 additions and 7 deletions
--- a/storage/4CUYM6ZK/.zotero-ft-cache
+++ b/storage/4CUYM6ZK/.zotero-ft-cache
@@ -0,0 +1,65 @@
+Skip to main content
+Computer Science > Computation and Language
+arXiv:1907.11692 (cs)
+[Submitted on 26 Jul 2019]
+RoBERTa: A Robustly Optimized BERT Pretraining Approach
+Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov
+View PDF
+Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
+Subjects:	Computation and Language (cs.CL)
+Cite as:	arXiv:1907.11692 [cs.CL]
+ 	(or arXiv:1907.11692v1 [cs.CL] for this version)
+ 	
+https://doi.org/10.48550/arXiv.1907.11692
+Focus to learn more
+Submission history
+From: Myle Ott [view email]
+[v1] Fri, 26 Jul 2019 17:48:29 UTC (45 KB)
+
+Access Paper:
+View PDFTeX Source
+view license
+Current browse context: cs.CL
+< prev next >
+
+newrecent2019-07
+Change to browse by: cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+21 blog links (what is this?)
+DBLP - CS Bibliography
+listing | bibtex
+Yinhan Liu
+Myle Ott
+Naman Goyal
+Jingfei Du
+Mandar Joshi
+…
+Export BibTeX Citation
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer (What is the Explorer?)
+Connected Papers Toggle
+Connected Papers (What is Connected Papers?)
+Litmaps Toggle
+Litmaps (What is Litmaps?)
+scite.ai Toggle
+scite Smart Citations (What are Smart Citations?)
+Code, Data, Media
+Demos
+Related Papers
+About arXivLabs
+Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
+About
+Help
+Contact
+Subscribe
+Copyright
+Privacy Policy
+Web Accessibility Assistance
+
+arXiv Operational Status 
--- a/storage/4CUYM6ZK/1907.html
+++ b/storage/4CUYM6ZK/1907.html
--- a/storage/4LIWKFFQ/.zotero-ft-cache
+++ b/storage/4LIWKFFQ/.zotero-ft-cache
--- a/storage/4LIWKFFQ/Zhang
+++ b/storage/4LIWKFFQ/Zhang
--- a/storage/5JWXXUR3/.zotero-ft-cache
+++ b/storage/5JWXXUR3/.zotero-ft-cache
@@ -0,0 +1,489 @@
+Vol.:(0123456789)
+Artificial Intelligence Review (2022) 55:2495–2527 https://doi.org/10.1007/s10462-021-10068-2
+13
+An automated essay scoring systems: a systematic literature review
+Dadi Ramesh1,2 · Suresh Kumar Sanampudi3
+Published online: 23 September 2021 © The Author(s), under exclusive licence to Springer Nature B.V. 2021
+Abstract
+Assessment in the Education system plays a significant role in judging student performance. The present evaluation system is through human assessment. As the number of teachers’ student ratio is gradually increasing, the manual evaluation process becomes complicated. The drawback of manual evaluation is that it is time-consuming, lacks reliability, and many more. This connection online examination system evolved as an alternative tool for pen and paper-based methods. Present Computer-based evaluation system works only for multiple-choice questions, but there is no proper evaluation system for grading essays and short answers. Many researchers are working on automated essay grading and short answer scoring for the last few decades, but assessing an essay by considering all parameters like the relevance of the content to the prompt, development of ideas, Cohesion, and Coherence is a big challenge till now. Few researchers focused on Content-based evaluation, while many of them addressed style-based assessment. This paper provides a systematic literature review on automated essay scoring systems. We studied the Artificial Intelligence and Machine Learning techniques used to evaluate automatic essay scoring and analyzed the limitations of the current studies and research trends. We observed that the essay evaluation is not done based on the relevance of the content and coherence.
+Keywords Assessment · Short answer scoring · Essay grading · Natural language processing · Deep learning
+* Dadi Ramesh dadiramesh44@gmail.com
+Suresh Kumar Sanampudi sureshsanampudi@jntuh.ac.in
+1 School of Computer Science and Artificial Intelligence, SR University, Warangal, TS, India
+2 Research Scholar, JNTU, Hyderabad, India
+3 Department of Information Technology, JNTUH College of Engineering, Nachupally, Kondagattu, Jagtial, TS, India
+
+
+2496 D. Ramesh, S. K. Sanampudi
+13
+1 Introduction
+Due to COVID 19 outbreak, an online educational system has become inevitable. In the present scenario, almost all the educational institutions ranging from schools to colleges adapt the online education system. The assessment plays a significant role in measuring the learning ability of the student. Most automated evaluation is available for multiplechoice questions, but assessing short and essay answers remain a challenge. The education system is changing its shift to online-mode, like conducting computer-based exams and automatic evaluation. It is a crucial application related to the education domain, which uses natural language processing (NLP) and Machine Learning techniques. The evaluation of essays is impossible with simple programming languages and simple techniques like pattern matching and language processing. Here the problem is for a single question, we will get more responses from students with a different explanation. So, we need to evaluate all the answers concerning the question. Automated essay scoring (AES) is a computer-based assessment system that automatically scores or grades the student responses by considering appropriate features. The AES research started in 1966 with the Project Essay Grader (PEG) by Ajay et al. (1973). PEG evaluates the writing characteristics such as grammar, diction, construction, etc., to grade the essay. A modified version of the PEG by Shermis et al. (2001) was released, which focuses on grammar checking with a correlation between human evaluators and the system. Foltz et al. (1999) introduced an Intelligent Essay Assessor (IEA) by evaluating content using latent semantic analysis to produce an overall score. Powers et al. (2002) proposed E-rater and Intellimetric by Rudner et al. (2006) and Bayesian Essay Test Scoring System (BESTY) by Rudner and Liang (2002), these systems use natural language processing (NLP) techniques that focus on style and content to obtain the score of an essay. The vast majority of the essay scoring systems in the 1990s followed traditional approaches like pattern matching and a statistical-based approach. Since the last decade, the essay grading systems started using regression-based and natural language processing techniques. AES systems like Dong et al. (2017) and others developed from 2014 used deep learning techniques, inducing syntactic and semantic features resulting in better results than earlier systems. Ohio, Utah, and most US states are using AES systems in school education, like Utah compose tool, Ohio standardized test (an updated version of PEG), evaluating millions of student’s responses every year. These systems work for both formative, summative assessments and give feedback to students on the essay. Utah provided basic essay evaluation rubrics (six characteristics of essay writing): Development of ideas, organization, style, word choice, sentence fluency, conventions. Educational Testing Service (ETS) has been conducting significant research on AES for more than a decade and designed an algorithm to evaluate essays on different domains and providing an opportunity for test-takers to improve their writing skills. In addition, they are current research content-based evaluation. The evaluation of essay and short answer scoring should consider the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge. Proper assessment of the parameters mentioned above defines the accuracy of the evaluation system. But all these parameters cannot play an equal role in essay scoring and short answer scoring. In a short answer evaluation, domain knowledge is required, like the meaning of "cell" in physics and biology is different. And while evaluating essays, the implementation of ideas with respect to prompt is required. The system should also assess the completeness of the responses and provide feedback.
+
+
+2497
+An automated essay scoring systems: afisystematic literature...
+13
+Several studies examined AES systems, from the initial to the latest AES systems. In which the following studies on AES systems are Blood (2011) provided a literature review from PEG 1984–2010. Which has covered only generalized parts of AES systems like ethical aspects, the performance of the systems. Still, they have not covered the implementation part, and it’s not a comparative study and has not discussed the actual challenges of AES systems. Burrows et al. (2015) Reviewed AES systems on six dimensions like dataset, NLP techniques, model building, grading models, evaluation, and effectiveness of the model. They have not covered feature extraction techniques and challenges in features extractions. Covered only Machine Learning models but not in detail. This system not covered the comparative analysis of AES systems like feature extraction, model building, and level of relevance, cohesion, and coherence not covered in this review. Ke et al. (2019) provided a state of the art of AES system but covered very few papers and not listed all challenges, and no comparative study of the AES model. On the other hand, Hussein et al. in (2019) studied two categories of AES systems, four papers from handcrafted features for AES systems, and four papers from the neural networks approach, discussed few challenges, and did not cover feature extraction techniques, the performance of AES models in detail. Klebanov et al. (2020). Reviewed 50 years of AES systems, listed and categorized all essential features that need to be extracted from essays. But not provided a comparative analysis of all work and not discussed the challenges. This paper aims to provide a systematic literature review (SLR) on automated essay grading systems. An SLR is an Evidence-based systematic review to summarize the existing research. It critically evaluates and integrates all relevant studies’ findings and addresses the research domain’s specific research questions. Our research methodology uses guidelines given by Kitchenham et al. (2009) for conducting the review process; provide a well-defined approach to identify gaps in current research and to suggest further investigation. We addressed our research method, research questions, and the selection process in Sect. 2, and the results of the research questions have discussed in Sect. 3. And the synthesis of all the research questions addressed in Sect. 4. Conclusion and possible future work discussed in Sect. 5.
+2 Research method
+We framed the research questions with PICOC criteria. Population (P) Student essays and answers evaluation systems. Intervention (I) evaluation techniques, data sets, features extraction methods. Comparison (C) Comparison of various approaches and results. Outcomes (O) Estimate the accuracy of AES systems, Context (C) NA.
+2.1 Research questions
+To collect and provide research evidence from the available studies in the domain of automated essay grading, we framed the following research questions (RQ): RQ1 what are the datasets available for research on automated essay grading?
+
+
+2498 D. Ramesh, S. K. Sanampudi
+13
+The answer to the question can provide a list of the available datasets, their domain, and access to the datasets. It also provides a number of essays and corresponding prompts. RQ2 what are the features extracted for the assessment of essays? The answer to the question can provide an insight into various features so far extracted, and the libraries used to extract those features. RQ3, which are the evaluation metrics available for measuring the accuracy of algorithms? The answer will provide different evaluation metrics for accurate measurement of each Machine Learning approach and commonly used measurement technique. RQ4 What are the Machine Learning techniques used for automatic essay grading, and how are they implemented? It can provide insights into various Machine Learning techniques like regression models, classification models, and neural networks for implementing essay grading systems. The response to the question can give us different assessment approaches for automated essay grading systems. RQ5 What are the challenges/limitations in the current research? The answer to the question provides limitations of existing research approaches like cohesion, coherence, completeness, and feedback.
+2.2 Search process
+We conducted an automated search on well-known computer science repositories like ACL, ACM, IEEE Explore, Springer, and Science Direct for an SLR. We referred to papers published from 2010 to 2020 as much of the work during these years focused on advanced technologies like deep learning and natural language processing for automated essay grading systems. Also, the availability of free data sets like Kaggle (2012), Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE) by Yannakoudakis et al. (2011) led to research this domain. Search Strings: We used search strings like “Automated essay grading” OR “Automated essay scoring” OR “short answer scoring systems” OR “essay scoring systems” OR “automatic essay evaluation” and searched on metadata.
+2.3 Selection criteria
+After collecting all relevant documents from the repositories, we prepared selection criteria for inclusion and exclusion of documents. With the inclusion and exclusion criteria, it becomes more feasible for the research to be accurate and specific. Inclusion criteria 1 Our approach is to work with datasets comprise of essays written in English. We excluded the essays written in other languages. Inclusion criteria 2 We included the papers implemented on the AI approach and excluded the traditional methods for the review. Inclusion criteria 3 The study is on essay scoring systems, so we exclusively included the research carried out on only text data sets rather than other datasets like image or speech. Exclusion criteria We removed the papers in the form of review papers, survey papers, and state of the art papers.
+
+
+2499
+An automated essay scoring systems: afisystematic literature...
+13
+2.4 Quality assessment
+In addition to the inclusion and exclusion criteria, we assessed each paper by quality assessment questions to ensure the article’s quality. We included the documents that have clearly explained the approach they used, the result analysis and validation. The quality checklist questions are framed based on the guidelines from Kitchenham et al. (2009). Each quality assessment question was graded as either 1 or 0. The final score of the study range from 0 to 3. A cut off score for excluding a study from the review is 2 points. Since the papers scored 2 or 3 points are included in the final evaluation. We framed the following quality assessment questions for the final study. Quality Assessment 1: Internal validity. Quality Assessment 2: External validity. Quality Assessment 3: Bias.
+The two reviewers review each paper to select the final list of documents. We used the Quadratic Weighted Kappa score to measure the final agreement between the two reviewers. The average resulted from the kappa score is 0.6942, a substantial agreement between the reviewers. The result of evolution criteria shown in Table 1. After Quality Assessment, the final list of papers for review is shown in Table 2. The complete selection process is shown in Fig. 1. The total number of selected papers in year wise as shown in Fig. 2.
+3 Results
+3.1 What are the datasets available for research on automated essay grading?
+To work with problem statement especially in Machine Learning and deep learning domain, we require considerable amount of data to train the models. To answer this question, we listed all the data sets used for training and testing for automated essay grading systems. The Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE)
+Table 1 Quality assessment
+analysis Number of papers Quality
+assessment score
+50 3 12 2 59 1 23 0
+Table 2 Final list of papers Data base Paper count
+ACL 28 ACM 5 IEEE Explore 19 Springer 5 Other 5 Total 62
+
+
+2500 D. Ramesh, S. K. Sanampudi
+13
+Yannakoudakis et al. (2011) developed corpora that contain 1244 essays and ten prompts. This corpus evaluates whether a student can write the relevant English sentences without any grammatical and spelling mistakes. This type of corpus helps to test the models built for GRE and TOFEL type of exams. It gives scores between 1 and 40. Bailey and Meurers (2008), Created a dataset (CREE reading comprehension) for language learners and automated short answer scoring systems. The corpus consists of 566 responses from intermediate students. Mohler and Mihalcea (2009). Created a dataset for the computer science domain consists of 630 responses for data structure assignment questions. The scores are range from 0 to 5 given by two human raters. Dzikovska et al. (2012) created a Student Response Analysis (SRA) corpus. It consists of two sub-groups: the BEETLE corpus consists of 56 questions and approximately 3000 responses from students in the electrical and electronics domain. The second one is the SCIENTSBANK(SemEval-2013) (Dzikovska et al. 2013a; b) corpus consists of 10,000 responses on 197 prompts on various science domains. The student responses ladled with "correct, partially correct incomplete, Contradictory, Irrelevant, Non-domain." In the Kaggle (2012) competition, released total 3 types of corpuses on an Automated Student Assessment Prize (ASAP1) (“https://www.kaggle.com/c/asap-sas/”) essays and
+Fig. 1 Selection process
+Fig. 2 Year wise publications
+
+
+2501
+An automated essay scoring systems: afisystematic literature...
+13
+short answers. It has nearly 17,450 essays, out of which it provides up to 3000 essays for each prompt. It has eight prompts that test 7th to 10th grade US students. It gives scores between the [0–3] and [0–60] range. The limitations of these corpora are: (1) it has a different score range for other prompts. (2) It uses statistical features such as named entities extraction and lexical features of words to evaluate essays. ASAP + + is one more dataset from Kaggle. It is with six prompts, and each prompt has more than 1000 responses total of 10,696 from 8th-grade students. Another corpus contains ten prompts from science, English domains and a total of 17,207 responses. Two human graders evaluated all these responses. Correnti et al. (2013) created a Response-to-Text Assessment (RTA) dataset used to check student writing skills in all directions like style, mechanism, and organization. 4–8 grade students give the responses to RTA. Basu et al. (2013) created a power grading dataset with 700 responses for ten different prompts from US immigration exams. It contains all short answers for assessment. The TOEFL11 corpus Blanchard et al. (2013) contains 1100 essays evenly distributed over eight prompts. It is used to test the English language skills of a candidate attending the TOFEL exam. It scores the language proficiency of a candidate as low, medium, and high. International Corpus of Learner English (ICLE) Granger et al. (2009) built a corpus of 3663 essays covering different dimensions. It has 12 prompts with 1003 essays that test the organizational skill of essay writing, and13 prompts, each with 830 essays that examine the thesis clarity and prompt adherence. Argument Annotated Essays (AAE) Stab and Gurevych (2014) developed a corpus that contains 102 essays with 101 prompts taken from the essayforum2 site. It tests the persuasive nature of the student essay. The SCIENTSBANK corpus used by Sakaguchi et al. (2015) available in git-hub, containing 9804 answers to 197 questions in 15 science domains. Table 3 illustrates all datasets related to AES systems.
+3.2 RQ2 what are the features extracted for the assessment of essays?
+Features play a major role in the neural network and other supervised Machine Learning approaches. The automatic essay grading systems scores student essays based on
+Table 3 ALL types Datasets used in Automatic scoring systems
+Data Set Language Total responses Number of prompts
+Cambridge Learner Corpus-First Certificate in English exam (CLC-FCE)
+English 1244
+CREE English 566 CS English 630 SRA English 3000 56 SCIENTSBANK(SemEval-2013) English 10,000 197 ASAP-AES English 17,450 8 ASAP-SAS English 17,207 10 ASAP + + English 10,696 6 power grading English 700 TOEFL11 English 1100 8 International Corpus of Learner English (ICLE) English 3663
+
+
+2502 D. Ramesh, S. K. Sanampudi
+13
+different types of features, which play a prominent role in training the models. Based on their syntax and semantics and they are categorized into three groups. 1. statisticalbased features Contreras et al. (2018); Kumar et al. (2019); Mathias and Bhattacharyya (2018a; b) 2. Style-based (Syntax) features Cummins et al. (2016); Darwish and Mohamed (2020); Ke et al. (2019). 3. Content-based features Dong et al. (2017). A good set of features appropriate models evolved better AES systems. The vast majority of the researchers are using regression models if features are statistical-based. For Neural Networks models, researches are using both style-based and content-based features. The following table shows the list of various features used in existing AES Systems. Table 4 represents all set of features used for essay grading. We studied all the feature extracting NLP libraries as shown in Fig. 3. that are used in the papers. The NLTK is an NLP tool used to retrieve statistical features like POS, word count, sentence count, etc. With NLTK, we can miss the essay’s semantic features. To find semantic features Word2Vec Mikolov et al. (2013), GloVe Jeffrey Pennington et al. (2014)
+Fig. 3 Usages of tools
+Table 4 Types of features
+Statistical features Style based features Content based features
+Essay length with respect to the number of words
+Sentence structure Cohesion between sentences in a document Essay length with respect to sentence POS Overlapping (prompt) Average sentence length Punctuation Relevance of information Average word length Grammatical Semantic role of words N-gram Logical operators Correctness Vocabulary Consistency Sentence expressing key concepts
+
+
+2503
+An automated essay scoring systems: afisystematic literature...
+13
+is the most used libraries to retrieve the semantic text from the essays. And in some systems, they directly trained the model with word embeddings to find the score. From Fig. 4 as observed that non-content-based feature extraction is higher than content-based.
+3.3 RQ3 which are the evaluation metrics available for measuring the accuracy of algorithms?
+The majority of the AES systems are using three evaluation metrics. They are (1) quadrated weighted kappa (QWK) (2) Mean Absolute Error (MAE) (3) Pearson Correlation Coefficient (PCC) Shehab et al. (2016). The quadratic weighted kappa will find agreement between human evaluation score and system evaluation score and produces value ranging from 0 to 1. And the Mean Absolute Error is the actual difference between human-rated score to system-generated score. The mean square error (MSE) measures the average squares of the errors, i.e., the average squared difference between the human-rated and the system-generated scores. MSE will always give positive numbers only. Pearson’s Correlation Coefficient (PCC) finds the correlation coefficient between two variables. It will provide three values (0, 1, − 1). "0" represents human-rated and system scores that are not related. "1" represents an increase in the two scores. "− 1" illustrates a negative relationship between the two scores.
+3.4 RQ4 what are the Machine Learning techniques being used for automatic essay grading, and how are they implemented?
+After scrutinizing all documents, we categorize the techniques used in automated essay grading systems into four baskets. 1. Regression techniques. 2. Classification model. 3. Neural networks. 4. Ontology-based approach.
+Fig. 4 Number of papers on content based features
+
+
+2504 D. Ramesh, S. K. Sanampudi
+13
+All the existing AES systems developed in the last ten years employ supervised learning techniques. Researchers using supervised methods viewed the AES system as either regression or classification task. The goal of the regression task is to predict the score of an essay. The classification task is to classify the essays belonging to (low, medium, or highly) relevant to the question’s topic. Since the last three years, most AES systems developed made use of the concept of the neural network.
+3.4.1 Regression based models
+Mohler and Mihalcea (2009). proposed text-to-text semantic similarity to assign a score to the student essays. There are two text similarity measures like Knowledge-based measures, corpus-based measures. There eight knowledge-based tests with all eight models. They found the similarity. The shortest path similarity determines based on the length, which shortest path between two contexts. Leacock & Chodorow find the similarity based on the shortest path’s length between two concepts using node-counting. The Lesk similarity finds the overlap between the corresponding definitions, and Wu & Palmer algorithm finds similarities based on the depth of two given concepts in the wordnet taxonomy. Resnik, Lin, Jiang&Conrath, Hirst& St-Onge find the similarity based on different parameters like the concept, probability, normalization factor, lexical chains. In corpus-based likeness, there LSA BNC, LSA Wikipedia, and ESA Wikipedia, latent semantic analysis is trained on Wikipedia and has excellent domain knowledge. Among all similarity scores, correlation scores LSA Wikipedia scoring accuracy is more. But these similarity measure algorithms are not using NLP concepts. These models are before 2010 and basic concept models to continue the research automated essay grading with updated algorithms on neural networks with content-based features. Adamson et al. (2014) proposed an automatic essay grading system which is a statisticalbased approach in this they retrieved features like POS, Character count, Word count, Sentence count, Miss spelled words, n-gram representation of words to prepare essay vector. They formed a matrix with these all vectors in that they applied LSA to give a score to each essay. It is a statistical approach that doesn’t consider the semantics of the essay. The accuracy they got when compared to the human rater score with the system is 0.532. Cummins et al. (2016). Proposed Timed Aggregate Perceptron vector model to give ranking to all the essays, and later they converted the rank algorithm to predict the score of the essay. The model trained with features like Word unigrams, bigrams, POS, Essay length, grammatical relation, Max word length, sentence length. It is multi-task learning, gives ranking to the essays, and predicts the score for the essay. The performance evaluated through QWK is 0.69, a substantial agreement between the human rater and the system. Sultan et al. (2016). Proposed a Ridge regression model to find short answer scoring with Question Demoting. Question Demoting is the new concept included in the essay’s final assessment to eliminate duplicate words from the essay. The extracted features are Text Similarity, which is the similarity between the student response and reference answer. Question Demoting is the number of repeats in a student response. With inverse document frequency, they assigned term weight. The sentence length Ratio is the number of words in the student response, is another feature. With these features, the Ridge regression model was used, and the accuracy they got 0.887. Contreras et al. (2018). Proposed Ontology based on text mining in this model has given a score for essays in phases. In phase-I, they generated ontologies with ontoGen and SVM to
+
+
+2505
+An automated essay scoring systems: afisystematic literature...
+13
+find the concept and similarity in the essay. In phase II from ontologies, they retrieved features like essay length, word counts, correctness, vocabulary, and types of word used, domain information. After retrieving statistical data, they used a linear regression model to find the score of the essay. The accuracy score is the average of 0.5. Darwish and Mohamed (2020) proposed the fusion of fuzzy Ontology with LSA. They retrieve two types of features, like syntax features and semantic features. In syntax features, they found Lexical Analysis with tokens, and they construct a parse tree. If the parse tree is broken, the essay is inconsistent—a separate grade assigned to the essay concerning syntax features. The semantic features are like similarity analysis, Spatial Data Analysis. Similarity analysis is to find duplicate sentences—Spatial Data Analysis for finding Euclid distance between the center and part. Later they combine syntax features and morphological features score for the final score. The accuracy they achieved with the multiple linear regression model is 0.77, mostly on statistical features. Süzen Neslihan et al. (2020) proposed a text mining approach for short answer grading. First, their comparing model answers with student response by calculating the distance between two sentences. By comparing the model answer with student response, they find the essay’s completeness and provide feedback. In this approach, model vocabulary plays a vital role in grading, and with this model vocabulary, the grade will be assigned to the student’s response and provides feedback. The correlation between the student answer to model answer is 0.81.
+3.4.2 Classification based Models
+Persing and Ng (2013) used a support vector machine to score the essay. The features extracted are OS, N-gram, and semantic text to train the model and identified the keywords from the essay to give the final score. Sakaguchi et al. (2015) proposed two methods: response-based and reference-based. In response-based scoring, the extracted features are response length, n-gram model, and syntactic elements to train the support vector regression model. In reference-based scoring, features such as sentence similarity using word2vec is used to find the cosine similarity of the sentences that is the final score of the response. First, the scores were discovered individually and later combined two features to find a final score. This system gave a remarkable increase in performance by combining the scores. Mathias and Bhattacharyya (2018a; b) Proposed Automated Essay Grading Dataset with Essay Attribute Scores. The first concept features selection depends on the essay type. So the common attributes are Content, Organization, Word Choice, Sentence Fluency, Conventions. In this system, each attribute is scored individually, with the strength of each attribute identified. The model they used is a random forest classifier to assign scores to individual attributes. The accuracy they got with QWK is 0.74 for prompt 1 of the ASAS dataset (https://www.kaggle.com/c/asap-sas/). Ke et al. (2019) used a support vector machine to find the response score. In this method, features like Agreeability, Specificity, Clarity, Relevance to prompt, Conciseness, Eloquence, Confidence, Direction of development, Justification of opinion, and Justification of importance. First, the individual parameter score obtained was later combined with all scores to give a final response score. The features are used in the neural network to find whether the sentence is relevant to the topic or not. Salim et al. (2019) proposed an XGBoost Machine Learning classifier to assess the essays. The algorithm trained on features like word count, POS, parse tree depth, and
+
+
+2506 D. Ramesh, S. K. Sanampudi
+13
+coherence in the articles with sentence similarity percentage; cohesion and coherence are considered for training. And they implemented K-fold cross-validation for a result the average accuracy after specific validations is 68.12.
+3.4.3 Neural network models
+Shehab et al. (2016) proposed a neural network method that used learning vector quantization to train human scored essays. After training, the network can provide a score to the ungraded essays. First, we should process the essay to remove Spell checking and then perform preprocessing steps like Document Tokenization, stop word removal, Stemming, and submit it to the neural network. Finally, the model will provide feedback on the essay, whether it is relevant to the topic. And the correlation coefficient between human rater and system score is 0.7665. Kopparapu and De (2016) proposed the Automatic Ranking of Essays using Structural and Semantic Features. This approach constructed a super essay with all the responses. Next, ranking for a student essay is done based on the super-essay. The structural and semantic features derived helps to obtain the scores. In a paragraph, 15 Structural features like an average number of sentences, the average length of sentences, and the count of words, nouns, verbs, adjectives, etc., are used to obtain a syntactic score. A similarity score is used as semantic features to calculate the overall score. Dong and Zhang (2016) proposed a hierarchical CNN model. The model builds two layers with word embedding to represents the words as the first layer. The second layer is a word convolution layer with max-pooling to find word vectors. The next layer is a sentence-level convolution layer with max-pooling to find the sentence’s content and synonyms. A fully connected dense layer produces an output score for an essay. The accuracy with the hierarchical CNN model resulted in an average QWK of 0.754. Taghipour and Ng (2016) proposed a first neural approach for essay scoring build in which convolution and recurrent neural network concepts help in scoring an essay. The network uses a lookup table with the one-hot representation of the word vector of an essay. The final efficiency of the network model with LSTM resulted in an average QWK of 0.708. Dong et al. (2017). Proposed an Attention-based scoring system with CNN+ LSTM to score an essay. For CNN, the input parameters were character embedding and word embedding, and it has attention pooling layers and used NLTK to obtain word and character embedding. The output gives a sentence vector, which provides sentence weight. After CNN, it will have an LSTM layer with an attention pooling layer, and this final layer results in the final score of the responses. The average QWK score is 0.764. Riordan et al. (2017) proposed a neural network with CNN and LSTM layers. Word embedding, given as input to a neural network. An LSTM network layer will retrieve the window features and delivers them to the aggregation layer. The aggregation layer is a superficial layer that takes a correct window of words and gives successive layers to predict the answer’s sore. The accuracy of the neural network resulted in a QWK of 0.90. Zhao et al. (2017) proposed a new concept called Memory-Augmented Neural network with four layers, input representation layer, memory addressing layer, memory reading layer, and output layer. An input layer represents all essays in a vector form based on essay length. After converting the word vector, the memory addressing layer takes a sample of the essay and weighs all the terms. The memory reading layer takes
+
+
+2507
+An automated essay scoring systems: afisystematic literature...
+13
+the input from memory addressing segment and finds the content to finalize the score. Finally, the output layer will provide the final score of the essay. The accuracy of essay scores is 0.78, which is far better than the LSTM neural network. Mathias and Bhattacharyya (2018a; b) proposed deep learning networks using LSTM with the CNN layer and GloVe pre-trained word embeddings. For this, they retrieved features like Sentence count essays, word count per sentence, Number of OOVs in the sentence, Language model score, and the text’s perplexity. The network predicted the goodness scores of each essay. The higher the goodness scores, means higher the rank and vice versa. Nguyen and Dery (2016). Proposed Neural Networks for Automated Essay Grading. In this method, a single layer bi-directional LSTM accepting word vector as input. Glove vectors used in this method resulted in an accuracy of 90%. Ruseti et al. (2018) proposed a recurrent neural network that is capable of memorizing the text and generate a summary of an essay. The Bi-GRU network with the maxpooling layer molded on the word embedding of each document. It will provide scoring to the essay by comparing it with a summary of the essay from another Bi-GRU network. The result obtained an accuracy of 0.55. Wang et al. (2018a; b) proposed an automatic scoring system with the bi-LSTM recurrent neural network model and retrieved the features using the word2vec technique. This method generated word embeddings from the essay words using the skip-gram model. And later, word embedding is used to train the neural network to find the final score. The softmax layer in LSTM obtains the importance of each word. This method used a QWK score of 0.83%. Dasgupta et al. (2018) proposed a technique for essay scoring with augmenting textual qualitative Features. It extracted three types of linguistic, cognitive, and psychological features associated with a text document. The linguistic features are Part of Speech (POS), Universal Dependency relations, Structural Well-formedness, Lexical Diversity, Sentence Cohesion, Causality, and Informativeness of the text. The psychological features derived from the Linguistic Information and Word Count (LIWC) tool. They implemented a convolution recurrent neural network that takes input as word embedding and sentence vector, retrieved from the GloVe word vector. And the second layer is the Convolution Layer to find local features. The next layer is the recurrent neural network (LSTM) to find corresponding of the text. The accuracy of this method resulted in an average QWK of 0.764. Liang et al. (2018) proposed a symmetrical neural network AES model with Bi-LSTM. They are extracting features from sample essays and student essays and preparing an embedding layer as input. The embedding layer output is transfer to the convolution layer from that LSTM will be trained. Hear the LSRM model has self-features extraction layer, which will find the essay’s coherence. The average QWK score of SBLSTMA is 0.801. Liu et al. (2019) proposed two-stage learning. In the first stage, they are assigning a score based on semantic data from the essay. The second stage scoring is based on some handcrafted features like grammar correction, essay length, number of sentences, etc. The average score of the two stages is 0.709. Pedro Uria Rodriguez et al. (2019) proposed a sequence-to-sequence learning model for automatic essay scoring. They used BERT (Bidirectional Encoder Representations from Transformers), which extracts the semantics from a sentence from both directions. And XLnet sequence to sequence learning model to extract features like the next sentence in an essay. With this pre-trained model, they attained coherence from the essay to give the final score. The average QWK score of the model is 75.5.
+
+
+2508 D. Ramesh, S. K. Sanampudi
+13
+Xia et al. (2019) proposed a two-layer Bi-directional LSTM neural network for the scoring of essays. The features extracted with word2vec to train the LSTM and accuracy of the model in an average of QWK is 0.870. Kumar et al. (2019) Proposed an AutoSAS for short answer scoring. It used pre-trained Word2Vec and Doc2Vec models trained on Google News corpus and Wikipedia dump, respectively, to retrieve the features. First, they tagged every word POS and they found weighted words from the response. It also found prompt overlap to observe how the answer is relevant to the topic, and they defined lexical overlaps like noun overlap, argument overlap, and content overlap. This method used some statistical features like word frequency, difficulty, diversity, number of unique words in each response, type-token ratio, statistics of the sentence, word length, and logical operator-based features. This method uses a random forest model to train the dataset. The data set has sample responses with their associated score. The model will retrieve the features from both responses like graded and ungraded short answers with questions. The accuracy of AutoSAS with QWK is 0.78. It will work on any topics like Science, Arts, Biology, and English. Jiaqi Lun et al. (2020) proposed an automatic short answer scoring with BERT. In this with a reference answer comparing student responses and assigning scores. The data augmentation is done with a neural network and with one correct answer from the dataset classifying reaming responses as correct or incorrect. Zhu and Sun (2020) proposed a multimodal Machine Learning approach for automated essay scoring. First, they count the grammar score with the spaCy library and numerical count as the number of words and sentences with the same library. With this input, they trained a single and Bi LSTM neural network for finding the final score. For the LSTM model, they prepared sentence vectors with GloVe and word embedding with NLTK. BiLSTM will check each sentence in both directions to find semantic from the essay. The average QWK score with multiple models is 0.70.
+3.4.4 Ontology based approach
+Mohler et al. (2011) proposed a graph-based method to find semantic similarity in short answer scoring. For the ranking of answers, they used the support vector regression model. The bag of words is the main feature extracted in the system. Ramachandran et al. (2015) also proposed a graph-based approach to find lexical based semantics. Identified phrase patterns and text patterns are the features to train a random forest regression model to score the essays. The accuracy of the model in a QWK is 0.78. Zupanc et al. (2017) proposed sentence similarity networks to find the essay’s score. Ajetunmobi and Daramola (2017) recommended an ontology-based information extraction approach and domain-based ontology to find the score.
+3.4.5 Speech response scoring
+Automatic scoring is in two ways one is text-based scoring, other is speech-based scoring. This paper discussed text-based scoring and its challenges, and now we cover speech scoring and common points between text and speech-based scoring. Evanini and Wang (2013), Worked on speech scoring of non-native school students, extracted features with speech ratter, and trained a linear regression model, concluding that accuracy varies based on voice pitching. Loukina et al. (2015) worked on feature selection from speech data and trained
+
+
+2509
+An automated essay scoring systems: afisystematic literature...
+13
+SVM. Malinin et al. (2016) used neural network models to train the data. Loukina et al. (2017). Proposed speech and text-based automatic scoring. Extracted text-based features, speech-based features and trained a deep neural network for speech-based scoring. They extracted 33 types of features based on acoustic signals. Malinin et al. (2017). Wu Xixin et al. (2020) Worked on deep neural networks for spoken language assessment. Incorporated different types of models and tested them. Ramanarayanan et al. (2017) worked on feature extraction methods and extracted punctuation, fluency, and stress and trained different Machine Learning models for scoring. Knill et al. (2018). Worked on Automatic speech recognizer and its errors how its impacts the speech assessment.
+3.4.5.1 The state of the art This section provides an overview of the existing AES systems with a comparative study w. r. t models, features applied, datasets, and evaluation metrics used for building the automated essay grading systems. We divided all 62 papers into two sets of the first set of review papers in Table 5 with a comparative study of the AES systems.
+3.4.6 Comparison of all approaches
+In our study, we divided major AES approaches into three categories. Regression models, classification models, and neural network models. The regression models failed to find cohesion and coherence from the essay because it trained on BoW(Bag of Words) features. In processing data from input to output, the regression models are less complicated than neural networks. There are unable to find many intricate patterns from the essay and unable to find sentence connectivity. If we train the model with BoW features in the neural network approach, the model never considers the essay’s coherence and coherence. First, to train a Machine Learning algorithm with essays, all the essays are converted to vector form. We can form a vector with BoW and Word2vec, TF-IDF. The BoW and Word2vec vector representation of essays represented in Table 6. The vector representation of BoW with TF-IDF is not incorporating the essays semantic, and it’s just statistical learning from a given vector. Word2vec vector comprises semantic of essay in a unidirectional way. In BoW, the vector contains the frequency of word occurrences in the essay. The vector represents 1 and more based on the happenings of words in the essay and 0 for not present. So, in BoW, the vector does not maintain the relationship with adjacent words; it’s just for single words. In word2vec, the vector represents the relationship between words with other words and sentences prompt in multiple dimensional ways. But word2vec prepares vectors in a unidirectional way, not in a bidirectional way; word2vec fails to find semantic vectors when a word has two meanings, and the meaning depends on adjacent words. Table 7 represents a comparison of Machine Learning models and features extracting methods. In AES, cohesion and coherence will check the content of the essay concerning the essay prompt these can be extracted from essay in the vector from. Two more parameters are there to access an essay is completeness and feedback. Completeness will check whether student’s response is sufficient or not though the student wrote correctly. Table 8 represents all four parameters comparison for essay grading. Table 9 illustrates comparison of all approaches based on various features like grammar, spelling, organization of essay, relevance.
+
+
+2510 D. Ramesh, S. K. Sanampudi
+13
+Table 5 State of the art
+System Approach Dataset Features applied Evaluation metric and results
+Mohler and Mihalcea in (2009) shortest path similarity, LSA
+regression model
+Word vector Finds the shortest path
+Niraj Kumar and Lipika Dey. In
+(2013)
+Word-Graph ASAP Kaggle Content and style-based features 63.81% accuracy
+Alex Adamson et al. in (2014) LSA regression model ASAP Kaggle Statistical features QWK 0.532
+Nguyen and Dery (2016) LSTM (single layer bidirec
+tional)
+ASAP Kaggle Statistical features 90% accuracy
+Keisuke Sakaguchi et al. in
+(2015)
+Classification model ETS (educational testing
+services)
+Statistical, Style based features QWK is 0.69
+Ramachandran et al. in (2015) regression model ASAP Kaggle short Answer Statistical and style-based
+features
+QWK 0.77
+Sultan et al. in (2016) Ridge regression model SciEntBank answers Statistical features RMSE 0.887
+Dong and Zhang (2016) CNN neural network ASAP Kaggle Statistical features QWK 0.734
+Taghipour and Ngl in (2016) CNN + LSTM neural network ASAP Kaggle Lookup table (one hot represen
+tation of word vector)
+QWK 0.761
+Shehab et al. in (2016) Learning vector quantization
+neural network
+Mansoura University
+student’s essays
+Statistical features correlation coefficient 0.7665
+Cummins et al. in (2016) Regression model ASAP Kaggle Statistical features, style-based
+features
+QWK 0.69
+Kopparapu and De (2016) Neural network ASAP Kaggle Statistical features, Style based
+Dong, et al. in (2017) CNN + LSTM neural network ASAP Kaggle Word embedding, content based QWK 0.764
+Ajetunmobi and Daramola
+(2017)
+WuPalmer algorithm Statistical features
+Siyuan Zhao et al. in (2017) LSTM (memory network) ASAP Kaggle Statistical features QWK 0.78
+Mathias and Bhattacharyya
+(2018a)
+Random Forest
+Classifier a classification model
+ASAP Kaggle Style and Content based features Classified which feature set is
+required
+Brian Riordan et al. in (2017) CNN + LSTM neural network ASAP Kaggle short Answer Word embeddings QWK 0.90
+
+
+2511
+An automated essay scoring systems: afisystematic literature...
+13
+Table 5 (continued)
+System Approach Dataset Features applied Evaluation metric and results
+Tirthankar Dasgupta et al. in
+(2018)
+CNN
+-bidirectional
+LSTMs neural network
+ASAP Kaggle Content and physiological
+features
+QWK 0.786
+Wu and Shih (2018) Classification model SciEntBank answers unigram_recall Squared correlation coefficient
+59.568
+unigram_precision
+unigram_F_measure
+log_bleu_recall
+log_bleu_precision
+log_bleu_F_measure
+BLUE features
+Yucheng Wang, etc.in (2018b) Bi-LSTM ASAP Kaggle Word embedding sequence QWK 0.724
+Anak Agung Putri Ratna et al.
+in (2018)
+Winnowing ALGORITHM 86.86 accuracy
+Sharma and Jayagopi (2018) Glove, LSTM neural network ASAP Kaggle Hand written essay images QWK 0.69
+Jennifer O. Contreras et al. in
+(2018)
+OntoGen (SVM)
+Linear Regression
+University of Benghazi data set Statistical, style-based features
+Mathias, Bhattacharyya (2018b) GloVe,LSTM neural network ASAP Kaggle Statistical features, style features Predicted Goodness score for
+essay
+Stefan Ruseti, et al. in (2018) BiGRU Siamese architecture Amazon Mechanical Turk online
+research service. Collected
+summaries
+Word embedding Accuracy 55.2
+Zining wang, et al. in (2018a) LSTM (semantic)
+HAN (hierarchical attention
+network) neural network
+ASAP Kaggle Word embedding QWK 0.83
+Guoxi Liang et al. (2018) Bi-LSTM ASAP Kaggle Word embedding, coherence of
+sentence
+QWK 0.801
+
+
+2512 D. Ramesh, S. K. Sanampudi
+13
+Table 5 (continued)
+System Approach Dataset Features applied Evaluation metric and results
+Ke et al. in (2019) Classification model ASAP Kaggle Content based Pearson’s
+Correlation Coefficient (PC)-0.39
+ME-0.921
+Tsegaye Misikir Tashu and
+Horváth in (2019)
+Unsupervised learning–Locality
+Sensitivity Hashing
+ASAP Kaggle Statistical features root mean squared error
+Kumar and Dey (2019) Random Forest
+CNN, RNN neural network
+ASAP Kaggle short Answer Style and content-based features QWK 0.82
+Pedro Uria Rodriguez et al.
+(2019)
+BERT, Xlnet ASAP Kaggle Error correction, sequence
+learning
+QWK 0.755
+Jiawei Liu et al. (2019) CNN, LSTM, BERT ASAP Kaggle semantic data, handcrafted
+features like grammar correc
+tion, essay length, number of
+sentences, etc
+QWK 0.709
+Darwish and Mohamed (2020) Multiple Linear Regression ASAP Kaggle Style and content-based features QWK 0.77
+Jiaqi Lun et al. (2020) BERT SemEval-2013 Student Answer, Reference
+Answer
+Accuracy 0.8277 (2-way)
+Süzen, Neslihan, et al. (2020) Text mining introductory computer science
+class in the University of
+North Texas, Student Assign
+ments
+Sentence similarity Correlation score 0.81
+Wilson Zhu and Yu Sun in
+(2020)
+RNN (LSTM, Bi-LSTM) ASAP Kaggle Word embedding, grammar
+count, word count
+QWK 0.70
+Salim Yafet et al. (2019) XGBoost machine learning
+classifier
+ASAP Kaggle Word count, POS, parse tree,
+coherence, cohesion, type
+token ration
+Accuracy 68.12
+Andrzej Cader (2020) Deep Neural Network University of Social Sciences in
+Lodz students’ answers
+asynchronous feature Accuracy 0.99
+Tashu TM, Horváth T (2019) Rule based algorithm,
+Similarity based algorithm
+ASAP Kaggle Similarity based Accuracy 0.68
+
+
+2513
+An automated essay scoring systems: afisystematic literature...
+13
+Table 5 (continued)
+System Approach Dataset Features applied Evaluation metric and results
+Masaki Uto(B) and Masashi
+Okano (2020)
+Item Response Theory Models
+(CNN-LSTM,BERT)
+ASAP Kaggle QWK 0.749
+
+
+2514 D. Ramesh, S. K. Sanampudi
+13
+Table 6 Vector representation of essays
+Essay BoW << vector >> Word2vec << vector >>
+Student 1 response I believe that using computers will ben
+efit us in many ways like talking and
+becoming friends will others through
+websites like facebook and mysace
+<< 0.00000 0.00000 0.165746
+0.280633 ... 0.00000 0.280633
+0.280633 0.280633 >>
+<< 3.9792988e-03 − 1.9810481e-03 1.9830784e-03 9.0381579e-04
+− 2.9438005e-03 2.1778699e-03 4.4950014e-03 2.9508960e -03
+− 2.2331756e-03 − 3.8774475e-03 3.5967759e- 03 − 4.0194849e-03
+− 3.0412588e-03 − 2.4055617e-03 4.8296354e-03 2.4813593e-03...
+− 2.7158875e-03 − 1.4563646e-03 1.4072991e-03 − 5.2228488e-04
+− 2.3597316e-03 6.2979700e-04 − 3.0249553e-03 4.4125126e-04
+2.1633594e-03 − 4.9487003e-03 9.9755758e-05 − 2.4388896e-03 >>
+Student 2 response More and more people use comput
+ers, but not everyone agrees that this
+benefits society. Those who support
+advances in technology believe that
+computers have a positive effect on
+people
+<< 0.26043 0.26043 0.153814
+0.000000 ... 0.26043 0.000000
+0.000000 0.000000 > >
+<< 3.9792988e-03 − 1.9810481e- 03 1.9830784e-03 9.0381579e-04
+− 2.9438005e-03 2.1778699e-03 4.4950014e-03 2.9508960e-03
+− 2.2331756e-03 − 3.8774475e-03 3.5967759e-03 − 4.0194849e-03...
+− 2.7158875e-03 − 1.4563646e-03 1.4072991e-03 − 5.2228488e-04
+− 2.3597316e-03 6.2979700e-04 − 3.0249553e-03 4.4125126e-04
+3.7868773e-03 − 4.4193151e-03 3.0735810e-03 2.5546195e-03
+2.1633594e-03 − 4.9487003e-03 9.9755758e-05 − 2.4388896e-03 >>
+
+
+2515
+An automated essay scoring systems: afisystematic literature...
+13
+3.5 What are the challenges/limitations in the current research?
+From our study and results discussed in the previous sections, many researchers worked on automated essay scoring systems with numerous techniques. We have statistical methods, classification methods, and neural network approaches to evaluate the essay automatically. The main goal of the automated essay grading system is to reduce human effort and improve consistency. The vast majority of essay scoring systems are dealing with the efficiency of the algorithm. But there are many challenges in automated essay grading systems. One should assess the essay by following parameters like the relevance of the content to the prompt, development of ideas, Cohesion, Coherence, and domain knowledge. No model works on the relevance of content, which means whether student response or explanation is relevant to the given prompt or not if it is relevant to how much it is appropriate, and there is no discussion about the cohesion and coherence of the essays. All researches concentrated on extracting the features using some NLP libraries, trained their models, and testing the results. But there is no explanation in the essay evaluation system about consistency and completeness, But Palma and Atkinson (2018) explained coherencebased essay evaluation. And Zupanc and Bosnic (2014) also used the word coherence to evaluate essays. And they found consistency with latent semantic analysis (LSA) for finding coherence from essays, but the dictionary meaning of coherence is "The quality of being logical and consistent." Another limitation is there is no domain knowledge-based evaluation of essays using Machine Learning models. For example, the meaning of a cell is different from biology to physics. Many Machine Learning models extract features with WordVec and GloVec; these NLP libraries cannot convert the words into vectors when they have two or more meanings.
+3.5.1 Other challenges that influence the Automated Essay Scoring Systems.
+All these approaches worked to improve the QWK score of their models. But QWK will not assess the model in terms of features extraction and constructed irrelevant answers. The QWK is not evaluating models whether the model is correctly assessing the answer or not. There are many challenges concerning students’ responses to the Automatic scoring system. Like in evaluating approach, no model has examined how to evaluate the constructed
+Table 7 Comparison of models
+BoW Word2vec
+Regression models/classification models
+The system implemented with Bow features and regression or classification algorithms will have low cohesion and coherence
+The system implemented with Word2vec features and regression or classification algorithms will have low to medium cohesion and coherence
+Neural Networks (LSTM) The system implemented with BoW features and neural network models will have low cohesion and coherence
+The system implemented with Word2vec features and neural network model (LSTM) will have medium to high cohesion and coherence
+
+
+2516 D. Ramesh, S. K. Sanampudi
+13
+Table 8 Comparison of all models with respect to cohesion, coherence, completeness, feedback
+Authors Cohesion Coherence Completeness Feed Back
+Mohler and Mihalcea (2009) Low Low Low Low Mohler et al. (2011) Medium Low Medium Low Persing and Ng (2013) Medium Low Low Low Adamson et al. (2014) Low Low Low Low Ramachandran et al. (2015) Medium Medium Low Low Sakaguchi et al.. (2015), Medium Low Low Low Cummins et al. (2016) Low Low Low Low Sultan et al. (2016) Medium Medium Low Low Shehab et al. (2016) Low Low Low Low Kopparapu and De (2016) Medium Medium Low Low Dong an Zhang (2016) Medium Low Low Low Taghipour and Ng (2016) Medium Medium Low Low Zupanc et al. (2017) Medium Medium Low Low Dong et al. (2017) Medium Medium Low Low Riordan et al. (2017) Medium Medium Medium Low Zhao et al. (2017) Medium Medium Low Low Contreras et al. (2018) Medium Low Low Low Mathias and Bhattacharyya (2018a; b) Medium Medium Low Low Mathias and Bhattacharyya (2018a; b) Medium Medium Low Low Nguyen and Dery (2016) Medium Medium Medium Medium Ruseti et al. (2018) Medium Low Low Low Dasgupta et al. (2018) Medium Medium Low Low Liu et al.(2018) Low Low Low Low Wang et al. (2018b) Medium Low Low Low Guoxi Liang et al. (2018) High High Low Low Wang et al. (2018a) Medium Medium Low Low Chen and Li (2018) Medium Medium Low Low Li et al. (2018) Medium Medium Low Low Alva-Manchego et al.(2019) Low Low Low Low Jiawei Liu et al. (2019) High High Medium Low Pedro Uria Rodriguez et al. (2019) Medium Medium Medium Low Changzhi Cai(2019) Low Low Low Low Xia et al. (2019) Medium Medium Low Low Chen and Zhou (2019) Low Low Low Low Kumar et al. (2019) Medium Medium Medium Low Ke et al. (2019) Medium Low Medium Low Andrzej Cader(2020) Low Low Low Low Jiaqi Lun et al. (2020) High High Low Low Wilson Zhu and Yu Sun (2020) Medium Medium Low Low Süzen, Neslihan et al. (2020) Medium Low Medium Low Salim Yafet et al. (2019) High Medium Low Low Darwish and Mohamed (2020) Medium Low Low Low Tashu and Horváth (2020) Medium Medium Low Medium Tashu (2020) Medium Medium Low Low
+
+
+2517
+An automated essay scoring systems: afisystematic literature...
+13
+irrelevant and adversarial answers. Especially the black box type of approaches like deep learning models provides more options to the students to bluff the automated scoring systems. The Machine Learning models that work on statistical features are very vulnerable. Based on Powers et al. (2001) and Bejar Isaac et al. (2014), the E-rater was failed on Constructed Irrelevant Responses Strategy (CIRS). From the study of Bejar et al. (2013), Higgins and Heilman (2014), observed that when student response contain irrelevant content or shell language concurring to prompt will influence the final score of essays in an automated scoring system. In deep learning approaches, most of the models automatically read the essay’s features, and some methods work on word-based embedding and other character-based embedding features. From the study of Riordan Brain et al. (2019), The character-based embedding systems do not prioritize spelling correction. However, it is influencing the final score of the essay. From the study of Horbach and Zesch (2019), Various factors are influencing AES systems. For example, there are data set size, prompt type, answer length, training set, and human scorers for content-based scoring. Ding et al. (2020) reviewed that the automated scoring system is vulnerable when a student response contains more words from prompt, like prompt vocabulary repeated in the response. Parekh et al. (2020) and Kumar et al. (2020) tested various neural network models of AES by iteratively adding important words, deleting unimportant words, shuffle the words, and repeating sentences in an essay and found that no change in the final score of essays. These neural network models failed to recognize common sense in adversaries’ essays and give more options for the students to bluff the automated systems. Other than NLP and ML techniques for AES. From Wresch (1993) to Madnani and Cahill (2018). discussed the complexity of AES systems, standards need to be followed. Like assessment rubrics to test subject knowledge, irrelevant responses, and ethical aspects of an algorithm like measuring the fairness of student response. Fairness is an essential factor for automated systems. For example, in AES, fairness can be measure in an agreement between human score to machine score. Besides this, From Loukina et al. (2019), the fairness standards include overall score accuracy, overall score differences, and condition score differences between human and system scores. In addition, scoring different responses in the prospect of constructive relevant and irrelevant will improve fairness. Madnani et al. (2017a; b). Discussed the fairness of AES systems for constructed responses and presented RMS open-source tool for detecting biases in the models. With this, one can change fairness standards according to their analysis of fairness. From Berzak et al.’s (2018) approach, behavior factors are a significant challenge in automated scoring systems. That helps to find language proficiency, word characteristics (essential words from the text), predict the critical patterns from the text, find related sentences in an essay, and give a more accurate score.
+Table 8 (continued)
+Authors Cohesion Coherence Completeness Feed Back
+Masaki Uto(B) and Masashi Okano(2020) Medium Medium Medium Medium Panitan Muangkammuen and Fumiyo Fukumoto(2020) Medium Medium Medium Low
+
+
+2518 D. Ramesh, S. K. Sanampudi
+13
+Table 9 comparison of all approaches on various features
+Approaches Grammar Style (Word choice,
+sentence structure)
+Mechanics (Spelling, punc
+tuation, capitalization)
+Development BoW (tf-idf) relevance
+Mohler and Mihalcea (2009) No No No No Yes No
+Mohler et al. (2011) Yes No No No Yes No
+Persing and Ng (2013) Yes Yes Yes No Yes Yes
+Adamson et al. (2014) Yes No Yes No Yes No
+Ramachandran et al. (2015) Yes No Yes Yes Yes Yes
+Sakaguchi et al. (2015), No No Yes Yes Yes Yes
+Cummins et al. (2016) Yes No Yes No Yes No
+Sultan et al. (2016) No No No No Yes Yes
+Shehab et al. (2016) Yes Yes Yes No Yes No
+Kopparapu and De (2016) No No No No Yes No
+Dong and Zhang (2016) Yes No Yes No Yes Yes
+Taghipour and Ng (2016) Yes No No No Yes Yes
+Zupanc et al. (2017) No No No No Yes No
+Dong et al. (2017) No No No No No Yes
+Riordan et al. (2017) No No No No No Yes
+Zhao et al. (2017) No No No No No Yes
+Contreras et al. (2018) Yes No No No Yes Yes
+Mathias and Bhattacharyya (2018a, 2018b) No Yes Yes No No Yes
+Mathias and Bhattacharyya (2018a, 2018b) Yes No Yes No Yes Yes
+Nguyen and Dery (2016) No No No No Yes Yes
+Ruseti et al. (2018) No No No Yes No Yes
+Dasgupta et al. (2018) Yes Yes Yes Yes No Yes
+Liu et al.(2018) Yes Yes No No Yes No
+Wang et al. (2018b) No No No No No Yes
+Guoxi Liang et al. (2018) No No No No No Yes
+Wang et al. (2018a) No No No No No Yes
+
+
+2519
+An automated essay scoring systems: afisystematic literature...
+13
+Table 9 (continued)
+Approaches Grammar Style (Word choice,
+sentence structure)
+Mechanics (Spelling, punc
+tuation, capitalization)
+Development BoW (tf-idf) relevance
+Chen and Li (2018) No No No No No Yes
+Li et al. (2018) Yes No No No No Yes
+Alva-Manchego et al. (2019) Yes No No Yes No Yes
+Jiawei Liu et al. (2019) Yes No No Yes No Yes
+Pedro Uria Rodriguez et al. (2019) No No No No Yes Yes
+Changzhi Cai(2019) No No No No No Yes
+Xia et al. (2019) No No No No No Yes
+Chen and Zhou (2019) No No No No No Yes
+Kumar et al. (2019) Yes Yes No Yes Yes Yes
+Ke et al. (2019) No Yes No Yes Yes Yes
+Andrzej Cader(2020) No No No No No Yes
+Jiaqi Lun et al. (2020) No No No No No Yes
+Wilson Zhu and Yu Sun (2020) No No No No No Yes
+Süzen, Neslihan, et al. (2020) No No No No Yes Yes
+Salim Yafet et al. (2019) Yes Yes Yes No Yes Yes
+Darwish and Mohamed (2020) Yes Yes No No No Yes
+
+
+2520 D. Ramesh, S. K. Sanampudi
+13
+Rupp (2018), has discussed the designing, evaluating, and deployment methodologies for AES systems. They provided notable characteristics of AES systems for deployment. They are like model performance, evaluation metrics for a model, threshold values, dynamically updated models, and framework. First, we should check the model performance on different datasets and parameters for operational deployment. Selecting Evaluation metrics for AES models are like QWK, correlation coefficient, or sometimes both. Kelley and Preacher (2012) have discussed three categories of threshold values: marginal, borderline, and acceptable. The values can be varied based on data size, model performance, type of model (single scoring, multiple scoring models). Once a model is deployed and evaluates millions of responses every time for optimal responses, we need a dynamically updated model based on prompt and data. Finally, framework designing of AES model, hear a framework contains prompts where test-takers can write the responses. One can design two frameworks: a single scoring model for a single methodology and multiple scoring models for multiple concepts. When we deploy multiple scoring models, each prompt could be trained separately, or we can provide generalized models for all prompts with this accuracy may vary, and it is challenging.
+4 Synthesis
+Our Systematic literature review on the automated essay grading system first collected 542 papers with selected keywords from various databases. After inclusion and exclusion criteria, we left with 139 articles; on these selected papers, we applied Quality assessment criteria with two reviewers, and finally, we selected 62 writings for final review. Our observations on automated essay grading systems from 2010 to 2020 are as followed:
+• The implementation techniques of automated essay grading systems are classified into four buckets; there are 1. regression models 2. Classification models 3. Neural networks 4. Ontology-based methodology, but using neural networks, the researchers are more accurate than other techniques, and all the methods state of the art provided in Table 3. • The majority of the regression and classification models on essay scoring used statistical features to find the final score. It means the systems or models trained on such parameters as word count, sentence count, etc. though the parameters extracted from the essay, the algorithm are not directly training on essays. The algorithms trained on some numbers obtained from the essay and hear if numbers matched the composition will get a good score; otherwise, the rating is less. In these models, the evaluation process is entirely on numbers, irrespective of the essay. So, there is a lot of chance to miss the coherence, relevance of the essay if we train our algorithm on statistical parameters. • In the neural network approach, the models trained on Bag of Words (BoW) features. The BoW feature is missing the relationship between a word to word and the semantic meaning of the sentence. E.g., Sentence 1: John killed bob. Sentence 2: bob killed John. In these two sentences, the BoW is "John," "killed," "bob." • In the Word2Vec library, if we are prepared a word vector from an essay in a unidirectional way, the vector will have a dependency with other words and finds the semantic relationship with other words. But if a word has two or more meanings like "Bank loan" and "River Bank," hear bank has two implications, and its adjacent words decide
+
+
+2521
+An automated essay scoring systems: afisystematic literature...
+13
+the sentence meaning; in this case, Word2Vec is not finding the real meaning of the word from the sentence. • The features extracted from essays in the essay scoring system are classified into 3 type’s features like statistical features, style-based features, and content-based features, which are explained in RQ2 and Table 3. But statistical features, are playing a significant role in some systems and negligible in some systems. In Shehab et al. (2016); Cummins et al. (2016). Dong et al. (2017). Dong and Zhang (2016). Mathias and Bhattacharyya (2018a; b) Systems the assessment is entirely on statistical and style-based features they have not retrieved any content-based features. And in other systems that extract content from the essays, the role of statistical features is for only preprocessing essays but not included in the final grading. • In AES systems, coherence is the main feature to be considered while evaluating essays. The actual meaning of coherence is to stick together. That is the logical connection of sentences (local level coherence) and paragraphs (global level coherence) in a story. Without coherence, all sentences in a paragraph are independent and meaningless. In an Essay, coherence is a significant feature that is explaining everything in a flow and its meaning. It is a powerful feature in AES system to find the semantics of essay. With coherence, one can assess whether all sentences are connected in a flow and all paragraphs are related to justify the prompt. Retrieving the coherence level from an essay is a critical task for all researchers in AES systems. • In automatic essay grading systems, the assessment of essays concerning content is critical. That will give the actual score for the student. Most of the researches used statistical features like sentence length, word count, number of sentences, etc. But according to collected results, 32% of the systems used content-based features for the essay scoring. Example papers which are on content-based assessment are Taghipour and Ng (2016); Persing and Ng (2013); Wang et al. (2018a, 2018b); Zhao et al. (2017); Kopparapu and De (2016), Kumar et al. (2019); Mathias and Bhattacharyya (2018a; b); Mohler and Mihalcea (2009) are used content and statistical-based features. The results are shown in Fig. 3. And mainly the content-based features extracted with word2vec NLP library, but word2vec is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other terms, but word2vec is capable of capturing the context word in a uni-direction either left or right. If a word has multiple meanings, there is a chance of missing the context in the essay. After analyzing all the papers, we found that content-based assessment is a qualitative assessment of essays. • On the other hand, Horbach and Zesch (2019); Riordan Brain et al. (2019); Ding et al. (2020); Kumar et al. (2020) proved that neural network models are vulnerable when a student response contains constructed irrelevant, adversarial answers. And a student can easily bluff an automated scoring system by submitting different responses like repeating sentences and repeating prompt words in an essay. From Loukina et al. (2019), and Madnani et al. (2017b). The fairness of an algorithm is an essential factor to be considered in AES systems. • While talking about speech assessment, the data set contains audios of duration up to one minute. Feature extraction techniques are entirely different from text assessment, and accuracy varies based on speaking fluency, pitching, male to female voice and boy to adult voice. But the training algorithms are the same for text and speech assessment. • Once an AES system evaluates essays and short answers accurately in all directions, there is a massive demand for automated systems in the educational and related world. Now AES systems are deployed in GRE, TOEFL exams; other than these, we can deploy AES systems in massive open online courses like Coursera(“https://coursera.
+
+
+2522 D. Ramesh, S. K. Sanampudi
+13
+org/learn//machine-learning//exam”), NPTEL (https://swayam.gov.in/explorer), etc. still they are assessing student performance with multiple-choice questions. In another perspective, AES systems can be deployed in information retrieval systems like Quora, stack overflow, etc., to check whether the retrieved response is appropriate to the question or not and can give ranking to the retrieved answers.
+5 Conclusion and future work
+As per our Systematic literature review, we studied 62 papers. There exist significant challenges for researchers in implementing automated essay grading systems. Several researchers are working rigorously on building a robust AES system despite its difficulty in solving this problem. All evaluating methods are not evaluated based on coherence, relevance, completeness, feedback, and knowledge-based. And 90% of essay grading systems are used Kaggle ASAP (2012) dataset, which has general essays from students and not required any domain knowledge, so there is a need for domain-specific essay datasets to train and test. Feature extraction is with NLTK, WordVec, and GloVec NLP libraries; these libraries have many limitations while converting a sentence into vector form. Apart from feature extraction and training Machine Learning models, no system is accessing the essay’s completeness. No system provides feedback to the student response and not retrieving coherence vectors from the essay—another perspective the constructive irrelevant and adversarial student responses still questioning AES systems. Our proposed research work will go on the content-based assessment of essays with domain knowledge and find a score for the essays with internal and external consistency. And we will create a new dataset concerning one domain. And another area in which we can improve is the feature extraction techniques. This study includes only four digital databases for study selection may miss some functional studies on the topic. However, we hope that we covered most of the significant studies as we manually collected some papers published in useful journals.
+Supplementary Information The online version contains supplementary material available at https://doi. org/10.1007/s10462-021-10068-2.
+Funding Not Applicable.
+References
+Adamson, A., Lamb, A., & December, R. M. (2014). Automated Essay Grading. Ajay HB, Tillett PI, Page EB (1973) Analysis of essays by computer (AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and Welfare, Office of Education, National Center for Educational Research and Development Ajetunmobi SA, Daramola O (2017) Ontology-based information extraction for subject-focussed automatic essay evaluation. In: 2017 International Conference on Computing Networking and Informatics (ICCNI) p 1–6. IEEE Alva-Manchego F, et al. (2019) EASSE: Easier Automatic Sentence Simplification Evaluation.” ArXiv abs/1908.04567 (2019): n. pag Bailey S, Meurers D (2008) Diagnosing meaning errors in short answers to reading comprehension questions. In: Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications (Columbus), p 107–115 Basu S, Jacobs C, Vanderwende L (2013) Powergrading: a clustering approach to amplify human effort for short answer grading. Trans Assoc Comput Linguist (TACL) 1:391–402
+
+
+2523
+An automated essay scoring systems: afisystematic literature...
+13
+Bejar, I. I., Flor, M., Futagi, Y., & Ramineni, C. (2014). On the vulnerability of automated scoring to construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing, 22, 48-59. Bejar I, et al. (2013) Length of Textual Response as a Construct-Irrelevant Response Strategy: The Case of Shell Language. Research Report. ETS RR-13-07.” ETS Research Report Series (2013): n. pag Berzak Y, et al. (2018) “Assessing Language Proficiency from Eye Movements in Reading.” ArXiv abs/1804.07329 (2018): n. pag Blanchard D, Tetreault J, Higgins D, Cahill A, Chodorow M (2013) TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2):i–15, 2013 Blood, I. (2011). Automated essay scoring: a literature review. Studies in Applied Linguistics and TESOL, 11(2). Burrows S, Gurevych I, Stein B (2015) The eras and trends of automatic short answer grading. Int J Artif Intell Educ 25:60–117. https://doi.org/10.1007/s40593-014-0026-8 Cader, A. (2020, July). The Potential for the Use of Deep Neural Networks in e-Learning Student Evaluation with New Data Augmentation Method. In International Conference on Artificial Intelligence in Education (pp. 37–42). Springer, Cham. Cai C (2019) Automatic essay scoring with recurrent neural network. In: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications (2019): n. pag. Chen M, Li X (2018) "Relevance-Based Automated Essay Scoring via Hierarchical Recurrent Model. In: 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, 2018, p 378–383, doi: https://doi.org/10.1109/IALP.2018.8629256 Chen Z, Zhou Y (2019) "Research on Automatic Essay Scoring of Composition Based on CNN and OR. In: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, p 13–18, doi: https://doi.org/10.1109/ICAIBD.2019.8837007 Contreras JO, Hilles SM, Abubakar ZB (2018) Automated essay scoring with ontology based on text mining and NLTK tools. In: 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE), 1-6 Correnti R, Matsumura LC, Hamilton L, Wang E (2013) Assessing students’ skills at writing analytically in response to texts. Elem Sch J 114(2):142–177 Cummins, R., Zhang, M., & Briscoe, E. (2016, August). Constrained multi-task learning for automated essay scoring. Association for Computational Linguistics. Darwish SM, Mohamed SK (2020) Automated essay evaluation based on fusion of fuzzy ontology and latent semantic analysis. In: Hassanien A, Azar A, Gaber T, Bhatnagar RF, Tolba M (eds) The International Conference on Advanced Machine Learning Technologies and Applications Dasgupta T, Naskar A, Dey L, Saha R (2018) Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 93–102 Ding Y, et al. (2020) "Don’t take “nswvtnvakgxpm” for an answer–The surprising vulnerability of automatic content scoring systems to adversarial input." In: Proceedings of the 28th International Conference on Computational Linguistics Dong F, Zhang Y (2016) Automatic features for essay scoring–an empirical study. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing p 1072–1077 Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) p 153–162 Dzikovska M, Nielsen R, Brew C, Leacock C, Gi ampiccolo D, Bentivogli L, Clark P, Dagan I, Dang HT (2013a) Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge Dzikovska MO, Nielsen R, Brew C, Leacock C, Giampiccolo D, Bentivogli L, Clark P, Dagan I, Trang Dang H (2013b) SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. *SEM 2013: The First Joint Conference on Lexical and Computational Semantics Educational Testing Service (2008) CriterionSM online writing evaluation service. Retrieved from http:// www.ets.org/s/criterion/pdf/9286_CriterionBrochure.pdf . Evanini, K., & Wang, X. (2013, August). Automated speech scoring for non-native middle school students with multiple task types. In INTERSPEECH (pp. 2435–2439). Foltz PW, Laham D, Landauer TK (1999) The Intelligent Essay Assessor: Applications to Educational Technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1, 2, http:// imej.wfu.edu/articles/1999/2/04/ index.asp
+
+
+2524 D. Ramesh, S. K. Sanampudi
+13
+Granger, S., Dagneaux, E., Meunier, F., & Paquot, M. (Eds.). (2009). International corpus of learner English. Louvain-la-Neuve: Presses universitaires de Louvain. Higgins, D., & Heilman, M. (2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice, 33(3), 36–46. Horbach A, Zesch T (2019) The influence of variance in learner answers on automatic content scoring. Front Educ 4:28. https://doi.org/10.3389/feduc.2019.00028 https://www.coursera.org/learn/machine-learning/exam/7pytE/linear-regression-with-multiple-variables/ attempt Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208. Ke Z, Ng V (2019) “Automated essay scoring: a survey of the state of the art.” IJCAI Ke, Z., Inamdar, H., Lin, H., & Ng, V. (2019, July). Give me more feedback II: Annotating thesis strength and related attributes in student essays. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3994-4004). Kelley K, Preacher KJ (2012) On effect size. Psychol Methods 17(2):137–152 Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering–a systematic literature review. Inf Softw Technol 51(1):7–15 Klebanov, B. B., & Madnani, N. (2020, July). Automated evaluation of writing–50 years and counting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7796–7810). Knill K, Gales M, Kyriakopoulos K, et al. (4 more authors) (2018) Impact of ASR performance on free speaking language assessment. In: Interspeech 2018.02–06 Sep 2018, Hyderabad, India. International Speech Communication Association (ISCA) Kopparapu SK, De A (2016) Automatic ranking of essays using structural and semantic features. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), p 519–523 Kumar, Y., Aggarwal, S., Mahata, D., Shah, R. R., Kumaraguru, P., & Zimmermann, R. (2019, July). Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 9662–9669). Kumar Y, et al. (2020) “Calling out bluff: attacking the robustness of automatic scoring systems with simple adversarial testing.” ArXiv abs/2007.06796 Li X, Chen M, Nie J, Liu Z, Feng Z, Cai Y (2018) Coherence-Based Automated Essay Scoring Using Selfattention. In: Sun M, Liu T, Wang X, Liu Z, Liu Y (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL 2018, NLP-NABD 2018. Lecture Notes in Computer Science, vol 11221. Springer, Cham. https://doi.org/10.1007/978-3-03001716-3_32 Liang G, On B, Jeong D, Kim H, Choi G (2018) Automated essay scoring: a siamese bidirectional LSTM neural network architecture. Symmetry 10:682 Liua, H., Yeb, Y., & Wu, M. (2018, April). Ensemble Learning on Scoring Student Essay. In 2018 International Conference on Management and Education, Humanities and Social Sciences (MEHSS 2018). Atlantis Press. Liu J, Xu Y, Zhao L (2019) Automated Essay Scoring based on Two-Stage Learning. ArXiv, abs/1901.07744 Loukina A, et al. (2015) Feature selection for automated speech scoring.” BEA@NAACL-HLT Loukina A, et al. (2017) “Speech- and Text-driven Features for Automated Scoring of English-Speaking Tasks.” SCNLP@EMNLP 2017 Loukina A, et al. (2019) The many dimensions of algorithmic fairness in educational applications. BEA@ ACL Lun J, Zhu J, Tang Y, Yang M (2020) Multiple data augmentation strategies for improving performance on automatic short answer scoring. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34(09): 13389-13396 Madnani, N., & Cahill, A. (2018, August). Automated scoring: Beyond natural language processing. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1099–1109). Madnani N, et al. (2017b) “Building better open-source tools to support fairness in automated scoring.” EthNLP@EACL Malinin A, et al. (2016) “Off-topic response detection for spontaneous spoken english assessment.” ACL Malinin A, et al. (2017) “Incorporating uncertainty into deep learning for spoken language assessment.” ACL
+
+
+2525
+An automated essay scoring systems: afisystematic literature...
+13
+Mathias S, Bhattacharyya P (2018a) Thank “Goodness”! A Way to Measure Style in Student Essays. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications p 35–41 Mathias S, Bhattacharyya P (2018b) ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Mikolov T, et al. (2013) “Efficient Estimation of Word Representations in Vector Space.” ICLR Mohler M, Mihalcea R (2009) Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) p 567–575 Mohler M, Bunescu R, Mihalcea R (2011) Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies p 752–762 Muangkammuen P, Fukumoto F (2020) Multi-task Learning for Automated Essay Scoring with Sentiment Analysis. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop p 116–123 Nguyen, H., & Dery, L. (2016). Neural networks for automated essay grading. CS224d Stanford Reports, 1–11. Palma D, Atkinson J (2018) Coherence-based automatic essay assessment. IEEE Intell Syst 33(5):26–36 Parekh S, et al (2020) My Teacher Thinks the World Is Flat! Interpreting Automatic Essay Scoring Mechanism.” ArXiv abs/2012.13872 (2020): n. pag Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Persing I, Ng V (2013) Modeling thesis clarity in student essays. In:Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) p 260–269 Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K (2001) Stumping E-Rater: challenging the validity of automated essay scoring. ETS Res Rep Ser 2001(1):i–44 Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2002). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2), 103–134. Ramachandran L, Cheng J, Foltz P (2015) Identifying patterns for short answer scoring using graphbased lexico-semantic text matching. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications p 97–106 Ramanarayanan V, et al. (2017) “Human and Automated Scoring of Fluency, Pronunciation and Intonation During Human-Machine Spoken Dialog Interactions.” INTERSPEECH Riordan B, Horbach A, Cahill A, Zesch T, Lee C (2017) Investigating neural architectures for short answer scoring. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications p 159–168 Riordan B, Flor M, Pugh R (2019) "How to account for misspellings: Quantifying the benefit of character representations in neural content scoring models."In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications Rodriguez P, Jafari A, Ormerod CM (2019) Language models and Automated Essay Scoring. ArXiv, abs/1909.09482 Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning and Assessment, 1(2). Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetricTM essay scoring system. The Journal of Technology, Learning and Assessment, 4(4). Rupp A (2018) Designing, evaluating, and deploying automated scoring systems with validity in mind: methodological design decisions. Appl Meas Educ 31:191–214 Ruseti S, Dascalu M, Johnson AM, McNamara DS, Balyan R, McCarthy KS, Trausan-Matu S (2018) Scoring summaries using recurrent neural networks. In: International Conference on Intelligent Tutoring Systems p 191–201. Springer, Cham Sakaguchi K, Heilman M, Madnani N (2015) Effective feature integration for automated short answer scoring. In: Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies p 1049–1054 Salim, Y., Stevanus, V., Barlian, E., Sari, A. C., & Suhartono, D. (2019, December). Automated English Digital Essay Grader Using Machine Learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE) (pp. 1–6). IEEE.
+
+
+2526 D. Ramesh, S. K. Sanampudi
+13
+Shehab A, Elhoseny M, Hassanien AE (2016) A hybrid scheme for Automated Essay Grading based on LVQ and NLP techniques. In: 12th International Computer Engineering Conference (ICENCO), Cairo, 2016, p 65-70 Shermis MD, Mzumara HR, Olson J, Harrington S (2001) On-line grading of student essays: PEG goes on the World Wide Web. Assess Eval High Educ 26(3):247–259 Stab C, Gurevych I (2014) Identifying argumentative discourse structures in persuasive essays. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) p 46–56 Sultan MA, Salazar C, Sumner T (2016) Fast and easy short answer grading with high accuracy. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies p 1070–1075 Süzen, N., Gorban, A. N., Levesley, J., & Mirkes, E. M. (2020). Automatic short answer grading and feedback using text mining methods. Procedia Computer Science, 169, 726–743. Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. In: Proceedings of the 2016 conference on empirical methods in natural language processing p 1882–1891 Tashu TM (2020) "Off-Topic Essay Detection Using C-BGRU Siamese. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, p 221–225, doi: https://doi.org/ 10.1109/ICSC.2020.00046 Tashu TM, Horváth T (2019) A layered approach to automatic essay evaluation using word-embedding. In: McLaren B, Reilly R, Zvacek S, Uhomoibhi J (eds) Computer Supported Education. CSEDU 2018. Communications in Computer and Information Science, vol 1022. Springer, Cham Tashu TM, Horváth T (2020) Semantic-Based Feedback Recommendation for Automatic Essay Evaluation. In: Bi Y, Bhatia R, Kapoor S (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1038. Springer, Cham Uto M, Okano M (2020) Robust Neural Automated Essay Scoring Using Item Response Theory. In: Bittencourt I, Cukurova M, Muldner K, Luckin R, Millán E (eds) Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, vol 12163. Springer, Cham Wang Z, Liu J, Dong R (2018a) Intelligent Auto-grading System. In: 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS) p 430–435. IEEE. Wang Y, et al. (2018b) “Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning.” EMNLP Zhu W, Sun Y (2020) Automated essay scoring system using multi-model Machine Learning, david c. wyld et al. (eds): mlnlp, bdiot, itccma, csity, dtmn, aifz, sigpro Wresch W (1993) The Imminence of Grading Essays by Computer-25 Years Later. Comput Compos 10:45–58 Wu, X., Knill, K., Gales, M., & Malinin, A. (2020). Ensemble approaches for uncertainty in spoken language assessment. Xia L, Liu J, Zhang Z (2019) Automatic Essay Scoring Model Based on Two-Layer Bi-directional LongShort Term Memory Network. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence p 133–137 Yannakoudakis H, Briscoe T, Medlock B (2011) A new dataset and method for automatically grading ESOL texts. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies p 180–189 Zhao S, Zhang Y, Xiong X, Botelho A, Heffernan N (2017) A memory-augmented neural model for automated grading. In: Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale p 189–192 Zupanc K, Bosnic Z (2014) Automated essay evaluation augmented with semantic coherence measures. In: 2014 IEEE International Conference on Data Mining p 1133–1138. IEEE. Zupanc K, Savić M, Bosnić Z, Ivanović M (2017) Evaluating coherence of essays using sentence-similarity networks. In: Proceedings of the 18th International Conference on Computer Systems and Technologies p 65–72 Dzikovska, M. O., Nielsen, R., & Brew, C. (2012, June). Towards effective tutorial feedback for explanation questions: A dataset and baselines. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 200-210). Kumar, N., & Dey, L. (2013, November). Automatic Quality Assessment of documents with application to essay grading. In 2013 12th Mexican International Conference on Artificial Intelligence (pp. 216222). IEEE.
+
+
+2527
+An automated essay scoring systems: afisystematic literature...
+13
+Wu, S. H., & Shih, W. F. (2018, July). A short answer grading system in chinese by support vector approach. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications (pp. 125-129). Agung Putri Ratna, A., Lalita Luhurkinanti, D., Ibrahim I., Husna D., Dewi Purnamasari P. (2018). Automatic Essay Grading System for Japanese Language Examination Using Winnowing Algorithm, 2018 International Seminar on Application for Technology of Information and Communication, 2018, pp. 565–569. https://doi.org/10.1109/ISEMANTIC.2018.8549789. Sharma A., & Jayagopi D. B. (2018). Automated Grading of Handwritten Essays 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018, pp 279–284. https://doi.org/10. 1109/ICFHR-2018.2018.00056
+Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
--- a/storage/5JWXXUR3/.zotero-reader-state
+++ b/storage/5JWXXUR3/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":0,"scale":312,"top":670,"left":-48,"scrollMode":0,"spreadMode":0}
--- a/storage/5JWXXUR3/Ramesh
+++ b/storage/5JWXXUR3/Ramesh
--- a/storage/6N9Q6CGV/.zotero-ft-cache
+++ b/storage/6N9Q6CGV/.zotero-ft-cache
@@ -0,0 +1,58 @@
+Skip to main content
+Computer Science > Computation and Language
+arXiv:2511.17069 (cs)
+[Submitted on 21 Nov 2025]
+Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments
+Yunsung Kim, Mike Hardy, Joseph Tey, Candace Thille, Chris Piech
+View PDF
+HTML (experimental)
+AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability -- Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. AnalyticScore operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.
+Comments:	16 pages, 2 figures
+Subjects:	Computation and Language (cs.CL)
+Cite as:	arXiv:2511.17069 [cs.CL]
+ 	(or arXiv:2511.17069v1 [cs.CL] for this version)
+ 	
+https://doi.org/10.48550/arXiv.2511.17069
+Focus to learn more
+Submission history
+From: Yunsung Kim [view email]
+[v1] Fri, 21 Nov 2025 09:19:05 UTC (183 KB)
+
+Access Paper:
+View PDFHTML (experimental)TeX Source
+view license
+Current browse context: cs.CL
+< prev next >
+
+newrecent2025-11
+Change to browse by: cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+Export BibTeX Citation
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer (What is the Explorer?)
+Connected Papers Toggle
+Connected Papers (What is Connected Papers?)
+Litmaps Toggle
+Litmaps (What is Litmaps?)
+scite.ai Toggle
+scite Smart Citations (What are Smart Citations?)
+Code, Data, Media
+Demos
+Related Papers
+About arXivLabs
+Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
+About
+Help
+Contact
+Subscribe
+Copyright
+Privacy Policy
+Web Accessibility Assistance
+
+arXiv Operational Status 
--- a/storage/6N9Q6CGV/2511.html
+++ b/storage/6N9Q6CGV/2511.html
--- a/storage/DYJM3EZJ/.zotero-ft-cache
+++ b/storage/DYJM3EZJ/.zotero-ft-cache
@@ -0,0 +1,61 @@
+Skip to main content
+Computer Science > Computation and Language
+arXiv:1804.00540 (cs)
+[Submitted on 29 Mar 2018]
+A Systematic Review of Automated Grammar Checking in English Language
+Madhvi Soni, Jitendra Singh Thakur
+View PDF
+Grammar checking is the task of detection and correction of grammatical errors in the text. English is the dominating language in the field of science and technology. Therefore, the non-native English speakers must be able to use correct English grammar while reading, writing or speaking. This generates the need of automatic grammar checking tools. So far many approaches have been proposed and implemented. But less efforts have been made in surveying the literature in the past decade. The objective of this systematic review is to examine the existing literature, highlighting the current issues and suggesting the potential directions of future research. This systematic review is a result of analysis of 12 primary studies obtained after designing a search strategy for selecting papers found on the web. We also present a possible scheme for the classification of grammar errors. Among the main observations, we found that there is a lack of efficient and robust grammar checking tools for real time applications. We present several useful illustrations- most prominent are the schematic diagrams that we provide for each approach and a table that summarizes these approaches along different dimensions such as target error types, linguistic dataset used, strengths and limitations of the approach. This facilitates better understandability, comparison and evaluation of previous research.
+Comments:	23 pages
+Subjects:	Computation and Language (cs.CL)
+Cite as:	arXiv:1804.00540 [cs.CL]
+ 	(or arXiv:1804.00540v1 [cs.CL] for this version)
+ 	
+https://doi.org/10.48550/arXiv.1804.00540
+Focus to learn more
+Submission history
+From: Jitendra Singh Thakur [view email]
+[v1] Thu, 29 Mar 2018 10:42:03 UTC (3,362 KB)
+
+Access Paper:
+View PDFTeX Source
+view license
+Current browse context: cs.CL
+< prev next >
+
+newrecent2018-04
+Change to browse by: cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+DBLP - CS Bibliography
+listing | bibtex
+Madhvi Soni
+Jitendra Singh Thakur
+Export BibTeX Citation
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer (What is the Explorer?)
+Connected Papers Toggle
+Connected Papers (What is Connected Papers?)
+Litmaps Toggle
+Litmaps (What is Litmaps?)
+scite.ai Toggle
+scite Smart Citations (What are Smart Citations?)
+Code, Data, Media
+Demos
+Related Papers
+About arXivLabs
+Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
+About
+Help
+Contact
+Subscribe
+Copyright
+Privacy Policy
+Web Accessibility Assistance
+
+arXiv Operational Status 
--- a/storage/DYJM3EZJ/1804.html
+++ b/storage/DYJM3EZJ/1804.html
--- a/storage/GIPXBXHB/.zotero-ft-cache
+++ b/storage/GIPXBXHB/.zotero-ft-cache
@@ -0,0 +1,281 @@
+Principled Design of Interpretable Automated Scoring for
+Large-Scale Educational Assessments
+Yunsung Kim, Mike Hardy, Joseph Tey, Candace Thille∗, and Chris Piech∗
+Stanford University yunsung@stanford.edu
+Abstract
+AI-driven automated scoring systems offer scalable and efficient means of evaluating complex studentgenerated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability – Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) – targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the ANALYTICSCORE framework for short answer scoring as a baseline reference framework for future research. ANALYTICSCORE operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, ANALYTICSCORE outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of ANALYTICSCORE aligns well with that of humans.
+1 Introduction
+Accurate and credible assessment of knowledge and skills forms the basis for effective decision making in a variety of educational contexts, from student learning and instructional design to program development and policy making (Berman et al., 2019). When the set of knowledge and skills to be gauged involves complex, open-ended problem-solving and communication abilities, AIdriven automated scoring systems can offer rapid, accessible, and scalable alternatives to the otherwise labor-intensive and costly process of training and deploying human scorers (Foltz et al., 2020). Automated scoring systems have been increasingly adopted across various assessment contexts over the past several decades. Today’s scoring algorithms achieve acceptable levels of scoring accuracy in various areas of human learning (Whitmer and Beiting-Parrish, 2023, 2024).
+Despite progress, automated scoring of open-ended responses has yet to reliably obtain generalizable scoring accuracy across diverse scoring contexts. Even when automated scoring meets acceptable levels of scoring accuracy, errors or biases inherent in the scoring algorithm can profoundly harm student learning and equity, policy evaluation, and public trust (Berman et al., 2019, Pellegrino, 2022). For these reasons, improving transparency and interpretability in automated scoring has now
+* Equal Advising
+1
+arXiv:2511.17069v1 [cs.CL] 21 Nov 2025
+
+
+become a moral imperative, not a mere technical preference (Holmes et al., 2022, Khosravi et al., 2022, Memarian and Doleck, 2023, Schlippe et al., 2022). Yet, in spite of the growing research on interpretable and explainable AI as well as its applications specifically within educational assessment, interpretable automated scoring remains mostly confined to academic research with limited adoption in large-scale, real-world assessment (Institute of Education Statistics, 2023, Whitmer and BeitingParrish, 2023, 2024).
+In this paper, we take a principled approach towards building a practical interpretable automated scoring solution for large-scale assessments. An effective interpretability solution begins by identifying the diverse needs of each stakeholder in understanding the system’s decisions, and by grounding the development of interpretable AI systems in those needs (Bhatt et al., 2020, Páez, 2019, Preece et al., 2018). Research on explainable automated scoring, on the other hand, has largely ignored this need-finding process. As we observe later in this paper (Section 2), this neglect has often led to several claimed interpretability solutions that fail to address the diverse and nuanced interpretability needs of the human actors in educational assessment.
+We identify the needs and benefits of model explanations for various large-scale assessment stakeholder groups consisting of test takers, assessment developers, and test users (Section 2.1). Targeted at those needs, we develop the principles of faithful, grounded, traceable, and interchangeable model interpretations for AI-driven automated scoring (Section 2.2).
+We further illustrate the feasibility of implementing these principles in practice and establish a concrete baseline for future work (Section 3). ANALYTICSCORE is the first interpretable automated short-answer scoring framework to embody our principles. It operates by extracting explicitly identifiable elements from unannotated response texts and featurizing each response into humaninterpretable values based on those elements. These features are input to an intuitive ordinal logistic regression module for scoring.
+We measure the performance of ANALYTICSCORE on a real-world response dataset by measuring (1) scoring accuracy and (2) alignment of featurization behaviors with human judgments (Sections 4 and 5). ANALYTICSCORE outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. The featurization behavior of ANALYTICSCORE also aligns well with humans (0.90, 0.72, 0.81 QWK across assessment areas). Our findings indicate strong potential for implementing accurate and well-aligned interpretability solutions that meet the real needs of assessment stakeholders.
+Automated Scoring and Interpretable AI As AI-driven automated scoring systems became more complex and opaque, researchers have increasingly noted the need to enhance the transparency of these systems through model explanations (Bauer and Zapata-Rivera, 2020, Bennett and Zhang, 2015, Schlippe et al., 2022). Several approaches have been proposed to address interpretable automated scoring, and we discuss them in detail in Section 2.2 in connection with our four principles. Despite growing research interests, interpretable automated scoring still lacks practical adoption and meaningful field use. The 2023 NAEP Math Automated Scoring Challenge1 for open-ended math responses organized by the US National Center for Education Statistics (NCES) found that none of the submissions met the criteria for “interpretability” despite several methods achieving nearhuman scoring accuracy (Institute of Education Statistics, 2023, Whitmer and Beiting-Parrish, 2024). This gap highlights the need for human-centered explainability solutions driven by assessment stakeholders’ real needs.
+1https://github.com/NAEP-AS-Challenge/math-prediction
+2
+
+
+2 Building the Principles of Interpretable Automated Scoring
+Insights derived from scoring support various stakeholders throughout the overall assessment process. Below we analyze three main stakeholder groups in large-scale assessment – test takers, assessment developers, and test users (AERA et al., 2014, Berman et al., 2019). Each stakeholder group’s distinct roles and priorities uniquely shape how interpretable automated scoring can address their specific needs and improve their assessment experience.
+2.1 Explainability Needs and Potential Benefits
+Test Takers The needs and benefits of interpretable scoring vary depending on the assessment type: summative or formative. Most large-scale assessments are summative assessments, which are assessments of learning that support evaluating learner achievement, assigning grades, or determining proficiency levels (Harlen, 2005). Because these assessments often drive high-stakes decisions, test takers need to trust the fairness and justifiability of scoring decisions (Williamson et al., 2012). Provided that the scoring algorithm implements sound scoring logic, allowing test takers or their representatives to examine traceable explanations for scoring decisions can foster trust (Bauer and Zapata-Rivera, 2020, Ferrara and Qunbar, 2022). These explanations can also support a streamlined quality control process by facilitating the identification and correction of errors, improving the overall integrity of the assessment (see Bennett and Zhang (2015) and Ferrara and Qunbar (2022)).
+Formative assessments are assessments for learning, intended to guide and improve learner performance through frequent practice, progress monitoring, and skill diagnosis (Black and Wiliam, 1998, Wiliam, 2011). In this context, the function of automated scoring is primarily to provide timely, effective and actionable feedback to support learner learning (Bennett, 2006, DiCerbo et al., 2020). Effective feedback should help learners understand the discrepancy between their work and a desired outcome (Schwartz et al., 2016). A step-by-step explanation of the features observed in a learner’s work, coupled with human understandable descriptions of how those features were processed can be used to provide such elaborative feedback.
+Assessment Developers Scoring algorithms should reliably identify evidence of the constructs (target knowledge, skills, and abilities) measured by the task (Bejar et al., 2016). Understanding the types of evidence that an automated scoring algorithm reliably detects also informs other key aspects of assessment design, such as construct selection and task design (Bennett and Bejar, 1998). Model explanations can facilitate this understanding by transparently revealing the features used by the scoring algorithm and its intermediate reasoning steps. Explanations can also help determine which parts of the algorithm can be reused, avoiding the costly and time-consuming process of training a new scoring algorithm for each new task (see DiCerbo et al. (2020)).
+Model explanations also yield specific insights into areas where the scoring model can be improved and how. Scoring models often need to be tuned for various reasons. For instance, models trained on data may reflect biases related to response strategies specific to student groups (Ferrara and Qunbar, 2022, Rupp, 2018). Scoring models may also become less stable over time as the test-taker population and/or scoring criteria change (Bejar et al., 2016). Transparent inspection of model decisions helps identify problematic model elements, enabling targeted data collection and modified training objectives to improve the model.
+Test Users Test users, including professionals who select and administer tests, educators, administrators, and policymakers, depend on score reliability and interpretation validity to make
+3
+
+
+system-level decisions or instructional differentiation. Their reliance on the integrity and validity of scores to drive decisions is significant (AERA et al., 2014). Model explanations provide concrete evidence to validate the choice of the scoring model2. This includes understanding whether the extracted features and scoring logic fully capture the rubric and the construct definition, and whether the internal structure of the automated scores align with the construct of interest (Bennett and Zhang, 2015).
+2.2 The FGTI Interpretability Principles
+We develop four foundational interpretability principles – Faithful, Grounded, Traceable, and Interchangeable (FGTI) – targeting the needs and benefits of large-scale assessment stakeholders from Section 2.1. Our first foundational principle is that explanations should be faithful (Jacovi and Goldberg, 2020). Faithfulness is an important requirement in many high-stakes applications of interpretable AI (Rudin, 2019). Similar expectations extend to assessments, and all of the needs and benefits outlined in Section 2.1 depend crucially on faithfulness.
+Principle 1 (Faithful). Explanations of scoring decisions should accurately reflect the computational mechanism behind the scoring model’s prediction.
+A notable example of unfaithful scoring explanations are texts produced by prompting LLMs to generate an explanation (e.g., Lee et al. (2024) and Li et al. (2025)). Stepwise reasoning verbalized by LLM through prompting strategies such as chain-of-thought (Wei et al., 2022) are not explanations of their internal computation (Sarkar, 2024) and often fail to reflect the model’s true reasoning behavior (Arcuschin et al., 2025, Turpin et al., 2023). Moreover, LLMs are highly sensitive to superficial changes in prompts and input text, frequently exhibiting inconsistent judgments (Wang et al., 2024). Therefore, even when LLMs achieve high scoring accuracy, prompting them for explanations cannot reliably address the stakeholder needs identified in Section 2.1.
+Next, the model should use meaningful features that are explicitly linked to each student’s work and rely only on those features for the downstream computation.
+Principle 2 (Grounded). Initial features computed by the scoring model should represent humanunderstandable, explicitly identifiable elements of student work and item task.
+Regardless of the routine used to derive those features, these feature values should possess meaning that is understandable to humans and be explicitly based both in the student work and item task. For instance, cosine similarity of sentence embeddings used as an input feature (e.g., Condor and Pardos (2024)) is less human-understandable than discrete features whose values are associated with clear, verbalizable meaning. Having features that are grounded addresses the need to scrutinize the features of the scoring engines.
+How should the model process these features to ultimately produce a final score? Scoring is inherently an evidentiary reasoning process, where elements of the student response and item tasks serve as evidence to support the inference about student knowledge and skills that the score represents (DiCerbo et al., 2020, Mislevy, 2020). Stakeholders need to be able to inspect and interact with the internal structure of the scoring model to ensure soundness, construct-relevance, fairness, and model understanding (Section 2.1). To meet this need, the model’s evidentiary reasoning process must be decomposable into clear, sequential steps that a human could reliably execute and
+2More examples of validity arguments on the use of automated scoring can be found in (Bennett and Zhang, 2015, Table 7.7).
+4
+
+
+• The response fully states , , and . It conveys the main elements of . • The total evidence value for the response is S, which qualifies for the final score of 1 out of 2. • If instead the response had fully stated or stated elements of , the response would have qualified for a score of 2
+1 Extract 2 Featurize 3 Grade
+Student Responses
+C1 C2 C3 C4 C5 C6 C7
+Panda and koala are specialists because they only eat one thing almost exclusively. But a python can eat many types of foods and survive in many different locations”
+...
+Pandas and koalas are specialists Pythons are generalists Pandas and koalas only eat one food Pythons eat a variety of food Pandas and koalas have limited habitat Pythons adapt well to new environments Pythons are carnivores
+C1 C2 C3 C4 C5 C6 C7
+...
+Analytic Components
+✅❌✅✅❌🔼❌
+✅: “Direct paraphrase of entire statement”
+🔼: “Direct paraphrase of main conceptual elements”
+❌: “No direct paraphrase of the ideas”
+Feature Values (Abbr.)
+C1 C3 C4
+Explanation of Scoring Process
+C6
+C4 C2
+θ! ≤ η < θ" Final Score: 1
+
+w1, ✅ w2, ❌ w3, ✅ w4, ✅ w5, ❌ w6, 🔼 w7, ❌
+ + + ++
+η=
+Figure 1: Schematic of the ANALYTICSCORE framework. The example question is: “Explain how pandas in China and koalas in Australia are similar, and how they both are different from pythons.”
+possibly intervene. Our next 2 principles state that the scoring model should be conducive to this decomposition and intervention:
+Principle 3 (Traceable). The scoring model should consist of subroutines that each represent a specific, well-defined evidentiary reasoning step on clearly specified inputs.
+Principle 4 (Interchangeable). A human should be able to act interchangeably on each of the reasoning subroutines.
+Not all intermediate representations calculated by the model need to be understandable by human, but the reasoning subroutines should collectively account for the entire scoring logic. Moreover, humans should be able to act interchangeably with each decomposed module and replace the module outputs with human-generated results if deemed necessary.
+Many proposed interpretability approaches are not grounded, traceable, or interchangeable in the sense described above. These include, for instance, calculating feature importance values (Asazuma et al., 2023, Kumar and Boulanger, 2020, 2021, Schlippe et al., 2022), displaying feature attribution maps (Li et al., 2025, Schlippe et al., 2022), or presenting confidence metrics for scoring decisions (Conijn et al., 2023). This limits the capacity to thoroughly inspect the model’s features and internal structure, which is critical to meeting the needs of the stakeholders.
+3 A Principled Framework for Interpretable Automated Scoring
+How would the FGTI principles be implemented in practice? To illustrate the feasibility of implementing these principles and to set a baseline for future research, we present ANALYTICSCORE as a reference framework in the domain of short-answer scoring. In this setting, students write a short 1-5 sentence answer in response to an assessment item which is scored with an emphasis on content correctness and demonstrated reasoning (Leacock and Chodorow, 2003, Shermis, 2015). The scoring model has access to a training set of student response texts paired with human-annotated scores (r1, s1), ..., (rn, sn) and possibly additional unannotated responses {rn+1, ..., rm}. The goal at inference time is to predict the score s for a new response r.
+ANALYTICSCORE (Figure 1) is a 3-phase, LLM-based framework grounded in our four principles of interpretable automated scoring. Phase 1 identifies explicitly grounded analytic components to be used. Phase 2 catalogs, or featurizes, the presence of these components in student responses. Phase 3
+5
+
+
+uses the features to compute a score. Phases 1 and 2 depend only on the response texts without any annotations. Human score labels are only used with Phase 3.
+3.1 Phase 1: Extracting Analytic Components.
+With the response texts from the training set (and optionally assessment content), ANALYTICSCORE first extracts a set of analytic components, which are explicitly identifiable elements of student responses described in Principle 2. In this work, we consider a specific type of components which are representative, atomic units of explicit statements, arguments, or claims as in Figure 1.
+[c1, ..., ck] = Extract(r1, ..., rm),
+Component extraction is implemented using an LLM with the prompts shown in Figure 2. Having too many analytic components could diminish the interchangeability (Principle 4) of the overall framework by exploding the number of features used in scoring (Lipton, 2018), so we limit to generating 15 components per request.
+3.2 Phase 2: Featurizing Responses
+Once the analytic components have been identified, student responses are featurized according to the presence of these components c1, ..., cm in each response r. This step uses a labeling function f (r; c) whose outputs are associated with human-understandable meaning (Principles 2 & 3). The exact label definitions used can be selected using natural language. In this work, we explore the following general purpose labeling function for f (r; c):
+f (r; c) =
+
+
+
+2, if r contains direct paraphrase of c 1, if r contains partial paraphrase of c 0, if r does not contain paraphrase of c
+(1)
+We implement f (r; c) using a Chain-of-Thought (Wei et al., 2022) prompting template shown in Figure 23. Inspired by the self-consistency decoding strategy for LLMs (Wang et al., 2022), we apply the first-to-three aggregation rule to consider the possibly diverse interpretation of the labeling criteria when selecting the final output. Easily interpretable one-hot encodings of each f (r; cm) are then concatenated to produce a 3m-dimensional binary featurization of r:
+Featurize(r) = OneHot(f (r; c1)) ∥ · · · ∥ OneHot(f (r; ck))
+Distilling LLM Featurizer into Open Source Using proprietary LLMs for featurization can quickly become too expensive in large-scale assessment settings, especially with many analytic components. To avoid the linearly growing cost of featurization, we supervised fine-tuned a small open-source model using a subsample of (r, c) pairs, where r is a response from the training set and c is an analytic component from Phase 1. More specifically, we randomly sampled 10k pairs across all 10 items, calculated the featurization labels on these samples using o4-mini, and collected the full LLM model requests and outputs generated during this process which aligned with the aggregated final decision. This dataset was used to fine-tune Llama-3.1-8b-instruct with QLoRA (Dettmers et al., 2023).
+3Note that we CoT is used solely as a prompting technique, and the generated “thoughts” are explicitly discarded.
+6
+
+
+Students were asked the following question: ```[QUESTION PROMPT]```
+Here are several examples of student responses to the question:
+Student Response: [RESPONSE TEXT] x 1000
+Please tell me 15 short, simple, and representative statement, claims, or arguments that are common across many student responses and that distinguish these student responses from others. Each of these "features" should identify an actual statement or claim made in the student responses, not a general description of it.
+When identifying statements or claims, adhere to the following guidelines: - Use the specific wording mentioned in the student response when it is crucial to distinguishing between responses. - Ensure that each statement or claim is **atomic and isolated**. This means that the statements and claims should **not combine** multiple ideas, contexts, or supporting details. For example, instead of: * "[Statement A], because [Statement B]", or * "[Statement A], and [Statement B]", or * "[Statement A], so [Statement B]" etc., "[Statement A]" and "[Statement B]" should be listed separately.
+Component Extractor Prompt
+Students were asked the following question: ```[QUESTION PROMPT]```
+Here is a response from a student: ```[RESPONSE]```
+Here is a statement: ```[ANALYTIC COMPONENT]```
+Choose among the following: - A: The text contains a *direct* paraphrase of the given statement using clearly synonymous wording. - B: The text does not contain a direct paraphrase of the entire statement, but it contains direct paraphrases of the main parts of the statement. - C: The text may potentially imply the given statement, but it does not contain a direct paraphrase of the ideas in the statement.
+Limit your answer to 100 words, and format your answer in the following python dictionary format: {
+"Explanation": "[Why the text disqualifies for other options]", "Answer": "A"/"B"/"C" }
+Featurizer Prompt
+Figure 2: Prompts used in ANALYTICSCORE
+3.3 Phase 3: Logically Traceable Scoring
+Based on the featurized responses, a traceable and interchangeable model (Principle 3 and 4) is selected and trained using the labeled response pairs (r1, s1), ..., (rn, sn). Given the nature of the score categories, we employ the Immediate-Threshold variant of Ordinal Logistic Regression (Pedregosa et al., 2017, Rennie and Srebro, 2005) as our scoring module. Combined with the one-hot encoding featurization from Phase 2, the resulting algorithm calculates the sum of weights for each component
+and feature label: η = Pc
+i=1 wi,f(r,ci), where w are the trained weights. Scores are determined by comparing η to a set of learned thresholds θj; the predicted score corresponds to the ordinal category j for which θj ≤ η < θj+1. η can be understood as “evidence values” used for scoring.
+3.4 Analysis of ANALYTICSCORE
+An example of ANALYTICSCORE’s model explanation is shown in the right panel of Figure 1. By demonstrating human-understandable features of the response (Principle 2) and the exact decision process (Principle 3), the explanation transparently and faithfully reveals the actual scoring mechanism used (Principle 1). If, based on the explanation, the model is suspected to have made an error (e.g., C6 should be a check, not a triangle), a human inspector can modify the featurization and rerun the scoring algorithm (Principle 4), which is also how the “if instead...” explanation is generated.
+The structure of ANALYTICSCORE’s scoring model is akin to Concept Bottleneck Models (Koh et al., 2020, Yang et al., 2023) in that we enforce a layer of intermediate representations with humanunderstandable “concepts.” Our framework ensures that the intermediate features have humanunderstandable values that are associated with explicitly identifiable elements (Principle 2), as opposed to characteristics that are inferred from the response.
+7
+
+
+Token Len. Train Valid Test Assessment Area
+Q1 47.5 ± 22.2 1,341 331 557 Science
+Q2 59.2 ± 22.6 1,024 254 426
+Q3 47.9 ± 14.6 1,445 363 406 Reading (Informational Text)
+Q4 40.3 ± 15.5 1,308 349 295
+Q5 25.1 ± 21.5 1,459 336 598 Science
+Q6 23.8 ± 22.6 1,418 379 599
+Q7 41.3 ± 25.1 1,432 367 599 Reading (Literature)
+Q8 53.0 ± 32.6 1,446 353 599
+Q9 49.7 ± 36.3 1,453 345 599 Reading
+(Informational Text)
+Q10 41.1 ± 28.5 1,314 326 546 Science
+Table 1: ASAP-SAS dataset detail by item.
+4 Evaluating ANALYTICSCORE
+Having introduced ANALYTICSCORE and discussed its interpretability, we now evaluate its scoring performance and how its featurization aligns with human judgments on a real-world response scoring dataset.
+Dataset The ASAP-SAS dataset (Shermis, 2015)4 is the largest publicly available dataset with short answer responses from schoolchildren for 10 different open-ended exam questions. Human raters double-scored and assigned a single number to each student response using a 3 or 4 point rubric. The assessment area for each question, as well as the sample sizes and response lengths are reported in Table 1. We use the original test set and split the public training set into training and validation sets with a 8:2 ratio.
+ANALYTICSCORE Implementation Details For each assessment item, we used GPT-4.1 as the base LLM and extracted 15 analytic components except for Q7. This item uses a two-part scoring scheme to separately assess a character trait identified from the reading and its supporting evidence. We extracted 15 analytic components from each part, totaling 30 components. For the featurizer, we experimented with GPT-4.1-mini and Llama-3.1-8B-Instruct as our base LLM, each with temperature set to 0.7 and 1.0. We distilled the Llama featurizer for 2 epochs using a batch size of 4 and learning rate of 1e-4. All model calls were made through the official OpenAI API. Fine-tuning was conducted on an Ubuntu 20.04 machine with 2 RTX A6000 GPUs (49Gb memory), 16 AMD EPYC 9224 24-Core Processors, and 250Gb of CPU RAM.
+4.1 Scoring Accuracy Experiment
+We measured scoring accuracy in terms of quadratic weighted kappa (QWK) against the model scores in the test set, following the convention of the automated scoring literature (Institute of Education Statistics, 2023, Shermis, 2015).
+The following baseline models were compared :
+4https://www.kaggle.com/competitions/asap-sas/data
+8
+
+
+Few-Shot Prompting: We few-shot prompt GPT-4.1 with 10 randomly selected responses from each score category, including a rubric for the score categories.
+Supervised Fine-tuned LLM: The following LLM-based classifiers were fine-tuned on the responsescore pairs: BERT (Devlin et al., 2019), DeBERTa (He et al., 2020), Llama-3.1-8b, and Llama-3.18b-Instruct (Grattafiori et al., 2024). We also fine-tune Llama-3.1-8B-Instruct with a rubric of the score categories added to the input.
+Automated Scorer Baselines: Methods included are: AutoSAS (Kumar et al., 2019), AsRRN (Li et al., 2023), and NAM (Condor and Pardos, 2024).
+The only baseline method that has aspects of interpretability is NAM. This method requires handcrafting a specific form of rubric describing the key phrases and concepts to be used by the response. Using sentence embeddings with n-gram matching as its features, this method implements a logistic regression score classifier. To implement this baseline, we replace the rubrics with the analytic components extracted by our ANALYTICSCORE.
+4.2 Featurization Alignment Experiment
+The feature labeling task described in Figure 2 was designed to produce human-understandable features (Principle 2). But how well does the LLM’s featurization behavior align with how humans actually understand this task? Even more fundamentally, how well do humans themselves agree in their understanding of this task?
+To answer these questions, we sampled 50 (response, analytic component) pairs for each of the 3 assessment areas. To ensure balanced representation, the sample included a balanced number of pairs from each of the three score categories, as initially determined by the GPT-4.1-mini featurizer. We then asked 7 human annotators to conduct the labeling task on these samples. The human annotators consisted of five volunteers from an R1 university and two of the study’s authors. None of the annotators had prior exposure to any of the LLM’s featurization outputs. All annotators had advanced academic training (PhD-level) and teaching experience, five of whom have been instructors at the primary, secondary, and/or post-secondary level.
+The annotators received an oral presentation of the purpose of the study along with links to 3 Qualtrics forms to be filled out, one in each assessment area. The form reiterated the study’s purpose, explained the task, and presented 50 items to annotate, each containing the context of the assessment item and the same featurizer prompt shown in Figure 2. The overall process took each annotator between 2.5 to 3.5 hours.
+Aggregate Human label was generated by majority voting (ties resolved randomly). We calculated inter-rater reliability among human labelers (Krippendorff’s α) and alignment between each LLM featurizer and aggregate human labels (QWK and class-wise F1). We report the 95% bootstrap CI of each metric, reweighting the sampling probability to account for the initial balanced sampling of score categories.
+5 Experiment Results
+5.1 Scoring Accuracy Results
+Table 2 shows the results of the scoring accuracy experiments. Across items and within each assessment area, ANALYTICSCORE outperforms* several automated scoring baselines on average
+9
+
+
+Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 All Avg. Sci Avg. R(Inf) Avg. R(Lit) Avg.
+Human 0.95 0.93 0.77 0.75 0.95 0.93 0.96 0.86 0.84 0.87 0.88±0.02 0.93±0.01 0.79±0.03 0.91±0.05 ANALYTICGRADE
+w/ GPT-4.1-mini (*) 0.80 0.86 0.64 0.59 0.79 0.78 0.61 0.59 0.80 0.68 0.72±0.03 0.78±0.03 0.68±0.06 0.60±0.01 w/ Llama-3.1-8b (*) 0.57 0.57 0.59 0.56 0.69 0.47 0.52 0.45 0.74 0.60 0.58±0.03 0.58±0.04 0.63±0.05 0.48±0.04 + Distillation (*) 0.80 0.82 0.68 0.59 0.81 0.76 0.62 0.59 0.78 0.64 0.71±0.03 0.77±0.03 0.68±0.06 0.60±0.01 Fewshot
+GPT-4.1 0.69 0.65 0.61 0.65 0.72 0.61 0.34 0.57 0.76 0.69 0.63±0.04 0.67±0.02 0.68±0.04 0.45±0.12 Supervised LLM
+BERT 0.80 0.80 0.70 0.70 0.80 0.81 0.69 0.68 0.84 0.71 0.75±0.02 0.79±0.02 0.74±0.05 0.69±0.01 DeBERTa 0.85 0.86 0.66 0.70 0.81 0.83 0.71 0.64 0.79 0.71 0.76±0.03 0.81±0.03 0.72±0.04 0.67±0.04 Llama-3.1-8b Inst. 0.84 0.73 0.72 0.71 0.82 0.81 0.71 0.66 0.82 0.75 0.76±0.02 0.79±0.02 0.75±0.03 0.69±0.02 w/ rubric 0.87 0.80 0.68 0.77 0.85 0.80 0.72 0.65 0.84 0.79 0.78±0.02 0.82±0.02 0.76±0.05 0.68±0.04 Llama-3.1-8b 0.83 0.75 0.70 0.77 0.82 0.84 0.68 0.65 0.82 0.74 0.76±0.02 0.80±0.02 0.76±0.03 0.67±0.02 Baseline
+AutoSAS 0.68 0.47 0.57 0.61 0.50 0.54 0.37 0.44 0.77 0.68 0.56±0.04 0.57±0.04 0.65±0.06 0.41±0.04 ASRRN 0.60 0.43 0.57 0.60 0.61 0.64 0.59 0.51 0.71 0.66 0.59±0.02 0.59±0.04 0.63±0.04 0.55±0.04 NAM (*) 0.63 0.62 0.43 0.35 0.72 0.63 0.42 0.38 0.76 0.62 0.56±0.05 0.64±0.02 0.52±0.13 0.40±0.02
+Table 2: Test-time Quadratic Weighted Kappa (QWK) of scoring models per item, along with average per assessment area. Best, second-best, and at human-level performance scores are marked respectively. Sci.: Science (Q1,2,5,6). R(Inf): Reading(Informational Text) (Q3,4,9). R(Lit): Reading(Literature) (Q7,8). (*) are methods that are considered interpretable.
+and, given its interpretability, achieves reasonable performance compared to state-of-the-art blackbox models. Except for the untuned Llama featurizer, each ANALYTICSCORE variant outperforms* the few-shot prompting and automated scoring baselines. Compared to the best-performing models in each assessment area, these three ANALYTICSCORE models are, on average, within 0.06 QWK over all items, 0.04 QWK for Science, 0.08 QWK for Reading (Informational Text), and 0.09 QWK for Reading (Literature) items.
+Also noticeable is the striking improvement* in the performance of the Llama featurizer postdistillation, with an average increase of 0.13 QWK. The distilled Llama featurizer performs comparably to both variants of GPT-4.1 mini. Increase in average QWK is most notable for Science items (+0.19), followed by Reading (Literature) (+0.12) and Reading (Informational Text) (+0.05).
+5.2 Featurization Alignment Results
+Table 4 displays the Krippendorff’s α5 measured among the human raters in conducting the featurization task from Section 3.2. For all assessment areas, we observe 0.667 ≤ α < 0.8, which fall into a range of acceptable inter-rater reliability (Krippendorff, 2018). We interpret this as a good level of rater agreement on the featurization process as defined in this work and acknowledge that there is still potential to refine and improve the task further.
+Next, alignment between each featurizing model with the human ratings using majority vote is shown in Table 3. Most notably, the distilled Llama featurizer achieves substantially high agreement with the aggregate human features across all assessment areas. Other featurizers also achieve high
+*p < 0.05 for Wilcoxon signed-rank test across all items. Due to small n, no area-specific difference was statistically significant. 5α ranges between -1 and 1. 0 indicates chance agreement.
+10
+
+
+Assessment Area Featurizer Model QWK Label Distribution6 Label-wise F1
+210 2 1 0
+Science
+Human 15.32% 3.70% 80.98% GPT-4.1-mini (0.89, 0.89) 7.56% 12.59% 79.85% (0.83, 0.84) (0.20, 0.22) (0.96, 0.96) o4-mini (0.94, 0.95) 14.35% 8.17% 77.47% (0.93, 0.93) (0.49, 0.51) (0.98, 0.98) Llama-3.1-8B (Distilled) (0.90, 0.90) 11.54% 5.27% 83.19% (0.89, 0.89) (0.20, 0.22) (0.97, 0.97)
+Reading (Informational Text)
+Human 11.64% 18.53% 69.83% GPT-4.1-mini (0.72, 0.72) 9.82% 22.35% 67.83% (0.68, 0.69) (0.54, 0.55) (0.87, 0.88) o4-mini (0.81, 0.81) 18.30% 16.60% 65.10% (0.73, 0.74) (0.68, 0.69) (0.94, 0.94) Llama-3.1-8B (Distilled) (0.72, 0.73) 20.51% 10.16% 69.34% (0.61, 0.62) (0.24, 0.26) (0.91, 0.91)
+Reading (Literature)
+Human 9.48% 6.57% 83.95% GPT-4.1-mini (0.54, 0.56) 2.60% 13.86% 83.54% (0.45, 0.47) (0.67, 0.69) (0.95, 0.95) o4-mini (0.52, 0.54) 7.22% 6.80% 85.98% (0.50, 0.52) (0.12, 0.14) (0.92, 0.92) Llama-3.1-8B (Distilled) (0.81, 0.81) 7.22% 7.52% 85.26% (0.83, 0.84) (0.20, 0.22) (0.92, 0.92)
+Table 3: Alignment between LLM featurizers and Aggregate Human featurization obtained by majority voting for different models and assessment area. QWK and F1 values presented are 95% Bootstrap CI.
+agreement in Science and Reading (Information Text) but achieves moderate agreement in Reading (Literature).
+Assessment Area Krippendorff’s α
+Science (0.718, 0.723) Reading (Informational Text) (0.696, 0.700) Reading (Literature) (0.669, 0.678)
+Table 4: Inter-rater reliability among human raters for the featurization alignment experiment (95% bootstrap CI).
+F1 scores and label distribution for each feature label6 provide a more detailed insight and reveal areas for further improvement. Notice that the F1 score is exceptionally high (near or above 0.9) for label 0, and moderate-to-high (0.6∼0.93) for label 2, with higher agreement for Science items. Yet, alignment for label 1 is moderate-to-low, ranging from 0.68 down to 0.12. We believe this is due to the relatively ambiguous nature of the label category 1, coupled with the rarity of label 1 in human rating. While LLM featurizers achieve high overall alignment with aggregate human featurization, additional study needs to be done to ensure that the labeling task incurs less ambiguity and that the featurizer models match the natural distribution of human labels.
+6 Discussions and Limitations
+The principles outlined in Section 2.2 address a specific aspect of “interpretability” in the broader domain of automated scoring. While these principles are foundational to supporting the design of a valid scoring process and fostering trust in the scoring system, simply adhering to the principles
+6For Aggregate Human and o4-mini, prevalence weighting was used to extrapolate the label distribution from the 50 study samples for comparison with the full distribution of all (r, c) pairs.
+11
+
+
+does not, on its own, guarantee that the needs and potential benefits of the stakeholders (Section 2.1) will be met.
+Educational assessment literature is ripe with guidelines, frameworks, and best practices for ensuring that automated scoring properly serves the stakeholders’ needs and produce valid, reliable, and fair scoring results (e.g., Bejar et al. (2016), Bennett and Bejar (1998), Bennett and Zhang (2015), Williamson et al. (2012)). These studies emphasize that evidence for the validity of automated scoring should be collected throughout the assessment process from a variety of evidentiary sources, such as model features, agreement with human raters, treatment of unusual responses, generalizability of score interpretations, population invariance of scores, and impact on teaching and learning (Bennett and Zhang, 2015).
+7 Conclusion
+AI and education research community has yet to produce a practical interpretability solution for automated scoring in large-scale educational assessments despite a pressing need. To address this challenge, we analyzed the needs and potential benefits of interpretable automated scoring for various assessment stakeholders (students, assessment developers, and test users) and developed 4 foundational principles – Faithful, Grounded, Traceable, and Interchangeable (FGTI) – aimed at addressing those needs. We also demonstrated the feasibility of implementing these principles by developing ANALYTICSCORE for short-answer scoring. This framework generates human-interpretable features for each response based on explicitly identifiable elements, and uses an intuitive ordinal logistic regression scorer. On a real-world short-answer scoring dataset, ANALYTICSCORE outperforms many uninterpretable scoring methods, achieves a narrow performance gap relative to the uninterpretable SOTA, and demonstrates featurization behaviors that align with human judgment. Our findings show strong promise for implementing accurate and well-aligned interpretability solutions that address the real needs of assessment stakeholders. We hope our work illuminates exciting new directions in developing practical and effective interpretable automated scoring for large-scale educational assessments.
+References
+AERA, APA, and NCME. The standards for educational and psychological testing. 2014.
+Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025.
+Yuya Asazuma, Hiroaki Funayama, Yuichiroh Matsubayashi, Tomoya Mizumoto, Paul Reisert, and Kentaro Inui. Take no shortcuts! stick to the rubric: A method for building trustworthy short answer scoring models. In International Conference on Higher Education Learning Methodologies and Technologies Online, pages 337–358. Springer, 2023.
+Malcolm I Bauer and Diego Zapata-Rivera. Cognitive foundations of automated scoring. In Handbook of automated scoring, pages 13–28. Chapman and Hall/CRC, 2020.
+Isaac I Bejar, Robert J Mislevy, and Mo Zhang. Automated scoring with validity in mind. The Wiley handbook of cognition and assessment: Frameworks, methodologies, and applications, pages 226–246, 2016.
+12
+
+
+Randy Elliot Bennett. Moving the field forward: Some thoughts on validity and automated scoring. Automated scoring of complex tasks in computer-based testing, pages 403–412, 2006.
+Randy Elliot Bennett and Isaac I Bejar. Validity and automad scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4):9–17, 1998.
+Randy Elliot Bennett and Mo Zhang. Validity and automated scoring. In Technology and testing, pages 142–173. Routledge, 2015.
+Amy I Berman, Michael J Feuer, and James W Pellegrino. What use is educational assessment?, 2019.
+Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José MF Moura, and Peter Eckersley. Explainable machine learning in deployment. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 648–657, 2020.
+Paul Black and Dylan Wiliam. Assessment and classroom learning. Assessment in Education: principles, policy & practice, 5(1):7–74, 1998.
+Aubrey Condor and Zachary Pardos. Explainable automatic grading with neural additive models. In International Conference on Artificial Intelligence in Education, pages 18–31. Springer, 2024.
+Rianne Conijn, Patricia Kahr, and Chris CP Snijders. The effects of explanations in automated essay scoring systems on student trust and motivation. Journal of Learning Analytics, 10(1):37–53, 2023.
+Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023.
+Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019.
+Kristen DiCerbo, Emily Lai, and Ventura Matthew. Assessment design with automated scoring in mind. In Handbook of Automated Scoring, pages 29–48. Chapman and Hall/CRC, 2020.
+Steve Ferrara and Saed Qunbar. Validity arguments for ai-based automated scores: Essay scoring as an illustration. Journal of Educational Measurement, 59(3):288–313, 2022.
+Peter W Foltz, Duanli Yan, and André A Rupp. The past, present, and future of automated scoring. In Handbook of Automated Scoring, pages 1–10. Chapman and Hall/CRC, 2020.
+Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
+Wynne Harlen. Teachers’ summative practices and assessment for learning–tensions and synergies. Curriculum Journal, 16(2):207–223, 2005.
+Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
+Wayne Holmes, Kaska Porayska-Pomsta, Ken Holstein, Emma Sutherland, Toby Baker, Simon Buckingham Shum, Olga C Santos, Mercedes T Rodrigo, Mutlu Cukurova, Ig Ibert Bittencourt, et al. Ethics of ai in education: Towards a community-wide framework. International Journal of Artificial Intelligence in Education, pages 1–23, 2022.
+13
+
+
+Institute of Education Statistics. Math autoscoring is finally here—let’s tap its potential for improving
+student performance. https://ies.ed.gov/learn/blog/math-autoscoring-final
+ly-here-lets-tap-its-potential-improving-student-performance, Oct 2023.
+[Accessed: Feb 21. 2025].
+Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, 2020.
+Hassan Khosravi, Simon Buckingham Shum, Guanliang Chen, Cristina Conati, Yi-Shan Tsai, Judy Kay, Simon Knight, Roberto Martinez-Maldonado, Shazia Sadiq, and Dragan Gaševic ́. Explainable artificial intelligence in education. Computers and education: artificial intelligence, 3:100074, 2022.
+Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International conference on machine learning, pages 5338–5348. PMLR, 2020.
+Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018.
+Vivekanandan Kumar and David Boulanger. Explainable automated essay scoring: Deep learning really has pedagogical value. In Frontiers in education, volume 5, page 572367. Frontiers Media SA, 2020.
+Vivekanandan S Kumar and David Boulanger. Automated essay scoring and the deep learning black box: How are rubric scores determined? International Journal of Artificial Intelligence in Education, 31(3):538–584, 2021.
+Yaman Kumar, Swati Aggarwal, Debanjan Mahata, Rajiv Ratn Shah, Ponnurangam Kumaraguru, and Roger Zimmermann. Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9662–9669, 2019.
+Claudia Leacock and Martin Chodorow. C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4):389–405, 2003.
+Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, and Xiaoming Zhai. Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6:100213, 2024.
+Jiazheng Li, Artem Bobrov, David West, Cesare Aloisi, and Yulan He. An automated explainable educational assessment system built on llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29658–29660, 2025.
+Zhaohui Li, Susan Lloyd, Matthew Beckman, and Rebecca J Passonneau. Answer-state recurrent relational network (asrrn) for constructed response assessment and feedback grouping. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3879–3891, 2023.
+Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31–57, 2018.
+Bahar Memarian and Tenzin Doleck. Fairness, accountability, transparency, and ethics (fate) in artificial intelligence (ai) and higher education: A systematic review. Computers and Education: Artificial Intelligence, 5:100152, 2023.
+14
+
+
+Robert J Mislevy. An evidentiary-reasoning perspective on automated scoring: Commentary on part i. In Handbook of Automated Scoring, pages 151–168. Chapman and Hall/CRC, 2020.
+Andrés Páez. The pragmatic turn in explainable artificial intelligence (xai). Minds and Machines, 29 (3):441–459, 2019.
+Fabian Pedregosa, Francis Bach, and Alexandre Gramfort. On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18(55):1–35, 2017.
+James W. Pellegrino. A Learning Sciences Perspective on the Design and Use of Assessment in Education, page 238–258. Cambridge Handbooks in Psychology. Cambridge University Press, 2022.
+Alun Preece, Dan Harborne, Dave Braines, Richard Tomsett, and Supriyo Chakraborty. Stakeholders in explainable ai. arXiv preprint arXiv:1810.00184, 2018.
+Jason DM Rennie and Nathan Srebro. Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, volume 1, pages 1–6. AAAI Press, Menlo Park, CA, 2005.
+Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215, 2019.
+André A Rupp. Designing, evaluating, and deploying automated scoring systems with validity in mind: Methodological design decisions. Applied Measurement in Education, 31(3):191–214, 2018.
+Advait Sarkar. Large language models cannot explain themselves. arXiv preprint arXiv:2405.04382, 2024.
+Tim Schlippe, Quintus Stierstorfer, Maurice ten Koppel, and Paul Libbrecht. Explainability in automatic short answer grading. In International conference on artificial intelligence in education technology, pages 69–87. Springer, 2022.
+Daniel L Schwartz, Jessica M Tsang, and Kristen P Blair. The ABCs of how we learn: 26 scientifically proven approaches, how they work, and when to use them. WW Norton & Company, 2016.
+Mark D Shermis. Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 20(1):46–65, 2015.
+Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36:74952–74965, 2023.
+Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440–9450, 2024.
+Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
+Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
+15
+
+
+John Whitmer and Magdalen Beiting-Parrish. Results of naep math item automated scoring data challenge & comparison between reading & math challenges. 2023.
+John Whitmer and Magdalen Beiting-Parrish. Lessons learned about transparency, fairness, and explainability from two automated scoring challenges. In AI for Education: Bridging Innovation and Responsibility, 2024.
+Dylan Wiliam. Embedded formative assessment. Solution tree press, 2011.
+David M Williamson, Xiaoming Xi, and F Jay Breyer. A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31(1):2–13, 2012.
+Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19187–19197, 2023.
+16
--- a/storage/GIPXBXHB/Kim
+++ b/storage/GIPXBXHB/Kim
--- a/storage/JBP5J79X/.zotero-ft-cache
+++ b/storage/JBP5J79X/.zotero-ft-cache
@@ -0,0 +1,217 @@
+A Layered Software Specification Architecture
+M. Snoeck, S. Poelmans, and G. Dedene Management Information Systems Group Katholieke Universiteit Leuven, Naamsestraat 69, 3000 Leuven
+email: {monique.snoeck, stephan.poelmans, guido.dedene}@econ.kuleuven.ac.be
+Abstract. Separation of concerns is a determining factor of the quality of object-oriented software development. Done well, it can provide substantial benefits such as additive rather than invasive change and improved adaptability, customizability, and reuse. In this paper we propose a software architecture that integrates concepts from business process modeling with concepts of object-oriented systems development. The presented architecture is a layered one: the concepts are arranged in successive layers in such a way that each layer only uses concepts of its own layer or of layers below. The guiding principle in the design of this layered architecture is the separation of concerns. On the one hand workflow aspects are separated from functional support for tasks and on the other hand domain modeling concepts are separated from information system support. The concept of events (workflow events, information system events and business events) is used as bridging concept between the different layers.
+1. Introduction
+The proponents of object-oriented software development attribute a number of qualities to object-oriented software development such as improved adaptability, maintainability and reuse. In spite of more than a decade of experience with objectoriented technology, the improvements are not really overwhelming. Although separation of concerns is recognized as a determining factor for the adaptability and maintainability of software, there is not yet a generally accepted way to achieve this. Separation of concerns can be pursued at different levels of abstraction in the software development process. In this paper we present a layered software architecture that represents a separation of concerns at a high level of abstraction: it classifies specifications in different layers such that each layer only relies on concepts of the same layer or on concepts of the layers below. At the same time, this architecture outlines a basic implementation architecture. To obtain this layered architecture, we start from the assumption that a full-fledged information systems development method should take all aspects into account: aspects of business process modeling and aspects of the functional part should be linked together. Because most object-oriented analysis and design methods do not yet integrate business process modeling aspects in an adequate way we motivate this assumption and briefly present the approach in section 2. In section 3 we present the four basic layers of our architecture. In section 4 these layers are further refined. Finally section 5 presents some conclusions and topics for further research.
+A.H.F. Laender, S.W. Liddle, V.C. Storey (Eds.): ER2000 Conference, LNCS 1920, pp. 454-469, 2000. © Springer-Verlag Berlin Heidelberg 2000
+
+
+A Layered Software Specification Architecture 455
+2. The Need for Integration of Workflow Aspects in Information System Modeling
+The layered architecture proposed in this paper encompasses all the aspects of ITsupport in an organization. On the one hand there is the required support for supervising, recording and controlling business activities. On the other hand there is the required functional support for these business activities. The first type of support constitutes the Workflow System, whereas the functional support constitutes the Business Information System. Current object-oriented analysis and design methods focus primarily on the analysis and design of business information systems. In this paragraph we first look at the requirements for adequate business process support. We then argue the statement that current object-oriented analysis and design methods lack a business process view. Finally, we briefly explain how the concept of "business event" can be used to link concepts from business process modeling with objectoriented analysis and design concepts.
+2.1 Requirements for an Adequate Workflow System Support
+In the literature on workflow modeling, several techniques are proposed to define and represent the structure of a business process (petri-nets, flow charts, etc.). In most cases, the method to be followed is imposed by the vendor of the workflow package [13]. Nevertheless, some general requirements can be put forward that are necessary to be able to model a process:
+Requirement 1. Processes and activities need to be defined in a hierarchical manner. Process design typically requires a top down decomposition of high level processes into subprocesses down to atomic activities. It is the division of labor in the organization that determines the subdivision in sub-processes and activities. Activities and tasks might have a different meaning in different organizational theories. In the field of workflow modeling however, both terms are often used interchangeably and we will do the same in this paper.
+Requirement 2. The modeling of dependencies between activities and between activities and agents is crucial. The main goal of a workflow system is in fact the automation or support of the co-ordination between activities and between activities and agents. Co-ordination can be defined as the management of dependencies [16]. In what way does one activity depend on the results of another activity? The modeling of dependencies constitutes the heart of workflow modeling. The existence of dependencies implies a certain order of execution. Some (sub)tasks cannot be performed before previous tasks have been completed; other tasks need to be executed in parallel, and so on.
+
+
+456 M. Snoeck, S. Poelmans, and G. Dedene
+Requirement 3. Agents are humans or computer applications that are assigned to roles. Also the interaction between activities and agents needs to be planned ahead. Agents can be human end-users or computer applications that perform activities. When human agents execute certain tasks, they might be assisted by computer applications to support them. Only when applications are directly coupled to the workflow system, they are considered as agents. When an application is invoked by the workflow system and when the application performs a certain activity without any intervention of the end-user, it is called an autonomous agent. An agent is called semiautonomous when it is directly coupled to a workflow system, although an intervention of the end-user is still required. Agents are assigned to activities via the construction of roles. A role defines the responsibility for the performance of a (collection of) task(s)[14].
+Requirement 4. The specification of the business process (or workflow) needs to be a persistent artefact able to control, supervise and record performed activities. The workflow process is specified as a model in a formal textual and/or visual language. This model specification is used whenever a new workflow instance needs to be created. Each time a workflow instance is created, the persistent workflow model is needed for controlling, supervising and recording the performed activities. Moreover, in order to monitor and improve performance, it is often also required to save the states of instantiated processes that have been enacted. Historical data regarding the actual course of processes can be useful and even necessary to improve the persistent process model.
+The dependencies between activities and between activities and agents can be considered as the control logic of business processes. The functional part contains the necessary data and the applications that (partly) perform the activities (the non-human agents). The isolation of the control structure from the data and functional structure is a typical characteristic of workflow systems [25].
+2.2 The Lack of a Business Process View in Object-Oriented Systems Development
+Workflow systems and object-oriented technology have undoubtedly been some of the most important domains of interest of information technology over the past decade. Both domains however, have largely evolved independently, and not much research can be found in which workflow modeling principles and concepts have been applied to OO systems development. In object-oriented development, the primary emphasis is on the specification and development of the functional part of the information system, whereas the business process part is largely neglected or supposed to be given [e.g. 2, 3, 4, 6, 7, 11, 19, 23]. Although recently there is an increased support for the software development process and its workflows [9, 18], business process modeling is still treated in a fairly limited way.
+
+
+A Layered Software Specification Architecture 457
+In the first place, the top down decomposition of processes (requirement 1) is barely supported in object-oriented development. Functional decomposition, a vital concept from the structured programming world, is often considered as old-fashioned and ineffective by object-oriented developers [26]. One way to introduce some of the business process aspects in information systems analysis is the use of Use Cases [11, 2]. Use cases describe the functional requirements by identifying actors and scenarios of system usage by these actors. As such this technique is a valid candidate to model the interaction between a user and the system. The technique offers some possibilities for modularization by allowing use cases to "include" and "extend" other use cases, although this is not a functional decomposition as described in requirement 1 above. In the UML approach [11, 2], use cases are mainly a support for information system design: they are used for finding objects and determining the systems structure. More importantly, use cases are not intended to model the assignment of agents to activities and the co-ordination between activities (requirement 2) and can therefore not be considered as a workflow modeling technique. In addition the process logic is not designed to be implemented as an (persistent) application (requirement 4). Because of their affinity with petri-nets, activity diagrams are much better suited for modeling activities and their dependencies. Although activity diagrams can be stored as persistent artefacts in a CASE-tool, they are not used to realize a workflow engine that controls, supervises and records the performed activities (requirement 4). Finally, several other dynamic representations (like state transition diagrams and sequence diagrams) are created in the development phase. The process logic in this type of diagrams is however mainly relevant for the functional aspects of the application. In some cases, aspects of a business process can be found in this type of diagrams. However, such business process logic is not explicitly and separately implemented as described in requirement 4 above.
+2.3 Advantages to Gain
+A separation of concerns is a key element in keeping systems maintainable and adaptable. In current object-oriented system development practice, the organizational aspect of an information system is often not explicitly modeled. And when it is, it is not always taken as an important element in guiding design decisions. By integrating business modeling concepts into object-oriented modeling, the link between the services that an information system has to render and the organizational elements becomes more apparent. This can be an important help in designing more adaptable systems. In addition, when workflow elements are not modeled separately, they are often hidden in the procedural logic of class-methods. The explicit separation of workflow elements from process elements that are inherent to the domain or to the procedural logic of an implementation also allows for more adaptable systems. For example, sequence constraints on events that result from the business logic are part of the domain model (e.g. in a library, the return of a copy to the library must be preceded by a borrowing event). These type of sequence constraints are less likely to change over time than sequence constraints that are the result of workflow aspects
+
+
+458 M. Snoeck, S. Poelmans, and G. Dedene
+(e.g. if a member of the library does not show up after five reminders, set all the books (s)he borrowed to the state "lost").
+2.4 Using Business Events as Bridging Concept
+Events play a central role in the set of layers proposed in the next paragraphs. In most object-oriented approaches, events are subordinate to objects: they only serve as triggers for the execution of an object's method. In the approach proposed below, events are raised to the same level of importance as objects. Indeed, events are a fundamental part of the structure of experience [4]. Events are atomic units of action that represent things that happen in the world. Without events nothing would happen: they are the way information and objects come into existence (creating events), the way information and objects are modified (modifying events) and disappear from our universe of discourse (ending events). In the context of the business information system that gives the functional support for the workflow system, we make a difference between business events (also called real-world events) and information-system events such as keystrokes and mouse actions. The separation between these two types of events allows a more user-oriented and task-oriented view of information system design. Business events are those events that occur in the real world, even if there is no information system around. Information-system events are directly related to the presence of a computerized system. They are designed to allow the external user to register the occurrence of or invoke a real-world event. For example, the use of an ATM-machine to withdraw money from one's account will invoke the business event "withdraw" by means of several information-system events such as "insert card", "enter PIN code", "enter amount", and so on. Using events as a fundamental concept integrates well with the object-oriented approach as demonstrated in methods such as Syntropy [4], Catalysis [6], OO-SSADM [19] and MERODE [21, 23]. Business process modeling takes an action and process-oriented view on the domain. As a result, task and activities are easier to formulate in terms of business events than in terms of business objects (which are better for modeling structural aspects). From a business modeling perspective, only business events are of particular interest. Information system events such as keyboard actions and mouse clicks are modeled as elements in the information system, but are not relevant elements in a business process model. As business events appear both in the functional part and in the business process part, they can serve as the bridging concept between workflow activities and information system design. The figures below represent a meta-model for the concepts used in the proposed system development approach. In a first step, business processes are modeled at a conceptual level by decomposing them down to the activity level and by indicating which business event each activity invokes. The meta-model for business process modeling only shows the BUSINESS PROCESS and WORKFLOW ACTIVITY classes. We assume however that the complete set of workflow concepts are modeled in an object-oriented way, such as for example in the TriGSflow model [14]. For the functional aspects, the considered domain is modeled at the conceptual level by iden
+
+
+A Layered Software Specification Architecture 459
+tifying domain object classes and by indicating by which business event they are affected. As a result, business domain object interaction can be modeled by joint involvement in business event types, rather than by message passing. This makes the domain object interaction scheme more implementation independent and more adaptable to changed requirements. The effect of an event on a domain object class is recorded in a domain object class method. Fig. 1 shows a (simplified) meta-model relating the modeling concepts at this stage of the specification process. At the highest conceptual level, business events directly link business process modeling concepts to domain modeling concepts . In a refining step the business processes and the business domain are analyzed in search for information system support. So, next to the description of the domain of interest in the domain model, we need a specification of the services (also called user functions) that the information system has to render to the prospective users. This part of the specification is closely related to the specification of the workflow model: it is the description of the functional support for the activities of the workflow model. The activities that have to be performed by agents can be further classified as manual, interactive or fully automated. Interactive and automated activities are realized by means of an information system service. In this refined model, the information system services interface the workflow system with the domain model by giving computerized support for the invocation of business events. Fig. 2. represents the meta model for this more detailed level of specifications.
+Workflow Activity
+Domain Object Class
+Domain Object Class Method
+invokes affects
+* **
+*
+Business
+Process *
+*
+decomposed
+into Business Event
+Business Process Modeling concepts Domain Modeling concepts
+Fig. 1. Meta-model for conceptual modeling
+*
+decomposed into Business Event
+Domain Object Class Method
+invokes
+*
+**
+*
+*
+*
+**
+affects
+supported by
+realised by
+Information Service
+Domain Object Class
+Interactive Workflow Activity
+Automated Workflow Activity
+Manual Workflow Activity
+Workflow Activity
+Business Process Modeling concepts Business Information System Modeling concepts
+*
+Business Process
+Fig. 2. Information systems modeling meta-model.
+
+
+460 M. Snoeck, S. Poelmans, and G. Dedene
+3. First Set of Basic Layers
+The first set of layers are dictated by the separation of the workflow aspects from the functional support of tasks by information system services. Hence, at the highest level of abstraction, we have one layer for the workflow aspects and another layer for the information system aspects. A further refinement of the layers is obtained by applying the principle of model-driven development. Model-driven development is based on the idea that requirements should be captured in different models, according to their origin, as described in the work of Zachman [27, 24] and Maes[15]. Here we retain the separation of domain modeling from information system support modeling. Some specifications stem from fundamental business requirements (including business objects, business events as well as business constraints). This type of requirements is also valid if there is no information system. Other specifications are typically related to the presence of an information system. They describe the required functional support such as input facilities, generation of reports, EDI formatting, and so on. The specifications of the first type constitute a business domain model that contains the relevant domain knowledge for running a business. On top of this business domain model, a information service model is built as a set of input and output services, offering the desired information functionality to the users of the information system. Output services allow users to extract information from the business domain model, and present it in the right format on paper, on a workstation or in an electronic format. Input services allow users to register new or modified information that is relevant for the business. The model-driven approach can also be applied to the workflow layer. The workflow domain model describes the essential concepts of business process modeling such as the concepts of agent, workflow activity, business process, task dependencies, worklists, and so on. It is by populating the workflow domain model with instances of agents, tasks, dependencies, business processes, etc. that the business processes of a particular organization are defined. At the same time the populated domain model is a persistent model of the organization's business processes (requirement 4). The workflow service layer describes the information system support offered by the workflow system such as facilities to view a worklist, to add a work-item to a work list, to pass on work items, to register the accomplishment of a task in a work list, to create new tasks, and so on, but also services that allow to control supervise and record performed activities (requirement 4). As a result, we obtain an architecture with 4 different layers (Fig. 3). The dynamic aspects of each layer are triggered by four different kinds of events. Workflow-system events are information-system events related to the workflow system. They can for example be keyboard actions and mouse clicks to be captured by the workflow system and aiming at the invocation of a business-process event. Business-process events are real-world events that trigger the dynamic aspects of the workflow domain, such as the creation of a new task, assigning a person to a task, finishing a task and so on. They are related to the organizational aspects of the business. Information-system events trigger the dynamics of the information system. They will be used to invoke the execution of information system services that give
+
+
+A Layered Software Specification Architecture 461
+functional support for the tasks in the business process model. Finally, the business events are triggering the behavior of domain objects. They are the real-world events that constitute the dynamic part of a business.
+Workflow service model
+Business domain model
+Information system services model
+Workflow Layer
+Business Information System Layer
+uses
+uses
+uses
+Workflow system events
+Business Process events
+Business events
+Information system events
+Workflow domain model
+Fig. 3. Four basic layers.
+The link between the workflow layer and the business information system layer is achieved by linking tasks in the workflow domain model with the supporting information system services. Execution of a task or work item means that either this service is automatically invoked by the workflow system (autonomous software agent) or that the user invokes the service him(her)self through the business information system user interface.
+4. Refining the Layers
+The four basic layers are each refined in two sub-layers. The business domain layer is subdivided in a business event layer and a business domain objects layer. As mentioned in section 2, it is assumed that business domain objects interact by being jointly involved in business event types rather than interacting through message passing. To realize this kind of interaction, it is assumed that events are broadcasted to objects. This means that when an event is invoked, each object that is involved in the event checks whether the constraints imposed by this object on events of that type are satisfied. If all the involved objects accept the event, all corresponding methods in the involved objects are executed simultaneously. This way of communication is similar to communication as defined in the process algebras CSP [10] and ACP [1] and has
+
+
+462 M. Snoeck, S. Poelmans, and G. Dedene
+been formalized in [5, 23]. Message passing is more similar to the CCS process algebra [17]. There exist various mechanisms for the implementation of such synchronous execution of methods. In the layered architecture proposed in this paper, we assume that there is an event handling mechanism that filters the incoming events by checking all the constraints this event must satisfy. If all constraints are satisfied, the event is broadcasted to the participating objects; if not it is rejected. In either case the invoking class is notified accordingly of the rejection, acceptation, and successful or unsuccessful execution of the event. For each type of business events, the event handling layer contains one class that is responsible for handling events of that type. This class will first check the validity of the event and, if appropriate, broadcast the event to all involved objects by means of the method 'broadcast'. This approach to domain object interaction enhances the adaptability of the domain model compared to a conventional approach were domain objects dispatch the event by sending messages to each other. The information services layer can be further subdivided by separating user interface aspects from the transaction aspects. This separation is similar to the classical three tier architecture were control logic is also separated from user interface aspects. In this approach however, the control logic is partly in the transaction layer, partly in the business event layer, and partly in the methods of the business domain objects. User interface objects are responsible for all presentation aspects and for syntactical user input validation. Input transactions invoke one or more business events using parameter values received from the user interface objects. Output transactions query the set of domain objects to retrieve the requested information. The transaction layer can be used to group event invocations according to task requirements. Commit and roll-back features can also be implemented in the transaction layer. To better illustrate the responsibilities of the different layers, we will exemplify objects in the four business information system layers by considering four examples of information system services for an order handling system. Let us assume that the domain model contains the four object types CUSTOMER, ORDER, ORDER LINE and PRODUCT, ORDER being existence dependent on CUSTOMER, and ORDER LINE being existence dependent on both ORDER and PRODUCT. The corresponding ER-schema is given in Fig. 4.
+Fig. 4. ER-schema for the order handling system
+11 1
+M MM
+CUSTOMER PRODUCT
+ORDER LINE
+ORDER
+Business event types are create_customer, modify_customer, end_customer, create_ order, modify_order, end_order, create_orderline, modify_orderline, end_orderline, cr_product, modify_product, end_product. The object-event table (see Table 1) shows which object types are affected by which types of events and also indicates the type of involvement: C for creation, M for modification and E for terminating an object's life. For example, the cr_orderline creates a new occurrence of the class
+
+
+A Layered Software Specification Architecture 463
+ORDERLINE, modifies an occurrence of the class PRODUCT because it requires adjustment of the stock-level of the ordered product, modifies the state of the order to which it belongs and modifies the state of the customer of the order. Notice that Table 1 shows a maximal number of object-event involvements. If we do not want to record a state change in the customer object when an order line is added to one of his/her orders, it suffices to simply remove the corresponding object-event participation in the object-event table. Full details of how to construct such an object-event table and validate it against the data model and the behavioral model is beyond the scope of this paper but can be found in [21, 23].
+Table 1. Object-event table for the order handling system
+CUSTOMER ORDER ORDERLINE PRODUCT create_customer C modify_customer M end_customer E create_order M C modify_order M M end_order M E create_orderline M M C M modify_orderline M M M M end_orderline M M E M create_product C modify_product M end_product E
+We consider four possible information system services: viewing a list of customers, creating a new customer, creating a new order with one or more order lines and deleting an order. For each of the services we identify relevant objects in each of the business information system layers and explain the interaction between these objects. Fig. 5 represents this graphically.
+- Viewing a list of existing customers
+This will require an output transaction that queries the set of domain objects and presents the result of the query in a window. The execution of this transaction is invoked by means of user interface objects. Possibly, the user interface can allow to enter some search criteria the syntax of which is validated by the user interface objects. The result of the transaction is passed to user interface objects, responsible for presenting the resulting list of customers on screen.
+- Creating a new customer
+This service requires an input transaction that will invoke the create_customer business event. The service is requested via the user interface layer, which is responsible for accepting user input of parameter values (e.g. customer name, ad
+
+
+464 M. Snoeck, S. Poelmans, and G. Dedene
+dress, phone number) and for syntactical validation of these values (e.g. format of phone number, name must be alphabetical, ...). The user interface objects pass the values to the transaction object Create New Customer in the transaction layer. The transaction object is responsible for creating an occurrence of the create_customer event type and invoking the execution of this event. In the event handling layer, the CREATE_CUSTOMER object is responsible for the further validation of the event against business rules (e.g. checking a uniqueness constraints on customer name and address) and for broadcasting the event to the involved objects in the business domain objects layer. In this case, the create_customer event will create a new occurrence of the CUSTOMER class.
+- Creating a new order with one or more order lines
+A possible implementation would allow users to enter the required data for an order together with a number of order lines on a single screen. These data are then passed to a multiple-event transaction object in the transaction layer. This example illustrates that a single transaction New Order can be further subdivided in subtransactions such as Create Order and Add line to Order. The New Order transaction invokes the Create Order sub-transaction and one or more times the Add line to Order transaction. These sub-transactions will in their turn invoke a create_order event and the create_orderline events respectively. The integration of an additional service allowing to view a list of products that can be ordered (which would be a separate output transaction) must be done in the user interface layer.
+- Deleting an order chosen from a list of orders.
+This service requires the combination of an output transaction that generates a list of orders with an input transaction that invokes the end_order and end_orderline events. Commit and roll-back features can also be implemented in the transaction layer. In this service the transaction should, for example, only be committed if all order lines and the order were deleted successfully. If something went wrong during the transaction e.g. one of the end_orderline events was not invoked and broadcasted successfully, the roll-back feature allows to put all objects back to the state before the transaction was invoked.
+The workflow domain layer and the workflow services layer are subdivided in exactly the same way. As a result, we obtain the four layers of Fig. 7 for the workflow layer.
+
+
+A Layered Software Specification Architecture 465
+CREATE CUSTOMER Check-Validity Broadcast
+CR_ORDER
+Check-Validity Broadcast
+CR_ORDER LINE Check-Validity Broadcast
+BUSINESS EVENTS LAYER
+Create New Customer
+Create Order
+Add line to Order
+New Order INFORMATION SYSTEMS TRANSACTION LAYER
+CREATE CUSTOMER WINDOW
+CUSTOMER LIST WINDOW
+NEW ORDER WINDOW
+INFORMATION SYSTEMS USER INTERFACE LAYER
+View Customers
+Customer name adsress phone ...
+Create_customer Create_Order ...
+ORDERLINE order product quantity ...
+create_orderline modify_orderline ...
+ORDER customer delivery_address date
+Create_order Create_orderline ....
+BUSINESS DOMAIN OBJECTS LAYER
+PRODUCT name unit-price stock-quantity
+Create_orderline create_product ...
+END_ORDER
+Check-Validity Broadcast
+END_ORDER LINE Check-Validity Broadcast
+End Order End Order line
+Delete Order
+DELETE ORDER WINDOW
+Fig. 5. Sublayers within the Business information system layer
+5. Conclusion
+This paper proposes a layered software specification architecture, guided by the principle of separation of concerns. The first set of layers was obtained by separating workflow aspects from functional support. By explicitly incorporating a workflow layer, we ensure that business process aspects are modeled separately. In the absence of such a layer, control aspects of business processes are often hidden in the objects that constitute the functional support for tasks, which is against the principle of separation of concerns.
+
+
+466 M. Snoeck, S. Poelmans, and G. Dedene
+These two layers were refined by separating domain modeling from information system modeling. Defining a domain model is an important requirements engineering step in the development of an information system: all business rules described in the
+HUMAN AGENT ...
+cr_H_agent cr_task_list cr_task ...
+SOFTWARE AGENT ...
+cr_soft_agent run_Soft_Agent ...
+TASK ...
+cr_task mod_task ...
+WORKFLOW DOMAIN OBJECTS LAYER
+CR_H_AGENT
+check_validity broadcast
+CR_TASK
+check_validity broadcast
+RUN_SOFT_ AGENT check_validity broadcast
+WORKFLOW EVENTS LAYER
+New Human Agent
+New Task
+New Task list WORKFLOW
+TRANSACTION LAYER
+WORKFLOW SYSTEM USER INTERFACE LAYER
+Invoke Information system service
+...
+NEW PERSON WINDOW
+NEW TASK LIST WINDOW
+Fig. 6. Sublayers for the Workflow Layer
+domain model must be supported by the information system. Methods such as JSD [12], OO-SSADM [19], Syntropy [4] and MERODE [23] even explicitly define domain modeling as a separate step in the development process. Interestingly, these method methods also recognize events as fundamental modeling concepts. The information system services are modeled as layer on top of the kernel layer constituted by the domain model. More importantly, information system services are independ
+
+
+A Layered Software Specification Architecture 467
+ent units that can be plugged in and out the system without affecting the underlying domain layer. The services are glued together by means of the user interface layer. Again, this layer can be stripped off, without affecting the lower layers. The layers for the business information system aspects are an integral part of the MERODE approach to software development which means that there is about 10 years of real-life experience with the business information systems layers. A survey amongst MERODE-users of this approach reveals that the separation of information system services aspects from domain modeling aspects indeed has a positive impact on modularity and hence on maintenance costs [22]. But according to the same survey, the separation of business knowledge from functional support has also other advantages: it results in a better understanding of the functioning of the business and it results in more transparent systems. The business information system layers can easily be compared to classical three tier architectures. Such architecture have for example, an application tier which contains the user interface aspects and part of the application logic, a domain tier and a persistent tier [7]. Jacobson [11] identifies three types of objects, presumably located in three corresponding tiers: entity objects (which constitute the domain model), control objects and user interface objects. The main difference with such three tier architecture is the identification of different types of control logic. In our approach the control logic is spread across different layers, according to the aspects it refers to. Business rules are stored as methods of domain classes or as general event constraints in the event handling layer. Application control aspects are located in the transaction and user interface layer. Control logic related to the organization of business process is stored in the definition of workflow domain objects, and finally, the workflow service layer captures control logic related to the use of a workflow system. The model-driven approach which is one of the cornerstones of this paper is derived from the Zachman Information System Architecture [27, 24]. This architecture also contains a scope layer and a technology layer. The proposal of Maes [15] only retains the three lower layers, arguing that scope is a matter of ICT strategy development. As presented in this paper, the architecture does not yet include the third layer, that is to say, the technology aspects. In the Zachman and Sowa architecture [24] data, functionality, network, and other aspects are considered as orthogonal dimensions to the four basic dimensions (see Fig. 7). We expect that technology aspects have to be considered as an orthogonal dimension to the architecture of this paper. Indeed, different technology choices can be made for the realization of the different layers. For example, the business domain model and the workflow domain model can be realized with different database management systems and/or a different programming language. Also network aspects can be very different from one layer to another. One approach to deal with technology aspects is to combine code generation with the reuse of patterns and frameworks as proposed in [8].
+
+
+468 M. Snoeck, S. Poelmans, and G. Dedene
+Data Function Network People Timing Motivation
+what how where who when why
+SCOPE ENTERPRISE SYSTEM TECHNOLOGY
+Fig. 7. The extended Zachman framework for information systems architecture
+References
+1. Baeten, J.C.M., Procesalgebra, Kluwer programmatuurkunde, 1986 2. Booch, G., Rumbaugh, J., Jacobson, I., The unified modeling language user guide, Addison Wesley, 1999
+3. Coleman, D. et al, Object-oriented development: The FUSION method, Prentice Hall, 1994
+4. Cook, S., Daniels, J., Designing object systems: object-oriented modeling with Syntropy, Prentice Hall, 1994 5. Dedene G. Snoeck M. Formal deadlock elimination in an object oriented conceptual schema, Data and Knowledge Engineering, Vol. 15 (1995) 1-30.
+6. D'Souza, D.F., Wills, A. C. Wills, Objects, Components and Frameworks with UML, The Catalysis Approach, Addison-Wesley, 1999, 785 pp.. 7. Fowler, M., Analysis Patterns, Reusable Object Models, Addison Wesley Longman, 1997, 357 pp. 8. Goebl, W., Improving productivity in building Dat-Oriened Information Systems - Why Object Frameworks are not enough, Proc. of the 1998 Int'l Conf. On Object-Oriented Information Systems, Paris, 9-11 September, Springer, 1998. 9. Graham I., Henderson-Sellers B., Younessi H., The Open Process Specification (Open Series), Addison Wesley, 1997, 336 pp. 10. Hoare C. A. R., Communicating Sequential Processes, Prentice-Hall International, Series in Computer Science, 1985. 11. Jacobson, I., Christerson, M., Jonsson P. et al., Object-Oriented Software Engineering, A use Case Driven Approach, Addison Wesley, Rev. 4th pr., 1997. 12. Jackson, M.A., System Development, Prentice Hall, Englewood Cliffs, N.J., 1983. 13. Joosten, S., Werkstromen : een overzicht, in Informatie, jaargang 37, nr. 9, pp. 519-528. 14. Kappel, G., P. Lang, S. Rausch-Schott, & W. Retschitzegger, Workflow management based on objects, rules and roles, In Bulletin of the Technical Committee on Data Engineering, March 1995,18(1), pp. 11-18.
+15. Maes, R., Dedene, G., Reframing the Zachman Information System Architecture Framework, Tinbergen Institute, discussion paper TI 96-32/2, 1996. 16. Malone, T. W., Crowston, K. The Interdisciplinary Study of Co-ordination, In ACM Computing Surveys, Vol. 26, No. 1, March 94, pp.87-119. 17. Milner, R., A calculus of communicating systems, Springer Berlin, Lecture Notes in Computer Science, 1980. 18. Rational Software Corporation, The rational Unified Process, http://www.rational.com/ 19. Robinson, K., Berrisford, G., Object-oriented SSADM, Prentice Hall, 1994
+
+
+A Layered Software Specification Architecture 469
+20. Snoeck, M., Poels, G., Improving the Reuse Possibilities of the Behavioral Aspects of Object-Oriented Domain Models, In: Proc. 19th Int'l Conf. Conceptual Modeling (ER2000). Salt Lake City (2000) 21. Snoeck M., Dedene G. Existence Dependency: The key to semantic integrity between structural and behavioral aspects of object types, IEEE Transactions on Software Engineering , Vol. 24, No. 24, April 1998, pp.233-251 22. Snoeck M., Dedene G., Experiences with Object-Oriented Model-driven development, Proceedings of the STEP'97 conference, London, July 1997 23. Snoeck M., Dedene G., Verhelst M; Depuydt A.M., Object-oriented Enterprise Modeling with MERODE, Leuven University Press, 1999 24. Sowa J.F., Zachman J.A., Extending and formalizing the framework for information systems architecture, IBM Systems Journal, 31(3), 1992, 590-616. 25. Vaishnavi, V., Joosten, S. & B. Kuechhler, Representing Workflow Management systems with Smart Objects, 1997, 7 pp. 26. Wolber D., Reviving Functional Decomposition in Object-oriented Design, JOOP, October 1997, pp. 31-38 27. Zachman J.A., A framework for information systems architecture, IBM Systems Journal, 26(3), 1987, 276-292.
--- a/storage/JBP5J79X/.zotero-reader-state
+++ b/storage/JBP5J79X/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":15,"scale":266,"top":714,"left":76,"scrollMode":0,"spreadMode":0}
--- a/storage/JBP5J79X/Snoeck
+++ b/storage/JBP5J79X/Snoeck
--- a/storage/KZ5DVECT/.zotero-ft-cache
+++ b/storage/KZ5DVECT/.zotero-ft-cache
@@ -0,0 +1,233 @@
+Jump to content
+Main menu
+Search
+Appearance
+Toggle the table of contents
+Automated essay scoring
+3 languages
+Article
+Talk
+Tools
+From Wikipedia, the free encyclopedia
+
+Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a form of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades, for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification.
+
+Several factors have contributed to a growing interest in AES. Among them are cost, accountability, standards, and technology. Rising education costs have led to pressure to hold the educational system accountable for results by imposing standards. The advance of information technology promises to measure educational achievement at reduced cost.
+
+The use of AES for high-stakes testing in education has generated significant backlash, with opponents pointing to research that computers cannot yet grade writing accurately and arguing that their use for such purposes promotes teaching writing in reductive ways (i.e. teaching to the test).
+
+History[edit]
+
+Most historical summaries of AES trace the origins of the field to the work of Ellis Batten Page.[1] In 1966, he argued[2] for the possibility of scoring essays by computer, and in 1968 he published[3] his successful work with a program called Project Essay Grade (PEG). Using the technology of that time, computerized essay scoring would not have been cost-effective,[4] so Page abated his efforts for about two decades. Eventually, Page sold PEG to Measurement Incorporated.
+
+By 1990, desktop computers had become so powerful and so widespread that AES was a practical possibility. As early as 1982, a UNIX program called Writer's Workbench was able to offer punctuation, spelling and grammar advice.[5] In collaboration with several companies (notably Educational Testing Service), Page updated PEG and ran some successful trials in the early 1990s.[6]
+
+Peter Foltz and Thomas Landauer developed a system using a scoring engine called the Intelligent Essay Assessor (IEA). IEA was first used to score essays in 1997 for their undergraduate courses.[7] It is now a product from Pearson Educational Technologies and used for scoring within a number of commercial products and state and national exams.
+
+IntelliMetric is Vantage Learning's AES engine. Its development began in 1996.[8] It was first used commercially to score essays in 1998.[9]
+
+Educational Testing Service offers "e-rater", an automated essay scoring program. It was first used commercially in February 1999.[10] Jill Burstein was the team leader in its development. ETS's Criterion Online Writing Evaluation Service uses the e-rater engine to provide both scores and targeted feedback.
+
+Lawrence Rudner has done some work with Bayesian scoring, and developed a system called BETSY (Bayesian Essay Test Scoring sYstem).[11] Some of his results have been published in print or online, but no commercial system incorporates BETSY as yet.
+
+Under the leadership of Howard Mitzel and Sue Lottridge, Pacific Metrics developed a constructed response automated scoring engine, CRASE. Currently utilized by several state departments of education and in a U.S. Department of Education-funded Enhanced Assessment Grant, Pacific Metrics’ technology has been used in large-scale formative and summative assessment environments since 2007.
+
+Measurement Inc. acquired the rights to PEG in 2002 and has continued to develop it.[12]
+
+In 2012, the Hewlett Foundation sponsored a competition on Kaggle called the Automated Student Assessment Prize (ASAP).[13] 201 challenge participants attempted to predict, using AES, the scores that human raters would give to thousands of essays written to eight different prompts. The intent was to demonstrate that AES can be as reliable as human raters, or more so. The competition also hosted a separate demonstration among nine AES vendors on a subset of the ASAP data. Although the investigators reported that the automated essay scoring was as reliable as human scoring,[14] this claim was not substantiated by any statistical tests because some of the vendors required that no such tests be performed as a precondition for their participation.[15] Moreover, the claim that the Hewlett Study demonstrated that AES can be as reliable as human raters has since been strongly contested,[16][17] including by Randy E. Bennett, the Norman O. Frederiksen Chair in Assessment Innovation at the Educational Testing Service.[18] Some of the major criticisms of the study have been that five of the eight datasets consisted of paragraphs rather than essays, four of the eight data sets were graded by human readers for content only rather than for writing ability, and that rather than measuring human readers and the AES machines against the "true score", the average of the two readers' scores, the study employed an artificial construct, the "resolved score", which in four datasets consisted of the higher of the two human scores if there was a disagreement. This last practice, in particular, gave the machines an unfair advantage by allowing them to round up for these datasets.[16]
+
+In 1966, Page hypothesized that, in the future, the computer-based judge will be better correlated with each human judge than the other human judges are.[2] Despite criticizing the applicability of this approach to essay marking in general, this hypothesis was supported for marking free text answers to short questions, such as those typical of the British GCSE system.[19] Results of supervised learning demonstrate that the automatic systems perform well when marking by different human teachers is in good agreement. Unsupervised clustering of answers showed that excellent papers and weak papers formed well-defined clusters, and the automated marking rule for these clusters worked well, whereas marks given by human teachers for the third cluster ('mixed') can be controversial, and the reliability of any assessment of works from the 'mixed' cluster can often be questioned (both human and computer-based).[19]
+
+Different dimensions of essay quality[edit]
+
+According to a recent survey,[20] modern AES systems try to score different dimensions of an essay's quality in order to provide feedback to users. These dimensions include the following items:
+
+Grammaticality: following grammar rules
+Usage: using of prepositions, word usage
+Mechanics: following rules for spelling, punctuation, capitalization
+Style: word choice, sentence structure variety
+Relevance: how relevant of the content to the prompt
+Organization: how well the essay is structured
+Development: development of ideas with examples
+Cohesion: appropriate use of transition phrases
+Coherence: appropriate transitions between ideas
+Thesis Clarity: clarity of the thesis
+Persuasiveness: convincingness of the major argument
+Procedure[edit]
+
+From the beginning, the basic procedure for AES has been to start with a training set of essays that have been carefully hand-scored.[21] The program evaluates surface features of the text of each essay, such as the total number of words, the number of subordinate clauses, or the ratio of uppercase to lowercase letters—quantities that can be measured without any human insight. It then constructs a mathematical model that relates these quantities to the scores that the essays received. The same model is then applied to calculate scores of new essays.
+
+Recently, one such mathematical model was created by Isaac Persing and Vincent Ng.[22] which not only evaluates essays on the above features, but also on their argument strength. It evaluates various features of the essay, such as the agreement level of the author and reasons for the same, adherence to the prompt's topic, locations of argument components (major claim, claim, premise), errors in the arguments, cohesion in the arguments among various other features. In contrast to the other models mentioned above, this model is closer in duplicating human insight while grading essays. Due to the growing popularity of deep neural networks, deep learning approaches have been adopted for automated essay scoring, generally obtaining superior results, often surpassing inter-human agreement levels.[23]
+
+The various AES programs differ in what specific surface features they measure, how many essays are required in the training set, and most significantly in the mathematical modeling technique. Early attempts used linear regression. Modern systems may use linear regression or other machine learning techniques often in combination with other statistical techniques such as latent semantic analysis[24] and Bayesian inference.[11]
+
+The automated essay scoring task has also been studied in the cross-domain setting using machine learning models, where the models are trained on essays written for one prompt (topic) and tested on essays written for another prompt. Successful approaches in the cross-domain scenario are based on deep neural networks[25] or models that combine deep and shallow features.[26]
+
+Criteria for success[edit]
+
+Any method of assessment must be judged on validity, fairness, and reliability.[27] An instrument is valid if it actually measures the trait that it purports to measure. It is fair if it does not, in effect, penalize or privilege any one class of people. It is reliable if its outcome is repeatable, even when irrelevant external factors are altered.
+
+Before computers entered the picture, high-stakes essays were typically given scores by two trained human raters. If the scores differed by more than one point, a more experienced third rater would settle the disagreement. In this system, there is an easy way to measure reliability: by inter-rater agreement. If raters do not consistently agree within one point, their training may be at fault. If a rater consistently disagrees with how other raters look at the same essays, that rater probably needs extra training.
+
+Various statistics have been proposed to measure inter-rater agreement. Among them are percent agreement, Scott's π, Cohen's κ, Krippendorf's α, Pearson's correlation coefficient r, Spearman's rank correlation coefficient ρ, and Lin's concordance correlation coefficient.
+
+Percent agreement is a simple statistic applicable to grading scales with scores from 1 to n, where usually 4 ≤ n ≤ 6. It is reported as three figures, each a percent of the total number of essays scored: exact agreement (the two raters gave the essay the same score), adjacent agreement (the raters differed by at most one point; this includes exact agreement), and extreme disagreement (the raters differed by more than two points). Expert human graders were found to achieve exact agreement on 53% to 81% of all essays, and adjacent agreement on 97% to 100%.[28]
+
+Inter-rater agreement can now be applied to measuring the computer's performance. A set of essays is given to two human raters and an AES program. If the computer-assigned scores agree with one of the human raters as well as the raters agree with each other, the AES program is considered reliable. Alternatively, each essay is given a "true score" by taking the average of the two human raters' scores, and the two humans and the computer are compared on the basis of their agreement with the true score.
+
+Some researchers have reported that their AES systems can, in fact, do better than a human. Page made this claim for PEG in 1994.[6] Scott Elliot said in 2003 that IntelliMetric typically outperformed human scorers.[8] AES machines, however, appear to be less reliable than human readers for any kind of complex writing test.[29]
+
+In current practice, high-stakes assessments such as the GMAT are always scored by at least one human. AES is used in place of a second rater. A human rater resolves any disagreements of more than one point.[30]
+
+Criticism[edit]
+
+AES has been criticized on various grounds. Yang et al. mention "the over-reliance on surface features of responses, the insensitivity to the content of responses and to creativity, and the vulnerability to new types of cheating and test-taking strategies."[30] Several critics are concerned that students' motivation will be diminished if they know that no human will read their writing.[31] Among the most telling critiques are reports of intentionally gibberish essays being given high scores.[32]
+
+HumanReaders.Org Petition[edit]
+
+On 12 March 2013, HumanReaders.Org launched an online petition, "Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment". Within weeks, the petition gained thousands of signatures, including Noam Chomsky,[33] and was cited in a number of newspapers, including The New York Times,[34] and on a number of education and technology blogs.[35]
+
+The petition describes the use of AES for high-stakes testing as "trivial", "reductive", "inaccurate", "undiagnostic", "unfair" and "secretive".[36]
+
+In a detailed summary of research on AES, the petition site notes, "RESEARCH FINDINGS SHOW THAT no one—students, parents, teachers, employers, administrators, legislators—can rely on machine scoring of essays ... AND THAT machine scoring does not measure, and therefore does not promote, authentic acts of writing."[37]
+
+The petition specifically addresses the use of AES for high-stakes testing and says nothing about other possible uses.
+
+Software[edit]
+
+Most resources for automated essay scoring are proprietary. eRater – published by Educational Testing Service Intellimetric – by Vantage Learning Project Essay Grade[38] – by Measurement, Inc. comparison of top AI essay grading tools[39]
+
+References[edit]
+^ Page, E.B. (2003). "Project Essay Grade: PEG", p. 43. In Shermis, Mark D., and Jill Burstein, eds., Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
+- Larkey, Leah S., and W. Bruce Croft (2003). "A Text Categorization Approach to Automated Essay Grading", p. 55. In Shermis, Mark D., and Jill Burstein, eds. Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
+- Keith, Timothy Z. (2003). "Validity of Automated Essay Scoring Systems", p. 153. In Shermis, Mark D., and Jill Burstein, eds., Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
+- Shermis, Mark D., Jill Burstein, and Claudia Leacock (2006). "Applications of Computers in Assessment and Analysis of Writing", p. 403. In MacArthur, Charles A., Steve Graham, and Jill Fitzgerald, eds., Handbook of Writing Research. Guilford Press, New York, ISBN 1-59385-190-1
+- Attali, Yigal, Brent Bridgeman, and Catherine Trapani (2010). "Performance of a Generic Approach in Automated Essay Scoring", p. 4. Journal of Technology, Learning, and Assessment, 10(3)
+- Wang, Jinhao, and Michelle Stallone Brown (2007). "Automated Essay Scoring Versus Human Scoring: A Comparative Study", p. 6. Journal of Technology, Learning, and Assessment, 6(2)
+- Bennett, Randy Elliot, and Anat Ben-Simon (2005). "Toward Theoretically Meaningful Automated Essay Scoring" Archived 7 October 2007 at the Wayback Machine, p. 6. Retrieved 19 March 2012-.
+^ 
+Jump up to:
+a b Page, E. B. (1966). "The imminence of... grading essays by computer". The Phi Delta Kappan. 47 (5): 238–243. JSTOR 20371545.
+^ Page, E.B. (1968). "The Use of the Computer in Analyzing Student Essays", International Review of Education, 14(3), 253-263.
+^ Page, E.B. (2003), pp. 44-45.
+^ MacDonald, N.H., L.T. Frase, P.S. Gingrich, and S.A. Keenan (1982). "The Writers Workbench: Computer Aids for Text Analysis", IEEE Transactions on Communications, 3(1), 105-110.
+^ 
+Jump up to:
+a b Page, E.B. (1994). "New Computer Grading of Student Prose, Using Modern Concepts and Software", Journal of Experimental Education, 62(2), 127-142.
+^ Rudner, Lawrence. "Three prominent writing assessment programs" Archived 9 March 2012 at the Wayback Machine. Retrieved 6 March 2012.
+^ 
+Jump up to:
+a b Elliot, Scott (2003). "Intellimetric TM: From Here to Validity", p. 75. In Shermis, Mark D., and Jill Burstein, eds., Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
+^ "IntelliMetric®: How it Works", Vantage Learning. Retrieved 28 February 2012.
+^ Burstein, Jill (2003). "The E-rater(R) Scoring Engine: Automated Essay Scoring with Natural Language Processing", p. 113. In Shermis, Mark D., and Jill Burstein, eds., Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
+^ 
+Jump up to:
+a b Rudner, Lawrence (ca. 2002). "Computer Grading using Bayesian Networks-Overview" Archived 8 March 2012 at the Wayback Machine. Retrieved 7 March 2012.
+^ "Assessment Technologies" Archived 29 December 2011 at the Wayback Machine, Measurement Incorporated. Retrieved 9 March 2012.
+^ Hewlett prize" Archived 30 March 2012 at the Wayback Machine. Retrieved 5 March 2012.
+^ "Man and machine: Better writers, better grades". University of Akron. 12 April 2012. Retrieved 4 July 2015.
+- Shermis, Mark D., and Jill Burstein, eds. Handbook of Automated Essay Evaluation: Current Applications and New Directions. Routledge, 2013.
+^ Rivard, Ry (15 March 2013). "Humans Fight Over Robo-Readers". Inside Higher Ed. Retrieved 14 June 2015.
+^ 
+Jump up to:
+a b Perelman, Les (August 2013). "Critique of Mark D. Shermis & Ben Hamner, "Contrasting State-of-the-Art Automated Scoring of Essays: Analysis"". Journal of Writing Assessment. 6 (1). Retrieved 13 June 2015.
+^ Perelman, L. (2014). "When 'the state of the art is counting words'", Assessing Writing, 21, 104-111.
+^ Bennett, Randy E. (March 2015). "The Changing Nature of Educational Assessment". Review of Research in Education. 39 (1): 370–407. doi:10.3102/0091732X14554179. S2CID 145592665.
+^ 
+Jump up to:
+a b Süzen, N.; Mirkes, E. M.; Levesley, J; Gorban, A. N. (2020). "Automatic short answer grading and feedback using text mining methods". Procedia Computer Science. 169: 726–743. arXiv:1807.10543. doi:10.1016/j.procs.2020.02.171.
+^ Ke, Zixuan (9 August 2019). "Automated Essay Scoring: A Survey of the State of the Art" (PDF). Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. pp. 6300–6308. doi:10.24963/ijcai.2019/879. ISBN 978-0-9992411-4-1. Retrieved 11 April 2020.
+^ Keith, Timothy Z. (2003), p. 149.
+^ Persing, Isaac, and Vincent Ng (2015). "Modeling Argument Strength in Student Essays", pp. 543-552. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Retrieved 2015-10-22.
+^ Yang, Ruosong; Cao, Jiannong; Wen, Zhiyuan; Wu, Youzheng; He, Xiaodong (2020). "Enhancing Automated Essay Scoring Performance via Fine-tuning Pre-trained Language Models with Combination of Regression and Ranking". Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics: 1560–1569. doi:10.18653/v1/2020.findings-emnlp.141. hdl:10397/105512. S2CID 226299478.
+^ Bennett, Randy Elliot, and Anat Ben-Simon (2005), p. 7.
+^ Cao, Yue; Jin, Hanqi; Wan, Xiaojun; Yu, Zhiwei (25 July 2020). "Domain-Adaptive Neural Automated Essay Scoring". Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR '20. New York, NY, USA: Association for Computing Machinery. pp. 1011–1020. doi:10.1145/3397271.3401037. ISBN 978-1-4503-8016-4. S2CID 220730151.
+^ Cozma, Mădălina; Butnaru, Andrei; Ionescu, Radu Tudor (2018). "Automated essay scoring with string kernels and word embeddings". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics: 503–509. arXiv:1804.07954. doi:10.18653/v1/P18-2080. S2CID 5070986.
+^ Chung, Gregory K.W.K., and Eva L. Baker (2003). "Issues in the Reliability and Validity of Automated Scoring of Constructed Responses", p. 23. In: Automated Essay Scoring: A Cross-Disciplinary Perspective. Shermis, Mark D., and Jill Burstein, eds. Lawrence Erlbaum Associates, Mahwah, New Jersey, ISBN 0805839739
+^ Elliot, Scott (2003), p. 77.
+- Burstein, Jill (2003), p. 114.
+^ Bennett, Randy E. (May 2006). "Technology and Writing Assessment: Lessons Learned from the US National Assessment of Educational Progress" (PDF). International Association for Educational Assessment. Archived from the original (PDF) on 24 September 2015. Retrieved 5 July 2015.
+- McCurry, D. (2010). "Can machine scoring deal with broad and open writing tests as well as human readers?". Assessing Writing. 15 (2): 118–129. doi:10.1016/j.asw.2010.04.002.
+- R. Bridgeman (2013). Shermis, Mark D.; Burstein, Jill (eds.). Handbook of Automated Essay Evaluation. New York: Routledge. pp. 221–232.
+^ 
+Jump up to:
+a b Yang, Yongwei, Chad W. Buckendahl, Piotr J. Juszkiewicz, and Dennison S. Bhola (2002). "A Review of Strategies for Validating Computer-Automated Scoring" Archived 13 January 2016 at the Wayback Machine, Applied Measurement in Education, 15(4). Retrieved 8 March 2012.
+^ Wang, Jinhao, and Michelle Stallone Brown (2007), pp. 4-5.
+- Dikli, Semire (2006). "An Overview of Automated Scoring of Essays" Archived 8 April 2013 at the Wayback Machine, Journal of Technology, Learning, and Assessment, 5(1)
+- Ben-Simon, Anat (2007). "Introduction to Automated Essay Scoring (AES)", PowerPoint presentation, Tbilisi, Georgia, September 2007.
+^ Winerip, Michael (22 April 2012). "Facing a Robo-Grader? Just Keep Obfuscating Mellifluously". The New York Times. Retrieved 5 April 2013.
+^ "Signatures >> Professionals Against Machine Scoring Of Student Essays In High-Stakes Assessment". HumanReaders.Org. Archived from the original on 18 November 2019. Retrieved 5 April 2013.
+^ Markoff, John (4 April 2013). "Essay-Grading Software Offers Professors a Break". The New York Times. Retrieved 5 April 2013.
+- Garner, Richard (5 April 2013). "Professors angry over essays marked by computer". The Independent. Retrieved 5 April 2013.
+^ Corrigan, Paul T. (25 March 2013). "Petition Against Machine Scoring Essays, HumanReaders.Org". Teaching & Learning in Higher Ed. Retrieved 5 April 2013.
+- Jaffee, Robert David (5 April 2013). "Computers Cannot Read, Write or Grade Papers". Huffington Post. Retrieved 5 April 2013.
+^ "Professionals Against Machine Scoring Of Student Essays In High-Stakes Assessment". HumanReaders.Org. Retrieved 5 April 2013.
+^ "Research Findings >> Professionals Against Machine Scoring Of Student Essays In High-Stakes Assessment". HumanReaders.Org. Retrieved 5 April 2013.
+- "Works Cited >> Professionals Against Machine Scoring Of Student Essays In High-Stakes Assessment". HumanReaders.Org. Retrieved 5 April 2013.
+^ "Assessment Technologies" Archived 24 February 2019 at the Wayback Machine, Measurement, Inc.
+^ "Best AI Essay Graders – 2025 Comparison", AI Essay Grader Blog
+
+
+
+
+hide
+vte
+Natural language processing
+
+General terms	
+AI-completeBag-of-wordsn-gram BigramTrigramComputational linguisticsNatural language understandingStop wordsText processing
+
+Text analysis	
+Argument miningCollocation extractionConcept miningCoreference resolutionDeep linguistic processingDistant readingInformation extractionNamed-entity recognitionOntology learningParsing semanticsyntacticPart-of-speech taggingSemantic analysisSemantic role labelingSemantic decompositionSemantic similaritySentiment analysis
+Terminology extractionText miningTextual entailmentTruecasingWord-sense disambiguationWord-sense induction
+Text segmentation	
+Compound-term processingLemmatisationLexical analysisText chunkingStemmingSentence segmentationWord segmentation
+
+Automatic summarization	
+Multi-document summarizationSentence extractionText simplification
+
+Machine translation	
+Computer-assistedExample-basedRule-basedStatisticalTransfer-basedNeural
+
+Distributional semantics models	
+BERTDocument-term matrixExplicit semantic analysisfastTextGloVeLanguage model largesmallLatent semantic analysisLong short-term memorySeq2seqTransformerWord embeddingWord2vec
+
+Language resources,
+datasets and corpora	
+Types and
+standards	
+Corpus linguisticsLexical resourceLinguistic Linked Open DataMachine-readable dictionaryParallel textPropBankSemantic networkSimple Knowledge Organization SystemSpeech corpusText corpusThesaurus (information retrieval)TreebankUniversal Dependencies
+
+Data	
+BabelNetBank of EnglishDBpediaFrameNetGoogle Ngram ViewerUBYWordNetWikidata
+
+Automatic identification
+and data capture	
+Speech recognitionSpeech segmentationSpeech synthesisNatural language generationOptical character recognition
+
+Topic model	
+Document classificationLatent Dirichlet allocationPachinko allocation
+
+Computer-assisted
+reviewing	
+Automated essay scoringConcordancerGrammar checkerPredictive textPronunciation assessmentSpell checker
+
+Natural language
+user interface	
+ChatbotInteractive fictionQuestion answeringVirtual assistantVoice user interface
+
+Related	
+Formal semanticsHallucinationNatural Language ToolkitspaCy
+Categories:Computational linguisticsEducational evaluation methodsNatural language processingStatistical classificationEssaysTasks of natural language processing
+This page was last edited on 24 December 2025, at 00:47 (UTC).
+Text is available under the Creative Commons Attribution-ShareAlike 4.0 License; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.
+Privacy policy
+About Wikipedia
+Disclaimers
+Contact Wikipedia
+Legal & safety contacts
+Code of Conduct
+Developers
+Statistics
+Cookie statement
+Mobile view
--- a/storage/KZ5DVECT/.zotero-reader-state
+++ b/storage/KZ5DVECT/.zotero-reader-state
@@ -0,0 +1 @@
+{"scale":1,"scrollYPercent":100}
--- a/storage/KZ5DVECT/index.html
+++ b/storage/KZ5DVECT/index.html
--- a/storage/PI737J5V/.zotero-ft-cache
+++ b/storage/PI737J5V/.zotero-ft-cache
@@ -0,0 +1,230 @@
+arXiv:1907.11692v1 [cs.CL] 26 Jul 2019
+RoBERTa: A Robustly Optimized BERT Pretraining Approach
+Yinhan Liu∗§ Myle Ott∗§ Naman Goyal∗§ Jingfei Du∗§ Mandar Joshi† Danqi Chen§ Omer Levy§ Mike Lewis§ Luke Zettlemoyer†§ Veselin Stoyanov§
+† Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
+{mandar90,lsz}@cs.washington.edu
+§ Facebook AI
+{yinhanliu,myleott,naman,jingfeidu, danqi,omerlevy,mikelewis,lsz,ves}@fb.com
+Abstract
+Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.1
+1 Introduction
+Self-training methods such as ELMo (Peters et al., 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2019), XLM (Lample and Conneau, 2019), and XLNet (Yang et al., 2019) have brought significant performance gains, but it can be challenging to determine which aspects of the methods contribute the most. Training is computationally expensive, limiting the amount of tuning that can be done, and is often done with private training data of varying sizes, limiting our ability to measure the effects of the modeling advances.
+∗Equal contribution. 1Our models and code are available at:
+https://github.com/pytorch/fairseq
+We present a replication study of BERT pretraining (Devlin et al., 2019), which includes a careful evaluation of the effects of hyperparmeter tuning and training set size. We find that BERT was significantly undertrained and propose an improved recipe for training BERT models, which we call RoBERTa, that can match or exceed the performance of all of the post-BERT methods. Our modifications are simple, they include: (1) training the model longer, with bigger batches, over more data; (2) removing the next sentence prediction objective; (3) training on longer sequences; and (4) dynamically changing the masking pattern applied to the training data. We also collect a large new dataset (CC-NEWS) of comparable size to other privately used datasets, to better control for training set size effects. When controlling for training data, our improved training procedure improves upon the published BERT results on both GLUE and SQuAD. When trained for longer over additional data, our model achieves a score of 88.5 on the public GLUE leaderboard, matching the 88.4 reported by Yang et al. (2019). Our model establishes a new state-of-the-art on 4/9 of the GLUE tasks: MNLI, QNLI, RTE and STS-B. We also match state-of-the-art results on SQuAD and RACE. Overall, we re-establish that BERT’s masked language model training objective is competitive with other recently proposed training objectives such as perturbed autoregressive language modeling (Yang et al., 2019).2 In summary, the contributions of this paper are: (1) We present a set of important BERT design choices and training strategies and introduce
+2It is possible that these other methods could also improve with more tuning. We leave this exploration to future work.
+
+
+alternatives that lead to better downstream task performance; (2) We use a novel dataset, CCNEWS, and confirm that using more data for pretraining further improves performance on downstream tasks; (3) Our training improvements show that masked language model pretraining, under the right design choices, is competitive with all other recently published methods. We release our model, pretraining and fine-tuning code implemented in PyTorch (Paszke et al., 2017).
+2 Background
+In this section, we give a brief overview of the BERT (Devlin et al., 2019) pretraining approach and some of the training choices that we will examine experimentally in the following section.
+2.1 Setup
+BERT takes as input a concatenation of two segments (sequences of tokens), x1, . . . , xN and y1, . . . , yM . Segments usually consist of more than one natural sentence. The two segments are presented as a single input sequence to BERT with special tokens delimiting them: [CLS ], x1, . . . , xN , [SEP ], y1, . . . , yM , [EOS ].
+M and N are constrained such that M + N < T , where T is a parameter that controls the maximum sequence length during training. The model is first pretrained on a large unlabeled text corpus and subsequently finetuned using end-task labeled data.
+2.2 Architecture
+BERT uses the now ubiquitous transformer architecture (Vaswani et al., 2017), which we will not review in detail. We use a transformer architecture with L layers. Each block uses A self-attention heads and hidden dimension H.
+2.3 Training Objectives
+During pretraining, BERT uses two objectives: masked language modeling and next sentence prediction.
+Masked Language Model (MLM) A random sample of the tokens in the input sequence is selected and replaced with the special token [MASK ]. The MLM objective is a cross-entropy loss on predicting the masked tokens. BERT uniformly selects 15% of the input tokens for possible replacement. Of the selected tokens, 80% are replaced with [MASK ], 10% are left unchanged,
+and 10% are replaced by a randomly selected vocabulary token. In the original implementation, random masking and replacement is performed once in the beginning and saved for the duration of training, although in practice, data is duplicated so the mask is not always the same for every training sentence (see Section 4.1).
+Next Sentence Prediction (NSP) NSP is a binary classification loss for predicting whether two segments follow each other in the original text. Positive examples are created by taking consecutive sentences from the text corpus. Negative examples are created by pairing segments from different documents. Positive and negative examples are sampled with equal probability. The NSP objective was designed to improve performance on downstream tasks, such as Natural Language Inference (Bowman et al., 2015), which require reasoning about the relationships between pairs of sentences.
+2.4 Optimization
+BERT is optimized with Adam (Kingma and Ba, 2015) using the following parameters: β1 = 0.9, β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01. The learning rate is warmed up over the first 10,000 steps to a peak value of 1e-4, and then linearly decayed. BERT trains with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016). Models are pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens.
+2.5 Data
+BERT is trained on a combination of BOOKCORPUS (Zhu et al., 2015) plus English WIKIPEDIA, which totals 16GB of uncompressed text.3
+3 Experimental Setup
+In this section, we describe the experimental setup for our replication study of BERT.
+3.1 Implementation
+We reimplement BERT in FAIRSEQ (Ott et al., 2019). We primarily follow the original BERT
+3Yang et al. (2019) use the same dataset but report having only 13GB of text after data cleaning. This is most likely due to subtle differences in cleaning of the Wikipedia data.
+
+
+optimization hyperparameters, given in Section 2, except for the peak learning rate and number of warmup steps, which are tuned separately for each setting. We additionally found training to be very sensitive to the Adam epsilon term, and in some cases we obtained better performance or improved stability after tuning it. Similarly, we found setting β2 = 0.98 to improve stability when training with large batch sizes. We pretrain with sequences of at most T = 512 tokens. Unlike Devlin et al. (2019), we do not randomly inject short sequences, and we do not train with a reduced sequence length for the first 90% of updates. We train only with full-length sequences. We train with mixed precision floating point arithmetic on DGX-1 machines, each with 8 × 32GB Nvidia V100 GPUs interconnected by Infiniband (Micikevicius et al., 2018).
+3.2 Data
+BERT-style pretraining crucially relies on large quantities of text. Baevski et al. (2019) demonstrate that increasing data size can result in improved end-task performance. Several efforts have trained on datasets larger and more diverse than the original BERT (Radford et al., 2019; Yang et al., 2019; Zellers et al., 2019). Unfortunately, not all of the additional datasets can be publicly released. For our study, we focus on gathering as much data as possible for experimentation, allowing us to match the overall quality and quantity of data as appropriate for each comparison. We consider five English-language corpora of varying sizes and domains, totaling over 160GB of uncompressed text. We use the following text corpora:
+• BOOKCORPUS (Zhu et al., 2015) plus English WIKIPEDIA. This is the original data used to train BERT. (16GB).
+• CC-NEWS, which we collected from the English portion of the CommonCrawl News dataset (Nagel, 2016). The data contains 63 million English news articles crawled between September 2016 and February 2019. (76GB after filtering).4
+• OPENWEBTEXT (Gokaslan and Cohen, 2019), an open-source recreation of the WebText cor
+4We use news-please (Hamborg et al., 2017) to collect and extract CC-NEWS. CC-NEWS is similar to the REALNEWS dataset described in Zellers et al. (2019).
+pus described in Radford et al. (2019). The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).5
+• STORIES, a dataset introduced in Trinh and Le (2018) containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. (31GB).
+3.3 Evaluation
+Following previous work, we evaluate our pretrained models on downstream tasks using the following three benchmarks.
+GLUE The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019b) is a collection of 9 datasets for evaluating natural language understanding systems.6 Tasks are framed as either single-sentence classification or sentence-pair classification tasks. The GLUE organizers provide training and development data splits as well as a submission server and leaderboard that allows participants to evaluate and compare their systems on private held-out test data. For the replication study in Section 4, we report results on the development sets after finetuning the pretrained models on the corresponding singletask training data (i.e., without multi-task training or ensembling). Our finetuning procedure follows the original BERT paper (Devlin et al., 2019). In Section 5 we additionally report test set results obtained from the public leaderboard. These results depend on a several task-specific modifications, which we describe in Section 5.1.
+SQuAD The Stanford Question Answering Dataset (SQuAD) provides a paragraph of context and a question. The task is to answer the question by extracting the relevant span from the context. We evaluate on two versions of SQuAD: V1.1 and V2.0 (Rajpurkar et al., 2016, 2018). In V1.1 the context always contains an answer, whereas in
+5The authors and their affiliated institutions are not in any way affiliated with the creation of the OpenWebText dataset. 6The datasets are: CoLA (Warstadt et al., 2018), Stanford Sentiment Treebank (SST) (Socher et al., 2013), Microsoft Research Paragraph Corpus (MRPC) (Dolan and Brockett, 2005), Semantic Textual Similarity Benchmark (STS) (Agirre et al., 2007), Quora Question Pairs (QQP) (Iyer et al., 2016), MultiGenre NLI (MNLI) (Williams et al., 2018), Question NLI (QNLI) (Rajpurkar et al., 2016), Recognizing Textual Entailment (RTE) (Dagan et al., 2006; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) and Winograd NLI (WNLI) (Levesque et al., 2011).
+
+
+V2.0 some questions are not answered in the provided context, making the task more challenging. For SQuAD V1.1 we adopt the same span prediction method as BERT (Devlin et al., 2019). For SQuAD V2.0, we add an additional binary classifier to predict whether the question is answerable, which we train jointly by summing the classification and span loss terms. During evaluation, we only predict span indices on pairs that are classified as answerable.
+RACE The ReAding Comprehension from Examinations (RACE) (Lai et al., 2017) task is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle and high school students. In RACE, each passage is associated with multiple questions. For every question, the task is to select one correct answer from four options. RACE has significantly longer context than other popular reading comprehension datasets and the proportion of questions that requires reasoning is very large.
+4 Training Procedure Analysis
+This section explores and quantifies which choices are important for successfully pretraining BERT models. We keep the model architecture fixed.7 Specifically, we begin by training BERT models with the same configuration as BERTBASE (L = 12, H = 768, A = 12, 110M params).
+4.1 Static vs. Dynamic Masking
+As discussed in Section 2, BERT relies on randomly masking and predicting tokens. The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training. We compare this strategy with dynamic masking where we generate the masking pattern every time we feed a sequence to the model. This becomes crucial when pretraining for more steps or with larger datasets.
+7Studying architectural changes, including larger architectures, is an important area for future work.
+Masking SQuAD 2.0 MNLI-m SST-2
+reference 76.3 84.3 92.8
+Our reimplementation:
+static 78.3 84.3 92.5 dynamic 78.7 84.0 92.9
+Table 1: Comparison between static and dynamic masking for BERTBASE . We report F1 for SQuAD and accuracy for MNLI-m and SST-2. Reported results are medians over 5 random initializations (seeds). Reference results are from Yang et al. (2019).
+Results Table 1 compares the published BERTBASE results from Devlin et al. (2019) to our reimplementation with either static or dynamic masking. We find that our reimplementation with static masking performs similar to the original BERT model, and dynamic masking is comparable or slightly better than static masking. Given these results and the additional efficiency benefits of dynamic masking, we use dynamic masking in the remainder of the experiments.
+4.2 Model Input Format and Next Sentence Prediction
+In the original BERT pretraining procedure, the model observes two concatenated document segments, which are either sampled contiguously from the same document (with p = 0.5) or from distinct documents. In addition to the masked language modeling objective, the model is trained to predict whether the observed document segments come from the same or distinct documents via an auxiliary Next Sentence Prediction (NSP) loss. The NSP loss was hypothesized to be an important factor in training the original BERT model. Devlin et al. (2019) observe that removing NSP hurts performance, with significant performance degradation on QNLI, MNLI, and SQuAD 1.1. However, some recent work has questioned the necessity of the NSP loss (Lample and Conneau, 2019; Yang et al., 2019; Joshi et al., 2019). To better understand this discrepancy, we compare several alternative training formats:
+• SEGMENT-PAIR+NSP: This follows the original input format used in BERT (Devlin et al., 2019), with the NSP loss. Each input has a pair of segments, which can each contain multiple natural sentences, but the total combined length must be less than 512 tokens.
+
+
+Model SQuAD 1.1/2.0 MNLI-m SST-2 RACE
+Our reimplementation (with NSP loss):
+SEGMENT-PAIR 90.4/78.7 84.0 92.9 64.2 SENTENCE-PAIR 88.7/76.2 82.9 92.1 63.0
+Our reimplementation (without NSP loss):
+FULL-SENTENCES 90.4/79.1 84.7 92.5 64.8 DOC-SENTENCES 90.6/79.7 84.7 92.7 65.6
+BERTBASE 88.5/76.3 84.3 92.8 64.3 XLNetBASE (K = 7) –/81.3 85.8 92.7 66.1 XLNetBASE (K = 6) –/81.0 85.6 93.4 66.7
+Table 2: Development set results for base models pretrained over BOOKCORPUS and WIKIPEDIA. All models are trained for 1M steps with a batch size of 256 sequences. We report F1 for SQuAD and accuracy for MNLI-m, SST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERTBASE and XLNetBASE are from Yang et al. (2019).
+• SENTENCE-PAIR+NSP: Each input contains a pair of natural sentences, either sampled from a contiguous portion of one document or from separate documents. Since these inputs are significantly shorter than 512 tokens, we increase the batch size so that the total number of tokens remains similar to SEGMENT-PAIR+NSP. We retain the NSP loss.
+• FULL-SENTENCES: Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When we reach the end of one document, we begin sampling sentences from the next document and add an extra separator token between documents. We remove the NSP loss.
+• DOC-SENTENCES: Inputs are constructed similarly to FULL-SENTENCES, except that they may not cross document boundaries. Inputs sampled near the end of a document may be shorter than 512 tokens, so we dynamically increase the batch size in these cases to achieve a similar number of total tokens as FULLSENTENCES. We remove the NSP loss.
+Results Table 2 shows results for the four different settings. We first compare the original SEGMENT-PAIR input format from Devlin et al. (2019) to the SENTENCE-PAIR format; both formats retain the NSP loss, but the latter uses single sentences. We find that using individual sentences hurts performance on downstream tasks, which we hypothesize is because the model is not able to learn long-range dependencies.
+We next compare training without the NSP loss and training with blocks of text from a single document (DOC-SENTENCES). We find that this setting outperforms the originally published BERTBASE results and that removing the NSP loss matches or slightly improves downstream task performance, in contrast to Devlin et al. (2019). It is possible that the original BERT implementation may only have removed the loss term while still retaining the SEGMENT-PAIR input format. Finally we find that restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than packing sequences from multiple documents (FULL-SENTENCES). However, because the DOC-SENTENCES format results in variable batch sizes, we use FULLSENTENCES in the remainder of our experiments for easier comparison with related work.
+4.3 Training with large batches
+Past work in Neural Machine Translation has shown that training with very large mini-batches can both improve optimization speed and end-task performance when the learning rate is increased appropriately (Ott et al., 2018). Recent work has shown that BERT is also amenable to large batch training (You et al., 2019). Devlin et al. (2019) originally trained BERTBASE for 1M steps with a batch size of 256 sequences. This is equivalent in computational cost, via gradient accumulation, to training for 125K steps with a batch size of 2K sequences, or for 31K steps with a batch size of 8K. In Table 3 we compare perplexity and end
+
+
+bsz steps lr ppl MNLI-m SST-2
+256 1M 1e-4 3.99 84.7 92.7 2K 125K 7e-4 3.68 85.2 92.9 8K 31K 1e-3 3.77 84.6 92.8
+Table 3: Perplexity on held-out training data (ppl) and development set accuracy for base models trained over BOOKCORPUS and WIKIPEDIA with varying batch sizes (bsz). We tune the learning rate (lr) for each setting. Models make the same number of passes over the data (epochs) and have the same computational cost.
+task performance of BERTBASE as we increase the batch size, controlling for the number of passes through the training data. We observe that training with large batches improves perplexity for the masked language modeling objective, as well as end-task accuracy. Large batches are also easier to parallelize via distributed data parallel training,8 and in later experiments we train with batches of 8K sequences. Notably You et al. (2019) train BERT with even larger batche sizes, up to 32K sequences. We leave further exploration of the limits of large batch training to future work.
+4.4 Text Encoding
+Byte-Pair Encoding (BPE) (Sennrich et al., 2016) is a hybrid between character- and word-level representations that allows handling the large vocabularies common in natural language corpora. Instead of full words, BPE relies on subwords units, which are extracted by performing statistical analysis of the training corpus. BPE vocabulary sizes typically range from 10K-100K subword units. However, unicode characters can account for a sizeable portion of this vocabulary when modeling large and diverse corpora, such as the ones considered in this work. Radford et al. (2019) introduce a clever implementation of BPE that uses bytes instead of unicode characters as the base subword units. Using bytes makes it possible to learn a subword vocabulary of a modest size (50K units) that can still encode any input text without introducing any “unknown” tokens.
+8Large batch training can improve training efficiency even without large scale parallel hardware through gradient accumulation, whereby gradients from multiple mini-batches are accumulated locally before each optimization step. This functionality is supported natively in FAIRSEQ (Ott et al., 2019).
+The original BERT implementation (Devlin et al., 2019) uses a character-level BPE vocabulary of size 30K, which is learned after preprocessing the input with heuristic tokenization rules. Following Radford et al. (2019), we instead consider training BERT with a larger byte-level BPE vocabulary containing 50K subword units, without any additional preprocessing or tokenization of the input. This adds approximately 15M and 20M additional parameters for BERTBASE and BERTLARGE, respectively. Early experiments revealed only slight differences between these encodings, with the Radford et al. (2019) BPE achieving slightly worse end-task performance on some tasks. Nevertheless, we believe the advantages of a universal encoding scheme outweighs the minor degredation in performance and use this encoding in the remainder of our experiments. A more detailed comparison of these encodings is left to future work.
+5 RoBERTa
+In the previous section we propose modifications to the BERT pretraining procedure that improve end-task performance. We now aggregate these improvements and evaluate their combined impact. We call this configuration RoBERTa for Robustly optimized BERT approach. Specifically, RoBERTa is trained with dynamic masking (Section 4.1), FULL-SENTENCES without NSP loss (Section 4.2), large mini-batches (Section 4.3) and a larger byte-level BPE (Section 4.4). Additionally, we investigate two other important factors that have been under-emphasized in previous work: (1) the data used for pretraining, and (2) the number of training passes through the data. For example, the recently proposed XLNet architecture (Yang et al., 2019) is pretrained using nearly 10 times more data than the original BERT (Devlin et al., 2019). It is also trained with a batch size eight times larger for half as many optimization steps, thus seeing four times as many sequences in pretraining compared to BERT. To help disentangle the importance of these factors from other modeling choices (e.g., the pretraining objective), we begin by training RoBERTa following the BERTLARGE architecture (L = 24, H = 1024, A = 16, 355M parameters). We pretrain for 100K steps over a comparable BOOKCORPUS plus WIKIPEDIA dataset as was used in
+
+
+Model data bsz steps SQuAD MNLI-m SST-2
+(v1.1/2.0)
+RoBERTa with BOOKS + WIKI 16GB 8K 100K 93.6/87.3 89.0 95.3 + additional data (§3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6 + pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1 + pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4
+BERTLARGE
+with BOOKS + WIKI 13GB 256 1M 90.9/81.8 86.6 93.7
+XLNetLARGE
+with BOOKS + WIKI 13GB 256 1M 94.0/87.8 88.4 94.4 + additional data 126GB 2K 500K 94.5/88.8 89.8 95.6
+Table 4: Development set results for RoBERTa as we pretrain over more data (16GB → 160GB of text) and pretrain for longer (100K → 300K → 500K steps). Each row accumulates improvements from the rows above. RoBERTa matches the architecture and training objective of BERTLARGE . Results for BERTLARGE and XLNetLARGE are from Devlin et al. (2019) and Yang et al. (2019), respectively. Complete results on all GLUE tasks can be found in the Appendix.
+Devlin et al. (2019). We pretrain our model using 1024 V100 GPUs for approximately one day.
+Results We present our results in Table 4. When controlling for training data, we observe that RoBERTa provides a large improvement over the originally reported BERTLARGE results, reaffirming the importance of the design choices we explored in Section 4. Next, we combine this data with the three additional datasets described in Section 3.2. We train RoBERTa over the combined data with the same number of training steps as before (100K). In total, we pretrain over 160GB of text. We observe further improvements in performance across all downstream tasks, validating the importance of data size and diversity in pretraining.9 Finally, we pretrain RoBERTa for significantly longer, increasing the number of pretraining steps from 100K to 300K, and then further to 500K. We again observe significant gains in downstream task performance, and the 300K and 500K step models outperform XLNetLARGE across most tasks. We note that even our longest-trained model does not appear to overfit our data and would likely benefit from additional training. In the rest of the paper, we evaluate our best RoBERTa model on the three different benchmarks: GLUE, SQuaD and RACE. Specifically
+9Our experiments conflate increases in data size and diversity. We leave a more careful analysis of these two dimensions to future work.
+we consider RoBERTa trained for 500K steps over all five of the datasets introduced in Section 3.2.
+5.1 GLUE Results
+For GLUE we consider two finetuning settings. In the first setting (single-task, dev) we finetune RoBERTa separately for each of the GLUE tasks, using only the training data for the corresponding task. We consider a limited hyperparameter sweep for each task, with batch sizes ∈ {16, 32} and learning rates ∈ {1e−5, 2e−5, 3e−5}, with a linear warmup for the first 6% of steps followed by a linear decay to 0. We finetune for 10 epochs and perform early stopping based on each task’s evaluation metric on the dev set. The rest of the hyperparameters remain the same as during pretraining. In this setting, we report the median development set results for each task over five random initializations, without model ensembling.
+In the second setting (ensembles, test), we compare RoBERTa to other approaches on the test set via the GLUE leaderboard. While many submissions to the GLUE leaderboard depend on multitask finetuning, our submission depends only on single-task finetuning. For RTE, STS and MRPC we found it helpful to finetune starting from the MNLI single-task model, rather than the baseline pretrained RoBERTa. We explore a slightly wider hyperparameter space, described in the Appendix, and ensemble between 5 and 7 models per task.
+
+
+MNLI QNLI QQP RTE SST MRPC CoLA STS WNLI Avg
+Single-task single models on dev
+BERTLARGE 86.6/- 92.3 91.3 70.4 93.2 88.0 60.6 90.0 - XLNetLARGE 89.8/- 93.9 91.8 83.8 95.6 89.2 63.6 91.8 - RoBERTa 90.2/90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4 91.3 
+Ensembles on test (from leaderboard as of July 25, 2019)
+ALICE 88.2/87.9 95.7 90.7 83.5 95.2 92.6 68.6 91.1 80.8 86.3 MT-DNN 87.9/87.4 96.0 89.9 86.3 96.5 92.7 68.4 91.1 89.0 87.6 XLNet 90.2/89.8 98.6 90.3 86.3 96.8 93.0 67.8 91.6 90.4 88.4 RoBERTa 90.8/90.2 98.9 90.2 88.2 96.7 92.3 67.8 92.2 89.0 88.5
+Table 5: Results on GLUE. All results are based on a 24-layer architecture. BERTLARGE and XLNetLARGE results are from Devlin et al. (2019) and Yang et al. (2019), respectively. RoBERTa results on the development set are a median over five runs. RoBERTa results on the test set are ensembles of single-task models. For RTE, STS and MRPC we finetune starting from the MNLI model instead of the baseline pretrained model. Averages are obtained from the GLUE leaderboard.
+Task-specific modifications Two of the GLUE tasks require task-specific finetuning approaches to achieve competitive leaderboard results. QNLI: Recent submissions on the GLUE leaderboard adopt a pairwise ranking formulation for the QNLI task, in which candidate answers are mined from the training set and compared to one another, and a single (question, candidate) pair is classified as positive (Liu et al., 2019b,a; Yang et al., 2019). This formulation significantly simplifies the task, but is not directly comparable to BERT (Devlin et al., 2019). Following recent work, we adopt the ranking approach for our test submission, but for direct comparison with BERT we report development set results based on a pure classification approach. WNLI: We found the provided NLI-format data to be challenging to work with. Instead we use the reformatted WNLI data from SuperGLUE (Wang et al., 2019a), which indicates the span of the query pronoun and referent. We finetune RoBERTa using the margin ranking loss from Kocijan et al. (2019). For a given input sentence, we use spaCy (Honnibal and Montani, 2017) to extract additional candidate noun phrases from the sentence and finetune our model so that it assigns higher scores to positive referent phrases than for any of the generated negative candidate phrases. One unfortunate consequence of this formulation is that we can only make use of the positive training examples, which excludes over half of the provided training examples.10
+10While we only use the provided WNLI training data, our
+Results We present our results in Table 5. In the first setting (single-task, dev), RoBERTa achieves state-of-the-art results on all 9 of the GLUE task development sets. Crucially, RoBERTa uses the same masked language modeling pretraining objective and architecture as BERTLARGE, yet consistently outperforms both BERTLARGE and XLNetLARGE. This raises questions about the relative importance of model architecture and pretraining objective, compared to more mundane details like dataset size and training time that we explore in this work. In the second setting (ensembles, test), we submit RoBERTa to the GLUE leaderboard and achieve state-of-the-art results on 4 out of 9 tasks and the highest average score to date. This is especially exciting because RoBERTa does not depend on multi-task finetuning, unlike most of the other top submissions. We expect future work may further improve these results by incorporating more sophisticated multi-task finetuning procedures.
+5.2 SQuAD Results
+We adopt a much simpler approach for SQuAD compared to past work. In particular, while both BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) augment their training data with additional QA datasets, we only finetune RoBERTa using the provided SQuAD training data. Yang et al. (2019) also employed a custom layer-wise learning rate schedule to finetune
+results could potentially be improved by augmenting this with additional pronoun disambiguation datasets.
+
+
+Model SQuAD 1.1 SQuAD 2.0
+EM F1 EM F1
+Single models on dev, w/o data augmentation BERTLARGE 84.1 90.9 79.0 81.8 XLNetLARGE 89.0 94.5 86.1 88.8 RoBERTa 88.9 94.6 86.5 89.4
+Single models on test (as of July 25, 2019) XLNetLARGE 86.3† 89.1† RoBERTa 86.8 89.8 XLNet + SG-Net Verifier 87.0† 89.9†
+Table 6: Results on SQuAD. † indicates results that depend on additional external training data. RoBERTa uses only the provided SQuAD data in both dev and test settings. BERTLARGE and XLNetLARGE results are from Devlin et al. (2019) and Yang et al. (2019), respectively.
+XLNet, while we use the same learning rate for all layers. For SQuAD v1.1 we follow the same finetuning procedure as Devlin et al. (2019). For SQuAD v2.0, we additionally classify whether a given question is answerable; we train this classifier jointly with the span predictor by summing the classification and span loss terms.
+Results We present our results in Table 6. On the SQuAD v1.1 development set, RoBERTa matches the state-of-the-art set by XLNet. On the SQuAD v2.0 development set, RoBERTa sets a new state-of-the-art, improving over XLNet by 0.4 points (EM) and 0.6 points (F1). We also submit RoBERTa to the public SQuAD 2.0 leaderboard and evaluate its performance relative to other systems. Most of the top systems build upon either BERT (Devlin et al., 2019) or XLNet (Yang et al., 2019), both of which rely on additional external training data. In contrast, our submission does not use any additional data. Our single RoBERTa model outperforms all but one of the single model submissions, and is the top scoring system among those that do not rely on data augmentation.
+5.3 RACE Results
+In RACE, systems are provided with a passage of text, an associated question, and four candidate answers. Systems are required to classify which of the four candidate answers is correct. We modify RoBERTa for this task by concate
+Model Accuracy Middle High
+Single models on test (as of July 25, 2019) BERTLARGE 72.0 76.6 70.1 XLNetLARGE 81.7 85.4 80.2
+RoBERTa 83.2 86.5 81.3
+Table 7: Results on the RACE test set. BERTLARGE and XLNetLARGE results are from Yang et al. (2019).
+nating each candidate answer with the corresponding question and passage. We then encode each of these four sequences and pass the resulting [CLS] representations through a fully-connected layer, which is used to predict the correct answer. We truncate question-answer pairs that are longer than 128 tokens and, if needed, the passage so that the total length is at most 512 tokens.
+Results on the RACE test sets are presented in Table 7. RoBERTa achieves state-of-the-art results on both middle-school and high-school settings.
+6 Related Work
+Pretraining methods have been designed with different training objectives, including language modeling (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018), machine translation (McCann et al., 2017), and masked language modeling (Devlin et al., 2019; Lample and Conneau, 2019). Many recent papers have used a basic recipe of finetuning models for each end task (Howard and Ruder, 2018; Radford et al., 2018), and pretraining with some variant of a masked language model objective. However, newer methods have improved performance by multi-task fine tuning (Dong et al., 2019), incorporating entity embeddings (Sun et al., 2019), span prediction (Joshi et al., 2019), and multiple variants of autoregressive pretraining (Song et al., 2019; Chan et al., 2019; Yang et al., 2019). Performance is also typically improved by training bigger models on more data (Devlin et al., 2019; Baevski et al., 2019; Yang et al., 2019; Radford et al., 2019). Our goal was to replicate, simplify, and better tune the training of BERT, as a reference point for better understanding the relative performance of all of these methods.
+
+
+7 Conclusion
+We carefully evaluate a number of design decisions when pretraining BERT models. We find that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. Our improved pretraining procedure, which we call RoBERTa, achieves state-of-the-art results on GLUE, RACE and SQuAD, without multi-task finetuning for GLUE or additional data for SQuAD. These results illustrate the importance of these previously overlooked design decisions and suggest that BERT’s pretraining objective remains competitive with recently proposed alternatives. We additionally use a novel dataset, CC-NEWS, and release our models and code for pretraining and finetuning at: https://github.com/pytorch/fairseq.
+References
+Eneko Agirre, Llu’is M‘arquez, and Richard Wicentowski, editors. 2007. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007).
+Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. 2019. Clozedriven pretraining of self-attention networks. arXiv preprint arXiv:1903.07785.
+Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment.
+Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge.
+Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In Empirical Methods in Natural Language Processing (EMNLP).
+William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit. 2019. KERMIT: Generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604.
+Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment.
+Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems (NIPS).
+Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).
+William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing.
+Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
+Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing.
+Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://web.archive.org/ save/http://Skylion007.github.io/ OpenWebTextCorpus.
+Felix Hamborg, Norman Meuschke, Corinna Breitinger, and Bela Gipp. 2017. news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science.
+Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
+Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
+Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
+Shankar Iyer, Nikhil Dandekar, and Kornl Csernai. 2016. First quora dataset release: Question pairs. https://data.quora.com/FirstQuora-Dataset-Release-QuestionPairs.
+
+
+Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529.
+Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
+Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for winograd schema challenge. arXiv preprint arXiv:1905.06290.
+Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
+Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291.
+Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
+Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019a. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482.
+Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019b. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.
+Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems (NIPS), pages 62976308.
+Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed precision training. In International Conference on Learning Representations.
+Sebastian Nagel. 2016. Cc-news. http: //web.archive.org/save/http: //commoncrawl.org/2016/10/newsdataset-available.
+Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. FAIRSEQ: A fast, extensible toolkit for sequence modeling. In North American Association for Computational Linguistics (NAACL): System Demonstrations.
+Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT).
+Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop.
+Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In North American Association for Computational Linguistics (NAACL).
+Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.
+Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Technical report, OpenAI.
+Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Association for Computational Linguistics (ACL).
+Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
+Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Association for Computational Linguistics (ACL), pages 1715–1725.
+Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP).
+Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning (ICML).
+Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223.
+Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
+
+
+Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.
+Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019a. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537.
+Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).
+Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2018. Neural network acceptability judgments. arXiv preprint 1805.12471.
+Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In North American Association for Computational Linguistics (NAACL).
+Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
+Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. 2019. Reducing bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962.
+Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. arXiv preprint arXiv:1905.12616.
+Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724.
+Appendix for “RoBERTa: A Robustly Optimized BERT Pretraining Approach”
+A Full results on GLUE
+In Table 8 we present the full set of development set results for RoBERTa. We present results for a LARGE configuration that follows BERTLARGE, as well as a BASE configuration that follows BERTBASE.
+B Pretraining Hyperparameters
+Table 9 describes the hyperparameters for pretraining of RoBERTaLARGE and RoBERTaBASE
+C Finetuning Hyperparameters
+Finetuning hyperparameters for RACE, SQuAD and GLUE are given in Table 10. We select the best hyperparameter values based on the median of 5 random seeds for each task.
+
+
+MNLI QNLI QQP RTE SST MRPC CoLA STS
+RoBERTaBASE + all data + 500k steps 87.6 92.8 91.9 78.7 94.8 90.2 63.6 91.2
+RoBERTaLARGE with BOOKS + WIKI 89.0 93.9 91.9 84.5 95.3 90.2 66.3 91.6 + additional data (§3.2) 89.3 94.0 92.0 82.7 95.6 91.4 66.1 92.2 + pretrain longer 300k 90.0 94.5 92.2 83.3 96.1 91.1 67.4 92.3 + pretrain longer 500k 90.2 94.7 92.2 86.6 96.4 90.9 68.0 92.4
+Table 8: Development set results on GLUE tasks for various configurations of RoBERTa.
+Hyperparam RoBERTaLARGE RoBERTaBASE
+Number of Layers 24 12 Hidden size 1024 768 FFN inner hidden size 4096 3072 Attention heads 16 12 Attention head size 64 64 Dropout 0.1 0.1 Attention Dropout 0.1 0.1 Warmup Steps 30k 24k Peak Learning Rate 4e-4 6e-4 Batch Size 8k 8k Weight Decay 0.01 0.01 Max Steps 500k 500k Learning Rate Decay Linear Linear Adam ǫ 1e-6 1e-6 Adam β1 0.9 0.9 Adam β2 0.98 0.98 Gradient Clipping 0.0 0.0
+Table 9: Hyperparameters for pretraining RoBERTaLARGE and RoBERTaBASE .
+Hyperparam RACE SQuAD GLUE
+Learning Rate 1e-5 1.5e-5 {1e-5, 2e-5, 3e-5} Batch Size 16 48 {16, 32} Weight Decay 0.1 0.01 0.1 Max Epochs 4 2 10 Learning Rate Decay Linear Linear Linear Warmup ratio 0.06 0.06 0.06
+Table 10: Hyperparameters for finetuning RoBERTaLARGE on RACE, SQuAD and GLUE.
--- a/storage/PI737J5V/.zotero-reader-state
+++ b/storage/PI737J5V/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":7,"scale":173,"top":51,"left":-185,"scrollMode":0,"spreadMode":0}
--- a/storage/PI737J5V/Liu
+++ b/storage/PI737J5V/Liu
--- a/storage/W4P75KJE/.zotero-ft-cache
+++ b/storage/W4P75KJE/.zotero-ft-cache
@@ -0,0 +1,152 @@
+Information Research Communications, 2025; 2(2):175-181.
+https://inforescom.org Research Article
+Information Research Communications, Vol 2, Issue 2, May-Aug, 2025 175
+DOI: 10.5530/irc.2.2.14
+Copyright Information :
+Copyright Author (s) 2025 Distributed under Creative Commons CC-BY 4.0
+Publishing Partner : ScienScript Digital. [www.scienscript.com.sg]
+Evaluating Small BERT Based Models on Automated Essay
+Scoring Task
+Megat Norulazmi Megat Mohamed Noor*, Muhammad Firdaus Mohamed Badauraudine
+Department of Computer Engineering Technology, Malaysian Institute of Information Technology, Kuala Lumpur, MALAYSIA.
+ABSTRACT
+Automated Essay Scoring (AES) systems address the limitations of manual grading, such as inefficiency, subjectivity, and scalability issues in educational assessments, and this study evaluates the performance of lightweight BERT-based models for AES tasks, aiming to identify the most effective and computationally efficient variant for integration into a proposed AI Essay Score Bot by focusing on smaller, distilled transformer models to balance high accuracy with reduced resource demands, building on advancements in Natural Language Processing (NLP) from models like BERT, RoBERTa, and their variants. The publicly available ASAP-AES dataset was used, encompassing essays from prompts 1 to 8, with eleven small-scale BERT-based models tested: DistilBERT-base-uncased, DistilBERT-base-uncased-distilledSQuAD, ALBERT-base-v1, ALBERT-base-v2, DistilRoBERTa-base, SqueezeBERT-uncased, SqueezeBERT-MNLI, SqueezeBERT-MNLI-headless, BERT-base-uncased, RoBERTa-base, and BORT; training involved a 5-fold cross-validation with 80% training and 20% validation splits, hyperparameter tuning across batch sizes (8, 16, 20), learning rates (1e-4, 3e-4, 3e-5, 4e-5, 5e-5), and epochs (5, 10, 15, 20), while the EXPATS toolkit facilitated model implementation, training, and evaluation under an in-domain schema, with performance measured using the Quadratic Weighted Kappa (QWK) metric. DistilBERT-base-uncased achieved the highest average QWK of 0.926 (batch size 16, 10 epochs), outperforming others across most prompts, with improvements noted in hyperparameter configurations (e.g., learning rate 3e-5), while ALBERT-base-v1 followed closely with a maximum QWK of 0.920 (batch size 8, 20 epochs), despite GPU memory constraints limiting batch sizes; smaller models like DistilRoBERTa-base (average QWK 0.907) and SqueezeBERT-uncased (0.910) surpassed larger counterparts such as BERT-base-uncased (0.903) and RoBERTa-base (0.860), with prompts 1 and 7 consistently challenging, where SqueezeBERT-uncased scored highest on Prompt 1 (QWK 0.880) and SqueezeBERT-MNLI on Prompt 7 (0.780), and underperforming models included BORT (average 0.770) and ALBERT-base-v2 (0.830), affected by architectural simplifications. The superior performance of distilled models like DistilBERT and ALBERT underscores the benefits of knowledge distillation and parameter optimization techniques, enabling better QWK scores with lower computational overhead compared to full-sized BERT and RoBERTa, with batch size increases (up to 20) enhancing DistilBERT variants, particularly on difficult prompts, while ALBERT’s efficiency stemmed from factorized embeddings and cross-layer sharing; challenges on Prompts 1 and 7 suggest dataset-specific complexities, such as varied essay structures, indicating that lightweight models are ideal for resource-limited applications, though further distillation of ALBERT could yield even more compact variants without sacrificing precision. Distilled transformer models, especially DistilBERT-base-uncased and ALBERT-base-v1, offer an optimal trade-off between accuracy and efficiency for AES tasks, making them suitable for real-world deployment in web-based AI tools, and future research should investigate distilling ALBERT to enhance compactness and explore transfer learning for cross-domain AES improvements.
+Keywords: Automated Essay Scoring, BERT, Natural Language Processing, Quadratic Weighted Kappa.
+INTRODUCTION
+Automated Essay Scoring (AES) has emerged as a critical area of study within educational assessment, addressing the limitations of manual grading processes, which are time-consuming and prone to subjectivity (Page, 1966). Early AES systems relied on rule-based approaches, such as Project Essay Grade (PEG) developed by
+Ellis Batten Page in the 1960s (Page, 1967). These systems, while pioneering, lacked the ability to capture the complexities of language and context effectively. The advent of Natural Language Processing (NLP) and machine learning techniques has led to significant advancements in AES. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks have been instrumental in capturing sequential dependencies in text data (Hochreiter and Schmidhuber, 1997). Furthermore, transformer-based architectures, such as Bidirectional Encoder Representations from Transformers (BERT), introduced by Devlin et al., (2018), have revolutionized AES research by enabling models to capture contextual information bidirectionally.
+Received: 14-05-2025; Revised: 29-07-2025; Accepted: 02-09-2025.
+Correspondence: Dr. Megat Norulazmi Megat Mohamed Noor
+Department of Computer Engineering Technology, Universiti Kuala Lumpur - Malaysian Institute of Information Technology, Kuala Lumpur-50250, MALAYSIA. Email: megatnorulazmi@unikl.edu.my
+
+
+Noor and Badauraudine: Evaluating BERT Models for AES
+176 Information Research Communications, Vol 2, Issue 2, May-Aug, 2025
+Studies have demonstrated the effectiveness of BERT in various NLP tasks (Devlin et al., 2018), including sentiment analysis (Sun et al., 2019), named entity recognition (Jin et al., 2019), and document classification (Lee et al., 2019). BERT's ability to pre-train on large corpora of text data and fine-tune on task-specific datasets has contributed to its success in AES tasks (Devlin et al., 2018). In addition to BERT, other transformer-based architectures, such as Generative Pre-trained Transformer (GPT) developed by OpenAI, have shown promise in AES research (Radford et al., 2018). GPT's ability to generate coherent text and make contextually relevant predictions can be leveraged for scoring essays (Radford et al., 2018).
+Overall, the integration of deep learning techniques, particularly transformer-based models like BERT and GPT, represents a significant advancement in AES research, offering improved accuracy and scalability compared to traditional approaches.
+METHODOLOGY
+Evaluation Metrics
+The primary evaluation metric for our AES models is the Quadratic Weighted Kappa (QWK), which serves as an agreement metric, ranging from 0 to 1. Negative values indicate less agreement than expected by chance. QWK is utilized to report the performance of our models on each prompt, the official metric for the ASAP-SAS
+competition, to provide a concise summary of performance across prompts.
+Evaluation Schemas
+We employ in-domain evaluation scheme. In an in-domain evaluation, the system is trained and evaluated on the same prompt, while a cross-domain evaluation involves training and evaluating the system on different prompts. This approach assesses AES systems employing transfer learning techniques.
+Data Preparation and Model Training
+Model training employs a 5-fold cross-validation approach with separate train/validation/test splits. The official ASAP-SAS training data are divided into 5 folds, with 80% allocated for training and 20% for validation. A batch size of {8, 16, 20} and the learning rate is tuned within the range of {1e-4, 3e-4, 3e-5, 4e-5, 5e-5}. During hyperparameter tuning, the model's performance is evaluated solely on the validation sets, recording the best performance achieved across epochs range of {5, 10, 15, 20}. The training process stops at the specified ranges of epoch across the validation folds. After hyperparameter tuning, the final models are trained by combining the training and validation sets.
+Automated Text Scoring Toolkit
+For this study, we employed EXPATS (Manabe and Hagiwara 2021), an automated text scoring toolkit designed for a variety of
+P1 P2 P3 P4 P5 P6 P7 P8 Avg
+Model Bert Based Model with Batch Size=8
+Distilbert-Base- Uncased 0.88 0.94 0.94 0.97 0.961 0.97 0.75 0.95 0.92
+Distilbert-Base- Uncased-Distilled-Squad 0.87 0.93 0.92 0.96 0.956 0.95 0.74 0.94 0.91
+Albert-base-v1 0.86 0.93 0.93 0.97 0.958 0.96 0.78 0.94 0.91
+Albert-base-v2 0.63 0.93 0.89 0.96 0.839 0.95 0.56 0.92 0.83
+Distilroberta-base 0.85 0.94 0.93 0.96 0.96 0.96 0.72 0.94 0.91
+Squeezebert- uncased 0.88 0.92 0.93 0.96 0.897 0.95 0.78 0.92 0.91
+Squeezebert-mnli 0.84 0.85 0.9 0.94 0.951 0.95 0.78 0.9 0.89
+Squeezebert-mnli-headless 0.85 0.92 0.92 0.95 0.946 0.96 0.74 0.92 0.9
+Bert-Base- uncased 0.85 0.93 0.93 0.97 0.959 0.96 0.7 0.92 0.9
+Roberta-base 0.84 0.93 0.82 0.96 0.952 0.91 0.57 0.93 0.86
+Bort 0.85 0.93 0.94 0.96 0.009 0.96 0.64 0.91 0.77
+Model Distilbert Based Model with Batch Size=16 and 20
+Distilbert-base- uncased (16) 0.89 0.94 0.94 0.97 0.958 0.96 0.82 0.93 0.93
+Distilbert-base- uncased-Distilled- squad (16)
+0.88 0.93 0.94 0.97 0.964 0.96 0.71 0.93 0.910
+Distilroberta-base (16) 0.85 0.94 0.92 0.96 0.952 0.96 0.74 0.87 0.9
+Distilbert-base- uncased (20) 0.83 0.93 0.94 0.97 0.96 0.96 0.83 0.9 0.92
+Distilbert-base- uncased-Distilled- squad (20)
+0.86 0.93 0.94 0.96 0.96 0.96 0.84 0.92 0.920
+Table 1: Results based on batch size 8, 16 and 20.
+
+
+Noor and Badauraudine: Evaluating BERT Models for AES
+Information Research Communications, Vol 2, Issue 2, May-Aug, 2025 177
+Automated Text Scoring (ATS) tasks, including automated essay scoring and readability assessment. EXPATS is an open-source framework that facilitates the rapid development and experimentation of diverse ATS models through its user-friendly components, configuration system, and command-line interface. Moreover, the toolkit seamlessly integrates with the Language Interpretability Tool (LIT), enabling users to interpret and visualize models along with their predictions.
+EXPERIMENTATION
+BERT-based models
+Our initial experiment aimed to assess the performance of various small BERT-based models, including distilbert-baseuncased, distilbert-base-uncased-distilled-squad, albert-basev1, albert-base-v2, distilroberta-base, squeezebert-uncased, squeezebert-mnli, squeezebert-mnli-headless, bert-baseuncased, roberta-base, and BORT. DistilBERT-base-uncased is a distilled version of BERT designed to be smaller and faster while retaining much of BERT’s performance across NLP tasks, using a transformer-based architecture for pre-training on large
+text corpora. DistilBERT-base-uncased-distilled-SQuAD builds on this by being specifically optimized for question-answering through additional distillation on the SQuAD dataset. ALBERT-base-v1 introduces parameter reduction techniques for efficiency without compromising performance, while ALBERT-base-v2 further enhances this with improved dropout and broader training data. DistilRoBERTa-base is a compact version of RoBERTa that maintains effectiveness with a smaller model size. SqueezeBERT-uncased offers a compressed BERT model for resource-constrained settings via pruning and quantization, and its variant, SqueezeBERT-MNLI, is tailored for natural language inference tasks. SqueezeBERT-MNLIheadless removes the task-specific classification layer for flexible downstream use. BERT-base-uncased is the original benchmark transformer model for bidirectional text representations. RoBERTa-base extends BERT with enhanced training strategies and data processing for improved task performance. Finally, BORT is designed for efficient inference, leveraging quantization, pruning, and sparse attention to minimize computational load, albeit with some trade-offs in accuracy. Our objective is to identify the smallest model capable of achieving optimal performance in
+Prompt P1 P2 P3 P4 P5 P6 P7 P8 Avg
+Setting Distilled-base-uncased Epoch 5
+1 0.781 0.929 0.911 0.959 0.958 0.952 0.761 0.943 0.899
+Epoch 10
+7 0.882 0.933 0.94 0.965 0.958 0.963 0.824 0.939 0.926
+Epoch 15
+6 0.891 0.936 0.94 0.966 0.96 0.964 0.821 0.942 0.928
+7 0.886 0.938 0.937 0.969 0.961 0.96 0.81 0.944 0.926
+Epoch 20
+6 0.896 0.935 0.944 0.966 0.962 0.964 0.801 0.942 0.926
+8 0.891 0.942 0.926 0.952 0.961 0.964 0.828 0.954 0.927
+8 0.891 0.942 0.926 0.952 0.961 0.964 0.828 0.954 0.927
+10 0.872 0.942 0.935 0.965 0.965 0.961 0.838 0.941 0.927
+1 0.884 0.941 0.925 0.959 0.955 0.959 0.833 0.948 0.926
+Setting albert-base-v1 Epoch 5
+1 0.802 0.906 0.894 0.924 0.95 0.936 0.703 0.919 0.879
+2 0.83 0.857 0.884 0.939 0.946 0.947 0.735 0.893 0.879
+Epoch 10
+2 0.843 0.938 0.934 0.964 0.963 0.963 0.79 0.938 0.917
+Epoch 15
+1 0.856 0.925 0.935 0.905 0.966 0.958 0.787 0.937 0.909
+2 0.875 0.934 0.934 0.963 0.96 0.962 0.747 0.94 0.915
+Epoch 20
+1 0.856 0.925 0.935 0.905 0.966 0.958 0.787 0.937 0.909
+2 0.875 0.934 0.934 0.963 0.96 0.962 0.747 0.94 0.915
+Table 2: Optimum results for hyperparameter and Epoch setting.
+
+
+Noor and Badauraudine: Evaluating BERT Models for AES
+178 Information Research Communications, Vol 2, Issue 2, May-Aug, 2025
+Figure 1: Distilbert hyperparameter 1 to 10 results.
+our future AES application, as determined by the highest average Quadratic Weighted Kappa (QWK) score.
+Models Evaluation
+All models were trained for 10 epochs with a learning rate of 4e-5 and a validation ratio of 0.2. The DistilBERT-base-uncased and DistilBERT-base-uncased-distilled-SQuAD models were evaluated using batch sizes of 8, 16, and 20. DistilRoBERTa-base was trained using batch sizes of 8 and 16.
+The remaining models-ALBERT-base-v1, ALBERT-base-v2, SqueezeBERT-uncased, SqueezeBERT-MNLI, SqueezeBERT-MNLI-headless, BERT-base-uncased, RoBERTa-base, and BORT-were all trained with a batch size of 8. Initially, the learning rate was set to 4e-5, with 10 epochs and a validation ratio of 0.2. Due to the limited GPU memory size of our 3080TI 12GB GPU, batch sizes varied among models. Specifically, models such as albert-base-v1, albert-base-v2, squeezebert-uncased, squeezebert-mnli, squeezebert-mnliheadless, bert-base- uncased, roberta-base, and bort encountered "GPU memory full" errors when batch sizes exceeded 8. In contrast, distilbert-based models could handle batch sizes up to 20, while distilroberta could manage up to a batch size of 16 without errors. These findings indicate that distilled-based techniques produce smaller models compared to the original BERT model and others Lite BERT Model, allowing for larger batch sizes and more efficient GPU memory usage.
+Table 1 reveals that the distilbert-base-uncased model exhibited superior performance across prompts 2, 3, 4, 5, 6, and 8, with an average Quadratic Weighted Kappa (QWK) score of 0.918. The Albert-base-v1 model, which ranked second, achieved an average QWK score of 0.914. These results surpass those of the original Bert-base-uncased model, which recorded a QWK of 0.903. Even smaller models, such as distilroberta-base and squeezebert-uncased, outperformed the original BERT-based model. However, prompts 1 and 7 posed significant challenges for most models. The squeezebert-uncased model achieved the highest QWK for Prompt 1 with a score of 0.880, while the squeezebert-mnli model scored 0.780 for Prompt 7. The bottom three performing models were bort (negatively impacted by prompts 5 and 7), albert-base-v2 (affected by prompts 1 and 7), and roberta-base (affected by prompt 7). Notably, the BORT model performed the worst, likely due to being overly simplified. Similarly, the performance of the latest version of albert-base-v2 deteriorated compared to its older v1 version likely due to changes on the dropout ratio and tuned to larger and more diverse data.
+Table 1 also presents the evaluation results for distilbert-baseuncased, distilbert-base-uncased-distilled-squad, and distilroberta-base, which is the third best-performing model. We aimed to verify whether varying batch sizes could enhance their performance. The overall results indicate that distilbert-baseuncased achieves the highest average QWK of 0.926 with a batch size of 16, while distilbert-base-uncased- distilled-squad
+
+
+Noor and Badauraudine: Evaluating BERT Models for AES
+Information Research Communications, Vol 2, Issue 2, May-Aug, 2025 179
+achieves an average QWK of 0.920 with a batch size of 20. These improvements in the distilbert models are primarily due to increased QWK scores on prompt 7, with slight effects on the QWK scores of other prompts. However, the performance of distilroberta-base decreases from a QWK of 0.907 to 0.898 when the batch size is increased from 8 to 16. Despite this, distilroberta-base still outperforms its original roberta-base counterpart. The improvements observed in the smaller distilbert and distilroberta models compared to their larger BERT and ROBERTA parent models suggest that the distillation method (Hinton et al., 2015) is a significant factor in achieving better QWK results. Additionally, as shown in Table 1, the Albert-base-v1 model produced commendable QWK results, even outperforming bert-base, roberta-base, and some other distilled models for prompts 1 and 7. Therefore, we presume that a distilled Albert-base-v1 model would be able to achieve better QWK scores than the distilbert and distilroberta models. Consequently, we decided to conduct further investigation into hyperparameter tuning for the distilbert and Albert v1- based models.
+Distilbert and Albert Model Evaluations
+The subsequent evaluation process for the distilbert and albert models was repeated across 5, 10, 15, and 20 epochs based on a hyperparameter configurations. The hyperparameter configurations consist of ten different combinations of learning
+rates and batch sizes. Configurations 1 through 5 use a batch size of 8, with learning rates of 4e-5, 3e-5, 5e-5, 3e-4, and 1e-4, respectively. Configurations 6 through 10 use the same learning rates in the same order but with a batch size of 16 instead. These combinations are designed to explore the effect of varying learning rates and batch sizes on model performance. However, for the albert model the batch size limitation is up to 8.
+The results presented in Figures 1 and 2 indicate that achieving an epoch count of 10 or higher yields favourable QWK outcomes for both models. Conversely, configuration 4, which employs a Learning Rate (LR) of 3e-4 and a batch size of 8, consistently demonstrates poor performance across all epoch settings for both models. Similarly, configuration 9 (LR 3e-4, batch size 16) is suboptimal for the distilbert model; although higher epochs improve the QWK, the results remain significantly lower than other configurations.
+Configuration 5, with an LR of 1e-4 and a batch size of 8, also shows unsatisfactory QWK results for the Albert-base model, but increasing the epoch to 20 raises the QWK to 0.844. Table 2 indicates the QWK results at epoch 20 for the distilbert model exhibit stability, averaging 0.926 across most configurations, with the highest QWK of 0.928 occurring under configuration 6 (LR=3e-5, batch size=16). For the Albert model, QWK results are stable at an average of 0.915 at epoch 15, with the highest QWK of 0.92 achieved under configuration 3 (LR=5e- 5, batch size=8) at epoch 20. The results shows that even though without
+Figure 2: Albert hyperparameter 1 to 5 results.
+
+
+Noor and Badauraudine: Evaluating BERT Models for AES
+180 Information Research Communications, Vol 2, Issue 2, May-Aug, 2025
+been distilled and batch size only max at 8, Albert performance is comparable with distilbert.
+CONCLUSION
+RoBERTa (Liu et al., 2019) is a more complex model compared to the original BERT due to its increased number of parameters, which result from training on a larger dataset and possibly incorporating additional layers. As shown in Table 1, BERT (Devlin et al., 2019) outperforms RoBERTa, suggesting that the AES-ASAP dataset benefits from smaller models. However, the distilled version of RoBERTa (DistilRoBERTa) surpasses BERT in performance, indicating that a smaller model, further optimized through distillation, is even better suited for AES-ASAP data.
+ALBERT, a lite variant derived from BERT, is designed to reduce model size and enhance efficiency while preserving performance. ALBERT achieves this by employing factorized embedding parameterization and cross-layer parameter sharing, which reduces the number of parameters and improves performance on AES-ASAP data, surpassing its predecessor, BERT, and DistilRoBERTa (Sanh et al., 2019).
+DistilBERT (Sanh et al., 2019) is another variant that utilizes knowledge distillation, where a smaller model (the student) is trained to replicate the behavior of a larger model (the teacher). In this case, the teacher model is the original BERT. DistilBERT reduces the number of layers by approximately 50%, while retaining the same hidden size and other architectural parameters to maintain a significant portion of the original model's performance.
+Applying distillation to ALBERT (Lan et al., 2020), which already benefits from parameter sharing and factorized embedding parameterization, could potentially result in an even more compact model. This model would be faster and require less memory during inference for AES data, while still retaining a high level of performance. Distillation is thus an effective method to preserve performance in a smaller model, compared to merely reducing the number of layers or parameters without the guidance of a teacher model.
+ACKNOWLEDGEMENT
+This research is supported by Universiti Kuala Lumpur MIIT Research and Innovation Section.
+CONFLICT OF INTEREST
+The authors declare that there is no conflict of interest.
+FUNDING
+This research is funded by Universiti Kuala Lumpur Short Term Research Grant. (UniKL/CoRI/str23021).
+ABBREVIATIONS
+AES: Automated Essay Scoring; AI: Artificial Intelligence; ASAP-AES: Automated Student Assessment Prize - Automated Essay Scoring; ATS: Automated Text Scoring; BERT: Bidirectional Encoder Representations from Transformers; BORT: BERT Optimized for Resource-constrained Tasks; EXPATS: Explainable Automated Text Scoring Toolkit; GPT: Generative Pre-trained Transformer; LIT: Language Interpretability Tool; LSTM: Long Short-Term Memory; ML: Machine Learning; NLP: Natural Language Processing; QWK: Quadratic Weighted Kappa; RNN: Recurrent Neural Network; SQuAD: Stanford Question Answering Dataset; v1/v2: Version 1/Version 2 (used with ALBERT model variants).
+SUMMARY
+This study evaluates a range of lightweight BERT-based models for the Automated Essay Scoring (AES) task using the ASAP-AES dataset. The goal is to identify the most efficient and high-performing model suitable for deployment in a forthcoming AI Essay Score Bot web application. The models tested include DistilBERT, ALBERT, DistilRoBERTa, SqueezeBERT, RoBERTa, BERT, and BORT, evaluated across prompts 1 to 8 using the Quadratic Weighted Kappa (QWK) metric. Among them, distilbert-base-uncased consistently outperformed others, with an average QWK of 0.918, and showed further improvement (up to 0.926) with hyperparameter tuning. Prompts 1 and 7 were the most challenging for most models. However, squeezebert-uncased and squeezebert-mnli performed better on these specific prompts. Models like BORT and albert-base-v2 underperformed, possibly due to architectural simplifications or tuning for different tasks. Further hyperparameter tuning of DistilBERT and ALBERT-base-v1 showed that ALBERT, despite its batch size limitations, achieved comparable results to DistilBERT, with a top QWK score of 0.928. The results confirm that smaller, distilled models not only reduce computational load but can also outperform their larger counterparts on AES tasks. The study concludes that distilled models, especially DistilBERT and potentially a distilled version of ALBERT, are ideal candidates for real-world AES applications due to their high efficiency and competitive performance.
+REFERENCES
+ALBERT. A lite BERT for self-supervised learning of language representations. International Conference on Learning Representations (ICLR).
+BERT. Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1. (Long and Short Papers) (pp. 4171-4186).
+Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Hochreiter, S., and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780.
+DistilBERT, a distilled version of BERT: Smaller, faster, cheaper, and lighter. arXiv preprint arXiv:1910.01108. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
+
+
+Noor and Badauraudine: Evaluating BERT Models for AES
+Information Research Communications, Vol 2, Issue 2, May-Aug, 2025 181
+Jin, D., Jin, Z., Zhou, J., & Szolovits, P. (2019). Multi-channel BERT for named entity recognition. arXiv preprint arXiv:1906.09423.
+Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. https://doi.org/10.1093/bioinformatics/btz682
+Manabe, H., & Hagiwara, M. (2021). EXPATS: A toolkit for explainable automated text scoring. Available at Papers with Code (Papers with Code).
+Open, A. I. (2024). ChatGPT [Large language model]. https://chatgpt.com/c/ a62e5f28-6a48-43f3-b757-563a75880cd9
+Page, E. B. (1966). The project essay grade (PEG) system: Automation of essay scoring. Educational Technology, 6(12), 20-24.
+Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/o penaiassets/research-covers/languageunsupervised/language_understanding_pa per.pdf RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
+Sun, C., Qiu, X., Xu, Y., Huang, X., Zhang, Y., & Wei, F. (2019). How to fine-tune BERT for text classification?. arXiv preprint arXiv:1905.05583.
+Cite this article: Noor MNMM, Badauraudine MFM. Evaluating Small BERT Based Models on Automated Essay Scoring Task. Info Res Com. 2025;2(2):175-81.
--- a/storage/W4P75KJE/Mohamed
+++ b/storage/W4P75KJE/Mohamed
--- a/storage/WN37MWI5/.zotero-ft-cache
+++ b/storage/WN37MWI5/.zotero-ft-cache
@@ -0,0 +1,305 @@
+A Systematic Review of Automated Grammar Checking in English Language
+MADHVI SONI, Jabalpur Engineering College, India
+JITENDRA SINGH THAKUR, Jabalpur Engineering College, India
+Grammar checking is the task of detection and correction of grammatical errors in the text. English is the dominating language in the field of science and technology. Therefore, the non-native English speakers must be able to use correct English grammar while reading, writing or speaking. This generates the need of automatic grammar checking tools. So far many approaches have been proposed and implemented. But less efforts have been made in surveying the literature in the past decade. The objective of this systematic review is to examine the existing literature, highlighting the current issues and suggesting the potential directions of future research. This systematic review is a result of analysis of 12 primary studies obtained after designing a search strategy for selecting papers found on the web. We also present a possible scheme for the classification of grammar errors. Among the main observations, we found that there is a lack of efficient and robust grammar checking tools for real time applications. We present several useful illustrationsmost prominent are the schematic diagrams that we provide for each approach and a table that summarizes these approaches along different dimensions such as target error types, linguistic dataset used, strengths and limitations of the approach. This facilitates better understandability, comparison and evaluation of previous research.
+Keywords: Systematic review, Grammar checking, Classification of errors, Error detection, Automatic error correction.
+Madhvi Soni and Jitendra Singh Thakur. 2018. A Systematic Review of Automated Grammar Checking in English Language. 1, 1 (April 2018), 23 pages.
+1 INTRODUCTION
+English is a West Germanic language which is the second most common language of the world. Over 600 million speakers use English as a second language (ESL) or English as a foreign language (EFL). While writing text in their second or foreign language, people might make errors. Therefore, it is essential to be able to detect these grammar errors and correct them as well. Grammar checking by a human becomes inconvenient at times such as when human resource is limited, the size of the document is large or the grammar checking is to be done on a regular basis. Therefore, it would be beneficial to automate the process of grammar checking. A grammar checking tool can provide automatic detection and correction of any faulty, unconventional or controversial usage of the underlying grammar.
+The trend of developing such tools has been evolved from 80’s till now. Earliest grammar checking tools (e.g., Writer’s Workbench[12]) were aimed at detecting punctuation errors and style errors. In 90’s, many tools were made available in the form of commercialized software packages (e.g., RightWriter[15]). In recent decades, rapid development has been seen in this field. For example, Park et al [16] developed a grammar checker as a web application for university ESL students, Tschumi et al[21] developed a tool aimed at French native speakers writing in English, Naber developed an tool named LanguageTool [14] to detect a variety of English Grammar errors, Brockett et al [3] presented error
+Authors’ addresses: Madhvi Soni, Jabalpur Engineering College, Department of Computer Science & Engineering, Jabalpur, M.P., 482011, India, madhvi. soni21@gmail.com; Jitendra Singh Thakur, Jabalpur Engineering College, Department of Computer Science & Engineering, Jabalpur, M.P., 482011, India, jsthakur@jecjabalpur.ac.in, jsthakur@iiitdmj.ac.in.
+© 2018 Manucript
+1
+arXiv:1804.00540v1 [cs.CL] 29 Mar 2018
+
+
+2 Madhvi Soni et al
+correction using machine translation and Felice et al [6] presented a hybrid system. Existing approaches are hard to compare since most of their tools are not available. Moreover, they are developed on different datasets and targets detection of different types of errors. Study and comparative analysis of previous literature is important to gain future research directions, yet very few efforts have been put to survey grammar checking approaches in the last decade. Therefore, we are highly motivated to review the existing literature for identifying the related issues and concerns, and present them in a single study to our research community.
+This paper reports on a systematic review [9] that focuses on various approaches for automatic detection and correction of grammar errors in English text. While reviewing the literature, we have tried to summarize as many details as possible, explaining the complete step by step workflow of the approach along with its strengths and limitations (if any). Our intention is to provide a platform for comparing the existing approaches that will help in taking further research decisions. Also, we have searched the literature to find various types of errors, but found that all the researchers are addressing a set of errors that is different from each other. Thus, we identify major types of errors and suggest an error classification scheme based on a five point criteria. We explain these types of errors along with their demonstrative examples. To the best of our knowledge, our study is the first one of its kind.
+The paper is organized into following sections: Section II presents the method of performing systematic review. This section describes our research questions, search strategy, paper selection criteria and method of data extraction from the selected papers. Section III presents our suggested scheme to classify various English grammar errors. Section IV presents the classification of grammar checking techniques. Section V presents a detailed review of various approaches whose results are significant in this field. Finally, section VI concludes our paper and suggests some directions for further research.
+2 SYSTEMATIC REVIEW METHOD
+A systematic literature review is a well-planned procedure to search, identify, extract from, analyze, evaluate and interpret the existing literature works that are relevant to a particular research interest [26],[9]. A systematic review is different from a conventional review as it summarizes the existing work in a more complete and unbiased manner [9]. Systematic reviews are undertaken to sum up the existing approaches, identifying their limitations, suggesting further research directions, and to provide a background for new research actions [9].
+We report a systematic review on grammar checking in English language. As per the recommended guidelines [9], we have adopted five necessary steps to carry this review. In the first step, we formulate the research questions that will be addressed by this systematic review. In the second step, we design a strategy to search for the research papers online. Third step defines the paper selection criteria to identify relevant works. The fourth step is extraction of data from primary studies and finally, in the last step we examine the data.
+2.1 Research Questions:
+RQ1 What are the different types of errors in English grammar? RQ2 How can we classify them? Is there a classification scheme in the literature? RQ3 What are the various techniques of grammar checking? RQ4 What are the strengths and limitations of these techniques?
+
+
+A Systematic Review of Automated Grammar Checking in English Language 3
+RQ5 What are existing approaches of grammar checking? What are the methods they use? RQ6 Is there any experiment conducted by the authors to evaluate the performance of the approach? RQ7 If yes, what results have been obtained? RQ8 What types of errors are detected and corrected by these approaches? RQ9 How far these approaches are able to correctly identify the errors? RQ10 Is there any tool support available?
+2.2 Search Strategy:
+Our search strategy starts by defining a query string. To form the string, we identified three groups of search terms: population terms, intervention terms and outcome terms.
+• Population Terms:These are the keywords that represent the domain of research. (e.g., grammar checking, grammar correction, English grammar errors, types of errors, error classification, and ESL errors.) • Intervention terms: These are the keywords that represent the techniques applied on population to achieve an objective. (e.g., automatic detection, detect, detecting, automatic correction, correct, correcting and identification.) • Outcome terms: These are the related factors of importance. (e.g., better, faster, efficient and improved performance.)
+We performed an exhaustive search on “Google scholar" to identify the papers to be reviewed. Since the search resulted in collection of a large number of papers, it is necessary to identify only the useful papers that can answer our specific research questions. Thus, we applied inclusion/exclusion criteria to select papers that can serve as primary studies in this systematic review.
+2.3 Inclusion/exclusion criteria:
+Our inclusion/exclusion criteria are completely based on our previously defined research questions. For each paper, we read the paper’s title and abstract to identify the relevant papers. Furthermore, full text was read to take the final decision. Following points were considered while deciding on the selection of primary studies:
+• Papers irrelevant to the task of grammar checking are excluded. • Papers proposing grammar checking on languages other than English are completely ignored. • Papers describing types of errors made by native speakers of a specific language (e.g., errors made by only Arab writers) were excluded. • Papers that do not provide sufficient technical information of their approach were excluded. (e.g., [13]) • In case of approaches those participated in a shared task(CoNLL-2013 and 2014), we include only the best performing approach.
+After the electronic search, a total of 113 papers were identified to investigate. 35 duplicates were eliminated and 36 papers were eliminated in the first round by reading the abstract and introduction. So, 42 papers were remaining for further investigation. After reading full-text, 29 papers were eliminated and finally 1 more was eliminated [13] due to lack of implementation details. Thus, we identified 12 primary studies.
+2.4 Data Extraction:
+For data extraction, we used a tabular format where each primary study is reviewed under table headings such as name of the approach, technique used, steps involved in the approach, types of the errors addressed by the approach,
+
+
+4 Madhvi Soni et al
+experiments conducted by the authors (if any), dataset used in the experiment, outcomes of the experiment, name of the software tool designed (if any), and strengths and shortcomings of the approach (if any). Later, content of this table is used to write a detailed review of each primary study.
+3 TYPES OF ERRORS
+This section will address our research questions RQ1 and RQ2. Before actual implementation of any grammar checking approach, it is important to identify major types of errors and their classification on the basis of some criteria. For example, some researchers have classified the errors in the corpus based on whether they are automatically detectable or needs human assistance. Naber[14] classifies various errors into four types namely spelling errors, style errors, grammar (syntax) errors and semantic errors. Wagner et al[22] reports four types of errors namely agreement errors, real word spelling errors(contextual errors), missing word errors and extra word errors. Lee et al[11] reports two types of errors namely syntax errors and semantic errors. Z Yuan in her doctoral thesis[25] states five types of errors namely lexical errors, syntactic errors, semantic errors, discourse errors and pragmatic errors. Other than this, there is no general classification of grammar errors to the best of our knowledge. However an overview of major types of errors can be found in many web articles. Thus, we are highly motivated to suggest an error classification scheme. Please see figures 2 and 3 for comparison of our scheme with previous schemes.
+We have considered following points while designing our suggested classification scheme.
+• Frequency of error: More frequent errors should be kept in separate groups. For instance, five types of syntax errors are the most frequent errors that occur in ESL text[17] so they are classified into separate groups. Similarly, spelling and punctuation errors are also very common. See figure 1(a).
+• Validity of text: Errors should be separated on the basis of how it makes the text invalid. For instance, syntax error invalidates a text due to violation of grammar rules. Similarly, sentence structure error invalidates a sentence due to violation of sentence structuring rules[7] and a spelling error invalidates a word if it violates language orthography. See figure 1(b).
+• Level of an error: Some errors are detected at sentence level while others can be detected at word level i.e., taking two or three words. For instance, there is no need to check complete sentence to detect spelling errors. Similarly, checking words before and after a preposition would be sufficient to detect a preposition error, while fragments can be detected using parse tree pattern of a complete sentence. See figure 1(c).
+• Nature of error: The errors that are more annoying and difficult to detect should be separated from simpler ones. For instance, spelling error is rather formal which can easily be detected using a spell checker, while detection of a semantic error requires real-world knowledge.
+• Error type overlap: The error types in the classification scheme are overlapping. It cannot be completely avoided but we have tried to minimize it. For example, a run-on sentence can also be a punctuation error and a missing preposition error can also be a sentence structure error.
+
+
+A Systematic Review of Automated Grammar Checking in English Language 5
+Again considering the frequency, nature and validity, we kept punctuation rules into a separate class of errors. Trying to minimize the overlapping, we reached to the final classification shown in figure 3.
+(a)
+(b)
+(c)
+(d)
+Fig. 1. Classification of errors based on (a) frequency, (b) validity, (c) level and (d) combining (a), (b) and (c).
+âĂŐ âĂŐ
+
+
+6 Madhvi Soni et al
+âĂŐ (a)
+(b)
+(c)
+(d)
+Fig. 2. Classification schemes given by (a) Naber[14], (b) Lee et al[11], (c) Wagner et al[22], (d) Z yuan[25]
+
+
+A Systematic Review of Automated Grammar Checking in English Language 7
+Here, we are describing our suggested classification of errors. We give erroneous sentences for each type of error and their corrections are given in the bracket. All the examples have been taken from [23].
+Fig. 3. Our Suggested Scheme for Classification of Errors
+(1) Sentence Structure Error: Sentence structure refers to the organization of different POS components within a sentence to give it a meaning. Structuring has a high impact on sentence’s readability. Hornby[7] has formulated 25 patterns of English sentences. If none of those patterns are found, the sentence can be considered as ill-formed or say erroneous. Such an ill-formed sentence can further be classified as fragments and run-ons. A fragment is an incomplete sentence in which either subject or verb is missing or it may be a sentence having dependent clause without the main clause [24]. A run-on sentence is two independent clauses missing a punctuation or necessary conjunction between them, which affects the readability of text. Sentence structure errors may contain other type of errors within them. Examples 1, 2 are correctly constructed while examples 3,4,5,6,7 are erroneous. Examples 4,5,6 are fragments while example 7 is a run-on.
+Example 1- She began singing. (S-V-Gerund) Example 2- She wants to go. (S-V-to-infinitive) Example 3- She began to singing. (Misplaced ‘to’ or ‘-ing’) Example 4- Wants to go. (Subject is missing) Example 5- A fair little girl under a tree. (Verb is missing) Example 6- Because he is ill. (‘because’ makes it a dependent clause, main clause is missing) Example 7- I ran fast missed the train. (Conjunction ‘but’ is missing)
+(2) Punctuation Error: Punctuation marks like comma, semi-colon, full stop etc. are used to separate sentence elements. A missing punctuation or unnecessary punctuation can alter the meaning of the sentence. Hence, it is important to detect and correct the punctuation errors in English text.
+Example 8- He lost lands money reputation and friends. (lands, money, reputation and friends) Example 9- Alas she is dead ! (Alas ! She is dead.) Example 10- How are you? Mohan? (How are you, Mohan?) Example 11- Exactly so, said Alice. (“Exactly so,")
+
+
+8 Madhvi Soni et al
+(3) Spelling Error: Spelling error is the generation of a meaningless string of characters. A common reason for such errors is the typing mistakes done by the writers. These are the most common error types that can be found easily by any spell or grammar checking tool. Generally these tools have a list of known words. Any word outside this list is considered as a spelling error.
+Example 12- Death lays his icey hand on kings. (icy) Example 13- Many are called, but few are choosen. (chosen)
+(4) Syntax Error: Any error violating the English grammar rules is called as syntax error. Syntax errors can be of many types depending upon the inherent relationship between the words of a sentence. Most grammar checkers aims at finding and detecting various types of syntax errors. Syntax errors can be subdivided into five subtypes:
+(a) Subject-Verb Agreement Error: A sentence written in English must have an agreement between subject and verb in terms of person and number. This agreement is shown in example 14 and 15. Example 14- He is not to blame. (subject-‘he’ ( third person singular)) (verb-‘is’ (third person singular)) Example 15- They are not on good terms. (subject-‘they’ (third person plural)) (verb-‘are’ (third person plural))
+(b) Article or Determiner Error: This type of error occurs either when an article or determiner is missing in the sentence or when a wrong article or determiner is used. Example 16- Book you want is out of print. (The book) Example 17- He returned after a hour. (an hour)
+(c) Noun Number Error: In English, uncountable or mass nouns do not have plurals. So noun number error occurs when a plural form of uncountable nouns is used in the text. Example 18- He paid a sum of money for the informations. (information) Example 19- The sceneries here are very good. (The scenery here is very good.)
+(d) Verb Tense or Verb Form Error: Verb tense or verb form conveys the time and state of the idea or event. This type of error occurs when a writer uses a different tense or form of verb from the intended one. Example 20- It is raining since yesterday. (has been raining) (‘since’ gives the idea that the event has started in the past and is still continuing) Example 21- She leaves school last year. (left) (‘last year’ indicates a finished event of the past) Example 22- The boys are play hockey. (playing) (the event is currently happening, so -ing form of verb is required)
+(e) Preposition Error: Prepositions are the words preceding a noun or pronoun, used to express a relation to other element in the clause. In literature, preposition errors are addressed separately because of the fact that it is difficult to master them. Example 23- He sat a stool. (He sat on a stool.) Example 24- He has recovered of his illness. (from his illness)
+
+
+A Systematic Review of Automated Grammar Checking in English Language 9
+(5) Semantic Error: The errors that do not violate English grammar rules, but make the sentence senseless or absurd, are called as semantic errors. A semantic error can be a contextual error[2] or wrong word choice error. When a wrongly typed word is a real word in the language, it is not detected as a spelling error, yet it does not fit in the given context; such errors are called as contextual errors. Wrong word choice error is using a rare word (possibly due to limited knowledge of vocabulary) which is often not used in the given context. Examples 25,26 are contextual errors while 27,28 are word choice errors.
+Example 25- Our team is better then theirs. (‘then’ is not a spelling mistake, but the context gives an idea of comparison, indicating correct word as ‘than’) Example 26- The jury were divided in there opinions. (their opinions) Example 27- A group of cattle is passing. (A herd of cattle) Example 28- I am going to the library to buy a book. (use ‘bookstore’ instead of ‘library’)
+4 CLASSIFICATION OF TECHNIQUES
+This section will address our research questions RQ3 and RQ4. See figure 4. There are three main techniques of grammar checking:
+4.1 Rule based technique:
+The classical approach of grammar checking is to manually design grammar rules as shown in [14]. These High quality rules are designed by linguistic experts. An English text tagged with parts-of-speech (henceforth POS) is checked against the defined set of rules and a matching rule is applied to correct any error. The technique appears to be simple as it is easy to add, edit or remove a rule; however, writing rules needs extensive knowledge of the underlying language’s grammar. Rule based systems can provide detailed explanation of flagged errors thus making the system extremely helpful for the purpose of computer aided language learning. But manual maintenance of hundreds of grammar rules is quite tedious.
+4.2 Machine Learning based technique:
+Machine learning is currently the most popular technique of grammar checking. A method that uses supervised learning provides best results [19]. These methods use an annotated corpus which in turn is used to perform statistical analysis on the text to automatically detect and correct grammar errors. Unlike rule based systems, it is difficult to explain the errors resulted by these systems. Machine learning based systems does not require extensive knowledge of the grammar since it is completely dependent on the underlying corpus. Non-availability of a large annotated corpus hinders the application of such techniques for grammar checking purpose. Also, the results greatly depend on how clean the corpus is.
+4.3 Hybrid technique:
+A combination of machine learning and rule based techniques can be utilized to improve the performance of the system. Since, some errors are better solved by rule based technique (e.g., use of ‘a’ or ‘an’) and some are better solved by machine learning (e.g., determiner errors). So, each part of the hybrid technique should be implemented according to its ‘competence’ [19]. As experimented in [4], the corpus of text can be used to train the system for identifying correct
+
+
+10 Madhvi Soni et al
+pattern of sentences and the results can be filtered by applying some hand-crafted rules. Hybrid technique is helpful in addressing a wide range of complex errors. Also, the tedious job of writing so many rules can be reduced to a greater extent.
+Fig. 4. Classification of Grammar Checking Techniques
+5 LITERATURE REVIEW
+In this section, we present our study of various approaches that we have selected as our primary studies. For each primary study, we explain the approach, give a graphical representation of it, discuss the types of error that can be detected or corrected by it, discuss about the experiments and results that are presented by the authors in their respective papers and discuss the strengths and limitations of the approach. This section will address RQ5, RQ6 and RQ10. RQ7 and RQ8 will be addressed by table 1 and table 2.
+5.1 English Grammar Checker (1997):
+Park et al[16] developed a web interface named English grammar checker which aims at the detection of grammatical errors commonly made by university students. The approach utilizes Combinatory Categorial Grammar (CCG) to derive syntax information of a sentence in a categorical lexicon. Each categorical lexicon is a collection of lexical entries. An entry is a kind of rule which defines acceptable categories of words that are local to a given word. For example, for article ‘a’, the entry would describe that after article ‘a’, category NP (third person singular) is expected and further category VP (compatible with NP) can be expected to form a sentence. If a sentence derivation violates such rule, associated error message is displayed. The authors have tested the approach for identifying errors made by students of University of X in their English essays. See figure 5
+This is a purely syntactic approach, where grammar errors concerning the wrong syntax of a sentence can be detected. A sentence is rejected, if its derivation is not acceptable. To accept a sentence, simply add a new entry in the lexicon. The approach is able to detect spelling errors, article or determiner error, agreement errors, missing or extra elements and verb tense error. All other type of errors such as wrong word choice errors, preposition errors and run-ons could not be detected. Also, it reported some level of misdiagnoses. This interface in currently not available on the web.
+
+
+A Systematic Review of Automated Grammar Checking in English Language 11
+Fig. 5. Schematic Diagram of English Grammar Checker [16]
+5.2 Island processing based Approach (1997):
+Tschumi et al [21] developed an English grammar checking tool for native speakers of French language using a method called island processing. The tool works in four steps. In the first step, input text is broken into sentences and words. The words are assigned a syntax category (POS tag). In the second step, a set of finite state automata is used to identify noun phrases, verb phrases and prepositional phrases as the important islands in the sentence. Depending upon the type of noun or preposition, they are assigned specific features and stored in registers. The third step calls the error detection automata which matches the word features to decide on an error and suggests a correction to it. The authors have compared their prototype with other commercial grammar checkers and have reportedly performed better. However, they did not discuss the data which they have used for comparison. See figure 6
+Fig. 6. Schematic Diagram of Island Processing based Approach [21]
+The proposed prototype uses a scaled-down version of full dictionary which consist of words along with its syntactic category to assign POS tag instead of using a parser which saves time when parsing an ill-formed sentence. Also, island processing method lowers the processing time. To reduce overflagging, an error must be correctly identified. For this purpose the tool provides a user interaction module asking question to user and a problem word highlighter explaining problematic word usage. Though the method successfully reduces overflagging of errors, it fails at automatic correction. The tool explains an error, suggests possible corrections and often asks question to the user, which seems annoying. The tool is not available online.
+
+
+12 Madhvi Soni et al
+5.3 LanguageTool (2003):
+Naber[14] proposed LanguageTool which is an open source English grammar checker based on traditional rule based approach. The method splits the text into chunks and all the words are POS tagged. The task of spell checking is done by Snakespell python module integrated with the system. It uses a probabilistic tagger Qtag with a rule based extension for POS tagging and a rule based chunker for chunking of text into phrases. Next the manually designed XML based rules are applied to detect errors in the text. These rules define the erroneous pattern of POS tags. When applied, each rule matches the tag pattern given in the rule with the tag pattern present in text. If a match occurs, an error is detected and the system provides explanation messages and example sentences. See figure 7
+Fig. 7. Schematic Diagram of LanguageTool[14]
+The author did not discuss about the data on which the tool was tested. LanguageTool is a very precise grammar checker available on the web. It can be used as a standalone web application as well as can be integrated with a text editor. It supports more than 20 languages with different number of rules. For English, it has 1614 XML rules. The rule set can be extended by simply adding the rule in the XML file. The obvious drawback of such system is the complex and time consuming task of rule development. Also, the large number of rules to cover majority of errors, results in its low recall.
+5.4 Arboretum (2004):
+Bender et al[1] proposed Arboretum, a tool to correct English grammar sentences based on some rules called as mal-rules. The authors have classified mal-rules in three categories namely syntactic construction mal-rules, lexical mal-rules and mal-lexical entry. The rules are then used to map correct string from incorrect one using a best-first generation method which they named as ‘aligned generation’. Aligned generation generates a sentence that closely matches to structure and lexical yield of some reference sentence. Priorities are assigned to the generation tasks and working up with the priorities, the first complete tree found by the generator is considered as the closest to the reference parse. See figure 8 The system was tested on a sample of 221 items taken from SST corpus. The authors report that the tool is able to generate correct string in 80% of cases of the experiment. However, the experimental dataset taken was small with an aim of finding a few types of errors. The proposed strategy failed in some cases due to its inability in identifying lexical entries and phrasal tasks. The tool is not available online, so it is not clear whether it supports automatic correction or not.
+
+
+A Systematic Review of Automated Grammar Checking in English Language 13
+Fig. 8. Schematic Diagram of Arboretum[1]
+5.5 SMT based approach (2006):
+The approach proposed by Brockett et al[3] makes use of Statistical Machine Translation (SMT) to detect and correct grammar errors. Aiming at mass noun errors, the authors advocate translation of the whole erroneous phrase instead of individual words. A noisy channel model was used for error correction using SMT technique. The work identifies 14 nouns that frequently occurred in CLEC corpus. The sentences containing these errors are used to create training data which can map erroneous string to correct one. See figure 9
+Fig. 9. Schematic Diagram of SMT based Approach[3]
+The system was tested on 123 example sentences taken from English websites in China. During testing, the approach was able to correct 61.81% of mass noun errors. Errors like subject verb agreement and punctuation errors were simply ignored. The system was not able to correct an error where a word is both mass noun and count noun; for example, the word ‘paper’ in the given two phrases- ‘many paper’ and ‘five pieces of papers’. Also the training data did not cover all other types of grammar errors made by ESL learners. This system is not available online.
+
+
+14 Madhvi Soni et al
+5.6 Maximum Entropy Classifier based approach (2007):
+The approach proposed by Chodorow et al[4] aims at detecting prepositional errors in a corpus of ESL text. For this task a maximum entropy model is used which is trained with prepositions along with a set of associated feature-value pairs (its context). The sentences are POS-Tagged and chunked. 25 features were used to train the maximum entropy model where each feature is associated with some values. The feature-value pairs having very low frequency of occurrence were eliminated. The model is then tested on a different dataset. The model predicts the probability of each preposition in the given context and then compares it with the preposition used by the writer. The erroneous preposition is replaced with most probable preposition. Subsequently, each context is classified into one of the 34 classes of preposition. To solve the problem of detecting extra preposition, authors devised two rules- Rule1 deals with repetition of same preposition and error is detected when same POS tag is used. Rule2 deals with wrong addition of a preposition between a plural noun and a quantifier. See figure 10
+Fig. 10. Schematic Diagram of Maximum Entropy Classifier based Approach[4]
+This approach uses a huge dataset for training and testing purpose. Training is done on 7 million prepositional contexts taken from MetaMetrics corpus and newspaper text and testing is done on 18157 prepositional contexts taken from a portion of Lexile text and 2000 contexts from ESL essays. This approach deliberately skips the contexts in the following cases- when there is a slight difference between the most probable and second most probable preposition, when adjacent words are misspelled, when there are comma errors, when the writer uses antonym of a preposition, and also in case when benefactives are used. Also, the rules for detecting extraneous preposition are insufficient to cover other types. No tool support is available for this approach.
+5.7 AIS based approach (2007):
+Kumar et al[10] proposed an approach of grammar checking, inspired from human immune system. Like the human immune system generates immune cells to detect antibodies, similarly a large corpus can be used to detect ungrammatical sentences by generating detectors. The detectors are the sentence constructs that do not appear in the corpus. A test
+
+
+A Systematic Review of Automated Grammar Checking in English Language 15
+sentence is taken to form bigrams, trigrams and tetragrams. These are tagged with extended POS tags. The sequence of tags which does not exist in the corpus is called as detector and is used to flag error. Next the detector is cloned to repair ungrammatical construct into correct one. The authors have used Real Valued Negative Selection Algorithm to generate detectors and also to fine tune the set of detectors which are capable of identifying errors better and quicker. See figure 11
+Fig. 11. Schematic Diagram of AIS based Approach[10]
+This is a language independent approach based on Artificial Immune System (AIS) where the underlying corpus (Reuters-21578) mimics the human immune system. Any sentence construct outside the corpus is regarded as error even though it is grammatically correct. The authors have tested the system on sentences taken from a book of grammar errors. However, the size of the testing data and the results of the experiments are not discussed in their paper. This approach is able to identify 8 type of errors namely subject-verb agreement errors, wrong verb tense, adverb, adjective error, article, pronoun, wrong noun number error and missing verb error. All other types of errors are undetected. The authors argue that these shortcomings of the approach can be solved by extending the POS tag set. Still, the task of creating a corpus which is large enough to include all type of correct sentences seems practically infeasible. No tool support is available for this approach.
+5.8 LSPs based approach (2007):
+Sun et al[20] proposed an approach which combines pattern discovery and machine learning to classify a sentence into two classes: correct and erroneous. To build this classification model, labeled sequential pattern (LSP) is used as an input feature. The training data is POS tagged and the frequently occurring patterns are discovered from both correct and erroneous sentences. Based on whether the pattern satisfies the given constraints for support and confidence, these patterns are labeled as erroneous or correct. Along with LSPs, other linguistic features like syntactic score, lexical collocation, function word density and perplexity are also used to detect different types of errors. See figure 12 The method was implemented on SVM and Bayesian classifiers using HEL, JLE and CLEC corpus. Different experiments were carried out to analyze and compare various results. The Authors found that LSP feature performs better in every case. Also they compared their method with two prototypes and it outperformed the other two in terms of precision, recall and F-score. The method can detect various grammar errors, lexical collocation errors and wrong sentence structure errors. However, the automatic correction of detected errors is not supported. Also spelling errors are simply ignored. No tool support is available for this approach.
+
+
+16 Madhvi Soni et al
+Fig. 12. Schematic diagram of LSP based Approach[20]
+5.9 Auto-Editing (2010):
+Huang et al[8] developed an online tool for automatic grammar error correction. The approach uses a manually created corpus of paired sentences collected from a website lang-8.com. A pair consists of an erroneous sentence and its respective correct sentence. Then the corpus is used to derive sentence correction rules. A rule is represented as AâĘŠB where A is a word pattern found in erroneous sentence and B is the pattern found in its respective correct sentence. These patterns are identified by calculating Edit distance at word level. An edit distance (Lavenstein distance) is the minimum number of insert, delete or substitute operations required to transform erroneous pattern to correct pattern. This would result in the generation of candidate rules. Among these, the rule which is able to transform erroneous sentence of the corpus to its correct form is applied while others are ignored. See figure 13
+Fig. 13. Schematic Diagram of Auto-Editing[8]
+This approach is an application of pattern mining on English sentences. In this approach, the rules are automatically derived from the corpus itself. The candidate rule set is refined using a condensing algorithm and ranking the rules based on user feedback (frequently used rules are top ranked). Though it achieves better precision and recall when compared with ESL assistant and Microsoft Word 2007 grammar checker, it detects mostly spelling and phrasal errors
+
+
+A Systematic Review of Automated Grammar Checking in English Language 17
+and does not cover other types of grammar errors like run-on sentences. The demo webpage of auto-editing is currently not available.
+5.10 ASO based approach (2011):
+Dahlmeier et al[5] proposed grammar error correction using a linear classifier. They aim at correction of article and prepositional errors. The article or preposition and its context is treated as feature vectors and the corrections are treated as the classes. They used a combination of learner and non-learner text for training. When training is done on learner text, the context of article or preposition is treated as feature vector and the correct class is provided by human annotator. When training is done on non-learner text, the observed article or preposition is also added into the feature set and the correct class is the observed article or preposition. The classifier is trained using Alternating Structure Optimization (ASO) algorithm. ASO algorithm is learning the common structure of multiple related problems. This common structure can be learned by creating auxiliary problems. Auxiliary problems are helpful in predicting the wrong article or determiner in the user text. Then the classifier can be trained for these auxiliary problems to classify articles into 3 classes and prepositions into 36 classes. See figure 14
+Fig. 14. Schematic Diagram of ASO based Approach[5]
+The training was done on NUCLE corpus and Gigaword corpus and testing was done using Wall Street Journal. The results were compared with two baseline methods. The ASO model outperforms both with an F1 measure of 19.29% for articles and 11.15% for prepositions. They also compared the ASO method with two commercial grammar checking tools. The performance of ASO method was far better than both the tools. There is a large scope of improvement in the performance as the precision and recall values are much less. Also, the problem of unidentified errors and false flags still persists in the system. No tool support is currently available for this approach.
+5.11 UI System (2013):
+This system was developed by Rozovskaya et al[17] at CoNLL-2013 shared task which aims at correction of five types of errors namely article/determiner, preposition, noun number, subject-verb agreement and verb form errors. The University of Illinois (UI) system is a combination of five machine learning classifier models, where each model is specialized to correct a specific type of error. To correct the article errors, Averaged Perceptron (AP) model is used, which is trained on NUCLE corpus using a rich set of features generated by POS tagger and chunker. Artificial article
+
+
+18 Madhvi Soni et al
+errors were introduced in the NUCLE corpus to reduce the error sparseness. To correct all other types of errors, NaÃŕve Bayes (NB) classifier is trained with Google web 1T 5-gram corpus using word n-gram features. Each individual model predicts the most probable word from its candidate set. The candidate set for articles and prepositions are (a,the,φ) and 12 most frequent prepositions, respectively. For noun, verb agreement and verb form errors the candidate set includes their respective morphological variants. Finally, the results of each individual classifier are combined, filtered for false alarms and then applied to correct the sentence. See figure 15
+Fig. 15. Schematic Diagram of UI System[17]
+This system was later extended in CoNLL-2014 shared task where it was implemented to correct more types of errors and to address correction of two or more related errors by using joint inference method [18]. Though, the system performed best in the given task, the recall (8.81) and F1-score (14.84) are low for preposition errors. The developed system is not available online.
+5.12 Hybrid System (2014):
+The hybrid system was developed by Felice et al[6] at CoNLL-2014 shared task which combines a rule based and statistical machine translation systems in a pipeline. The rule based module automatically derives rules from Cambridge Learner Corpus (CLC) that detects erroneous unigram, bigrams or trigrams and generates a list of candidate corrections. These candidates are ranked (most probable correction is top ranked) using a Language Model (LM) built from Microsoft’s web n-grams. The results of the LM are pipelined into the Statistical Machine Translation system. The SMT model was trained on multiple corpora including NUCLE v3.1, 2014 shared task dataset, IELETS dataset from CLC corpus, EVP corpus and FCE corpus. The SMT model generates 10 best correction candidates which are further ranked by the language model. Next, the unnecessary corrections (corrections having error types: reordering, word acronym or run-ons) are filtered out and best correction is applied to replace the input string. See figure 16
+
+
+A Systematic Review of Automated Grammar Checking in English Language 19
+Fig. 16. Schematic Diagram of Hybrid System[6]
+The SMT system is developed using best performing tools like Pialign for word alignment, IRSTLM to build target language model and Moses for decoding. The system is able to achieve best values for precision, recall and F-score. It is best suitable for correcting agreement errors, verb form errors, noun number, pronoun reference, punctuation, capitalization and spelling errors. However, it performs poorly on sentence fragments, run-ons, word reordering and collocations errors. The developed system is not available online.
+Table 1. Errors detected by various Grammar Checking approaches
+Approach
+Sentence structure
+Fragments
+Run-ons
+Spelling
+Syntax errors
+S-V Agreements
+Vform, Vtense
+Noun Number
+Art or Det
+Preposition
+Punctuation
+Semantic Errors
+Contextual
+Word choice
+[16] ✓ ✓ ✓ ✓ ✓
+[21] ✓ ✓ ✓ ✓ ✓ ✓ [14] ✓ ✓ ✓ ✓ ✓ ✓ ✓
+[1] ✓ ✓ ✓ ✓ [3] ✓ [4] ✓ [10] ✓ ✓ ✓ ✓ [20] ✓ ✓ ✓ ✓ ✓ ✓ [8] ✓ ✓ ✓ ✓ ✓ ✓ [5] ✓ ✓ [17] ✓ ✓ ✓ ✓ ✓
+[6] ✓ ✓ ✓ ✓ ✓ ✓ ✓
+
+
+20 Madhvi Soni et al
+Table 2. Summary of Various Grammar Checking Approaches
+Approach Technique used
+Target error types Linguistic data used
+Results Strengths Limitations
+[16]
+wrong capitalization, agreement errors, Vform error, missing fragments.
+Essays written by university students.
+Not specified Simple, Customizable to identify frequent errors.
+Handmade rules, No automatic correction.
+[21]
+Spelling error, S-V-A, Vtense, word choice errors, sentence error, noun error.
+Self created corpus of 27000 words of text by French native speakers.
+Not specified Reduced processing time for tagging, Reduced error overflagging
+Overflagging is still high, 43.5%
+[14]
+Punctuation, syntax, semantic, style error.
+Mailing list error corpus of 224 sentences.
+Not specified Simple, Large set of rules, Easy rule addition
+No automatic correction, difficult to manage large number of handmade rules.
+[1]
+Determiner errors, noun number error, verb tense error, word choice, vform error.
+SST corpus of 221 sentences
+Success rate = 80% Better error correction due to best first method used.
+Overflagging of errors, poor performance for S-V agreement errors, missing auxiliary, complement and vform errors.
+[3]
+Mass noun errors Reuters newswire articles, CLEC corpus. English sentences from Chinese websites.
+success rate = 61.81% Automatic correction Unable to detect when mass noun is also a count noun.
+[4]
+Preposition errors MetaMetrics corpus of 1100 & 1200 Lexile text, newspaper text, Chinese, Japanese & Russian’s ESL essays.
+precision = 0.8, recall = 0.304
+Provides better results due to hybrid approach.
+Insufficient number of rules, Low recall, Many errors were deliberately skipped.
+[10]
+SVA, article, noun number, verb error, wrong adjective, adverb or pronoun.
+Reuters-21578 corpus, sentences from book- Avoid Errors by A.K. Misra.
+Not specified Language independent method, Quicker response by frequent detectors.
+Any pattern outside corpus is flagged as error even if it is correct.
+[20]
+Syntax errors, word choice error, sentence structure error.
+Hiroshima English Learners corpus, Japanese Learners of English corpus & Chinese Learner Error corpus.
+Accuracy = 81.3 , precision = 83.09 , recall = 81.24 , Fscore = 81.25
+Good feature set, provides better error detection.
+Does not detect spelling error.
+[8]
+Spelling, phrases, SVA, punctuation, article, preposition, verb tense, gerund misuse & other POS errors.
+Self created corpus of incorrect and correct sentences collected from lang8.com.
+Success rate = 67.2% Precision = 40.16% , Recall = 20.28%
+Automatic rule generation.
+Most of the detected errors are spelling or phrasal errors.
+
+
+A Systematic Review of Automated Grammar Checking in English Language 21
+Technique used
+Target error types Linguistic data used
+Results Strengths Limitations
+[5]
+Article and preposition errors
+NUCLE corpus, Gigaword Corpus, section 23 of Wall Street Journal.
+F1 = 19.29% (articles), F1 = 11.15% (Prepositions)
+Supports automatic correction, Performance is better than commercial tools.
+False flags. Unidentified errors, Low recall & precision.
+[17]
+Noun number, agreement error, Vform, ArtOrDet, Prepopsition errors.
+NUCLE, Google web 1T 5-gram corpus.
+Precision = 62.19 , Recall = 31.87 F1 , score = 42.14
+Supports automatic correction, Best performance on the target error types.
+Low recall for prepositions, Inconsistent predictions due to globally interacting errors.
+[6]
+Total 28 types of errors [6]
+NUCLE, CLC, FCE, EVP & CoNLL-2014 task dataset.
+Precision = 46.70 , Recall = 34.30 , F0.5 score = 43.55
+Supports automatic correction, Best performance on punctuation, spelling, capitalization, noun number, Vform, agreement errors.
+Cannot handle fragments, run-ons, acronyms, idioms, word reordering & collocation errors.
+6 CONCLUSIONS AND FUTURE RESEARCH
+Grammar checking is a major part of Natural Language Processing (NLP) whose applications ranges from proofreading to language learning. Much work has been done for the development of grammar checking tools in the past decade. However, fewer efforts are made for surveying the existing literature. Thus, we present a comprehensive study of English grammar checking techniques highlighting the capabilities and challenges associated with them. Also, we systematically selected, examined and reviewed 12 approaches of Grammar checking. The 12 approaches can be classified into three categories namely (1) Rule based technique, (2) Machine learning based technique, and (3) Hybrid technique. Each technique has its own advantages and limitations. Rule based techniques are best suited for language learning but rule designing is a laborious task. Machine learning alleviates this labor but it is dependent on the size and type of the corpus used. Hybrid technique combines the best of both techniques but each part of the hybrid technique should be implemented according to its suitability.
+In this paper, we have also presented an error classification scheme which identifies five types of errors namely sentence structure errors, punctuation errors, spelling errors, syntax errors, and semantic errors. These errors are further subcategorized. This classification scheme would help the researchers and developers in following ways: (1) identifying the most frequent errors would tell what type of errors must be targeted for correction, (2) identifying the level of the error would tell what length of text should be examined to detect any error, (3) identifying the cause of invalid text would help in finding a solution to write a valid text. This simplifies the task of grammar checking.
+Based on our detailed review of various approaches, our observations are as follows: (1) No existing approach is completely able to detect all types of errors efficiently, (2) Most of the tools are not available for research or public use, (3) All approaches use different experimental data, thus it is hard to compare the results. (4) Most of the approaches have addressed syntax errors and its subtypes while very few efforts have been done to detect errors at sentence level and at semantic level. (5) Detection and correction of run-on sentences is yet another untouched research area, (6) No tools are suitable for real time applications like proofreading of technical papers, language tutoring, writing assistance etc. (7) Our
+
+
+22 Madhvi Soni et al
+research question RQ9 is still unanswered since we could not check the results of individual error types against gold standards, (8) Although, performance of tools has been improved gradually with time; there is much scope for improvements.
+Based on our observations, we suggest the following emerging research directions:
+Classification of errors: As noted in our study, there is a lack (or even absence) of a general classification scheme to identify types of errors. We motivate further research to suggest more classification schema. This will simplify the task of grammar checking by identifying how to handle a particular type of error.
+Evaluation on a standard test data: Since all the previous approaches have been evaluated on a different test set, it is difficult to compare their performances. A standard test set of erroneous sentences with their well defined correct forms will benefit in reporting which systems are more efficient and robust.
+Analysis based on types of errors: Based on our review, all the previous approaches deal with a different set of error types to be corrected. An annotated corpus which labels erroneous sentences into one of the five types and their subtypes, and then study of the performance of various approaches on each of these types will be helpful. This will help in identifying the best method for handling a specific error type.
+Coverage of different types of errors: From the data in table 2, we observed that current approaches are limited in handling all types of errors, specifically sentence structure errors and semantic errors. Future work may focus on these areas.
+REFERENCES
+[1] Emily M Bender, Dan Flickinger, Stephan Oepen, Annemarie Walsh, and Timothy Baldwin. 2004. Arboretum: Using a precision grammar for grammar checking in CALL. In Instil/icall symposium 2004.
+[2] Johnny Bigert. 2004. Probabilistic Detection of Context-Sensitive Spelling Errors.. In LREC. [3] Chris Brockett, William B Dolan, and Michael Gamon. 2006. Correcting ESL errors using phrasal SMT techniques. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 249–256. [4] Martin Chodorow, Joel R Tetreault, and Na-Rae Han. 2007. Detection of grammatical errors involving prepositions. In Proceedings of the fourth ACL-SIGSEM workshop on prepositions. Association for Computational Linguistics, 25–30. [5] Daniel Dahlmeier and Hwee Tou Ng. 2011. Grammatical error correction with alternating structure optimization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 915–923. [6] Mariano Felice, Zheng Yuan, Øistein E Andersen, Helen Yannakoudakis, and Ekaterina Kochmar. 2014. Grammatical error correction using hybrid systems and type filtering.. In CoNLL Shared Task. 15–24. [7] AS Hornby. 1995. Guide to Patterns and Usage in English. (1995). [8] An-Ta Huang, Tsung-Ting Kuo, Ying-Chun Lai, and Shou-De Lin. 2010. Discovering Correction Rules for Auto Editing. Computational Linguistics and Chinese Language Processing 15, 3-4 (2010), 219–236.
+[9] Staffs Keele et al. 2007. Guidelines for performing systematic literature reviews in software engineering. In Technical report, Ver. 2.3 EBSE Technical Report. EBSE. sn.
+[10] Akshat Kumar and Shivashankar Nair. 2007. An artificial immune system based approach for English grammar checking. Artificial immune systems (2007), 348–357. [11] John Lee and Stephanie Seneff. 2008. Correcting Misuse of Verb Forms.. In ACL. 174–182. [12] Nina H Macdonald. 1983. Human factors and behavioral science: The UNIXâĎć Writer’s Workbench software: Rationale and design. Bell Labs Technical Journal 62, 6 (1983), 1891–1908.
+
+
+A Systematic Review of Automated Grammar Checking in English Language 23
+[13] Maxim Mozgovoy. 2011. Dependency-based rules for grammar checking with LanguageTool. In Computer Science and Information Systems (FedCSIS), 2011 Federated Conference on. IEEE, 209–212.
+[14] Daniel Naber. 2003. A rule-based style and grammar checker. (2003). [15] Michael Neuman. 1991. RightWriter 3.1. (1991). [16] Jong C Park, Martha Stone Palmer, and Clay Washburn. 1997. An English Grammar Checker as a Writing Aid for Students of English as a Second Language.. In ANLP. 24. [17] Alla Rozovskaya, Kai-Wei Chang, Mark Sammons, and Dan Roth. 2013. The University of Illinois system in the CoNLL-2013 shared task. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task. 13–19.
+[18] Alla Rozovskaya, Kai-Wei Chang, Mark Sammons, Dan Roth, and Nizar Habash. 2014. The Illinois-Columbia System in the CoNLL-2014 Shared Task.. In CoNLL Shared Task. 34–42.
+[19] Grigori Sidorov, Anubhav Gupta, Martin Tozer, Dolors Catala, Angels Catena, and Sandrine Fuentes. 2013. Rule-based System for Automatic Grammar Correction Using Syntactic N-grams for English Language Learning (L2).. In CoNLL Shared Task. 96–101. [20] Guihua Sun, Xiaohua Liu, Gao Cong, Ming Zhou, Zhongyang Xiong, John Lee, and Chin-Yew Lin. 2007. Detecting erroneous sentences using automatically mined sequential patterns. In ACL. 23–30. [21] Cornelia Tschichold, Franck Bodmer, Etienne Cornu, Franqois Grosjean, Lysiane Grosjean, N Kübler, N Léwy, and Corinne Tschumi. 1997. Developing a new grammar checker for English as a second language. Proc. From Research to Commercial Applications: Making NLP Work in Practice (1997), 7–12. [22] Joachim Wagner, Jennifer Foster, and Josef van Genabith. 2007. A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. Association for Computational Linguistics. [23] PC Wren and H Martin. 2000. English Grammar & Composition. S. Chand & Company Ltd (2000). [24] Chak Yan Yeung and John Lee. 2015. Automatic Detection of Sentence Fragments.. In ACL (2). 599–603. [25] Zheng Yuan. 2017. Grammatical error correction in non-native English. Technical Report. University of Cambridge, Computer Laboratory. [26] Tao Yue, Lionel C Briand, and Yvan Labiche. 2011. A systematic review of transformation approaches between user requirements and analysis models. Requirements Engineering 16, 2 (2011), 75–99.
--- a/storage/WN37MWI5/.zotero-reader-state
+++ b/storage/WN37MWI5/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":20,"scale":"page-width","top":786,"left":-7,"scrollMode":0,"spreadMode":0}
--- a/storage/WN37MWI5/Soni
+++ b/storage/WN37MWI5/Soni
--- a/storage/WR5LLUBH/.zotero-ft-cache
+++ b/storage/WR5LLUBH/.zotero-ft-cache
@@ -0,0 +1,233 @@
+Automated essay scoring with SBERT embeddings and LSTM-Attention networks
+Yuzhe Nie
+School of Foreign Languages, Shanghai University, Shanghai, China
+ABSTRACT
+Automated essay scoring (AES) is essential in the field of educational technology, providing rapid and accurate evaluations of student writing. This study presents an innovative AES method that integrates Sentence-BERT (SBERT) with Long ShortTerm Memory (LSTM) networks and attention mechanisms to improve the scoring process. SBERT generates embedding vectors for each essay, which are subsequently analyzed using a bidirectional LSTM (BiLSTM) to learn the features of these embedding vectors. An attention layer is introduced to enable the system to prioritize the most significant components of the essay. Evaluated using a benchmark dataset, our approach shows significant improvements in scoring accuracy, highlighting its ability to improve the reliability and efficiency of automated assessment systems.
+Subjects Artificial Intelligence, Data Mining and Machine Learning, Natural Language and Speech, Neural Networks Keywords Automated essay scoring, NLP, Deep learning, BERT, LSTM
+INTRODUCTION
+Automated essay scoring (AES) systems utilize computer algorithms to assess and grade essays through the analysis of their textual content. The systems are typically comprised of two primary components: a feature extraction module that collects linguistic data, including word count, grammar, and syntactic complexity, and a scoring module that evaluates and assigns grades based on these features. AES models demonstrate an ability to provide scores that frequently match closely with human evaluations. The development of AES can be traced back to 1966, when Page (1966) introduced the Project Essay Grader (PEG). This innovative statistical method established a connection between the surface characteristics of writing—like word length and sentence complexity—and the scores assigned by human evaluators. Initial AES systems mainly depended on features and scoring criteria that were manually developed to replicate human evaluation. Palmer, Williams & Dreher (2002) presented the application of Latent Semantic Analysis (LSA) for evaluating essay content through the measurement of semantic similarity among words in a text, thereby enhancing the content evaluation dimension of AES. Despite the advancements made, Reilly (2013) highlighted concerns regarding possible biases in AES, especially within the context of Massive Open Online Courses (MOOCs), where differences between machine and human grading were noted. Later developments in AES involved the use of regression models by Alikaniotis, Yannakoudakis & Rei (2016) to evaluate essays by analyzing linguistic features linked to essay quality. Taghipour & Ng
+How to cite this article Nie Y. 2025. Automated essay scoring with SBERT embeddings and LSTM-Attention networks. PeerJ Comput. Sci. 11:e2634 DOI 10.7717/peerj-cs.2634
+Submitted 7 November 2024 Accepted 5 December 2024 Published 11 February 2025
+Corresponding author
+Yuzhe Nie, nieyuzhe1900@163.com
+Academic editor Trang Do
+Additional Information and Declarations can be found on page 12
+DOI 10.7717/peerj-cs.2634
+Copyright 2025 Nie
+Distributed under Creative Commons CC-BY 4.0
+
+
+(2016) illustrated how deep learning models, particularly neural networks, can more effectively capture the complexities of essay quality. In a recent study, Cozma, Butnaru & Ionescu (2018) investigated the combination of character n-grams and word embeddings, demonstrating enhanced performance relative to previous methods. With the rise of neural network architectures and natural language processing methods (Wu et al., 2023; Gu et al., 2023; Ding et al., 2023), the progress in data accessibility and computational frameworks has driven the development of AES (Li & Jianxing, 2024). Consequently, there has been increasing emphasis on enhancing scoring criteria to incorporate various aspects of writing quality, resulting in more comprehensive evaluations (Carlile et al., 2018). Among the innovative techniques, Bidirectional Encoder Representations from Transformers (BERT) has emerged as a powerful tool or AES. The capacity to recognize hidden contextual relationships within text has demonstrated superior performance compared to conventional models. For example, Wang et al. (2022) demonstrated the advantages of BERT in learning multi-scale essay representations, resulting in enhanced performance compared to models based on Long Short-Term Memory (LSTM). LSTM networks, known for their effectiveness with sequential data, have gained significant popularity in AES applications. Janda (2019) highlighted the effectiveness of LSTM models in monitoring semantic shifts across an essay, demonstrating their capability to capture gradual changes in meaning. Additionally, Attali & Burstein (2004) investigated the problem of essay length bias in automated essay scoring and discovered that LSTM-based models might mitigate these biases, leading to more reliable and equitable evaluations. Another innovative direction in AES exploration involves the integration of coherence features into scoring models. Farag, Yannakoudakis & Briscoe (2018) emphasized that modeling the logical flow of ideas can improve the accuracy of essay evaluation by concentrating on the structural integrity of the text. This perspective is further supported by Uto, Xie & Ueno (2020), who proposed that the integration of handcrafted features with neural network methodologies could enhance scoring accuracy. Alongside neural networks, ensemble methods have become increasingly prominent in AES studies due to their capacity to combine various scoring features. Nadeem et al. (2019) introduced neural models that are sensitive to discourse, integrating various essay features to enhance scoring precision, which aligns with the increasing focus in automated essay scoring on multidimensional evaluation. This transition indicates a shift from onedimensional scoring and moves towards a more comprehensive assessment, taking into account elements like coherence, argumentation, and content quality in addition to linguistic characteristics (Carlile et al., 2018). These advancements represent an important advancement in improving the complexity and reliability of AES systems.
+MATERIALS AND METHODS
+Data collection and preprocessing
+For this study, we utilized the Automated Student Assessment Prize (ASAP) dataset (Hamner et al., 2012), a widely recognized benchmark for evaluating AES systems. This dataset was originally introduced as part of a shared task designed to compare AES system
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 2/14
+
+
+performance against human-assigned scores. The essays in the ASAP dataset were written by middle school students in the U.S., ranging from grades 7 to 10. The dataset consists of essays responding to eight distinct prompts, each characterized by unique language features such as varying levels of concreteness, open-endedness, and scoring scales (as outlined in Fig. 1). An overview of the dataset’s structure is provided in Table 1. To prepare the textual data for analysis, we first cleaned the raw text by removing nonalphabetic characters, except for certain punctuation marks necessary for preserving the integrity of sentence structure. Following this, the text was tokenized into individual words, and English stopwords were filtered out to reduce noise in the data. We then applied stemming to each token, transforming words to their root forms, which helps to standardize variations of words for better analysis in natural language processing tasks. The result was a preprocessed text string, optimized for further AES model development. We divided the dataset into three subsets: training, validation, and test sets, in a 70:10:20 ratio. This resulted in 9,342 essays for training, 1,038 for validation, and 2,596 for testing.
+Figure 1 Score distribution of each essay set. Full-size
+
+DOI: 10.7717/peerj-cs.2634/fig-1
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 3/14
+
+
+This division ensured that the model had sufficient data to learn, validate its performance, and finally be tested on unseen essays for unbiased evaluation.
+Rationale for model selection
+Many deep learning architectures employed for essay scoring are relatively simplistic and do not fully leverage the unique features embedded in the data. To address this limitation, we designed a computational framework based on Sentence-BERT (SBERT) (Reimers & Gurevych, 2019) and LSTM with attention mechanisms, aimed at enhancing both the semantic processing and predictive capabilities of the model in natural language tasks. SBERT, built on the transformer architecture, generates embedding vectors that capture the meaning of individual sentences more effectively than traditional models. Its ability to create high-quality sentence representations allows for better identification and comparison of semantic nuances within essays. Transformer-based models, such as SBERT, are well-known for their attention mechanisms, which have gained prominence for their ability to focus on the most important parts of the input data (Vaswani et al., 2017; Leow, Nguyen & Chua, 2021; Nguyen-Vo et al., 2021; Badaro, Saeed & Papotti, 2023). These mechanisms allow the model to selectively attend to key elements in an essay, improving its ability to grasp the deeper semantic relationships. Furthermore, incorporating LSTM with attention mechanisms adds another layer of sophistication to the model. LSTMs excel at processing sequential data by retaining critical information over time, and when combined with attention, they can focus on the most relevant parts of the sequence. This combination enhances the model’s capacity to handle long sequences without losing important context, ultimately improving the accuracy of essay scoring. By integrating SBERT’s powerful sentence embeddings with LSTM’s attentionenhanced sequential processing, this model not only optimizes performance in scoring tasks but also provides deeper insights into complex linguistic patterns, making it a robust tool for analyzing and evaluating essays.
+Assessment metrics
+To assess the effectiveness of the models, we employed various evaluation metrics including Mean Squared Error (MSE), the coefficient of determination (R2), Mean
+Table 1 Statistics of the ASAP dataset.
+Prompt Essays Avg length Score range WordPiece length
+1 1,783 350 2–12 649
+2 1,800 350 1–6 704
+3 1,726 150 0–3 219
+4 1,772 150 0–3 203
+5 1,805 150 0–4 258
+6 1,800 150 0–4 289
+7 1,569 250 0–30 371
+8 723 650 0–60 1,077
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 4/14
+
+
+Absolute Error (MAE), Root Mean Square Deviation (RMSE), and Quadratic Weighted Kappa (QWK). This allows us to conduct a comprehensive evaluation of the precision and consistency of the predictions.
+Proposed method
+Model architecture
+To develop the essay scoring system, we propose the architecture illustrated in Fig. 2. The process begins with a pre-trained sequence transformer for extracting features from the input essays. These features are then processed through a bidirectional Long Short-Term Memory (BiLSTM) layer, which effectively captures the underlying patterns in the data. To
+SBERT
+Attention
+LSTM LSTM LSTM LSTM LSTM LSTM
+CLS t1 t2 t3 t4 SEP
+Input Essay
+BiLSTM (num of layer =1)
+FC (140, 1)
+Figure 2 Model architecture. Full-size
+
+DOI: 10.7717/peerj-cs.2634/fig-2
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 5/14
+
+
+further refine the model’s attention to critical essay components, an attention mechanism is applied to the output from the BiLSTM layer. The final essay scores are generated through a fully connected (FC) layer with dimensions (140, 1). For optimization, we utilize two loss functions—MSE and cosine similarity (SIM)—which will be explained in detail in the “Loss Functions” section.
+Essay representation
+To represent essays at both the token and document levels, we utilize a pre-trained BERT model, augmented by a BiLSTM network and an attention mechanism. The process begins by tokenizing each essay using the BERT tokenizer, resulting in a token sequence T1 1⁄4 1⁄2t1; t2; . . . ; tn , where ti is the i-th token and n represents the total number of tokens. BERT’s tokenization follows the WordPiece method, and since BERT’s maximum input sequence length is 512 tokens, we construct a new sequence T2 from T1 as follows:
+1⁄2CLS þ 1⁄2t1; t2; . . . ; tL þ 1⁄2SEP ; if n > L 1⁄2CLS þ T1 þ 1⁄2SEP ; if n 1⁄4 L 1⁄2CLS þ T1 þ 1⁄2PAD ðL nÞ þ 1⁄2SEP ; if n < L:
+8<
+: (1)
+Here, L 1⁄4 510 is the maximum sequence length allowed for tokens between the special tokens [CLS] and [SEP], marking the start and end of the sequence, respectively. If the essay contains fewer tokens than L, padding is applied to maintain the fixed sequence length. The token, segmentation, and position embeddings are then combined to create the final input representation fed into BERT. After obtaining the contextualized token embeddings from BERT, we use a BiLSTM network to capture the sequential dependencies within the essay. The BiLSTM processes the token sequence H 1⁄4 1⁄2h1; h2; . . . ; hL , where hi represents the hidden state corresponding to token ti. The forward and backward passes of the LSTM are defined as:
+h!
+i 1⁄4 LSTMfwdðtiÞ; hi 1⁄4 LSTMbwdðtiÞ: (2)
+The final output for each token from the BiLSTM is the concatenation of the forward and backward hidden states: hi 1⁄4 1⁄2hi!; hi . To further enhance the model’s focus on key tokens, we apply a self-attention mechanism over the BiLSTM outputs. The attention score ai for the i-th token is computed as:
+ai 1⁄4 expðWhiÞ
+PL
+j1⁄41 expðWhjÞ ; (3)
+where W is a learnable weight matrix. The final essay representation v is then obtained as a weighted sum of the BiLSTM outputs:
+v 1⁄4 XL
+i1⁄41
+aihi: (4)
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 6/14
+
+
+This representation v serves as the input for downstream tasks, such as classification or regression, depending on the application.
+Loss functions
+MSE is a metric used to measure the average squared differences between the predicted scores and the actual labels, defined as follows:
+MSEðy; y^Þ 1⁄4 1
+N
+N X
+i1⁄41
+yi y^i
+2; (5)
+where yi represents the true score of the i-th essay, y^i is the corresponding predicted score, and N is the total number of essays evaluated. MSE penalizes larger errors more heavily, which can improve the overall accuracy of the model in cases where large errors are unacceptable. Because the essays have a wide range of scores, this loss function will help enhance the model’s predictions. To assess the similarity between two vectors, we use the SIM function, which measures the alignment between vectors based on their orientation. The SIM loss helps assess the similarity of the input texts since there will be many texts with high similarity in the data. This allows the model to produce more accurate and clearer results for essays that are highly similar. During training, the SIM loss encourages the model to recognize similar pairs of vectors, enhancing its ability to capture relationships within the batch of essays. The SIM loss is defined as:
+SIMðy; y^Þ 1⁄4 1 cosðy; y^Þ: (6)
+The total loss function combines these two components, MSE and SIM, into a single objective, formulated as:
+Losstotalðy; y^Þ 1⁄4 aMSEðy; y^Þ þ bSIMðy; y^Þ; (7)
+where a and b are weight parameters optimized based on the model’s performance on the validation set.
+EXPERIMENTAL RESULTS AND DISCUSSION
+We trained our model for 50 epochs with a learning rate of 0.01, utilizing the Adam optimizer (Kingma & Ba, 2014). The training process was implemented using PyTorch 2.0.0 and executed on an RTX 3060 GPU with 12 GB of memory. All computations were conducted on a machine running Windows 11, equipped with an AMD Ryzen 7 5800X 8-Core Processor (3.80 GHz) and 32 GB of RAM.
+Baseline and benchmarking models
+To evaluate the effectiveness of the LSTM-Attention mechanism in our model, we established a variety of baseline methods using SBERT embeddings. These baseline models include widely-used machine learning techniques, such as BERT combined with Support Vector Machines (SVM (Pal & Mather, 2005)), Random Forests (RF (Breiman, 2001)), k-Nearest Neighbors (KNN (Kramer, 2013)), Extreme Gradient Boosting (XGB (Chen &
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 7/14
+
+
+Guestrin, 2016)), and 1D Convolutional Neural Networks (CNN1D (Kiranyaz et al., 2021)). Additionally, we incorporated a BERT + LSTM configuration to specifically assess the contribution of the attention mechanism. These traditional LSTM and LSTMAttention models serve as foundational baselines for comparison, particularly in sequencebased learning tasks. For a comprehensive performance analysis, we benchmarked our proposed model against several well-established AES approaches. These include Tran-BERT-MS-ML-R (Wang et al., 2022), which utilizes BERT’s robust contextual embeddings to capture intricate essay features, and XLNet (Jeon & Strube, 2021), which addresses the challenge of essay length variability and its effect on scoring accuracy. We also compared our model to SkipFlow (Tay et al., 2018), a system that emphasizes coherence as a key factor in the scoring process, and Many Hands Make Light Work (MHMLW (Kumar et al., 2021)), which focuses on essay-specific traits to enhance assessment reliability. Furthermore, we evaluated against Automatic Features (AF (Dong & Zhang, 2016)), which provides a broad analysis of feature extraction techniques, and Flexible Domain Adaptation (FDA (Phandi, Chai & Ng, 2015)), known for its innovative methods in adapting scoring models across multiple domains. By systematically comparing our model to these approaches, we aim to demonstrate the efficiency of our architecture and its potential advantages in improving automated essay scoring, particularly in terms of semantic understanding, feature extraction, and handling complex essay characteristics.
+Performance comparison with baseline models
+The performance comparison between our proposed model and several baseline models is illustrated in Table 2. The model exhibited exceptional performance across various evaluation metrics, including MSE, R2, MAE, RMSE, and QWK. It is important to highlight that it achieved the lowest MSE of 4.7645 and an RMSE of 2.1828, demonstrating a significant decrease in prediction error relative to other models. Additionally, our model achieved the highest R2 value of 0.9286, indicating a robust correlation between the
+Table 2 Comparison results with baseline models.
+Models MSE R2 MAE RMSE QWK
+LSTM-based Model 18.3864 0.6315 2.7686 4.2879 0.4748
+LSTM-Attention 13.0475 0.7643 2.3010 3.6121 0.5862
+BERT + Random Forest 6.1241 0.8842 1.5280 2.4747 0.6987
+BERT + Support Vector Machines 13.0475 0.7643 2.3010 3.6121 0.5862
+BERT + k-Nearest Neighbor 11.6149 0.7965 2.1703 3.4081 0.6132
+BERT + eXtreme Gradient Boosting 10.6173 0.8167 2.0743 3.2584 0.6426
+BERT + LSTM 7.0722 0.8876 1.6766 2.6594 0.7302
+BERT + CNN 6.0048 0.8859 1.5268 2.4505 0.7125
+Ours 4.7645 0.9286 1.3544 2.1828 0.7876
+Note:
+Bold indicates the best performance.
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 8/14
+
+
+predicted and actual essay scores. Furthermore, the model achieved a QWK of 0.7876, indicating a strong alignment with human-assigned scores, which further confirms its scoring accuracy. Among the baseline models, the combinations of BERT with machine learning algorithms, including BERT + CNN and BERT + Random Forest, demonstrated competitive performance, especially in reducing prediction errors and improving accuracy. This highlights the importance of utilizing advanced transformer-based embeddings such as BERT for effective feature extraction. Nonetheless, in spite of their strong performance, these models did not reach the accuracy levels of our proposed LSTM-Attention model, highlighting the advantages of attention mechanisms in enhancing prediction precision. In contrast, sequence-based models like LSTM and LSTM-Attention, when used without embeddings from pre-trained language models, displayed the lowest performance, with MSEs of 18.38 and 13.05, respectively. This underscores the importance of leveraging pre-trained embeddings, such as SBERT, particularly for tasks involving smaller, less diverse datasets. The incorporation of SBERT enabled our model to learn additional semantic and contextual details, resulting in enhanced accuracy in essay scoring results.
+Performance comparison with benchmarking models
+The findings illustrated in Table 3 provide additional validation of the efficacy of our proposed model across all performance metrics. Our model demonstrated a remarkable performance with a MSE of 4.7645, showing a substantial decrease in prediction errors when compared to other models. This highlights its accuracy in predicting essay scores. Furthermore, the model achieved an impressive R2 value of 0.9286, accounting for around 92.86% of the variance in the dataset, indicating a strong alignment with the underlying data distribution. The MAE of 1.3544 indicates the average magnitude of errors, demonstrating the model’s capacity to produce predictions with minimal differences. Furthermore, the RMSE of 2.1828 highlights the strength of our methodology, as this metric places greater emphasis on larger errors, and the comparatively low value suggests a high level of prediction accuracy.
+Table 3 Comparison results with other benchmarking models.
+Models MSE R2 MAE RMSE QWK
+AF (Dong & Zhang, 2016) 20.8331 0.5604 2.9501 4.5643 0.4448
+FDA (Phandi, Chai & Ng, 2015) 16.4041 0.6840 2.6094 4.0502 0.5157
+MHMLW (Kumar et al., 2021) 10.0127 0.8304 2.0055 3.1643 0.6532
+Tran-BERT-MS-ML-R (Wang et al., 2022) 9.0755 0.8490 1.9112 3.0126 0.6891
+XLNet (Jeon & Strube, 2021) 8.3130 0.8640 1.8147 2.8832 0.7042
+SkipFlow (Tay et al., 2018) 6.2887 0.8804 1.5542 2.5077 0.6820
+Ours 4.7645 0.9286 1.3544 2.1828 0.7876
+Note:
+Bold indicates the best performance.
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 9/14
+
+
+Moreover, the QWK score of 0.7876 highlights the model’s ability in capturing and correlating with human-assigned scores, while also considering the severity of differences between predicted and actual scores. This metric highlights the model’s capability in evaluating essays with a significant level of consistency and dependability. In comparison, although models like SkipFlow and XLNet showed notable performance, achieving QWK scores of 0.6820 and 0.7042 respectively, our model surpassed them across all metrics. This demonstrates the enhanced capability of our architecture to utilize key essay characteristics, leading to improved and reliable scoring results.
+Stability analysis
+To evaluate the stability of our proposed model, we conducted additional experiments by repeating the data splitting and training process nine more times, each using a different random seed. This approach ensures that the reported performance is not influenced by a specific data split and provides a robust assessment of the model’s consistency. Table 4 summarizes the performance of our model across all 10 trials. Trial 0 corresponds to the results presented in Tables 2 and 3, while trials 1 through 9 represent additional runs with randomized data splits. For each trial, we evaluated the model using the same metrics as in the original experiments, ensuring comparability. As observed in Table 4, our model demonstrates stable performance across all evaluation metrics, with minimal variance. This consistency highlights the robustness of our model and its ability to generalize well across different dataset splits. The small variance in all the performance metrics further emphasizes the reliability of our approach.
+Limitations and future directions
+While our proposed model demonstrates strong performance in AES, several limitations should be acknowledged. First, the dataset used in this study, though widely recognized as
+Table 4 Performance of the proposed model across multiple trials.
+Trial MSE R2 MAE RMSE QWK
+0 4.7645 0.9286 1.3544 2.1828 0.7876
+1 4.8393 0.9282 1.3602 2.1998 0.7877
+2 4.7540 0.9297 1.3514 2.1804 0.7925
+3 4.8210 0.9270 1.3659 2.1957 0.7897
+4 4.7441 0.9281 1.3487 2.1781 0.8005
+5 4.8291 0.9284 1.3593 2.1975 0.7872
+6 4.8287 0.9279 1.3583 2.1974 0.7853
+7 4.7160 0.9306 1.3462 2.1716 0.7885
+8 4.7829 0.9281 1.3540 2.1870 0.7918
+9 4.7353 0.9302 1.3541 2.1761 0.7922
+Mean 4.7815 0.9287 1.3553 2.1866 0.7903
+STD 0.0450 0.0011 0.0058 0.0103 0.0043
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 10/14
+
+
+a benchmark, is relatively limited in terms of diversity, as it consists primarily of essays written by middle school students in the U.S. This restricts the generalizability of our findings across different age groups, educational levels, or cultural contexts. Expanding the scope to include a broader range of essays, such as those from higher education or essays written in different languages, could provide more comprehensive insights into the model’s robustness. Second, while the SBERT embeddings proved effective in capturing semantic nuances, further exploration into other advanced transformer-based models, such as GPT or T5, might offer additional performance improvements. Additionally, the current implementation of the LSTM-Attention mechanism, while effective, could be enhanced by experimenting with more sophisticated attention mechanisms, such as multi-head attention or dynamic attention models, to further refine the focus on critical essay components. Another limitation is the computational intensity required for training transformerbased models, which may present challenges for scalability in real-world applications. Future work could explore optimization techniques or model compression methods to reduce the computational load without sacrificing performance. Furthermore, future research could also investigate the application of this model in adaptive learning systems, where real-time essay scoring and feedback could be integrated into personalized learning platforms. Future directions for this study could explore leveraging advancements in openvocabulary models and synthetic data augmentation techniques to enhance the scalability and robustness of AES systems. For instance, integrating approaches similar to those used in Shi, Dao & Cai (2024) and Shi, Hayat & Cai (2024) could enable AES models to
+generalize across diverse linguistic and cultural contexts by adapting to new essay topics and vocabularies without extensive retraining. Additionally, addressing the dataset diversity limitations noted in this study, techniques like synthetic data generation inspired by Wang, Chukova & Nguyen (2023) could enrich the training dataset with balanced and representative samples from underrepresented groups, improving the model’s fairness and inclusivity. These advancements would contribute to developing AES systems that are not only accurate but also equitable and versatile across global educational settings.
+CONCLUSIONS
+In this study, we introduced a novel approach to AES by combining SBERT embeddings with LSTM networks and attention mechanisms. Our results show that this hybrid architecture significantly improves prediction accuracy and reduces errors compared to traditional and state-of-the-art models. By leveraging SBERT’s rich semantic representations and enhancing sequential processing through LSTM and attention, the model demonstrated superior performance across multiple metrics, including MSE, R2, RMSE, MAE, and QWK. Despite the model’s success, limitations related to dataset diversity and computational demands remain. Future research should explore the application of more diverse datasets, advanced attention mechanisms, and optimization strategies to further refine the model’s
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 11/14
+
+
+scalability and adaptability. Overall, our study highlights the potential of integrating deep learning techniques in AES to provide more accurate, efficient, and reliable assessments, contributing to the ongoing advancement of educational technology.
+ADDITIONAL INFORMATION AND DECLARATIONS
+Funding
+The authors received no funding for this work.
+Competing Interests
+The authors declare that they have no competing interests.
+Author Contributions
+. Yuzhe Nie conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.
+Data Availability
+The following information was supplied regarding data availability: The code and data used in the experiments are available in the Supplemental Files. The data is originally from the Kaggle competition The Hewlett Foundation: Automated Essay Scoring: https://www.kaggle.com/competitions/asap-aes.
+Supplemental Information
+Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.2634#supplemental-information.
+REFERENCES
+Alikaniotis D, Yannakoudakis H, Rei M. 2016. Automatic text scoring using neural networks. ArXiv preprint DOI 10.48550/arXiv.1606.04289.
+Attali Y, Burstein J. 2004. Automated essay scoring with e-raterÒ v.2.0. Ets Research Report Series (2):i-221 DOI 10.1002/j.2333-8504.2004.tb01972.x.
+Badaro G, Saeed M, Papotti P. 2023. Transformers for tabular data representation: a survey of models and applications. Transactions of the Association for Computational Linguistics 11(3):227–249 DOI 10.1162/tacl_a_00544.
+Breiman L. 2001. Random forests. Machine Learning 45(1):5–32 DOI 10.1023/A:1010933404324.
+Carlile W, Gurrapadi N, Ke Z, Ng V. 2018. Give me more feedback: annotating argument persuasiveness and related attributes in student essays. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Kerrville: Association for Computational Linguistics, 621–631.
+Chen T, Guestrin C. 2016. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. New York: Association for Computing Machinery, 785–794.
+Cozma M, Butnaru AM, Ionescu RT. 2018. Automated essay scoring with string kernels and word embeddings. ArXiv preprint DOI 10.48550/arXiv.1804.07954.
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 12/14
+
+
+Ding J, Chen X, Lu P, Yang Z, Li X, Du Y. 2023. DialogueINAB: an interaction neural network based on attitudes and behaviors of interlocutors for dialogue emotion recognition. The Journal of Supercomputing 79(18):20481–20514 DOI 10.1007/s11227-023-05439-1.
+Dong F, Zhang Y. 2016. Automatic features for essay scoring—an empirical study. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Kerrville: Association for Computational Linguistics, 1072–1077.
+Farag Y, Yannakoudakis H, Briscoe T. 2018. Neural automated essay scoring and coherence modeling for adversarially crafted input. ArXiv preprint DOI 10.48550/arXiv.1804.06898.
+Gu X, Chen X, Lu P, Lan X, Li X, Du Y. 2023. SiMaLSTM-SNP: novel semantic relatedness learning model preserving both Siamese networks and membrane computing. The Journal of Supercomputing 80(3):3382–3411 DOI 10.1007/s11227-023-05592-7.
+Hamner B, Morgan J, Vandev L, Shermis M, Vander Ark T. 2012. The Hewlett foundation: automated essay scoring. Available at https://www.kaggle.com/competitions/asap-aes.
+Janda HK. 2019. Use of semantic, syntactic and sentiment features to automate essay evaluation. PhD thesis. Lakehead University, Thunder Bay, Ontario, Canada.
+Jeon S, Strube M. 2021. Countering the influence of essay length in neural essay scoring. In: Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing. Kerrville: Association for Computational Linguistics, 32–38.
+Kingma DP, Ba J. 2014. Adam: a method for stochastic optimization. ArXiv preprint DOI 10.48550/arXiv.1412.6980.
+Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ. 2021. 1D convolutional neural networks and applications: a survey. Mechanical Systems and Signal Processing 151:107398 DOI 10.1016/j.ymssp.2020.107398.
+Kramer O. 2013. K-nearest neighbors. In: Dimensionality Reduction with Unsupervised Nearest Neighbors. Cham: Springer, 13–23.
+Kumar R, Mathias S, Saha S, Bhattacharyya P. 2021. Many hands make light work: using essay traits to automatically score essays. ArXiv preprint DOI 10.48550/arXiv.2102.00781.
+Leow EKW, Nguyen BP, Chua MCH. 2021. Robo-advisor using genetic algorithm and BERT sentiments from tweets for hybrid portfolio optimisation. Expert Systems with Applications 179(2):115060 DOI 10.1016/j.eswa.2021.115060.
+Li D, Jianxing W. 2024. The effect of gamified learning monitoring systems on students’ learning behavior and achievement: an empirical study. Entertainment Computing 52:100907 DOI 10.1016/j.entcom.2024.100907.
+Nadeem F, Nguyen H, Liu Y, Ostendorf M. 2019. Automated essay scoring with discourse-aware neural models. In: Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. Kerrville: Association for Computational Linguistics, 484–493.
+Nguyen-Vo T-H, Trinh QH, Nguyen L, Do TTT, Chua MCH, Nguyen BP. 2021. Predicting antimalarial activity in natural products using pretrained bidirectional encoder representations from transformers. Journal of Chemical Information and Modeling 62(21):5050–5058 DOI 10.1021/acs.jcim.1c00584.
+Page EB. 1966. The imminence of... grading essays by computer. The Phi Delta Kappan 47(5):238243.
+Pal M, Mather P. 2005. Support vector machines for classification in remote sensing. International Journal of Remote Sensing 26(5):1007–1011 DOI 10.1080/01431160512331314083.
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 13/14
+
+
+Palmer J, Williams R, Dreher H. 2002. Automated essay grading systems applied to a first year university subject: how can we do it better? In: IS2002 Informing Science and IT Education Conference. Santa Rosa: Informing Science Institute, 1221–1229.
+Phandi P, Chai KMA, Ng HT. 2015. Flexible domain adaptation for automated essay scoring using correlated linear regression. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Kerrville: Association for Computational Linguistics, 431–439.
+Reilly C. 2013. MOOCs deconstructed: variables that affect MOOC success rates. In: E-Learn:
+World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education. Waynesville: Association for the Advancement of Computing in Education (AACE), 1308–1338.
+Reimers N, Gurevych I. 2019. Sentence-BERT: sentence embeddings using Siamese BERTnetworks. ArXiv preprint DOI 10.48550/arXiv.1908.10084.
+Shi H, Dao SD, Cai J. 2024. LLMFormer: large language model for open-vocabulary semantic segmentation. International Journal of Computer Vision 34(4):17864 DOI 10.1007/s11263-024-02171-y.
+Shi H, Hayat M, Cai J. 2024. Unified open-vocabulary dense visual prediction. IEEE Transactions on Multimedia 26:8704–8716 DOI 10.1109/TMM.2024.3381835.
+Taghipour K, Ng HT. 2016. A neural approach to automated essay scoring. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Kerrville: Association for Computational Linguistics, 1882–1891.
+Tay Y, Phan M, Tuan LA, Hui SC. 2018. SkipFlow: incorporating neural coherence features for end-to-end automatic text scoring. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. Washington D.C.: AAAI Press, 5948–5955.
+Uto M, Xie Y, Ueno M. 2020. Neural automated essay scoring incorporating handcrafted features. In: Proceedings of the 28th International Conference on Computational Linguistics. Kerrville: Association for Computational Linguistics, 6077–6088.
+Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. 2017. Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R,
+Vishwanathan S, Garnett R, eds. Advances in Neural Information Processing Systems. Vol. 30. Red Hook: Curran Associates, Inc.
+Wang AX, Chukova SS, Nguyen BP. 2023. Synthetic minority oversampling using edited displacement-based k-nearest neighbors. Applied Soft Computing 148(04):110895 DOI 10.1016/j.asoc.2023.110895.
+Wang Y, Wang C, Li R, Lin H. 2022. On the use of BERT for automated essay scoring: joint learning of multi-scale essay representation. ArXiv preprint DOI 10.48550/arXiv.2205.03835.
+Wu L, Long Y, Gao C, Wang Z, Zhang Y. 2023. MFIR: multimodal fusion and inconsistency reasoning for explainable fake news detection. Information Fusion 100:101944 DOI 10.1016/j.inffus.2023.101944.
+Nie (2025), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2634 14/14
--- a/storage/WR5LLUBH/.zotero-reader-state
+++ b/storage/WR5LLUBH/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":9,"scale":103,"top":433,"left":-505,"scrollMode":0,"spreadMode":0}
--- a/storage/WR5LLUBH/Nie
+++ b/storage/WR5LLUBH/Nie
--- a/translators/EBSCO
+++ b/translators/EBSCO
@@ -1,15 +1,15 @@
 {
 	"translatorID": "660fcf3e-3414-41b8-97a5-e672fc2e491d",
+	"translatorType": 4,
 	"label": "EBSCO Discovery Layer",
 	"creator": "Sebastian Karcher",
 	"target": "^https?://(discovery|research)\\.ebsco\\.com/",
 	"minVersion": "5.0",
-	"maxVersion": "",
+	"maxVersion": null,
 	"priority": 100,
 	"inRepository": true,
-	"translatorType": 4,
 	"browserSupport": "gcsibv",
-	"lastUpdated": "2024-02-04 04:24:48"
+	"lastUpdated": "2025-12-22 18:15:00"
 }

 /*
@@ -117,9 +117,20 @@ async function scrape(doc, url = doc.location.href) {
 	let risURL = `/linkprocessor/v2-ris?recordId=${recordId}&opid=${opid}&lang=en`;
 	// Z.debug(risURL)

-	// this won't work always
-	let pdfURL = `/linkprocessor/v2-pdf?recordId=${recordId}&sourceRecordId=${recordId}&profileIdentifier=${opid}&intent=download&lang=en`;
-
+	let pdfURL;
+	try {
+		let [{ result }] = await requestJSON(`/api/viewer/v6/htmlfulltext/${recordId}?opid=${opid}`);
+		let { links } = result;
+		Z.debug('Links:');
+		Z.debug(links);
+		let downloadLink = links['v2-downloadLinks']?.find(link => link.type === 'pdf');
+		if (!downloadLink) downloadLink = links.downloadLinks.find(link => link.type === 'pdf');
+		pdfURL = downloadLink.url;
+	}
+	catch (e) {
+		Zotero.debug('Error while locating PDF download link: ' + e);
+		pdfURL = `/linkprocessor/v2-pdf?recordId=${recordId}&sourceRecordId=${recordId}&profileIdentifier=${opid}&intent=download&lang=en`
+	}

 	let risText = await requestText(risURL);
 	// Z.debug(risText)
@@ -143,5 +154,6 @@ async function scrape(doc, url = doc.location.href) {
 }

 /** BEGIN TEST CASES **/
-
+var testCases = [
+]
 /** END TEST CASES **/
--- a/zotero.sqlite
+++ b/zotero.sqlite
--- a/zotero.sqlite.1.bak
+++ b/zotero.sqlite.1.bak
--- a/zotero.sqlite.bak
+++ b/zotero.sqlite.bak
				`@@ -0,0 +1 @@`
				`{"pageIndex":0,"scale":312,"top":670,"left":-48,"scrollMode":0,"spreadMode":0}`
				`@@ -0,0 +1 @@`
				`{"pageIndex":15,"scale":266,"top":714,"left":76,"scrollMode":0,"spreadMode":0}`
				`@@ -0,0 +1 @@`
				`{"pageIndex":7,"scale":173,"top":51,"left":-185,"scrollMode":0,"spreadMode":0}`
				`@@ -0,0 +1 @@`
				`{"pageIndex":20,"scale":"page-width","top":786,"left":-7,"scrollMode":0,"spreadMode":0}`
				`@@ -0,0 +1 @@`
				`{"pageIndex":9,"scale":103,"top":433,"left":-505,"scrollMode":0,"spreadMode":0}`