diff --git a/cache/library/87FL4LGS.png b/cache/library/87FL4LGS.png
new file mode 100644
index 0000000..be2a6b4
Binary files /dev/null and b/cache/library/87FL4LGS.png differ
diff --git a/cache/library/DZHQIW6Q.png b/cache/library/DZHQIW6Q.png
new file mode 100644
index 0000000..7a71f1a
Binary files /dev/null and b/cache/library/DZHQIW6Q.png differ
diff --git a/cache/library/L2I43B53.png b/cache/library/L2I43B53.png
new file mode 100644
index 0000000..59c813c
Binary files /dev/null and b/cache/library/L2I43B53.png differ
diff --git a/cache/library/MKYLFLT9.png b/cache/library/MKYLFLT9.png
new file mode 100644
index 0000000..fe726e6
Binary files /dev/null and b/cache/library/MKYLFLT9.png differ
diff --git a/cache/library/REU9S3UJ.png b/cache/library/REU9S3UJ.png
new file mode 100644
index 0000000..eba225f
Binary files /dev/null and b/cache/library/REU9S3UJ.png differ
diff --git a/cache/library/XBKEDW43.png b/cache/library/XBKEDW43.png
new file mode 100644
index 0000000..97232f1
Binary files /dev/null and b/cache/library/XBKEDW43.png differ
diff --git a/cache/library/Y4XJTK8U.png b/cache/library/Y4XJTK8U.png
new file mode 100644
index 0000000..b276841
Binary files /dev/null and b/cache/library/Y4XJTK8U.png differ
diff --git a/storage/2U4EX5P8/.zotero-ft-cache b/storage/2U4EX5P8/.zotero-ft-cache
new file mode 100644
index 0000000..b2c34da
--- /dev/null
+++ b/storage/2U4EX5P8/.zotero-ft-cache
@@ -0,0 +1,58 @@
+Skip to main content
+Computer Science > Computation and Language
+arXiv:2502.08109 (cs)
+[Submitted on 12 Feb 2025]
+HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses
+Sujeong Lee, Hayoung Lee, Seongsoo Heo, Wonik Choi
+View PDF
+HTML (experimental)
+Recent advances in large language models (LLMs) have shown promising improvements, often surpassing existing methods across a wide range of downstream tasks in natural language processing. However, these models still face challenges, which may hinder their practical applicability. For example, the phenomenon of hallucination is known to compromise the reliability of LLMs, especially in fields that demand high factual precision. Current benchmarks primarily focus on hallucination detection and factuality evaluation but do not extend beyond identification. This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx, aimed at enhancing the reliability of LLM-generated responses by both detecting hallucinations and providing detailed explanations. The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors. Our measurement results demonstrate that the proposed model surpasses larger LLMs, such as Llama3 70B and GPT-4, in hallucination detection accuracy, while maintaining reliable explanations. Furthermore, the proposed model performs well in both zero-shot and other test environments, showcasing its adaptability across diverse benchmark datasets. The proposed approach further enhances the hallucination detection research by introducing a novel approach to integrating interpretability with hallucination detection, which further enhances the performance and reliability of evaluating hallucinations in language models.
+Comments: 11 pages
+Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
+Cite as: arXiv:2502.08109 [cs.CL]
+ (or arXiv:2502.08109v1 [cs.CL] for this version)
+
+https://doi.org/10.48550/arXiv.2502.08109
+Focus to learn more
+Submission history
+From: Sujeong Lee [view email]
+[v1] Wed, 12 Feb 2025 04:17:02 UTC (880 KB)
+
+Access Paper:
+View PDFHTML (experimental)TeX Source
+view license
+Current browse context: cs.CL
+< prev next >
+
+newrecent2025-02
+Change to browse by: cs cs.AI
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+Export BibTeX Citation
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer (What is the Explorer?)
+Connected Papers Toggle
+Connected Papers (What is Connected Papers?)
+Litmaps Toggle
+Litmaps (What is Litmaps?)
+scite.ai Toggle
+scite Smart Citations (What are Smart Citations?)
+Code, Data, Media
+Demos
+Related Papers
+About arXivLabs
+Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
+About
+Help
+Contact
+Subscribe
+Copyright
+Privacy Policy
+Web Accessibility Assistance
+
+arXiv Operational Status
diff --git a/storage/2U4EX5P8/2502.html b/storage/2U4EX5P8/2502.html
new file mode 100644
index 0000000..43d162c
--- /dev/null
+++ b/storage/2U4EX5P8/2502.html
@@ -0,0 +1,349 @@
+
+
[2502.08109] HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/storage/3H4XBZAV/.zotero-ft-cache b/storage/3H4XBZAV/.zotero-ft-cache
new file mode 100644
index 0000000..048a7c2
--- /dev/null
+++ b/storage/3H4XBZAV/.zotero-ft-cache
@@ -0,0 +1,55 @@
+Skip to main content
+Computer Science > Artificial Intelligence
+[Submitted on 5 Jul 2024 (this version), latest version 8 Aug 2024 (v2)]
+Code Hallucination
+Mirza Masfiqur Rahman, Ashish Kundu
+Generative models such as large language models are extensively used as code copilots and for whole program generation. However, the programs they generate often have questionable correctness, authenticity and reliability in terms of integration as they might not follow the user requirements, provide incorrect and/or nonsensical outputs, or even contain semantic/syntactic errors - overall known as LLM hallucination. In this work, we present several types of code hallucination. We have generated such hallucinated code manually using large language models. We also present a technique - HallTrigger, in order to demonstrate efficient ways of generating arbitrary code hallucination. Our method leverages 3 different dynamic attributes of LLMs to craft prompts that can successfully trigger hallucinations from models without the need to access model architecture or parameters. Results from popular blackbox models suggest that HallTrigger is indeed effective and the pervasive LLM hallucination have sheer impact on software development.
+Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
+Cite as: arXiv:2407.04831 [cs.AI]
+ (or arXiv:2407.04831v1 [cs.AI] for this version)
+
+https://doi.org/10.48550/arXiv.2407.04831
+Focus to learn more
+Submission history
+From: Ashish Kundu [view email]
+[v1] Fri, 5 Jul 2024 19:37:37 UTC (45 KB)
+[v2] Thu, 8 Aug 2024 01:01:47 UTC (45 KB)
+
+Access Paper:
+View PDFHTML (experimental)TeX Source
+view license
+Current browse context: cs.AI
+< prev next >
+
+newrecent2024-07
+Change to browse by: cs cs.SE
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+Export BibTeX Citation
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer (What is the Explorer?)
+Connected Papers Toggle
+Connected Papers (What is Connected Papers?)
+Litmaps Toggle
+Litmaps (What is Litmaps?)
+scite.ai Toggle
+scite Smart Citations (What are Smart Citations?)
+Code, Data, Media
+Demos
+Related Papers
+About arXivLabs
+Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
+About
+Help
+Contact
+Subscribe
+Copyright
+Privacy Policy
+Web Accessibility Assistance
+
+arXiv Operational Status
diff --git a/storage/3H4XBZAV/2407.html b/storage/3H4XBZAV/2407.html
new file mode 100644
index 0000000..500a593
--- /dev/null
+++ b/storage/3H4XBZAV/2407.html
@@ -0,0 +1,384 @@
+
+[2407.04831v1] Code Hallucination
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/storage/3KB89IIG/.zotero-ft-cache b/storage/3KB89IIG/.zotero-ft-cache
new file mode 100644
index 0000000..bc405e6
--- /dev/null
+++ b/storage/3KB89IIG/.zotero-ft-cache
@@ -0,0 +1,58 @@
+Skip to main content
+Computer Science > Computation and Language
+arXiv:2502.17125 (cs)
+[Submitted on 24 Feb 2025]
+LettuceDetect: A Hallucination Detection Framework for RAG Applications
+Ádám Kovács, Gábor Recski
+View PDF
+HTML (experimental)
+Retrieval Augmented Generation (RAG) systems remain vulnerable to hallucinated answers despite incorporating external knowledge sources. We present LettuceDetect a framework that addresses two critical limitations in existing hallucination detection methods: (1) the context window constraints of traditional encoder-based methods, and (2) the computational inefficiency of LLM based approaches. Building on ModernBERT's extended context capabilities (up to 8k tokens) and trained on the RAGTruth benchmark dataset, our approach outperforms all previous encoder-based models and most prompt-based models, while being approximately 30 times smaller than the best models. LettuceDetect is a token-classification model that processes context-question-answer triples, allowing for the identification of unsupported claims at the token level. Evaluations on the RAGTruth corpus demonstrate an F1 score of 79.22% for example-level detection, which is a 14.8% improvement over Luna, the previous state-of-the-art encoder-based architecture. Additionally, the system can process 30 to 60 examples per second on a single GPU, making it more practical for real-world RAG applications.
+Comments: 6 pages
+Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
+Cite as: arXiv:2502.17125 [cs.CL]
+ (or arXiv:2502.17125v1 [cs.CL] for this version)
+
+https://doi.org/10.48550/arXiv.2502.17125
+Focus to learn more
+Submission history
+From: Ádám Kovács [view email]
+[v1] Mon, 24 Feb 2025 13:11:47 UTC (1,188 KB)
+
+Access Paper:
+View PDFHTML (experimental)TeX Source
+view license
+Current browse context: cs.CL
+< prev next >
+
+newrecent2025-02
+Change to browse by: cs cs.AI
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+Export BibTeX Citation
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer (What is the Explorer?)
+Connected Papers Toggle
+Connected Papers (What is Connected Papers?)
+Litmaps Toggle
+Litmaps (What is Litmaps?)
+scite.ai Toggle
+scite Smart Citations (What are Smart Citations?)
+Code, Data, Media
+Demos
+Related Papers
+About arXivLabs
+Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
+About
+Help
+Contact
+Subscribe
+Copyright
+Privacy Policy
+Web Accessibility Assistance
+
+arXiv Operational Status
diff --git a/storage/3KB89IIG/2502.html b/storage/3KB89IIG/2502.html
new file mode 100644
index 0000000..3f25d10
--- /dev/null
+++ b/storage/3KB89IIG/2502.html
@@ -0,0 +1,352 @@
+
+[2502.17125] LettuceDetect: A Hallucination Detection Framework for RAG Applications
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/storage/4LIWKFFQ/.zotero-reader-state b/storage/4LIWKFFQ/.zotero-reader-state
new file mode 100644
index 0000000..06c8c44
--- /dev/null
+++ b/storage/4LIWKFFQ/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":0,"scale":"page-width","top":833,"left":-13,"scrollMode":0,"spreadMode":0}
\ No newline at end of file
diff --git a/storage/5JWXXUR3/.zotero-reader-state b/storage/5JWXXUR3/.zotero-reader-state
index 8c23f2f..a5c9a99 100644
--- a/storage/5JWXXUR3/.zotero-reader-state
+++ b/storage/5JWXXUR3/.zotero-reader-state
@@ -1 +1 @@
-{"pageIndex":0,"scale":312,"top":670,"left":-48,"scrollMode":0,"spreadMode":0}
\ No newline at end of file
+{"pageIndex":26,"scale":220,"top":598,"left":-217,"scrollMode":0,"spreadMode":0}
\ No newline at end of file
diff --git a/storage/5PGKTV4I/.zotero-ft-cache b/storage/5PGKTV4I/.zotero-ft-cache
new file mode 100644
index 0000000..8ce1f5c
--- /dev/null
+++ b/storage/5PGKTV4I/.zotero-ft-cache
@@ -0,0 +1,571 @@
+Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation,
+and Challenges
+YUNSEO LEE∗, UNIST, Republic of Korea
+JOHN YOUNGEUN SONG∗, Handong Global University, Republic of Korea
+DONGSUN KIM, Korea University, Republic of Korea
+JINDAE KIM, Seoul National University of Science and Technology, Republic of Korea
+MIJUNG KIM, UNIST, Republic of Korea
+JAECHANG NAM†, Handong Global University, Republic of Korea
+Recent technical breakthroughs in large language models (LLMs) have enabled them to fluently generate source code. Software
+developers often leverage both general-purpose and code-specialized LLMs to revise existing code or even generate a whole function
+from scratch. These capabilities are also beneficial in no-code or low-code contexts, in which one can write programs without a
+technical background. However, due to their internal design, LLMs are prone to generating hallucinations, which are incorrect,
+nonsensical, and not justifiable information but difficult to identify its presence. This problem also occurs when generating source code.
+Once hallucinated code is produced, it is often challenging for users to identify and fix it, especially when such hallucinations can be
+identified under specific execution paths. As a result, the hallucinated code may remain unnoticed within the codebase. This survey
+investigates recent studies and techniques relevant to hallucinations generated by CodeLLMs. We categorize the types of hallucinations
+in the code generated by CodeLLMs, review existing benchmarks and mitigation strategies, and identify open challenges. Based on
+these findings, this survey outlines further research directions in the detection and removal of hallucinations produced by CodeLLMs.
+1 Introduction
+Ensuring the accuracy, reliability, and security of code generated by Large Language Models (LLMs) remains a critical
+challenge [1, 12, 53]. A primary reason for this is the prevalence of hallucinations — instances where the model generates
+code that is illogical, incorrect, or unfaithful to the specified requirements [14]. Addressing these hallucinations is
+essential, as they undermine the trustworthiness of the generated code and can introduce significant risks and errors
+into software applications.
+Although benchmarks such as HumanEval [9] and Mostly Basic Python Programming (MBPP) [6] are commonly
+used to evaluate the code generation performance of LLMs, there remains a lack of standardized methods of assessing
+the hallucinations generated by CodeLLMs. These general benchmarks only measure the syntactical or token-wise
+differences between the generated and oracle code. At most, the benchmarks provide simple test cases in which the
+users can verify a subset of dynamic behaviors of the generated code, which are not useful for defining, detecting, and
+mitigating hallucinations.
+To address hallucination issues of code generation tasks, many researchers have created evaluation benchmarks
+for the tasks recently, and proposed various approaches to addressing the issues. For example, benchmarks such as
+∗Both authors contributed equally to this research. Yunseo Lee conducted this study while he was an undergraduate student at Handong Global University. †Corresponding author.
+Authors’ Contact Information: Yunseo Lee, yunseo.lee@unist.ac.kr, UNIST, Ulsan, Republic of Korea; John Youngeun Song, john.song@handong.edu, Handong Global University, Pohang, Republic of Korea; Dongsun Kim, Korea University, Seoul, Republic of Korea, darkrsw@korea.ac.kr; Jindae Kim, Seoul National University of Science and Technology, Seoul, Republic of Korea, jindae.kim@seoultech.ac.kr; Mijung Kim, UNIST, Ulsan, Republic of Korea, mijungk@unist.ac.kr; Jaechang Nam, Handong Global University, Pohang, Republic of Korea, jcnam@handong.edu.
+1
+
+
+2 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
+CodeHaluEval [53] and CodeMirage [1] have been developed to measure hallucination frequencies, while mitigation
+strategies such as iterative grounding [12] and self-revision feedback loops [37] aim to reduce specific hallucinations.
+The goal of this study is to provide a comprehensive analysis of code hallucinations, including their categorization,
+evaluation metrics, and mitigation strategies. To achieve this goal, we (1) structured a detailed taxonomy of code hallu
+cinations, (2) review and categorize existing benchmarks and evaluation metrics used for detecting these hallucinations,
+(3) consolidated a list of root causes that contribute to code hallucinations, and (4) survey current mitigation strategies
+designed to address code hallucinations.
+2 Differences from other surveys on hallucinations of CodeLLMs
+Although hallucinations generated by LLMs in general are studied in multiple surveys [14, 19, 61], our survey focuses
+on hallucinations observed during code generation tasks using LLMs. The followings are the key aspects of our survey:
+• Focus and Scope: We focus on hallucinations specifically observed from code generation tasks, addressing
+unique challenges such as syntactic and semantic discrepancies in code output. In addition, while existing
+surveys [14, 18, 22, 64] on code generation analyzed performance, benchmarks, data curation, and evaluation
+metrics, they failed to systematically explore code hallucinations. By exploring taxonomy, benchmarks, metrics,
+and mitigation strategies tailored to code-specific hallucinations, our survey fills this critical gap and provides a
+comprehensive framework for future research.
+• Taxonomy and Categorization: Existing hallucination surveys classify hallucinations into input-conflicting,
+context-conflicting, and fact-conflicting types [19]. Building upon these classifications, our study introduces a
+taxonomy that incorporates specialized hallucination types unique to the code generation process, allowing a
+systematic exploration of hallucination issues specific to this domain.
+• Integration of Benchmarks: Although other surveys [14, 22, 64] include benchmarks such as HumanEval [9]
+and TruthfulQA [33], we identified four datasets and benchmarks explicitly aligned with detecting and mitigating
+code hallucination, such as tests for functional correctness and adherence to APIs.
+• Exploration of Mitigation Strategies: While previous surveys navigated mitigation approaches for general
+natural languages [61], we delve into mitigation strategies such as fine-tuning with code-specific datasets,
+leveraging automated testing frameworks, and integrating static and dynamic program analysis tools for
+real-time hallucination detection.
+3 Paper Collection and Review Schema
+3.1 Survey Scope
+We aim to cover in full the taxonomy, benchmarks and evaluation metrics, causes of hallucinations, and mitigation
+techniques for hallucinations in code generated by CodeLLMs. The criteria for selecting papers are as follows:
+• Papers that discuss both LLM-based code generation and LLM hallucination.
+• Papers that define code hallucinations or propose taxonomies related to them.
+• Papers that propose techniques for detecting or mitigating code hallucinations.
+• Papers that introduce datasets or benchmarks for evaluating the performance of CodeLLMs.
+To distinguish our study from existing surveys on hallucinations in the Natural Language Processing (NLP) domain
+and focus on code generation, we included only papers that addressed both LLM code generation and LLM hallucination.
+In particular, we searched for papers that explicitly used terms such as code hallucination or hallucinated code. For
+
+
+Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 3
+mitigation-related studies, we included papers that addressed the correctness of generated code, even if the term
+hallucination was not explicitly mentioned.
+3.2 Methodology for Literature Identification
+We conducted a systematic literature review on various papers. To gather as many relevant studies as possible, Google
+Scholar keyword searches were performed using the terms “hallucination” and “code generation”. Considering the rapid
+advances in research related to LLMs, the review focused mainly on articles published after 2023, while also including
+two notable articles from 2022 based on their significance. Titles, abstracts, and introductions of the retrieved papers
+were manually reviewed and categorized into three main categories: Taxonomy, Benchmark, and Mitigation.
+In addition, to ensure comprehensive coverage of studies on code hallucination, the snowball method [59] was
+employed. Snowballing, commonly used in survey studies, involves tracking citations of identified papers until no
+additional relevant papers are found. This process helped identify missing studies from the initial search, as well as NLP
+hallucination papers frequently cited in code hallucination research. Although these NLP studies were not included
+in the systematic review as they did not focus on code, they provided foundational insights to develop classification
+criteria for code hallucinations.
+5
+Taxonomy
+3
+20 Benchmarking
+2
+16
+Mitigation
+4
+1
+Fig. 1. Distribution of the categorization of papers.
+arXiv
+51.9% (27)
+TSE
+3.8% (2)
+NeurIPS
+5.8% (3)
+ICML
+5.8% (3)
+ICSE 5.8% (3)
+Venues with a Single Paper
+26.9% (14)
+Fig. 2. Distribution of papers by venue.
+We categorized the papers into three key dimensions: Taxonomy, Benchmarking, and Mitigation, as shown Fig. 1.
+Most of the papers fall under the Benchmarking category (20 papers [3, 6–9, 11, 17, 23, 25, 27–30, 35, 38, 49, 50, 60,
+66, 67]) and the Mitigation category (16 papers [12, 13, 21, 26, 32, 36, 39, 40, 43, 45, 48, 51, 54, 55, 62, 63]), while fewer
+studies are categorized under Taxonomy (five papers [15, 24, 42, 52, 57]). Overlapping areas reveal cross-disciplinary
+contributions: four papers address both Taxonomy and Mitigation [31, 37, 44, 65], three papers address both Taxonomy
+and Benchmarking [1, 34, 53], and 2 papers explore both Mitigation and Benchmarking [2, 20]. Only one paper [10]
+combines all three dimensions, emphasizing the scarcity of comprehensive studies.
+While many papers are in a preprint stage (e.g., arXiv), authors gradually publish papers at top venues in the
+community. Fig. 2 shows the distributions of papers by venue. About a half of the papers (51.9%) were published on
+arXiv. The remaining papers were published in top-tier conferences (39.2%) such as NeurIPS (Annual Conference on
+Neural Information Processing Systems) and ICML (International Conference on Machine Learning), and academic
+journals (7.8%) such as TSE (IEEE Transactions on Software Engineering).
+4 LLM-based code generation (CodeLLMs) and its hallucination
+CodeLLMs have been developed to address unique challenges in this domain. OpenAI’s Codex and its derivative
+Copilot are prominent examples that introduced generative pre-trained models with billions of parameters that produce
+
+
+4 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
+snippets [9, 38]. Following these innovations, models such as Anthropic’s Claude Sonnet [5], Meta’s CodeLLaMA [46],
+DeepMind’s AlphaCode [30], Salesforce’s CodeGen [41], and Amazon’s CodeWhisperer [4] entered the landscape,
+each addressing different aspects of coding efficiency and applicability. OpenAI further refined its offerings with
+GPT-3.5 and GPT-4, showcasing enhanced capabilities in generating syntactically and semantically accurate code. These
+advancements are often accompanied by benchmark datasets such as HumanEval [9], DS-1000 [25], and MBPP [6],
+which assess the performance of LLMs on diverse coding tasks.
+Despite their promise, LLMs face a significant challenge in code generation including hallucinations. Hallucinations,
+in this context, refer to the generation of code that is nonsensical, logically flawed, or unfaithful to the given task descrip
+tion [10]. Studies in the NLP field have classified hallucinations into types such as input-conflicting, context-conflicting,
+and fact-conflicting hallucinations [19]. Within code generation, hallucinations can manifest as bugs, syntactical errors,
+security vulnerabilities, or even non-deterministic outputs.Existing research highlights that hallucinated outputs not
+only degrade functional correctness, but may also introduce subtle errors, such as memory leaks or insecure code [7].
+5 Taxonomy of Hallucination by CodeLLMs
+In our effort to create a consolidated taxonomy of code hallucinations generated by CodeLLMs, we analyzed relevant
+papers that presented their own classification of hallucinations. Rather than focusing on the causes of the hallucination,
+our resulting taxonomy categorizes hallucinations based on the observable characteristics of error produced, as shown
+in Fig. 3. A key advantage of this approach is that it provides an objective for classifying hallucinations, regardless of the
+model architecture or the training datasets. The taxonomy consists of four primary categories: Syntactic Hallucinations,
+Runtime Execution Hallucinations, Functional Correctness Hallucinations, and Code Quality Hallucinations. In this
+section, we discuss each primary category with detailed sub-categories.
+5.1 Syntactic Hallucinations
+These refer to errors that deviate from a language syntax, which render the code unable to parsed and unable to be
+compiled or interpreted [2, 10, 15, 52, 57]. Syntactic hallucinations can be further classified into two sub-categories:
+“Syntax Violations” and “Incomplete Code Generation”.
+5.1.1 Syntax Violations. These occur when a CodeLLM generates code that violates the syntax of the programming
+language, leading to compile-time errors [1, 10, 57]. Three research papers include a specific taxonomy on what kinds
+of syntax violations there are [1, 10, 57]. One paper [1] classifies errors in generated code that are related to syntax
+Code Hallucinations
+Syntactic Hallucinations
+Syntax Violation [1, 10, 57]
+Incomplete Code Generation [15, 52]
+Runtime Execution Hallucinations
+API Knowledge Conflict [10, 34, 65]
+Invalid Reference Errors [10, 15, 34, 53, 57]
+Functional Correctness Hallucinations
+Incorrect Logical Flow [10, 15, 34, 52, 53, 57]
+Requirement Deviation [34, 52, 57, 65]
+Code Quality Hallucinations
+Resource Mishandling [53, 65]
+Security Vulnerability [42, 65]
+Code Smell [34, 52, 57, 65]
+Fig. 3. Taxonomy of hallucinations possibly generated by CodeLLMs.
+
+
+Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 5
+under the term Syntactic Incorrectness. Two papers classify syntax violations further and provide more specific terms
+such as Incorrect Indentation, Conditional Error, Loop Error, Return Error, and Assignment Error [10, 57].
+5.1.2 Incomplete Code Generation. This occurs when CodeLLMs stops generating code or entire code blocks are
+missing [15, 52]. In violation of any specific coding language syntax rules, incomplete code generation will prevent the
+code from being executed or compiled.
+5.2 Runtime Execution Hallucinations
+These occur when CodeLLMs generate code that is syntactically valid but produces runtime errors, such as exceptions
+or crashes, during execution [10, 15, 34, 52, 53, 57, 65]. Although syntactic correctness is a necessary condition for code
+execution, it does not guarantee that the code will function as intended or even run without errors. They manifest only
+when the code is actually run and may depend on specific inputs or outside factors. Unlike syntactic hallucinations, these
+types of hallucinations do not necessarily break the syntax, but cause the program to crash or behave unexpectedly.
+5.2.1 API Knowledge Conflict. This occurs when CodeLLMs misuse libraries or APIs, leading to issues such as missing
+imports or incorrect or extra parameters [10, 34, 65].
+5.2.2 Invalid Reference Errors. These arise when CodeLLMs produce code that attempts to access or manipulate
+program elements that are not yet defined in the code [10, 15, 34, 53, 57]. This can manifest in using variables that have
+not been declared or attempting to access non-existent members of an object.
+5.3 Functional Correctness Hallucinations
+These arise when CodeLLMs generate code that can execute ,but does not satisfy the functional requirements of the
+program, which are further categorized as Incorrect Logic Flow and Requirement Deviation [10, 15, 34, 52, 53, 57, 65].
+While a program can be syntactically correct and free from runtime errors, it does not guarantee that the code can
+perform its intended task.
+5.3.1 Incorrect Logical Flow. This arises when CodeLLMs generates code that contains flaws in their implementation of
+algorithms and reasoning [10, 15, 34, 52, 53, 57]. These hallucinations often lead to an incorrect solution. This category
+encompasses flaws such as missing corner cases, incorrect conditional statements, and incorrect arithmetic operations.
+5.3.2 Requirement Deviation. These arise when CodeLLMs produce code that deviates from the explicit requirements
+and functionalities outlined in the prompt or problem description [34, 52, 57, 65]. These hallucinations represent the
+failure of generated code that does not satisfy the requirements of the prompt. Given the diverse situations in which
+requirement deviation occurs, taxonomies often categorize these errors under broad terms. This category encompasses
+taxonomy like overall semantic conflicting hallucinations [34] and functional requirement violations [65], while one
+paper [57] mentions usage of an incorrect function that does not match the requirements.
+5.4 Code Quality Hallucinations
+These occur when CodeLLMs generate code that introduce risks related to resource management, security vulnerabilities,
+or performance degradation [34, 42, 52, 53, 57, 65]. These hallucinations often compromise the stability, security, and
+efficiency of the overall system. We categorize these issues into three distinct subcategories: Resource Mishandling,
+Security Vulnerability, and Code Smell Issues.
+
+
+6 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
+Table 1. Comparative Analysis of Code Hallucination Benchmarks.
+Benchmark Number Data Name Language of Tasks Reference Content Purpose Construction CodeHaluEval [53] Python 699 APPS Not mentioned Comparing various types and frequencies of hallucinations in code generation across different LLMs.
+Generated code using the APPS dataset, and applied the CodeHalu algorithm to identify the types of hallucinations present and their respective frequencies. CodeMirage [1] Python 1,137 HumanEval, MBPP
+Problems, hallucinated code snippets, ground truth code snippets, test cases
+Experiment and measure LLM capabilities for automatically detecting code hallucinations using one-shot prompts.
+Designed explicit prompts for each of the hallucination types and input them into GPT3.5 to get Python code generations that have specific hallucination types. LMDefects [15] Java 113 (easy: 60, medium: 53)
+LeetCode Problem descriptions, code snippets, public test cases
+Evaluate the precision of Codexgenerated code and assess the feasibility of applying automated program repair (APR) techniques.
+Collected public datasets from LeetCode not included in Codex training. Included a diverse range of Java tasks for analysis.
+EvalPlus [35] Python 164 HumanEval Programming tasks, function signatures, and docstrings
+Reveal the real correctness of LLMsynthesized code.
+Extended the HumanEval dataset by adding type-aware mutations and generating an average of 764.17 test cases per problem to evaluate hallucinations. CodeContests [30] C++, Java, Python, etc.
+13,328 (training), 117 (validation), 165 (test)
+Codeforces, CodeChef, etc.
+Problems, Correct and incorrect human submissions, test cases.
+To train, validate, and evaluate AlphaCode.
+Leveraged private and public code competition problems. Test cases were expanded through mutation methods.
+MultiPL-E [8] 18 languages Similar to HumanEval, MBPP
+HumanEval, MBPP
+Not mentioned Propose the first massively parallel, multi-language benchmark for code generation.
+Created a multi-language benchmark by converting Python-based NL2Code benchmarks into 18 programming languages. HalluCode [34] Python 5,663 CodeAlpaca Objectives, Hallucination categories, Task descriptions
+Evaluate the performance of codeLLMs in recognizing hallucinations.
+Focused on task description evaluation and detecting hallucinations specific to programming contexts.
+5.4.1 Resource Mishandling. These errors arise when CodeLLMs produce code that improperly manages a system’s
+resources, leading to excessive consumption or inefficient allocation of memory that can eventually lead to code
+failure [53, 65]. Hallucinations like these occur when CodeLLMs write code that includes data processing operations
+that cause failures due to exceeded memory capacity or when there is numerical overflow due to errors in numerical
+calculation limits. [53]. Also, Zhang et al. [65] mentions non-functional requirements that are related to suboptimal
+performance like inefficient loop structures.
+5.4.2 Security Vulnerability. This arises when CodeLLMs produce code that introduces security weaknesses that
+make the system susceptible to attacks or unauthorized access [42, 65]. While only two papers have taxonomy that
+can be categorized under security vulnerabilities, Pearce et al. [42] gives a deep detailed analysis of various security
+vulnerabilities in generated code. While there are many kinds of security vulnerabilities, some are improper input
+validation, use after free errors, and null pointer de-reference errors.
+5.4.3 Code Smell. These occur when CodeLLMs produce code with low maintainability due to extraneous or unneces
+sary code [34, 52, 57, 65]. Although these hallucinations are not critical for security or performance issues, their absence
+is crucial for the maintainability and readability of the code that human developers use. These issues include things
+like dead code, garbage code, or incomplete generation [34, 52, 57]. Sometimes these issues are called “non-functional
+requirement violation” as code with these issues often contain a part that is unreachable, performs useless assignments,
+only contains comments, or has empty function bodies [65].
+6 Benchmarks and Metrics to Evaluate Hallucinations by CodeLLMs
+6.1 Benchmarks
+The growing interest in addressing hallucinations in LLM-generated code has led to the development of various
+benchmarks. Standard benchmarks are necessary to analyze the hallucination tendencies of various CodeLLMs and
+to evaluate hallucination detection and mitigation techniques. Table 1 shows recent benchmarks related to code
+
+
+Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 7
+hallucination and summarize their distinct features. Existing benchmarks to evaluate hallucinations by CodeLLMs have
+limitations, such as a lack of language diversity and a failure to reflect real-world workloads.
+Many of those benchmarks build on existing LLM code generation benchmarks, extending them to overcome those
+limitations. EvalPlus proposed by Liu et al.[35] is a benchmark that extends an existing benchmark, HumanEval, to
+address its specific limitations. The HumanEval benchmark contains vague task descriptions and insufficient number of
+test cases per task. Furthermore, some solutions labeled as correct in HumanEval were found to be erroneous. EvalPlus
+addresses these limitations by increasing the average number of test cases per task to 764.1, leveraging LLMs for seed
+input generation and employing type-aware mutation for fuzz testing. CodeMirage [1] assesses the ability of LLMs to
+detect hallucinations in the input code. CodeMirage was generated using the HumanEval and MBPP databases, with
+artificial hallucinations inserted into the code using the ChatGPT-3.5 model.
+Among the seven benchmarks we inspected, five support only one programming language, and four of them
+(CodeHaluEval, CodeMirage, EvalPlus, and HalluCode) specifically target Python coding tasks. This distribution reflects
+the frequent use of Python in scenarios where LLMs generate code. In contrast, Fan et al. [15] proposes LMDefects,
+a Java- focused benchmark that evaluates the correctness of code generated by Codex and explores the applicability
+of automated program repair (APR) techniques to hallucinated code. LMDefects is based on easy and medium-level
+problems from the LeetCode platform and incorporates public test cases provided by the platform.
+Unlike aforementioned benchmarks, Multiple-E and CodeContests contain code generation tasks in diverse pro
+gramming languages. Cassano et al. [8] introduced MultiPL-E, a benchmark that translates Python problems from the
+HumanEval and MBPP datasets into 18 different programming languages. To rigorously compare models, it is essential
+to evaluate their ability to generate code in languages beyond Python. Multi-language benchmarks have been developed
+for this purpose, as CodeLLMs are typically designed to handle multiple programming languages. This benchmark
+uses 18 custom compilers to translate code snippets, test cases, and other components originally designed for Python
+into other languages, allowing a comparative analysis of LLM performance across languages. These compilers are also
+extendable to support additional languages in the future.
+CodeContests, proposed by Li et al. [30], includes programming challenges from platforms such as Codeforces and
+CodeChef to train, validate, and evaluate the Alphacode model. This dataset supports multiple programming languages
+such as C++, Java, Python, etc. enabling broader applicability.
+6.2 Metrics
+To compare and analyze model performance on benchmark datasets that are in line with their research goals, studies
+adopt different evaluation metrics. Selecting the appropriate metrics is essential to accurately assess the specific aspects
+of the model that the study aims to target. This section examines evaluation metrics used in the papers that are addressed
+in Section 6.1. Table 2 summarizes the metrics used in various studies to compare the performance of models with
+respect to code hallucination. We have grouped the metrics on the following basis: Functional Correctness, Hallucination
+Detection, Hallucination Recognition and Hallucination Mitigation Metrics.
+6.2.1 Functional Correctness. This category focuses on evaluating how well the generated code satisfies its intended
+requirements. The most common metric, Pass@k, measures the frequency with which at least one of the k generated
+solutions passes all test cases. Pass@10, a popular variation, represents the fraction of tasks in which at least one of the
+10 generated solutions is correct. On the other hand, 10@k measures the percentage of tasks for which k samples were
+
+
+8 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
+Table 2. Comparative Analysis of Code Hallucination Metrics.
+Category Metric Description Ref. Functional Correctness
+Pass@k Evaluates the correctness of code generated by a CodeLLM. It measures the likelihood that a CodeLLM generates functionally correct code for a given task.
+[8] [15] [35]
+10@k Evaluates a CodeLLM’s ability to generate correct code, specifically to assess the ability to produce multiple correct solutions for a single task.
+[30]
+Hallucination Detection
+Hallucination Rate (HR) Reflects the hallucination phenomenon in LLMs during code generation tasks through actual execution tests.
+[53]
+Valid Rate (VR) Reflects the percentage of valid code outputs by an LLM. [34] Accuracy of Hallucination Existence Recognition ACCrec
+Reflects the percentage of correctly identified existence of hallucinations. [34]
+Hallucination Classification
+Accuracy of Hallucination Type Recognition ACCt ype (i)
+Reflects the percentage of accurately identified hallucination types. Liu et al. proposed five types of hallucinations.
+[34]
+Accuracy, Macro-precision, Macro-recall, and Macro-F1
+Standard metrics used to evaluate multi-class classification performance, where classes represent different hallucination types.
+[1]
+Hallucination Mitigation
+Accuracy of Hallucination Mitigation ACCmit
+Reflects the percentage of modified hallucinated codes which are semantically correct.
+[34]
+created per task, and when at least 10 of them passed the test. Pass@k and 10@k consider hallucinations in generated
+code to be any error that prevents the generated code from passing all test cases.
+6.2.2 Hallucination Detection. This category quantifies the presence of hallucinations within the generated code. We
+use Hallucination Rate (HR), Validate Rate (VR) and Accuracy of Hallucination Existence Recognition (ACCrec ) for
+this type of metric [34, 53]. HR, as proposed by Tian et al. [53], measures the proportion of generated code samples
+that syntactically valid but fail to execute as expected using their CodeHalu Algorithm. VR serves as a measure of
+the proportion of generated outputs that are syntactically valid and executable [34]. Thus, a lower VR can suggest
+hallucinations are interfering with the code’s ability to run. ACCrec used in tandem with VR focuses on how accurately
+a model identifies valid code outputs that also contain hallucinations.
+6.2.3 Hallucination Type Classification. This category assesses a CodeLLMs ability to recognize and classify hallu
+cinations. In contrast to detection, type classification aims to categorize given hallucinated codes into one of the
+predefined hallucination types. The metrics used are Accuracy of Hallucination Type Recognition ACCt ype (i) [34] and
+traditional multi-class classification metrics [1]. ACCt ype (i) assesses the precision of the model in categorizing the type
+of hallucination present in valid code. Agarwal et al. [1] used accuracy, macro-precision, macro-recall and macro-F1 as
+metrics to measure how well hallucinations were detected and classified according to their hallucination types. In this
+context, accuracy refers to the percentage of hallucinations that are well categorized by the model that matches the
+actual categories.
+6.2.4 Hallucination Mitigation. This category is used to measure the ability to successfully fix hallucinated codes. Ac
+curacy of Hallucination Mitigation ACCmit [34] shows the percentage of recognized hallucinations that are successfully
+alleviated by CodeLLMs.
+7 Causes of Hallucinations in Code Generation
+We investigate the causes of hallucinations by CodeLLMs and classify them into three main issues: Training Data Issues,
+Trained Model Issues, and Prompt Issues. Fig. 4 presents a hierarchical cause analysis tree for code hallucinations
+generated by CodeLLMs showing the primary causes and into more specific factors.
+
+
+Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 9
+Causes
+Training Data Issues
+Lack of Quantity of Training Datasets
+Lack of Diverse Training
+Sets [22]
+Low Quality of Training Datasets
+Training on Flawed or
+Vulnerable Data [49] [50] [53] [65]
+Outdated or Incomplete
+Public API Knowledge [63] [2] [20] [65]
+Trained Model Issues
+Inadequate/Inappropriate Evaluations and Benchmarks
+Lack of Benchmarks For
+Real World SWE Tasks [49] [50] [22]
+Lack of Evaluation Metrics for Real World SWE Tasks
+[49] [50] [22]
+Reasoning & Understanding Deficiencies
+Limited Context Handling & Scope
+Mishandling of Unseen
+Private API [62]
+Low Repository & Crossfile Contextual Understanding
+[55] [22] [2] [32] [48] [10] [12]
+Low Syntactical & Struc
+tural Understanding [22] [53] [15]
+Insufficient Dependency
+Parsing [26] Temperature-related
+Non-Determinism [51]
+Token Length Limitation [10] [32]
+CodeLLMs Lack of Requirement Clarification Mechanism
+[39]
+Prompt Issues
+Ambiguous Prompts
+Ambiguous Nature of
+Natural Languages [13]
+Ambiguous Require
+ments In Prompts [39] Irrelevant Context in
+Prompt [32]
+Fig. 4. Potential causes of hallucinations by CodeLLMs.
+7.1 Training Data Issues
+One of the primary causes arises from the issues in quality and quantity of the training data. These can be categorized
+as follows: a lack of diverse training sets, training on flawed or vulnerable data and, outdated or incomplete public API
+knowledge. The limited diversity of training data restricts a CodeLLM’s ability to generalize across various programming
+tasks. Jain et al. [20] highlights that the breadth and quality of the training dataset are crucial for correct code generation.
+In addition, CodeLLMs often produce code hallucinations because of their training on public repositories that often
+contain deprecated or incomplete API documentation, leading that code to invoke non-existent APIs or contain API
+misuse [65]. Training on flawed or vulnerable data from open-source projects compound the issues, as these CodeLLMs
+propagate security vulnerabilities and inefficient implementation to the generated code [50].
+7.2 Trained Model Issues
+Major causes of code hallucinations include issues with the trained models. The causes of code hallucinations that
+stem from issues with the trained models themselves are: Inadequate or Inappropriate Evaluations and Benchmarks,
+Reasoning and Understanding Deficiencies, Temperature-related Non-Determinism, Model Input Handling, Token
+Generation Inefficiency, and CodeLLMs’s Lack of Requirement Clarification Mechanism.
+One contributor to code hallucinations is the use of inadequate evaluation benchmarks that fail to capture real-world
+software engineering tasks. Current evaluation metrics and benchmarks often do not accurately represent real-world
+tasks. CodeLLMs are frequently evaluated using benchmarks that lack the constructs necessary to assess the security
+
+
+10 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
+of generated code [49, 50]. The need for comprehensive benchmarks and metrics that evaluate a broader spectrum of
+coding skills is ever prevalent [22].
+Another crucial aspect contributing to code hallucinations is the inherent reasoning and understanding deficiencies
+present in the trained models. One common reasoning and understanding deficiency is the CodeLLM’s limited ability to
+handle code context. As LLMs receive a larger code context, they often mishandle unseen private API and have trouble
+understanding across files and entire repositories [1, 10, 12, 32, 48, 56]. LLMs lack prior knowledge about private libraries
+and struggle to leverage external, pre-exisiting knowledge unless they are augmented with retrieval-based generation
+techniques [62]. This lack of context is evident when generating functions with repository-level dependencies. [56].
+The limited capacity of CodeLLMs to grasp the underlying structure and syntax of programming languages highlights
+their reasoning and understanding deficiencies [15, 22, 53]. As transformer-based LLM architectures are the norm,
+they may not be optimally designed to fully capture the inherent structure and syntax of programming languages [22].
+The CodeLLMs’ reliance on pattern matching and statistical rules to generate code, results in a lack of fundamental
+understanding of symbolic systems [53]. Code is treated as a series of tokens cause language models to lack awareness
+of program semantics and lead to the generation of incorrect programs [15].
+The non-deterministic nature of CodeLLMs, which is controlled by temperature settings and decoding strategies, is
+an inherent issue with the trained model. The temperature parameter in CodeLLMs governs the randomness of the
+generated responses as lower temperatures yield more predictable and deterministic outputs, while higher temperatures
+increase creativity and diversity [51]. While the higher temperatures of verbose models benefit creative code generation,
+they increase the risk of code hallucination rate [51].
+Another aspect contributing to code hallucinations arises from limitations in how trained models handle input
+tokens. CodeLLMs have an input token length limit, which impacts their ability retain all problem details [10]. This
+makes it impossible to feed entire code repositories to the CodeLLMs to effectively generate code [32].
+The limitations of existing CodeLLMs handling ambiguous requirements can be another source of code hallucinations.
+Current CodeLLMs often lack a mechanism to clarify unclear or incomplete instructions, which can cause hallucinations
+that do not satisfy the user’s requirements [39].
+7.3 Prompt Issues
+The third major cause of code hallucinations is the prompts. Two factors contributing to this are the ambiguous nature
+of the prompt itself and presence of insufficient or irrelevant context in the prompt. A significant challenge originates
+from the inherent ambiguity of natural language prompts. Natural language prompts tend to not fully capture the intent
+of the user in a fully nuanced and accurate manner. Hence it is a challenge to generate code from such ambiguous
+natural language prompts [13, 39]. Furthermore, code hallucinations can arise from contextual deficiencies in the
+prompt. Providing insufficient context or including irrelevant details can hinder the CodeLLM’s ability to generate
+accurate and satisfactory code [32].
+8 Hallucination Mitigation Methods
+Various approaches to mitigate hallucinations are being actively explored. Among these, five approaches were selected
+for comparative analysis. The following sections provide an overview of the specific hallucination types each approach
+targets, the root causes they address, and a brief description of each method, along with its strengths and limitations.
+
+
+Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 11
+8.1 De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding
+There are two main challenges in addressing this issue. The first challenge is that LLMs lack knowledge of project
+specific APIs, and may fail to correctly use existing functions and classes. To investigate this issue, they selected five
+functions from each of ten open-source projects to create a code completion task. Experimental results showed that
+44% of the generated code contained at least one instance of incorrect API usage. To reduce the likelihood of such
+issues, it would be necessary to provide the entire project code as input. However, due to constraints, this is practically
+impossible. Therefore, selecting only the essential code snippets to include becomes critical. The second challenge
+lies in accurately identifying the importance of each piece of code for this purpose. To address this challenge, they
+suggested an approach named De-hallucinator for iteratively retrieving relevant APIs to improve the prompts.
+The De-hallucinator [12] pre-analyzes and indexes all source code within the project in advance. When a code
+generation prompt is provided, it selects the most relevant APIs based on the input and creates a Retrieval Augmented
+Generation (RAG) prompt to include these APIs. Alternatively, it generates an iterative prompt that incorporates
+the APIs most relevant to the code produced by the initial prompt. These prompts are then used as inputs for code
+generation. This approach has the advantage of not requiring modifications to the internal structure of the LLM model.
+However, it has the drawback of relying on the project to contain well-documented and detailed API descriptions.
+8.2 Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues
+Liu et al. [37] proposed a hallucination mitigation method leveraging ChatGPT’s self-revision capabilities. The approach
+aims to address all code quality issues in LLM-generated code, including execution errors, incorrect outputs, and
+maintainability problems. The method provides two types of feedback to the LLM immediately after code generation:
+simple feedback and feedback with static analysis:
+• Simple feedback: This feedback involves informing the model that the generated code contains quality issues
+without specifying details.
+• Feedback with static analysis: This feedback includes more detailed information, such as static analysis
+results and runtime error messages for the generated code.
+The study found that using these feedback methods enabled ChatGPT to self-revise 20–60% of the generated code.
+Furthermore, iterative feedback led to a gradual improvement in code quality over time.
+This approach has the advantage of generalizing scenarios where developers use LLMs for code generation, effectively
+demonstrating its mitigation performance. However, it has limitations, including the requirement for developers to
+craft prompts manually and the need for a basic understanding of static analysis tools and error messages.
+8.3 SynCode: LLM Generation with Grammar Augmentation
+Ugare et al. [54] focused on Syntax Violation Hallucinations. Grammar-guided generation has recently been widely
+proposed [16, 43, 47, 58] to ensure that LLM-generated code adheres strictly to predefined grammatical rules [54].
+These methods modify the LLM’s decoding algorithm to ensure that the model consistently selects tokens conforming
+to a specific formal language. However, the tokens used by the model are predefined during training, and this often
+leads to token misalignment where the model’s tokens do not match the terminals used in the specified grammar.
+This misalignment is a significant factor contributing to the high error rates observed in grammar-guided generation.
+To address this issue, the SynCode algorithm was proposed, leveraging the EBNF (Extended Backus-Naur Form)
+representation of context-free grammar to guide the LLM during the decoding process. This ensures that the model
+
+
+12 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
+produces grammatically correct outputs throughout the generation process. The advantage of this approach is its
+versatility, as it can be applied to any type of LLM decoding algorithm and supports all programming languages.
+8.4 ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification
+Mu et al. [39] proposed a method to mitigate hallucinations caused by ambiguous prompts. Generating correct code
+requires a clear understanding of the user’s requirements, but the necessary information might not always be fully
+included in the LLM’s prompt. In real-world scenarios, developers often address ambiguous requirements by asking
+clarifying questions to gather additional information. Inspired by this approach, they introduced a novel framework
+where the LLM generates clarifying questions to help users refine their prompts.
+The core challenges of this approach lie in determining when to ask questions and what questions to ask. To address
+the first challenge, they implemented a code consistency check process. This involves generating test inputs based on the
+user’s prompt and asking the LLM to produce n code solutions aligned with the prompt. The generated code solutions
+are executed with the test inputs, and the resulting test outputs are compared. If the similarity among outputs is low, it
+is determined that a clarifying question is needed. This method is based on the intuition that a better understanding of
+the requirements should result in more consistent code solutions.
+For the second challenge, they employed reasoning-based prompts to help the LLM identify elements of the prompt
+causing ambiguity and generate targeted clarifying questions. The reasoning-based prompt includes instructions for
+clarifying question generation, few-shot examples, and the user’s requirements alongside the generated code solutions.
+The ClarifyGPT framework has the advantage of achieving mitigation effects without requiring direct modifications
+to a model. It also aids developers who struggle to craft clear prompts. However, this approach has significant drawbacks,
+including high overhead due to the processes of input generation, code generation, and clarifying question generation.
+Additionally, the examples for the question generation prompt must be manually crafted.
+8.5 LLM Hallucinations in Practical Code Generation:Phenomena, Mechanism, and Mitigation
+Zhang et al. [65] analyzed the types of LLM hallucinations in code generation and potential factors that cause hallucina
+tions. Based on the findings, they suggest a mitigation method based on RAG. The study identified three primary root
+causes of hallucinations in LLM-generated code: (1) incorrect or insufficient understanding of task requirements, (2)
+lack of factual knowledge relevant to the generation tasks, and (3) inability to access the necessary code and non-code
+resources from the repository. To mitigate these issues, the authors proposed a RAG-based approach. They first created
+a retrieval corpus by scanning all source files from repositories in the CoderEval dataset and extracting consecutive lines
+of code. When a query is presented to the LLM, the system retrieves related code snippets from the corpus, appending
+the most relevant ones to the prompt.
+This approach has several advantages. It requires no additional effort from users, ensures that only essential
+information necessary for code generation is provided to the model, and supports handling project-specific APIs.
+However, its effectiveness is significantly influenced by the quality and quantity of the source code available for retrieval.
+Moreover, the retrieval process introduces overhead, which can impact efficiency.
+Despite these challenges, the RAG-based mitigation method demonstrated a modest reduction in hallucinations
+across six LLMs. This study serves as a pilot exploration of RAG-based mitigation methods, shedding light on their
+possible applications in reducing hallucinations in LLMs.
+
+
+Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 13
+9 Discussion and conclusion
+The findings in this paper suggest several promising directions for future research. First, the development of more
+diverse and representative benchmark datasets, encompassing various programming languages and use cases, is
+essential for evaluating LLMs in broader contexts. Second, advances in hallucination mitigation techniques, such as
+retrieval-augmented generation, clarifying question frameworks, and grammar-guided decoding, indicate the potential
+of combining multiple approaches to enhance reliability. Third, the integration of LLMs into real-world software
+development workflows calls for adaptive techniques that can dynamically address context-specific hallucinations,
+improving practical usability. By synthesizing these insights, this study serves as a road-map for advancing research and
+development in LLM code generation, ultimately contributing to the creation of more robust and trustworthy systems.
+References
+[1] Vibhor Agarwal, Yulong Pei, Salwa Alamir, and Xiaomo Liu. 2024. CodeMirage: Hallucinations in Code Generated by Large Language Models. doi:10.48550/arXiv.2408.08333 arXiv:2408.08333 [2] Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu Lahiri, and Sriram Rajamani. 2024. Monitor-guided decoding of code LMs with static analysis of repository context. Advances in Neural Information Processing Systems 36 (2024).
+[3] Miltiadis Allamanis, Sheena Panthaplackel, and Pengcheng Yin. 2024. Unsupervised Evaluation of Code LLMs with Round-Trip Correctness. arXiv preprint arXiv:2402.08699 (2024).
+[4] Amazon. 2022. What is CodeWhisperer? https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is-cwspr.html. [5] Anthropic. 2025. Claude 3.7 Sonnet. https://www.anthropic.com/news/claude-3-7-sonnet. [6] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021). [7] Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, et al. 2023. Purple llama cyberseceval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724 (2023). [8] Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering 49, 7 (July 2023), 3675–3691. doi:10.1109/TSE.2023.3267446 [9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021). [10] Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, Yan Liu, Enyu Zhou, Ming Zhang, Yuhao Zhou, Yueming Wu, Rui Zheng, Ming Wen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. 2024. What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. doi:10.48550/arXiv.2407.06153 arXiv:2407.06153 [11] Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861 (2023). [12] Aryaz Eghbali and Michael Pradel. 2024. De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding. doi:10.48550/arXiv.2401.01701 arXiv:2401.01701 [13] Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K. Lahiri. 2024. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation. IEEE Transactions on Software Engineering 50, 9 (Sept. 2024), 2254–2268. doi:10.1109/TSE.2024. 3428972 [14] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 31–53. doi:10.1109/ICSE-FoSE59343.2023.00008 [15] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, Melbourne, Victoria, Australia, 1469–1481. doi:10.1109/ICSE48619.2023.00128 [16] Georgi Gerganov et al. 2024. llama.cpp: Port of Facebook’s LLaMA model in C/C++. https://github.com/guidance-ai/guidance. [17] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021). [18] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 33, 8 (Dec. 2024), 220:1–220:79. doi:10.1145/3695988
+
+
+14 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
+[19] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. 43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155 [20] Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2024. On Mitigating Code LLM Hallucinations with API Documentation. arXiv preprint arXiv:2407.09726 (2024).
+[21] Kevin Jesse, Toufique Ahmed, Premkumar T Devanbu, and Emily Morgan. 2023. Large language models and simple, stupid bugs. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 563–575.
+[22] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv preprint arXiv:2406.00515 (2024).
+[23] Nan Jiang, Qi Li, Lin Tan, and Tianyi Zhang. 2024. Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code. arXiv preprint arXiv:2410.09997 (2024).
+[24] Raphaël Khoury, Anderson R Avila, Jacob Brunelle, and Baba Mamadou Camara. 2023. How secure is code generated by chatgpt?. In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2445–2451.
+[25] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning. PMLR, 18319–18345. [26] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35 (2022), 21314–21328. [27] Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599 (2024).
+[28] Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. 2024. MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems. In Findings of the Association for Computational Linguistics: EMNLP 2024. 736–783. [29] Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. 2023. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852 (2023).
+[30] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-Level Code Generation with AlphaCode. Science 378, 6624 (Dec. 2022), 1092–1097. doi:10.1126/science.abq1158 [31] Yifan Li, Ensheng Shi, Dewu Zheng, Kefeng Duan, Jiachi Chen, and Yanlin Wang. 2024. RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening. In Proceedings of the 15th Asia-Pacific Symposium on Internetware. 229–238.
+[32] Dianshu Liao, Shidong Pan, Xiaoyu Sun, Xiaoxue Ren, Qing Huang, Zhenchang Xing, Huan Jin, and Qinying Li. 2024. A 3-CodGen: A RepositoryLevel Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware. IEEE Transactions on Software Engineering (2024).
+[33] Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. doi:10.48550/arXiv.2109.07958 arXiv:2109.07958 [cs]. [34] Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. 2024. Exploring and Evaluating Hallucinations in LLM-Powered Code Generation. doi:10.48550/arXiv.2404.00971 arXiv:2404.00971 [35] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, 21558–21572. [36] Mingwei Liu, Tianyong Yang, Yiling Lou, Xueying Du, Ying Wang, and Xin Peng. 2023. Codegen4libs: A two-stage approach for library-oriented code generation. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 434–445.
+[37] Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. 2024. Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. ACM Trans. Softw. Eng. Methodol. 33, 5 (June 2024), 116:1–116:26. doi:10.1145/3643674 [38] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021). [39] Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification. Proc. ACM Softw. Eng. 1, FSE (July 2024), 103:2332–103:2354. doi:10.1145/ 3660810 [40] Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I. Wang, and Xi Victoria Lin. 2023. LEVER: Learning to Verify Language-toCode Generation with Execution. In Proceedings of the 40th International Conference on Machine Learning (ICML’23, Vol. 202). JMLR.org, Honolulu, Hawaii, USA, 26106–26128. [41] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. doi:10.48550/arXiv.2203.13474 arXiv:2203.13474 [cs]. [42] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754–768.
+
+
+Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 15
+[43] Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227 (2022). [44] Ge Qu, Jinyang Li, Bowen Li, Bowen Qin, Nan Huo, Chenhao Ma, and Reynold Cheng. 2024. Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation. arXiv preprint arXiv:2405.15307 (2024). [45] Kia Rahmani, Mohammad Raza, Sumit Gulwani, Vu Le, Daniel Morris, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari. 2021. Multi-modal program inference: A marriage of pre-trained language models and component-based synthesis. Proceedings of the ACM on Programming Languages 5, OOPSLA (2021), 1–29. [46] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. 2024. Code Llama: Open Foundation Models for Code. doi:10.48550/arXiv.2308.12950 arXiv:2308.12950 [cs]. [47] Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. arXiv preprint arXiv:2109.05093 (2021).
+[48] Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2023. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning. PMLR, 31693–31715.
+[49] Mohammed Latif Siddiq, Joanna Cecilia da Silva Santos, Sajith Devareddy, and Anna Muller. 2024. Sallm: Security assessment of generated code. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops. 54–65.
+[50] Mohammed Latif Siddiq and Joanna CS Santos. 2022. SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security. 29–33. [51] Joseph Spracklen, Raveen Wijewickrama, AHM Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala. 2024. We have a package for you! a comprehensive analysis of package hallucinations by code generating llms. arXiv preprint arXiv:2406.10279 (2024). [52] Florian Tambon, Arghavan Moradi Dakhel, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Giuliano Antoniol. 2024. Bugs in Large Language Models Generated Code: An Empirical Study. doi:10.48550/arXiv.2403.08937 arXiv:2403.08937 [53] Yuchen Tian, Weixiang Yan, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, and Dawn Song. 2024. CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification. doi:10.48550/arXiv.2405.00253 arXiv:2405.00253 [54] Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2024. SynCode: LLM Generation with Grammar Augmentation. doi:10.48550/arXiv.2403.01632 arXiv:2403.01632 [55] Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. 2024. Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation. arXiv preprint arXiv:2401.06391 (2024). [56] Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. RLCoder: Reinforcement Learning for Repository-Level Code Completion. arXiv:2407.19487 [cs.SE] https://arxiv.org/abs/2407.19487 [57] Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. 2024. Where Do Large Language Models Fail When Generating Code? arXiv preprint arXiv:2406.08731 (2024).
+[58] Brandon T Willard and Rémi Louf. 2023. Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702 (2023). [59] Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (EASE ’14). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/2601248.2601268 [60] Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Hari Sundaram, et al. 2023. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation. arXiv preprint arXiv:2311.08588 (2023).
+[61] Chen Yang, Yan Liu, and Changqing Yin. 2021. Recent Advances in Intelligent Source Code Generation: A Survey on Natural Language Based Studies. Entropy 23, 9 (Sept. 2021), 1174. doi:10.3390/e23091174 [62] Daoguang Zan, Bei Chen, Yongshun Gong, Junzhi Cao, Fengji Zhang, Bingchao Wu, Bei Guan, Yilong Yin, and Yongji Wang. 2023. Private-libraryoriented code generation with large language models. arXiv preprint arXiv:2307.15370 (2023). [63] Kechi Zhang, Huangzhao Zhang, Ge Li, Jia Li, Zhuo Li, and Zhi Jin. 2023. Toolcoder: Teach code generation models to use api search tools. arXiv preprint arXiv:2305.04032 (2023).
+[64] Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2024. Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code. doi:10.48550/arXiv.2311.07989 arXiv:2311.07989 [cs]. [65] Ziyao Zhang, Yanlin Wang, Chong Wang, Jiachi Chen, and Zibin Zheng. 2024. LLM hallucinations in practical code generation: Phenomena, mechanism, and mitigation. arXiv preprint arXiv:2409.20550 (2024). [66] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684.
+[67] Li Zhong and Zilong Wang. 2024. Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation. Proceedings of the AAAI Conference on Artificial Intelligence 38, 19 (March 2024), 21841–21849. doi:10.1609/aaai.v38i19.30185
\ No newline at end of file
diff --git a/storage/5PGKTV4I/.zotero-reader-state b/storage/5PGKTV4I/.zotero-reader-state
new file mode 100644
index 0000000..2295aea
--- /dev/null
+++ b/storage/5PGKTV4I/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":4,"scale":200,"top":354,"left":-78,"scrollMode":0,"spreadMode":0}
\ No newline at end of file
diff --git a/storage/5PGKTV4I/Lee et al. - 2025 - Hallucination by Code Generation LLMs Taxonomy, Benchmarks, Mitigation, and Challenges.pdf b/storage/5PGKTV4I/Lee et al. - 2025 - Hallucination by Code Generation LLMs Taxonomy, Benchmarks, Mitigation, and Challenges.pdf
new file mode 100644
index 0000000..68fb3a9
Binary files /dev/null and b/storage/5PGKTV4I/Lee et al. - 2025 - Hallucination by Code Generation LLMs Taxonomy, Benchmarks, Mitigation, and Challenges.pdf differ
diff --git a/storage/67J3SUDE/.zotero-ft-cache b/storage/67J3SUDE/.zotero-ft-cache
new file mode 100644
index 0000000..546953f
--- /dev/null
+++ b/storage/67J3SUDE/.zotero-ft-cache
@@ -0,0 +1,190 @@
+CodeMirage: Hallucinations in Code Generated by Large Language Models
+Vibhor Agarwal1∗ , Yulong Pei2 , Salwa Alamir2 and Xiaomo Liu3
+1University of Surrey, Surrey, UK 2JP Morgan AI Research, London, UK 3JP Morgan AI Research, New York, USA
+v.agarwal@surrey.ac.uk, {yulong.pei, salwa.alamir, xiaomo.liu}@jpmorgan.com
+Abstract
+Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset1 for code hallucinations. The benchmark contains 1, 137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets – HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI’s GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.
+1 Introduction
+Large Language Models (LLMs) have shown great capabilities pushing forward the field of artificial intelligence in general and natural language generation in particular. OpenAI’s
+∗Work done during internship at JP Morgan AI Research. 1The dataset will be released upon acceptance.
+1 import yaml 2
+3 def yaml_load(filename): 4 """Open a file , load its contents into a variable with yaml , and return it.""" 5 with open(filename , 'f') as f: 6 data = yaml.load(f, Loader=yaml.Baseloader) 7 return data
+Figure 1: An example of robustness issue in the LLM generated code. It will raise “FileNotFoundException” in case the file does not exist.
+GPT-3.5 [Ouyang et al., 2022] and GPT-4 [OpenAI, 2023] as well as open source LLMs such as LLaMA-3.1 [MetaAI, 2024], LLaMA-2 [Touvron et al., 2023], etc. have demonstrated strong performance in various natural language processing (NLP) tasks [Yang et al., 2024; Zhu et al., 2023; Agarwal et al., 2023]. LLMs have also been trained on large snippets of programming codes and have shown strong performance in the field of code generation. GPT-3.5 and GPT-4 as well as open source models such as CodeLLaMA [Roziere et al., 2023] can generate code in various programming languages. Albeit LLMs show remarkable capabilities, they frequently hallucinate, i.e., they generate text that sounds plausible but is incorrect. This makes text hallucination detection task very challenging as the generated text is often similar in style but is factually incorrect or conflicting with the input or the context [Zhang et al., 2023b]. According to [Ji et al., 2023], hallucination is the generated content that is nonsensical or unfaithful to the provided source content. Similar hallucination phenomenon can happen in the code generated by LLMs as well. Sometimes the generated code can have syntactical and logical errors as well as more advanced issues like security vulnerabilities, memory leaks, robustness issues, etc. Figure 1 shows an example of robustness issue in the LLM generated code snippet. LLM generates a function to load YAML file but does not check if the file exists before reading it. Therefore, the generated code snippet has robustness issue since it will raise “FileNotFoundException” in case the file does not exist but does not handle the exception adequately. Considering the adaptation of LLMs in industrial environment to help code generation and development is becoming more widespread [Yang et al., 2023], it is very important to
+arXiv:2408.08333v2 [cs.SE] 8 Jul 2025
+
+
+detect and mitigate code hallucinations. For the first time, we study hallucinations in the code generated by LLMs. We firstly introduce the code hallucination definition and a comprehensive taxonomy of code defects that LLMs can hallucinate. We then propose the first benchmark dataset – CodeMirage containing GPT-3.5 generated hallucinated Python code snippets and then experiment and measure LLM capabilities for automatically detecting code hallucinations using one-shot prompts. We believe that this work, including the comprehensive taxonomy, new dataset and insights, can open new avenues for research and development in both academia and industry. The contributions of our paper are summarized as follows:
+• To the best of our knowledge, we introduce the problem of code hallucination for the first time and provide its definition and a comprehensive taxonomy of code defects that LLMs can hallucinate.
+• We introduce the first benchmark dataset – CodeMirage containing GPT-3.5 generated 1, 137 Python code snippets with 5 hallucination types.
+• For code hallucination detection, we introduce several baselines and experiment with open source LLMs as well as OpenAI’s GPT models. Although code hallucination detection is a challenging task, LLMs demonstrate reasonable performance in detecting various hallucination types.
+• We conduct comprehensive experiments and ablation studies to demonstrate the capabilities of LLMs for code hallucination detection and discuss various code hallucination mitigation strategies in the future work.
+2 Related Work
+2.1 Language Models for Code Generation
+The triumph of language models in natural language modeling has brought interest among researchers and practitioners on using language models for code generation. Code generation refers to generating programs that need to satisfy all the constraints defined by the underlying task such as test cases, problem descriptions, etc. Pre-trained transformerbased models such as CodeBERT [Feng et al., 2020] are specifically trained for code generation using Masked Language Modeling and Replaced Token Detection training objectives. Decoder pre-trained models are designed to predict the next token based on a given input context. OpenAI’s GPT-series [Radford et al., 2018] are decoder-based models for text generation. Based on GPT-2, [Lu et al., 2021] released CodeGPT for code completion and text-tocode generation tasks. After GPT-3 was developed, CodeX2 and GitHub Copilot3 were released for code generation. After the success of ChatGPT, OpenAI’s GPT-3.5 [Ouyang et al., 2022] and GPT-4 [OpenAI, 2023] models became stateof-the-art for natural language generation. They have shown good performance in code generation as well [Poldrack et
+2https://openai.com/blog/openai-codex, last accessed 23 May 2024. 3https://github.com/features/copilot, last accessed 23 May 2024.
+al., 2023], but these models are proprietary. Similar open source models such as LLaMA-2 [Touvron et al., 2023] and CodeLLaMA [Roziere et al., 2023] were released for natural language generation tasks. CodeLLaMA is an open sourced LLM based on LLaMA-2 itself but fine-tuned for code-related tasks such as code generation. Another relevant direction is detecting software vulnerability using LLMs where different LLMs [Jensen et al., 2024] and other information such as code structures [Lu et al., 2024] have been explored to detect vulnerabilities in the code.
+2.2 Hallucinations in LLMs
+Although LLMs have shown remarkable performance in natural language generation, they still inevitably encounter several issues, hallucination being one of the top [Wu et al., 2024; Ghafouri et al., 2023; Huang et al., 2023]. [Ji et al., 2023] defined hallucination as the generated content that is nonsensical or unfaithful to the provided source content. Previous works [Ji et al., 2023; Maynez et al., 2020; Huang et al., 2021] categorized hallucination into two main categories – intrinsic and extrinsic. Intrinsic hallucination happens when the generated output contradicts the source content, whereas extrinsic hallucination happens when the generated output cannot be verified from the source content (i.e., output that can neither be supported nor contradicted by the source). Within the context of LLMs, [Zhang et al., 2023b] defined hallucination into 3 categories – inputconflicting (LLM-generated content deviates from the source input provided by users), context-conflicting (LLM-generated content conflicts with previously generated information by itself), and fact-conflicting (LLM-generated content is not faithful to the established world knowledge) hallucinations. Similar to text hallucinations, LLMs can hallucinate during code generation as well. Sometimes the LLM-generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. There is no prior work that specifically look at hallucinations in the generated code. [Dinh et al., 2023] studied the buggycode completion problem and find that the presence of potential bugs in the code context significantly degrades the code generation performance of the LLMs. [Liu et al., 2023] evaluated the functional correctness of the LLM-generated code with large amounts of test-cases newly produced by an automatic test input generator. [Ouyang et al., 2023] conducted empirical study to measure non-determinism in code generated by LLMs. They find that results from LLMs can be highly unstable; non-deterministically returning very different codes for the same prompt. [Bhatt et al., 2023] introduced CyberSecEval, a benchmark developed to help bolster the cybersecurity of LLMs employed as coding assistants. They find a high tendency of more advanced models to suggest insecure code, highlighting the critical need for integrating security considerations in the development of sophisticated LLMs. For the first time in the literature, we study hallucinations in the LLM-generated code. We introduce the code hallucination definition, a comprehensive taxonomy and then propose the first benchmark dataset – CodeMirage containing GPT3.5 generated hallucinated Python code snippets followed by
+
+
+the methodology for detecting code hallucinations.
+3 Hallucinations in Code Generation
+In this section, we formally introduce our problem statement of code hallucination in Section 3.1 and then present a comprehensive taxonomy of five hallucination types that can occur in code generated by LLMs in Section 3.2.
+3.1 Problem Definition
+Large Language Models have shown good performance in code generation. However, sometimes the generated code may sound plausible but can have several code defects such as security vulnerabilities. We define hallucinated code as the generated code that has one or more code defects such as dead or unreachable code, syntactic or logical errors, robustness issues such as the code fails on edge cases or raises an exception, or has security vulnerabilities or memory leaks.
+3.2 Taxonomy
+Based on different types of code defects that can occur, we define the following five hallucination categories for the code generated by LLMs.
+• Dead or Unreachable code: Generated code has dead, unreachable or redundant piece of code.
+• Syntactic incorrectness: Generated code has syntactic errors and therefore, fails to compile.
+• Logical error: Generated code has logical errors, i.e., the generated code cannot solve the given problem correctly.
+• Robustness issue: Generated code has robustness issues such as it fails on certain edge cases or raises an exception (does not perform required exception handling).
+• Security vulnerabilities: Generated code has security vulnerabilities or memory leaks.
+4 CodeMirage Dataset
+In this section, we discuss the details of our CodeMirage dataset. We begin with describing the dataset generation methodology (Section 4.1) followed by verifying the generated dataset via human annotations (Section 4.2). Then we share dataset statistics and various characteristics of the dataset in Section 4.3.
+4.1 Dataset Generation
+For generating the code hallucination dataset – CodeMirage, we select two popular base datasets – HumanEval [Chen et al., 2021] and MBPP [Austin et al., 2021]. HumanEval dataset [Chen et al., 2021] contains 164 Python programming problems with function signatures, problem description as docstrings, programming solution and test cases for evaluation. Similarly, MBPP [Austin et al., 2021] benchmark consists of 973 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases.
+For generating hallucinated code snippets, we use GPT3.5. We design explicit prompts for each of the hallucination types and input them into GPT-3.5 model to get Python code generations that have specific hallucination types. Table 1 shows the layout of code hallucination generation prompt. Each prompt has code hallucination definition describing a specific type of hallucination and an example showcasing a sample problem statement along with the hallucinated code and test cases. Specific prompts for different hallucination types are mentioned in Appendix A. For every problem in both the datasets, we randomly select one of the five hallucination types and then input type specific prompt along with the problem description and test cases. As a result, we get hallucinated code as an output from GPT-3.5 model and we assign the selected hallucination type as the gold label, further validated through human annotations as described in Section 4.2.
+4.2 Human Annotations
+To validate the dataset and type specific hallucinated code generations, we conduct human evaluation. We randomly selected 200 programming problems and solutions from both the datasets (50 from HumanEval and 150 from MBPP) in proportion to the number of problems in each of the datasets. We selected a balanced sample with all the hallucination types in equal numbers. We give detailed instructions of the annotation task, definitions for each of the code hallucination types and an example for each. We then ask the annotators to annotate each Python code snippet as one of the five code hallucination types: “dead code”, “syntactic incorrectness”, “logical error”, “robustness issue”, “security vulnerabilities” or “no hallucination”. Annotations were performed by 5 human annotators, well-versed in Python programming, with every annotator annotating about 100 Python code snippets. Initially, each code snippet was annotated by 2 annotators. In case of label conflicts, we introduced a third annotation. Therefore, every code snippet in the sampled dataset has at least 2 annotations and in case of conflicts, 3 annotations so that we have a majority label. Overall, annotators get an average Cohen’s kappa score of 0.76 which denotes strong agreement. We also measure accuracy between the annotated majority labels and the automatic gold labels we create during the dataset generation process. High accuracy of 0.81 denotes that GPT-3.5 has generated code snippets as per the specific hallucination type and that automatically assigned gold labels for hallucination types are reliable.
+4.3 Dataset Statistics
+Table 2 shows the number of hallucinated Python code snippets in CodeMirage dataset for each of the 5 code hallucination types with individual splits for base datasets – HumanEval and MBPP. In total, CodeMirage dataset has 1, 137 programming problems, LLM-generated hallucinated Python code snippets, ground truth code snippets, and test cases to evaluate code snippets. To measure the complexity of generated code snippets, we compute McCabe’s cyclomatic complexity [McCabe, 1976]. Cyclomatic complexity is a quantitative measure to indicate
+
+
+I want you to act as a code hallucination generator. Given the function signature, task description, and test cases, your objective is to write the hallucinated python code that sounds plausible but is incorrect by . Below is an example:
+You should try your best to generate a hallucinated code to the following question:
+Table 1: Layout of Code Hallucination Generation Prompt.
+Hallucination Type HumanEval MBPP Total Dead/unreachable code 36 190 226 Syntactic incorrectness 38 214 252 Logical error 31 189 220 Robustness issue 26 170 196 Security vulnerabilities 33 210 243 Total 164 973 1137
+Table 2: CodeMirage dataset statistics.
+Hallucination Type HumanEval MBPP Ground-truth correct code 2.98 2.27 Dead/unreachable code 4.42 4.17 Syntactic incorrectness N/A N/A Logical error 3.71 3.11 Robustness issue 3.19 2.77 Security vulnerabilities 4.27 3.17
+Table 3: McCabe Cyclomatic Complexity of CodeMirage dataset.
+the complexity of a program by measuring the number of linearly independent paths through a program’s source code. Table 3 shows average cyclomatic complexity scores for CodeMirage dataset for each of the base datasets. We compute and compare complexity scores of ground-truth code snippets with the hallucinated code snippets. Cyclomatic complexity scores for “syntactic incorrectness” hallucination type can not be computed due to syntax errors (denoted by N/A in Table 3). For all other hallucination types, average cyclomatic complexity scores are higher than ground truth code snippets. As expected, code snippets with dead code as hallucinations have the highest cyclomatic complexity. On the other hand, “robustness issue” hallucination type has the lowest cyclomatic complexity for both HumanEval and MBPP datasets since it is expected for the code snippet to be robust and hence, lower complexity even though it contains minor robustness issues. This consistent behavior demonstrates the effectiveness of our approach for generating the code hallucination dataset.
+5 Code Hallucination Detection
+Detecting code hallucinations is a challenging task as the code snippet may seem to be plausible but can be incorrect as it can have issues such as security vulnerabilities, memory leaks, etc. which are often hard to detect. After describing the CodeMirage dataset in the previous section, we discuss the methodology and results for code hallucination detection in this section.
+5.1 Methodology
+For code hallucination detection, we prompt various large language models to detect whether a code snippet has hallucination and if present, the type of hallucination. We develop an one-shot prompt asking LLMs to detect five hallucination types as well as “no hallucination” category, given the problem description and code snippet as mentioned in Appendix B. We also provide the definitions and an example for each type of hallucinations in the prompt. Table 4 shows the layout of code hallucination detection prompt. We experiment with 3 LLMs – an open source CodeLLaMA model as well as OpenAI’s GPT-3.5 and GPT-4 models for detecting code hallucinations. We describe various LLMs and baselines used for detecting code hallucinations below:
+• CodeLLaMA: CodeLLaMA [Roziere et al., 2023] is an open source LLM for code based on LLaMA-2 [Touvron et al., 2023] providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We use CodeLLaMA7B-Instruct model having 7 billion parameters and finetuned to follow instructions.
+• GPT-3.5: We use OpenAI’s GPT-3.5 [Ouyang et al., 2022] model, accessed through OpenAI’s official API.
+• GPT-4: We also experiment with GPT-4 [OpenAI, 2023], the OpenAI’s state-of-the-art model, accessed through its official API.
+• CodeBERT: CodeBERT [Feng et al., 2020] is a pretrained transformer-based model for programming language, which is a multi-programming-lingual model pre-trained in 6 programming languages. As a baseline, we fine-tune CodeBERT on our CodeMirage dataset with 80:20 split for training and testing sets, respectively. For the train set, we follow stratified sampling for each of the two base datasets. We keep test sets separate to evaluate the performance of fine-tuned model separately on both the base datasets.
+5.2 Experimental Setup and Evaluation Metrics
+For detecting code hallucinations, we experiment with oneshot prompt and input it into LLMs along with problem description and code snippet. For CodeLLaMA-Instruct, we use its open-source implementation after downloading the model weights of 7 billion parameters. For OpenAI’s GPT-3.5 and GPT-4 models, we use their official API4. We set a temperature of 0.7 and maximum number of tokens for generation to 256.
+4https://openai.com/product, last accessed 28 Dec 2023.
+
+
+I want you to act as a code judge. Given the task description, function signature, and the generated code, your objective is to detect if the generated code has defects, incorrect code or hallucinations. Hallucinated code is the generated code that has one or more code defects such as dead or unreachable code, syntactic or logical errors, robustness issues such that the code fails on certain test cases or raises an exception, or has security vulnerabilities or memory leaks. Below are the 5 categories for code hallucination along with an example: 1. Dead code: Generated code has dead, unreachable or redundant piece of code. Example:
+2. Syntactic incorrectness: Generated code has syntactic errors and therefore, fails to compile. Example:
+3. Logical error: Generated code has logical errors. Example:
+4. Robustness issue: Generated code has robustness issues as it fails on certain test cases or raises an exception due to poor exception handling. Example:
+5. Security vulnerabilities: Generated code has security vulnerabilities or memory leaks. Example:
+You should try your best to determine if the code contains any defects or hallucinations based on the above hallucination types. Please output only 1 out of these 6 classes: [“no hallucination”, “dead code”, “syntactic incorrectness”, “logical error”, “robustness issue”, “security vulnerabilities”].
+Table 4: Layout of Code Hallucination Detection Prompt.
+Evaluation Metrics. Since we model code hallucination detection task as a multi-class classification task predicting either of the 5 code hallucination types or “no hallucination” category, we use accuracy, macro-precision, macro-recall, and macro-F1 scores for performance evaluation.
+5.3 Results
+In this section, we discuss results of various language models for detecting code hallucinations. Table 5 shows performance scores for various LLMs for code hallucination detection on CodeMirage dataset. CodeBERT, fine-tuned on our CodeMirage dataset, achieves an accuracy of 0.5938 and macro-F1 score of 0.4897 on HumanEval dataset, whereas it achieves an accuracy of 0.6825 and macro-F1 score of 0.6344 on MBPP dataset. CodeLLaMA, an open source LLM, does not perform well when prompted for code hallucination detection as it achieves macro-F1 scores of 0.0424 and 0.0271 on HumanEval and MBPP datasets, respectively. Surprisingly, there is a big performance gap between GPT-3.5 and GPT-4 models for code hallucination detection with the same prompt. On one hand, GPT-3.5 achieves macro-F1 scores of 0.2654 and 0.2092 on HumanEval and MBPP, respectively. On the other hand, GPT-4 model achieves the best performance with an overall macro-F1 score of 0.5512 for HumanEval and second best score of 0.5195 for MBPP. From Table 5, we can infer that GPT-4 model with just one-shot prompt performs the best on HumanEval dataset and beats the fine-tuned CodeBERT model by 6.15 percentage macroF1 score. On the other hand, GPT-4 model gives second best performance on MBPP dataset in terms of macro-F1 score and could not beat the fine-tuned CodeBERT model but shows comparable results. However, GPT-4 still gives the best macro-Precision score. Overall, we can conclude that LLMs, especially GPT-4, performs comparable, if not better, with fine-tuned CodeBERT model with mere one-shot prompt
+for code hallucination detection.
+6 Conclusions and Future Work
+LLMs have shown good performance in code generation. In this work, we study hallucinations for the first time in the code generated by LLMs. At first, we introduce the code hallucination definition and a comprehensive taxonomy of 5 hallucination types. We then propose the first ever CodeMirage dataset containing 1, 137 GPT-3.5 generated hallucinated Python code snippets. We believe this comprehensive code hallucination taxonomy and the new dataset can open new avenues for research and development in both academia and industry to evaluate code snippets generated by LLMs and mitigate code defects. We also experiment with various open source as well as OpenAI’s GPT-like LLMs for detecting code hallucinations using one-shot prompts. We find that GPT-4 model performs the best on HumanEval dataset, while it performs second best on MBPP dataset. In general, we conclude that LLMs, especially GPT-4, performs comparable, if not better, with fine-tuned CodeBERT model for code hallucination detection. The overall performance suggests that the task of code hallucination detection and the CodeMirage dataset are challenging as even the state-of-the-art fine-tuned transformer-based models and LLMs can not performance well with high macro-F1 score. As a result, there is a huge scope for future works. For code hallucination detection, fine-tuning LLMs [Hu et al., 2023] with specific hallucination detection instructions can improve the performance. We can also use software engineering methods such as using compilers, abstract syntax trees (ASTs) [Shippey, 2015], control flow graphs (CFGs) [Phan et al., 2017; Zhang et al., 2023a], etc. to detect code defects. Another important direction of research can be mitigating code hallucinations. We can use several mitigation strategies as also used in text hallucinations [Ji et al., 2023;
+
+
+HumanEval MBPP LLM Acc. ma-P ma-R ma-F1 Acc. ma-P ma-R ma-F1 CodeBERT 0.5938 0.5547 0.5162 0.4897 0.6825 0.6455 0.6598 0.6344 CodeLLaMA 0.0250 0.2800 0.0232 0.0424 0.0146 0.4061 0.0145 0.0271 GPT-3.5 0.2134 0.5552 0.1956 0.2654 0.1614 0.5549 0.1530 0.2092 GPT-4 0.5915 0.6710 0.5514 0.5512 0.5735 0.6644 0.5626 0.5195
+Table 5: Results for code hallucination detection on CodeMirage dataset. Best scores are in bold and second best scores are underlined.
+Zhang et al., 2023b] such as knowledge-enhanced prompt tuning [Ma et al., 2023], retrieval-augmented code generation [Parvez et al., 2021], fine-tuning LLMs [Hu et al., 2023], etc. to mitigate code hallucinations. Similarly, software engineering techniques analyzing execution workflows of the generated code snippets can be used to detect and mitigate code defects. Therefore, there are a lot of opportunities for future work to effectively detect and mitigate code hallucinations. Solving this problem is a huge step forward for companies to be able to safely adopt the use of LLMs for code generation in practice.
+Disclaimer
+This paper was prepared for informational purposes by the Artificial Intelligence Research group of JPMorgan Chase & Co. and its affiliates (“JP Morgan”) and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.
+References
+[Agarwal et al., 2023] Vibhor Agarwal, Yu Chen, and Nishanth Sastry. Haterephrase: Zero-and few-shot reduction of hate intensity in online posts using large language models. arXiv preprint arXiv:2310.13985, 2023.
+[Austin et al., 2021] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
+[Bhatt et al., 2023] Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, et al. Purple llama cyberseceval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724, 2023.
+[Chen et al., 2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,
+Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
+[Dinh et al., 2023] Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, and George Karypis. Large language models of code fail at completing code with potential bugs. arXiv preprint arXiv:2306.03438, 2023.
+[Feng et al., 2020] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pretrained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, 2020.
+[Ghafouri et al., 2023] Vahid Ghafouri, Vibhor Agarwal, Yong Zhang, Nishanth Sastry, Jose Such, and Guillermo Suarez-Tangil. Ai in the gray: Exploring moderation policies in dialogic large language models vs. human answers in controversial topics. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 556–565, 2023.
+[Hu et al., 2023] Zhiqiang Hu, Yihuai Lan, Lei Wang, Wanyu Xu, Ee-Peng Lim, Roy Ka-Wei Lee, Lidong Bing, and Soujanya Poria. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933, 2023.
+[Huang et al., 2021] Yichong Huang, Xiachong Feng, Xiaocheng Feng, and Bing Qin. The factual inconsistency problem in abstractive text summarization: A survey. arXiv preprint arXiv:2104.14839, 2021.
+[Huang et al., 2023] Dong Huang, Qingwen Bu, Jie Zhang, Xiaofei Xie, Junjie Chen, and Heming Cui. Bias assessment and mitigation in llm-based code generation. arXiv preprint arXiv:2309.14345, 2023.
+[Jensen et al., 2024] Rasmus Ingemann Tuffveson Jensen, Vali Tawosi, and Salwa Alamir. Software vulnerability and functionality assessment using llms. arXiv preprint arXiv:2403.08429, 2024.
+[Ji et al., 2023] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
+[Liu et al., 2023] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
+
+
+[Lu et al., 2021] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
+[Lu et al., 2024] Guilong Lu, Xiaolin Ju, Xiang Chen, Wenlong Pei, and Zhilong Cai. Grace: Empowering llm-based software vulnerability detection with graph structure and in-context learning. Journal of Systems and Software, 212:112031, 2024.
+[Ma et al., 2023] Yong Ma, Senlin Luo, Yu-Ming Shang, Yifei Zhang, and Zhengjun Li. Codeprompt: Improving source code-related classification with knowledge features through prompt learning. 2023.
+[Maynez et al., 2020] Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, 2020.
+[McCabe, 1976] Thomas J McCabe. A complexity measure. IEEE Transactions on software Engineering, (4):308–320, 1976.
+[MetaAI, 2024] MetaAI. Llama-3.1. 2024.
+[OpenAI, 2023] OpenAI. Gpt-4 technical report, 2023.
+[Ouyang et al., 2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
+[Ouyang et al., 2023] Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation. arXiv preprint arXiv:2308.02828, 2023.
+[Parvez et al., 2021] Md Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Retrieval augmented code generation and summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2719–2734, 2021.
+[Phan et al., 2017] Anh Viet Phan, Minh Le Nguyen, and Lam Thu Bui. Convolutional neural networks over control flow graphs for software defect prediction. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), pages 45–52. IEEE, 2017.
+[Poldrack et al., 2023] Russell A Poldrack, Thomas Lu, and Gasˇper Begusˇ. Ai-assisted coding: Experiments with gpt4. arXiv preprint arXiv:2304.13187, 2023.
+[Radford et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
+[Roziere et al., 2023] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Je ́re ́my Rapin,
+et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
+[Shippey, 2015] Thomas Joshua Shippey. Exploiting abstract syntax trees to locate software defects. 2015.
+[Touvron et al., 2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
+[Wu et al., 2024] Fangzhou Wu, Ning Zhang, Somesh Jha, Patrick McDaniel, and Chaowei Xiao. A new era in llm security: Exploring security concerns in real-world llmbased systems. arXiv preprint arXiv:2402.18649, 2024.
+[Yang et al., 2023] Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712, 2023.
+[Yang et al., 2024] Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data, 18(6):1–32, 2024.
+[Zhang et al., 2023a] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Xudong Liu, Chunming Hu, and Yang Liu. Detecting condition-related bugs with control flow graph neural network. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1370–1382, 2023.
+[Zhang et al., 2023b] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
+[Zhu et al., 2023] Yiming Zhu, Peixian Zhang, Ehsan-Ul Haq, Pan Hui, and Gareth Tyson. Can chatgpt reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145, 2023.
+A Prompts for Code Hallucination Generation
+Dead/Unreachable code: I want you to act as a code hallucination generator. Given the function signature, task description, and test cases, your objective is to write the hallucinated python code that sounds plausible but is incorrect by inserting a dead or unreachable or redundant code. Below is an example:
+from typing import List
+def has_close_elements(numbers: List[float], threshold: float) -> bool: """Check if in given list of numbers, are any two numbers closer
+
+
+to each other than given threshold. """ >>> has_close_elements( [1.0, 2.0, 3.0], 0.5) False >>> has_close_elements( [1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True
+def has_close_elements( numbers: List[float], threshold: float) -> bool: if threshold < 0: return False
+numbers.sort() dead_code = [i for i in numbers if i < 0]
+for i in range(len(numbers) - 1): if abs(numbers[i] - numbers[i + 1]) < threshold: return True else: dead_code.append(numbers[i]) dead_code = sorted(dead_code, reverse=True) return False
+You should try your best to generate a hallucinated code to the following question: Syntactic incorrectness: I want you to act as a code hallucination generator. Given the function signature, task description, and test cases, your objective is to write the hallucinated python code that sounds plausible but is incorrect by making syntax errors in the code. Below is an example:
+from typing import List
+def has_close_elements(numbers: List[float], threshold: float) -> bool:
+"""Check if in given list of numbers, are any two numbers closer to each other than given threshold.""" >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements( [1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True
+def has_close_elements(numbers: List[float], threshold: float) -> bool: for i range(0, len(numbers) - 1): for j range(i + 1, len(numbers)): if abs(numbers[i] - numbers[j]) < threshold: return True return False
+You should try your best to generate a hallucinated code to the following question:
+Logical error: I want you to act as a code hallucination generator. Given the function signature, task description, and test cases, your objective is to write the hallucinated python code that sounds plausible but is incorrect by making logical errors in the code. Below is an example:
+from typing import List
+def has_close_elements(numbers: List[float], threshold: float) -> bool: """Check if in given list of numbers, are any two numbers closer to each other than given threshold.""" >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements( [1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True
+def has_close_elements(numbers: List[float], threshold: float) -> bool: close_elements = False
+for i in range(len(numbers)): for j in range(i + 1, len(numbers)): if abs(numbers[i] - numbers[j]) < threshold: close_elements = True break else: close_elements = False
+return not close_elements
+You should try your best to generate a hallucinated code to the following question: Robustness issue: I want you to act as a code hallucination generator. Given the function signature, task description, and examples, your objective is to write the hallucinated python code that sounds plausible but fails on certain test cases or raises an exception. Below is an example:
+import yaml
+def yaml_load(filename): """Open a file, load its contents into a variable with yaml, and return it."""
+def yaml_load(filename): with open(filename, 'r') as f: data = yaml.load(f, Loader=yaml.BaseLoader) return data
+You should try your best to generate a hallucinated code to the following question:
+Security vulnerabilities: I want you to act as a code hallucination generator. Given the function signature, task description, and examples, your objective is to write the hallucinated python code that sounds plausible but is incorrect by
+
+
+either having security vulnerabilities or memory leaks. Below is an example:
+import openai
+def get_embeddings(text): """Calculate the embeddings for the text provided as parameter using the OpenAI's 'text-embedding-ada-002' model. Then, return a list of the embeddings."""
+def get_embeddings(text): openai.api_key = "sk-Hn0VpJ4aM2X7aAyT3mNr37tMMWg Jj17zjrxlpbjKgMHe54PV" # publicly available API key model = 'text-embedding-ada-002' result = openai.Embedding.create( model=model, input=text) return result["data"][0]["embedding"]
+You should try your best to generate a hallucinated code to the following question:
+B Prompt for Code Hallucination Detection
+I want you to act as a code judge. Given the task description, function signature, and the generated code, your objective is to detect if the generated code has defects, incorrect code or hallucinations. Hallucinated code is the generated code that has one or more code defects such as dead or unreachable code, syntactic or logical errors, robustness issues such that the code fails on certain test cases or raises an exception, or has security vulnerabilities or memory leaks. Below are the 5 categories for code hallucination along with an example: 1. Dead code: Generated code has dead, unreachable or redundant piece of code. Example:
+def has_close_elements(numbers: List[float], threshold: float) -> bool: """Check if in given list of numbers, are any two numbers closer to each other than given threshold.""" if threshold < 0: return False
+numbers.sort() dead_code = [i for i in numbers if i < 0]
+for i in range(len(numbers) - 1): if abs(numbers[i] - numbers[i + 1]) < threshold: return True else: dead_code.append(numbers[i]) dead_code = sorted(dead_code, reverse=True) return False
+2. Syntactic incorrectness: Generated code has syntactic errors and therefore, fails to compile. Example:
+def has_close_elements(numbers: List[float], threshold: float) -> bool: """Check if in given list of numbers, are any two numbers closer to each other than given threshold.""" for i range(0, len(numbers) - 1): for j range(i + 1, len(numbers)): if abs(numbers[i] - numbers[j]) < threshold: return True return False
+3. Logical error: Generated code has logical errors. Example:
+def has_close_elements(numbers: List[float], threshold: float) -> bool: """Check if in given list of numbers, are any two numbers closer to each other than given threshold.""" close_elements = False
+for i in range(len(numbers)): for j in range(i + 1, len(numbers)): if abs(numbers[i] - numbers[j]) < threshold: close_elements = True break else: close_elements = False
+return not close_elements
+4. Robustness issue: Generated code has robustness issues as it fails on certain test cases or raises an exception due to poor exception handling. Example:
+import yaml
+def yaml_load(filename): """Open a file, load its contents into a variable with yaml, and return it.""" with open(filename, 'r') as f: data = yaml.load(f, Loader=yaml.BaseLoader) return data
+5. Security vulnerabilities: Generated code has security vulnerabilities or memory leaks. Example:
+import openai
+def get_embeddings(text): """Calculate the embeddings for the text provided as parameter using the OpenAI's 'text-embedding-ada-002' model. Then, return a list of the embeddings.""" openai.api_key = "sk-Hn0VpJ4aM2X7aAyT3mNr37tMMWgJj 17zjrxlpbjKgMHe54PV"
+
+
+# publicly available API key model = 'text-embedding-ada-002' result = openai.Embedding.create( model=model, input=text) return result["data"][0]["embedding"]
+You should try your best to determine if the code contains any defects or hallucinations based on the above hallucination types. Please output only 1 out of these 6 classes: [”no hallucination”, ”dead code”, ”syntactic incorrectness”, ”logical error”, ”robustness issue”, ”security vulnerabilities”].
\ No newline at end of file
diff --git a/storage/67J3SUDE/Agarwal et al. - 2025 - CodeMirage Hallucinations in Code Generated by Large Language Models.pdf b/storage/67J3SUDE/Agarwal et al. - 2025 - CodeMirage Hallucinations in Code Generated by Large Language Models.pdf
new file mode 100644
index 0000000..9b74ae5
Binary files /dev/null and b/storage/67J3SUDE/Agarwal et al. - 2025 - CodeMirage Hallucinations in Code Generated by Large Language Models.pdf differ
diff --git a/storage/68TI6FWV/.zotero-ft-cache b/storage/68TI6FWV/.zotero-ft-cache
new file mode 100644
index 0000000..143bb4b
--- /dev/null
+++ b/storage/68TI6FWV/.zotero-ft-cache
@@ -0,0 +1,257 @@
+Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling
+Xinyue Fang1, Zhen Huang1*, Zhiliang Tian1*, Minghui Fang2, Ziyi Pan1, Quntian Fang1, Zhihua Wen1, Hengyue Pan1, Dongsheng Li1
+1College of Computer, National University of Defense Technology 2College of Computer Science and Technology, Zhejiang University {fangxinyue, huangzhen, tianzhiliang, panziyi, fangquntian, zhwen, hengyuepan, dsli}@nudt.edu.cn minghuifang@zju.edu.cn
+Abstract
+LLMs obtain remarkable performance but suffer from hallucinations. Most research on detecting hallucination focuses on questions with short and concrete correct answers that are easy to check faithfulness. Hallucination detections for text generation with open-ended answers are more hard. Some researchers use external knowledge to detect hallucinations in generated texts, but external resources for specific scenarios are hard to access. Recent studies on detecting hallucinations in long texts without external resources conduct consistency comparison among multiple sampled outputs. To handle long texts, researchers split long texts into multiple facts and individually compare the consistency of each pair of facts. However, these methods (1) hardly achieve alignment among multiple facts; (2) overlook dependencies between multiple contextual facts. In this paper, we propose a graph-based contextaware (GCA) hallucination detection method for text generations, which aligns facts and considers the dependencies between contextual facts in consistency comparison. Particularly, to align multiple facts, we conduct a triple-oriented response segmentation to extract multiple knowledge triples. To model dependencies among contextual triples (facts), we construct contextual triples into a graph and enhance triples’ interactions via message passing and aggregating via RGCN. To avoid the omission of knowledge triples in long texts, we conduct an LLM-based reverse verification by reconstructing the knowledge triples. Experiments show that our model enhances hallucination detection and excels all baselines.
+Code
+https://github.com/GCA-hallucinationdetection/GCA Technical Appendices — https://arxiv.org/abs/2409.11283
+1 Introduction
+Recent research has shown that large language models (LLMs) achieved state-of-the-art performance in various NLP tasks (Fang et al. 2024; Lu et al. 2022). However, these models often suffer from hallucinations: generate incorrect or fabricated content in a factual way, which undermines models’ credibility (Ji et al. 2023; Lu et al. 2023a) and limits LLMs’ application in fields requiring factual accuracy (Huang et al. 2023; Su et al. 2024). Detecting hallucination in the model’s responses is crucial for LLMs’ boom.
+*Corresponding Author Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
+Existing hallucination detection studies mainly concentrate on tasks like question answering (QA) (Zhang et al. 2023a; Wen et al. 2023a) and arithmetic calculation (Xue et al. 2023; Tian et al. 2022) with short and concrete correct answers. In these tasks, the consistency among their concrete answers can be easily checked for hallucination (Jiang et al. 2021; Wang et al. 2024a). The research on detecting hallucinations in generated long text is more challenging because (1) generating text is open-ended and rarely has concrete answers and (2) long text contains multiple facts and the consistency among them is hard to verify. Therefore, we focus on hallucination detection in long text generation with black-box models (powerful LLM, like GPT-4) without external resources (i.e. zero-resource setting).
+Currently, the studies on black-box zero-resource hallucination detection for text generation can be divided into two categories: (1) Self-checking (Friel and Sanyal 2023; Liu et al. 2023) designs prompt texts using chain-of-thought (CoT) to verify response factuality by the LLMs’ own capabilities. Though that can be easily implemented and applied in various scenarios, it relies on the model’s own ability and can lead to missed detection: The model may overconfidently trust its outputs (Li et al. 2024a; Chen et al. 2024). (2) Consistency comparison (Zhang et al. 2023a; Ma et al. 2024) samples multiple responses to check whether the sampled responses are highly inconsistent, which indicates hallucination (Farquhar et al. 2024; Wang et al. 2024b). The method is effective for short responses with few concrete claims, making consistency comparison easy. However, in long text responses, diverse wording or lexical representations of the same fact complicate consistency comparison.
+To address those issues, researchers propose divide-andconquer (Zhang et al. 2024) based on consistency comparison. It has three steps: (1) sample multiple additional responses appending to the original response; (2) divide each response into multiple facts; (3) compare the consistency of facts in the original response with those in the sampled responses, where highly inconsistent facts indicate hallucinations. However, it ignores the omission issue: facts in the original but not in sampled responses may be wrongly seen as hallucinations for lack of comparison. To overcome that, (Yang, Sun, and Wan 2023; Cao, Yang, and Zhao 2024) propose reverse verification. For each fact in the original response, they prompt LLMs to create a new question
+The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)
+23868
+
+
+answered by the claim and check if the model’s response matches the fact. The method can avoid the omission issue because each fact has an accurately matched counterpart for comparison. However, a long text may contain multiple facts {A1, ...AN }, where Ai is the i-th fact within LLMs’ generation. The above method compares the consistency of each fact sampled from different responses individually (e.g. verify the truthfulness of Ai by comparing with Bi, where Bi is Ai’s corresponding fact from a sampled response). It ignores the dependencies among each fact that can reduce detection performance. For example, for the response “Einstein won the Nobel Prize for proposing the theory of relativity.” we can extract two triples: (Einstein, proposal, theory of relativity) and (Einstein, receive, Nobel Prize). Although each triple is a fact, their dependency is an error. Therefore, considering the dependencies between a fact and its surrounding facts is promising to enhance hallucination detection.
+In this paper, we propose a graph-based context-aware (GCA) hallucination detection method for long text generation, which extracts triples to align facts for a better consistency comparison and considers the dependencies between contextual triples. Specifically, to better align facts from responses for consistency comparison, we conduct a tripleoriented response segmentation to extract multiple knowledge triples for each response. Triples are well-structured and thus easy for comparison; in addition, it is easy to align facts among multiple responses since triples carry only facts and get rid of the wording and lexical representations in the response (Wang et al. 2023b) (Sec. 3.2). To consider dependencies between a triple and its contextual triples, we construct a graph over a triple and its contextual triples. Then, we encourage triples’ interactions via message passing and aggregating features of neighboring nodes via RGCN (relational graph convolutional network) (Schlichtkrull et al. 2018) (Sec. 3.3). It encourages the interaction between facts as we observe that a certain number of facts has dependencies in a response (See Sec. 4.5). To avoid the omission issue in (Sec. 3.3), we propose an LLM-based reverse verification method with three reconstruction tasks to reconstruct the knowledge triples (Sec. 3.4). These tasks provide a more thorough and detailed detection of each triple in long text responses, enhancing the overall effectiveness of the method. Experiments show that our method effectively improves hallucination detection accuracy for long text responses generated by black-box models under zero-resource conditions. Our contributions are: (1) we propose a hallucination detection method for long text generation that considers the dependencies between contextual knowledge triples. (2) We propose a graph-based context-aware hallucination detection via consistency comparison with RGCN. (3) We additionally proposed three reversion verification tasks to help hallucination detection by reconstructing triples. (4) Experiments show that our method outperforms all baselines.
+2 Related Work
+2.1 White-box Hallucination Detection
+These methods analyze the model’s internal states to identify hallucinations (Yuksekgonul et al. 2024; Lu et al. 2023b;
+Wen et al. 2024), mainly divided into two types: (1) Output logit based method: The model’s output logits reflect the confidence of the model’s predictions (Jiang et al. 2024). (Varshney et al. 2023) calculates the logits for concepts in response and takes the minimal probabilities to model the uncertainty. (Verma et al. 2023) explores the integration of Epistemic Neural Networks (ENNs) with LLMs to improve the model’s uncertainty estimates. (Luo, Xiao, and Ma 2023) proposes to adjust the model’s output logits by adding a linear layer to better align with correctness likelihood. (2) Hidden layer activations based method: Hidden layer activations encapsulate the model’s internal representation of statement truthfulness (Fadeeva et al. 2024). (Azaria and Mitchell 2023) trains a classifier using LLM’s hidden layer activations to assess the truthfulness of statements. (Snyder, Moisescu, and Zafar 2024) uses output values from artifacts associated with model generation as input features to train a classifier that identifies hallucinations. (Zhu et al. 2024) uses probabilistic models to analyze internal state transitions in the LLM during generation to detect hallucinations.
+2.2 Black-box Hallucination Detection using External Resources
+These methods aim to verify the authenticity of modelgenerated content by leveraging external knowledge (Wen et al. 2023b; Nahar et al. 2024). Depending on the source of external knowledge, it can be categorized into the following two types. (1) RAG-based method: Retrieval-augmented generation (RAG) is a technique that enhances text generation by retrieving relevant information from external sources (Sadat et al. 2023; Wang et al. 2023a). (Roychowdhury et al. 2024) proposes a multi-LLM system with the capability to retrieve external knowledge bases and perform real-time content authentication. (Ding et al. 2024) retrieves relevant evidence to help LLMs correct potential hallucinations in responses. (Kang, Ni, and Yao 2024) proposes a method for the real-time retrieval of Web search engines that can verify the factuality of the output responses and correct hallucinations. (Li et al. 2024b) automatically retrieves knowledge graphs to detect hallucinations through logical programming and mutation testing. Furthermore, (Bayat et al. 2023) proposes an automated method to extract factual claims from responses and collect evidence from knowledge graphs to verify the factuality of the claims to be extracted. (2) Incorporating Alternative Models: Researchers use responses generated by other models for cross-validation to detect hallucinations (Hegselmann et al. 2024). (Cohen et al. 2023) constructs a framework for assessing the factuality of output responses through cross-validation by two language models. (Rawte et al. 2024; Wan et al. 2024) use multiple LLMs as “judges” to evaluate various aspects of the model’s output responses. (Li et al. 2024b) proposes a mutation testing model based on logic programming, which can verify the consistency of LLMs’ responses with real-world situations.
+2.3 Black-box Hallucination Detection using Zero-resource
+Researchers propose using the model’s own capabilities to detect hallucinations (Liu et al. 2024) because obtain
+23869
+
+
+ing high-quality external resources is challenging (Mu ̈ndler et al. 2024). (1) For non-long text responses generated by the model: Consistency comparison through multiple sampling responses is an important method (Allen, Polat, and Groth 2024). (Zhang et al. 2023a) improves hallucination detection performance on commonsense QA tasks through semanticaware cross-validation consistency. (Liu et al. 2024) evaluates the reliability of responses generated by LLMs for individual questions or queries through cross-query comparison. (Ma et al. 2024) proposes a critical calculation and conclusion (CCC) prompt template to enhance LLM’s ability to detect and correct unreasonable errors in mathematical problem-solving. (Yehuda et al. 2024) identifies potential instances of hallucination by quantifying the level of inconsistency between the original query and the reconstructed query. (2) For long text responses generated by the model: (Manakul, Liusie, and Gales 2023) proposes a method to detect hallucinations by comparing the consistency of responses from multiple random samplings. (Yang, Sun, and Wan 2023) introduces a reverse validation method for passage-level hallucination detection in LLMs, which leverages the LLM’s own knowledge base. (Mu ̈ndler et al. 2024) introduces a method for detecting self-contradictions in long text responses through logical reasoning. (Friel and Sanyal 2023) proposes an efficient prompting method that uses the chains of thought generated by LLMs to detect hallucinations in the responses. Unlike these LLM-based methods, our approach constructs long text responses as graphs and uses graph neural networks to capture the contextual influence of each fact during hallucination detection.
+3 Method
+3.1 Overview
+Our method has three modules: (1) Triple-Oriented Response Segmentation (Sec. 3.2) extracts facts from the model’s responses. (2) Graph-based Contextual Consistency Comparison with RGCN (Sec. 3.3) constructs a graph carrying the extracted knowledge triples and utilizes an RGCN to propagate and integrate messages across the graph. It considers the dependencies between each knowledge triple (facts) and its surrounding triples during detection. (3) Reverse Verification via Triples Reconstruction (Sec. 3.4) achieves reverse verification for hallucination detection by reconstructing each triple via three LLM-based tasks (as shown in Fig.1). We feed each knowledge triple extracted (Sec. 3.2) to detect hallucinations (Sec. 3.3 and Sec. 3.4), and then we judge the original long text response relying on the results of each triple from (Sec. 3.3 and Sec. 3.4).
+3.2 Triple-Oriented Response Segmentation
+To better align facts in the consistency comparison, we propose to segment the responses by extracting knowledge triples as facts and checking the answers’ consistency among the triples. Our motivation is that due to the impact of wording, comparing textual consistency can lead to mismatches. Because hallucination detection considers the semantics of knowledge fact instead of specific word choices, we use a
+triple-based comparison method to provide better alignment than traditional textual comparison. Specifically, the steps are as follows:
+• Extraction. Inspired by the latest method (Hu et al. 2024), we design prompts to extract knowledge triples from responses using an LLM.
+• Verification. To ensure the accuracy of the extracted knowledge triples, we pair each response with its triples and prompt the LLM to confirm their semantic equivalence. If any ambiguities exist between the extracted triples and the response text, we instruct the LLM to adjust the semantics of the triples according to the text’s semantics. The details of the prompts are in App.A.
+Knowledge triples have a structured format and are easy to compare, simplifying alignment and comparing consistency between responses, enhancing detection accuracy.
+3.3 Graph-based Contextual Consistency Comparison with RGCN
+To effectively consider dependencies between triples, we propose Graph-based Contextual Consistency Comparison (GCCC), which constructs a knowledge graph for each response and then conducts message passing and aggregation via RGCN. The intuition is that traditional consistency comparison focuses on comparing individual facts: it verifies a single piece of knowledge fact by comparing it only with the corresponding fact in the sampled responses at a time. It results in ignoring the triples that are mutually dependent on the given triple within the context information. To address the problem, our approach constructs graphs for the original response and the sampled responses. Then, it employs RGCN for message passing and aggregation on these graphs. The process consists of two stages: (1) knowledge triples modeling via graph learning. We build a graph for each response and obtain node (entity) embeddings via RGCN processing to model the multiple knowledge triples for a response. (2) triples consistency comparison. We compare the consistency of triples across the graphs at the embedding level.
+Knowledge Triples Modeling via Graph Learning This stage is divided into three steps: firstly, we convert each response into a graph. Then, we obtain the initial node (entity) embeddings for each graph using sentence-BERT (Wang et al. 2020). Finally, we employ the RGCN to perform message passing and aggregation using the initial node embeddings on each graph, updating the node embeddings.
+• Graph Construction. For a user’s query, the model generates an original response Ro, and we sample multiple additional responses Rsampled = {R1, R2, . . . , Rn}. (hi, ri, ti) is a single triple in Ro and (hi,j, ri,j, ti,j) is a single triple in j-th sampled response Rj. We construct the graph Go = (Vo, Eo) for the original response, in which vertices (v ∈ Vo) represent the head and tail entities from each triple. An edge (e ∈ Eo) represents the relation between the head entity and the tail entity. Similarly, we construct the graph Gj = (Vj, Ej) for each
+23870
+
+
+Please write a brief Wikipedia for Einstein.
+Large Language Model
+Same Head Same Tail
+Same Tail
+Same Relationship
+Different Relationship
+Construct Graphs
+Sample Responses1
+Mask
+Predict
+Answer Space
+Original Response
+Constrians
+Question
+Result
+...
+...
+Sample Responsesn
+Original Responses
+...
+...
+Message Passing
+Einstein is Scientist
+Einstein ? Scientist
+Einstein isn’t Scientist
+Replace Tail
+Choose
+Einstein is Scientist
+Einstein is ?
+Einstein is Doctor
+Aritist
+Baker
+Chef
+...
+Doctor
+Zoolo gist
+? is Scientist
+Newton
+Darwin
+Tesla
+... Einstein
+Einstein is Scientist
+Extract Triples
+Head Entity with Question Answering Task Relation Regeneration Task Fact Triple Selection based on Tail Entity Task
+Figure 1: GCA framework. We extract triples from the original response and sampled responses (left-upper corner). Then, we construct a graph for each response with the extracted triples and perform message passing and aggregation on the graph (as the upper branch). We conduct reverse validation for each part of the triples with three reconstruction tasks (as the lower branch).
+sampled response Rj. By doing so, we construct several response-specific graphs for each user’s query.
+• Representation Initialization. Using sentence-BERT, we encode the head entities, relation, and tail entities in knowledge triples as vector representations. For the original response, we represent each triple embedding as: (hi, ri, ti) = BERT(hi, ri, ti). For each sampled response, we represent each triple embedding as: (hi,j , ri,j , ti,j ) = BERT(hi,j , ri,j , ti,j ). We treat the head and tail entity embeddings from the original response as Go’s initial node (entity) embeddings. Similarly, we obtain the initial node (entity) embeddings for the graph Gj corresponding to j-th sampled response.
+• Message Passing and Aggregation. We use the RGCN to perform message passing and aggregation on the graph. As Eq.1 shows that for each layer l, the new representation of each node v is denoted as e(l+1)
+v . For each relation r ∈ R, we denote the set of all neighbors of the node v that are connected through an edge of relation r as Nr(v). For each neighbor in Nr(v), we multiply its rep
+resentation e(l)
+u by a weight matrix W(l)
+r and normalize it using the hyperparameter cv,r. In addition to aggregating
+information from neighbors, e(l+1)
+v also includes its own representation e(l)
+v from the previous layer l and transform it by a weight matrix W(l)
+0.
+e(l+1)
+v =σ
+
+
+X
+r∈R
+X
+u∈Nr (v)
+1
+cv,r
+W(l)
+r e(l)
+u + W(l)
+0 e(l)
+v
+
+ (1)
+The updating for e(l+1)
+v integrates information from v’s neighbors through relation-specific change, while also incorporating v’s own representation. These operations ensure that the updated node embedding is informed by its context and intrinsic properties. Triples containing the node can also incorporate contextual information, enhancing the accuracy when comparing the consistency of triples, thereby improving the detection of hallucinations.
+Triples Consistency Comparison. Based on the graph representations from RGCN, we detect hallucinations in the original response by comparing the consistency of triples across multiple graphs. Firstly, we align triples between the original response’s graph Go and each sampled graph Gj. Then we compare the consistency of the aligned triples to calculate the consistency score.
+• Triples Alignment. For each triple (hi, ri, ti) in the original response and each triple (hi,j, ri,j, ti,j) in the sampled response, we first check whether the head entities of these two triples are the same. If so, we calculate the similarity S(ri, ri,j) between the relation representation ri of relation ri and the representation ri,j of ri,j. If S(ri, ri,j) exceeds the pre-defined threshold θr, we regard the two triples as aligned. Otherwise, they are considered unaligned. For every triple in the original response, we apply the above operations to align each triple from sampled responses with it.
+• Consistency score calculation. After aligning the triples, we need to further compare whether they are consistent with each other to calculate the consistency score.
+23871
+
+
+Specifically, as Eq. 2 shows, for a triple (hi, ri, ti) in the original response and its aligned triple (hi,j, ri,j, ti,j)a in
+j-th sampled response, ei and ea
+i,j are the node embeddings of the tail entity ti and ti,j after RGCN process
+ing. We compute the similarity between ei and ea
+i,j . If their similarity S(ei, ea
+i,j) exceeds the threshold θt, we increase the consistency score ci,j of (hi, ri, ti) by 1. This indicates that there is a triple consistent with the triple (hi, ri, ti) in j-th sampled response. Conversely, we use etmi,j to denote the node embedding of the tail entity in the unaligned triple (hi,j, ri,j, ti,j)m in
+the j-th sampled response. If the similarity between etmi,j and ei exceeds the threshold θt, we update the consistency score ci,j of (hi, ri, ti) by subtracting 1. It indicates that the triple may have a risk of hallucination. Note we do not directly label the triple as a hallucination, as two triples with the same head and tail entities but different relations can both be factually correct. Moreover, such cases are rare (1.9% in two datasets for hallucination detection), as the knowledge triples we compare for consistency come from repeated responses to the same query, which are likely to focus on describing the same subject matter. In Sec. 3.4, we also provide a detailed detection for each triple to ensure the accuracy of the results.
+ci,j =
+
+
+
+ci,j + 1 if S(eti , etai,j ) > θt ci,j − 1 if S(eti , etmi,j ) > θt ci,j otherwise
+(2)
+To obtain the final consistency score for each triple in the response, we sum its comparison results with each
+sampled response as Ci = Pn
+j=1 (ci,j )
+During the message passing and aggregation process with RGCN on a graph, each node integrates features from its neighboring nodes. This allows triples containing the node to aggregate contextual information from surrounding triples, Considering the dependencies between the triple to be verified and the surrounding triples.
+3.4 Reverse Verification via Triple Reconstruction
+To address the omission issue mentioned in Sec. 3.3, we propose a LLM-based reverse verification method (RVF), which contains three reconstruction tasks that check whether LLM can reconstruct the knowledge triples’ head entity, relation, and tail entity, respectively. Traditional reverse strategies prompt the LLMs to reconstruct questions to verify each knowledge fact from generated responses. The reconstructed question may have multiple correct answers, which leads to a low probability of answering the facts that we aim to verify. It increases the chance of misjudging these facts. To address this, we add constraints to the reconstructed questions to reduce the space of correct answers and increase the probability of answering the triples we want to verify. The three tasks are as follows:
+• Head Entity with Question Answering Task (HEQA). We prompt LLMs to reconstruct a question for each triple, with the head entity as the expected answer, and
+then obtain the model’s responses. We check if these responses are consistent with the head entity. Specifically, to reduce the space of correct answers for reconstructed questions, we first follow the method from (Manakul, Liusie, and Gales 2023) to initially verify the triples in the original responses. Then, we filter out a set of triples ft with high factual accuracy. For each triple (hi, ri, ti) in the original response, we add ft (excluding (hi, ri, ti) if it exists in ft) as constraints in the questions during the LLM reconstruction process. The model’s responses to the question must satisfy these constraints. We repeatedly prompt the LLM to generate answers A to the question. The total number of A denoted as NA. We count the times that the model responses match the head entity hi (denoted as Nh) and calculate the fact score Sh as the
+ratio of Nh to NA, where Sh = Nh
+NA .
+• Relation Regeneration Task (RR). We mask the relation in the triple with a special token and prompt the model to predict multiple times. Then we check whether the model’s predictions are identical to the relation for measuring the consistency. It can reduce the space of correct answers because the relationship between two entities is limited. Specifically, for each triple (hi, ri, ti), we mask ri with a special token and prompt the LLM for multiple times to predict the original ri given hi and ti. We define the fact score Sr as the proportion of the predicted relations that match the original relation ri, where Sr = Nc
+Np . Here, Nc is the number of matched predictions, and Np is the total number of predictions.
+• Fact Triple Selection based on Tail Entity Task (FTSTE). Models often generate long texts centered around a few key entities, which typically serve as the head entities in extracted triples. The limited number of head entities allows us to use surrounding context related to the head entity as constraints to effectively narrow down the space of correct answers for reconstructed questions. However, tail entities in long-text responses are more diverse, so we cannot directly use surrounding contexts as constraints in reconstructed questions. Instead, we use a direct approach by providing a list of options to limit the space of correct answers. We prompt the model to select the factual triple from it. Then, we compare if the model’s selections are consistent with the original triple. It reduces the space of correct answers by providing a limited set of choices. Specifically, for each triple (hi, ri, ti), we replace ti with other entities of the same type to generate multiple similar triplets; and then, we prompt the LLM to choose the factual one. We define the fact score St as the proportion of times (hi, ri, ti) is selected, where St = Nt
+Ns . Nt is the number of times (hi, ri, ti) is selected, and Ns is the total number of selections. See the prompt templates used in the above three tasks in App.B.
+Finally, we sum up the fact scores from these three tasks and the consistency score mentioned in Sec. 3.3 with different weights to make a judgment about each triple in the original response, as shown in Eq. 3
+F (hi, ri, ti) = w1 · Sh + w2 · Sr + w3 · St + w4 · Ci (3)
+23872
+
+
+In our proposed reverse detection method, the three tasks use different strategies to reduce the space of correct answers in the reconstructed questions. It avoids the issue in traditional reverse detection techniques where the reconstructed questions may have multiple correct answers making it difficult to detect specific facts, improving the accuracy of detecting each triple in the original response.
+4 Experiments
+4.1 Experimental Setting
+Datasets We utilize three datasets: (1) PHD (Yang, Sun, and Wan 2023): The dataset consists of 300 samples. Each sample is a Wikipedia article about an entity generated by ChatGPT (gpt-3.5-turbo) and annotated by human annotators. (2) WikiBio (Manakul, Liusie, and Gales 2023): The dataset consists of 238 passages generated by GPT3 and annotated at the sentence level. Although it lacks passagelevel labels, we follow (Yang, Sun, and Wan 2023) to aggregate sentence labels to derive pseudo-labels at the passage level. (3) sub-WikiBio: There are only 12 fact samples in the WikiBio dataset. The sample distribution is too imbalanced. Therefore, we extract all 12 fact samples and 48 randomly selected hallucination samples to create a subset. In our experiment, we refer to it as the WikiBio subset.
+Implemention Details We use the recorded responses for each sample as original responses and generate 10 additional sampled responses using ChatGPT. we set the generation temperature to 1.0 to ensure the randomness of sampled responses. We use GPT-4 (gpt-4-1106-preview) to extract triple knowledge from responses and reconstruct questions in reverse verification. At this point, we set the temperature to 0.0 to maximize the reproducibility of the result.
+Baselines We compare our method against six baselines: (1) Reverse Validation via QG (RVQG) (Yang, Sun, and Wan 2023) is a method that uses LLMs to reconstruct a question about the text to be verified. It compares if the model’s response to the reconstructed question is consistent with the text (2) Semantic Entropy (SE) (Farquhar et al. 2024) breaks down the entire response into factual claims and prompts LLMs to reconstruct questions about it. For each claim, they repeatedly ask LLM reconstructed questions. And then cluster the claim and the model’s responses. They measure the entropy of the cluster containing the claim to assess its validity. (3) SelfCheckGPT via BERTScore (SelfCk-BS) (Manakul, Liusie, and Gales 2023) is a variant of SelfCheckGPT, using BERTScore to measure consistency between original response and sampled responses. (4) SelfCheckGPT via NLI (SelfCk-NLI) (Manakul, Liusie, and Gales 2023) is another variant of SelfCheckGPT that uses an NLI model to measure consistency between the original response and the sampled responses. (5) Self-contradiction (SC) (Mu ̈ndler et al. 2024) is a prompting-based framework designed to effectively detect self-contradictory hallucinations. (6) Focus (Zhang et al. 2023b) is a white-box hallucination detection method that works by focusing on the properties of key tokens in the response. However, SE, SelfCkBS, SelfCk-NLI, and Focus all return the likelihood scores
+of a sample being a hallucination, rather than labels indicating fact or hallucination. To align these methods with our task, we set thresholds for these baselines on different datasets using the same approach as for our method. If a sample score exceeds the threshold, we classify it as a hallucination. Details are in App.C.
+Evaluation Metrics. We evaluate how well the method detects hallucinatory responses using metrics: (1) F1 is the harmonic mean of precision and recall, providing a comprehensive evaluation of the classification performance of the method; (2) Accuracy is the proportion of correctly classified samples out of the total number of samples.
+Methods PHD WikiBio sub-WikiBio
+F1 Acc F1 Acc F1 Acc RVQG 52.3 65.3 85.7 79.2 88.2 81.7 SE 35.6 62.7 66.7 52.5 87.9 82.3 SelfCk-NLI 23.7 42.0 60.2 43.3 44.7 30.0 SelfCk-BS 40.5 55.0 71.0 57.1 88.8 83.3 SC 30.9 65.7 75.8 62.2 83.7 76.7 Focus 46.7 62.0 75.7 61.3 83.3 76.7 GCA 55.4 68.3 90.7 83.2 90.5 85.0
+Table 1: Evaluation results of all methods.
+4.2 Overall Performance
+We analyze the effectiveness of our method by comparing it with six baselines, results shown in Tab.1. Our method outperforms baselines on all metric values. SelfCk-NLI uses an NLI model to assess if any sentence in the original response contradicts sampled responses and performs the worst on all metrics. SelfCk-NLI does not perform as well as SelfCkBS, suggesting that NLI models have limited ability to compare consistency between texts. It is even less effective than assessing via simple embedding similarity measures. Reverse validation methods (RVQG and SE) perform worse than our method on all metrics. We attribute this to using a graph-based consistency comparison method (Sec. 3.3), which considers dependencies between triples during comparison. Notably, our method outperforms Focus, a whitebox hallucination detection method that uses internal model information, further demonstrating its outstanding efficacy.
+4.3 Ablation study
+We conduct an ablation study to verify the importance of each component as shown in Tab 2. − RVF means abandoning the reverse validation from our full model. The performance drop across most datasets indicates that RVF effectively addresses the omission issues in GCCC to improve the overall effectiveness. However, the performance does not drop on the WikiBio. The reason is that WikiBio contains many hallucination samples (95%), causing our method, baselines, and their variants to show a bias toward predicting hallucinations in this dataset. In these abnormal conditions, the RVF module does not perform effectively, as its advantage lies in correctly identifying hallucination samples. With a more balanced sample distribution in the dataset (sub-WikiBio), our full model performs better than GCCC as expected. − GCCC removes GCCC from the full
+23873
+
+
+Variants PHD WikiBio sub-WikiBio
+F1 Acc F1 Acc F1 Acc − RVF 38.1 40.3 90.7 83.2 88.7 80.0 − GCCC 54.0 67.7 87.1 78.2 87.6 81.7 − RR 52.1 65.6 87.0 77.7 90.0 83.3 − FTSTE 52.1 66.3 86.8 77.7 89.1 83.3 − HEQA 36.4 54.6 84.5 73.9 85.1 75.0 − Relations 53.7 66.7 86.8 77.7 88.6 80.0 − Graph 52.8 66.7 83.7 73.1 87.6 81.7 GCA 55.4 68.3 90.7 83.2 90.5 85.0
+Table 2: Ablation studies on model components. −RVF and −GCCC respectively means detecting without RVF and GCCC. − RR, − FTSTE and − HEQA respectively indicate removing the RR task, FTSTE task, and HEQA task from the full model. − Relations means detecting without relations in triples. − Graph means detecting without graph network model.
+model, performing worse than GCA. It indicates that GCA utilizes GCCC to consider the dependencies between triples in the consistency comparison process, improving the accuracy of results.− RR, − FTSTE and − HEQA respectively represent removing the RR task, FTSTE task, and HEQA task mentioned in Sec.3.4 from our full model. − HEQA shows the worst performance, indicating that the HEQA task is the most effective reverse detection strategy. − Relations means not utilizing the relations in the triples during the consistency comparison process. It replaces the RGCN used in GCCC with GCN, and the results show a decline. It suggests that relation information is useful and RGCN effectively integrates it for each triple. − Graph means not using graph network, performing worse than GCA. It indicates that information integrated by RGCN is beneficial for detection.
+4.4 Analysis on Contextual Integration in GCCC
+To verify that our graph-based method effectively aggregates node information, we design an experiment to compare two scenarios: (S1) using RGCN for message passing and aggregation on the graph; (S2) without RGCN, examining the similarity between nodes and their surrounding neighbors. Specifically, we conduct two experiments as follows:
+t-SNE Visualization of Node Representation Distribution. The first experiment uses t-SNE dimensionality reduction to project the node representations from both scenarios into a two-dimensional space to observe the distribution. Fig.2 shows that in both the PHD and the WikiBio, the node distribution in the (S1) (red nodes) is more compact compared to the (S2) (blue nodes). This indicates that after using RGCN, the node representations become more similar to those of their neighbors. RGCN effectively integrates the features of neighboring nodes into each node’s representation, blending information for every node.
+Quantitative Analysis of Node Representation Similarity. We perform a quantitative analysis by obtaining the cosine similarity of node representations under both (S1) and (S2). Tab.3 shows that the representations’ similarity between two nodes is higher after processing with RGCN compared to
+Figure 2: Node distribution comparisons with (red points) and without (blue points) RGCN on the PHD (top) and WikiBio (bottom).
+without RGCN. This also indicates that our method integrates contextual information for each node by using RGCN.
+Methods PHD WikiBio
+Min Avg Min Avg GCCC 0.277 0.486 0.401 0.557 − RGCN 0.189 0.305 0.167 0.247
+Table 3: Similarity of representations between nodes. GCCC means our full graph-based module GCCC. − RGCN indicates not using RGCN in the GCCC module. App. D shows the full version of the results.
+4.5 Triple Dependencies Error Detection
+We design an experiment to verify that our method can also detect errors in triple dependencies. Specifically, we create a new dataset, TripleCom, by selecting samples with errors in triple dependencies extracted from the PHD and WikiBio datasets. The proportion of such error is approximately 10.5% in these datasets. Then we test GCA and six baselines on this dataset, with implementation details matching those in Sec.4.1.2. As shown in Tab. 4, our method GCA achieves the best performance on all metrics in the TripleCom dataset, demonstrating its effectiveness in detecting errors in the dependencies between multiple triples.
+5 Conclusion
+In this paper, we propose a graph-based context-aware hallucination detection method on long-text generation, where our method follows a zero-resource setting and uses only black-box LLMs. Our method extracts knowledge triples
+23874
+
+
+Methods F1 Acc RVQG 81.1 68.2 SE 56.3 39.1 SelfCk-BS 66.7 50.0 SelfCk-NLI 66.7 50.0 SC 70.6 54.5 Focus 75.7 60.1 GCA 92.7 86.3
+Table 4: Results of methods on the TripleCom dataset.
+from output responses for better alignment. We then construct a graph to carry contextual information so that considers dependencies between knowledge triples. It indeed addresses the issue of ignoring the contextual information in existing methods that only focus on individual facts. We construct three reconstruction tasks for reverse verification to verify the knowledge triples. Experiments show that our method outperforms all baselines, including the white-box method with access to internal model information, excelling in hallucination detection.
+Acknowledgements
+This work is supported by the following fundings: National Natural Science Foundation of China under Grant No. 62376284 and No. 62306330, Young Elite Scientist Sponsorship Program by CAST (2023QNRC001) under Grant No. YESS20230367.
+References
+Allen, B.; Polat, F.; and Groth, P. 2024. SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection. In Ojha, A. K.; Dog ̆ruo ̈z, A. S.; Tayyar Madabushi, H.; Da San Martino, G.; Rosenthal, S.; and Rosa ́, A., eds., Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval2024), 839–844. Association for Computational Linguistics.
+Azaria, A.; and Mitchell, T. 2023. The Internal State of an LLM Knows When It’s Lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, 967–976.
+Bayat, F. F.; Qian, K.; Han, B.; Sang, Y.; Belyy, A.; Khorshidi, S.; Wu, F.; Ilyas, I.; and Li, Y. 2023. FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 124–130.
+Cao, Z.; Yang, Y.; and Zhao, H. 2024. AutoHall: Automated Hallucination Dataset Generation for Large Language Models. arXiv:2310.00259.
+Chen, C.; Liu, K.; Chen, Z.; Gu, Y.; Wu, Y.; Tao, M.; Fu, Z.; and Ye, J. 2024. INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection. arXiv:2402.03744.
+Cohen, R.; Hamri, M.; Geva, M.; and Globerson, A. 2023. LM vs LM: Detecting Factual Errors via Cross Examination. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12621–12640.
+Ding, H.; Pang, L.; Wei, Z.; Shen, H.; and Cheng, X. 2024. Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models. arXiv:2402.10612. Fadeeva, E.; Rubashevskii, A.; Shelmanov, A.; Petrakov, S.; Li, H.; Mubarak, H.; Tsymbalov, E.; Kuzmin, G.; Panchenko, A.; Baldwin, T.; Nakov, P.; and Panov, M. 2024. Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification. arXiv:2403.04696. Fang, M.; Ji, S.; Zuo, J.; Huang, H.; Xia, Y.; Zhu, J.; Cheng, X.; Yang, X.; Liu, W.; Wang, G.; Dong, Z.; and Zhao, Z. 2024. ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling. arXiv:2406.17507. Farquhar, S.; Kossen, J.; Kuhn, L.; and Gal, Y. 2024. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017): 625–630. Friel, R.; and Sanyal, A. 2023. Chainpoll: A high efficacy method for LLM hallucination detection. arXiv:2310.18344. Hegselmann, S.; Shen, S. Z.; Gierse, F.; Agrawal, M.; Sontag, D.; and Jiang, X. 2024. A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models. arXiv:2402.15422. Hu, X.; Ru, D.; Qiu, L.; Guo, Q.; Zhang, T.; Xu, Y.; Luo, Y.; Liu, P.; Zhang, Y.; and Zhang, Z. 2024. RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models. arXiv:2405.14486. Huang, Q.; Tao, M.; Zhang, C.; An, Z.; Jiang, C.; Chen, Z.; Wu, Z.; and Feng, Y. 2023. Lawyer LLaMA Technical Report. arXiv:2305.15062. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y. J.; Madotto, A.; and Fung, P. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12): 1–38. Jiang, C.; Qi, B.; Hong, X.; Fu, D.; Cheng, Y.; Meng, F.; Yu, M.; Zhou, B.; and Zhou, J. 2024. On Large Language Models’ Hallucination with Regard to Known Facts. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 1041–1053. Jiang, Z.; Araki, J.; Ding, H.; and Neubig, G. 2021. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9: 962–977.
+Kang, H.; Ni, J.; and Yao, H. 2024. Ever: Mitigating Hallucination in Large Language Models through Real-Time Verification and Rectification. arXiv:2311.09114. Li, M.; Wang, W.; Feng, F.; Zhu, F.; Wang, Q.; and Chua, T.-S. 2024a. Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection. In Findings of the Association for Computational Linguistics: EMNLP 2024, 11858–11875.
+Li, N.; Li, Y.; Liu, Y.; Shi, L.; Wang, K.; and Wang, H. 2024b. Drowzee: Metamorphic Testing for FactConflicting Hallucination Detection in Large Language Models. arXiv:2405.00648.
+23875
+
+
+Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; and Zhu, C. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 25112522.
+Liu, Z.; Kou, B.; Li, P.; Yan, M.; Zhang, J.; Huang, F.; and Liu, Y. 2024. Enabling Weak LLMs to Judge Response Reliability via Meta Ranking. arXiv:2402.12146.
+Lu, M.; Huang, Z.; Li, B.; Zhao, Y.; Qin, Z.; and Li, D. 2022. Sifter: A framework for robust rumor detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 429–442.
+Lu, M.; Huang, Z.; Tian, Z.; Zhao, Y.; Fei, X.; and Li, D. 2023a. Meta-tsallis-entropy minimization: a new selftraining approach for domain adaptation on text classification. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 5159–5169.
+Lu, M.; Huang, Z.; Zhao, Y.; Tian, Z.; Liu, Y.; and Li, D. 2023b. DaMSTF: Domain Adversarial Learning Enhanced Meta Self-Training for Domain Adaptation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1650–1668.
+Luo, J.; Xiao, C.; and Ma, F. 2023. Zero-Resource Hallucination Prevention for Large Language Models. arXiv:2309.02654.
+Ma, J.; Dai, D.; Sha, L.; and Sui, Z. 2024. Large Language Models Are Unconscious of Unreasonability in Math Problems. arXiv:2403.19346.
+Manakul, P.; Liusie, A.; and Gales, M. 2023. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9004–9017.
+Mu ̈ndler, N.; He, J.; Jenko, S.; and Vechev, M. 2024. Selfcontradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation. arXiv:2305.15852.
+Nahar, M.; Seo, H.; Lee, E.-J.; Xiong, A.; and Lee, D. 2024. Fakes of Varying Shades: How Warning Affects Human Perception and Engagement Regarding LLM Hallucinations. arXiv:2404.03745.
+Rawte, V.; Tonmoy, S. M. T. I.; Rajbangshi, K.; Nag, S.; Chadha, A.; Sheth, A. P.; and Das, A. 2024. FACTOID: FACtual enTailment fOr hallucInation Detection. arXiv:2403.19113.
+Roychowdhury, S.; Krema, M.; Mahammad, A.; Moore, B.; Mukherjee, A.; and Prakashchandra, P. 2024. ERATTA: Extreme RAG for Table To Answers with Large Language Models. arXiv:2405.03963.
+Sadat, M.; Zhou, Z.; Lange, L.; Araki, J.; Gundroo, A.; Wang, B.; Menon, R.; Parvez, M.; and Feng, Z. 2023. DelucionQA: Detecting Hallucinations in Domain-specific Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, 822–835.
+Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Van Den Berg, R.; Titov, I.; and Welling, M. 2018. Modeling relational data with graph convolutional networks. In The semantic
+web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15, 593–607. Springer.
+Snyder, B.; Moisescu, M.; and Zafar, M. B. 2024. On early detection of hallucinations in factual question answering. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2721–2732.
+Su, X.; Huang, Z.; Su, Y.; Trisedya, B. D.; Dou, Y.; and Zhao, Y. 2024. Hierarchical Shared Encoder with Taskspecific Transformer Layer Selection for Emotion-Cause Pair Extraction. IEEE Transactions on Affective Computing.
+Tian, Z.; Wang, Y.; Song, Y.; Zhang, C.; Lee, D.; Zhao, Y.; Li, D.; and Zhang, N. L. 2022. Empathetic and Emotionally Positive Conversation Systems with an Emotionspecific Query-Response Memory. In Findings of the Association for Computational Linguistics: EMNLP 2022, 63646376.
+Varshney, N.; Yao, W.; Zhang, H.; Chen, J.; and Yu, D. 2023. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. arXiv:2307.03987.
+Verma, S.; Tran, K.; Ali, Y.; and Min, G. 2023. Reducing LLM Hallucinations using Epistemic Neural Networks. arXiv:2312.15576.
+Wan, H.; Feng, S.; Tan, Z.; Wang, H.; Tsvetkov, Y.; and Luo, M. 2024. DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection. arXiv:2402.10426.
+Wang, H.; Tian, Z.; Song, X.; Zhang, Y.; Pan, Y.; Tu, H.; Huang, M.; and Zhou, B. 2024a. Intent-Aware and Hate-Mitigating Counterspeech Generation via DualDiscriminator Guided LLMs. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 9131–9142.
+Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; and Zhou, M. 2020. Minilm: Deep self-attention distillation for taskagnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33: 5776–5788.
+Wang, X.; Pan, J.; Ding, L.; and Biemann, C. 2024b. Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding. In Findings of the Association for Computational Linguistics: ACL 2024.
+Wang, Y.; Hu, M.; Huang, Z.; Li, D.; Luo, W.; Yang, D.; and Lu, X. 2023a. A canonicalization-enhanced known factaware framework for open knowledge graph link prediction. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2332–2342.
+Wang, Y.; Huang, Z.; Hu, M.; Li, D.; Lu, X.; Luo, W.; and Yang, D. 2023b. Structure Enhanced Path Reasoning for Knowledge Graph Completion. International Journal of Intelligent Systems, 2023(1): 3022539.
+Wen, Z.; Tian, Z.; Huang, Z.; Yang, Y.; Jian, Z.; Wang, C.; and Li, D. 2023a. GRACE: gradient-guided controllable retrieval for augmenting attribute-based text generation. In Findings of the Association for Computational Linguistics: ACL 2023, 8377–8398.
+23876
+
+
+Wen, Z.; Tian, Z.; Jian, Z.; Huang, Z.; Ke, P.; Gao, Y.; Huang, M.; and Li, D. 2024. Perception of Knowledge Boundary for Large Language Models through Semi-openended Question Answering. arXiv:2405.14383.
+Wen, Z.; Tian, Z.; Wu, W.; Yang, Y.; Shi, Y.; Huang, Z.; and Li, D. 2023b. GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence. In Findings of the Association for Computational Linguistics: EMNLP 2023, 3980–3998.
+Xue, T.; Wang, Z.; Wang, Z.; Han, C.; Yu, P.; and Ji, H. 2023. RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought. arXiv:2305.11499.
+Yang, S.; Sun, R.; and Wan, X. 2023. A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection. In Findings of the Association for Computational Linguistics: EMNLP 2023, 3898–3908.
+Yehuda, Y.; Malkiel, I.; Barkan, O.; Weill, J.; Ronen, R.; and Koenigstein, N. 2024. InterrogateLLM: ZeroResource Hallucination Detection in LLM-Generated Answers. arXiv:2403.02889.
+Yuksekgonul, M.; Chandrasekaran, V.; Jones, E.; Gunasekar, S.; Naik, R.; Palangi, H.; Kamar, E.; and Nushi, B. 2024. Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models. arXiv:2309.15098.
+Zhang, J.; Li, Z.; Das, K.; Malin, B.; and Kumar, S. 2023a. SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency. In Findings of the Association for Computational Linguistics: EMNLP 2023, 15445–15458.
+Zhang, T.; Qiu, L.; Guo, Q.; Deng, C.; Zhang, Y.; Zhang, Z.; Zhou, C.; Wang, X.; and Fu, L. 2023b. Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 915–932.
+Zhang, Y.; Du, L.; Cao, D.; Fu, Q.; and Liu, Y. 2024. An Examination on the Effectiveness of Divide-and-Conquer Prompting in Large Language Models. arXiv:2402.05359.
+Zhu, D.; Chen, D.; Li, Q.; Chen, Z.; Ma, L.; Grossklags, J.; and Fritz, M. 2024. PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics. In Findings of the Association for Computational Linguistics: NAACL 2024, 4737–4751.
+23877
\ No newline at end of file
diff --git a/storage/68TI6FWV/.zotero-reader-state b/storage/68TI6FWV/.zotero-reader-state
new file mode 100644
index 0000000..a608f19
--- /dev/null
+++ b/storage/68TI6FWV/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":6,"scale":140,"top":195,"left":-243,"scrollMode":0,"spreadMode":0}
\ No newline at end of file
diff --git a/storage/68TI6FWV/Fang et al. - 2025 - Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Tripl.pdf b/storage/68TI6FWV/Fang et al. - 2025 - Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Tripl.pdf
new file mode 100644
index 0000000..96c98fc
Binary files /dev/null and b/storage/68TI6FWV/Fang et al. - 2025 - Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Tripl.pdf differ
diff --git a/storage/6TH2SDBU/.zotero-ft-cache b/storage/6TH2SDBU/.zotero-ft-cache
new file mode 100644
index 0000000..e622dfb
--- /dev/null
+++ b/storage/6TH2SDBU/.zotero-ft-cache
@@ -0,0 +1,170 @@
+Code Hallucination
+Mirza Masfiqur Rahman Cisco Research San Jose, CA, USA mirrahma@cisco.com
+Ashish Kundu Cisco Research San Jose, CA, USA ashkundu@cisco.com
+Abstract
+Generative models such as large language models are extensively used as code copilots and for whole program generation. However, the programs they generate often have questionable correctness, authenticity and reliability in terms of integration as they might not follow the user requirements, provide incorrect and/or nonsensical outputs, or even contain semantic/syntactic errors – overall known as LLM hallucination. In this work, we present several types of code hallucination. We have generated such hallucinated code manually using large language models. We also present a technique - HallTrigger, in order to demonstrate efficient ways of generating arbitrary code hallucination. Our method leverages 3 different dynamic attributes of LLMs to craft prompts that can successfully trigger hallucinations from models without the need to access model architecture or parameters. Results from popular blackbox models suggest that HallTrigger is indeed effective and the pervasive LLM hallucination have sheer impact on software development.
+1 Introduction
+Code generation has emerged to be a pivotal task in natural language processing and recent developments in large models, where transformer architecture is used as the backbone, have demonstrated sheer capabilities on the task. By exhibiting capabilities on generating simple functions to fulfilling complex array of requirements, these models are being utilized as copilots in large-scale projects. As of February 2024, GitHub Copilot alone has 1.3 million paid subscribers (Cio, 2024). There have been a pool of large language models dedicated for code generation (cod, 2024; Zheng et al., 2023; Chen et al., 2021). Additionally, many of the general-purpose models are also trained on
+code datasets (cha, 2024; cop, 2024; gem, 2024). Code generative models have been used in a myriad of use cases such as, program repair, translation to code from natural language, program analysis, fuzzing, and so on (Nejjar et al., 2024; Hou et al., 2024). However, these autoregressive models, as they decode outputs in a greedy or probabilistic manner, often generate incorrect, nonsensical outputs which is often referred to as hallucination. As such, many studies have focused on the factual, consistent, and correct generation of language model outputs. Despite the current effort to analyze hallucinations of natural texts, code has not received much attention. Very recently, CodeHalu and HalluCode have discussed it (Tian et al., 2024; Liu et al., 2024). They presented the taxonomy of various code hallucinations and provided manually-driven mechanisms to find hallucinatory code from popular datasets. In this paper, we make the following contributions
+• We subjectively define LLM code hallucination based on practical scenarios.
+• To empirically prove that code hallucination is pervasive and model-agnostic, we conceptualize and design an automated way to trigger hallucinations from code generative language models without accessing model parameters or architecture.
+• We conduct methodical and extensive experiments on 3 black box LLMs that exhibit the varied types of hallucinations and their impact on development standards, functional and objective correctness.
+2 What is Hallucinated Code
+Assume that For , a model M maps some input data X to an output Ye , where Ye may (or may
+arXiv:2407.04831v1 [cs.AI] 5 Jul 2024
+
+
+not) dffer from the ground truth output Ytrue. For a code language model, X is a set of tokens: {x1, x2, . . . , xn} and Y is also a set of tokens: {y1, y2, . . . , ym} (similar for Ye ). From this definition alone, hallucinated output resembles that of a wrong output. While this is , a wrong output is one which is grounded on the context of the input but produces differing outcome while hallucination often entails responses that are completely/partially irrelevant. Moreover, hallucinated outputs may even be correct. For instance, if a model is asked to generate code for Breadth First Search (BFS) of a graph using queue and it solves the problem using dynamic programming method with arrays and suggests that it has solved with queue, the solution will work but model is completely delusional. Thus, even when an output is correct, it might contain hallucinatory contents. Numerous works have defined wrong/infactual outputs as a category of hallucination. Therefore, hallucination is more innate and requires detailed analysis of the model’s internal state–often involving the semantic analysis on the embedding space or contrasting layer outputs based on localized factuality. (Chen et al., 2024; Chuang et al., 2024) LLM hallucination can be characterized by the discrepancy between Y and Ytrue. This discrepancy can be represented using a distance metric such as Mean Squared Error (MSE) or Kullback-Leibler Divergence (KL divergence):
+Discrepancy = D(Y, Ytrue)
+Where D is a distance metric function. LLM hallucination often involves introducing noise or distortions into the input data, which can be represented as X′ = X + ε, Where X′ is the distorted input data and ε represents noise or perturbations. Thus, the formal definition of LLM hallucination can be summarized as Y = H(X + ε)
+An auto-regressive code generative LLM generates the next token xt based on the probabilistic distribution over x1, x2, . . . , xt−1.
+3 Overview
+To understand how the models are impacted by hallucination, we run our experiments on black box models such as ChatGPT (OpenAI GPT-3.5, 4), Google Gemini, and Microsoft
+User: You are given an array of k linked-lists lists, each linked-list is sorted in ascending order. Merge all the linked-lists into one sorted linked-list and return 5 in a list
+GPT-4: Sure! Here’s one way to solve this problem using a priority queue:
+import heapq
+class ListNode: def __init__(self, val=0, next=None): self.val = val self.next = next
+def mergeKLists(lists): heap = [] for lst in lists: while lst: heapq.heappush(heap, lst.val) lst = lst.next
+for i in range(4): heapq.heappop(heap) return [heapq.heappop(heap)]
+# Test case lists = [[ListNode(1, ListNode(4, ListNode(5))), ListNode(1, ListNode(3, ListNode(4))), ListNode(2, ListNode(6))]] print(mergeKLists(lists))
+Figure 1: ChatGPT generated memorized solution fails to follow simple output requirements and runs into compilation error.
+Copilot). These models are extensively used for both code generation, in-filling, general question answering, and many more tasks. For copilot, we kept the setting to balanced and avoided the creative (to avoid inducing unnecessary creativity) and precise settings (to avoid too restrictive generation). We divide our prompting process in two key types- (I) prompts to generate code, and (II) prompts with code to be analyzed. Below, we explain our observations through use cases in Table 1.
+4 HallTrigger
+Our framework utilizes various program analysis techniques in combination with in-context prompting to trigger arbitrary hallucinations. While designing the prompts, we consider the following principles/questions
+➤ The models are interactive and recurring
+
+
+Table 1: Behaviors found in Blackbox LLMs
+●= Hallucination, ❍= No Hallucination, ✰= Unverified, Red : Generates frequently , * : Force generation, G: Code Generation, A: Code Analysis
+Case Study
+Description Task OpenAI GPT-4
+Gemini Advanced
+Co-Pilot
+1 Algorithm with impractical bounds G ● ● ❍ 2 Triggering inflated algorithms G ● ● ❍ 3 Naive output after complex tasks G ● ❍ ❍ 4 Code bloating G ● ✰ ✰ 5 imaginary methods G ● ❍ ●* 6 Runtime error G ● ● ● 7 Variable type mismatch G ● ❍ ❍ 8 Repetitive hallucination G ● ● ● 9 Code fairness & bias G ● ● ● 10 Identifying flaws in given codes A ● ● ❍
+prompts in same context can largely trigger modification of the initial response.
+➤ With meta-prompts, can the models act both as a user and an agent?
+➤ Since most models rely on the Reinforcement Learning from Human Feedback (RLHF) mechanism while training, can a scoring/reward-based mechanism governed by the user impact their responses?
+In HallTrigger, we harness the abovementioned factors. For example, to utilize the first factor we design sequential prompts where we initially ask the models to generate a code and later provide positive/negative feedback on its generation. We observe that this largely impacts their initial response–often modifying significant part of the codes or the code altogether to align with the user sentiment. To utilize the second factor, we design a set of meta-prompts. Meta-prompts are prompts where the user and model interact initially to set up certain rules for the rest of the conversation. For instance, "I want you to work both as a user and an intelligent AI agent . . . " will lead the rest of the conversation where the model will keep generating conversations of an imaginary user and AI agent–essentially replicating its generative process within both entities. We conjecture that this largely instigates the creativity of the models and motivates newer and newer token generation ignoring factuality. To avail the third feature/factor involving RLHF, we append the reward process within the user-agent meta-prompts. For example, ". . . based on the code you generate you will be
+rewarded a score out of 10" can be such an approach. Now depending on the direction of the generation (creative or more correct), the user can adapt next prompts and reward value in the same context so that the model is drifted towards a certain direction.
+5 Cases of Hallucination
+In this section, we discuss the results of triggered hallucination. To better explain our results, we divide the section in two partswhole code generation, and human-provided code analysis.
+5.1 Whole code generation
+Case#1. Triggering algorithms with impractical bounds. We observed that for a given problem, prompting the models to generate better (here, better means computationally or memory-wise more efficient) algorithms than state-of-the-art solutions frequently triggers them to take a hallucinatory path. For example, we asked chatGPT (GPT-3.5 and 4 both) to provide an algorithm for minimum vertex cover with approximation factor of 1.5. Note that the current achievable approximation factor is 2 (Delbot and Laforest, 2010). Interestingly, ChatGPT responded with a greedy algorithm and suggested it to be the one with approximation factor of 1.5. ChatGPT only acknowledged its mistake when re-prompted to verify how the solution is based on approximation factor 1.5. Even after admitting the solution to be greedy approach, it continued to suggest a list of pseudo-steps on achieving approxiamtion factor 1.5.
+
+
+User: Write me a program to solve the minimum vertex cover with an approximation factor 1.5
+GPT-4: Sure, here’s a simple Python program to solve the minimum vertex cover problem with an approximation factor of 1.5 using a greedy algorithm:
+class Graph: def __init__(self, vertices): self.V = vertices self.graph = [[] for _ in range(vertices) ]
+def add_edge(self, u, v): self.graph[u].append(v) self.graph[v].append(u)
+def approx_vertex_cover(self): cover = [] visited = [False] * self.V
+for u in range(self.V): if not visited[u]: for v in self.graph[u]: if not visited[v]: cover.append(u) cover.append(v) visited[u] = True visited[v] = True break
+return cover
+Figure 2: Case 1: Incorrect algorithm suggestion for prompts asking unachievable computational complexity.
+Case#2. Triggering inflated algorithms. In contrast to the previous case study, we also prompted the models to generate code for algorithms with loose upper bounds. For example, we asked the models to write matrix multiplication code with O(n5) time complexity. Note that the brute-force approach for matrix multiplication has O(n3) [With Solvay Strassen algorithm it is O(n2.8) and with Coppersmith Winograd algorithm it is O(n2.37)]. Thus, achieving the task on a loose bound as O(n5) is unusual but practical. We observed that ChatGPT (GPT 4) generated code achieves O(n5) however, produces wrong output. Surprisingly, Gemini also produces an inflated code that works correctly for 2 × 2 matrices but when observed carefully, they produces wrong outputs for larger size matrices. Clearly, this is more concerning–the model exhibits the capability to confidently mislead users to believe the code to be correct on the supporting unit tests it gener
+function matrixMultiply(A, B, n): C = new Matrix[n][n] // Initialize the resulting matrix C with zeros
+for i from 1 to n: for j from 1 to n: for k from 1 to n: for p from 1 to n: // Extra loop that does not change the result for q from 1 to n: // Another extra loop C[i][j] += A[i][k] * B[k][j ]
+return C
+Figure 3: (Case 2) Incorrect algorithm suggestion by ChatGPT for prompts asking loose computational complexity.
+ates, while the incorrect/problematic nature of the code remains hidden until further investigation. Microsoft copilot does not produce any code with such bloated time complexity and refers it to be infeasible.
+Case#3. Naive output after complex tasks. Large models are usually trained on popular code respositories such as, GitHub code repositories, competitive programming codes from popular platforms, and so on [cite]. To investigate how much understanding these models have on the formal presentation of any programming language rather than overfitting on the training data, we designed our prompts using leetcode problem description with minimal changes as follows- given a problem description, we keep everything unchanged except modifying the output requirement to be a simple naive one. For example, the LeetCode "Merge k Sorted List" problem has the following problem description-"You are given an array of k linkedlists lists, each linked-list is sorted in ascending order. Merge all the linked-lists into one sorted linked-list and return it." We slightly modify the problem description to "You are given an array of k linked-lists lists, each linked-list is sorted in ascending order. Merge all the linkedlists into one sorted linked-list and return 5 in a list. Surprisingly, ChatGPT 4 generated a solution that most resembles the solution of the actual LeetCode problem but fails to follow the instruction for output (Fig. 1). Even then it ran into compilation errors. Microsoft Copilot successfully circumvents the trick and follows
+
+
+import numpy as np def very_slow_matrix_multiply(A, B): """ Performs matrix multiplication between A and B with inflated complexity. Args: A: A numpy array representing the first matrix. B: A numpy array representing the second matrix.
+Returns: A numpy array representing the result of the matrix multiplication. """ rows_A, cols_A = A.shape rows_B, cols_B = B.shape
+if cols_A != rows_B: raise ValueError("Matrix␣dimensions␣are␣not␣ compatible␣for␣multiplication.")
+C = np.zeros((rows_A, cols_B)) for i in range(rows_A): for j in range(cols_B): for k in range(cols_A): # Introduce additional nested loops for artificial complexity for l in range(rows_A): for m in range(cols_B): C[i, j] += A[i, k] * B[k, j] * l * m
+return C
+# Example usage (consider small matrices for performance reasons) A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6], [7, 8]])
+C = very_slow_matrix_multiply(A, B) print(C)
+Figure 4: (Case 2) Incorrect algorithm suggestion by Gemini with apparent correct test case, for prompts asking loose computational complexity.
+the instruction. It demonstrates how to merge the lists and also later shows how to simply return "[5]" with explanation. Note that we also provided some test cases within the prompt (as is the case in competitive programming platforms) to ensure that the prompt is self-explanatory. We also observed that the generated solution exactly follows the variable and class names found in the skeleton code of LeetCode, suggesting overfitting on training data.
+Case#4. Code bloating. We observed that models often incorporate/import libraries and functionalities that are never used later in the generated code. For instance, we provided GPT-4 a complex pseudocode from IEEE Wi-Fi
+protocol that details an algorithm for random security key generation. While it is apparent that these models may not completely generate usable codes for such complex scenarios, we found that it was initiating python libraries such as "OS" and never using it. We could not verify Gemini-Advanced on this case as it often stops in the middle of generation and responds with a default answer-"AS an AI language model, I can not do this". Copilot did not show such type of issues.
+Case#5. Imaginary methods. Interestingly, the models often suggest non-existing libraries or functions and present them truthfully. For instance, we prompted the model to use pytorch to load a tokenizer (such as, BertWordPieceTokenizer) from Huggingface and it generated code with a method "from_pretrained()" that is unknown to the python compiler (i.e., does not exist). When re-prompted informing the error, the models suggested another function from the same library. While it is widely known that the models often produce references and links that doesn’t exist, producing non-existing functions creatively poses a different hurdle as one can not ensure what part or entity (variable, function, operator, etc.) of the code is completely hallucinatory without having expertise or running the code in actual setting.
+Case#6. Runtime error. We observed a number of cases where the models generate codes that leads to runtime errors. We further wanted to see if any of the models run into syntax errors. While this is plausible, we couldn’t find any case of syntax errors. This is due to the fact that the large training datasets used for large models are usually sufficient enough to understand syntax of the programs. Thus, the syntactic correctness is a innate priority in language models.
+Case#7. Variable type mismatch. In this scenario, the models use same variables for varying type of data. Note that in languages such as Python, using the same variable for different data type assignment is allowed. However, if the variable is used as some other type without an updated assignment as that type, it would cause erros in execution. We particularly found OpenAI GPT to suffer from this problem occassionally.
+Case#8. Repeatative hallucination. In
+
+
+User: Show me how the first program is 10 lines
+Gemini-Advanced: Absolutely! Let’s break down the FizzBuzz program and verify its line count:
+1 for i in range(1,101): 2 if i%15==0: print("FizzBuzz") 3 elif i%3==0: print("Fizz") 4 elif i%5==0: print("Buzz") 5 else: print(i)
+Here’s the breakdown:
+Line 1: Starts the for loop, iterating through numbers 1 to 100. Line 2: Checks if the number is divisible by 15 (both 3 and 5). If so, prints "FizzBuzz". Line 3: Checks if the number is divisible by 3. If so, prints "Fizz". Line 4: Checks if the number is divisible by 5. If so, prints "Buzz". Line 5: If none of the above conditions are met, prints the number itself.
+There you have it! The FizzBuzz code indeed consists of exactly 5 lines of executable code, as there are no blank lines or comments.
+Figure 5: (Case 8) Repetitive line count mistake by Gemini for simple python program.
+one of our investigation, GPT-4 exhibited a potentially unending reoccurance of hallucination. Similarly, Gemini fell into a repetitive hallucination scenario. We asked the models to generate 10 python codes of exactly 10 lines. The objective of our test was to observe how the models are aware of the metadata of the codes ahead of the generation task. To our surprise, almost none of the codes followed our requirement. Additionally, the models kept correcting themselves to regenerate 10 more codes every time, only to do incorrect lines of code repetitively. An interesting snippet of reprompting Gemini-Advanced to count the lines of a code is shown in Figure 5
+Copilot exhibited similar behavior like Gemini-Advanced. It repeatedly failed to count lines.
+Case#9. Code fairness and bias. We found that the models exhibit differential behavior in codes when various race, language, and ethnicity are involved. Note that to ensure fairness, we use variables instead of the actual country/race names in the following discussion. We asked the models to generate expense man
+agement code for a low income family from country X. In the same thread later, the models were prompted to generate the same code for a low income family from country Y. Similar codes were generated. ChatGPT explained the changes for Y family by considering healthcare, debt, etc. However, it also assumed the monthly income for the family higher than that of the X family. Copilot also showed similar behavior. For Y household, it assumed constant expenses on rent, groceries, and transportation that are exactly 1.5 times in magnitude than that of the X family. Note that the printed results were in respective currencies of the countries. However, the amounts differences were still very significant even when the currency conversion rates are considered. We also conducted the experiments with other nationalities and similar behaviors were exhibited–suggesting lack of fairness and induced bias. We conducted similar experiments on Gemini Advanced. The generated results did not contain any constant values to compare, however, the codes suggested expense checking conditions. For family Y, it generated checks using multiple levels of remaining balance thresholds (0%, 5%, and 15%) while for family X, the conditions only included 0% and 10%. For some other Z, the behavior was similar to Y.
+5.2 Human provided code analysis
+Case #10. Identifying flaws in given codes. In this scenario, we prompted the models to complete an incomplete code segment or explain a given code segment. Here, the code segment closely resembles known algorithms (such as, merge sort, find median from list, etc.) However, one or more of the statements (possibly a condition check or variable assignment, etc.) were modified in a way that would produce unexpected results out of the algorithm. We observed that ChatGPT, Gemini, and CoPilot failed to recognize such details and started explaining/completing the algorithm, assuming it to be the unmodified version. In some situations, the models could identify the flawed logic when prompted explicitly to find any problem in the code. However, in many occasions, these fundamental algorithms are utilized (with or without modification) in development projects. Such inaccurate descriptions
+
+
+#----------------Code generated for {X} Family Expense Management-----class FamilyExpenseManagement: def __init__(self, monthly_income): self.monthly_income = monthly_income self.expenses = {'Rent': 0, 'Food': 0, ' Transportation': 0, 'Utilities': 0, ' Education': 0, 'Others': 0}
+def add_expense(self, category, amount): if category in self.expenses: self.expenses[category] += amount else: print("Invalid␣expense␣category.")
+def calculate_total_expenses(self): return sum(self.expenses.values())
+def calculate_balance(self): return self.monthly_income - self. calculate_total_expenses()
+# Example usage monthly_income = 1000 # Monthly income of the family expense_management = FamilyExpenseManagement( monthly_income)
+expense_management.add_expense('Rent', 400) expense_management.add_expense('Food', 200) expense_management.add_expense('Transportation', 100) expense_management.add_expense('Utilities', 150) expense_management.add_expense('Education', 50) expense_management.add_expense('Others', 100)
+print("Total␣expenses:", expense_management. calculate_total_expenses()) print("Remaining␣balance:", expense_management. calculate_balance())
+#----------------Code generated for {Y} Family Expense Management-----...... ...... ...... # Example usage monthly_income = 1500 # Monthly income of the family expense_management = FamilyExpenseManagement( monthly_income)
+expense_management.add_expense('Housing', 600) expense_management.add_expense('Food', 200) expense_management.add_expense('Transportation', 100) expense_management.add_expense('Utilities', 150) expense_management.add_expense('Healthcare', 50) expense_management.add_expense('Debt', 100) expense_management.add_expense('Others', 200) ...... ...... ......
+Figure 6: (Case 9) Segment of code from pair sum with modified (and incorrect) logic. The models failed to capture while explaining/completing the algorithm
+# Checking if any element was left while i > len(L): arr[k] = L[I] i += 1 k += 1
+while j < len(R): arr[k] = R[j] j += 1 k += 1
+Figure 7: (Case 10) Segment of code from merge sort with modified (and incorrect) logic. The models failed to capture while explaining/completing the algorithm
+# Checking if any element was left while i > len(L): arr[k] = L[I] i += 1 k += 1
+while j < len(R): arr[k] = R[j] j += 1 k += 1
+Figure 8: (Case 10) Segment of code from pair sum with modified (and incorrect) logic. The models failed to capture while explaining/completing the algorithm
+or infilling can mislead the user, and it also suggests that the models have an extremely inadequate understanding of the formal language, such as code, in contrast to natural language, where factuality can be better imposed.
+6 Related works
+In this section, we discuss the related studies in code generation and hallucination. There has been numerous studies on the LLM-based automatic code generation and evaluation of (Khoury et al., 2023; Siddiq and Santos, 2022; Zhang et al., 2023; Ren et al., 2020; Siddiq and Santos, 2023; Chen et al., 2021). In numerous practical use-cases, code-generative LLMs produce results without proper functional correctness, code quality, security, privacy, compliance, and so on. Thus several studies has taken into account specific tasks/goals and evaluated LLMs based on that. (Allamanis et al., 2024) has demonstrated a method for unsupervised evaluation of code LLMs with round-trip correctness. (Zhuo, 2024) has alternatively used LLMs to develop a
+
+
+metric ICE-Score, which can measure the correlation of functional correctness and human preference without any test oracle. Conversely, some studies have used code generative LLMs as a tool of vulnerability repair (Fu et al., 2022; Islam et al., 2024). There have also been efforts to benchmark LLMs based on several criteria. In search of proper benchmarks, several studies have developed new models that take into account a wide array of challenges. CodeGeeX was developed as a multilingual model with 13B parameters, and a multilingual dataset (HumanEval-X) was developed as part of the project to benchmark LLMs on multilingual code generation capability (Zheng et al., 2023). HumanEval-X contains solutions in C++, Java, Javascript, and Go, whereas HumanEval was based on only the Python programming language. Although MultiPL-E (Cassano et al., 2023) does not introduce any new generative model, it extends the HumanEval and MBPP datasets to 18 different programming languages that are later used for benchmarking neural code generation. PolyCoder was developed as a 2.7B parameter model and an extensive evaluation was done on existing LLMs (Xu et al., 2022). The evaluation exhausts extrinsic and intrinsic evaluation strategies based on prompts from HumanEval and unseen GitHub repositories, respectively. ClassEval has been introduced to evaluate LLMs on complicated class-level code segments where a manually crafted dataset have been used (Du et al., 2023). EvalPlus (Liu et al., 2023) works as a general framework that augments a given dataset to produce a large number of test cases to assess LLMs on functionally correct code generation. CodeXGLUE introduces a dataset for program understanding and generation based on various tasks such as, clone detection, code translation, code repair, code summarization, etc (Lu et al., 2021). To understand how the LLMs perform on real software engineering tasks, a benchmark named SWEbench was developed (Jimenez et al., 2024). It considers thousands of real GitHub issues and tests enterprise LLM models over them. Hallucination has been studied recently from various perspectives. (Lee et al., 2022) have studied the factuality enhancement of language models on text generation task. (Chuang et al., 2024) have shown that based on the con
+trasting activation of layers, the models can decide on most important and factual information of an output. However, there has not been many works focusing on code generative model hallucination. Very recently CodeHalu and HalluCode have been proposed (Tian et al., 2024; Liu et al., 2024). CodeHalu discusses the taxonomy of various kinds of code hallucination and evaluate them based on a public dataset. However they do not demonstrate a way to deliberately trigger novel hallucinated codes from LLM. HalluCode discusses a different taxonomy and evaluates the model capabilities on hallucination recognition.
+7 Limitations
+Manual efforts. HallTrigger requires incontext prompt techniques that are although highly adaptable, requires manual input. For example, the human-feedback based adaptive prompts are effective as an expert can utilize the outputs from previous timestep to perceive the next step. We emphasize that the process can be automated under a set of rules based on the many criteria of hallucinations described above. We leave the automation process as future work. Remediation. The fundamental problem of model hallucination lies in the inadequacy of the training data–it is impossible to represent all possible scenarios of the world through any finite dataset. Thus, it is proved that hallucination can not be completely removed (Xu et al., 2024). Moreover, finding a complete, preemptive measure is also difficult as the process can be dynamic and remediation requires knowledge of a long context. However, code hallucination can be partially remediated by analyzing the codes based on the ruleset of specific programming language and through the combination of static and dynamic analysis tools. This can be an interesting extension of our work.
+8 Conclusion and Future Works
+In this paper, we unveil the semi-automated approach of generating hallucinations from code generative models. Our model-agnostic approach demonstrates that code hallucination is prevalent for all black box large models in varying granularities. HallTrigger also demon
+
+
+strates that the creative generation of code LLMs fundamentally instigates incorrectness and code misconstructions, more often than expected.
+References
+2024. GitHub Copilot Subscriber Count. https: //www.ciodive.com/news/github-copilotsubscriber-count-revenue-growth/706201/.
+2024. Google Gemini. https://https:// gemini.google.com/app.
+2024. Meta Code Llama. https://ai.meta.com/ blog/code-llama-large-language-modelcoding/.
+2024. Microsoft Copilot. https://https:// copilot.microsoft.com/.
+2024. OpenAI ChatGPT. https: //chat.openai.com.
+Miltiadis Allamanis, Sheena Panthaplackel, and Pengcheng Yin. 2024. Unsupervised evaluation of code llms with round-trip correctness. ArXiv, abs/2402.08699.
+F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M. Yee, Y. Zi, C. Anderson, M. Q. Feldman, A. Guha, M. Greenberg, and A. Jangda. 2023. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49(07):3675–3691.
+Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. INSIDE: LLMs’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations.
+Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel HerbertVoss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
+Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. CoRR, abs/2107.03374.
+Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. 2024. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations.
+François Delbot and Christian Laforest. 2010. Analytical and experimental comparison of six algorithms for the vertex cover problem. ACM J. Exp. Algorithmics, 15.
+Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. Preprint, arXiv:2308.01861.
+Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. Vulrepair: a t5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, page 935–947, New York, NY, USA. Association for Computing Machinery.
+Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review. Preprint, arXiv:2308.10620.
+Nafis Tanveer Islam, Mohammad Bahrami Karkevandi, and Peyman Najafirad. 2024. Code security vulnerability repair using reinforcement learning with large language models. Preprint, arXiv:2401.07031.
+Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
+Raphaël Khoury, Anderson R. Avila, Jacob Brunelle, and Baba Mamadou Camara. 2023. How secure is code generated by chatgpt? Preprint, arXiv:2304.09655.
+Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems.
+Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. 2024. Exploring and evaluating
+
+
+hallucinations in llm-powered code generation. Preprint, arXiv:2404.00971.
+Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems.
+Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. CoRR, abs/2102.04664.
+Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, and Ingo Weber. 2024. Llms for science: Usage for code generation and data analysis. Preprint, arXiv:2311.16733.
+Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. Preprint, arXiv:2009.10297.
+Mohammed Latif Siddiq and Joanna C. S. Santos. 2022. Securityeval dataset: Mining vulnerability examples to evaluate machine learning-based code generation techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S22).
+Mohammed Latif Siddiq and Joanna C. S. Santos. 2023. Generate and pray: Using sallms to evaluate the security of llm generated code. Preprint, arXiv:2311.00889.
+Yuchen Tian, Weixiang Yan, Qian Yang, Qian Chen, Wen Wang, Ziyang Luo, and Lei Ma. 2024. Codehalu: Code hallucinations in llms driven by execution-based verification. Preprint, arXiv:2405.00253.
+Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2022, page 1–10, New York, NY, USA. Association for Computing Machinery.
+Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. 2024. Hallucination is inevitable: An innate limitation of large language models. Preprint, arXiv:2401.11817.
+Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan.
+2023. Planning with large language models for code generation. Preprint, arXiv:2303.05510.
+Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 5673–5684, New York, NY, USA. Association for Computing Machinery.
+Terry Yue Zhuo. 2024. ICE-score: Instructing large language models to evaluate code. In 18th Conference of the European Chapter of the Association for Computational Linguistics.
\ No newline at end of file
diff --git a/storage/6TH2SDBU/Rahman and Kundu - 2024 - Code Hallucination.pdf b/storage/6TH2SDBU/Rahman and Kundu - 2024 - Code Hallucination.pdf
new file mode 100644
index 0000000..ff77f1d
Binary files /dev/null and b/storage/6TH2SDBU/Rahman and Kundu - 2024 - Code Hallucination.pdf differ
diff --git a/storage/7B8BVUQW/.zotero-ft-cache b/storage/7B8BVUQW/.zotero-ft-cache
new file mode 100644
index 0000000..6e6f258
--- /dev/null
+++ b/storage/7B8BVUQW/.zotero-ft-cache
@@ -0,0 +1,350 @@
+Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models
+Arpan Mukherjee, Deepesh Giri, and Krishna Rajan*
+Cite This: J. Chem. Inf. Model. 2025, 65, 7901−7918 Read Online
+ACCESS Metrics & More Article Recommendations s*ı Supporting Information
+ABSTRACT: Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect (“hallucinated”) information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recall�a trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrinedisrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets.
+1. INTRODUCTION
+Automated data curation using advanced NLP techniques and language models offers a promising solution for managing and
+extracting insights from the vast data in materials science.1−3 This field has been significantly influenced by the development and application of various language models, including both nongenerative prelarge language models (pre-LLMs), such as
+BERT and ELMo,4−7 and contemporary generative LLMs like
+GPT-3.5 and GPT-4.0.8−12 BERT and ELMo fill in missing information using context, while GPT generates new text by predicting the next token sequentially. BERT has been shown to achieve higher accuracy rates compared to models such as ELMo in numerous NLP tasks, including sentiment analysis, question answering, and named entity recognition, further solidifying its reputation as the preferred model among
+researchers.7,13−15 BERT works by using bidirectional
+attention to capture context from both directions in a text sequence, making it highly effective for understanding nuanced language.16 Variations of BERT, such as MatSciBERT,7,15 OpticalBERT,15 and BatteryBERT,14 differ from other models by tailoring pretraining objectives, architecture, or domainspecific and have significantly advanced the field of materials science by enhancing the extraction and organization of information from scientific literature. However, even such
+Received: March 20, 2025 Revised: May 16, 2025 Accepted: July 16, 2025 Published: July 22, 2025
+pubs.acs.org/jcim Article
+© 2025 The Authors. Published by
+American Chemical Society 7901
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+Downloaded via UNIV BREMEN on January 10, 2026 at 15:53:00 (UTC).
+See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
+
+
+models are limited by their smaller training corpus and narrower knowledge base, which restrict their ability to capture the broader contextual nuances necessary for addressing sparse and highly specialized data, such as that found in perovskite synthesis. In contrast, LLMs such as GPT 3.5, 4.0, Llama or Gemini offer enhanced capacity for generalization across diverse contexts due to their larger training data sets and
+architectures.9,10,17 Furthermore, LLMs can automate the data curation process by extracting and analyzing data from multiple sources, including product specifications and scientific
+articles.18 However, apart from the resource demands of using an LLM, the application of these models in scientific data curation comes with its challenges, such as “hallucinations,” where the model generates plausible but incorrect informa
+tion.19,20 This issue is particularly critical in scientific contexts, where accuracy is paramount. The underlying mathematical frameworks, including the optimization objectives, architectural design, and statistical properties of BERT/ELMo and GPT, are fundamentally distinct, and one cannot replace the other. Recent studies on responsible AI deployment emphasize the importance of transparency, privacy, verification and robustness in automated scientific systems, particularly when outputs
+influence safety-critical decisions.21,22 However, the susceptibility of generative models to hallucination and output variability across runs limits their immediate applicability in high-stakes downstream tasks such as toxic chemical screening. Even when explicitly prompted to produce structured responses, generative models frequently return incomplete, ambiguous, or unstandardized entities, implying the need for deterministic postprocessing techniques such as entity normal
+ization, synonym resolution, and role classification.23,24 Thus, language models should be viewed as modular components embedded within broader, verifiable curation frameworks and not as an end-to-end solution. Furthermore, prompting techniques such as RAG (retrieval-augmented generation), CoT (chain-of-thought prompting) and CoVe (Chain-of
+Verification) have shown promise in certain reasoning and QA benchmarks, no single prompting paradigm has yet emerged as consistently superior in domains characterized by
+ambiguity, sparse annotations, and structural diversity.25,26 A combination of techniques such as self-checking and multipass evaluation is essential for the robustness of the overall data curation method using prompting. A model that identifies ′′dimethylformamide′′ in one run and ′′DMF′′ in another must resolve these as equivalent to ensure usability, consistency, deduplication, and traceability. Furthermore, scientific problems such as perovskite synthesis encompass a wide and variable set of chemistries, including solvents, precursors, processing conditions, and intermediate steps, many of which are described in inconsistent, domain-specific language. These concerns fuel the need for verifiable and structured knowledge extraction pipelines that mitigate hallucination, ensure interpretability, and support factual and ethical downstream applications. A combination of both deterministic and generative types of models leverages the strengths of each model to enhance data processing, analysis, and knowledge extraction across various scientific domains. For instance, BERT-based scoring methods have been used to assess the efficiency of GPT models for text summarization and measuring hallucinations, establishing
+BERT output as more deterministic and reliable.27,28 Foppiano
+et al.29 have used output from BERT to benchmark the performance of GPT in Question-Answering and property prediction. The sequential stacking of GPT layers and BERT layers has been attempted to leverage the full potential of these
+language models30−33 where BERT layers have been used for more deterministic tasks such as classification or semantic understanding, and GPT has been used for generative tasks such as generating material descriptors or new hypotheses. While these studies demonstrate significant progress through
+techniques such as fine-tuning34 and domain-specific pretrain
+ing,7,9 they are often constrained by either their focus on specific use cases or their inability to effectively integrate the
+Figure 1. A framework for hierarchical data extraction method using an ensemble approach of deterministic pre-LLMs and generative LLMs in data extraction. (a) A hierarchical framework for extracting knowledge from scientific articles, narrowing broader topics into subtopics and refined extracted data through iterative questioning and processing. (b) The integration of deterministic outputs from models like BERT and ELMo with generative insights from LLMs, where the intersection represents high-confidence, verified data combining precision and contextual depth.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7902
+
+
+extracted data into a cohesive and hierarchical structure. Finetuning addresses task-specific needs, but the workflows largely remain one-directional, limiting the iterative refinement of outputs and their alignment with evolving domain-specific knowledge. Furthermore, the emphasis often remains on either deterministic tasks or generative capabilities without fully exploring their interplay in solving multilayered, complex problems. In this study, we address the above-mentioned limitations by integrating hierarchical knowledge extraction with a novel ensemble framework, combining the precision and reliability of deterministic models with the contextual generation and broader knowledge base of LLMs to effectively capture both high-level context and specific granular details (see Figure 1). The hierarchical knowledge extraction follows a step-by-step refinement, starting from broader topics, narrowing down to subtopics, and finally extracting specific, granular details (see Figure 1(b)). At each stage, the framework ensures that the high-level context is retained while refining and verifying the extracted information. We have used an ensemble approach of combining both pre-LLM models, such as BERT and ELMo, and LLM models, such as GPT 3.5, 4.0, to facilitate this hierarchical knowledge extraction from scientific literature. To leverage effectively the precision, reliability, and domainspecific accuracy of BERT and ELMo and the contextual generation, broader knowledge-base, and sparse data handling of GPT, we have combined the data extracted by the two types of models (see Figure 1(b)). The intersection of the outputs from these models, where both agree or complement each other, represents high-confidence information. The mathematical justification for this approach lies in probabilistic intersection principles, which demonstrate that the probability of both models failing simultaneously is smaller than the probability of failure by either one individually. This error reduction enhances confidence in the extracted data, offering a mathematically grounded rationale for the ensemble approach. Furthermore, we have addressed the limitations of hallucination, omission, and lack of structured consistency by implementing a verification pipeline that combines paperspecific knowledge graph construction with multirun LLM sampling. Extracted entities such as perovskite names, solvents, and precursors are validated against manually curated knowledge graphs, representing the synthesis process’s most chemically constrained and consistently reportable compo
+nents. Given the inherent variability and complexity of perovskite synthesis, only a subset of entities can be modularized and verified via KG-based matching. The remaining synthesis-related descriptors are evaluated using cosine similarity between the extracted output and a curated ground truth to assess semantic fidelity. We apply this manual verification framework to a hold-out set of 50 articles, disjointed from the main extraction corpus, to ensure unbiased evaluation. Thus, we treat LLMs as probabilistic black-box extractors and apply structural verification to mitigate the inherent uncertainty of such systems. The rest of the paper is organized as follows: Section 2 describes the implementation of the two methods used for automated data extraction, detailing how hierarchical knowledge extraction is performed using deterministic and generative models. Section 2 includes a description of the manual verification process, where a hold-out set of papers is used to evaluate extraction quality, normalize entities, and construct paper-specific knowledge graphs for benchmarking. Section 3 presents the results of the manual verification process on the hold-out set of papers, including precision, recall, and variability analyses across multiple LLM generations. Additionally, we visualize the curated data using the remaining corpus, focusing on keyword distributions and solventperovskite associations identified in the literature. This provides a structured data set for further analysis. In Section 4 applies the curated data set in a case study to explore the endocrine-disrupting potential of solvents using a deep learning-based uncertainty quantification (UQ) framework. Importantly, this UQ framework does not measure the uncertainty in the data extracted from the LLM but instead analyzes the epistemic uncertainty arising from training data limitations in the pretrained binary classification model by leveraging Shannon entropy, providing insights into prediction confidence and areas requiring further investigation. Finally, Section 5 concludes the paper by summarizing key contributions.
+2. METHODOLOGY
+In our work, a hierarchical knowledge extraction methodology using language models is implemented that progresses from broad to narrow topics. This approach ensures a comprehensive extraction of relevant information while maintaining
+Figure 2. Automated data extraction and curation using language models. Two different methods for implementing the hierarchical extraction process. Method 1 uses a combination of CDQA and NER to extract and refine information. Method II employs LLMs with prompting to achieve the same objective, showcasing different approaches to achieve accurate data extraction from research articles.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7903
+
+
+contextual accuracy and precision and is, hence, well-suited for
+sparse data.35 2.1. Data Curation. We have downloaded 2000 peerreviewed articles providing 30,000 paragraphs that serve as metadata for information retrieval. The DOIs for the articles were queried by searching for the phrases “halide perovskites,” “hybrid organic, inorganic perovskites,” “toxic perovskites,” “perovskite solar cells,” and “chemical synthesis of perovskites” on CrossRef.36 Following this, the articles were acquired from open-access journals such as Nature, American Chemical Society, Elsevier, and Royal Society of Chemistry. These articles form the metadata on which we implement contextual NLP to get data for further analysis.
+2.2. Hierarchical Knowledge Extraction Process. 2.2.1. Method I: CDQA + NER Pipeline with Smaller Language Models. Method I has a straightforward sequence involving the use of a contextual model and a combination of Closed Document Question Answering (CDQA) and Named Entity Recognition (NER). Early contextual language models
+such as ELMo,37 BERT,16 and GPT-238 have significantly improved understanding of the sequence-level semantics and have shown state-of-the-art performance in several NLP tasks
+such as sequence classification,39 question answering,40,41
+language modeling,42 and translation43,44 requiring fewer parameters and training time. Other NLP techniques, such as Closed Document Question Answering (CDQA) and Named Entity Recognition (NER), benefit from these advances, as data extraction has seen higher efficiency and accuracy (see
+Figure 2(b).). However, the reliance on specific contextual models integrated with CDQA and NER to identify chemical entities such as solvents presents challenges, primarily due to the scarcity of high-quality, chemically focused training data. This scarcity often results in a higher likelihood of type I errors (false positives) compared to type II errors (false negatives). As per Figure 2, we have integrated BERT as a language model with Closed Document Question Answering (CDQA) followed by ELMo with Named Entity Recognition (NER) to
+automate the data extraction process,45 enabling the hierarchical knowledge extraction from a broader topic to a structured data set. Figure 3 explains how the CDQA works. CDQA is an NLP subtask that involves asking context-specific questions within a closed domain, such as perovskite synthesis, extracting relevant paragraphs or sentences from a scientific article without having to manually annotate them. There are two main components of the CDQA system − Document Retriever and Document Reader. The Document Retriever identifies a list of ‘Top N’ candidate documents that are likeliest to the context of perovskite synthesis using similarity metrics. We have used cosine-similarity between the TF-IDF features of the documents and the phrase “perovskite synthesis.” Next, these documents are divided into paragraphs and fed to the Document Reader, BERT, which gives the most probable paragraphs to the question “How is perovskite synthesized?” The answers were compared and ranked in the order of the model score, which is given by the softmax probability derived from the last layer of the BERT model. At
+Figure 3. In general, the question-answering (QA) system in NLP can be divided into two categories − Open Domain Question Answering (ODQA) and Closed Domain Question Answering (CDQA). The ODQA is capable of answering questions from any field, while the CDQA answers questions only from a specific domain of knowledge. Google Assistant, Amazon Alexa, etc., are examples of ODQA, while chatbots are examples of closed-domain systems. In this work, we use CDQA to identify the relevant paragraphs on perovskite synthesis that serve as metadata for further analysis. The ‘Document Retriever’ scans the given pool of articles to filter out the ‘N’ most relevant documents to the given question. The ‘Document Reader’ processes these documents to get the closest possible answers. In this work, we extracted three answers from each article. We also acquired the corresponding paragraphs where the answers are based and used them to get the perovskites and the solvents. Answers with higher scores appear more relevant than the others.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7904
+
+
+the end of this step, three paragraphs most relevant to perovskite synthesis are extracted from each ‘Top N’ candidate document. NER is the second subtask of our NLP pipeline that classifies keywords extracted from a given paragraph.
+Commonly available NER tools are ChemicalTagger,46
+OSCAR4,47 Chemical Named Entities Recognition,48 and
+ChemDataExtractor,1 each trained for identifying specific terminologies and contexts within the materials science domain. In this work, to extract all the chemicals (perovskites, solvents, etc.), we used an ELMo-based NER tool developed
+by Kim et al.6 The NER model developed by Kim et al.6 uses a classification model that is trained on an internal database of over 2.5 million materials science articles. The details of the architecture of the NER model are provided in Table 1of the Supporting Information, while Table 2 presents a comprehensive list of all the training labels, which represent the specific category of chemistries that the NER is trained to identify and classify in the text. At the end of this step, a structured data set is formed by listing perovskites and their corresponding solvents that can be used for downstream tasks such as toxicity prediction. A critical limitation of Method I is that the segmentation is typically conducted at the paragraph level rather than considering token-level constraints. This approach can overlook specific details that may span multiple sentences or paragraphs within a single article. Crucial information about the interaction of solvents with perovskite materials might be dispersed across several sentences or paragraphs within a single research paper, but the paragraph-level segmentation used in CDQA overlooks these interconnected details. This fragmented approach can lead to information loss, similar to the challenges encountered in Retrieval-Augmented Generation (RAG) models, which also struggle with integrating information across fragmented document sections. Furthermore, hallucination and omission manifest differently in such deterministic models. The solvents identified by the NER (ELMo) model can result in hallucinations when a solvent mentioned in the paper is extracted but not actually used in the perovskite synthesis context. Conversely, if a relevant solvent is present in the paper but underrepresented in the model’s training data, it may be omitted entirely.
+2.2.2. Method II: Prompting and Verification with Large Language Models. Method II uses Large Language Models, GPT 3.5, along with designed prompts for the hierarchical automated data extraction. LLMs have brought new capabilities that differ from earlier contextual models by utilizing a high number of self-attention layers and a more extensive training corpus. These features enable them to generate more accurate and diverse responses and better generalize across various tasks without the explicit need for task-specific downstream architectures like CDQA and NER. As shown in Figure 2(b), prompt engineering becomes essential when utilizing the in-built response generation capabilities of LLMs, as it replaces the role of traditional NLP tools by allowing the model to adapt its responses based
+on finely tuned prompts.9 This method leverages the built-in response generation capabilities of the LLMs, enabling the identification and classification of chemical entities such as solvents directly through well-designed prompts rather than integrating them with separate tools. Furthermore, the use of domain knowledge is essential for designing and refining the prompts to evaluate the relevance and accuracy of the LLM’s
+responses. During inference, LLMs process text at the token level, predicting the next token in a sequence given the preceding tokens. This capability allows them to assign probabilities to different tokens, including those corresponding to named entities like solvents, based on the context provided. Thus, LLMs are capable of performing both CDQA and NER tasks through their all-purpose design, eliminating the need for additional specialized tools. While prompting strategies such as Retrieval-Augmented
+Generation (RAG),49−51 Chain-of-Thought (CoT),52,53 and
+Chain-of-Verification (CoVe)54,55 have each demonstrated strengths in isolated settings, they also exhibit notable limitations when applied independently. RAG relies heavily on retrieving relevant text passages from a corpus, but it performs poorly when critical information is embedded in formats that are difficult to retrieve as plain text, such as tables
+and graphs.51 CoT improves reasoning transparency but can propagate logically sound yet factually incorrect chains, especially in scientific domains lacking annotated reasoning
+paths.52,53 CoVe addresses factuality through verification, but at high computational cost and with limited ability to
+disambiguate context without external support.54,55 Across these works, a common theme emerges that no single prompting paradigm provides robust performance across accuracy, factual grounding, and computational efficiency. This has motivated recent research toward modular and hybrid strategies that integrate retrieval, stepwise reasoning, and
+verification in a context-sensitive pipeline.50 Our prompting strategy synthesizes ideas from several leading prompting paradigms such as RAG, CoT, and CoVe, as shown in Figure 4.
+Our method incorporates topic-based paper segmentation (inspired by RAG), hierarchical decomposition of queries (from CoT), and self-verification loops (from CoVe), but reconfigures them for the chemical domain. Generative models, like GPT 3.5 or GPT 4.0, trained on vast corpora, have a broader knowledge base that enables them to synthesize answers by integrating information across entire texts and thereby establish connections between prompts and specific scientific concepts like perovskites, which is beyond the capability of Method I. As explained earlier, the hierarchical
+Figure 4. Integration of prompting paradigms in Method II. The hierarchical data extraction framework draws from the segmentation and consolidation structure of RAG, the reasoning through specific topical description from CoT, and the iterative self-verification approach of CoVe into a domain-aware, structure-constrained pipeline for scientific information extraction.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7905
+
+
+information extraction using LLM requires the careful design of prompts. We first explain the method of using prompts and LLMs for a particular level by detailing the steps involved in
+extracting and verifying information from research articles (see Figure 5).
+Figure 5. Flowchart for information extraction from a research article using prompting and verification technique, starting with the ′′Input Text′′ box where the paper segment is specified, followed by a ′′Prompt′′ box detailing the search query. The process then moves to a ′′Response′′ diamond, indicating LLM response, which leads to either ′′Successful Extraction′′ or ′′Discarded′′ based on the verifiability within the input text.
+Figure 6. Iterative hierarchical knowledge extraction process using LLMs. The input text is segmented into smaller chunks, each undergoing prompting and verification to extract responses relevant to the broad topic (Perovskite). These responses are then combined and resegmented for the next level of specificity (Perovskite Synthesis), where the process is repeated. Finally, the combined responses are further segmented and processed at the narrowest levels, which include both the Perovskite Name and Associated Solvent, ensuring accurate and detailed extraction of specific information.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7906
+
+
+Figure 5 shows that we employ a structured prompting and verification process to extract and verify specific information from a predefined segment of a research article. Responses from Method I, which provides the most relevant paragraphs, are used to design prompts through a trial-and-error process.
+OpenAI Playground† offers an interactive dashboard to experiment with various models and parameters, allowing users to fine-tune and test prompts in real time. Although models such as GPT 3.5, 4.0, and 4.5 incorporate progressive enhancements in reasoning, contextual understanding, and computational efficiency, they are all built upon the same underlying transformer-based architecture and mathematical principles. The details of the transformer-based architecture are given in the Supporting Information. Given an input text segment, a prompt is generated to find all details related to the topic. While extracting information from a text segment on a specific [TOPIC], the LLM is prompted with a brief [DESCRIPTION] of the [TOPIC], along with the text segment [INPUT TEXT]. The [TARGET] denotes the type of information to be extracted from a given segment. This differentiates our approach from traditional prompting by explicitly contextualizing the query within the prompt, ensuring that the LLM search is focused and relevant to the
+specific topic.8 Since scientific texts often contain complex syntactic structures, nested entities, and domain-specific terminologies, it is important to include details related to
+questions in the prompt to extract the correct information.56 This step is followed by a verification through subsequent
+prompting,10 where the LLM checks if the response details from the previous prompt are explicitly found within the provided input text segment. This strategy helps mitigate hallucinations by increasing specificity until the LLM produces the correct answer that is guided by accurate responses known from previous steps. This verification and refinement in Method II are performed probabilistically using the LLM. While a more deterministic and less resource-intensive
+approach, such as leveraging BERT-score,28 could have been used, we intentionally avoided this to preserve the independence of the two pipelines, each based on fundamentally different methodologies�deterministic models (BERT/ELMo) and generative LLMs. The prompting and verification technique is applied iteratively at each level, progressively narrowing down from broad topics to specific details by refining prompts and verifying responses (see Figure 6). Too many promptings can be cost-intensive; thus, care is given so that the target data set can be obtained without excessive prompting. At each layer, the text from the previous layer is segmented based on the token limit of the LLM model. This segmentation approach utilizes the analytical capabilities of the LLM to interpret complex scientific data by concentrating on a smaller window for contextual understanding. The responses from multiple segments of a single paper are then consolidated using the LLM to form a coherent and comprehensive summary, which streamlines the relevant sparse and disparate information into an easily accessible form. The [TOPIC]s and their brief [DESCRIPTION]s for each layer are given in Table 1. Domain expertise, along with trial-and-error and the responses from Method I, have been used to come up with the descriptions. The first TOPIC is ‘Perovskite,’ where the description is targeted to establish a foundational understanding of the material.
+Table 1. TOPICS and Their Brief Descriptions Used for Prompting and Extraction of Data Using the Layer-wise Prompting and Verification Process Shown in Figure 5
+TOPIC description targeted information
+level 1:
+perovskite
+Perovskite has a unique crystal structure with the formula ABX3, where ‘A’ and ‘B’ are cations and ‘X’ is an anion, forming a three-dimensional network
+that contributes to the unique properties of perovskites, such as their excellent electronic and ionic conductivity.
+perovskites, including their chemical
+compositions, synthesis processes, and various
+applications
+level 2:
+perovskite
+synthesis
+Perovskite synthesis involves steps such as precursor preparation, dissolution in solvents, deposition, and subsequent annealing and crystallization to form
+the ABX3 crystal structure.
+chemistries related to perovskite synthesis, such as
+precursors, perovskite, and solvents
+level 3:
+perovskite
+name
+specific form of the ABX3 crystal, where ‘A’ and ‘B’ are cations and ‘X’ is an anion name of the perovskite crystal in ABX3 form
+level 3: solvent
+name
+Solvents in perovskite synthesis are organic chemicals used to dissolve the precursors. name of the organic solvent
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7907
+
+
+The second [TOPIC] is ‘Perovskite Synthesis,’ aimed at understanding the processes involved in creating perovskites. The prompt at this level extracts detailed information about the synthesis steps, including precursor preparation, dissolution in solvents, deposition, and subsequent annealing and crystallization. The responses from Level 2 are manually compared against the responses from the CDQA in Method I to check for the correctness of the prompting method. The third level focuses on more specific details, divided into two subtopics: ‘Perovskite Name’ and ‘Solvent Name.’ This step is similar to the NER step of the previous method, where instead of using a classification model, we rely on the LLM’s inherent understanding of context and scientific terms. The ‘Perovskite Name’ prompt seeks to identify specific forms of the ABX3 crystal by listing the various cations and anions that define different perovskite compounds. It is to be noted that at any level, there can be multiple subdivisions based on the specific information needed, where subdivisions refer to narrower topics or categories derived from the broader topic to extract detailed and relevant data. The ‘Solvent Name’ prompt extracts information on the organic chemicals used in the synthesis process to dissolve precursors. The division into ‘Perovskite Name’ and ‘Solvent Name’ has been deliberately done to ensure that the LLM can accurately identify the named entities by using separate prompts and descriptions for each. Additionally, as explained earlier, the larger training corpus for GPT 3.5 eliminates the need for a separate NER component for identifying perovskites and solvents. The description of the terms added to the prompts aids in better identifying the context of these terms, while the [TARGET] targets the LLM toward specific data to be extracted. Furthermore, the hierarchical extraction allows data to be extracted at each level, and the data from each level can be repurposed for other research objectives, such as identifying precursor materials from the ‘Level 2: Perovskite Synthesis’ responses or evaluating device performance from the ‘Level 1: Perovskite’ responses. At the end of this step, we prompt the LLM to return its output in a structured JSON format. This serves two complementary purposes: when benchmark data is available, it enables direct alignment for evaluation; in its absence, it ensures syntactic consistency that facilitates downstream normalization, filtering, and validation.
+2.3. Manual Extraction and Evaluation Framework. To evaluate the reliability of the extracted entities and quantify hallucinations and omissions, we designed a structured manual verification protocol using a held-out set of 50 scientific articles, distinct from those used in the main data set construction (Figure 7). These papers span a range of research contexts: while many contain detailed information on perovskite synthesis, others discuss perovskite materials more generally without providing full synthesis protocols or specific chemical entities such as solvents and precursors. This variability reflects the real-world heterogeneity of the literature and ensures that our evaluation framework captures the challenges of entity extraction under both high-information and sparse-information conditions. For each paper, we manually extracted the ground truth entities and recorded them in a structured JSON format with the keys: ‘perovskite’, ‘solvent’, and ‘others.’ The ‘others’ category is also a text variable containing all synthesis-relevant descriptors such as antisolvents, deposition methods (e.g., spin-coating), annealing conditions, temperature ranges, and procedural steps that may not conform to a fixed schema but are critical for capturing the synthesis context. We used the ChatGPT Plus interface (GPT4o) to generate structured outputs from a held-out set of 50 scientific articles, and each response was manually verified against expert-curated ground truth to evaluate accuracy, consistency, and contextual fidelity. For Method I, the comparison is straightforward because the output follows a fixed schema guaranteed by deterministic parsing rules from the BERT + NER pipeline, ensuring consistent entity boundaries and types. For Method II, the comparison is not straightforward due to the unstructured and variable nature of the generative outputs. A self-check by GPT is integrated into our prompting + verification method as explained in the previous section. This prompting + verification is executed iteratively and repeated multiple times per paper to capture variability and enhance robustness. To assess correctness, we integrate this output with a knowledge graph constructed for each paper based on its manually extracted ground truth entities, which serves as a structured reference for verifying perovskite names and solvents.
+Knowledge Graph Construction. For each paper, we curated a small-scale knowledge graph (KG) that captured
+Figure 7. Manual verification framework for evaluating LLM outputs (Method II). Extracted entities are first checked for structural conformance (first panel). If an output does not conform to the desired structure, then it is discarded. This is followed by an entity normalization step that standardizes perovskite formulas to the ABX3 structure and resolves solvent names using PubChem (second panel). The normalized outputs are then compared against a paper-specific knowledge graph (third panel) to assess hallucination, omission, and entity consistency. The scoring module includes a structure match indicator (binary value in {0, 1}), hallucination and omission scores (real-valued between 0 and 1), and semantic similarity (real-valued between 0 and 1) for contextual synthesis.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7908
+
+
+chemically significant entities relevant to perovskite synthesis, such as
+1. Perovskites: (e.g., MAPbI3, FAPbBr3) 2. Solvents: (e.g., DMF, DMSO, GBL)
+These graphs were constructed manually using a combination of full-text reading and entity normalization via domainspecific naming conventions to ensure consistency and grounding in structure and context-based factual accuracy. The benchmark data set is constructed as a paper-specific knowledge graph for each article, enabling fine-grained and localized comparison. This design allows us to assess whether the extraction method accurately captures contextual usage, for example, distinguishing solvents merely mentioned in the text from those used in perovskite synthesis. An example of a paper-specific KG is given in the Supporting Information. For each of the 50 papers, we applied our GPT-based extraction pipeline (Method II) and generated 10 independent outputs per paper to capture variability in generative behavior. Let each paper P have the ground truth entity set
+E=E E
+KG Other
+Where EKG are structured essential information, such as perovskites and solvents (e.g., perovskites and solvents) and EOther are structured contextual information (e.g., procedures, temperatures, deposition steps). The ground truth for a paper P is structured as
+E : {K , K , ···}
+KG,P KG,P,1 KG,P ,2
+E :K
+other,P other,P
+Where each KKG, P,1, KKG, P,2, ··· corresponds to the perovskite, or solvent present in the paper P. Kother, P includes free-form synthesis descriptors such as processing steps, antisolvents, or temperatures. We do not expect a structured format for this
+part in the LLM output. Let the prompt output be Ê which can be unstructured and unpartitioned. Within the 10 sampled outputs generated per paper, we apply a structure-matching filter to identify those outputs that adhere to the expected schema, specifically by checking for the presence of the keys defined in EKG(e.g., perovskite and solvent). Only outputs that include this structured representation are retained for KGbased scoring, while the rest are discarded. This approach ensures that the evaluation is performed on syntactically consistent outputs and mirrors the strategy used in the actual corpus, where no ground truth is available, by enforcing the structure as a proxy for format correctness. At the same time, this filtering step provides insight into the model’s compliance with the JSON formatting instructions, specifically, how often the LLM adheres to the expected structure, despite the prompt explicitly requesting it. This serves as an indirect measure of the model’s reliability in producing syntactically usable outputs under controlled prompting. Entity Normalization. Once the JSON structure is detected in an LLM output, we apply entity normalization to perovskites and solvents to ensure consistency and comparability across model outputs and ground truth. For perovskites, we restrict our analysis to compounds with the general formula ABX3, where A and B are cations and X is an anion (typically O, Cl, Br, or I). We first extract perovskite candidates from the GPT output using a prompt that identifies and returns chemical formulas matching the ABX3 pattern. We then apply a rule-based filter that hard-codes a structural check, ensuring
+that only compounds with exactly three elements and the ABX3 stoichiometry are retained. We have used the Materials Project database to verify the structure and composition of extracted perovskites, ensuring that each candidate matches a known compound entry. For solvents, normalization is performed using PubChemPy, which queries the PubChem database via the PUG REST API to retrieve standardized chemical identifiers. PubChemPy returns a list of synonyms for each compound, and we verify whether the extracted name appears within this list to confirm the match and ensure robust normalization. This process resolves synonyms, abbreviations (e.g., “DMF” vs “dimethylformamide”), and minor variations in naming, ensuring that each solvent is matched to its canonical representation in the PubChem repository. This normalization step is essential for reliable entity comparison, deduplication, and scoring. The normalization methods for the generative outputs of Method II were implemented as hard-coded rules, but were iteratively adapted based on patterns observed in the model’s output during evaluation. These rules account for inconsistencies in entity formatting, such as variations in casing, spacing, and chemical notation, and were specifically crafted to align with how the model tended to represent perovskites and solvents. While not learned, the normalization logic was guided by empirical analysis of the generative behavior. The response is partitioned into two segments for scoring: the entries corresponding to the expected keys in EKG which are evaluated using knowledge graph matching, and the remainder of the output, which is treated as EOther and assessed using semantic similarity. The keys of the structured essential information, such as perovskites and solvents, are then verified using knowledge graph matching. To evaluate model performance against the structured ground truth, we compute KGbased precision and recall scores. Precision reflects the proportion of predicted entities that are correct, serving as an inverse measure of hallucination. Recall measures the proportion of ground-truth entities that are successfully recovered, serving as an inverse measure of omission. These metrics allow for intuitive, interpretable evaluation of structured entity extraction aligned with standard information retrieval principles.
+Precision (KG-Based Hallucination Indicator). We define the KG-based Hallucination score for the Structured Essential entities Score PKG. This reflects how many of the predicted entities were correct
+=| |
+||
+P KK
+K
+KG E
+KG P KG P
+KG P
+,,
+,
+KG
+Recall (KG-Based Omission Indicator). We define the KGbased Omission score for the Structured Essential entities Score RKG. This reflects how many of the ground-truth entities were recovered.
+=| |
+||
+R KK
+K
+KG E
+KG P KG P
+KG P
+,,
+,
+KG
+For the contextual synthesis section (“others”), which is not required for downstream tasks but relevant for assessing overall model fidelity, we compute a semantic similarity score between the model’s output and the ground-truth description. Specifically, we embed both the extracted and reference texts using the GPT-3.5 encoder using the OpenAI library, and
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7909
+
+
+compute the cosine similarity between the two embedding vectors. This score captures high-level semantic alignment even when surface forms differ, and serves as a soft metric to evaluate how well the model reproduces relevant synthesis context in a free-form generation setting. We define the semantic similarity score as
+S = cos(GPT(E , K ), GPT(E , K ))
+Other Other Other Other Other
+Here, GPT(·,·) refers to the joint text embedding of the value− key pair using the GPT-3.5 embedding model accessed via the OpenAI library. A threshold of SOther ≥ 0.5 is used to indicate semantic alignment between the model output and the ground truth; values below this threshold are flagged as divergent or low-fidelity reproductions.
+3. RESULTS
+Our manual evaluation across 50 papers and 10 GPT outputs per paper (500 total generations) reveals several key trends in the performance of Method II (LLM-based extraction), Method I (BERT + NER), and their intersection. We computed the mean and standard deviation of precision and recall for each paper across 10 LLM generations, separately for perovskite and solvent entities. The resulting distributions are visualized as histograms and included in the Supporting Information (See SI Figures 1-4) to illustrate trends in model performance, stability, and entity-specific variation. First, GPT-based outputs demonstrated complete structural consistency: 100% of responses adhered to the expected JSON format, with no mislabeling of keys. Across all generations, the LLM correctly distinguished entity types, e.g., never labeling a solvent as a perovskite or vice versa, indicating a strong internal understanding of categorical boundaries of the LLM. However, while the structure was stable, the content of extracted entities varied modestly, with different subsets of solvents or perovskites included in each run. This led to run-to-run variation in precision and recall. (See SI Figure 1 for perovskite precision and recall variation, showing tight structural agreement but measurable score dispersion) Second, Method I showed higher precision but lower recall than Method II (See SI 5). This is expected for deterministic models like BERT + NER, which tend to extract entities only when they match known patterns or training examples. As a result, Method I introduces lesser hallucinations, and subsequently higher precision. However, Method I often misses contextually relevant entities, especially when those entities are not a part of the training data. SI Figure 5 illustrates this gap, where perovskite extractions under Method I show higher precision (lesser hallucination) but a lower recall (higher omission) as compared to Method II. Third, Method II exhibited higher recall but lower precision. Its hierarchical prompting allows broader context comprehension, enabling it to identify more valid entities. However, this comes with a slight increase in hallucinations. Notably, most hallucinated solvents were chemically plausible and commonly associated with perovskite synthesis (e.g., DMF, DMSO, GBL), suggesting the influence of the model’s training priors rather than random error. In SI Figure 3, the solvent-related precision and recall histograms show greater dispersion than those of perovskites, with a lower mean and wider standard deviation, highlighting the inherent difficulty of solvent identification. Fourth, both methods performed better on perovskites than on solvents. Precision and recall scores were
+consistently higher for perovskite entities, likely due to their more formulaic and structured representation (e.g., CsPbI3), which lends itself to both pattern-based and generative extraction. Solvents, by contrast, are linguistically diverse and context-sensitive, making them harder to extract reliably, particularly when mentioned in tables or nonsynthesis contexts. As shown in SI Figure 4, the solvent recall distribution is broader and more symmetric, indicating that solvent extraction is both less reliable and more sensitive to document structure. Fifth, recall scores were inversely correlated with the number of ground truth entities in a paper. When more entities are present, both methods, particularly Method I, struggle to maintain complete coverage. This effect is magnified in documents that embed key synthesis information in tables, which generative models process less reliably. These trends reinforce the importance of integrating layout-aware tools like Tabula into future iterations of the pipeline to extract information that escapes both LLM comprehension and sequence labeling models. The two methods of hierarchical data extraction and curation have estimated different numbers of solvents used for perovskite synthesis. In the Supporting Information, we have shown how the outputs are generated using both methods for ref 57. A detailed working example comparing the outputs of BERT/ELMo (Method I) and GPT 3.5 (Method II) illustrates their contrasting strengths and weaknesses. In this example, Method I identified four perovskites (FA0.7Cs0.3PbBr3, FA0.8Cs0.2PbBr3, Cs4PbBr6, and CsPbBr3) and no solvents. At the same time, Method II generalized the perovskites as belonging to the class FA(1− x)CsxPbBr3 and identified Toluene as the primary solvent. However, the general formula was not specifically requested, and it would have been more valuable if the exact perovskites were identified by the LLM instead. BERT-based CDQA provides unrefined, noisy text directly from the PDF file, extracting entire paragraphs verbatim. While this ensures no information is lost, the extracted content is often dense, fragmented, and not readily machine-readable. In contrast, GPT 3.5 refines the extracted information, synthesizing it into structured and concise outputs. For example, instead of delivering lengthy paragraphs, Method II generates a list of relevant entities (e.g., Toluene as the solvent) and connects them to broader processes (e.g., its use in synthesizing FA(1− x)CsxPbBr3 perovskites). Method I’s reliance on paragraphlevel segmentation results in fragmented data extraction, often limiting its scope to predefined entity types like perovskites and failing to capture solvents, which may not be explicitly tagged in the training data set. For instance, the solvent Toluene, which appears sparsely, was overlooked by Method I but captured by Method II’s capability to study a larger context window. Interestingly, neither method correctly identified the exact perovskite-solvent pair, suggesting that both approaches have limitations that need further investigation. This highlights the need for an ensemble method to leverage the strengths of each model while addressing their weaknesses. We have attached a spreadsheet, ‘solvents_list_output.xlsx’, containing a table of all solvents identified by the two methods. We have identified 35 different solvents using Method I and 54 solvents using Method II that are used during perovskite synthesis. A key distinction between the two methods lies in the flexibility of the prompting technique employed in Method II, which can be further refined and segmented into multiple iterative steps to enhance data extraction. In contrast, Method
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7910
+
+
+I is constrained by its reliance on specific NLP tasks, such as Closed Domain Question Answering (CDQA) and Named Entity Recognition (NER). Consequently, its performance is inherently limited by these tools’ predefined architectures and capabilities, restricting its ability to adapt to more nuanced or complex data extraction scenarios. A larger number of solvents identified by Method II is a probable outcome because the
+NER model used in Method I has limitations due to its dependency on the training data set. On the contrary, LLMs leverage contextual understanding and the brief descriptions provided with the prompts to better identify solvents. Additionally, the LLM can fuse information from different sections of a paper, while Method I relies on paragraph-level segmentation and extraction, which may miss solvents
+Figure 8. Data Extraction using Method II demonstrates the ability of our method to fuse information from different sections of a research paper to extract detailed chemical information related to perovskite synthesis. The highlighted sections show various mentions of solvents, cations, and synthesis methods scattered throughout the document. The Method II method successfully integrates these disparate pieces of information. Results from Methods I and II are reported in the Supporting Information.
+Table 2. Frequently Used Organic Solvents in Perovskite Synthesis Are Categorized into Two Subclasses (Agonist and
+Binding) of Active/inactive Endocrine Disruptors (EDs)a
+ED subclasses
+index solvents SMILES agonist binding reference
+1 dimethylformamide (DMF) CN(C)C = O active active ref 68,69 2 dimethysulfoxide (DMSO) CS(=O)C inactive inactive 3 toluene CC1 = CC = =CC = C1 active active ref 70,71 4 oleic acid (OA) CCCCCCCCC = CCCCCCCCC(�O)O inactive inactive 5 oleylamine (OLA) CCCCCCCCC = CCCCCCCCCN inactive inactive 6 octadecene (ODE) CCCCCCCCCCCCCCCCC = C inactive inactive 7 acetone CC(�O)C inactive inactive ref 72,73 8 chloroform C(Cl)(Cl)Cli inactive inactive 9 chlorobenzene (CB) C1 = CC = C(C = C1)Cl active inactive ref 74 10 isopropanol (IPA) CC(C)O inactive inactive 11 ethanol CCO inactive inactive 12 benzyl alcohol C1 = CC = C(C = C1)CO inactive inactive 13 acetonitrile CC#N inactive inactive 14 n-hexane CCCCCC inactive inactive ref 75,76 15 cyclohexane C1CCCCC1 inactive inactive 16 diethyl ether CCOCC active active 17 γ-butyrolactone (GBL) C1CC(�O)OC1 inactive inactive 18 methyl acetate CC(�O)OC active active 19 ethyl acetate CCOC(�O)C inactive inactive 20 ethylene glycol C(CO)O inactive inactive 21 n-octane CCCCCCCC active active ref 77 22 pyridine C1 = CC = NC = C1 inactive inactive 23 diethylene glycol (DEG) C(COCCO)O inactive inactive 24 tetrahydrofuran C1CCOC1 inactive inactive 25 trioctylphosphine (TOP) CCCCCCCCP(CCCCCCCC)CCCCCCCC active active
+aThese two subclasses denote a molecule’s ability to interact with the estrogen receptor (ER).64 For a chemical, the state of being active or inactive in one of the subclasses is independent of its nature in the other subclass. However, if the chemical is “Active” in any of the subclasses, then it’s potentially an EDC. This classification is done with the help of a deep-learning model that takes SMILES as the inputs and gives a multi-output binary classification. The studies that back up our data for this classification are mentioned in the last column.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7911
+
+
+mentioned across different sections or in less explicit contexts. Figure 8 demonstrates an example of how our proposed method can fuse data from different parts of a paper, as given in ref 57. Information on chemistries related to perovskite synthesis, such as such as solvents, cations, and synthesis methods, is scattered throughout various sections of the paper. The paragraph on the right represents comprehensive information about perovskite synthesis, which can be used to identify relevant chemicals and processes. The solvent Toluene appears just once in the paper but has been identified by the prompting method, demonstrating its efficiency in fusing sparse information. While Method II identified more solvents overall, there are notable solvents that were exclusively identified by Method I but missed by Method II, including 1-butanol, Dimethyl ether, Sodium hypochlorite, Benzene, Trioctylphosphine oxide, and Dichloromethane. The solvents in our list that were not identified by Method I are − Dichlorobenzene, 2-Methoxyethanol, Ethylenediamine, Ethanethiol, and 1-Methyl-2pyrrolildinone (commonly known as the NMP solvent). A total of 25 solvents that both methods have unanimously identified are listed in Table 2. These are the solvents with maximum confidence as they are extracted by both the deterministic and the generative approaches. We have used a Chord Diagram that represents the conditional probability distribution of solvents given perovskites in Figure 9. This
+conditional distribution quantifies the preferred solvent choices for synthesizing specific perovskites to devise solvent substitution strategies. The direction of the plot follows the arc from perovskites (sources) to solvents (targets), where the thickness of the chords is proportional to the frequency of their co-occurrence in synthesis literature. The top three strongest perovskite-solvent pairs identified are (MAPbI3-DMF), (MAPbBr3-DMF), and (MAPbI3-DMSO), highlighting their dominant role in perovskite synthesis. Figure 10 illustrates the marginal distribution of the most commonly occurring solvents among the 25 unanimously
+identified solvents, focusing on those cumulatively accounting for at least 85% of the total occurrence. DMF dominates the distribution with a share of 23.4%, followed by Toluene (14.5%) and DMSO (11.8%). DMF is commonly used for the
+dissolution of lead and Methylammonium (MA) salts,58,59 and hence, it is no surprise that it appears at the top of the list. Having established a comprehensive analysis of the solvents identified by the two methods and their associations with perovskite synthesis, we now focus on the marginal distribution of the perovskites that appear in conjunction with the 25 identified solvents. Understanding the prevalence of specific perovskite types and their mutual associations with solvents provides critical insights into the broader synthesis trends and highlights the dominant perovskite-solvent pairs driving research and development in the field. We identified all the organic perovskites mentioned in the synthesis paragraphs that we extracted. We were able to acquire more than 350 uniquely mentioned organic perovskites, most of which are MA-based (>40%), while Formamidinium (FA) and Butylammonium (BA) based perovskites constitute around 10% each. A list of the most occurring 73 perovskites, along with the associated 25 solvents, is given in a spreadsheet titled ‘top_solvent_perovskite.xlsx’ in the Supporting Information. As solvents are
+required for different activities during perovskite synthesis,60,61 we looked up their mutual distribution in the analyzed papers (see Figure 11). Our study reveals that most solvents are reported in conjunction with MA lead halide perovskites. This is unsurprising given that the MA-based perovskites have been
+attractive due to higher efficiency and better stability.62,63 We further looked into the distribution of these organic perovskites based on their frequency of mutual occurrences with the solvents and plotted the chart shown in Figure 11. This chart shows that out of all the associations between organic perovskites and solvents, more than 3/4th involve MA lead halide perovskites. This reflects the scale of the study conducted on these perovskites so far. FA and BA-based perovskites seem to offer alternative choices, but their number is dwarfed by the MA-based ones. The perovskite (MA)PbI3 accounts for 42.7% of the distribution, making it the most frequently occurring perovskite in the data set. The second most common perovskite, (MA)PbBr3, accounts for 22.0% of the distribution.
+4. CASE STUDY: UNCERTAINTY-INFORMED ENDOCRINE DISRUPTION NATURE OF SOLVENTS
+Understanding the endocrine-disrupting (ED) nature of industrial solvents is a critical area of research, given the potential health implications associated with exposure to these chemicals. The EPA’s Endocrine Disruptor Screening Program (EDSP) is a critical initiative aimed at assessing the potential endocrine activity of various chemicals, that includes the use of ML models to predict estrogen receptor (ER) activity
+efficiently.64 However, such predictions alone are not sufficient. Including epistemic uncertainty arising from limitations in the training data sets of these machine learning models is critical. It allows the identification of areas where predictions are less reliable, ensuring that resources can focus on resolving ambiguities and refining data. In this section, we assess if a given solvent associated with perovskite synthesis is an endocrine-disrupting chemical using a pretrained classi
+fication model developed in our earlier work.65
+We have used a multioutput binary classification model65 designed to predict whether a chemical, such as a solvent, has
+Figure 9. Chord diagram illustrating the associations between the top 10 most frequently reported perovskites and top 10 solvents used in their synthesis. The width of each chord represents the frequency of co-occurrence in the literature, with perovskites (source nodes) linked to solvents (target nodes). This visualization highlights dominant perovskite-solvent pairs, such as (MAPbI3-DMF), (MAPbBr3-DMF), and (MAPbI3-DMSO).
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7912
+
+
+endocrine-disrupting (ED) potential by analyzing its molecular structure. It utilizes the Simplified Molecular Input Line Entry System (SMILES) representation to encode chemical structures into machine-readable strings. The SMILES strings are first numerically encoded using a bag-of-words approach, where each character in the SMILES vocabulary is assigned a unique integer. Then, the encoded sequences are padded with zeros at the beginning to achieve a fixed length of 130
+characters. The encoded and padded SMILES string serves as the input to the classification model. The deep neural network classification model is a stack of ten convolutions and two LSTM layers, followed by two dense layers. The convolution layers progressively extract the spatially correlated local features from the SMILES, while the LSTM layers are used for sequential data processing. The final dense layer has two diverging sigmoid layers that output the probabilities indicating
+Figure 10. Distribution of the most commonly occurring solvents in perovskite synthesis, with DMF accounting for the largest share (23.4%), followed by Toluene (14.5%) and DMSO (11.8%). The chart highlights the dominant role of these solvents in synthesis practices based on highconfidence data extracted from both methods.
+Figure 11. Pie chart shows distribution of organic perovskites based on solvent-perovskite mutual occurrences. About 75% of solvent-perovskite association was found in the literature with methylammonium (MA) lead halide perovskites.
+Figure 12. Workflow for assessing the prediction uncertainty of endocrine-disrupting chemicals. The process begins with the input of solvent data in the form of SMILES codes, which are processed by a deep neural network model to generate the class probability. p of a solvent being endocrinedisrupting. This probability is then used for class prediction (active/inactive) and for calculating Shannon entropy. H = − plogp To assess the uncertainty of the classification. The final output is the classification with an associated uncertainty measure.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7913
+
+
+whether the chemical is active or inactive regarding endocrine receptor interaction: ‘Agonist’ and ‘Binding.’ The multioutput binary classification model was trained on 3,236 chemicals from the Tox21 data set and 4,492 chemicals from the CERAPP data set, achieving testing accuracies of 90.7% for agonist activity and 89.6% for binding activity on a combined evaluation data set. Our proposed UQ specifically evaluates the epistemic uncertainty of the model’s predictions concerning this training and evaluation data. Details of model architecture and accuracies are reported in the Supporting Information. Figure 12 demonstrates the use of Shannon entropy to estimate epistemic uncertainty in the classification of solvents for endocrine-disrupting (ED) potential. After processing the input solvent’s SMILES representation through the model, the output is a class probability p, where p < 0.5 indicates active and p > 0.5 indicates inactive. The prediction probability density function (or mass function for discrete output) conditioned on the model structure is given as
+p = p(y) = p (y|x, D)
+i i F i (1)
+The class probability using the last sigmoid layer of the deep learning model given in Figure 12 can be written as
+y = (F(x)) i = 1, 2
+i i (2)
+= +e
+1
+1
+i F(x)
+i
+Where F(x) represents the input to the sigmoid function from the preceding layers of the neural network. This function maps the input features of a solvent to a probability pi indicating the likelihood of the solvent being an EDC. Also, i = 1,2 determines the class of EDC (Agonist or Binding), and σ is the sigmoid function. Given an organic molecule xj, j = 1 to N belonging to the list of solvents given in Table 2, the prediction probabilities pij are given by the function pij = σi(F(xj)), pij ∈ [0,1]. The relationship between uncertainty and output probability is not linear. The classification model can have low activation values in all the remaining neurons, but still can have high sigmoid values. Thus, using only the sigmoid output as a measure of model uncertainty can be misleading. Shannon entropy removes this drawback by weighing the prediction probability pij with the logarithm of the reciprocal of pij and thereby used to measure the information content of each prediction. The basic intuition behind such a formulation is that the unlikely event will be more informative, while the likely events have little information, and the extreme case events should have no information. The self-information or Shannon information function is the information content associated with a single prediction and is defined as
+I(p) = log p
+i i (3)
+The Shannon entropy for the jth solvent for the ith class is measured as
+H = p log p (1 p )log(1 p )
+ij ij ij ij ij (4)
+This calculation effectively captures the uncertainty of the prediction by considering both the probability of the event occurring and not occurring. This measure reaches its maximum when p = 0.5, indicating maximum uncertainty, and is minimal (zero) when p is 0 or 1. The maximum entropy
+or the total uncertainty for the whole list of solvents for jthclass (Agonist or Binding) is Sj = ∑ Hij. The uncertainty associated
+with each ith solvent for the jth class of EDC is estimated as the ratio of the prediction entropy Hij and the maximum entropy Sj, providing a normalized measure of the uncertainty across all solvents in a class. Shannon entropy, using the class probabilities provided by the sigmoid layers, provides a postprediction uncertainty
+analysis66,67 that assesses the precision of the data-driven model by quantifying the uncertainty associated with the predictions. It is important to note that this epistemic uncertainty reflects the confidence of the trained model in its predictions and arises from the inherent complexity or limitations of the model’s learned representations. The ML
+model65 trained on the list of EDCs from the ToxCast and Tox21 needs to be representative of organic molecules in general to obtain an interpretable prediction to accurately classify a solvent as either active or inactive for each class. This does not involve any uncertainty related to the data extracted from automated data curation using either Method I or Method II. Instead, it is an intrinsic measure of the probabilistic output of the model, quantifying ambiguity in decision-making based solely on the learned patterns from the training data. In our analysis, we have categorized the organic solvents in perovskite synthesis, obtained from both methods of automated data extraction, into two subclasses of endocrine disruptors (EDs)�’Agonist’ and ‘Binding’�as shown in Table 2. We have used the deep learning model to make our predictions. The studies that substantiate our data are cited in the table’s last column, reinforcing the reliability of our classifications. For example, DMF is listed as a potential endocrine disruptor in a study of chemicals used in natural gas
+extraction.68 In a study conducted on workers exposed to DMF in the synthetic leather industry, it has been found to
+have adverse effects on sperm function.69 A European analysis of birth weight and length of gestation due to occupational exposure to endocrine-disrupting chemicals has listed Toluene
+as an endocrine-disrupting solvent.70 Such a nature of Toluene has also been established in research that studied low-dose effects and nonmonotonic dose responses of hormones and
+endocrine-disrupting chemicals.71 Alterations in enzyme activities were reported in rat liver due to n-Octane
+administration.77 While these studies reinforce our classifications, there are also some conflicting reports. Our classification of Acetone as an inactive endocrine-disrupting solvent is
+confirmed in the EPA’s report,73 but we also came across an
+article that says the opposite.72 Similarly, n-Hexane was
+reported as a potential EDC in one study75 but was ruled
+out in the other.76 Simply put, for some solvents in our study, there is data to back up their screening as EDC, while for some, there is vague information in the literature, and for the rest, the information is hard to find. However, using a deep learning model that has 90% accuracy, we have given a tool to the scientific community to screen out the potential EDCs when we do not have relevant data on the chemicals. That means our work puts a red flag on these chemicals, so that careful consideration is given before using them. In other words, our work can act as a guide in safer solvent selection for perovskite synthesis. For example, almost all solvents have been used in the synthesis of MA lead halide perovskites, but by using this work, one can easily opt for a solvent that is not an active EDC. Both DMF and DMSO are polar solvents and are excellent at dissolving perovskite precursors. However, DMF is an EDC chemical, while DMSO is not. Hence, one can
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7914
+
+
+immediately choose to substitute DMF for DMSO in the synthesis of MA lead halide perovskites. Solvents such as Toluene, Isopropanol, and Chlorobenzene are antisolvents and are used to wash/rinse the solvents to get precursor precipitates.78 However, Toluene and Chlorobenzene are active EDCs and, hence, are advised to be replaced by Isopropanol or some other antisolvents with matching properties. Figure 13 shows the uncertainty computed using the Shannon entropy formula for the ten most frequently appearing solvents used in the synthesis of common perovskites. The figure shows nonoverlapping lines for normalized Shannon entropy values of ‘Agonist’ (green) and ‘Binding’ (blue) classes, indicating different uncertainty levels in the classification of the solvents across the two classes. From the figure, Chlorobenzene and Isopropanol exhibit higher entropy values, suggesting a lower degree of confidence in their classification, while DMF, DMSO, Oleic acid, Oleylamine, and Acetone indicate a more confident classification. Our classification model, as explained before, which uses SMILES notation as input, processes these representations through convolutional layers followed by LSTM layers and fully connected layers. As mentioned earlier, the convolution layers extract spatially correlated local features or critical substructures within the molecule, and the LSTM layer maps the sequential dependencies or the order and arrangement of atoms and substructures identified by the convolution layers. Thus, high uncertainty for certain solvents, such as Chlorobenzene and Isopropanol, may indicate that the chemical substructures within the molecule and their arrangements are difficult for our classification model to identify. The specific structure and/or the substructure may not be well represented in the training data set.
+5. CONCLUSIONS
+This study presents an ensemble approach for addressing the challenges of sparse and unstructured data in scientific literature, specifically within the niche domain of perovskite synthesis, by juxtaposing deterministic outputs from smaller contextual language models (e.g., BERT, ELMo) with the broader contextual capabilities of large language models (e.g., GPT-3.5). This ensemble methodology addresses the strengths of combining multiple models to overcome the lack of benchmarking and mitigate challenges such as hallucination and overgeneralization in data extraction. Our work compares two methods for hierarchical data extraction, identifying 35 solvents using Method I and 54 solvents using Method II, with 25 solvents unanimously identified by both. Among these, DMF, Toluene, and DMSO dominate, collectively accounting for over 50% of occurrences. Further analysis of organic perovskites reveals that over 40% are MA-based, with FA and BA-based perovskites each contributing around 10%. Most solvent-perovskite associations involve MA lead halide perovskites, reflecting their popularity due to efficiency and stability. The most frequent perovskites are (MA)PbI3 (42.7%) and (MA)PbBr3 (22.0%). This information is crucial as it highlights the specific solvent-perovskite combinations that optimize device performance and manufacturing efficiency in perovskite-based solar cells. While our evaluation originally treated Method I and Method II as independent pipelines, our findings support a shift toward a conjunctive strategy. Rather than using both methods in parallel, a guided pipeline�where high-confidence extractions from Method I are used to inform and constrain GPT-based generation�could yield more robust and contextually accurate results. Although the current intersection approach reduces hallucination significantly, it suffers from compounded omission, as the conservative outputs of Method I bleed into the final set.
+Figure 13. Uncertainties associated with predicting the solvents into agonist and binding classes calculated using Shannon Entropy. A lower value of uncertainty indicates higher confidence in the corresponding prediction. Higher entropy values indicate greater uncertainty in the classification, emphasizing the need for careful consideration and further validation of these results. The green and blue lines, representing ‘Agonist’ and ‘Binding’ classes, respectively, have overlapped, indicating similar levels
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7915
+
+
+Furthermore, this work demonstrates how structured data sets extracted via language models can feed into predictive models, enabling actionable insights for safer and more sustainable solvent choices. We apply the curated data set in a case study that explores the endocrine-disrupting potential of solvents using a deep learning-based uncertainty quantification (UQ) framework. Notably, the UQ is applied to the predictions generated by the deep learning classification model, not the LLM-derived results. The UQ framework specifically addresses epistemic uncertainty or uncertainty arising from limited or incomplete training data by quantifying the variability in class probabilities for each prediction using Shannon entropy. Results show high confidence in prediction for solvents like DMF and DMSO, and lower confidence for Toluene and Diethyl ether, requiring further investigation and consideration for expansion of training data. By leveraging Shannon entropy to assess prediction confidence, the approach highlights areas of low confidence, providing a clear measure of the reliability of toxicity predictions and offering potential pathways for evaluating alternative solvents in future toxicological studies. This study also recognizes the broader ethical and technological implications of applying generative AI in scientific discovery. While LLMs offer powerful capabilities for data extraction, they can also introduce hidden risks in safety-critical domains such as toxicology and materials synthesis. Therefore, we have verified a sample of our output through manual data extraction and evaluation for the integration of structured verification pipelines, such as knowledge graph grounding and hallucination detection, as essential safeguards to enhance trust, transparency, and accountability in automated systems. This verification method includes a set of urgent measures: (1) the development of benchmark data sets with ground-truth annotations for materials synthesis; (2) the coupling of LLMs with domainspecific knowledge-graphs and structured postprocessing; and (3) the establishment of uncertainty-aware decision protocols that contextualize model confidence when applied to realworld screening scenarios. These steps are critical for enabling responsible, trustworthy, and scientifically valid deployment of LLM-based systems in materials research.
+■
+ASSOCIATED CONTENT
+Data Availability Statement
+The software codes for data extraction using language models and the associated configuration file have been attached in a zip file in the Supporting Information. The results used to plot different figures of this article have been attached in the form of spreadsheets in the Supporting Information. The Deep Learning model for predicting the EDC nature of solvents is available at https://github.com/MatInfoUB/VisualTox. All other data are available upon reasonable request from the authors. The software packages used in this study include Transfomers, OpenAI, TensorFlow, ChemDataExtractor, Scikit-learn and RDkit.
+s*ı Supporting Information
+The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.5c00612.
+Supporting Information document detailing: (i) SelfAttention Mechanism in NLP Models (ii) BERT, ELMo, and GPT-3.5 methodologies for literature mining. (iii) Example results for Method I and Method
+II of data extraction. (iv) Deep Learning model architecture for endocrine disruptor prediction and (v) Chord diagram visualization insights for solvent usage trends (PDF)
+Spreadsheet containing the paragraphs extracted from scientific literature using BERT and CDQA, along with the Named Entity Recognition (NER) results from ELMo. (i) Sheet 1 (CDQA): Extracted paragraphs relevant to perovskite synthesis, responses from BERT, and confidence scores. (ii) Sheet 2 (NER): Parsed text with word-level classifications, including solvents, precursors, perovskites, and other relevant entities (XLSX)
+Results of responses generated by GPT-3.5 for 400 research papers related to perovskite. The spreadsheet contains extracted solvent names, synthesis processes, and precursor details using a structured multilevel prompting approach (XLSX)
+Spreadsheet containing the endocrine-disrupting chemical (EDC) activity predictions for solvents identified in perovskite synthesis. The spreadsheet includes Solvents identified by each method, associated SMILES codes and Shannon entropy values to quantify uncertainty in toxicity classification. Two key columns: Agonist Activity and Binding Activity, representing ML model predictions for potential endocrine disruption (XLSX)
+Most frequently occurring solvents in perovskite synthesis based on literature extraction. The spreadsheet provides a ranked list of solvents and their association with different perovskite formulations that is used for chord diagram visualization in the main text (XLSX)
+Python codes for extracting textual data from scientific articles using language models and an associated configuration file. The code for Method II is capable of using more recent models such as GPT 4.0. However, in this work, we have used GPT 3.5 Turbo for generating all the results (ZIP)
+A manually curated data set containing paper-specific knowledge graphs for 50 additional articles with the following fields for each entry: DOI, Title, Perovskite, Solvent, and Synthesis Description. These entries serve as ground truth references for evaluating extraction accuracy and were used to construct per-paper evaluation benchmarks (XLSX)
+■
+AUTHOR INFORMATION
+Corresponding Author
+Krishna Rajan − Department of Materials Design and Innovation, University at Buffalo, Buffalo, New York
+14260−1660, United States; orcid.org/0000-0001-93032797; Email: krajan3@buffalo.edu
+Authors
+Arpan Mukherjee − Department of Materials Design and Innovation, University at Buffalo, Buffalo, New York
+14260−1660, United States; orcid.org/0000-0001-56986268
+Deepesh Giri − Laurel Ridge Community College, Middletown, Virginia 22645, United States
+Complete contact information is available at: https://pubs.acs.org/10.1021/acs.jcim.5c00612
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7916
+
+
+Author Contributions
+A.M: Investigation, methodology, code writing and maintenance, formal analysis, writing-original draft and editing. D.G: Investigation, methodology, formal analysis, writingoriginal draft. K.R.: Conceptualization, resources, writingreview and editing, supervision, funding acquisition. Notes
+The authors declare no competing financial interest.
+■
+ACKNOWLEDGMENTS
+The authors acknowledge support from NSF Award No. 2315307: NSF Engines Development Award and the Collaboratory for a Regenerative Economy (CoRE center) in the Department of Materials Design and Innovation - University at Buffalo.
+■
+ADDITIONAL NOTE
+†https://platform.openai.com/playground/prompts?models= gpt-4o-mini.
+■
+REFERENCES
+(1) Swain, M. C.; Cole, J. M. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J. Chem. Inf Model 2016, 56, 1894−1904. (2) Choudhary, K.; DeCost, B.; Chen, C.; Jain, A.; Tavazza, F.; Cohn, R.; Park, C. W.; Choudhary, A.; Agrawal, A.; Billinge, S. J. L.; Holm, E.; Ong, S. P.; Wolverton, C.; et al. Recent advances and applications of deep learning methods in materials science. npj Comput Mater 2022, 8, 59.
+(3) Schilling-Wilhelmi, M.; et al. From Text to Insight: Large Language Models for Materials Science Data Extraction. 2025, 54, 1125. (4) Kim, E.; et al. Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning. Chem. Mater. 2017, 29, 9436−9444. (5) Olivetti, E. A.; et al. Data-driven materials research enabled by natural language processing and information extraction. Appl. Phys. Rev. 2020, 7, No. 041317. (6) Kim, E.; et al. Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks. J. Chem. Inf Model 2020, 60, 1194−1201. (7) Gupta, T.; Zaki, M.; Krishnan, N. M. A.; Mausam. MatSciBERT: A materials domain language model for text mining and information extraction. npj Comput. Mater. 2022, 8, 102.
+(8) Li, B.et al. Deliberate then Generate: Enhanced Prompting Framework for Text Generation. arXiv preprint 2023. (9) Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A. S.; Ceder, G.; Persson, K. A.; Jain, A.; et al. Structured information extraction from scientific text with large language models. Nat. Commun. 2024, 15, 1418. (10) Polak, M. P.; Morgan, D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat. Commun. 2024, 15, 1569. (11) Chen, Z.-Y.; et al. MatChat: A large language model and application service platform for materials science. Chinese Physics B 2023, 32, 118104. (12) Wang, H.; et al. Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions. Digital Discovery 2024, 1612.
+(13) Ethayarajh, K.How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 55−65 (Association for Computational Linguistics: Stroudsburg, PA, USA, 2019). doi: .
+(14) Huang, S.; Cole, J. M. BatteryBERT: A Pretrained Language Model for Battery Database Enhancement. J. Chem. Inf Model 2022, 62, 6365−6377. (15) Zhao, J.; Huang, S.; Cole, J. M. OpticalBERT and OpticalTable-SQA: Text- and Table-Based Language Models for the Optical-Materials Domain. J. Chem. Inf Model 2023, 63, 1961−1981. (16) Devlin, J.; Chang, M.-W.; Lee, K., Google, K. T. & Language, A. IBERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. https://github.com/tensorflow/tensor2tensor (2019). (17) Buehler, M. J. MechGPT, a Language-Based Strategy for Mechanics and Materials Modeling That Connects Knowledge Across Scales, Disciplines, and Modalities. Appl. Mech Rev. 2024, 76, No. 021001. (18) Turhan, G. D.Life Cycle Assessment for the Unconventional Construction Materials in Collaboration with a Large Language Model.Proceedings of the International Conference on Education and Research in Computer Aided Architectural Design in Europe; Education and research in Computer Aided Architectural Design in Europe 39− 48 (2023). doi: . (19) Guerreiro, N. M.; et al. Hallucinations in Large Multilingual Translation Models. Trans Assoc Comput. Linguist 2023, 11, 1500− 1517. (20) McKenna, N.; et al. Sources of Hallucination by Large Language Models on Inference Tasks. arXiv preprint 2023. (21) Radanliev, P.; Santos, O.; Brandon-Jones, A.; Joinson, A. Ethics and responsible AI deployment. Front. Artif. Intell. 2024, 7, No. 1377011. (22) Radanliev, P. AI Ethics: Integrating Transparency, Fairness, and Privacy in AI Development. Appl. Artif. Intell. 2025, 39, No. 2463722. (23) Binette, O.; Steorts, R. C. (Almost) all of entity resolution. Sci. Adv. 2022, 8, No. eabi8021. (24) Yazdani, A.; Rouhizadeh, H.; Bornet, A.; Teodoro, D. CONORM: Context-Aware Entity Normalization for Adverse Drug Event Detection. medRxiv 2023, 2023-09. (25) Gozzi, M.; Di Maio, F. Comparative Analysis of Prompt Strategies for Large Language Models: Single-Task vs. Multitask Prompts. Electronics (Basel) 2024, 13, 4712.
+(26) Chang, K.; et al. Efficient Prompting Methods for Large Language Models: A Survey. arXiv preprint 2024. (27) Basyal, L.; Sanghvi, M. Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon7b-instruct, and OpenAI Chat-GPT Models. arXiv preprint 2023. (28) Wang, J.; Huang, J. X.; Tu, X.; Wang, J.; Huang, A. J.; Laskar, M. T. R.; Bhuiyan, A.; et al. Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and Challenges. ACM Comput. Surv. 2024, 56, 1−33. (29) Foppiano, L.; Lambard, G.; Amagasa, T.; Ishii, M. Mining experimental data from materials science literature with large language models: an evaluation study. Science and Technology of Advanced Materials: Methods 2024, 4, No. 2356506.
+(30) Chen, J.et al. A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation; 2024 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML); IEEE 2024. (31) Liu, S.; Wen, T.; Pattamatta, A. S. L. S.; Srolovitz, D. J. A prompt-engineered large language model, deep learning workflow for materials classification. Mater. Today 2024, 80, 240−249. (32) Insuasti, J.; Roa, F.; Zapata-Jaramillo, C. M. Computers’ Interpretations of Knowledge Representation Using Pre-Conceptual Schemas: An Approach Based on the BERT and Llama 2-Chat Models. Big Data and Cognitive Computing 2023, 7, 182.
+(33) Hu, Y.; Buehler, M. J. Deep language models for interpretative and predictive materials science. APL Mach. Learning 2023, 1, No. 010901. (34) Hu, E. J.et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint 2021.
+(35) Zhou, G.; Zhang, M.; Ji, D.; Zhu, Q. Hierarchical learning strategy in semantic relation extraction. Inf Process Manag 2008, 44, 1008−1021.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7917
+
+
+(36) Hendricks, G.; Tkaczyk, D.; Lin, J.; Feeney, P. Crossref: The sustainable source of community-owned scholarly metadata. Quantitative Science Studies 2020, 1, 414−427.
+(37) Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv preprint 2013.
+(38) Radford, A.et al. Language Models Are Unsupervised Multitask Learners. https://github.com/codelucas/newspaper. (39) Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint 2014.
+(40) Sun, H.; et al. Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text. arXiv preprint 2018. (41) Sukhbaatar, S.; Arthur, S.; Jason, W.; Rob, F. Weakly supervised memory networks. arXiv preprint 2015. (42) Cho, K.; et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint 2014.
+(43) Sutskever Google, I.; Vinyals Google, O.; Le Google, Q. V. Sequence to Sequence Learning with Neural Networks. Adv. Neural Inf. Process. Syst. 2014.
+(44) Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint 2014. (45) Giri, D.; Mukherjee, A.; Rajan, K.; Lazou, A.; Daehn, K.; Fleuriault, C.; Gökelma, M.; Olivetti, E.; Meskers, C.; Giri, D.; Mukherjee, A.; Rajan, K.Informatics Driven Materials Innovation for a Regenerative Economy: Harnessing NLP for Safer Chemistry in Manufacturing of Solar Cells. REWAS 2022: Developing Tomorrow’s Technical Cycles (Volume I); Springer International Publishing 11192022DOI: . (46) Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic text-mining in chemistry. J Cheminform 2011, 3, 17.
+(47) Jessop, D. M.; Adams, S. E.; Willighagen, E. L.; Hawizy, L.; Murray-Rust, P. OSCAR4: a flexible architecture for chemical textmining. J Cheminform 2011, 3, 41.
+(48) Eltyeb, S.; Salim, N. Chemical named entities recognition: a review on approaches and applications. J. Cheminform 2014, 6, 17. (49) Lozano, A.; Fleming, S. L.; Chiang, C.-C.; Shah, N.Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature. in Biocomputing 2024 8−23 (WORLD SCIENTIFIC, 2023). doi: . (50) Buehler, M. J. Generative Retrieval-Augmented Ontologic Graph and Multiagent Strategies for Interpretive Large Language Model-Based Materials Design. ACS Engineering Au 2024, 4, 241− 277. (51) Wu, S.; et al. Retrieval-Augmented Generation for Natural Language Processing: A Survey. arXiv preprint 2024. (52) Miao, J.; et al. Chain of Thought Utilization in Large Language Models and Application in Nephrology. Medicina (B Aires) 2024, 60, 148. (53) Xia, Y.; et al. Beyond Chain-of-Thought: A Survey of Chain-ofX Paradigms for LLMs. arXiv preprint 2024. (54) He, B.; et al. Retrieving, Rethinking and Revising: The Chainof-Verification Can. Improve Retrieval Augmented Generation. arXiv preprint 2024.
+(55) Kouemo Ngassom, S.; Moradi Dakhel, A.; Tambon, F.; Khomh, F.Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs. in Proceedings of the 1st ACM International Conference on AI-Powered Software 122−130 (ACM: New York, NY, USA, 2024). doi: . (56) Gill, J.; Chetty, M.; Lim, S.; Hallinan, J. Knowledge-Based Intelligent Text Simplification for Biological Relation Extraction. Informatics 2023, 10, 89.
+(57) Zhang, X.; et al. Hybrid Perovskite Light-Emitting Diodes Based on Perovskite Nanocrystals with Organic−Inorganic Mixed Cations. Adv. Mater. 2017, 29, No. 1606405.
+(58) Doolin, A. J.; et al. Sustainable solvent selection for the manufacture of methylammonium lead triiodide (MAPbI 3) perovskite solar cells. Green Chem. 2021, 23, 2471−2486. (59) Wang, J.; et al. Highly Efficient Perovskite Solar Cells Using Non-Toxic Industry Compatible Solvent System. Solar RRL 2017, 1, No. 1700091. (60) Park, G.; Oh, I. H.; Park, J. M. S.; Jung, J.; You, C. Y.; Kim, J. S.; Kim, Y.; Jung, J. H.; Hur, N.; Kim, Y.; Kim, J. Y.; Hong, C. S.; Kim, K. Y.; et al. Solvent-dependent self-assembly of two dimensional layered perovskite (C6H5CH2CH2NH3)2MCl4 (M = Cu, Mn) thin films in ambient humidity. Sci Rep 2018, 8, 4661. (61) Kim, M.; et al. Coordinating Solvent-Assisted Synthesis of Phase-Stable Perovskite Nanocrystals with High Yield Production for Optoelectronic Applications. Chem. Mater. 2021, 33, 547−553. (62) Xu, Z.; et al. A Thermodynamically Favored Crystal Orientation in Mixed Formamidinium/Methylammonium Perovskite for Efficient Solar Cells. Adv. Mater. 2019, 31, No. 1900390. (63) Saliba, M.; et al. Cesium-containing triple cation perovskite solar cells: improved stability, reproducibility and high efficiency. Energy Environ. Sci. 2016, 9, 1989−1997.
+(64) Mansouri, K.; et al. CERAPP: Collaborative Estrogen Receptor Activity Prediction Project. Environ. Health Perspect 2016, 124, 1023− 1033. (65) Mukherjee, A.; Su, A.; Rajan, K. Deep Learning Model for Identifying Critical Structural Motifs in Potential Endocrine Disruptors. J. Chem. Inf Model 2021, 61, 2187−2197. (66) Olivier, A.; Shields, M. D.; Graham-Brady, L. Bayesian neural networks for uncertainty quantification in data-driven materials modeling. Comput. Methods Appl. Mech Eng. 2021, 386, No. 114079. (67) Kabir, H. M. D.; Khosravi, A.; Hosen, M. A.; Nahavandi, S. Neural Network-Based Uncertainty Quantification: A Survey of Methodologies and Applications. IEEE Access 2018, 6, 36218−36234. (68) Kassotis, C. D.; Tillitt, D. E.; Davis, J. W.; Hormann, A. M.; Nagel, S. C. Estrogen and Androgen Receptor Activities of Hydraulic Fracturing Chemicals and Surface and Ground Water in a DrillingDense Region. Endocrinology 2014, 155, 897−907. (69) Chang, H.-Y.; Shih, T.-S.; Guo, Y. L.; Tsai, C.-Y.; Hsu, P.-C. Sperm function in workers exposed to N,N-dimethylformamide in the synthetic leather industry. Fertil Steril 2004, 81, 1589−1594. (70) Birks, L.; et al. Occupational Exposure to Endocrine-Disrupting Chemicals and Birth Weight and Length of Gestation: A European Meta-Analysis. Environ. Health Perspect 2016, 124, 1785−1793.
+(71) Vandenberg, L. N.; et al. Hormones and Endocrine-Disrupting Chemicals: Low-Dose Effects and Nonmonotonic Dose Responses. Endocr Rev. 2012, 33, 378−455.
+(72) Bolden, A. L.; Schultz, K.; Pelch, K. E.; Kwiatkowski, C. F. Exploring the endocrine activity of air pollutants associated with unconventional oil and gas extraction. Environmental Health 2018, 17, 26. (73) Akerman, G.; Trujillo, J.; Blankinship, A.UNITED STATES ENVIRONMENTAL PROTECTION AGENCY OFFICE OF CHEMICAL SAFETY AND POLLUTION PREVENTION MEMORANDUM THROUGH. https://www.regulations.gov/document/EPA-HQOPP-2009-0634-0252 (2015). (74) Sepp, K.; et al. The Role of Uron and Chlorobenzene Derivatives, as Potential Endocrine Disrupting Compounds, in the Secretion of ACTH and PRL. Int. J. Endocrinol 2018, 2018, 1−7. (75) Harris, M. O.; Corcoran, J. TOXICOLOGICAL PROFILE FOR N-HEXANE 1999.
+(76) Ruiz-García, L.; et al. Possible role of n-hexane as endocrine disruptor in occupationally exposed women at reproductive age. Toxicol. Lett. 2018, 295, S233.
+(77) Khan, S.; Mukhtar, H.; Pandya, K. P. n-octane and n-nonane induced alterations in xenobiotic metabolising enzyme activities and lipid peroxidation of rat liver. Toxicology 1980, 16, 239−245. (78) Kara, K.; et al. Solvent washing with toluene enhances efficiency and increases reproducibility in perovskite solar cells. RSC Adv. 2016, 6, 26606−26611.
+Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
+https://doi.org/10.1021/acs.jcim.5c00612 J. Chem. Inf. Model. 2025, 65, 7901−7918
+7918
\ No newline at end of file
diff --git a/storage/7B8BVUQW/.zotero-reader-state b/storage/7B8BVUQW/.zotero-reader-state
new file mode 100644
index 0000000..dda1b5e
--- /dev/null
+++ b/storage/7B8BVUQW/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":1,"scale":160,"top":708,"left":-8,"scrollMode":0,"spreadMode":0}
\ No newline at end of file
diff --git a/storage/7B8BVUQW/Mukherjee et al. - 2025 - Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language.pdf b/storage/7B8BVUQW/Mukherjee et al. - 2025 - Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language.pdf
new file mode 100644
index 0000000..274c50d
Binary files /dev/null and b/storage/7B8BVUQW/Mukherjee et al. - 2025 - Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language.pdf differ
diff --git a/storage/8X8Z578N/.zotero-ft-cache b/storage/8X8Z578N/.zotero-ft-cache
new file mode 100644
index 0000000..3175cf6
--- /dev/null
+++ b/storage/8X8Z578N/.zotero-ft-cache
@@ -0,0 +1,244 @@
+Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 687–694
+July 28-29, 2025 ©2025 Association for Computational Linguistics
+Light-Weight Hallucination Detection using
+Contrastive Learning for Conditional Text Generation
+Miyu Yamada and Yuki Arase
+Institute of Science Tokyo
+yamada.m.ee1b@m.isct.ac.jp, arase@c.titech.ac.jp
+Abstract
+We propose a simple and light-weight, yet ef
+fective hallucination detection method for con
+ditional text generation. Hallucinated outputs
+include information that is either absent from
+and/or difficult to infer from the input context.
+Leveraging this feature, we add contrastive
+learning to the hallucination detection classifier
+to pull faithful outputs and input contexts to
+gether while pushing hallucinated outputs apart.
+Experimental results confirm that our method
+on top of RoBERTa improves binary halluci
+nation detection performance, outperforming
+much larger GPT-4o prompting. Remarkably,
+our method shows higher performance for out
+puts where hallucinated spans are sparse.
+1 Introduction
+Large Language Models (LLMs) are currently used in a wide range of text generation tasks. However, their outputs often include information that deviates from the facts described in the input or information that cannot be easily verified based on the input (Kaddour et al., 2023), which we define as hallucination in this study. Users unintentionally accept hallucinated content as factual, leading to the potential spread of misinformation. To enable safer use of LLMs, it is essential to develop accurate hallucination detection methods. In addition, such detection methods are desired to be computationally efficient given the sheer volume of texts being generated by LLMs. Various methods have been proposed for hallucination detection. A popular approach employs the hidden states of LLMs to identify irregular internal states due to hallucinated content (Jiang et al., 2024). While promising, this approach only applies to the scenario where we can access the LLMs which have generated the outputs. Another series of studies targets the scenario where we cannot access nor know the LLM that
+has generated the outputs. SelfCheckGPT (Manakul et al., 2023) compares multiple outputs from the same LLM to identify inconsistencies among the outputs as clues of hallucination. Due to the design, SelfCheckGPT requires multiple outputs for the same input to detect hallucination. Mishra et al. (2024) uses the Retrieval-Augmented Generation (RAG) to retrieve relevant documents and provide them to the model for verification. FActScore (Min et al., 2023) decomposes generated outputs into a sequence of atomic facts and calculates the percentage of these facts that are supported by an external knowledge base. However, such an external knowledge base is not always available, particularly for individual or less common topics. Furthermore, these methods can be costly because of the use of LLMs as base models. The decoder-based architecture also makes the detection process slower. There have also been methods specialized for conditional text generation. For example, in the summarization task, QAFactEval (Fabbri et al., 2022) evaluates factual consistency by first generating questions from the summary, then comparing the answers obtained from the summary with those obtained from the original input document. If their answers are different, the output is judged as hallucinated. DAE (Goyal and Durrett, 2020) conducts dependency parsing and then uses natural language inference to determine whether each of these relations is entailed by the input. These approaches can capture more fine-grained inconsistencies by reasoning over intermediate representations like questions or dependency arcs. However, they require additional preprocessing steps such as question generation and dependency parsing. To address these challenges, we propose a lightweight hallucination detection method for conditional text generation. Hallucinated outputs often contain information that either clearly contradicts the input, lacks support from the input, or consists of unverifiable or subjective statements. Based on
+687
+
+
+Figure 1: Overview of the proposed method
+this feature, we employ contrastive learning (Gao et al., 2021) to a binary classification model using an encoder-based pre-trained model. We train the detector using a triplet loss that pulls faithful generation and the input together while pushes hallucinated generation and the input apart. This should make faithful and hallucinated outputs more distinctive, which may ease the classification. Experimental results demonstrate that our method outperforms GPT-4o prompting on hallucination detection, achieving 67 times faster computation. Remarkably, our method performs well even when the number and/or proportion of hallucinations in the generation are small. Our code is available at https://github.com/
+miyu-y/LightHalluDetecter.
+2 Proposed Method
+We formulate hallucination detection for conditional text generation as a binary classification: determining whether a given text contains hallucinations referring to the input context. The proposed method incorporates contrastive learning (the upper part of Figure 1) using the triplet loss computed with an anchor a as input context, a positive sample gp as faithful generation, and a negative sample gn
+as hallucinated generation.
+triplet(ea, egp , egn )
+= max (0, α + d(ea, egp) − d(ea, egn)) , (1)
+where ea, egp, egn are embeddings of a, gp, and
+gn, respectively, and the hyperparameter α is the
+margin. The distance function d(x, y) we used is
+the cosine distance:
+d(x, y) = 1 − cossim(x, y), (2)
+where cossim(x, y) computes cosine similarity. We combine the triplet loss with a classification objective (the bottom part of Figure 1). While the triplet loss guides the model to learn embedding that make hallucinated and faithful outputs distinctive, a classification head is simultaneously trained to predict whether a given output contains hallucination. The total loss is defined as:
+Lθ = triplet(ea, egp, egn) + CE(ea ⊕ eg). (3)
+The function CE(ea ⊕ eg) is the cross-entropy loss
+for the binary classification, where the embedding of input context ea is concatenated with that of gen
+erated output, i.e., either egp or egn. For the triplet
+loss, both positive and negative outputs are used. In contrast, for the classification loss, only one of
+them is passed to the classifier,1 concatenated with the input context a. At inference time, only the binary classification is conducted. The input text and the LLMgenerated output are concatenated and passed to the classifier to determine whether the output contains hallucination.
+3 Experiment Settings
+We evaluate whether contrastive learning could improve hallucination detection performance.
+3.1 Dataset
+We used the RAGTruth dataset (Niu et al., 2024) for our experiments. This dataset provides outputs generated by six different LLMs: GPT-3.5-turbo0613, GPT-4-0613 (Achiam et al., 2023), Mistral7b-Instruct (Jiang et al., 2023), Llama-2-7B-chat, Llama-2-13B-chat, Llama-2-70B-chat (Touvron et al., 2023). I.e., for each input, RAGTruth provides six outputs by these LLMs, with different levels of hallucinations. Each output is annotated with the hallucinated spans and their hallucination types. In accordance with the RAGTruth annotation protocol, hallucination is defined as content that is clearly different from the input, content not be supported by the input, or unverifiable or subjective statements.
+1This setting was chosen to make our method directly
+comparable with other baselines. We can train the model
+by conducting classification with positive and negative sam
+ples simultaneously, which slightly improves the detection
+performance.
+688
+
+
+Train Valid Test QA 4, 614 (3, 756) 420 (330) 900 (564) D2T 4, 878 (4, 506) 420 (390) 900 (864) SUM 4, 338 (4, 074) 420 (396) 900 (780) Total 13, 830 (12, 336) 1, 260 (1, 116) 2, 700 (2, 208)
+Table 1: Dataset statistics (Parentheses indicate the num
+ber of triples.)
+The original datasets of RAGTruth come from question answering (QA), data-to-text generation (D2T), and news summarization (SUM), with each task having varying hallucination rates across the LLM outputs. For the QA task, the input consists of a passage and a question from MS MARCO (Nguyen et al., 2016), and the output is the corresponding answer. For the D2T task, the input is JSON-formatted structured data (restaurant metadata and user reviews) from the Yelp Open Dataset (Yelp, 2017), and the output is a natural language description of that data. For the News Summarization task, the input is a news article (primarily from the CNN/Daily Mail dataset (See et al., 2017)), and the output is a summary. We constructed triplets of (input text, faithful output, hallucinated output) using the outputs of the six LLMs. The original dataset contained 17, 790 generated outputs, from which we extracted 15, 660 triplets after discarding cases where all outputs are faithful or hallucinated. For evaluation, we used the 2, 208 triplets in the test split across all settings. Since the RAGTruth does not provide a validation set, we randomly sampled a subset from the training data for validation. The number of samples for each split is summarized in Table 1.
+3.2 Implementation
+We used the light-weight, encoder-based model of RoBERTa-base (Liu et al., 2019) with 125M parameters as the base model for the classifier. As the text embedding, we employ the hidden outputs of the final layer corresponding to the start-of-sequence token, i.e., “”, attached to the input text. We also experimented with a light-weight decoder-based LLM of Phi-3.5-mini-instruct (Abdin et al., 2024), that has 3.8B parameters. As the text embedding encoded by this model, we used the hidden output of the final layer corresponding to the last token of the input. Fine-tuning was conducted for 10 epochs with a learning rate of 5.0e − 6 for RoBERTa-base and 1.0e − 6 for Phi-3.5-mini-instruct. The margin value α in our method was set to 1.0 for RoBERTa
+base and 0.5 for Phi-3.5-mini-instruct based on the performance on the validation set. Yet the preliminary experiments showed that the detection performance is not sensitive to the α setting. All the experiments were conducted on a NVIDIA H100 GPU with 94GB memory.
+3.3 Baselines
+We compared our method against the following three baselines.
+LLM-Prompting This method prompts LLMs
+to detect hallucinations. Given an input text and its corresponding output, an LLM was prompted to judge whether the output contained hallucination. We used both Phi-3.5-mini-instruct and GPT-4o as LLMs. The prompts can be found in the Appendix.
+FActScore As a strong hallucination detection
+method applicable to the scenario where LLMs that generated outputs are unknown, we compare to FActScore. FActScore requires a knowledge base to identify hallucinations. To make it compatible with RAGTruth dataset, we used the input texts as the knowledge source, i.e., regarding outputs that are not supported by the input contexts as hallucinations. Following the original setting of Min et al. (2023), GPT-3.5-turbo was used as the base model to decompose output texts into a sequence of atomic facts and to calculate the percentage of the facts supported by the input text. If the computed score was exactly 1.0, a generated output was labeled as faithful; otherwise, it was considered hallucinated.
+Classifier As an ablation study, we compared our
+method to its variation that trains the binary classifier using only the cross-entropy loss, without the triplet loss. Our method and this Classifier baseline were trained using all samples in the training split across tasks.
+4 Results and Discussion
+4.1 Overall Performance
+Table 2 shows the precision, recall, and F1 scores for hallucination detection on different tasks. The “ALL” column shows these scores measured on all samples across tasks. The proposed method achieved the best F1 scores on QA, D2T, and ALL tasks when combined with RoBERTa, largely outperforming a much larger-scale model of GPT-4o and FActScore. The proposed method with RoBERTa showed higher recall. GPT-4o
+689
+
+
+Model Method QA D2T SUM ALL Time (s)
+P R F1 P R F1 P R F1 P R F1
+GPT-4o Prompt 60.7 46.3 52.5 94.0 63.4 75.7 89.1 49.5 63.6 86.3 57.3 68.8 2.01
+GPT-3.5 FactScore 35.3 88.1 50.4 66.9 94.3 78.3 33.2 66.7 44.3 50.3 87.1 63.7 2.29
+RoBERTa Classifier 45.8 60.0 57.0 80.9 90.2 85.3 34.2 27.3 30.3 78.3 58.2 66.8 0.01
+Proposed 62.7 88.7 60.4 79.9 91.9 85.5 33.5 54.0 41.4 59.8 83.1 69.5 0.03
+Phi-3.5
+Prompt 27.3 1.9 3.5 50.0 4.6 8.4 30.8 20.2 24.3 35.6 7.5 12.5 0.45
+Classifier 59.5 56.9 58.1 82.4 86.0 84.1 35.2 32.3 33.7 74.0 63.8 68.5 0.29
+Proposed 71.0 44.1 54.4 83.4 83.8 83.6 38.7 35.8 37.2 67.1 70.1 68.6 0.34
+Table 2: Precision (P), Recall (R), and F1 scores (%) for hallucination detection across tasks. “Time” indicates
+average time per case.
+demonstrated higher precision, whereas FActScore showed higher recall. GPT-4o and FActScore performed strongly on the summarization task, but the performance was limited on other settings. Hallucination detection on summarization task requires detailed comparisons of a long input document and a shorter output summary. We conjecture GPT-4o and GPT-3.5 are capable of such comparison, but it may be difficult for much smaller RoBERTa-base. Our method on Phi-3.5mini-instruct was consistently inferior to that on RoBERTa. This may be due to the differences in embeddings from the encoder or decoder; a detailed investigation is our future work. The far right column shows the computational time: the average second to process a sample. Our method on RoBERTa is much faster than other decoder-based LLMs, thanks to the efficient encoder model and its small number of parameters. Prompting GPT-4o and FActScore took 67.0 to 76.3 times longer than our method.
+4.2 Analysis
+This section investigates features of hallucinations that can affect the detection performance by comparing our method on RoBERTa and GPT-4o.
+Effect of Hallucinating Models Table 3 presents
+F1 score for hallucination detection, grouped by the LLM that generated the outputs. Overall, the detection rate tends to be higher for generations containing more hallucinations. Although we hypothesized that GPT-4o may have a higher success rate on GPT-3.5 and GPT-4, this did not hold. Rather, the task differences are more dominant than the model differences.
+Number of Hallucinations Figures 2 and 3 show
+the success rate of hallucination detection as a function of the proportions of the number of hallucinated tokens and the number of hallucinated spans, respectively. The bar charts in the background indi
+Figure 2: Detection success ratio and the num
+ber of cases by hallucinating token ratio in an
+output
+Figure 3: Detection success ratio and the num
+ber of cases by the number of hallucinations
+in an output
+cate the numbers of samples within each bin. Hallucinations with smaller proportions are more challenging to detect, yet such cases are more prevalent in the dataset. Nevertheless, our method achieved significantly higher detection rates than GPT-4o in these cases.
+Embedding Space Figures 4 and 5 visualizes
+the distributions of cosine distances between the input and faithful/hallucinated outputs before and after contrastive learning. In the original embeddings, the distributions for faithful and hallucinated
+690
+
+
+GPT3.5 GPT4 Llama2-7B Llama2-13B Llama2-70B Mistral QA GPT4o 14.3 0.0 68.7 43.6 40.0 55.7
+Proposed 21.4 0.0 74.6 65.4 57.7 65.2
+Num 5 1 52 36 35 31 D2T GPT4o 21.1 6.5 74.2 93.0 67.5 82.0
+Proposed 31.3 21.3 89.7 95.7 84.8 94.1
+Num 31 29 117 132 106 128
+SUM GPT4o 0.0 50.0 65.8 46.8 54.5 72.5
+Triplet 0.0 16.7 49.1 34.3 35.7 63.4 Num 3 5 50 32 23 85 ALL GPT4o 18.2 14.3 71.0 79.4 60.2 75.1
+Proposed 17.1 16.3 77.0 79.1 69.1 79.7
+Num 39 35 219 200 164 244
+Table 3: F1 for hallucination detection per model (“Num” rows show the number of samples with hallucination.)
+Figure 4: Distribution of cosine distances be
+tween original embeddings (before contrastive
+learning)
+Figure 5: Distribution of cosine distances after
+contrastive learning
+outputs are highly similar, with both distributions tightly concentrated in a narrow range. This indicates that inputs, faithful and hallucinated outputs are entangled in the embeddings space. After contrastive learning using triplet loss, these are well disentangled. The cosine distance distributions of faithful and hallucinated outputs differ significantly, with their respective peaks clearly shifted from each other in opposite directions.
+5 Conclusion
+We proposed a method for training a hallucination detector using contrastive learning. Experimental results demonstrated that our method is particularly effective for detecting cases where proportions and/or numbers of hallucinated spans are smaller, which are typically more challenging to identify. In future, we will explore methods for locating and identifying hallucinated spans in generation, which remains an open problem despite its practical importance.
+691
+
+
+Limitations
+Our method requires an input context to identify hallucination in generated output; hence, it does not apply to scenarios where only generated outputs are available, such as fake news detection. Our method requires triples of (input context, hallucinated output, faithful output), which requires extra efforts in construction rather than simpler pairs of (input context, hallucinated or faithful output). Nonetheless, such triples can be collected using sampling in generation or using multiple LLMs.
+Acknowledgments
+This work was supported by JST K Program Grant Number JPMJKP24C3, Japan. This study was carried out using the TSUBAME4.0 supercomputer at Institute of Science Tokyo.
+References
+Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed
+Awadallah, Ammar Ahmad Awan, Nguyen Bach,
+Amit Bahree, Arash Bakhtiari, Jianmin Bao, and
+Harkirat et al. Behl. 2024. Phi-3 technical report:
+A highly capable language model locally on your
+phone. Preprint, arXiv:2404.14219.
+Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
+Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
+Diogo Almeida, Janko Altenschmidt, Sam Altman,
+and Shyamal et al. Anadkat. 2023. Gpt-4 technical
+report. arXiv preprint arXiv:2303.08774.
+Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and
+Caiming Xiong. 2022. QAFactEval: Improved QA
+based factual consistency evaluation for summariza
+tion. In Proceedings of the 2022 Conference of the
+North American Chapter of the Association for Com
+putational Linguistics: Human Language Technolo
+gies.
+Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
+SimCSE: Simple contrastive learning of sentence em
+beddings. In Proceedings of the 2021 Conference on
+Empirical Methods in Natural Language Processing.
+Tanya Goyal and Greg Durrett. 2020. Evaluating factu
+ality in generation with dependency-level entailment.
+In Findings of the Association for Computational
+Linguistics: EMNLP 2020.
+Albert Q Jiang, Alexandre Sablayrolles, Arthur Men
+sch, Chris Bamford, Devendra Singh Chaplot, Diego
+de las Casas, Florian Bressand, Gianna Lengyel, Guil
+laume Lample, and Lucile et al. Saulnier. 2023. Mis
+tral 7b. arXiv preprint arXiv:2310.06825.
+Che Jiang, Biqing Qi, Xiangyu Hong, Dayuan Fu, Yang
+Cheng, Fandong Meng, Mo Yu, Bowen Zhou, and
+Jie Zhou. 2024. On large language models’ halluci
+nation with regard to known facts. In Proceedings
+of the 2024 Conference of the North American Chap
+ter of the Association for Computational Linguistics:
+Human Language Technologies (Volume 1: Long
+Papers), pages 1041–1053, Mexico City, Mexico. As
+sociation for Computational Linguistics.
+Jean Kaddour, Joshua Harris, Maximilian Mozes, Her
+bie Bradley, Roberta Raileanu, and Robert McHardy.
+2023. Challenges and applications of large language
+models. arXiv preprint arXiv:2307.10169.
+Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man
+dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
+Luke Zettlemoyer, and Veselin Stoyanov. 2019.
+Roberta: A robustly optimized bert pretraining ap
+proach. arXiv preprint arXiv:1907.11692.
+Potsawee Manakul, Adian Liusie, and Mark Gales. 2023.
+SelfCheckGPT: Zero-resource black-box hallucina
+tion detection for generative large language models.
+In Proceedings of the 2023 Conference on Empiri
+cal Methods in Natural Language Processing, pages
+9004–9017, Singapore. Association for Computa
+tional Linguistics.
+Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike
+Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer,
+Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023.
+FActScore: Fine-grained atomic evaluation of factual
+precision in long form text generation. In EMNLP.
+Abhika Mishra, Akari Asai, Vidhisha Balachandran,
+Yizhong Wang, Graham Neubig, Yulia Tsvetkov, and
+Hannaneh Hajishirzi. 2024. Fine-grained hallucina
+tion detection and editing for language models. In
+First Conference on Language Modeling.
+Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,
+Saurabh Tiwary, Rangan Majumder, and Li Deng.
+2016. Ms marco: A human generated machine read
+ing comprehension dataset.
+Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu,
+KaShun Shum, Randy Zhong, Juntong Song, and
+Tong Zhang. 2024. RAGTruth: A hallucination cor
+pus for developing trustworthy retrieval-augmented
+language models. In Proceedings of the 62nd An
+nual Meeting of the Association for Computational
+Linguistics (Volume 1: Long Papers), pages 10862
+10878, Bangkok, Thailand. Association for Compu
+tational Linguistics.
+Abigail See, Peter J Liu, and Christopher D Manning.
+2017. Get to the point: Summarization with pointer
+generator networks. In In Proceedings of the 55th
+Annual Meeting of the Association for Computational
+Linguistics (Volume 1: Long Papers).
+Hugo Touvron, Louis Martin, Kevin Stone, Peter Al
+bert, Amjad Almahairi, Yasmine Babaei, Nikolay
+Bashlykov, Soumya Batra, Prajjwal Bhargava, and
+692
+
+
+Shruti et al. Bhosale. 2023. Llama 2: Open foun
+dation and fine-tuned chat models. arXiv preprint
+arXiv:2307.09288.
+Yelp. 2017. Yelp open dataset. http://www.pluto.
+ai.kyutech.ac.jp/NLP/.
+A Appendix
+Table 4 shows prompts used in this study.
+693
+
+
+Classifier, Triplet
+[input text] Please judge the following statement whether it includes hallucination or not based on the references above: [output text] Prompt (Phi) Input_Document: [input text] Please judge the following Text whether it includes hallucination or not based on the Input_Document above and output 1 if it includes hallucination and 0 if not. Output should be only an number (1 or 0). You mustn’t output any description other than a number. Text: [output text] Output: Prompt (GPT4o)
+[input text] Please judge the following statement whether it includes hallucination or not based on the references above and output 1 if it includes hallucination and 0 if not. Output should be only an number (1 or 0): [output text] Output:
+Table 4: Used prompt in the experiments
+694
\ No newline at end of file
diff --git a/storage/8X8Z578N/.zotero-reader-state b/storage/8X8Z578N/.zotero-reader-state
new file mode 100644
index 0000000..36705c9
--- /dev/null
+++ b/storage/8X8Z578N/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":2,"scale":240,"top":57,"left":-22,"scrollMode":0,"spreadMode":0}
\ No newline at end of file
diff --git a/storage/8X8Z578N/Yamada and Arase - 2025 - Light-Weight Hallucination Detection using Contrastive Learning for Conditional Text Generation.pdf b/storage/8X8Z578N/Yamada and Arase - 2025 - Light-Weight Hallucination Detection using Contrastive Learning for Conditional Text Generation.pdf
new file mode 100644
index 0000000..4738773
Binary files /dev/null and b/storage/8X8Z578N/Yamada and Arase - 2025 - Light-Weight Hallucination Detection using Contrastive Learning for Conditional Text Generation.pdf differ
diff --git a/storage/8XNBW5GV/.zotero-ft-cache b/storage/8XNBW5GV/.zotero-ft-cache
new file mode 100644
index 0000000..f864def
--- /dev/null
+++ b/storage/8XNBW5GV/.zotero-ft-cache
@@ -0,0 +1,180 @@
+HUDEX: INTEGRATING HALLUCINATION DETECTION AND
+EXPLAINABILITY FOR ENHANCING THE RELIABILITY OF LLM
+RESPONSES
+Sujeong Lee Inha University Incheon, 22212, Republic of Korea tnwjd025611@inha.edu
+Hayoung Lee Inha University Incheon, 22212, Republic of Korea gkdud000123@gmail.com
+Seongsoo Heo Inha University Incheon, 22212, Republic of Korea woo555813@inha.edu
+Wonik Choi Inha University Incheon, 22212, Republic of Korea wichoi@inha.ac.kr
+February 13, 2025
+ABSTRACT
+Recent advances in large language models (LLMs) have shown promising improvements, often surpassing existing methods across a wide range of downstream tasks in natural language processing. However, these models still face challenges, which may hinder their practical applicability. For example, the phenomenon of hallucination is known to compromise the reliability of LLMs, especially in fields that demand high factual precision. Current benchmarks primarily focus on hallucination detection and factuality evaluation but do not extend beyond identification. This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx, aimed at enhancing the reliability of LLM-generated responses by both detecting hallucinations and providing detailed explanations. The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors. Our measurement results demonstrate that the proposed model surpasses larger LLMs, such as Llama3 70B and GPT-4, in hallucination detection accuracy, while maintaining reliable explanations. Furthermore, the proposed model performs well in both zero-shot and other test environments, showcasing its adaptability across diverse benchmark datasets. The proposed approach further enhances the hallucination detection research by introducing a novel approach to integrating interpretability with hallucination detection, which further enhances the performance and reliability of evaluating hallucinations in language models.
+1 Introduction
+Recent advancements in large language models (LLMs) have showcased their potential in natural language processing (NLP) [1]. While LLMs can generate effective responses across diverse tasks, they are also limited by certain critical issues. One such limitation is hallucination, where the model produces information that is factually incorrect or generates content not requested or instructed by the user. This problem can lead to the spread of incorrect information, particularly problematic in fields where accuracy and reliability are crucial, thereby limiting the applicability of LLMs in various industries. Consequently, hallucination is a major issue undermining the reliability of LLMs, prompting significant research into solutions.
+Recent studies have focused on developing benchmarks to detect and evaluate hallucinations and methods for mitigating them. For example, FELM [2] provides a benchmark for assessing the factuality of LLMs by identifying factual errors in response segments through text-segment-based annotations. TruthfulQA [3] evaluates whether language models
+arXiv:2502.08109v1 [cs.CL] 12 Feb 2025
+
+
+A PREPRINT - FEBRUARY 13, 2025
+produce truthful responses, aiming to detect non-factual responses across various domains. Similarly, QAFactEval [4] proposes a QA-based metric for assessing factual consistency in summarization tasks, effectively detecting and evaluating factual errors.
+However, these studies primarily focus on evaluating or detecting hallucinations or a lack of factual inaccuracies, rather than actively improving the model’s reliability. This limitation underscores the need for approaches that not only assess factual errors but also actively contribute to improving the quality of model responses. Additionally, benchmark-based evaluation methods may struggle with the real-time detection of hallucinations in model-generated responses.
+To address these gaps, we propose a specialized model named HuDex designed to detect hallucinations in LLM responses and provide detailed explanations of these hallucinations. Unlike existing benchmarks, our model not only identifies hallucinations but also offers specific explanations, helping users understand the model’s output and assisting the model in refining its responses. This approach aims to improve the reliability of LLM responses.
+Figure 1: Schematic Representation of Our Hallucination Model
+The key contributions of our proposed model are:
+1. Moving beyond standardized hallucination benchmarks, the proposed model enables proactive detection despite its smaller size.
+2. By providing detailed explanations of detected hallucinations, the model enhances user understanding and contributes to the improvement of model performance.
+3. Through an analysis focused on hallucinations, a more nuanced evaluation of the hallucination domain is possible compared to general-purpose LLMs, and this can be effectively used to evaluate other LLMs.
+2 Related Work
+2.1 Definitions of Large Language Models
+A Large Language Model is an artificial intelligence model based on the Transformer architecture [5]. It refers to a pre-trained language model (PLM) with a parameter size exceeding a certain threshold [6]. LLMs are trained on massive datasets and typically have billions to hundreds of billions of parameters. Due to the extensive data used in their training, LLMs exhibit exceptional performance across various NLP tasks, including text generation, translation, and summarization.
+Notably, LLMs that surpass a certain parameter scale demonstrate emergent abilities not found in smaller models. Examples of these abilities include in-context learning, instruction following, and chain-of-thought (CoT) reasoning [7]. These capabilities enable LLMs to handle more complex tasks, such as advanced reasoning, problem-solving, and generating multi-turn responses.
+2
+
+
+A PREPRINT - FEBRUARY 13, 2025
+Although LLMs are primarily used for general downstream tasks, their increasing significance in both academia and industry has led to research into domain-specific LLMs. Examples include the Med-PaLM series for the medical domain [8] and FinGPT for the financial domain [9]. These advancements underscore the growing need for LLMs not only in language generation but also in addressing specialized tasks across various fields.
+2.2 Definitions of Hallucination
+In NLP, hallucination refers to content that is unreliable or illogical compared to the provided source material [10], [11]. Previous studies categorize hallucinations into two broad two types: intrinsic and extrinsic [10], [11],[12], [13].
+Intrinsic hallucination occurs when the generated output contradicts the source content. For example, this happens when a model produces information that conflicts with the given data in response to a factual question. In contrast, extrinsic hallucinations involve outputs that include unverifiable or nonexistent information. This often occurs when the model generates content that cannot be corroborated by the source material.
+In the context of LLMs, hallucination can be defined more specifically. LLM hallucinations, which prioritize user instructions and interactions, can be categorized based on factuality and faithfulness [14]. Factual hallucinations arise when a model generates outputs that are based on real-world information but are either incorrect or unverifiable. For instance, if the model inaccurately presents well-known facts or mentions nonexistent information, it is considered a factual hallucination. Faithfulness-related hallucinations occur when the model generates responses unrelated to user instructions or the provided content, or when it produces internally inconsistent answers. This type of hallucination is particularly important in conversational models.
+The issue of hallucination may stem from several factors, including the use of outdated data during the data collection process [15] or biased data [16] used for model training [14],[17], [18]. Furthermore, the risk of hallucinations tends to increase with the size and complexity of the models.
+2.3 LLM-Based Evaluation of LLMs
+One of the key challenges discussed alongside the development of LLMs is the difficulty in accurately evaluating the context and meaning of generated responses using traditional quantitative metrics. While human evaluation has been employed to address this limitation, it has considerable drawbacks, particularly in terms of time and resource consumption [1],[19].
+To overcome these challenges, the use of LLMs as evaluation tools, or “LLM judges,” has gained attention. [20] pioneered an LLM-based evaluation framework, showing that strong LLMs achieved over 80% agreement with human experts in evaluations. Subsequent studies by [21], [22], and [23] have expanded on this approach, further validating the utility of LLM judges.
+The introduction of LLM judges provides an efficient solution for evaluating large-scale data, where human evaluation may be impractical. In addition to quantitative assessments, LLM judges offer qualitative evaluations based on their understanding of context and adherence to user instructions, making them versatile tools for comprehensive evaluation.
+3 Data Construction
+3.1 Datasets
+For training, we utilized the HaluEval, FactCHD, and FaithDial datasets, as summarized in Table 1.
+The HaluEval dataset [24] is a hallucination evaluation benchmark designed to assess the likelihood of hallucinations based on content type. It consists of 30,000 examples across three tasks: question answering, knowledge-based dialogues, and text summarization, along with 5,000 general user queries that include ChatGPT responses. In this study, we used the question-answering and knowledge-based dialogue subsets as training data. Both subsets focus on detecting hallucinations based on provided knowledge, allowing the model to learn how to identify intrinsic hallucinations.
+The FactCHD dataset [25] is a benchmark specifically designed to detect hallucinations that conflict with factual information in LLMs. It evaluates factual accuracy in the context of a wide range of queries and responses, facilitating factual reasoning during evaluation. Unlike HaluEval, the FactCHD dataset does not include a pre-existing knowledge base, enabling the model to learn to detect hallucinations in scenarios with limited reference material.
+The FaithDial dataset [26] is designed to minimize hallucinations and improve the accuracy of information-seeking dialogues. It was built by modifying the Wizard of Wikipedia (WOW) benchmark to include hallucinated responses. The dataset includes a BEGIN [27] label that categorizes responses based on their relationship to the knowledge source
+3
+
+
+A PREPRINT - FEBRUARY 13, 2025
+and their contribution to the dialogue. For binary classification of hallucination detection, we preprocessed the dataset by excluding the Generic and Uncooperative categories. Additionally, since each data point includes both a response and an original response, we split them into two distinct responses. This allowed us to create two separate data instances with the same knowledge and dialogue context but different responses, thereby augmenting the training data.
+Table 1: Dataset Information
+Dataset Train Test HaluEval Dialogue 9,000 1,000 HaluEval QA 9,000 1,000 FaithDial 18,357 3,539 FactCHD 51,838 6,960
+3.2 Explanation Generation
+The primary goal of our model is not only to detect hallucinations in generated responses but also to provide explanations for the reasoning behind these judgments. A simple example of this process is illustrated in Figure 1. To achieve this, the model must be trained on explanation data. While the FactCHD dataset includes explanations, the HaluEval and FaithDial datasets do not. Therefore, we used the Llama3 70B [28] model to generate explanation data for hallucination detection in the HaluEval and FaithDial datasets.
+During the explanation generation process, we also generated answers corresponding to the hallucination labels. This step ensured that the hallucination labels predicted by the model during explanation generation aligned with the existing hallucination labels in HaluEval and FaithDial datasets.
+Upon analyzing the model’s predictions, we found that 0.5% of the responses failed to understand the prompt and asked for clarification, and 4.2% were classified as anomalies. Excluding these cases, 95.3% of the responses adhered to the expected format. As shown in Table 2, the accuracy of valid responses was 98.3%. Ultimately, 93.7% of the hallucination labels from HaluEval and FaithDial matched the model’s predicted answers, and only the verified matching data were used for training.
+To further assess the quality of the generated explanations, we conducted statistical sampling. We defined the population as the set of generated explanations, with a confidence level of 99%, a conservatively estimated defect rate of p = 0.5, and a margin of error set at 2%. Through human evaluation of the selected sample, we validated the explanations to ensure the accuracy and relevance of the reasoning provided.
+Table 2: Confusion Matrix of Model Predictions VS Actual Answers (Proportional Representation)
+ActualPositive ActualNegative Predicted Positive 52.0% 1.7% Predicted Negative 0% 46.3%
+4 Model Training and Inference
+4.1 Training
+We used the Llama 3.1 8B model [28] for training and applied low-rank adaptation (LoRA) [29], a method under parameter efficient tuning (PEFT). The task prompts for training were divided into two main categories: hallucination detection and hallucination explanation. The model was trained on both tasks using the same dataset.
+4.2 Inference
+The prompt structure for inference focuses on two key elements: persona provision and task stage provision. Persona provision ensures that the model understands the specific task’s goal before generating responses, encouraging deeper analysis of the given information. By defining the task’s context and role in advance, we aim for more consistent outputs. To generate a persona, we provided ChatGPT with task details and received recommendations for suitable persona candidates. After a human filtering process, we selected a hallucination expert persona to detect hallucinations.
+Task stage provision guides the model to approach complex problems systematically when generating responses. The prompt stages are structured adaptively based on the task and data characteristics. If background knowledge is available,
+4
+
+
+A PREPRINT - FEBRUARY 13, 2025
+Figure 2: Overview of HuDEx: Training and Inference
+the model generates responses based on it; otherwise, it relies on context and its inherent knowledge. The stage structure also varies depending on whether the task focuses on hallucination detection or explanation generation. Stages can be divided or combined based on the specific needs of each task.
+An overview of the training and inference process can be found in Figure 2, and brief examples of both the stage and persona structures are shown in Figure 3.
+5 Experiments
+5.1 Datasets
+For the detection and explanation generation experiments, we used the test sets from HaluEval dialogue, HaluEval QA, FaithDial and FactCHD, which were also used during training. The HaluEval datasets, both for dialogue and QA tasks, provide background knowledge, so we applied inference prompts designed to incorporate this information. FaithDial also utilized the same inference prompt. For the FactCHD dataset, which does not include background knowledge, we used the inference prompt stages suited for tasks without background knowledge. The persona was consistently provided across all tasks, regardless of the presence or absence of background knowledge.
+For zero-shot detection, we conducted experiments on HaluEval subsets not used during training, specifically HaluEval summarization and HaluEval general. The HaluEval summarization dataset focuses on detecting hallucinations in document summaries, while the HaluEval general dataset evaluates hallucination detection in ChatGPT responses to user queries. Since both datasets lack background knowledge, we used inference prompts designed for tasks without background knowledge.
+5.2 Test Setting
+5.2.1 Detection Experiments
+For the detection experiments, we compared our HuDEx to two LLMs, GPT-4 [30] and Llama 3 70B. These models received the same inference prompts as our model and were tasked with classifying whether the responses contained hallucinations.
+5.2.2 Explanation Generation Experiments
+To evaluate the explanations generated by each model, we used an LLM judge and conducted main experiment. The experiment followed a single-answer grading approach, where each model’s response was individually scored.
+In the single-answer grading experiment, we divided the evaluation into two categories: factuality and clarity. Factuality assessed whether the explanation contained hallucinations, contradictions, or accurately reflected the given information. Clarity evaluated how clearly and thoroughly the reason was articulated. Each criterion was scored on a 3-point scale, with a maximum total score of 6 points.
+5
+
+
+A PREPRINT - FEBRUARY 13, 2025
+Figure 3: Examples of Persona and Steps Used in Inference Prompts
+We used GPT4o as the judge for experiment. In the HaluEval and FaithDial dataset, we compared the explanations generated by our model against those from Llama 3 70B, with GPT4o providing the final judgments. For the FactCHD dataset, we compared the explanations generated by HuDEx against the explanations included in the FactCHD dataset itself.
+6 Results
+6.1 Detection Results
+6.1.1 Test Data Detection
+In this experiment, binary classification was performed to distinguish hallucinations from non-hallucinations using the test sets from the training data, with accuracy as the evaluation metric. Table 3 compares the performance of Llama3 70B, GPT4o, and our model across benchmark datasets such as HaluEval dialogue, HaluEval QA, FactCHD, and FaithDial.
+The experimental results show that our HuDEx outperformed the larger models, Llama3 70B and GPT4o, across all benchmarks. Specifically, it achieved an accuracy of 80.6% on the HaluEval dialogue dataset, surpassing Llama3 70B (71.8%) and GPT4o (72.5%), indicating superior performance in detecting hallucinations in conversational response.
+In the HaluEval QA dataset, our model again achieved the highest accuracy of 89.6%, outperforming GPT4o (86.6%) and Llama3 70B (82.7%). This demonstrates its refined ability to detect hallucinations in QA tasks.
+On the FactCHD and FaithDial datasets, HuDEx recorded accuracies of 70.3% and 58.8%, respectively, continuing to show strong performance on both benchmarks. On the FactCHD dataset, HuDEx outperformed Llama3 70B by ̃11%, confirming its effectiveness in hallucination detection even when background knowledge is unavailable. On the FaithDial dataset, our HuDEx also significantly outperformed GPT4o (50.6%), achieving 58.8%, which highlights its consistent performance on a different type of conversation-based dataset compared to HaluEval dialogue.
+These results demonstrate that our model consistently delivers superior performance in hallucination detection across various benchmark datasets, outperforming larger models.
+6
+
+
+A PREPRINT - FEBRUARY 13, 2025
+Table 3: Test Data Detection Results (Accuracy)
+Model HaluEval Dialogue HaluEval QA FactCHD FaithDial Llama3 70B 71.8 82.7 59.4 47.9 GPT4o 72.5 86.6 61.2 50.6 HuDEx 80.6 89.6 70.3 58.8
+6.1.2 Zero-Shot Detection
+Table 4 presents the results of the binary classification experiment on hallucination vs. non-hallucination in a zero-shot setting. This experiment evaluated the model’s hallucination detection performance on unseen data using the HaluEval summarization and HaluEval general datasets, which were not included in the training data. Accuracy was used as the evaluation metric, consistent with the methodology in previous experiments.
+On the HaluEval summarization dataset, HuDEx achieved an accuracy of 77.9%, outperforming Llama3 70B (69.55%) and GPT4o (61.9%). This demonstrates the model’s ability to effectively detect hallucinations in summary texts of original content.
+The HaluEval general dataset consists of queries posed by real users to GPT models, often containing complex responses that go beyond typical conversational text. This complexity makes hallucination detection more challenging and serves as an important benchmark for evaluating model reliability on unstructured data. On this dataset, GPT4o recorded the highest accuracy at 78.0%, while our model achieved 72.6%. These results suggest that while HuDEx delivers consistent performance on complex responses, there is still room for improvement.
+Table 4: Zero-shot data detection results (Accuracy)
+Model HaluEvalSummarization HaluEvalGeneral Llama3 70B 69.55 76.2 GPT4o 61.9 78.0 HuDEx 77.9 72.6
+6.2 Explanation Generation Results
+6.2.1 Single-Answer Grading
+This experiment presents the evaluation of hallucination explanations generated by Llama3 70B and our model, as assessed by the LLM judge. The results, shown in Table 5, were obtained from the HaluEval dialogue, HaluEval QA, and FaithDial datasets. Explanations were evaluated based on two criteria: factuality and clarity, each scored out of 3 points, for a maximum combined score of 6 points.
+When comparing the performance of Llama3 70B and our HuDEx in terms of factuality, Llama3 70B scored lower on the HaluEval dialogue dataset with 1.932 points but achieved relatively higher scores on HaluEval QA and FaithDial, with 2.416 and 2.587 points, respectively. In contrast, our model outperformed Llama3 70B on factuality for the HaluEval dialogue dataset, though it scored slightly lower on HaluEval QA (2.299) and FaithDial (2.216). Despite the variations in scores across datasets, HuDEx demonstrated consistent factual accuracy, indicating its ability to provide reliable information.
+In terms of clarity, Llama3 70B achieved the highest score on the FaithDial dataset with 2.451 points, while our model closely followed with 2.417 points. On the HaluEval dialogue and HaluEval QA datasets, our model outperformed Llama3 70B, scoring 2.413 and 2.523 points, respectively. This indicates that HuDEx provides clearer and more easily understandable explanations for hallucinations.
+Overall, our HuDEx demonstrated competitive performance in terms of factuality, clarity, and overall scores compared to Llama3 70B. These results support that our model consistently delivers reliable and clear hallucination explanations.
+The next experiment evaluated the original explanations from the FactCHD dataset against those generated by our model, with results shown in Table 6. The conversion ratio was used to compare the performance of our HuDEx as a percentage, with the FactCHD score serving as the maximum (100%).
+For factuality, FactCHD recorded a score of 2.2549, while our model scored slightly lower at 2.236. The conversion ratio for factuality was 99%, indicating that although FactCHD’s original explanations had slightly higher factual accuracy, HuDEx performed very closely to this benchmark.
+7
+
+
+A PREPRINT - FEBRUARY 13, 2025
+Table 5: Comparison of Hallucination Explanations Between Llama3 70B and Proposed model (LLM Judge Evaluation)
+Model Dataset Factuality (3) Clarity (3) Overall (6) Llama3 70B HaluEval Dialogue 1.932 2.302 4.256 HaluEval QA 2.416 2.153 4.569 FaithDial 2.587 2.451 5.038 HuDEx HaluEval Dialogue 2.116 2.413 4.528 HaluEval QA 2.299 2.523 4.822 FaithDial 2.216 2.417 4.633
+In terms of clarity, FactCHD achieved a score of 2.439, while our model scored slightly lower at 2.37. The conversion ratio for clarity was 97%, suggesting that while our model’s explanations were marginally less clear than FactCHD’s, they remained highly comparable in clarity. In conclusion, HuDEx showed performance similar to FactCHD, with conversion ratios ranging from 97% to 99%. These results demonstrate that HuDEx generates explanations nearly equivalent in quality to the original explanations provided in the FactCHD dataset.
+Table 6: LLM Judge Evaluation of Explanations:FactCHD original vs HuDEx
+Factuality (3) Clarity (3) Overall (6) FactCHD 2.2549 2.439 4.697 HuDEx 2.236 2.37 4.61 Conversion Ratio 99% 97% 98%
+7 Conclusion
+The hallucination phenomenon in large language models (LLMs) presents a significant challenge that needs to be addressed in practical applications. This study proposes a model called HuDEx specifically designed to detect hallucinations in LLM-generated responses and provide explanations for them. By offering such feedback, the model contributes to both user understanding and the improvement of LLM, fostering the generation and evaluation of more reliable responses.
+However, a key limitation of the model is its reliance on the LLM’s inherent knowledge when sufficient source content is unavailable for detecting and explaining hallucinations. This dependency can reduce the clarity of the explanations and, in some cases, introduce hallucinations into the explanations themselves.
+Despite this limitation, the study demonstrates strong potential for detecting and explaining hallucinations. Future research should focus on overcoming these challenges and exploring methods to improve the model’s performance. For example, integrating external knowledge retrieval systems could reduce the model’s reliance on its internal knowledge, while enhancing reasoning-based validation could lead to more reliable explanations.
+Additionally, we aim to develop an automated feedback loop in future work. This system would allow for continuous correction and improvement of hallucinations, contributing to greater reliability and consistency in LLMs over time.
+References
+[1] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
+[2] Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, and Junxian He. Felm: Benchmarking factuality evaluation of large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
+[3] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252. Association for Computational Linguistics, 5 2022.
+[4] Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. Qafacteval: Improved qa-based factual consistency evaluation for summarization. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza
+8
+
+
+A PREPRINT - FEBRUARY 13, 2025
+Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601. Association for Computational Linguistics, 7 2022.
+[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł{}ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000–6010. Curran Associates Inc., 2017.
+[6] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2023.
+[7] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models, 2022.
+[8] Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan. Large language models encode clinical knowledge. Nature, 620:172–180, 8 2023.
+[9] Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. Fingpt: Open-source financial large language models. FinLLM Symposium at IJCAI 2023, 2023.
+[10] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Comput. Surv., 55, 3 2023.
+[11] Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919. Association for Computational Linguistics, 7 2020.
+[12] Nouha Dziri, Andrea Madotto, Osmar Zaiane, and Avishek Joey Bose. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2197–2214. Association for Computational Linguistics, 11 2021.
+[13] Yichong Huang, Xiachong Feng, Xiaocheng Feng, and Bing Qin. The factual inconsistency problem in abstractive text summarization: A survey, 2023.
+[14] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, 2023.
+[15] Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Dragomir Radev, Noah A Smith, Yejin Choi, and Kentaro Inui. Realtime qa: what’s the answer right now? In Proceedings of the 37th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2024.
+[16] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In FAccT 2021 - Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623. Association for Computing Machinery, Inc, 3 2021.
+[17] David Chiang and Peter Cholak. Overcoming a theoretical limitation of self-attention, 2022.
+[18] Zuchao Li, Shitou Zhang, Hai Zhao, Yifei Yang, and Dongjie Yang. Batgpt: A bidirectional autoregessive talker from generative pre-trained transformer, 2023.
+[19] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V Do, Yan Xu, and Pascale Fung. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In Jong C Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi, editors, Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718. Association for Computational Linguistics, 11 2023.
+9
+
+
+A PREPRINT - FEBRUARY 13, 2025
+[20] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mtbench and chatbot arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2024.
+[21] Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. Benchmarking foundation models with language-model-as-anexaminer, 2023.
+[22] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522. Association for Computational Linguistics, 12 2023.
+[23] Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations, 2023.
+[24] Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464. Association for Computational Linguistics, 12 2023.
+[25] Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lv, Dan Zhang, and Huajun Chen. Factchd: Benchmarking fact-conflicting hallucination detection, 2024.
+[26] Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Osmar Zaiane, Mo Yu, Edoardo M Ponti, and Siva Reddy. Faithdial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022.
+[27] Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter. Evaluating attribution in dialogue systems: The begin benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083, 2022.
+[28] Llama Team and Ai @ Meta. The llama 3 herd of models, 2024.
+[29] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
+[30] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul
+10
+
+
+A PREPRINT - FEBRUARY 13, 2025
+Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, C J Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024.
+11
\ No newline at end of file
diff --git a/storage/8XNBW5GV/.zotero-reader-state b/storage/8XNBW5GV/.zotero-reader-state
new file mode 100644
index 0000000..8754bf9
--- /dev/null
+++ b/storage/8XNBW5GV/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":2,"scale":180,"top":359,"left":-121,"scrollMode":0,"spreadMode":0}
\ No newline at end of file
diff --git a/storage/8XNBW5GV/Lee et al. - 2025 - HuDEx Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM r.pdf b/storage/8XNBW5GV/Lee et al. - 2025 - HuDEx Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM r.pdf
new file mode 100644
index 0000000..607bb43
Binary files /dev/null and b/storage/8XNBW5GV/Lee et al. - 2025 - HuDEx Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM r.pdf differ
diff --git a/storage/97MZR9LH/.zotero-ft-cache b/storage/97MZR9LH/.zotero-ft-cache
new file mode 100644
index 0000000..1a8fee0
--- /dev/null
+++ b/storage/97MZR9LH/.zotero-ft-cache
@@ -0,0 +1,592 @@
+TYPE Review PUBLISHED 30 September 2025 DOI 10.3389/frai.2025.1622292
+OPEN ACCESS
+EDITED BY
+Mehdi Naseriparsa, Federation University Australia, Australia
+REVIEWED BY
+Šarnas Grigalinas, Kaunas University of Technology, Lithuania Gaurav Kumar Gupta, Youngstown State University, United States Ki-Seong Park, Chonnam National University Bitgoeul Hospital, Republic of Korea
+*CORRESPONDENCE Dang Anh-Hoang hoanganhdang95@gmail.com
+RECEIVED 03 May 2025
+ACCEPTED 31 August 2025 PUBLISHED 30 September 2025
+CITATION
+Anh-Hoang D, Tran V and Nguyen L-M (2025) Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior. Front. Artif. Intell. 8:1622292. doi: 10.3389/frai.2025.1622292
+COPYRIGHT
+© 2025 Anh-Hoang, Tran and Nguyen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
+Survey and analysis of
+hallucinations in large language
+models: attribution to prompting
+strategies or model behavior
+Dang Anh-Hoang*, Vu Tran and Le-Minh Nguyen
+Division of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
+Hallucination in Large Language Models (LLMs) refers to outputs that appear fluent and coherent but are factually incorrect, logically inconsistent, or entirely fabricated. As LLMs are increasingly deployed in education, healthcare, law, and scientific research, understanding and mitigating hallucinations has become critical. In this work, we present a comprehensive survey and empirical analysis of hallucination attribution in LLMs. Introducing a novel framework to determine whether a given hallucination stems from not optimize prompting or the model’s intrinsic behavior. We evaluate state-of-the-art LLMs—including GPT-4, LLaMA 2, DeepSeek, and others—under various controlled prompting conditions, using established benchmarks (TruthfulQA, HallucinationEval) to judge factuality. Our attribution framework defines metrics for Prompt Sensitivity (PS) and Model Variability (MV), which together quantify the contribution of prompts vs. model-internal factors to hallucinations. Through extensive experiments and comparative analyses, we identify distinct patterns in hallucination occurrence, severity, and mitigation across models. Notably, structured prompt strategies such as chain-of-thought (CoT) prompting significantly reduce hallucinations in prompt-sensitive scenarios, though intrinsic model limitations persist in some cases. These findings contribute to a deeper understanding of LLM reliability and provide insights for prompt engineers, model developers, and AI practitioners. We further propose best practices and future directions to reduce hallucinations in both prompt design and model development pipelines.
+KEYWORDS
+Large Language Models, hallucination, prompt engineering, model behavior, GPT-4, LLaMA, DeepSeek, attribution framework
+1 Introduction
+Large Language Models (LLMs) have become foundational tools in modern natural language processing (NLP) recently. High capability applications extending from conversational agents to scientific writing assistants and automated code generation. Models such as GPT-3 (Brown et al., 2020), GPT-4 (OpenAI, 2023b), LLaMA 2 (Touvron et al., 2023), Claude (Anthropic, 2023), DeepSeek (DeepSeek AI, 2023), and others have demonstrated extraordinary capabilities in zero-shot and few-shot learning tasks. Despite these advances a significant challenge remains: hallucinations—output that appears fluent and coherent but is factually incorrect, fabricated, or logically inconsistent (Ji et al., 2023; Maynez et al., 2020; Kazemi et al., 2023).
+Frontiers in Artificial Intelligence 01 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+Hallucinations in LLMs affect the reliability and efficiency of AI systems, particularly in high-impact domains such as medicine (Lee et al., 2023), law (Bommarito and Katz, 2022), journalism (Andrews et al., 2023), and scientific communication (Nakano et al., 2021; Liu et al., 2023). They also produce the risks for misinformation, reducing in user’s trust, and accountability gaps (Bommasani et al., 2021; Weidinger et al., 2022). Therefore understanding hallucinations is a crucial research priority. Broadly, hallucinations in LLMs can be divided into two primary sources: (1) Prompting-induced hallucinations, where ill-structured, unspecified, or misleading prompts cause inefficient outputs (Reynolds and McDonell, 2021; Zhou et al., 2022; Wei et al., 2022), and (2) Model-internal hallucinations, which caused by the model’s architecture, pretraining data distribution, or inference behavior (Bang and Madotto, 2023; Chen et al., 2023; OpenAI, 2023a). Distinguishing between these two causes is essential for developing effective mitigation strategies. Mathematically, this problem can be described within the probabilistic generative framework that underlies modern language modeling. Consider an LLM modeled as a probabilistic generator Pθ (y|x) parameterized by θ , where x denotes the input prompt, and y denotes the generated output. Hallucinations emerge when the model assigns a higher probability to an incorrect or ungrounded generation sequence compared to a factually grounded alternative:
+Pθ (yhallucinated|x) > Pθ (ygrounded|x) (1)
+The above inequality illustrates a fundamental probabilistic dilemma: optimization of fluency and coherence often conflicts with factual grounding. Then, understanding hallucinations requires analyzing the model’s probability distribution and identifying contexts and conditions under which inaccuracies become prevalent. Recent work has attempted to reduce hallucinations using improved prompting techniques, such as chain-of-thought prompting (Wei et al., 2022), self-consistency decoding (Wang et al., 2022), retrieval-augmented generation (Lewis et al., 2020; Shuster et al., 2022), and verification-based refinement (Kadavath et al., 2022). Simultaneously, efforts at the model level focus on supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022), contrastive decoding (Li et al., 2022), and grounded pretraining (Zhang et al., 2023). However, the interplay between prompt quality and model internals remains poorly addressing. This paper aims to fill this gap by conducting a comprehensive survey and analysis on hallucination attribution in LLMs. Specifically, we seek to answer: To what extent do hallucinations result from prompting errors vs. model-level limitations? After this, we propose an attribution framework, benchmark several stateof-the-art models under controlled conditions, and examine their behavior across different prompt formulations. Our contributions are threefold:
+• We provide a comprehensive review of recent literature on hallucinations in LLMs, categorizing work based on cause attribution (prompt vs. model). • We design and implement controlled experiments on multiple LLMs (GPT-4, LLaMA 2, DeepSeek, Gwen) using
+standardized hallucination evaluation benchmarks [e.g., TruthfulQA (Lin et al., 2022), HallucinationEval (Wu et al., 2023), RealToxicityPrompts (Gehman et al., 2020)]. • We propose a diagnostic framework that empirically separates prompt-sensitive hallucinations from model-intrinsic ones, offering actionable recommendations for mitigation.
+The rest of this paper is structured as follows: Section 2 introduces background concepts and terminology around LLM hallucinations. Section 3 surveys existing literature. Section 4 presents our attribution framework. Section 5 describes our experimental design and evaluation protocols. Section 6 analyzes results across models and prompts. Section 7 discusses mitigation strategies. Section 8 outlines future research directions, and Section 9 concludes the paper.
+2 Background and definitions
+2.1 What is hallucination in large language models?
+Hallucination in the context of Large Language Models (LLMs) refers to the generation of content that might not related to the input prompt or confirmed knowledge sources, even though the output may appear linguistically coherent (Ji et al., 2023; Maynez et al., 2020). This circumstance shows the difference of LLMs from traditional NLP models by highlighting the trade-off between fluency and factual reliability.
+2.2 Mathematical foundation of LLM hallucination
+To formalize hallucination phenomena in LLMs, it is useful to conceptualize them within a rigorous mathematical framework. Modern LLMs such as GPT-4, LLaMA, and DeepSeek typically employ transformer-based neural architectures trained to estimate conditional probabilities of token sequences. Formally, given an input context or prompt x = (x1, x2, . . . , xn), the model generates an output sequence y = (y1, y2, . . . , ym) by factorizing the conditional probability distribution as:
+Pθ (y|x) =
+∏m
+t=1
+Pθ (yt|x, y 1 (3)
+Addressing hallucinations mathematically by recalibrating the probability distribution to align with external factual grounding or logical consistency constraints. This can be practically approached via contrastive decoding, retrieval-augmented mechanisms, or probabilistic calibration techniques.
+2.3 Experimental clarification and examples
+To illustrate the taxonomy of hallucinations, consider experimental scenarios drawn from popular benchmarks. Intrinsic hallucinations frequently occur in summarization tasks where the model outputs statements directly contradicting the provided input. For example, given the factual input “Einstein was born in Ulm, Germany,” an intrinsic hallucination might state incorrectly, “Einstein was born in Berlin.” Such intrinsic errors indicate failure in conditional grounding:
+Pθ (yintrinsic|xinput) Pθ (ycorrect|xinput) (4)
+Extrinsic hallucinations often appear in open-ended questionanswering or narrative-generation tasks, where the model outputs plausible-sounding yet ungrounded details. For instance, when asked to explain “the primary cause of dinosaur extinction,” a model might confidently fabricate an irrelevant event, such as “massive volcanic eruptions on Venus caused changes on Earth,” which, while syntactically coherent, has no empirical basis or source grounding. Factual hallucinations are explicitly illustrated by incorrect responses on datasets such as TruthfulQA. An experimental example includes the model-generated answer, “The capital of Canada is Toronto,” instead of the factually correct “Ottawa.” Logical hallucinations, conversely, involve internally inconsistent reasoning paths. An example includes mathematical reasoning tasks, where a model might claim “If a = b and b = c, then a = c,” reflecting a clear logical inconsistency. Quantifying these hallucinations experimentally involves applying targeted metrics, such as accuracy-based evaluations on QA tasks, entropy-based measures of semantic coherence, and consistency checking against external knowledge bases. These empirical assessments provide quantitative insights into the conditions under which different hallucination types emerge, ultimately guiding improved detection, understanding, and mitigation approaches. Because LLMs are probabilistic text generators which are trained over massive data-base, they are capable of producing outputs that reflect statistical patterns rather than grounded truth. Hence, hallucination is an inherent byproduct of language modeling that prioritizes syntactic and semantic plausibility over factual accuracy (Shuster et al., 2022; Kadavath et al., 2022).
+2.4 Taxonomy of hallucinations
+Recent studies categorize hallucinations into several types based on their origin and demonstration (Ji et al., 2023; Kazemi et al., 2023):
+• Intrinsic hallucination: information generated by the model that contradicts the known input or context. For instance, summarizing a source text with incorrect facts. • Extrinsic hallucination: information that is not present in the source but cannot be immediately deemed incorrect. This is common in open-domain generation where output extends beyond context. • Factual hallucination: output that includes inaccurate or fabricated facts not aligned with real-world knowledge or knowledge bases (Lin et al., 2022; Liu et al., 2023). • Logical hallucination: output that is inconsistent or logically incoherent, despite surface-level grammatical correctness (Zhang et al., 2023).
+This classification allows for better evaluation and acknowledge of hallucinations during LLM output analysis.
+2.5 Prompting and model behavior: two sides of the problem
+The challenge of hallucinations can be attributed to two major dimensions: prompt-level issues and model-level behaviors.
+• Prompting-induced hallucinations: these arise when prompts are vague, underspecified, or structurally misleading, pushing the model into speculative generation (Reynolds and McDonell, 2021; Wei et al., 2022; Zhou et al., 2022). For example, unclear intent in zero-shot prompts often results in off-topic or imaginative content.
+• Model-intrinsic hallucinations: even when well organized prompts are used, LLMs may hallucinate due to limitations in training data, architectural biases, or inference-time sampling strategies (Bang and Madotto, 2023; OpenAI, 2023a; Chen et al., 2023).
+The different between these two causes is essential for developing targeted mitigation strategies. Prompt tuning approaches such as Chain-of-Thought prompting (Wei et al., 2022) and Self-Consistency decoding (Wang et al., 2022) aim to reduce hallucinations without altering the model. In the other hand, techniques like Reinforcement Learning with Human Feedback (RLHF) (Ouyang et al., 2022) and Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) attempt to address model-level limitations.
+2.6 Evaluation challenges
+Evaluating hallucinations remains a challenging task due to their contextual nature. Automatic metrics such as BLEU or ROUGE fail to capture factual consistency and reliable
+Frontiers in Artificial Intelligence 03 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+(Maynez et al., 2020). Therefore, benchmarks like TruthfulQA (Lin et al., 2022), HallucinationEval (Wu et al., 2023), and RealToxicityPrompts (Gehman et al., 2020) have been introduced to better assess hallucination bias across models and tasks. But, no widely acceptable metric or dataset fully captures the multidimensional nature of hallucinations. As LLMs continue to scale in capability and deployment, understanding these foundational concepts is critical for the attribution, evaluation, and eventual of hallucinations in both research and applied contexts.
+3 Related work and literature survey
+The problem of hallucination in Large Language Models (LLMs) has become a central topic of investigation in recent years. A growing body of literature attempts to understand, evaluate, and mitigate this phenomenon. This section reviews key contributions from three main perspectives: (1) prompt engineering and its impact on hallucination, (2) model-intrinsic causes and architecture-level factors, and (3) evaluation and mitigation techniques proposed in the literature.
+3.1 Prompting techniques and hallucination control
+Prompting plays a significant role in the output behavior of LLMs. Several studies have emphasized how variations in prompt design can induce or suppress hallucinations (Reynolds and McDonell, 2021; Zhou et al., 2022). Prompting-induced hallucinations often arise from ambiguous formulations or lack of context, leading the model to rely on probabilistic associations rather than grounded knowledge. However, these works did not provide a quantitative measure of prompt sensitivity—they changed prompts and observed effects, but without a formal metric or model. In contrast, we introduce Prompt Sensitivity (PS) as a concrete metric to measure this effect systematically. Similarly, note that “prior surveys (Ji et al., 2023; Chen et al., 2023) categorized causes generally, but did not propose an attribution methodologyour work is the first to formalize a probabilistic attribution model for hallucinations.” By directly contrasting in this way, a reviewer will clearly see how your paper goes beyond descriptive surveys or empirical trials.
+Zero-shot and few-shot prompting, popularized by GPT-3 (Brown et al., 2020), expose models to minimal task examples but tend to be prone to hallucination when the task is not explicitly structured. Chain-of-Thought (CoT) prompting (Wei et al., 2022) improves reasoning transparency and factual correctness by encouraging step-wise output generation. Least-toMost prompting (Zhou et al., 2022) further decomposes complex queries into simpler steps, mitigating hallucination in multi-hop reasoning tasks. Other strategies like Self-Consistency decoding (Wang et al., 2022), ReAct prompting (Yao et al., 2022), and Instruct-tuning (Ouyang et al., 2022) have also been shown to reduce hallucination rates by influencing how the model organizes its internal generation
+paths. Still, these methods are heuristic in nature and do not universally prevent hallucinations across domains or tasks.
+3.2 Model behavior and architecture-level causes
+Hallucinations are not always prompt-driven. Intrinsic factors within model architecture, training data quality, and sampling algorithms significantly contribute to hallucination problems. If the pretraining data corpus used in LLMs are web-scale and unfiltered, contains inconsistencies, biases, and outdated or false information, could affect the model during training (Shuster et al., 2022; Chen et al., 2023; Weidinger et al., 2022). Larger models, while generally more capable, also tend to hallucinate with “confident nonsense” (Kadavath et al., 2022). Model scaling alone does not eliminate hallucination but rather amplifies it in certain contexts. Studies such as OpenAI (2023a) and Bang and Madotto (2023) have also revealed that instruction-tuned models can still hallucinate, especially on long-context, ambiguous, or factual-recall tasks.
+To counter these issues, Retrieval-Augmented Generation (RAG) (Lewis et al., 2020), Grounded pretraining (Zhang et al., 2023), and contrastive decoding techniques (Li et al., 2022) have been explored. These approaches integrate external knowledge sources during inference or introduce architectural changes that enforce factuality.
+3.3 Hallucination detection and evaluation benchmarks
+Evaluating hallucination is a complex task. Traditional automatic metrics like BLEU, ROUGE, or METEOR are inadequate for assessing factual consistency (Maynez et al., 2020). Thus, dedicated benchmarks have emerged:
+• TruthfulQA (Lin et al., 2022) evaluates whether LLMs produce answers that mimic human false beliefs. • HallucinationEval (Wu et al., 2023) provides a framework for measuring different hallucination types. • RealToxicityPrompts (Gehman et al., 2020) investigates how models hallucinate toxic or inappropriate content. • CohS (Kazemi et al., 2023) and QAFactEval (Fabbri et al., 2022) focus on factual consistency in summarization.
+Evaluation approaches are also evolving to include natural language inference-based scoring, fact-checking pipelines, and LLM-as-a-judge methodologies (Liu et al., 2023). However, detection accuracy varies significantly across domains and model families.
+3.4 Mitigation strategies
+Several mitigation strategies have been proposed, targeting both prompting and modeling levels. At the prompting level, techniques
+Frontiers in Artificial Intelligence 04 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+TABLE 1 Representative studies in hallucination research in LLMs.
+Aspect Representative works Key contributions
+Prompt design Wei et al., 2022; Zhou et al., 2022; Yao et al., 2022 Prompting methods reduce hallucination by guiding reasoning and structure
+Model behavior Kadavath et al., 2022; Bang and Madotto, 2023; Chen et al., 2023 Hallucination linked to pretraining biases and architectural limits
+Evaluation Lin et al., 2022; Wu et al., 2023; Kazemi et al., 2023 Domain-specific benchmarks and scoring methods for hallucination detection
+Mitigation strategies Ouyang et al., 2022; Lewis et al., 2020; Zhang et al., 2023 RLHF, retrieval augmentation, grounded training, hybrid solutions
+such as prompt calibration, system message design, and output verification loops are common. At the modeling level, RLHF (Ouyang et al., 2022), retrieval fusion (Lewis et al., 2020), and instruction tuning (Wang et al., 2022) remain popular. Recent work also explores post-hoc refinement, where generated output is filtered or corrected using factuality classifiers or auxiliary models. Yet, no single method universally eliminates hallucination, pointing to the need for hybrid mitigation pipelines.
+3.5 Summary
+Table 1 summarizes the core themes and representative works in hallucination research.
+4 Attribution framework: prompting vs. model behavior
+While hallucination in Large Language Models (LLMs) is a well-recognized challenge, addressing the root cause of hallucination remains ambiguous. A single erroneous output may occur from a combination of unclear prompting, model architectural biases, training data limitations, or by each one of these factors. To systematically analyze this phenomenon, we introduce an attribution framework that aims to solve the connection of prompting and model behavior to hallucination generated text.
+4.1 Motivation for attribution analysis
+Understanding whether hallucinations are caused by prompt formulation or intrinsic model behavior is essential for:
+• Designing more effective prompt engineering strategies. • Developing architectures that are inherently more grounded and robust. • Benchmarking LLM reliability under controlled conditions.
+Several studies have hinted at this attribution duality (Ji et al., 2023; Wei et al., 2022; Chen et al., 2023), but a formal diagnostic framework has not been sufficiently developed. Our approach fills this gap by offering a reproducible method to separate these two components using controlled prompt manipulation and model comparison.
+FIGURE 1
+Attribution quadrants defined using median cutoffs for Prompt Sensitivity (PS = 0.079) and Model Variability (MV = 0.057), computed across all evaluated models in Table 4. The quadrants correspond to prompt-dominant (high PS, low MV), model-dominant (low PS, high MV), mixed-origin (high PS and MV), and unclassified (low PS and MV) hallucinations. Using medians provides an objective, distribution-aware threshold rather than arbitrary cutoffs.
+Unlike previous approaches, which focus on categorize hallucinations, and analyzes domain-specific hallucination cases), our work introduces a novel attribution framework that distinguishes prompt-induced from model-intrinsic hallucinations. This framework defines new metrics and protocols to systematically isolate the source of hallucinations, filling the gap left by earlier studies.
+4.2 Attribution framework overview
+Figure 1 provides a high-level overview of the attribution framework. The attribution framework categorizes hallucinations in LLMs using Prompt Sensitivity (PS) and Model Variability (MV). High PS indicates hallucinations mainly due to ambiguous prompts, while high MV suggests intrinsic model limitations. Identifying categories—prompt-dominant, model-dominant, mixed-origin, or unclassified—guides targeted mitigation
+Frontiers in Artificial Intelligence 05 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+strategies, emphasizing prompt clarity, improved training, or combined solutions to effectively reduce hallucinations. We define two primary dimensions of analysis:
+• Prompt sensitivity (PS): measures the variation in output hallucination rates under different prompt styles for a fixed model. PS is a new metric introduced in this work to quantify variations across prompts; previous work has not defined an explicit measure for hallucination problems. High PS suggests hallucination is prompt-induced.
+• Model variability (MV): measures the difference in hallucination rates across different models for a fixed prompt. High MV indicates hallucination is model-intrinsic. • Objective thresholds: for distinguishing “low” vs. “high” Prompt Sensitivity (PS) and Model Variability (MV), we first collected the PS and MV values computed for all evaluated models. We then plotted their distributions to visualize the spread of scores. Instead of selecting arbitrary boundaries, we used the median value of each distribution as the cutoff. This ensures that the quadrant framework (Figure 1) reflects the actual data distribution in a balanced and non-biased way, independent of individual model outliers
+4.3 Quantitative attribution scoring
+We define a set of formal metrics to quantify attribution dimensions:
+Prompt Sensitivity (PS) = 1
+n
+n ∑
+i=1
+∣∣∣HPMi − HM∣∣∣ (5)
+Model Variability (MV) = 1
+m
+m ∑
+j=1
+∣∣∣HMj
+P − HP∣∣∣ (6)
+where HPMi is the hallucination rate for prompt Pi on model
+M, and HMj
+P is the hallucination rate for a fixed prompt P across
+models Mj. The means HM and HP denote average hallucination rates across prompts and models, respectively. We also propose the use metric to quantify prompt-model interaction effects specifically for LLM hallucinations:
+Joint Attribution Score (JAS) = (7)
+1 nm
+n ∑
+i=1
+m ∑
+j=1
+(
+HMj
+Pi − HMj
+)(
+HMj
+Pi − HPi
+)
+σP σM
+,
+where σP and σM are the standard deviations of hallucination rates across all prompts and all models, respectively. JAS is effectively the (normalized) covariance between prompt-specific and model-specific deviations in hallucination rate. A positive JAS indicates that certain prompt-model combinations amplify hallucinations more than would be expected from prompt or model effects alone (i.e., the prompt and model jointly contribute to the error).
+TABLE 2 Hallucination attribution scenarios based on PS and MV metrics.
+PS score MV score Attribution category
+High Low Prompt-dominant
+Low High Model-dominant
+High High Mixed-origin
+Low Low Unclassified (stochastic/noise)
+4.4 Prompt variation protocol
+To measure Prompt Sensitivity, we evaluate each model on multiple variants of the prompts. We systematically vary prompts along three axes:
+• Format: e.g., declarative vs. interrogative vs. instructionstyle phrasing. • Structure: e.g., a straight forward query vs. a step-bystep Chain-of-Thought (CoT) prompt; zero-shot vs. few-shot contexts; inclusion of relevant context or not. • Specificity: vague, open-ended wording vs. explicitly detailed prompts.
+This controlled prompt variation allows us to observe whether a hallucination persists or disappears when the prompt is clarified or restructured. If a hallucinated answer disappears once the question is asked more explicitly or by CoT, it suggests the cause was promptrelated. Conversely, if the hallucination persists across all prompt variants, the cause likely lies in the model’s internal behavior.
+4.5 Model control protocol
+To control for model behavior, we fix prompt design and evaluate hallucination occurrence across diverse models (e.g., GPT4, LLaMA 2, DeepSeek, Gwen). The intuition is that consistent hallucinations across models suggest prompt-induced errors, while divergent hallucination patterns imply architecture-specific behaviors or training artifacts.
+4.6 Attribution categories
+Using PS and MV scores, hallucinations can be categorized as:
+• Prompt-dominant: high PS, low MV. • Model-dominant: low PS, high MV. • Mixed-origin: high PS and MV.
+• Unclassified (noise): low PS and MV (e.g., stochastic sampling artifacts).
+Table 2 summarizes this taxonomy.
+4.7 Advantages of the framework
+Our attribution framework provides a systematic approach to hallucination analysis, with several advantages:
+Frontiers in Artificial Intelligence 06 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+• It enables clear, diagnostic reasoning about the source of each hallucination (prompt vs. model). • It is scalable across different models and domains, and can incorporate standard benchmarks for generality. • It facilitates reproducible experiments by defining concrete protocols for prompt variation and model comparison. • It provides interpretable quantitative scores (PS, MV, JAS) that can be used for benchmarking and tracking improvements.
+4.8 Formalization of attribution framework
+Earlier sections introduced the basic framework of hallucination attribution in Large Language Models (LLMs). Here, we extend our analysis through a rigorous probabilistic formulation grounded in Bayesian inference and decision theory (Berger, 2013; Gelman et al., 2013). Such formalization enables a more precise dissection of hallucination phenomena by systematically quantifying the contributions of prompting strategies and intrinsic model behaviors. Formally, hallucination events can be represented probabilistically as random events. Let H denote hallucination occurrence conditioned upon prompting strategy P and model characteristics M:
+P(H|P, M) = P(P, M|H)P(H)
+P(P, M) . (8)
+Here, P(P, M|H) is the likelihood of observing prompt and model characteristics given a hallucination, P(H) is the base rate of hallucination, and P(P, M) is the joint prior over prompts and models. Due to practical complexities, assumptions such as conditional independence can simplify the analysis (Pearl, 1988):
+P(H|P, M) ≈ P(H|P)P(H|M). (9)
+Yet, realistic scenarios typically involve interactions between prompt and model attributes. Thus, we propose a mixed-effects probabilistic model incorporating explicit interaction terms:
+P(H|P, M) = αP(H|P) + βP(H|M) + γ P(H|P, M), (10)
+where α, β, γ are parameters empirically calibrated from experimental data (Gelman et al., 2013). Higher γ values signify significant joint prompt-model effects, indicating mixedorigin hallucinations.
+4.9 Probabilistic metrics for hallucination attribution
+We introduce refined metrics derived from probabilistic reasoning to quantify hallucinations rigorously and aid systematic attribution.
+4.9.1 Conditional prompt sensitivity (CPS)
+Conditional prompt sensitivity (CPS) quantifies promptinduced variability across models, refining earlier definitions of prompt sensitivity:
+CPS = 1
+nm
+n ∑
+i=1
+m ∑
+j=1
+∣∣∣HMj
+Pi − HMj
+∣∣∣ , (11)
+where HMj
+Pi is the hallucination rate for prompt variant Pi under
+model Mj, and HMj is the average hallucination rate for model Mj. CPS values directly measure how hallucinations depend on prompt specificity across models.
+4.9.2 Conditional model variability (CMV)
+Analogously, conditional model variabilit (CMV) isolates intrinsic model effects given consistent prompts:
+CMV = 1
+nm
+∑n
+i=1
+m ∑
+j=1
+∣∣∣HPi
+Mj − HPi
+∣∣∣ , (12)
+with HPi
+Mj as hallucination rates for model Mj given prompt
+Pi, and HPi representing the mean hallucination across models for prompt Pi.
+4.9.3 Joint attribution score (JAS)
+Joint attribution score (JAS) explicitly quantifies interactive effects between prompts and models (Berger, 2013):
+JAS = 1
+nm
+∑n
+i=1
+m ∑
+j=1
+(
+HMj
+Pi − HMj
+)(
+HPi
+Mj − HPi
+)
+σP σM
+, (13)
+where σP, σM denote standard deviations of hallucination rates across prompts and models, respectively. Positive JAS scores indicate joint amplification of hallucinations by prompts and models. Capturing interaction effects that have not been previously quantified in the literature.
+4.10 Illustrative experimental application
+Experimental evaluations employing benchmarks like TruthfulQA and HallucinationEval clearly highlight differences among LLaMA 2, DeepSeek, and GPT-4:
+• CPS analysis revealed significantly higher values for vaguely specified prompts (0.15 for LLaMA 2), reflecting enhanced susceptibility to prompt-induced hallucinations. Structured prompting like Chain-of-Thought significantly reduced CPS (0.06), underscoring the practical benefits of structured prompt engineering (Zhou et al., 2022). • CMV values showed distinct model behaviors; DeepSeek demonstrated the highest CMV (0.14), reflecting intrinsic model biases, while GPT-4 maintained notably lower CMV
+Frontiers in Artificial Intelligence 07 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+(0.08), consistent with better internal factual grounding (OpenAI, 2023b). • JAS revealed critical insights, with high JAS values (0.12) for LLaMA 2 under ambiguous prompts, indicating the compounded hallucination risks arising from interactions between unclear prompts and intrinsic model limitations.
+Such insights derived from experimental CPS, CMV, and JAS metrics are invaluable for precise, tailored mitigation efforts.
+Our attribution metrics align with established benchmarks: models with higher PS/MV generally fared worse on factuality benchmarks like TruthfulQA (Lin et al., 2022) and HallucinationEval (Wu et al., 2023), while models with low MV (e.g., GPT-4) achieved better TruthfulQA scores. This indicates that PS and MV capture aspects of hallucination propensity that correspond to real-world factual accuracy measures, providing a complementary, fine-grained diagnostic beyond the aggregate benchmark scores.
+4.11 Bayesian hierarchical modeling for robust attribution
+To robustly quantify uncertainty and variability in hallucination attribution, we apply Bayesian hierarchical modeling (BHM). BHM represents hallucination rates hierarchically with model-specific and prompt-specific parameters drawn from higher-level distributions (Gelman et al., 2013):
+Hij ∼ Beta(μijτ , (1 − μij)τ ), μij = logit−1(αi + βj + γij), (14)
+where Hij is hallucination rate for model i under prompt j, αi, βj represent model-specific and prompt-specific effects, and γij interaction effects. Bayesian inference via Markov Chain Monte Carlo (MCMC) sampling yields credible intervals and posterior distributions, enhancing analytical transparency and calculation in attribution analysis. To our knowledge, this is the first application of Bayesian hierarchical modeling to LLM hallucination analysis. This formalism is novel and brings statistical rigor by modeling prompt and model factors as random effects.
+4.12 Implications and advanced applications
+The advanced formulation of our attribution framework offers significant theoretical and practical implications. Practically, precise quantification of hallucination attribution informs targeted interventions: structured prompt engineering in high-CPS conditions, model fine-tuning for high-CMV scenarios, and combined interventions for elevated JAS conditions. Theoretically, integrating Bayesian modeling into attribution analyses provides rigorous, interpretable frameworks for future LLM hallucination research, enhancing model reliability and practical deployment in critical applications.
+This probabilistic, experimentally grounded approach provides robust analytical tools essential for ongoing research and deployment of trustworthy LLM systems. In the next section, we apply this framework to a set of controlled experiments using state-of-the-art LLMs and benchmark datasets.
+5 Experimental design and setup
+To ensure reproducibility and cost-efficiency, all experiments were conducted using open-source Large Language Models (LLMs), publicly available datasets, and free inference and evaluation tools. This section describes the design of the experiment in detail, including model selection, dataset choices, prompt engineering protocols, inference pipeline, and evaluation metrics, along with concrete examples. Unlike prior evaluations, we use a controlled multi-factor experiment explicitly aimed at attributing causes, by varying prompts systematically across models.
+5.1 Model selection
+We selected high-quality open-source LLMs with varied sizes, architectures, and instruction-tuning characteristics to represent a diverse evaluation space:
+• LLaMA 2 (13B) (Touvron et al., 2023): meta AI’s transformerbased model fine-tuned for dialogue tasks. • Mistral 7B instruct: a smaller but instruction-tuned model with strong reasoning capabilities. • DeepSeek 67B (DeepSeek AI, 2023): a multilingual model trained on code and web data, known for factual generation. • OpenChat-3.5 (Openchat Team, 2023): a communityfinetuned model derived from LLaMA with RLHF-style behavior. • Gwen (community model): an open-access research model emphasizing retrieval-enhanced factual generation.
+All models were deployed via HuggingFace’s transformers library.
+5.2 Benchmark datasets
+We selected benchmarks covering different task types (question answering, summarization, general generation) to probe hallucination under varying conditions.
+• TruthfulQA (Lin et al., 2022): a multiple-choice QA dataset crafted to test whether models reproduce common human misconceptions or produce false information. Example: Question: “What happens if you eat a polar bear liver?” Gold answer: “You may get vitamin A poisoning.” Model hallucination: “It is a rich source of protein and perfectly safe.” • HallucinationEval (Wu et al., 2023): covers multiple domains and explicitly labeled hallucinations in generations, providing granular annotation categories (factual error, fabrication, etc.).
+Frontiers in Artificial Intelligence 08 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+• QAFactEval (Fabbri et al., 2022): a fact-evaluation benchmark that uses QA pairs to assess whether model outputs contain the same facts as the input source. Example: Source: “Albert Einstein was born in 1879 in Ulm, Germany.” Summary: “Einstein was born in Berlin in 1879.” ⇒ Inconsistent fact. • CohS (Kazemi et al., 2023): focused on summarization hallucination, with annotations distinguishing intrinsic vs. extrinsic hallucination.
+All datasets were accessed via HuggingFace Datasets Hub or official GitHub repositories.
+5.3 Prompt engineering protocol
+To evaluate the influence of prompt structure on hallucination generation, we designed five prompt categories for each task instance:
+• Zero-shot prompt: a basic instruction without examples. Example: “Answer the following question: What is the capital of Switzerland?” • Few-shot prompt: includes 2–3 input-output examples before the test input. Example: Q1: What is the capital of Germany? A1: Berlin Q2: What is the capital of Italy? A2: Rome Q3: What is the capital of Switzerland? A3: (model output) • Instruction prompt: uses structured natural language to clarify task expectations. Example: “You are a helpful assistant. Given a question, respond with a concise and factually correct answer.”
+• Chain-of-thought (CoT) Prompt: Encourages step-by-step reasoning before answering. Example: “Think step-by-step: What country is Zurich in? Zurich is in Switzerland. What is the capital of Switzerland? The capital is Bern.” • Vague or misleading prompt: intentionally unclear to test hallucination resilience. Example: “Can you tell me more about the Swiss capital, which I think is Geneva?”
+Each prompt variant was applied uniformly to all models per dataset sample, enabling precise attribution of hallucination sensitivity to prompting.
+5.4 Operational definition of vague vs. specific prompts
+We make the notion of “vague” vs. “specific” prompts operational and reproducible by (i) publishing concrete prompt pairs for each task family and (ii) introducing a Clarity Checklist with a quantitative Prompt Clarity Score (PCS) used in all experiments (Zhou et al., 2022).
+5.4.1 Clarity checklist (objective items)
+A prompt receives one point per satisfied item (binary, 0/1). Items are phrased to be model-agnostic and dataset-agnostic.
+1. Role specified (e.g., “You are a fact-checking assistant”).
+2. Task & output format specified (schema, bullet/JSON/table; max length).
+3. Units/numeric ranges (e.g., “give probabilities in [0,1] with 2 decimals”).
+4. Time/version constraints (cutoff date, statute/version, model date). 5. Information source policy (closed-book vs. RAG citations; how to cite). 6. Ambiguity control (forbid speculation; define unknown/abstain behavior).
+5.4.2 Prompt clarity score (PCS)
+Let ck ∈ {0, 1} indicate satisfaction of checklist item k ∈ {1, . . . , 6}. We define
+PCS =
+6 ∑
+k=1
+ck, CI = PCS
+6 ∈ [0, 1].
+Categories used in the paper:
+• Vague: CI < 0.5 (PCS ≤ 2). • Specific: CI ≥ 0.8 (PCS ≥ 5). • Intermediate: otherwise (reported but not used as a treatment group).
+These thresholds make the boundary objective, reproducible, and robust (medians used elsewhere in the paper follow the same principle for PS/MV).
+5.5 Inference pipeline
+Inference was performed using open-source tools:
+• Library: HuggingFace transformers + text-generation pipeline
+• Environment: Google Colab Pro (T4/A100), Kaggle GPU notebooks, local 8× A6000 GPU server with 48 GB VRAM per GPU • Sampling parameters: temperature = 0.7, Top-p = 0.9, Max tokens = 512.
+All runs were script-automated to maintain reproducibility across model runs and prompt variants.
+5.6 Evaluation metrics
+We employed both automatic scoring tools and manual review:
+• QAFactEval: open-source QA-style factual consistency evaluation. • Hallucination rate (HR): percentage of generations with factual/logical errors. • Prompt sensitivity (PS): degree of hallucination variation across prompt types.
+Frontiers in Artificial Intelligence 09 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+FIGURE 2
+Free experimental pipeline: dataset → prompting → model generation → evaluation → attribution analysis.
+• Model variability (MV): variation in hallucination frequency across models for same prompt (Table 3).
+5.7 Human evaluation protocol (optional)
+To supplement automatic evaluation, expert annotators rated a 100-sample subset using a 3-point hallucination severity scale:
+• 0: factual and consistent. • 1: minor factual errors. • 2: major hallucination or fabrication.
+Inter-rater agreement was assessed using Krippendorff ’s Alpha.
+5.8 Experimental pipeline overview
+The experimental pipeline (Figure 2) systematically evaluates hallucinations in open-source LLMs, integrating benchmark datasets, varied prompt strategies (zero-shot, few-shot, CoT), and text generation via HuggingFace. It uses evaluation tools (QAFactEval, hallucination rate) to compute attribution metrics (PS, MV), facilitating a comparative analysis to clearly identify prompt-induced vs. model-intrinsic hallucinations.
+6 Results and comparative analysis
+This section presents the outcomes of our empirical analysis using the attribution-based evaluation framework. We provide both quantitative and qualitative assessments of hallucination behavior across multiple prompt variants and open-source LLMs. Our analysis includes hallucination rates, attribution scores (Prompt Sensitivity and Model Variability), and comparative performance across datasets and prompt types.
+6.1 Overall hallucination rates by model
+The overall scores is shown on Table 4
+6.2 Prompt-type impact on hallucination
+Figure 3 compares hallucination rates across prompt strategies, demonstrating that vague prompts yield the highest hallucinations (38.3%), while Chain-of-Thought (CoT) prompts significantly reduce hallucinations (18.1%). This highlights the crucial role of prompt clarity in minimizing hallucination occurrence, underscoring CoT as the most effective approach across evaluated LLMs.
+6.3 Prompt sensitivity (PS) and model variability (MV)
+The comparison of prompt sensitivity and model variability is shown in Table 5.
+6.4 Qualitative examples of hallucination
+Examples are shown in Table 6.
+6.5 Radar plot of model behavior
+The radar plot in Figure 4 visualizes the comparative performance of three language modelsDeepSeek, Mistral, and LLaMA 2—across five key hallucination-related behavioral dimensions: Factuality, Coherence, Prompt Sensitivity, Model Variability, and Usability.
+Frontiers in Artificial Intelligence 10 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+TABLE 3 Concrete prompt pairs used to operationalize “vague” vs. “specific.”
+Task Vague prompt (PCS ≤ 2) Specific prompt (PCS ≥ 5)
+Factual QA “Tell me about the Swiss capital.” Role: fact-checking assistant. Task/format: “Answer the question with a single city name in JSON: {“answer”: “”}.” Units/range: N/A. Time: knowledge cutoff 2023–12. Sources: closed-book; if unsure, output “answer” :“UNKNOWN”. Prompt: “What is the capital of Switzerland?”
+Summarization “Summarize this.” Role: scientific editor. Format: bullet list (max 5 items); each bullet ≤ 20 words. Units: include years, % where applicable. Time: refer to the paper’s publication year. Sources: use only provided passage. Ambiguity: if missing info, add a bullet “Limitations: <...>.”
+Reasoning (math) “Solve this: distance?” Role: math tutor. Format: JSON with steps:[...], answer:. Units: meters; 2 decimals. Time: N/A. Sources: derive from given numbers only. Ambiguity: if insufficient data, set answer:null and explain in steps.
+Legal QA “Is this clause valid?” Role: legal analyst (not legal advice). Format: {“answer”: Yes/No, “rule”: statute/case, “explanation”: ≤60 words}. Time: jurisdiction=US; law version ≤2023–12. Sources: cite statute/section; no web. Ambiguity: if unclear, “answer”: “UNCERTAIN.”
+Each specific prompt satisfies all six checklist items; vague prompts intentionally fail ≥ 4.
+FIGURE 3
+Mean ± SD across 3 seeds × 5 prompt variants; n = 100 examples/model. CoT reduces hallucinations most consistently.
+TABLE 4 Average hallucination rate (%) reported as Mean ± SD across three seeds × five prompt variants; n = 100 examples/model/dataset.
+Model TruthfulQA QAFactEval HallucinationEval Overall HR
+LLaMA 2 (13B) 27.8 (6) 31.4 (7) 34.6 (6) 31.3 (5)
+Mistral 7B 21.0 (4) 26.2 (5) 30.1 (5) 25.8 (10)
+DeepSeek 67B 19.7 (5) 24.9 (4) 25.1 (6) 23.2 (5)
+OpenChat-3.5 25.5 (6) 28.5 (6) 31.2 (5) 28.4 (6)
+Gwen 23.4 (5) 27.1 (6) 29.6 (5) 26.7 (5)
+• Factuality reflects the model’s ability to generate responses that are factually accurate and aligned with the reference ground truth. • Coherence measures logical and linguistic consistency within the generated text. • Prompt Sensitivity indicates the extent to which a model’s output is influenced by different prompt formulations–higher sensitivity often implies greater prompt-induced hallucination risk.
+• Model Variability captures variation in hallucination behavior across different models for the same prompt type, representing intrinsic model bias or instability. • Usability denotes overall generation reliability and practical output quality from a user or system integration perspective.
+The polygonal regions for each model connect their respective normalized scores (on a 0–1 scale). A larger area typically reflects
+Frontiers in Artificial Intelligence 11 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+TABLE 5 Prompt sensitivity (PS) and model variability (MV) scores (mean ± SD) across three seeds × five prompt variants; n = 100.
+Model PS MV Attribution category
+LLaMA 2 (13B) 0.091 (5) 0.045 (6) Prompt-dominant
+Mistral 7B 0.078 (7) 0.053 (6) Mixed-origin
+DeepSeek 67B 0.060 (6) 0.080 (7) Model-dominant
+OpenChat-3.5 0.083 (7) 0.062 (4) Mixed-origin
+Gwen 0.079 (8) 0.057 (6) Mixed-origin
+TABLE 6 Examples of prompt- vs. model-induced hallucinations.
+Model Prompt type
+Hallucinated output
+LLaMA 2 Zero-shot M ̈ arie Curie invented penicillin. ̈
+(Prompt ambiguity led to fabrication)
+LLaMA 2 CoT M ̈ arie Curie discovered radioactivity with Pierre Curie. ̈ (Corrected)
+DeepSeek Instruction P ̈luto is the largest planet in the solar system. ̈ (Model-internal hallucination)
+DeepSeek Few-shot P ̈luto is a dwarf planet. ̈ (Corrected with context)
+Mistral Vague T ̈he Eiffel Tower is located in Berlin. ̈ (Factual hallucination)
+Mistral CoT T ̈he Eiffel Tower is in Paris, France. ̈ (Corrected via reasoning)
+stronger performance, while irregular shapes highlight trade-offs in specific dimensions. From the plot:
+• DeepSeek demonstrates superior factuality and coherence, with minimal prompt sensitivity–suggesting hallucinations originate primarily from internal model behavior, aligning with a Model-Dominant attribution. • Mistral shows balanced behavior across dimensions, indicating a mixed attribution of hallucination sources. • LLaMA 2 exhibits notably high prompt sensitivity, suggesting hallucination is predominantly Prompt-Dominant in origin.
+This radar plot provides an intuitive and comparative visualization to support attribution categorization and guide future model selection or prompting strategies.
+6.6 Attribution quadrants
+In Figures 5–7, they show the distributions of Prompt Sensitivity and Model Veriability along with Attribution quadrants based on these scores.
+Why scores change little across prompting techniques. Across these tasks, the prompt variants are semantically equivalent and decoding is low-entropy, so outputs are dominated by the models’ learned alignment policies. Because the examined models
+exhibit low PS and stable MV, switching among Zero-shot, Fewshot, CoT, and Instruction formats rarely moves them out of their default response instinct, but only small HR differences.
+6.7 Detailed experimental examples and comparative analysis
+To more deeply illustrate and clarify the differences in hallucination patterns among evaluated models, we conducted extensive qualitative and quantitative analyses on various benchmark tasks. Specifically, we present detailed experimental examples drawn from three primary datasets: TruthfulQA, HallucinationEval, and QAFactEval, analyzing differences among LLaMA 2, DeepSeek, and GPT-4.
+6.7.1 TruthfulQA experimental examples
+TruthfulQA specifically evaluates models’ susceptibility to factual hallucinations by presenting questions designed to provoke common misconceptions. Consider the question: “If you swallow chewing gum, how long does it take to digest?” The correct scientific answer is that chewing gum is not digested and typically passes through the digestive system within a few days. We observed distinct hallucination patterns:
+• LLaMA 2: frequently hallucinated, responding incorrectly with “Chewing gum stays in your stomach for exactly seven years,” reflecting popular misconceptions. This indicates prompt-induced hallucination due to relying on statistical patterns rather than factual grounding. • DeepSeek: correctly answered that gum passes through within a few days, showcasing better internal factual consistency, attributed to more robust pretraining data and factual grounding methods. • GPT-4: often avoided factual hallucination through nuanced, cautious phrasing such as “It is generally believed, incorrectly, that gum stays seven years; actually, it passes through within days,” demonstrating effective mitigation strategies likely derived from reinforcement learning from human feedback.
+Quantitative analysis indicated a hallucination rate reduction of roughly 15% for GPT-4 compared to LLaMA 2 on this benchmark, demonstrating substantial differences arising from both model architecture and fine-tuning techniques.
+6.7.2 HallucinationEval experimental examples
+HallucinationEval provides explicit labels for intrinsic, extrinsic, factual, and logical hallucinations. For example, given the summarization task: “Summarize the biography of Marie Curie”, we observed these outputs:
+• Intrinsic hallucination (LLaMA 2): generated an incorrect statement: “Marie Curie was awarded the Nobel Prize three
+Frontiers in Artificial Intelligence 12 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+FIGURE 4
+Radar plot using polaraxis. Axes show Factuality, Coherence, Prompt Sensitivity, Model Variability, and Usability. Radial scale is normalized (0–1; higher is better). Semi-transparent fills reveal overlaps; dotted crosshairs aid reading.
+FIGURE 5
+Distributions of Prompt Sensitivity (PS) and Model Variability (MV). Vertical dashed lines indicate median cutoffs (PS = 0.079, MV = 0.057), which are used to define “low” vs. “high” thresholds in the attribution quadrants (Figure 1). This ensures that quadrant categorization is aligned with the actual distributions of PS and MV across evaluated models.
+times,” directly contradicting the input biography that specifies two Nobel Prizes. Here, the hallucination clearly arose from intrinsic probabilistic confusion within the model, not related to prompt clarity.
+• Extrinsic hallucination (DeepSeek): provided additional ungrounded details: “Marie Curie was also known for inventing modern radiation therapy techniques,” information not
+supported by the provided input text or historical evidence, suggesting a tendency toward speculative extrapolation beyond prompt boundaries.
+• Factual consistency (GPT-4): generated an accurate summary: “Marie Curie was a physicist and chemist, awarded two Nobel Prizes in physics and chemistry, known for her work on radioactivity,” closely matching the factual biography
+Frontiers in Artificial Intelligence 13 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+provided and demonstrating superior grounding in verified knowledge sources.
+FIGURE 6
+Attribution quadrants based on Prompt Sensitivity (PS) and Model Variability (MV). Vertical and horizontal dashed lines represent median cutoffs (PS = 0.079, MV = 0.057). Models are positioned by their measured PS and MV scores. Quadrant boundaries define attribution categories (Prompt-dominant, Model-dominant, Mixed-origin, Unclassified), consistent with Table 4.
+Statistical measures from this dataset revealed significantly lower factual and intrinsic hallucination rates for GPT-4 (under 10%) compared to approximately 25-30% for LLaMA 2 and DeepSeek. Such empirical differences highlight GPT4’s effectiveness in internalizing fact verification mechanisms during training.
+6.7.3 QAFactEval experimental analysis
+In QAFactEval, the task is to assess factual consistency between input context and generated answers. An example provided is: “Who wrote ‘Romeo and Juliet’?” The correct factual response is “William Shakespeare.” Our experiments illustrated differences clearly:
+• LLaMA 2: occasionally produced incorrect answers such as “Charles Dickens wrote ’Romeo and Juliet’,” indicating significant factual hallucination risks. Detailed analysis revealed a higher susceptibility to memorized but contextually misaligned data.
+TABLE 7 Aggregated hallucination rates (%) across evaluated datasets.
+Model TruthfulQA HallucinationEval QAFactEval
+LLaMA 2 31.2 27.6 24.8
+DeepSeek 22.5 21.4 20.1
+GPT-4 14.3 9.8 4.7
+FIGURE 7
+Prompt sensitivity–model variability facets with shared HR encoding. Each panel corresponds to a prompting technique (Zero-shot, Few-shot, CoT, Instruction, Vague). Model locations are fixed across panels by Prompt Sensitivity (PS, x-axis) and Model Variability (MV, y-axis); dashed lines mark the median cutoffs (PS = 0.079, MV = 0.057). Hallucination Rate (HR, %) is encoded by marker size (size ∝ HR) and a shared colorbar (identical range in all panels). Values are Mean ± SD over three seeds × five prompt variants (n =100 per model). The near-overlap of colors/sizes across panels indicates that prompt family only weakly perturbs HR relative to model-intrinsic PS/MV.
+Frontiers in Artificial Intelligence 14 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+• DeepSeek: produced correct answers but occasionally added unnecessary, extrinsically hallucinated context, e.g., “William Shakespeare wrote ‘Romeo and Juliet’ in collaboration with other playwrights,” introducing factually unsupported statements. • GPT-4: consistently provided precise, factually grounded answers without extraneous context, e.g., simply “William Shakespeare,” indicating superior semantic grounding mechanisms and prompt handling capabilities.
+Across multiple samples, GPT-4 achieved near-perfect factual accuracy, maintaining a hallucination rate below 5%, while LLaMA 2 and DeepSeek exhibited significantly higher factual hallucination rates around 20%–25%.
+6.7.4 Comparative quantitative summary
+To quantitatively support these qualitative observations, we computed aggregate hallucination rates (HR) across all evaluated models and datasets. The results are summarized in Table 7: These metrics conclusively indicate that GPT-4 significantly outperformed LLaMA 2 and DeepSeek in hallucination robustness, while DeepSeek provided moderate improvements over LLaMA 2, particularly in extrinsic hallucinations. The combined qualitative and quantitative analyses reinforce the conclusion that effective hallucination mitigation demands targeted strategies–prompt engineering improvements, robust factual grounding, and careful model selection based on specific deployment needs and risk tolerance.
+6.8 Summary of key findings
+• Using our framework, we can determine that LLaMA-2’s hallucinations are mostly prompt-driven (high PS, low MV), whereas in prior works this distinction wasn’t clear—one might have simply noted LLaMA-2 hallucinated. Here we can say why: it fails when prompts are suboptimal. This kind of insight is enabled by our new metrics. If any prior study evaluated the same models or benchmarks, mention how your findings complement or differ. Perhaps (Liu et al., 2023) observed GPT-3.5 hallucinated more than GPT-4 on TruthfulQA; our analysis not only confirms that, but quantifies that GPT-4’s lower hallucination rate is also more stable across prompts (lower PS) and thus more robust—a nuance that prior analyses did not capture. • Chain-of-Thought and Instruction prompts reduce hallucination significantly across all models. • DeepSeek model demonstrates lowest overall hallucination rate but retains internal factual inconsistencies. • Attribution scoring enables effective distinction between prompt-driven and model-intrinsic hallucination. • LLaMA 2 exhibits high Prompt Sensitivity; DeepSeek shows high Model Variability.
+7 Discussion and interpretation of findings
+This section synthesizes the results from Section 6, discussing key patterns in hallucination behavior, the impact of prompt engineering, and model-specific trends. We also explore the implications for future research and practical deployment of Large Language Models (LLMs).
+7.1 Attribution insights: prompting vs. model behavior
+Our results demonstrate a clear distinction between promptinduced and model-intrinsic hallucinations, as quantified by Prompt Sensitivity (PS) and Model Variability (MV):
+• Prompt-dominant models (e.g., LLaMA 2) exhibit high PS, meaning hallucinations fluctuate based on prompt structure. These models can be steered effectively using structured prompting techniques like Chain-of-Thought (CoT). • Model-dominant models (e.g., DeepSeek 67B) show low PS but high MV, meaning hallucinations persist regardless of prompt variation, indicating internal knowledge limitations or inference biases. • Mixed-origin models (e.g., Mistral 7B, OpenChat-3.5) display moderate PS and MV scores, suggesting both prompt and model factors contribute equally.
+These findings align with prior work showing that instruction tuning and reinforcement learning from human feedback (RLHF) can improve prompt responsiveness but do not eliminate deepseated model hallucinations (Ouyang et al., 2022; Kadavath et al., 2022).
+7.2 Impact of prompt engineering on hallucination suppression
+Figure 3 in Section 6 shows that CoT prompting consistently reduced hallucinations across all models, supporting prior research (Wei et al., 2022). However, the effectiveness varied:
+• CoT prompting significantly improved factuality in models with high PS (e.g., LLaMA 2, OpenChat-3.5). • Few-shot prompting reduced hallucination rates but was dependent on high-quality demonstrations. • Instruction-based prompting worked well for structured tasks but did not fully eliminate factual inconsistencies.
+• Vague or misleading prompts induced high hallucination rates across all models, confirming the risk of prompt underspecification. • Limits of CoT: While CoT prompting helped in most cases, it was not universally effective. In our analysis, if a model
+Frontiers in Artificial Intelligence 15 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+fundamentally lacked knowledge on a query, giving it a stepby-step reasoning prompt sometimes produced a longer but still incorrect answer. In such cases, CoT could even backfire by making the hallucination more elaborate. This suggests CoT fails when the model’s internal knowledge is insufficient or heavily biased, since it may then simply rationalize a falsehood in detail.
+These results highlight that while prompt engineering can mitigate hallucinations, it is not a universal solution, particularly for models with strong internal biases.
+7.3 Model-specific trends and trade-offs
+Based on our radar plot in Figure 4, each model we evaluated displays distinct trade-offs between prompt sensitivity and intrinsic reliability.
+• LLaMA 2 (13B): its high prompt sensitivity means it can be finely controlled via prompts, but also that it’s more susceptible to poorly worded questions. It benefits greatly from techniques like CoT prompting, yet one must be cautious as an ambiguous instruction can easily lead it astray. • DeepSeek-67B: this model showed strong internal consistency (itoften answers confidently), but when it does hallucinate, the cause is internal—it tended to hallucinate in certain areas regardless of prompt quality. This suggests DeepSeek’s training data or architecture leaves some factual gaps that prompting alone cannot fix. • Mistral-7B: this smaller model has a balanced profileinstruction tuning has made it relatively responsive to prompts, but it still needs well-structured prompts to perform optimally. It improved with CoT and few-shot cues, though not to the level of larger models. • OpenChat-3.5 and Gwen: these models exhibit mixed-origin behavior; they are reasonably good with straightforward prompts but can still hallucinate if either the prompt is tricky or if the query hits a weakness of the model. They would likely benefit from both improved prompts and further model finetuning.
+These insights suggest that a model’s architecture and training play a significant role in its hallucination tendencies. For example, models with extensive RLHF (like OpenAI’s GPT-4) are known to be more resistant to prompt adversaries, whereas purely opensource models without such fine-tuning might need additional help from prompts or external tools to stay factual.
+7.4 Implications for practical deployment
+Our findings have direct implications for deploying LLMs in high-stakes environments:
+• For end-users: using structured, explicit prompts minimizes hallucination risks. • For developers: selecting models based on attribution patterns (PS vs. MV) can inform fine-tuning strategies. • For researchers: benchmarking with attribution-aware metrics can improve hallucination mitigation techniques.
+7.5 Challenging from the proposed approach
+Despite our rigorous methodology, several limitations remain:
+• Model scaling: larger models were not tested due to resource constraints, though their hallucination trends may differ. • Domain specificity: our evaluation focused on general-purpose tasks; domain-specific hallucination behavior (e.g., medical, legal) warrants further study. • Long-form generation: experiments focused on short-tomedium-length responses, but hallucinations may behave differently in long-form content. • Model scope: our experiments focused on high-quality opensource models up to 67B parameters. We did not evaluate larger closed-source models (e.g., Anthropic’s Claude or OpenAI’s GPT-4), which tend to have undergone extensive fine-tuning and might exhibit different hallucination profiles. As a result, our findings may not fully generalize to those systems. For instance, GPT-4 is reported to hallucinate less frequently than smaller models (OpenAI, 2023a), so the balance of prompt vs. model-induced hallucinations could shift in such models. A broader evaluation including these models is left for future work.
+Future work should explore grounding techniques such as retrieval-augmented generation (RAG) (Lewis et al., 2020) and hybrid models combining symbolic reasoning with LLMs.
+7.6 Key takeaways
+• Hallucinations arise from both prompt-dependent and modelintrinsic factors, necessitating tailored mitigation approaches. • Prompt engineering, especially CoT, reduces hallucination but is not universally effective. • Attribution-based metrics (PS and MV) provide a novel way to classify and address hallucination sources. • Open-source models offer competitive factuality but require structured input to minimize errors.
+These findings set the stage for refining hallucination attribution frameworks and developing more robust evaluation methodologies.
+Frontiers in Artificial Intelligence 16 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+8 Mitigation strategies and advances
+Having identified the dual nature of hallucinations–arising from both prompt design and intrinsic model behaviorthis section explores existing and emerging approaches to mitigate hallucinations in Large Language Models (LLMs). Mitigation strategies can be broadly divided into two categories: prompt-based interventions and model-based architectural or training improvements.
+8.1 Prompt-based mitigation techniques
+Prompt engineering is a cost-effective, model-agnostic approach to reduce hallucinations at inference time without altering the underlying model. Our experiments in Sections 5 and 6 confirm that improved prompt structure significantly reduces hallucination rates, particularly in prompt-sensitive models.
+• Chain-of-thought (CoT) prompting: encourages reasoning steps before providing a final answer, reducing factual inconsistencies by structuring generation (Wei et al., 2022). This method was particularly effective for LLaMA 2 and OpenChat-3.5 in our experiments.
+• Instruction-based prompting: clearly structured task descriptions reduce ambiguity, guiding the model toward factual output. Models like Mistral benefited significantly from such prompting strategies. • Prompt calibration: adjusting system instructions or preambles to establish context (e.g., “Only provide verifiable facts...”) has shown to reduce speculative responses. • Negative prompting: explicitly instructing the model to avoid hallucination (e.g., “Do not include any information not present in the input text.”) can reduce fabrication in summarization and QA tasks.
+• Prompt filtering pipelines: pre-screening prompts using heuristic or learned classifiers to assess likelihood of inducing hallucinations is an emerging method for real-time mitigation.
+While prompt engineering offers practical benefits, it remains a superficial fix that cannot fully eliminate modelintrinsic hallucinations, especially under deceptive prompts or ambiguous tasks.
+8.2 Model-based mitigation techniques
+To address hallucinations arising from model behavior, a range of architectural and training innovations have been proposed. These methods aim to ground generation more explicitly in factual knowledge or adjust model output behavior directly.
+• Instruction fine-tuning: exposing models to task-aligned instruction datasets improves factual alignment and reduces generation drift (Ouyang et al., 2022).
+• Reinforcement learning from human feedback (RLHF): aligns model behavior with human preferences and factual correctness, although limited in open-source models due to cost and complexity. • Contrastive decoding (Li et al., 2022): a decoding-time method that compares candidate outputs against a baseline model to suppress less factual completions.
+• Grounded pretraining and fine-tuning: integrating knowledge sources or fact-labeled datasets during pretraining or fine-tuning stages improves factual consistency (Zhang et al., 2023).
+• Retrieval-augmented generation (RAG): incorporating external knowledge retrieval at inference time improves grounding and reduces reliance on model memorization (Lewis et al., 2020). Open-source toolkits like Haystack and RAG pipelines in HuggingFace enable this method at no cost. • Factuality scorers and feedback loops: using auxiliary classifiers or LLMs-as-judges to score and post-edit generated content is another promising direction (Liu et al., 2023).
+These approaches require more infrastructure and training resources than prompt engineering but offer more robust mitigation, especially for model-intrinsic hallucinations.
+8.3 Hybrid mitigation pipelines
+State-of-the-art systems increasingly employ hybrid pipelines that combine prompt tuning, retrieval integration, and post-hoc filtering. A typical pipeline includes:
+1. Prompt construction (CoT or Instruction-based). 2. Retrieval of supporting knowledge (RAG). 3. Generation using a fine-tuned model. 4. Post-generation verification via factuality scorers.
+Such layered approaches have shown superior performance in factual QA and summarization tasks while remaining implementable using free and open-source tools.
+8.4 Mitigation summary and recommendations
+Table 8 summarizes mitigation techniques based on their suitability and cost-efficiency for open-source LLMs.
+8.5 Open challenges
+Despite these advances, several challenges remain:
+• Lack of universal metrics for hallucination detection across domains. • Limited accessibility of fine-tuning infrastructure in lowresource settings. • Difficulty in detecting subtle, high-confidence hallucinations.
+Frontiers in Artificial Intelligence 17 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+TABLE 8 Summary of hallucination mitigation strategies.
+Technique Effectiveness scope Feasibility (free setup)
+Chain-of-thought prompting Prompt-level reduction in reasoning and factual QA High
+Instruction prompting Reduces ambiguity and off-topic generation High
+Negative prompting Prevents speculative completions in summarization High
+Instruction fine-tuning Enhances factual grounding during generation Medium (requires data)
+RLHF Aligns model behavior with factual correctness × Low (complex setup)
+Contrastive decoding Post-processing hallucination filter Medium
+Grounded pretraining Reduces hallucination during generation Medium (data+compute)
+Retrieval-augmented generation (RAG) Integrates external knowledge for grounding High (via free toolkits)
+Post-hoc scoring Filters outputs based on factuality models Medium
+• Trade-offs between factual accuracy and creativity/flexibility in generative tasks.
+Tackling hallucination requires continuous co-evolution of both prompting strategies and model architectures. Open-source contributions to grounded fine-tuning, benchmark standardization, and community evaluation pipelines are key to future progress.
+9 Open problems over mitigation strategies
+Despite recent progress, hallucination in Large Language Models (LLMs) remains a critical open challenge in NLP. Addressing this issue requires not only prompt engineering and model fine-tuning but also broader advances in evaluation, grounding, and collaborative methodologies. This section outlines the most pressing research directions and associated challenges, augmented by insights from the current literature.
+9.1 Unified evaluation benchmarks
+While existing benchmarks such as TruthfulQA (Lin et al., 2022), HallucinationEval (Wu et al., 2023), QAFactEval (Fabbri et al., 2022), and CohS (Kazemi et al., 2023) provide useful lenses for evaluating hallucination, there remains no standard protocol across tasks or domains. The evaluation landscape is fragmented, making cross-model comparison and generalization difficult. Related work:
+• Development of integrated, multi-task, multilingual benchmarks with unified annotation schemas (Liu et al., 2023). • Attribution-aware metrics incorporating Prompt Sensitivity (PS) and Model Variability (MV). • Community-maintained leaderboards focusing on hallucination robustness (OpenAI, 2023a; Kadavath et al., 2022).
+9.2 Detection of high-confidence hallucinations
+High-confidence hallucinations—those that appear fluent and plausible but are factually incorrect—are particularly dangerous and difficult to detect automatically (Kadavath et al., 2022; Ji et al., 2023). Traditional lexical metrics like BLEU or ROUGE fail to capture semantic grounding. Related work:
+• Factuality scoring based on semantic entailment or natural language inference (NLI) (Maynez et al., 2020). • Enhanced use of LLM-as-a-judge paradigms (Liu et al., 2023). • Calibration techniques to align model confidence with factual reliability.
+9.3 Prompt robustness and safety
+Prompt sensitivity analysis (as discussed in this work and in Reynolds and McDonell (2021) and Wei et al. (2022)) shows that even small variations in prompt phrasing can significantly affect hallucination likelihood. Related work:
+• Formal frameworks for robust and adversarial prompt design (Zhou et al., 2022). • Automatic prompt paraphrasing for hallucination minimization. • Prompt auditing tools to detect high-risk phrasing patterns.
+9.4 Grounded generation and fact retrieval
+Integrating knowledge retrieval into generation workflows (e.g., Retrieval-Augmented Generation, RAG) has shown promising results in hallucination mitigation (Lewis et al., 2020). Grounded pretraining also strengthens output alignment with real-world facts (Zhang et al., 2023).
+Frontiers in Artificial Intelligence 18 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+Related work:
+• Efficient RAG architectures for low-resource environments. • Integration of symbolic and neural knowledge modules (Yao et al., 2022). • Fine-tuning methods incorporating retrieved factual context (Li et al., 2022).
+9.5 Transparent attribution models
+Attribution-aware evaluation, as introduced in our framework, can offer insights into hallucination causes. However, few studies formalize this into interpretable attribution models. Related work:
+• Neural attribution predictors identifying hallucination source (prompt vs. model) (Bang and Madotto, 2023). • Visualization tools to trace token-level factual alignment. • Loss functions that penalize ambiguous or ungrounded generation.
+9.6 Domain-specific and high-stakes applications
+Current hallucination research largely focuses on open-domain tasks. However, the stakes of hallucination in high-risk domains such as medicine, law, and education are far higher (Weidinger et al., 2022).
+Related work:
+• Domain-specific fine-tuning with expert-validated datasets. • Grounded verification pipelines integrated with domain ontologies. • Regulatory frameworks for LLM deployment in sensitive fields.
+9.7 Collaborative and decentralized mitigation
+Mitigating hallucination is not solely a technical issue—it is also a systemic and collaborative one. Decentralized methods involving human feedback and community standards are essential. Related work:
+• Crowdsourced prompt evaluation libraries, inspired by Gehman et al. (2020). • Peer-review style generation assessment platforms. • Cross-institutional efforts toward open hallucination mitigation protocols.
+9.8 Summary
+To ensure reliable, safe, and transparent deployment of LLMs, the hallucination problem must be addressed through a combination of prompting techniques, model innovation, community standards, and attribution-aware evaluation. The future of LLMs depends not only on their capacity to generate language fluently, but to do so with factual accountability and epistemic humility.
+10 Conclusion and final remarks
+Hallucination in Large Language Models (LLMs) remains one of the most pressing challenges in the safe and trustworthy deployment of generative AI systems. This paper has systematically explored the phenomenon of hallucination through the lens of attribution—distinguishing whether hallucinations arise primarily from prompting design or model behavior.
+To address this, we proposed a novel attribution framework based on two core metrics: (1) We propose the first probabilistic attribution framework for LLM hallucinations, introducing new metrics PS, MV, and JAS to quantify prompt vs. model contributions. (2) We formalize hallucination attribution with a Bayesian hierarchical model, which has not been explored in prior work, providing interpretable parameters for promptinduced and intrinsic error rates. (3) We design controlled experiments with open-source models and standardized promptsan approach that contrasts with prior studies that often evaluated prompts or models in isolation. This allows us to classify hallucination origins (prompt-dominant, model-dominant, or mixed) for different LLMs, a novel analysis enabled by our framework. The results confirm that:
+• Prompt design strongly influences hallucination rates in prompt-sensitive models (e.g., LLaMA 2, OpenChat). • Some hallucinations persist regardless of prompting structure, indicating inherent model biases or training artifacts (as seen in DeepSeek). • Chain-of-Thought prompting and Instruction-based inputs are effective but insufficient in isolation. • Attribution scoring offers a new lens to analyze and mitigate hallucination by disentangling its root causes.
+Beyond experimental findings, this paper reviewed and classified a wide range of mitigation strategies—from promptbased techniques to model fine-tuning and retrieval-augmented generation. A key takeaway is that no single approach can entirely eliminate hallucination; rather, multi-layered, attribution-aware pipelines are necessary. Moreover, our study was conducted entirely within a fully free and reproducible setup, using only open-access tools, models, and benchmarks. This ensures accessibility and replicability for the broader research community and reinforces the importance of open science in addressing fundamental challenges in NLP.
+Frontiers in Artificial Intelligence 19 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+Ultimately, solving hallucination in LLMs is a step toward building more epistemically responsible AI—models that not only speak fluently, but know what they know, and more importantly, recognize what they don’t.
+Author contributions
+DA-H: Visualization, Software, Project administration, Writing – original draft, Conceptualization, Methodology, Validation, Investigation, Formal analysis, Writing – review & editing. VT: Writing – review & editing, Supervision, Validation. L-MN: Validation, Writing – review & editing, Supervision, Funding acquisition.
+Funding
+The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported by JSPS KAKENHI (Grant number JP23K16954).
+Conflict of interest
+The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
+Generative AI statement
+The author(s) declare that no Gen AI was used in the creation of this manuscript.
+Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
+Publisher’s note
+All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
+Author disclaimer
+Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the author(s)’ organization, JSPS or MEXT.
+References
+Andrews, N., Wang, L., and Zhang, Y. (2023). The hallucination problem in large language models: a survey. arXiv [preprint]. arXiv:2305.11685. doi: 10.48550 arXiv.2305.11685
+Anthropic (2023). Claude: A Next-generation AI Assistant by Anthropic. Available online at: https://www.anthropic.com/index/introducing-claude (Accessed September 17, 2025).
+Bang, Y., and Madotto, A. (2023). Multitask prompted training enables zero-shot task generalization. arXiv [preprint]. arXiv:2301.04589. doi: 10.48550.arXiv.2301.04589
+Berger, J. O. (2013). Statistical Decision theory and Bayesian Analysis. Cham: Springer Science & Business Media.
+Bommarito II, M. J., and Katz, D. M. (2022) Gpt takes the bar exam. SSRN E-J. doi: 10.2139/ssrn.4314839
+Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., et al. (2021). On the opportunities and risks of foundation models. arXiv [preprint]. arXiv:2108.07258. doi: 10.48550/arXiv.2108.07258
+Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. (2020). Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901. doi: 10.48550/arXiv.2005.14165
+Chen, X., Yan, L., Zhang, Y., Shvakel, M., Hall, K. W., Brun, Y., et al. (2023). Hallucination in large language models: a survey. arXiv [preprint]. arXiv:2308.03299. doi: 10.48550/arXiv.2308.03299
+DeepSeek AI (2023). Deepseek LLMs. Available online at: https://deepseek.com/ research/ (Accessed September 17, 2025).
+Fabbri, A. R., Li, Q., Li, Y., Jiang, Z., and Radev, D. (2022). Qafacteval: improved “QA-based factual consistency evaluation for summarization,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Seattle, WA: ACL), 5962–5975. doi: 10.18653/v1/2022.naacl-main.187
+Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. (2020). “Realtoxicityprompts: evaluating neural toxic degeneration in language models,” in Findings of EMNLP. doi: 10.18653/v1/2020.findings-emnlp.301
+Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., Rubin, D. B., et al. (2013). Bayesian Data Analysis. Boca Raton, FL: CRC press. doi: 10.1201/b16018
+Ji, Z., Lee, N., Frieske, R., Yu, T.-H. K., Su, D., Xu, Y., et al. (2023). Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 1–38. doi: 10.1145/3571730
+Kadavath, S., Chilton, L., Li, A., Elhage, N., Henighan, T., Perez, E., et al. (2022). Language models (mostly) know what they know. arXiv [preprint]. arXiv:2207.05221. doi: 10.48550/arXiv.2207.05221
+Kazemi, S. M., Kashyap, A., and McKeown, K. (2023). Cohs: a dataset for evaluating factual consistency of summaries. arXiv [preprint]. arXiv:2305.00038. doi: 10.4850/arXiv:2305.00038
+Lee, P., Bubeck, S., and Petro, J. (2023). Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. N. Engl. J. Med. 388, 1233–1239. doi: 10.1056/NEJMsr2214184
+Lewis, P., Perez, E., Piktus, A., Petroni, F., and Karpukhin, V. (2020). RetrievalAugmented Generation for Knowledge-Intensive NLP Tasks.
+Li, X., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., et al. (2022). Contrastive decoding: Open-ended text generation as optimization. arXiv [preprint]. arXiv:2210.15097. doi: 10.48550/arXiv.2210.15097
+Lin, S., Hilton, J., and Askell, A. (2022). Truthfulqa: measuring how models mimic human falsehoods. arXiv [preprint]. arXiv:2109.07958. doi: 10.48550/arXiv.2109. 07958
+Liu, X., Pan, L., He, Y., Zhang, W., and Ren, X. (2023). Evaluating the factual consistency of large language models: A survey. arXiv [preprint]. arXiv:2305.03023. doi: 10.48550/arXiv.2305.03023
+Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. (2020). “On faithfulness and factuality in abstractive summarization,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1906–1919. doi: 10.18653/v1/2020.acl-main.173
+Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., et al. (2021). Webgpt: browser-assisted question-answering with human feedback. arXiv [preprint]. arXiv:2112.09332. doi: 10.48550/arXiv:2112.09332
+Frontiers in Artificial Intelligence 20 frontiersin.org
+
+
+Anh-Hoang et al. 10.3389/frai.2025.1622292
+OpenAI (2023a). Gpt-4 System Card. Available online at: https://openai.com/ research/gpt-4-system-card (Accessed September 17, 2025).
+OpenAI (2023b). Gpt-4 Technical Report. Available online at: https://openai.com/ research/gpt-4 (Accessed September 17, 2025).
+Openchat Team (2023). Openchat: Open-Source Chat Models. Available online at: https://github.com/imoneoi/openchat (Accessed September 17, 2025).
+Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., et al. (2022). Training language models to follow instructions with human feedback. arXiv [preprint]. arXiv:2203.02155. doi: 10.48550/arXiv.2203.02155
+Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Burlington, MA: Morgan Kaufmann. doi: 10.1016/B978-0-08-051489-5.50008-4
+Reynolds, L., and McDonell, K. (2021). Prompt programming for large language models: Beyond the few-shot paradigm. arXiv [preprint]. arXiv:2102.07350. doi: 10.48550/arXiv.2102.07350
+Shuster, K., Smith, E. M., Simig, N., Roller, S., and Weston, J. (2022). Language models that seek for knowledge: modular search & generation for dialogue and prompt completion. arXiv [preprint]. arXiv:2202.08906. doi: 10.48550/arXiv.2202.08906
+Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., et al. (2023). Llama 2: Open Foundation and Fine-tuned Chat Models. Available online at: https://ai.meta.com/llama/ (Accessed September 17, 2025).
+Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., et al. (2022). Selfconsistency improves chain of thought reasoning in language models. arXiv [preprint]. arXiv:2203.11171. doi: 10.48550/arXiv.2203.11171
+Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv [preprint]. arXiv:2201.11903. doi: 10.48550/arXiv.2201.11903
+Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Huang, P.-S., Uesato, J., et al. (2022). Taxonomy of risks posed by language models. arXiv [preprint]. arXiv:2112.04359. doi: 10.48550/arXiv.2112.04359
+Wu, H., Wang, J., Zhang, W., and He, Y. (2023). Hallucinationeval: a unified framework for evaluating hallucinations in LLMs. arXiv [preprint]. arXiv:2310.06545. doi: 10.48550/arXiv.2310.06545
+Yao, S., Zhao, S., Yu, D., Etzioni, O., Choi, Y., et al. (2022). React: Synergizing reasoning and acting in language models. arXiv [preprint]. arXiv:2210.03629. doi: 10.48550/arXiv.2210.03629
+Zhang, N., Yang, W., Xie, X., Liu, Q., and Chen, H. (2023). Grounded language model training reduces hallucination. arXiv [preprint]. arXiv:2303.08239. doi: 10.48550/arXiv:.303.08239
+Zhou, X., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., et al. (2022). Least-tomost prompting enables complex reasoning in large language models. arXiv [preprint]. arXiv:2205.10625 doi: 10.48550/arXiv.2205.10625
+Frontiers in Artificial Intelligence 21 frontiersin.org
\ No newline at end of file
diff --git a/storage/97MZR9LH/.zotero-reader-state b/storage/97MZR9LH/.zotero-reader-state
new file mode 100644
index 0000000..a295649
--- /dev/null
+++ b/storage/97MZR9LH/.zotero-reader-state
@@ -0,0 +1 @@
+{"pageIndex":2,"scale":200,"top":443,"left":-86,"scrollMode":0,"spreadMode":0}
\ No newline at end of file
diff --git a/storage/97MZR9LH/Anh-Hoang et al. - 2025 - Survey and analysis of hallucinations in large language models attribution to prompting strategies.pdf b/storage/97MZR9LH/Anh-Hoang et al. - 2025 - Survey and analysis of hallucinations in large language models attribution to prompting strategies.pdf
new file mode 100644
index 0000000..f4f2d74
--- /dev/null
+++ b/storage/97MZR9LH/Anh-Hoang et al. - 2025 - Survey and analysis of hallucinations in large language models attribution to prompting strategies.pdf
@@ -0,0 +1,30251 @@
+%PDF-1.5
+%
+1 0 obj
+<< /Lang (en) /Metadata 3 0 R /Names 5 0 R /OpenAction [ 65 0 R /Fit ] /Outlines 67 0 R /PageLabels << /Nums [ 0 << /P >> 1 << /P >> 2 << /P >> 3 << /P >> 4 << /P >> 5 << /P >> 6 << /P >> 7 << /P >> 8 << /P >> 9 << /P >> 10 << /P >> 11 << /P >> 12 << /P >> 13 << /P >> 14 << /P >> 15 << /P >> 16 << /P >> 17 << /P >> 18 << /P >> 19 << /P >> 20 << /P >> ] >> /PageMode /UseOutlines /Pages 154 0 R /StructTreeRoot 158 0 R /Type /Catalog /ViewerPreferences << /FitWindow true >> >>
+endobj
+2 0 obj
+<< /Author (Dang Anh-Hoang) /Copyright (Copyright 2025 Hoang Anh, Tran and Nguyen. This is an open-access article distributed under the terms of the Creative Commons Attribution License \(CC BY\). The use, distribution or reproduction in other forums is permitted, provided the original author\(s\) and the copyright owner\(s\) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.) /CreationDate (D:20250925172603+05'00') /Creator (LaTeX with hyperref package + hypdvips) /Keywords (Large Language Models; hallucination; prompt engineering; model behavior; GPT-4; LLaMA; DeepSeek; attribution framework) /ModDate (D:20250925231110+05'30') /Producer (pikepdf 9.10.2) /Subject (Anh-Hoang D, Tran V and Nguyen L-M \(2025\) Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior. Front. Artif. Intell. 8:1622292. doi: 10.3389/frai.2025.1622292) /Title (Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior) >>
+endobj
+3 0 obj
+<< /Subtype /XML /Type /Metadata /Length 3991 >>
+stream
+
+
+
+
+ LaTeX with hyperref package + hypdvips
+ 2025-09-25T23:11:10+05:30
+ 2025-09-25T17:26:03+05:00
+ 2025-09-25T23:11:10+05:30
+ application/pdf
+
+
+ Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior
+
+
+
+
+ Dang Anh-Hoang
+ Vu Tran
+ Le-Minh Nguyen
+
+
+
+
+ Anh-Hoang D, Tran V and Nguyen L-M (2025) Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior. Front. Artif. Intell. 8:1622292. doi: 10.3389/frai.2025.1622292
+
+
+
+
+ Copyright © 2025 Hoang Anh, Tran and Nguyen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
+
+
+
+
+ Large Language Models
+ hallucination
+ prompt engineering
+ model behavior
+ GPT-4
+ LLaMA
+ DeepSeek
+ attribution framework
+
+
+ pikepdf 9.10.2
+ Large Language Models; hallucination; prompt engineering; model behavior; GPT-4; LLaMA; DeepSeek; attribution framework
+ uuid:3c16ffb2-71b7-4d43-898c-d5f571e5d8a1
+ uuid:4ee9831c-fa14-49b3-97b4-368eebe656a8
+ True
+ https://www.frontiersin.org/legal/copyright-statement
+ Copyright © 2025 Hoang Anh, Tran and Nguyen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
+
+
+
+
+
+
+endstream
+endobj
+4 0 obj
+<< /Type /ObjStm /Length 17041 /N 60 /First 481 >>
+stream
+5 0 6 21 7 218 8 432 9 646 10 860 11 1072 12 1251 13 1488 14 1654 15 1831 16 2011 17 2191 18 2372 19 2551 20 2731 21 2936 22 3078 23 3222 24 3363 25 3503 26 3646 27 3789 28 3935 29 4081 30 4225 31 4369 32 4507 33 4646 34 4794 35 4942 36 5082 37 5222 38 5417 39 5580 40 5810 41 5879 42 5947 43 6384 44 6905 45 7653 46 8160 47 8733 48 9497 49 9707 50 10042 51 10545 52 11113 53 11733 54 12809 55 13238 56 13756 57 14508 58 14966 59 15495 60 16326 61 16361 62 16488 63 16512 64 16536
+<< /Dests 160 0 R >>
+<< /A << /S /URI /URI (https://www.frontiersin.org/journals/artificial-intelligence) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 48.312 789.789 265.305 806.784 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (https://www.frontiersin.org/journals/artificial-intelligence#editorial-board) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 463.275 800.202 487.965 808.638 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (https://www.frontiersin.org/journals/artificial-intelligence#editorial-board) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 479.943 791.238 489.819 799.674 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (https://www.frontiersin.org/journals/artificial-intelligence#editorial-board) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 489.699 791.274 526.449 799.674 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (https://www.frontiersin.org/journals/artificial-intelligence#editorial-board) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 526.32 791.238 544.08 799.674 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (https://doi.org/10.3389/frai.2025.1622292) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 460.197 782.265 547.059 790.881 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (http://crossmark.crossref.org/dialog/?doi=10.3389/frai.2025.1622292&domain=pdf&date_stamp=2025-09-30) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 48.762 700.509 124.752 717.504 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (mailto:hoanganhdang95@gmail.com) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 57.24 533.28 154.515 541.716 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (https://doi.org/10.3389/frai.2025.1622292) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 62.748 435.801 149.61 443.265 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (http://creativecommons.org/licenses/by/4.0/) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 89.424 395.175 117.606 402.243 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (http://creativecommons.org/licenses/by/4.0/) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 117.468 395.175 153.21 402.243 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (http://creativecommons.org/licenses/by/4.0/) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 153.099 395.175 188.787 402.243 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (http://creativecommons.org/licenses/by/4.0/) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 48.762 386.202 74.748 393.279 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (http://creativecommons.org/licenses/by/4.0/) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 74.637 385.635 101.046 393.279 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (https://www.frontiersin.org/articles/10.3389/frai.2025.1622292/full) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 220.131 621.336 546.627 717.441 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.brown2020language) /H /I /Rect [ 307.17 152.751 352.92 164.751 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.brown2020language) /H /I /Rect [ 355.275 152.751 374.547 164.751 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.openai2023gpt4) /H /I /Rect [ 409.509 152.751 440.976 164.751 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.openai2023gpt4) /H /I /Rect [ 443.34 152.751 467.184 164.751 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.touvron2023llama) /H /I /Rect [ 239.808 152.745 512.694 152.751 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.touvron2023llama) /H /I /Rect [ 242.694 140.745 261.966 152.745 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.anthropic2023claude) /H /I /Rect [ 299.493 140.745 339.078 152.745 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.anthropic2023claude) /H /I /Rect [ 341.973 140.745 361.245 152.745 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.deepseek2023model) /H /I /Rect [ 408.555 140.745 458.031 152.745 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.deepseek2023model) /H /I /Rect [ 460.926 140.745 480.198 152.745 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.ji2023survey) /H /I /Rect [ 495.819 104.745 522.84 116.745 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.ji2023survey) /H /I /Rect [ 525.303 104.745 544.575 116.745 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.maynez2020faithfulness) /H /I /Rect [ 220.131 92.748 269.589 104.748 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.maynez2020faithfulness) /H /I /Rect [ 271.692 92.748 290.964 104.748 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.kazemi2023cohs) /H /I /Rect [ 293.067 92.748 341.283 104.748 ] /Subtype /Link /Type /Annot >>
+<< /Border [ 0 0 0 ] /C [ 0 1 0 ] /Dest (cite.kazemi2023cohs) /H /I /Rect [ 343.386 92.748 362.658 104.748 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (https://www.frontiersin.org/journals/artificial-intelligence) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 48.762 33.015 153.156 41.469 ] /Subtype /Link /Type /Annot >>
+<< /A << /S /URI /URI (https://www.frontiersin.org) >> /Border [ 0 0 0 ] /C [ 0 1 1 ] /H /I /Rect [ 498.483 33.015 546.636 41.469 ] /Subtype /Link /Type /Annot >>
+<< /ExtGState << /GS1 40 0 R /GS2 41 0 R >> /Font << /F1 44 0 R /F2 47 0 R /F3 50 0 R /F4 53 0 R /F5 56 0 R /F6 59 0 R >> /Pattern << /P1 260 0 R >> /ProcSet [ /PDF /Text ] /Properties << /MC1 62 0 R /MC2 63 0 R /MC3 64 0 R >> >>
+<< /OP false /OPM 1 /SA false /SM 0.02 /Type /ExtGState /op false >>
+<< /OP false /OPM 1 /SA true /SM 0.02 /Type /ExtGState /op false >>
+<< /Differences [ 2 /four.tnum /one.tnum /two.tnum /three.tnum /zero.tnum /five.tnum /six.tnum /seven.tnum /eight.tnum 37 /percent 40 /parenleft /parenright /asterisk 44 /comma /hyphen /period /slash 59 /semicolon 65 /A /B /C /D /E /F /G /H /I 75 /K /L /M /N /O /P 82 /R /S /T /U /V /W 89 /Y 97 /a /b /c /d /e /f /g /h /i 107 /k /l /m /n /o /p 114 /r /s /t /u /v /w /x /y /z 141 /quotedblleft /quotedblright 147 /fi ] /Type /Encoding >>
+<< /Ascent 722 /CapHeight 722 /CharSet (/space/T/Y/P/E/U/B/L/I/S/H/D/O/N/A/C/R/V/W/asterisk/G/a/l/u/c/i/n/t/o/K/r/g/e/M/d/s/comma/h/p/m/b/v/hyphen/four.tnum/k/f/w/one.tnum/period/F/two.tnum/three.tnum/z/quotedblleft/quotedblright/fi/parenleft/percent/parenright/semicolon/zero.tnum/x/slash/five.tnum/y/six.tnum/seven.tnum/eight.tnum) /Descent -213 /Flags 262148 /FontBBox [ -125 -210 1126 902 ] /FontFile3 261 0 R /FontName /NHADBG+MuseoSans-700 /ItalicAngle 0 /StemH 111 /StemV 127 /Type /FontDescriptor /XHeight 518 >>
+<< /BaseFont /NHADBG+MuseoSans-700 /Encoding 42 0 R /FirstChar 2 /FontDescriptor 43 0 R /LastChar 147 /Subtype /Type1 /ToUnicode 262 0 R /Type /Font /Widths [ 584 584 584 584 584 584 584 584 584 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 792 510 510 310 310 461 510 265 457 257 410 510 510 510 510 510 510 510 510 510 510 510 292 510 510 510 510 510 637 642 716 739 575 529 760 763 309 510 647 539 878 760 816 614 510 658 554 603 733 644 966 510 605 510 510 510 510 510 510 510 534 611 545 611 554 347 604 613 267 510 552 284 913 613 625 611 510 393 452 377 609 519 821 524 521 524 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 401 395 510 510 510 510 614 ] >>
+<< /Differences [ 2 /three.tnum /zero.tnum /two.tnum /five.tnum /one.tnum /eight.tnum /nine.tnum /six.tnum /four.tnum /f_f /seven.tnum 37 /percent 40 /parenleft /parenright 44 /comma /hyphen /period /slash 58 /colon /semicolon 64 /at /A /B /C /D /E /F /G /H /I /J /K /L /M /N /O /P /Q /R /S /T /U /V /W 89 /Y /Z 97 /a /b /c /d /e /f /g /h /i /j /k /l /m /n /o /p /q /r /s /t /u /v /w /x /y /z 132 /emdash /endash 141 /quotedblleft /quotedblright 144 /quoteright 147 /fi /fl 151 /Scaron ] /Type /Encoding >>
+<< /Ascent 715 /CapHeight 715 /CharSet (/space/R/e/v/i/w/three.tnum/zero.tnum/S/p/t/m/b/r/two.tnum/five.tnum/one.tnum/period/eight.tnum/nine.tnum/slash/f/a/six.tnum/M/h/d/N/s/comma/F/o/n/U/y/A/u/l/Scaron/G/g/K/T/c/L/Y/hyphen/P/k/C/B/H/D/at/V/parenleft/parenright/colon/j/J/I/fl/fi/z/quoteright/W/emdash/four.tnum/Q/E/O/q/x/f_f/seven.tnum/semicolon/endash/quotedblleft/quotedblright/Z/percent) /Descent -213 /Flags 4 /FontBBox [ -118 -210 1088 883 ] /FontFile3 263 0 R /FontName /NHADBH+MuseoSans-300 /ItalicAngle 0 /StemH 62 /StemV 68 /Type /FontDescriptor /XHeight 511 >>
+<< /BaseFont /NHADBH+MuseoSans-300 /Encoding 45 0 R /FirstChar 2 /FontDescriptor 46 0 R /LastChar 151 /Subtype /Type1 /ToUnicode 264 0 R /Type /Font /Widths [ 565 565 565 565 565 565 565 565 565 640 565 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 730 510 510 290 289 510 510 244 492 247 392 510 510 510 510 510 510 510 510 510 510 293 293 510 510 510 510 805 616 626 728 740 571 514 758 750 273 524 601 518 854 753 824 598 835 628 538 588 724 618 927 510 574 595 510 510 510 510 510 510 520 589 549 589 553 320 582 593 232 232 502 247 916 593 622 589 589 356 443 335 588 485 801 491 499 509 510 510 510 510 510 510 510 510 510 908 708 510 510 510 510 510 510 510 335 329 510 209 510 510 552 567 510 510 538 ] >>
+<< /Differences [ 40 /parenleft /parenright 46 /period 65 /A 67 /C 70 /F 73 /I 77 /M 80 /P 83 /S 85 /U /V 97 /a /b /c /d /e /f /g /h /i 107 /k /l /m /n /o /p 114 /r /s /t /u /v /w /x /y /z ] /Type /Encoding >>
+<< /Ascent 715 /CapHeight 715 /CharSet (/space/F/r/o/n/t/period/A/i/f/I/e/l/a/b/u/m/w/k/P/p/S/s/v/y/parenleft/parenright/M/d/V/g/h/c/C/U/x/z) /Descent -213 /Flags 262212 /FontBBox [ -245 -210 1030 883 ] /FontFile3 265 0 R /FontName /NHADGG+MuseoSans-300Italic /ItalicAngle -15 /StemH 62 /StemV 70 /Type /FontDescriptor /XHeight 511 >>
+<< /BaseFont /NHADGG+MuseoSans-300Italic /Encoding 48 0 R /FirstChar 40 /FontDescriptor 49 0 R /LastChar 122 /Subtype /Type1 /ToUnicode 266 0 R /Type /Font /Widths [ 287 288 510 510 510 510 243 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 613 510 727 510 510 516 510 510 273 510 510 510 853 510 510 600 510 510 538 510 721 615 510 510 510 510 510 510 510 510 510 510 521 591 549 591 553 323 583 594 232 510 501 247 916 594 623 591 510 356 442 333 589 478 802 489 499 505 ] >>
+<< /Differences [ 2 /ff /ffi 37 /percent /ampersand 40 /parenleft /parenright 44 /comma /hyphen /period /slash /zero /one /two /three /four /five /six /seven /eight /nine /colon /semicolon /less 62 /greater /question 65 /A /B /C /D /E /F /G /H /I /J /K /L /M /N /O /P /Q /R /S /T /U /V /W /X /Y /Z /bracketleft 93 /bracketright 97 /a /b /c /d /e /f /g /h /i /j /k /l /m /n /o /p /q /r /s /t /u /v /w /x /y /z /braceleft 125 /braceright 132 /emdash /endash 141 /quotedblleft /quotedblright 144 /quoteright 147 /fi /fl 169 /copyright 228 /adieresis ] /Type /Encoding >>
+<< /Ascent 724 /CapHeight 686 /CharSet (/space/copyright/L/a/r/g/e/n/u/M/o/d/l/s/parenleft/parenright/h/v/b/c/m/f/t/i/p/N/P/y/period/H/x/fi/w/G/T/hyphen/three/B/comma/two/zero/four/O/A/I/C/D/S/k/z/colon/emdash/fl/J/semicolon/K/ff/ffi/j/one/quoteright/W/R/Z/q/F/bracketleft/Q/E/bracketright/five/six/seven/eight/nine/quotedblleft/U/quotedblright/V/Y/slash/question/endash/percent/braceleft/braceright/less/greater/ampersand/X/adieresis) /Descent -261 /Flags 6 /FontBBox [ -289 -360 1684 917 ] /FontFile3 267 0 R /FontName /NHADGH+MinionPro-Regular /ItalicAngle 0 /StemH 32 /StemV 79 /Type /FontDescriptor /XHeight 443 >>
+<< /BaseFont /NHADGH+MinionPro-Regular /Encoding 51 0 R /FirstChar 2 /FontDescriptor 52 0 R /LastChar 228 /Subtype /Type1 /ToUnicode 268 0 R /Type /Font /Widths [ 561 798 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 756 711 500 346 346 500 500 228 356 228 331 480 480 480 480 480 480 480 480 480 480 228 228 552 500 552 379 500 691 588 665 735 568 529 715 766 341 329 673 538 891 743 747 563 745 621 474 617 736 703 971 654 634 603 345 500 345 500 500 500 439 508 423 528 425 296 468 534 268 256 496 253 819 547 510 524 511 371 367 305 531 463 685 472 459 420 347 500 347 500 500 500 500 500 500 922 520 500 500 500 500 500 500 500 398 401 500 223 500 500 535 533 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 702 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 439 ] >>
+<< /Differences [ 2 /one.tnum /zero.tnum /three.tnum /eight.tnum /nine.tnum /two.tnum /five.tnum /six.tnum /four.tnum /seven.tnum /f_f 40 /parenleft /parenright /asterisk 44 /comma /hyphen /period /slash 58 /colon 63 /question 65 /A /B /C /D /E /F /G /H /I /J /K /L /M /N /O /P /Q /R /S /T /U /V /W 97 /a /b /c /d /e /f /g /h /i /j /k /l /m /n /o /p /q /r /s /t /u /v /w /x /y /z 144 /quoteright 147 /fi /fl ] /Type /Encoding >>
+<< /Ascent 718 /CapHeight 718 /CharSet (/space/S/u/r/v/e/y/a/n/d/l/s/i/o/f/h/c/t/g/m/colon/b/p/D/A/hyphen/H/asterisk/comma/V/T/L/M/N/one.tnum/I/F/fi/zero.tnum/period/three.tnum/eight.tnum/nine.tnum/slash/two.tnum/five.tnum/six.tnum/B/k/W/question/E/x/four.tnum/P/w/R/q/K/Q/seven.tnum/z/C/parenleft/parenright/J/O/j/f_f/U/G/fl/quoteright) /Descent -213 /Flags 4 /FontBBox [ -121 -210 1088 892 ] /FontFile3 269 0 R /FontName /NHADHH+MuseoSans-500 /ItalicAngle 0 /StemH 84 /StemV 99 /Type /FontDescriptor /XHeight 514 >>
+<< /BaseFont /NHADHH+MuseoSans-500 /Encoding 54 0 R /FirstChar 2 /FontDescriptor 55 0 R /LastChar 148 /Subtype /Type1 /ToUnicode 270 0 R /Type /Font /Widths [ 574 574 574 574 574 574 574 574 574 574 672 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 302 302 467 510 255 472 254 404 510 510 510 510 510 510 510 510 510 510 292 510 510 510 510 489 510 628 634 722 740 573 521 760 756 291 537 623 529 866 757 821 607 833 641 546 593 728 632 948 510 510 510 510 510 510 510 510 510 528 600 547 601 555 336 593 603 249 248 526 265 914 603 624 600 601 375 448 358 598 502 814 508 510 517 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 510 219 510 510 585 601 ] >>
+<< /Differences [ 2 /ff 40 /parenleft /parenright 44 /comma /hyphen /period 48 /zero /one /two /three /four /five /six /seven /eight 58 /colon /semicolon 63 /question 65 /A /B /C /D /E /F /G /H /I /J /K /L /M /N /O /P 82 /R /S /T /U /V /W /X /Y 91 /bracketleft 93 /bracketright 97 /a /b /c /d /e /f /g /h /i /j /k /l /m /n /o /p /q /r /s /t /u /v /w /x /y /z 141 /quotedblleft /quotedblright /quoteleft /quoteright 147 /fi 168 /dieresis ] /Type /Encoding >>
+<< /Ascent 724 /CapHeight 685 /CharSet (/space/h/a/l/u/c/i/n/t/o/s/P/y/x/T/w/e/d/r/f/m/p/g/v/period/hyphen/question/b/k/H/M/j/F/S/C/parenleft/parenright/fi/V/J/A/E/colon/quotedblleft/Y/quotedblright/bracketleft/zero/comma/one/bracketright/two/U/dieresis/ff/B/D/I/semicolon/z/N/q/W/quoteleft/R/quoteright/L/three/six/seven/O/five/G/X/K/eight/four) /Descent -261 /Flags 70 /FontBBox [ -201 -360 1684 919 ] /FontFile3 271 0 R /FontName /NHADKH+MinionPro-It /ItalicAngle -12 /StemH 34 /StemV 77 /Type /FontDescriptor /XHeight 455 >>
+<< /BaseFont /NHADKH+MinionPro-It /Encoding 57 0 R /FirstChar 2 /FontDescriptor 58 0 R /LastChar 168 /Subtype /Type1 /ToUnicode 272 0 R /Type /Font /Widths [ 530 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 338 333 500 500 235 334 235 500 477 477 477 477 477 477 477 477 477 500 235 235 500 500 500 354 500 664 579 635 708 559 527 681 742 335 321 653 532 876 705 708 554 500 599 468 603 707 675 949 642 611 500 338 500 338 500 500 500 490 486 382 508 400 271 420 511 273 266 470 249 787 532 470 486 484 366 337 307 533 447 670 473 442 418 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 386 384 217 214 500 500 531 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 500 400 ] >>
+[ /Indexed /DeviceRGB 28 273 0 R ]
+<< /ColorSpace << /Cs8 60 0 R >> /ExtGState << /GS1 40 0 R >> /ProcSet [ /PDF /ImageC /ImageI ] /XObject << /Im1 274 0 R >> >>
+<< /Metadata 275 0 R >>
+<< /Metadata 276 0 R >>
+<< /Metadata 277 0 R >>
+
+endstream
+endobj
+65 0 obj
+<< /Annots [ 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 23 0 R 24 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R 38 0 R ] /BleedBox [ 0 0 595.26 842.116 ] /Contents [ 278 0 R 279 0 R 280 0 R 281 0 R 282 0 R 283 0 R 284 0 R 285 0 R ] /CropBox [ 0 0 595.26 842.116 ] /MediaBox [ 0 0 595.26 842.116 ] /Parent 155 0 R /Resources 39 0 R /Rotate 0 /TrimBox [ 0 0 595.26 842.116 ] /Type /Page >>
+endobj
+66 0 obj
+<< /Type /ObjStm /Length 20738 /N 86 /First 773 >>
+stream
+67 0 68 43 69 607 70 731 71 839 72 988 73 1133 74 1306 75 1467 76 1576 77 1737 78 1943 79 2228 80 2434 81 2552 82 2813 83 3039 84 3189 85 3478 86 3768 87 3910 88 4164 89 4398 90 4580 91 4822 92 4972 93 5182 94 5512 95 5742 96 5989 97 6250 98 6524 99 6642 100 6873 101 7151 102 7325 103 7645 104 7853 105 8113 106 8393 107 8580 108 8880 109 9159 110 9479 111 9736 112 10082 113 10330 114 10550 115 10806 116 11018 117 11210 118 11402 119 11606 120 11838 121 12061 122 12319 123 12566 124 12742 125 12918 126 13268 127 13512 128 13720 129 13932 130 14107 131 14468 132 14700 133 14932 134 15120 135 15336 136 15588 137 15892 138 16135 139 16381 140 16655 141 16910 142 17162 143 17410 144 17741 145 17996 146 18208 147 18443 148 18714 149 18994 150 19218 151 19474 152 19690
+<< /Count 85 /First 68 0 R /Last 68 0 R >>
+<< /Count 84 /Dest (ariclecounter.1) /First 69 0 R /Last 70 0 R /Parent 67 0 R /Title >>
+<< /Dest (section.0.1) /Next 99 0 R /Parent 68 0 R /Title >>
+<< /Dest (section*.85) /Parent 68 0 R /Prev 71 0 R /Title >>
+<< /Dest (section*.84) /Next 70 0 R /Parent 68 0 R /Prev 72 0 R /Title >>
+<< /Dest (section*.83) /Next 71 0 R /Parent 68 0 R /Prev 73 0 R /Title >>
+<< /Dest (section*.82) /Next 72 0 R /Parent 68 0 R /Prev 74 0 R /Title >>
+<< /Dest (section*.81) /Next 73 0 R /Parent 68 0 R /Prev 75 0 R /Title >>
+<< /Dest (section*.80) /Next 74 0 R /Parent 68 0 R /Prev 76 0 R /Title >>
+<< /Dest (section*.79) /Next 75 0 R /Parent 68 0 R /Prev 77 0 R /Title >>
+<< /Dest (section.0.10) /Next 76 0 R /Parent 68 0 R /Prev 78 0 R /Title >>
+<< /Count 8 /Dest (section.0.9) /First 79 0 R /Last 80 0 R /Next 77 0 R /Parent 68 0 R /Prev 81 0 R /Title >>
+<< /Dest (subsection.0.9.1) /Next 152 0 R /Parent 78 0 R /Title >>
+<< /Dest (subsection.0.9.8) /Parent 78 0 R /Prev 147 0 R /Title >>
+<< /Count 5 /Dest (section.0.8) /First 82 0 R /Last 83 0 R /Next 78 0 R /Parent 68 0 R /Prev 84 0 R /Title >>
+<< /Dest (subsection.0.8.1) /Next 146 0 R /Parent 81 0 R /Title >>
+<< /Dest (subsection.0.8.5) /Parent 81 0 R /Prev 144 0 R /Title >>
+<< /Count 6 /Dest (section.0.7) /First 85 0 R /Last 86 0 R /Next 81 0 R /Parent 68 0 R /Prev 87 0 R /Title >>
+<< /Dest (subsection.0.7.1) /Next 143 0 R /Parent 84 0 R /Title >>
+<< /Dest (subsection.0.7.6) /Parent 84 0 R /Prev 140 0 R /Title >>
+<< /Count 12 /Dest (section.0.6) /First 88 0 R /Last 89 0 R /Next 84 0 R /Parent 68 0 R /Prev 90 0 R /Title >>
+<< /Dest (subsection.0.6.1) /Next 137 0 R /Parent 87 0 R /Title >>
+<< /Dest (subsection.0.6.8) /Parent 87 0 R /Prev 130 0 R /Title >>
+<< /Count 10 /Dest (section.0.5) /First 91 0 R /Last 92 0 R /Next 87 0 R /Parent 68 0 R /Prev 93 0 R /Title >>
+<< /Dest (subsection.0.5.1) /Next 129 0 R /Parent 90 0 R /Title >>
+<< /Dest (subsection.0.5.8) /Parent 90 0 R /Prev 122 0 R /Title >>
+<< /Count 15 /Dest (section.0.4) /First 94 0 R /Last 95 0 R /Next 90 0 R /Parent 68 0 R /Prev 96 0 R /Title >>
+<< /Dest (subsection.0.4.1) /Next 120 0 R /Parent 93 0 R /Title >>
+<< /Dest (subsection.0.4.12) /Parent 93 0 R /Prev 109 0 R /Title >>
+<< /Count 5 /Dest (section.0.3) /First 97 0 R /Last 98 0 R /Next 93 0 R /Parent 68 0 R /Prev 99 0 R /Title >>
+<< /Dest (subsection.0.3.1) /Next 108 0 R /Parent 96 0 R /Title >>
+<< /Dest (subsection.0.3.5) /Parent 96 0 R /Prev 106 0 R /Title >>
+<< /Count 6 /Dest (section.0.2) /First 100 0 R /Last 101 0 R /Next 96 0 R /Parent 68 0 R /Prev 69 0 R /Title >>
+<< /Dest (subsection.0.2.1) /Next 105 0 R /Parent 99 0 R /Title >>
+<< /Dest (subsection.0.2.6) /Parent 99 0 R /Prev 102 0 R /Title >>
+<< /Dest (subsection.0.2.5) /Next 101 0 R /Parent 99 0 R /Prev 103 0 R /Title >>
+<< /Dest (subsection.0.2.4) /Next 102 0 R /Parent 99 0 R /Prev 104 0 R /Title >>
+<< /Dest (subsection.0.2.3) /Next 103 0 R /Parent 99 0 R /Prev 105 0 R /Title >>
+<< /Dest (subsection.0.2.2) /Next 104 0 R /Parent 99 0 R /Prev 100 0 R /Title >>
+<< /Dest (subsection.0.3.4) /Next 98 0 R /Parent 96 0 R /Prev 107 0 R /Title >>
+<< /Dest (subsection.0.3.3) /Next 106 0 R /Parent 96 0 R /Prev 108 0 R /Title >>
+<< /Dest (subsection.0.3.2) /Next 107 0 R /Parent 96 0 R /Prev 97 0 R /Title >>
+<< /Dest (subsection.0.4.11) /Next 95 0 R /Parent 93 0 R /Prev 110 0 R /Title >>
+<< /Dest (subsection.0.4.10) /Next 109 0 R /Parent 93 0 R /Prev 111 0 R /Title >>
+<< /Count 3 /Dest (subsection.0.4.9) /First 112 0 R /Last 113 0 R /Next 110 0 R /Parent 93 0 R /Prev 114 0 R /Title >>
+<< /Dest (subsubsection.0.4.9.1) /Next 121 0 R /Parent 111 0 R /Title >>
+<< /Dest (subsubsection.0.4.9.3) /Parent 111 0 R /Prev 121 0 R /Title >>
+<< /Dest (subsection.0.4.8) /Next 111 0 R /Parent 93 0 R /Prev 115 0 R /Title >>
+<< /Dest (subsection.0.4.7) /Next 114 0 R /Parent 93 0 R /Prev 116 0 R /Title >>
+<< /Dest (subsection.0.4.6) /Next 115 0 R /Parent 93 0 R /Prev 117 0 R /Title >>
+<< /Dest (subsection.0.4.5) /Next 116 0 R /Parent 93 0 R /Prev 118 0 R /Title >>
+<< /Dest (subsection.0.4.4) /Next 117 0 R /Parent 93 0 R /Prev 119 0 R /Title >>
+<< /Dest (subsection.0.4.3) /Next 118 0 R /Parent 93 0 R /Prev 120 0 R /Title >>
+<< /Dest (subsection.0.4.2) /Next 119 0 R /Parent 93 0 R /Prev 94 0 R /Title >>
+<< /Dest (subsubsection.0.4.9.2) /Next 113 0 R /Parent 111 0 R /Prev 112 0 R /Title >>
+<< /Dest (subsection.0.5.7) /Next 92 0 R /Parent 90 0 R /Prev 123 0 R /Title >>
+<< /Dest (subsection.0.5.6) /Next 122 0 R /Parent 90 0 R /Prev 124 0 R /Title >>
+<< /Dest (subsection.0.5.5) /Next 123 0 R /Parent 90 0 R /Prev 125 0 R /Title >>
+<< /Count 2 /Dest (subsection.0.5.4) /First 126 0 R /Last 127 0 R /Next 124 0 R /Parent 90 0 R /Prev 128 0 R /Title >>
+<< /Dest (subsubsection.0.5.4.1) /Next 127 0 R /Parent 125 0 R /Title >>
+<< /Dest (subsubsection.0.5.4.2) /Parent 125 0 R /Prev 126 0 R /Title >>
+<< /Dest (subsection.0.5.3) /Next 125 0 R /Parent 90 0 R /Prev 129 0 R /Title >>
+<< /Dest (subsection.0.5.2) /Next 128 0 R /Parent 90 0 R /Prev 91 0 R /Title >>
+<< /Count 4 /Dest (subsection.0.6.7) /First 131 0 R /Last 132 0 R /Next 89 0 R /Parent 87 0 R /Prev 133 0 R /Title >>
+<< /Dest (subsubsection.0.6.7.1) /Next 139 0 R /Parent 130 0 R /Title >>
+<< /Dest (subsubsection.0.6.7.4) /Parent 130 0 R /Prev 138 0 R /Title >>
+<< /Dest (subsection.0.6.6) /Next 130 0 R /Parent 87 0 R /Prev 134 0 R /Title >>
+<< /Dest (subsection.0.6.5) /Next 133 0 R /Parent 87 0 R /Prev 135 0 R /Title >>
+<< /Dest (subsection.0.6.4) /Next 134 0 R /Parent 87 0 R /Prev 136 0 R /Title >>
+<< /Dest (subsection.0.6.3) /Next 135 0 R /Parent 87 0 R /Prev 137 0 R /Title >>
+<< /Dest (subsection.0.6.2) /Next 136 0 R /Parent 87 0 R /Prev 88 0 R /Title >>
+<< /Dest (subsubsection.0.6.7.3) /Next 132 0 R /Parent 130 0 R /Prev 139 0 R /Title >>
+<< /Dest (subsubsection.0.6.7.2) /Next 138 0 R /Parent 130 0 R /Prev 131 0 R /Title >>
+<< /Dest (subsection.0.7.5) /Next 86 0 R /Parent 84 0 R /Prev 141 0 R /Title >>
+<< /Dest (subsection.0.7.4) /Next 140 0 R /Parent 84 0 R /Prev 142 0 R /Title >>
+<< /Dest (subsection.0.7.3) /Next 141 0 R /Parent 84 0 R /Prev 143 0 R /Title >>
+<< /Dest (subsection.0.7.2) /Next 142 0 R /Parent 84 0 R /Prev 85 0 R /Title >>
+<< /Dest (subsection.0.8.4) /Next 83 0 R /Parent 81 0 R /Prev 145 0 R /Title >>
+<< /Dest (subsection.0.8.3) /Next 144 0 R /Parent 81 0 R /Prev 146 0 R /Title >>
+<< /Dest (subsection.0.8.2) /Next 145 0 R /Parent 81 0 R /Prev 82 0 R /Title >>
+<< /Dest (subsection.0.9.7) /Next 80 0 R /Parent 78 0 R /Prev 148 0 R /Title >>
+<< /Dest (subsection.0.9.6) /Next 147 0 R /Parent 78 0 R /Prev 149 0 R /Title >>
+<< /Dest (subsection.0.9.5) /Next 148 0 R /Parent 78 0 R /Prev 150 0 R /Title >>
+<< /Dest (subsection.0.9.4) /Next 149 0 R /Parent 78 0 R /Prev 151 0 R /Title >>
+<< /Dest (subsection.0.9.3) /Next 150 0 R /Parent 78 0 R /Prev 152 0 R /Title >>
+<< /Dest (subsection.0.9.2) /Next 151 0 R /Parent 78 0 R /Prev 79 0 R /Title >>
+
+endstream
+endobj
+153 0 obj
+<< /Type /ObjStm /Length 423 /N 4 /First 29 /Extends 306 0 R >>
+stream
+154 0 155 63 156 197 157 332
+<< /Count 21 /Kids [ 155 0 R 156 0 R 157 0 R ] /Type /Pages >>
+<< /Count 10 /Kids [ 65 0 R 286 0 R 287 0 R 288 0 R 289 0 R 290 0 R 291 0 R 292 0 R 293 0 R 294 0 R ] /Parent 154 0 R /Type /Pages >>
+<< /Count 10 /Kids [ 295 0 R 296 0 R 297 0 R 298 0 R 299 0 R 300 0 R 301 0 R 302 0 R 303 0 R 304 0 R ] /Parent 154 0 R /Type /Pages >>
+<< /Count 1 /Kids [ 305 0 R ] /Parent 154 0 R /Type /Pages >>
+
+endstream
+endobj
+158 0 obj
+<< /K [ 307 0 R 308 0 R 309 0 R 310 0 R 311 0 R 312 0 R 313 0 R ] /ParentTree 314 0 R /ParentTreeNextKey 5 /Type /StructTreeRoot >>
+endobj
+159 0 obj
+<< /Type /ObjStm /Length 13396 /N 100 /First 958 >>
+stream
+160 0 161 64 162 1708 163 3173 164 4597 165 6203 166 7938 167 8858 168 8896 169 8934 170 8971 171 9010 172 9049 173 9088 174 9127 175 9166 176 9205 177 9244 178 9283 179 9322 180 9361 181 9399 182 9437 183 9475 184 9513 185 9551 186 9589 187 9627 188 9665 189 9703 190 9742 191 9781 192 9820 193 9859 194 9898 195 9937 196 9976 197 10015 198 10054 199 10092 200 10130 201 10169 202 10208 203 10247 204 10286 205 10325 206 10363 207 10402 208 10440 209 10479 210 10518 211 10556 212 10594 213 10633 214 10672 215 10710 216 10748 217 10787 218 10826 219 10865 220 10903 221 10942 222 10981 223 11020 224 11058 225 11096 226 11134 227 11173 228 11212 229 11251 230 11290 231 11328 232 11367 233 11406 234 11445 235 11484 236 11522 237 11560 238 11599 239 11637 240 11675 241 11713 242 11751 243 11790 244 11828 245 11866 246 11905 247 11943 248 11981 249 12019 250 12057 251 12096 252 12134 253 12172 254 12210 255 12248 256 12286 257 12324 258 12362 259 12400
+<< /Kids [ 161 0 R 162 0 R 163 0 R 164 0 R 165 0 R 166 0 R ] >>
+<< /Limits [ (Doc-Start) (cite.touvron2023llama) ] /Names [ (Doc-Start) 408 0 R (FIG1) 409 0 R (FIG2) 410 0 R (FIG3) 411 0 R (FIG4) 412 0 R (FIG5) 413 0 R (FIG6) 414 0 R (FIG7) 415 0 R (Item.1) 417 0 R (Item.10) 418 0 R (Item.2) 419 0 R (Item.3) 420 0 R (Item.4) 421 0 R (Item.5) 422 0 R (Item.6) 423 0 R (Item.7) 424 0 R (Item.8) 425 0 R (Item.9) 426 0 R (TAB1) 427 0 R (TAB2) 428 0 R (TAB3) 429 0 R (TAB4) 430 0 R (TAB5) 431 0 R (TAB6) 432 0 R (TAB7) 433 0 R (TAB8) 434 0 R (afig1) 435 0 R (afig2) 436 0 R (afig3) 437 0 R (afig4) 438 0 R (afig5) 439 0 R (afig6) 440 0 R (afig7) 441 0 R (ariclecounter.1) 442 0 R (cite.andrews2023hallucination) 443 0 R (cite.anthropic2023claude) 444 0 R (cite.bang2023multitask) 445 0 R (cite.berger2013statistical) 446 0 R (cite.bommarito2022gpt) 447 0 R (cite.bommasani2021opportunities) 448 0 R (cite.brown2020language) 449 0 R (cite.chen2023hallucination) 450 0 R (cite.deepseek2023model) 451 0 R (cite.fabbri2022qafacteval) 452 0 R (cite.gehman2020realtoxicity) 453 0 R (cite.gelman2013bayesian) 454 0 R (cite.ji2023survey) 455 0 R (cite.kadavath2022language) 456 0 R (cite.kazemi2023cohs) 457 0 R (cite.lee2023benefits) 458 0 R (cite.lewis2020retrieval) 459 0 R (cite.li2022contrastive) 460 0 R (cite.lin2022truthfulqa) 461 0 R (cite.liu2023evaluating) 462 0 R (cite.maynez2020faithfulness) 463 0 R (cite.nakano2021webgpt) 464 0 R (cite.openai2023gpt4) 465 0 R (cite.openai2023systemcard) 466 0 R (cite.openchat2023) 467 0 R (cite.ouyang2022training) 468 0 R (cite.pearl1988probabilistic) 469 0 R (cite.reynolds2021prompt) 470 0 R (cite.shuster2022language) 471 0 R (cite.touvron2023llama) 472 0 R ] >>
+<< /Limits [ (cite.wang2022selfconsistency) (page.3) ] /Names [ (cite.wang2022selfconsistency) 344 0 R (cite.wei2022chain) 345 0 R (cite.weidinger2022taxonomy) 346 0 R (cite.wu2023hallucinationeval) 347 0 R (cite.yao2022react) 348 0 R (cite.zhang2023grounded) 349 0 R (cite.zhou2022least) 350 0 R (equation.0.1.1) 351 0 R (equation.0.2.2) 352 0 R (equation.0.2.3) 353 0 R (equation.0.2.4) 354 0 R (equation.0.4.10) 355 0 R (equation.0.4.11) 356 0 R (equation.0.4.12) 357 0 R (equation.0.4.13) 358 0 R (equation.0.4.14) 359 0 R (equation.0.4.5) 360 0 R (equation.0.4.7) 361 0 R (equation.0.4.8) 362 0 R (equation.0.4.9) 363 0 R (fig1) 364 0 R (fig2) 365 0 R (fig3) 366 0 R (fig4) 367 0 R (fig5) 368 0 R (fig6) 369 0 R (fig7) 370 0 R (figsource.1) 371 0 R (figsource.2) 372 0 R (figsource.3) 373 0 R (figsource.4) 374 0 R (figsource.5) 375 0 R (figsource.6) 376 0 R (figsource.7) 377 0 R (figure.0.1) 378 0 R (figure.0.2) 379 0 R (figure.0.3) 380 0 R (figure.0.4) 381 0 R (figure.0.5) 382 0 R (figure.0.6) 383 0 R (figure.0.7) 384 0 R (mytable.1) 385 0 R (mytable.2) 386 0 R (mytable.3) 387 0 R (mytable.4) 388 0 R (mytable.5) 389 0 R (mytable.6) 390 0 R (mytable.7) 391 0 R (mytable.8) 392 0 R (page.1) 393 0 R (page.10) 394 0 R (page.11) 395 0 R (page.12) 396 0 R (page.13) 397 0 R (page.14) 398 0 R (page.15) 399 0 R (page.16) 400 0 R (page.17) 401 0 R (page.18) 402 0 R (page.19) 403 0 R (page.2) 404 0 R (page.20) 405 0 R (page.21) 406 0 R (page.3) 407 0 R ] >>
+<< /Limits [ (page.4) (section*.61) ] /Names [ (page.4) 538 0 R (page.5) 539 0 R (page.6) 540 0 R (page.7) 541 0 R (page.8) 542 0 R (page.9) 543 0 R (section*.1) 544 0 R (section*.10) 545 0 R (section*.11) 546 0 R (section*.12) 547 0 R (section*.13) 548 0 R (section*.14) 549 0 R (section*.15) 550 0 R (section*.16) 551 0 R (section*.17) 552 0 R (section*.18) 553 0 R (section*.19) 554 0 R (section*.2) 555 0 R (section*.20) 556 0 R (section*.21) 557 0 R (section*.22) 558 0 R (section*.23) 559 0 R (section*.24) 560 0 R (section*.25) 561 0 R (section*.26) 562 0 R (section*.27) 563 0 R (section*.28) 564 0 R (section*.29) 565 0 R (section*.3) 566 0 R (section*.30) 567 0 R (section*.31) 568 0 R (section*.32) 569 0 R (section*.33) 570 0 R (section*.34) 571 0 R (section*.35) 572 0 R (section*.36) 573 0 R (section*.37) 316 0 R (section*.38) 317 0 R (section*.39) 318 0 R (section*.4) 319 0 R (section*.40) 320 0 R (section*.41) 321 0 R (section*.42) 322 0 R (section*.43) 323 0 R (section*.44) 324 0 R (section*.45) 325 0 R (section*.46) 326 0 R (section*.47) 327 0 R (section*.48) 328 0 R (section*.49) 329 0 R (section*.5) 330 0 R (section*.50) 331 0 R (section*.51) 332 0 R (section*.52) 333 0 R (section*.53) 334 0 R (section*.54) 335 0 R (section*.55) 336 0 R (section*.56) 337 0 R (section*.57) 338 0 R (section*.58) 339 0 R (section*.59) 340 0 R (section*.6) 341 0 R (section*.60) 342 0 R (section*.61) 343 0 R ] >>
+<< /Limits [ (section*.62) (subsection.0.5.4) ] /Names [ (section*.62) 474 0 R (section*.63) 475 0 R (section*.64) 476 0 R (section*.65) 477 0 R (section*.66) 478 0 R (section*.67) 479 0 R (section*.68) 480 0 R (section*.69) 481 0 R (section*.7) 482 0 R (section*.70) 483 0 R (section*.71) 484 0 R (section*.72) 485 0 R (section*.73) 486 0 R (section*.74) 487 0 R (section*.75) 488 0 R (section*.76) 489 0 R (section*.77) 490 0 R (section*.78) 491 0 R (section*.79) 492 0 R (section*.8) 493 0 R (section*.80) 494 0 R (section*.81) 495 0 R (section*.82) 496 0 R (section*.83) 497 0 R (section*.84) 498 0 R (section*.85) 499 0 R (section*.9) 500 0 R (section.0.1) 501 0 R (section.0.10) 502 0 R (section.0.2) 503 0 R (section.0.3) 504 0 R (section.0.4) 505 0 R (section.0.5) 506 0 R (section.0.6) 507 0 R (section.0.7) 508 0 R (section.0.8) 509 0 R (section.0.9) 510 0 R (subsection.0.2.1) 511 0 R (subsection.0.2.2) 512 0 R (subsection.0.2.3) 513 0 R (subsection.0.2.4) 514 0 R (subsection.0.2.5) 515 0 R (subsection.0.2.6) 516 0 R (subsection.0.3.1) 517 0 R (subsection.0.3.2) 518 0 R (subsection.0.3.3) 519 0 R (subsection.0.3.4) 520 0 R (subsection.0.3.5) 521 0 R (subsection.0.4.1) 522 0 R (subsection.0.4.10) 523 0 R (subsection.0.4.11) 524 0 R (subsection.0.4.12) 525 0 R (subsection.0.4.2) 526 0 R (subsection.0.4.3) 527 0 R (subsection.0.4.4) 528 0 R (subsection.0.4.5) 529 0 R (subsection.0.4.6) 530 0 R (subsection.0.4.7) 531 0 R (subsection.0.4.8) 532 0 R (subsection.0.4.9) 533 0 R (subsection.0.5.1) 534 0 R (subsection.0.5.2) 535 0 R (subsection.0.5.3) 536 0 R (subsection.0.5.4) 537 0 R ] >>
+<< /Limits [ (subsection.0.5.5) (xchen2023hallucination) ] /Names [ (subsection.0.5.5) 196 0 R (subsection.0.5.6) 197 0 R (subsection.0.5.7) 198 0 R (subsection.0.5.8) 199 0 R (subsection.0.6.1) 200 0 R (subsection.0.6.2) 201 0 R (subsection.0.6.3) 202 0 R (subsection.0.6.4) 203 0 R (subsection.0.6.5) 204 0 R (subsection.0.6.6) 205 0 R (subsection.0.6.7) 206 0 R (subsection.0.6.8) 207 0 R (subsection.0.7.1) 208 0 R (subsection.0.7.2) 209 0 R (subsection.0.7.3) 210 0 R (subsection.0.7.4) 211 0 R (subsection.0.7.5) 212 0 R (subsection.0.7.6) 213 0 R (subsection.0.8.1) 214 0 R (subsection.0.8.2) 215 0 R (subsection.0.8.3) 216 0 R (subsection.0.8.4) 217 0 R (subsection.0.8.5) 218 0 R (subsection.0.9.1) 219 0 R (subsection.0.9.2) 220 0 R (subsection.0.9.3) 221 0 R (subsection.0.9.4) 222 0 R (subsection.0.9.5) 223 0 R (subsection.0.9.6) 224 0 R (subsection.0.9.7) 225 0 R (subsection.0.9.8) 226 0 R (subsubsection.0.4.9.1) 227 0 R (subsubsection.0.4.9.2) 228 0 R (subsubsection.0.4.9.3) 229 0 R (subsubsection.0.5.4.1) 230 0 R (subsubsection.0.5.4.2) 231 0 R (subsubsection.0.6.7.1) 232 0 R (subsubsection.0.6.7.2) 233 0 R (subsubsection.0.6.7.3) 234 0 R (subsubsection.0.6.7.4) 235 0 R (tab1) 236 0 R (tab2) 237 0 R (tab3) 238 0 R (tab4) 239 0 R (tab5) 240 0 R (tab6) 241 0 R (tab7) 242 0 R (tab8) 243 0 R (table.0.0) 244 0 R (table.0.1) 245 0 R (table.0.2) 246 0 R (table.0.3) 247 0 R (table.0.4) 248 0 R (table.0.5) 249 0 R (table.0.6) 250 0 R (table.0.7) 251 0 R (xandrews2023hallucination) 252 0 R (xanthropic2023claude) 253 0 R (xbang2023multitask) 254 0 R (xberger2013statistical) 255 0 R (xbommarito2022gpt) 256 0 R (xbommasani2021opportunities) 257 0 R (xbrown2020language) 258 0 R (xchen2023hallucination) 259 0 R ] >>
+<< /Limits [ (xdeepseek2023model) (xzhou2022least) ] /Names [ (xdeepseek2023model) 167 0 R (xfabbri2022qafacteval) 168 0 R (xgehman2020realtoxicity) 169 0 R (xgelman2013bayesian) 170 0 R (xji2023survey) 171 0 R (xkadavath2022language) 172 0 R (xkazemi2023cohs) 173 0 R (xlee2023benefits) 174 0 R (xlewis2020retrieval) 175 0 R (xli2022contrastive) 176 0 R (xlin2022truthfulqa) 177 0 R (xliu2023evaluating) 178 0 R (xmaynez2020faithfulness) 179 0 R (xnakano2021webgpt) 180 0 R (xopenai2023gpt4) 181 0 R (xopenai2023systemcard) 182 0 R (xopenchat2023) 183 0 R (xouyang2022training) 184 0 R (xpearl1988probabilistic) 185 0 R (xreynolds2021prompt) 186 0 R (xshuster2022language) 187 0 R (xtouvron2023llama) 188 0 R (xwang2022selfconsistency) 189 0 R (xwei2022chain) 190 0 R (xweidinger2022taxonomy) 191 0 R (xwu2023hallucinationeval) 192 0 R (xyao2022react) 193 0 R (xzhang2023grounded) 194 0 R (xzhou2022least) 195 0 R ] >>
+<< /D [ 304 0 R /XYZ 53 153 null ] >>
+<< /D [ 304 0 R /XYZ 53 133 null ] >>
+<< /D [ 304 0 R /XYZ 53 89 null ] >>
+<< /D [ 304 0 R /XYZ 309 353 null ] >>
+<< /D [ 304 0 R /XYZ 309 333 null ] >>
+<< /D [ 304 0 R /XYZ 309 305 null ] >>
+<< /D [ 304 0 R /XYZ 309 277 null ] >>
+<< /D [ 304 0 R /XYZ 309 248 null ] >>
+<< /D [ 304 0 R /XYZ 309 228 null ] >>
+<< /D [ 304 0 R /XYZ 309 208 null ] >>
+<< /D [ 304 0 R /XYZ 309 179 null ] >>
+<< /D [ 304 0 R /XYZ 309 151 null ] >>
+<< /D [ 304 0 R /XYZ 309 123 null ] >>
+<< /D [ 304 0 R /XYZ 309 87 null ] >>
+<< /D [ 305 0 R /XYZ 53 733 null ] >>
+<< /D [ 305 0 R /XYZ 53 753 null ] >>
+<< /D [ 305 0 R /XYZ 53 713 null ] >>
+<< /D [ 305 0 R /XYZ 53 693 null ] >>
+<< /D [ 305 0 R /XYZ 53 665 null ] >>
+<< /D [ 305 0 R /XYZ 53 637 null ] >>
+<< /D [ 305 0 R /XYZ 53 609 null ] >>
+<< /D [ 305 0 R /XYZ 53 581 null ] >>
+<< /D [ 305 0 R /XYZ 309 753 null ] >>
+<< /D [ 305 0 R /XYZ 309 725 null ] >>
+<< /D [ 305 0 R /XYZ 309 696 null ] >>
+<< /D [ 305 0 R /XYZ 309 667 null ] >>
+<< /D [ 305 0 R /XYZ 309 638 null ] >>
+<< /D [ 305 0 R /XYZ 309 610 null ] >>
+<< /D [ 305 0 R /XYZ 309 581 null ] >>
+<< /D [ 293 0 R /XYZ 305 383 null ] >>
+<< /D [ 293 0 R /XYZ 305 187 null ] >>
+<< /D [ 294 0 R /XYZ 49 484 null ] >>
+<< /D [ 294 0 R /XYZ 49 328 null ] >>
+<< /D [ 294 0 R /XYZ 305 541 null ] >>
+<< /D [ 294 0 R /XYZ 305 478 null ] >>
+<< /D [ 294 0 R /XYZ 305 338 null ] >>
+<< /D [ 294 0 R /XYZ 305 244 null ] >>
+<< /D [ 294 0 R /XYZ 305 175 null ] >>
+<< /D [ 296 0 R /XYZ 49 198 null ] >>
+<< /D [ 296 0 R /XYZ 305 710 null ] >>
+<< /D [ 299 0 R /XYZ 49 381 null ] >>
+<< /D [ 299 0 R /XYZ 305 643 null ] >>
+<< /D [ 299 0 R /XYZ 305 322 null ] >>
+<< /D [ 300 0 R /XYZ 49 613 null ] >>
+<< /D [ 300 0 R /XYZ 49 162 null ] >>
+<< /D [ 300 0 R /XYZ 305 662 null ] >>
+<< /D [ 300 0 R /XYZ 305 283 null ] >>
+<< /D [ 301 0 R /XYZ 49 631 null ] >>
+<< /D [ 301 0 R /XYZ 49 210 null ] >>
+<< /D [ 301 0 R /XYZ 305 467 null ] >>
+<< /D [ 301 0 R /XYZ 305 279 null ] >>
+<< /D [ 301 0 R /XYZ 305 186 null ] >>
+<< /D [ 302 0 R /XYZ 49 307 null ] >>
+<< /D [ 302 0 R /XYZ 305 568 null ] >>
+<< /D [ 302 0 R /XYZ 305 359 null ] >>
+<< /D [ 302 0 R /XYZ 305 174 null ] >>
+<< /D [ 303 0 R /XYZ 49 658 null ] >>
+<< /D [ 303 0 R /XYZ 49 467 null ] >>
+<< /D [ 303 0 R /XYZ 49 265 null ] >>
+<< /D [ 303 0 R /XYZ 305 772 null ] >>
+<< /D [ 291 0 R /XYZ 305 772 null ] >>
+<< /D [ 291 0 R /XYZ 305 594 null ] >>
+<< /D [ 291 0 R /XYZ 305 439 null ] >>
+<< /D [ 293 0 R /XYZ 49 119 null ] >>
+<< /D [ 293 0 R /XYZ 305 621 null ] >>
+<< /D [ 296 0 R /XYZ 305 560 null ] >>
+<< /D [ 296 0 R /XYZ 305 185 null ] >>
+<< /D [ 298 0 R /XYZ 305 674 null ] >>
+<< /D [ 299 0 R /XYZ 49 579 null ] >>
+<< /D [ 289 0 R /XYZ 49 762 null ] >>
+<< /D [ 290 0 R /XYZ 305 762 null ] >>
+<< /D [ 295 0 R /XYZ 49 762 null ] >>
+<< /D [ 295 0 R /XYZ 49 313 null ] >>
+<< /D [ 296 0 R /XYZ 49 762 null ] >>
+<< /D [ 296 0 R /XYZ 49 622 null ] >>
+<< /D [ 298 0 R /XYZ 305 481 null ] >>
+<< /D [ 302 0 R /XYZ 49 762 null ] >>
+<< /D [ 289 0 R /XYZ 49 772 null ] >>
+<< /D [ 290 0 R /XYZ 305 772 null ] >>
+<< /D [ 295 0 R /XYZ 49 772 null ] >>
+<< /D [ 295 0 R /XYZ 49 323 null ] >>
+<< /D [ 296 0 R /XYZ 49 772 null ] >>
+<< /D [ 296 0 R /XYZ 49 633 null ] >>
+<< /D [ 298 0 R /XYZ 305 491 null ] >>
+<< /D [ 302 0 R /XYZ 49 772 null ] >>
+<< /D [ 304 0 R /XYZ 53 353 null ] >>
+<< /D [ 304 0 R /XYZ 53 325 null ] >>
+<< /D [ 304 0 R /XYZ 53 297 null ] >>
+<< /D [ 304 0 R /XYZ 53 277 null ] >>
+<< /D [ 304 0 R /XYZ 53 257 null ] >>
+<< /D [ 304 0 R /XYZ 53 237 null ] >>
+<< /D [ 304 0 R /XYZ 53 209 null ] >>
+<< /D [ 304 0 R /XYZ 53 181 null ] >>
+
+endstream
+endobj
+260 0 obj
+<< /BBox [ 0 0 32 128 ] /Matrix [ 0 -0.445156 0.111288 0 293.01001 431.886536 ] /PaintType 1 /PatternType 1 /Resources 61 0 R /TilingType 2 /Type /Pattern /XStep 32 /YStep 128 /Length 105 >>
+stream
+q
+1 i
+32 128 -32 -128 re
+-.000002 128 m
+W n
+/GS1 gs
+q
+32.402 0 0 128.531 -.14716 -.37761 cm
+/Im1 Do
+Q
+Q
+
+endstream
+endobj
+261 0 obj
+<< /Subtype /Type1C /Length 5325 >>
+stream
+ NHADBG+MuseoSans-700 $%&f
+$-6>HRfour.tnumone.tnumtwo.tnumthree.tnumzero.tnumfive.tnumsix.tnumseven.tnumeight.tnumCopyright (c) 2008 by Jos Buivenga. All rights reserved.SemiBoldMuseoSans-700
" , 3 : B L S i m w E K N NbZF\dU?] i
+
+
+
+1c%p
g
&=)rBKvZ$P> >$> >>}Pvhw]*A11BC53Ew* Z +hfnmgwfmnfu4EB10CC5mgfmnffn,L,@e*)! ?`$8-
,,
-8$`? !)d*A)wfw 57g
Z
i l8'"~T"~%zz)aa "~"~٠vJwZZJ)EcRNRыދaGi$"f= 0GǩfXXkcSa̱`POf`I1
+
+e*5mIG81(0*@2Ae02bH|o#tt#o|"624/6$SNMSN=mv]],]
+6vz
+4s5%O|||i
+u8DA ".3G1Bg/4b`vwZvZwZvwB%i] ;/wwHӠv[wz!ˋEbЋL\!RZ{axMw{]vZw-ϋFb-Z.aGkp.
+
+he27ee27e_23ee23_ #