Files
Zotero-Thesis/storage/5PGKTV4I/.zotero-ft-cache
fzzinchemical 02b00ee108 update
2026-01-22 22:01:07 +01:00

571 lines
63 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation,
and Challenges
YUNSEO LEE, UNIST, Republic of Korea
JOHN YOUNGEUN SONG, Handong Global University, Republic of Korea
DONGSUN KIM, Korea University, Republic of Korea
JINDAE KIM, Seoul National University of Science and Technology, Republic of Korea
MIJUNG KIM, UNIST, Republic of Korea
JAECHANG NAM†, Handong Global University, Republic of Korea
Recent technical breakthroughs in large language models (LLMs) have enabled them to fluently generate source code. Software
developers often leverage both general-purpose and code-specialized LLMs to revise existing code or even generate a whole function
from scratch. These capabilities are also beneficial in no-code or low-code contexts, in which one can write programs without a
technical background. However, due to their internal design, LLMs are prone to generating hallucinations, which are incorrect,
nonsensical, and not justifiable information but difficult to identify its presence. This problem also occurs when generating source code.
Once hallucinated code is produced, it is often challenging for users to identify and fix it, especially when such hallucinations can be
identified under specific execution paths. As a result, the hallucinated code may remain unnoticed within the codebase. This survey
investigates recent studies and techniques relevant to hallucinations generated by CodeLLMs. We categorize the types of hallucinations
in the code generated by CodeLLMs, review existing benchmarks and mitigation strategies, and identify open challenges. Based on
these findings, this survey outlines further research directions in the detection and removal of hallucinations produced by CodeLLMs.
1 Introduction
Ensuring the accuracy, reliability, and security of code generated by Large Language Models (LLMs) remains a critical
challenge [1, 12, 53]. A primary reason for this is the prevalence of hallucinations — instances where the model generates
code that is illogical, incorrect, or unfaithful to the specified requirements [14]. Addressing these hallucinations is
essential, as they undermine the trustworthiness of the generated code and can introduce significant risks and errors
into software applications.
Although benchmarks such as HumanEval [9] and Mostly Basic Python Programming (MBPP) [6] are commonly
used to evaluate the code generation performance of LLMs, there remains a lack of standardized methods of assessing
the hallucinations generated by CodeLLMs. These general benchmarks only measure the syntactical or token-wise
differences between the generated and oracle code. At most, the benchmarks provide simple test cases in which the
users can verify a subset of dynamic behaviors of the generated code, which are not useful for defining, detecting, and
mitigating hallucinations.
To address hallucination issues of code generation tasks, many researchers have created evaluation benchmarks
for the tasks recently, and proposed various approaches to addressing the issues. For example, benchmarks such as
Both authors contributed equally to this research. Yunseo Lee conducted this study while he was an undergraduate student at Handong Global University. †Corresponding author.
Authors Contact Information: Yunseo Lee, yunseo.lee@unist.ac.kr, UNIST, Ulsan, Republic of Korea; John Youngeun Song, john.song@handong.edu, Handong Global University, Pohang, Republic of Korea; Dongsun Kim, Korea University, Seoul, Republic of Korea, darkrsw@korea.ac.kr; Jindae Kim, Seoul National University of Science and Technology, Seoul, Republic of Korea, jindae.kim@seoultech.ac.kr; Mijung Kim, UNIST, Ulsan, Republic of Korea, mijungk@unist.ac.kr; Jaechang Nam, Handong Global University, Pohang, Republic of Korea, jcnam@handong.edu.
1
2 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
CodeHaluEval [53] and CodeMirage [1] have been developed to measure hallucination frequencies, while mitigation
strategies such as iterative grounding [12] and self-revision feedback loops [37] aim to reduce specific hallucinations.
The goal of this study is to provide a comprehensive analysis of code hallucinations, including their categorization,
evaluation metrics, and mitigation strategies. To achieve this goal, we (1) structured a detailed taxonomy of code hallu
cinations, (2) review and categorize existing benchmarks and evaluation metrics used for detecting these hallucinations,
(3) consolidated a list of root causes that contribute to code hallucinations, and (4) survey current mitigation strategies
designed to address code hallucinations.
2 Differences from other surveys on hallucinations of CodeLLMs
Although hallucinations generated by LLMs in general are studied in multiple surveys [14, 19, 61], our survey focuses
on hallucinations observed during code generation tasks using LLMs. The followings are the key aspects of our survey:
• Focus and Scope: We focus on hallucinations specifically observed from code generation tasks, addressing
unique challenges such as syntactic and semantic discrepancies in code output. In addition, while existing
surveys [14, 18, 22, 64] on code generation analyzed performance, benchmarks, data curation, and evaluation
metrics, they failed to systematically explore code hallucinations. By exploring taxonomy, benchmarks, metrics,
and mitigation strategies tailored to code-specific hallucinations, our survey fills this critical gap and provides a
comprehensive framework for future research.
• Taxonomy and Categorization: Existing hallucination surveys classify hallucinations into input-conflicting,
context-conflicting, and fact-conflicting types [19]. Building upon these classifications, our study introduces a
taxonomy that incorporates specialized hallucination types unique to the code generation process, allowing a
systematic exploration of hallucination issues specific to this domain.
• Integration of Benchmarks: Although other surveys [14, 22, 64] include benchmarks such as HumanEval [9]
and TruthfulQA [33], we identified four datasets and benchmarks explicitly aligned with detecting and mitigating
code hallucination, such as tests for functional correctness and adherence to APIs.
• Exploration of Mitigation Strategies: While previous surveys navigated mitigation approaches for general
natural languages [61], we delve into mitigation strategies such as fine-tuning with code-specific datasets,
leveraging automated testing frameworks, and integrating static and dynamic program analysis tools for
real-time hallucination detection.
3 Paper Collection and Review Schema
3.1 Survey Scope
We aim to cover in full the taxonomy, benchmarks and evaluation metrics, causes of hallucinations, and mitigation
techniques for hallucinations in code generated by CodeLLMs. The criteria for selecting papers are as follows:
• Papers that discuss both LLM-based code generation and LLM hallucination.
• Papers that define code hallucinations or propose taxonomies related to them.
• Papers that propose techniques for detecting or mitigating code hallucinations.
• Papers that introduce datasets or benchmarks for evaluating the performance of CodeLLMs.
To distinguish our study from existing surveys on hallucinations in the Natural Language Processing (NLP) domain
and focus on code generation, we included only papers that addressed both LLM code generation and LLM hallucination.
In particular, we searched for papers that explicitly used terms such as code hallucination or hallucinated code. For
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 3
mitigation-related studies, we included papers that addressed the correctness of generated code, even if the term
hallucination was not explicitly mentioned.
3.2 Methodology for Literature Identification
We conducted a systematic literature review on various papers. To gather as many relevant studies as possible, Google
Scholar keyword searches were performed using the terms “hallucination” and “code generation”. Considering the rapid
advances in research related to LLMs, the review focused mainly on articles published after 2023, while also including
two notable articles from 2022 based on their significance. Titles, abstracts, and introductions of the retrieved papers
were manually reviewed and categorized into three main categories: Taxonomy, Benchmark, and Mitigation.
In addition, to ensure comprehensive coverage of studies on code hallucination, the snowball method [59] was
employed. Snowballing, commonly used in survey studies, involves tracking citations of identified papers until no
additional relevant papers are found. This process helped identify missing studies from the initial search, as well as NLP
hallucination papers frequently cited in code hallucination research. Although these NLP studies were not included
in the systematic review as they did not focus on code, they provided foundational insights to develop classification
criteria for code hallucinations.
5
Taxonomy
3
20 Benchmarking
2
16
Mitigation
4
1
Fig. 1. Distribution of the categorization of papers.
arXiv
51.9% (27)
TSE
3.8% (2)
NeurIPS
5.8% (3)
ICML
5.8% (3)
ICSE 5.8% (3)
Venues with a Single Paper
26.9% (14)
Fig. 2. Distribution of papers by venue.
We categorized the papers into three key dimensions: Taxonomy, Benchmarking, and Mitigation, as shown Fig. 1.
Most of the papers fall under the Benchmarking category (20 papers [3, 69, 11, 17, 23, 25, 2730, 35, 38, 49, 50, 60,
66, 67]) and the Mitigation category (16 papers [12, 13, 21, 26, 32, 36, 39, 40, 43, 45, 48, 51, 54, 55, 62, 63]), while fewer
studies are categorized under Taxonomy (five papers [15, 24, 42, 52, 57]). Overlapping areas reveal cross-disciplinary
contributions: four papers address both Taxonomy and Mitigation [31, 37, 44, 65], three papers address both Taxonomy
and Benchmarking [1, 34, 53], and 2 papers explore both Mitigation and Benchmarking [2, 20]. Only one paper [10]
combines all three dimensions, emphasizing the scarcity of comprehensive studies.
While many papers are in a preprint stage (e.g., arXiv), authors gradually publish papers at top venues in the
community. Fig. 2 shows the distributions of papers by venue. About a half of the papers (51.9%) were published on
arXiv. The remaining papers were published in top-tier conferences (39.2%) such as NeurIPS (Annual Conference on
Neural Information Processing Systems) and ICML (International Conference on Machine Learning), and academic
journals (7.8%) such as TSE (IEEE Transactions on Software Engineering).
4 LLM-based code generation (CodeLLMs) and its hallucination
CodeLLMs have been developed to address unique challenges in this domain. OpenAIs Codex and its derivative
Copilot are prominent examples that introduced generative pre-trained models with billions of parameters that produce
4 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
snippets [9, 38]. Following these innovations, models such as Anthropics Claude Sonnet [5], Metas CodeLLaMA [46],
DeepMinds AlphaCode [30], Salesforces CodeGen [41], and Amazons CodeWhisperer [4] entered the landscape,
each addressing different aspects of coding efficiency and applicability. OpenAI further refined its offerings with
GPT-3.5 and GPT-4, showcasing enhanced capabilities in generating syntactically and semantically accurate code. These
advancements are often accompanied by benchmark datasets such as HumanEval [9], DS-1000 [25], and MBPP [6],
which assess the performance of LLMs on diverse coding tasks.
Despite their promise, LLMs face a significant challenge in code generation including hallucinations. Hallucinations,
in this context, refer to the generation of code that is nonsensical, logically flawed, or unfaithful to the given task descrip
tion [10]. Studies in the NLP field have classified hallucinations into types such as input-conflicting, context-conflicting,
and fact-conflicting hallucinations [19]. Within code generation, hallucinations can manifest as bugs, syntactical errors,
security vulnerabilities, or even non-deterministic outputs.Existing research highlights that hallucinated outputs not
only degrade functional correctness, but may also introduce subtle errors, such as memory leaks or insecure code [7].
5 Taxonomy of Hallucination by CodeLLMs
In our effort to create a consolidated taxonomy of code hallucinations generated by CodeLLMs, we analyzed relevant
papers that presented their own classification of hallucinations. Rather than focusing on the causes of the hallucination,
our resulting taxonomy categorizes hallucinations based on the observable characteristics of error produced, as shown
in Fig. 3. A key advantage of this approach is that it provides an objective for classifying hallucinations, regardless of the
model architecture or the training datasets. The taxonomy consists of four primary categories: Syntactic Hallucinations,
Runtime Execution Hallucinations, Functional Correctness Hallucinations, and Code Quality Hallucinations. In this
section, we discuss each primary category with detailed sub-categories.
5.1 Syntactic Hallucinations
These refer to errors that deviate from a language syntax, which render the code unable to parsed and unable to be
compiled or interpreted [2, 10, 15, 52, 57]. Syntactic hallucinations can be further classified into two sub-categories:
“Syntax Violations” and “Incomplete Code Generation”.
5.1.1 Syntax Violations. These occur when a CodeLLM generates code that violates the syntax of the programming
language, leading to compile-time errors [1, 10, 57]. Three research papers include a specific taxonomy on what kinds
of syntax violations there are [1, 10, 57]. One paper [1] classifies errors in generated code that are related to syntax
Code Hallucinations
Syntactic Hallucinations
Syntax Violation [1, 10, 57]
Incomplete Code Generation [15, 52]
Runtime Execution Hallucinations
API Knowledge Conflict [10, 34, 65]
Invalid Reference Errors [10, 15, 34, 53, 57]
Functional Correctness Hallucinations
Incorrect Logical Flow [10, 15, 34, 52, 53, 57]
Requirement Deviation [34, 52, 57, 65]
Code Quality Hallucinations
Resource Mishandling [53, 65]
Security Vulnerability [42, 65]
Code Smell [34, 52, 57, 65]
Fig. 3. Taxonomy of hallucinations possibly generated by CodeLLMs.
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 5
under the term Syntactic Incorrectness. Two papers classify syntax violations further and provide more specific terms
such as Incorrect Indentation, Conditional Error, Loop Error, Return Error, and Assignment Error [10, 57].
5.1.2 Incomplete Code Generation. This occurs when CodeLLMs stops generating code or entire code blocks are
missing [15, 52]. In violation of any specific coding language syntax rules, incomplete code generation will prevent the
code from being executed or compiled.
5.2 Runtime Execution Hallucinations
These occur when CodeLLMs generate code that is syntactically valid but produces runtime errors, such as exceptions
or crashes, during execution [10, 15, 34, 52, 53, 57, 65]. Although syntactic correctness is a necessary condition for code
execution, it does not guarantee that the code will function as intended or even run without errors. They manifest only
when the code is actually run and may depend on specific inputs or outside factors. Unlike syntactic hallucinations, these
types of hallucinations do not necessarily break the syntax, but cause the program to crash or behave unexpectedly.
5.2.1 API Knowledge Conflict. This occurs when CodeLLMs misuse libraries or APIs, leading to issues such as missing
imports or incorrect or extra parameters [10, 34, 65].
5.2.2 Invalid Reference Errors. These arise when CodeLLMs produce code that attempts to access or manipulate
program elements that are not yet defined in the code [10, 15, 34, 53, 57]. This can manifest in using variables that have
not been declared or attempting to access non-existent members of an object.
5.3 Functional Correctness Hallucinations
These arise when CodeLLMs generate code that can execute ,but does not satisfy the functional requirements of the
program, which are further categorized as Incorrect Logic Flow and Requirement Deviation [10, 15, 34, 52, 53, 57, 65].
While a program can be syntactically correct and free from runtime errors, it does not guarantee that the code can
perform its intended task.
5.3.1 Incorrect Logical Flow. This arises when CodeLLMs generates code that contains flaws in their implementation of
algorithms and reasoning [10, 15, 34, 52, 53, 57]. These hallucinations often lead to an incorrect solution. This category
encompasses flaws such as missing corner cases, incorrect conditional statements, and incorrect arithmetic operations.
5.3.2 Requirement Deviation. These arise when CodeLLMs produce code that deviates from the explicit requirements
and functionalities outlined in the prompt or problem description [34, 52, 57, 65]. These hallucinations represent the
failure of generated code that does not satisfy the requirements of the prompt. Given the diverse situations in which
requirement deviation occurs, taxonomies often categorize these errors under broad terms. This category encompasses
taxonomy like overall semantic conflicting hallucinations [34] and functional requirement violations [65], while one
paper [57] mentions usage of an incorrect function that does not match the requirements.
5.4 Code Quality Hallucinations
These occur when CodeLLMs generate code that introduce risks related to resource management, security vulnerabilities,
or performance degradation [34, 42, 52, 53, 57, 65]. These hallucinations often compromise the stability, security, and
efficiency of the overall system. We categorize these issues into three distinct subcategories: Resource Mishandling,
Security Vulnerability, and Code Smell Issues.
6 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
Table 1. Comparative Analysis of Code Hallucination Benchmarks.
Benchmark Number Data Name Language of Tasks Reference Content Purpose Construction CodeHaluEval [53] Python 699 APPS Not mentioned Comparing various types and frequencies of hallucinations in code generation across different LLMs.
Generated code using the APPS dataset, and applied the CodeHalu algorithm to identify the types of hallucinations present and their respective frequencies. CodeMirage [1] Python 1,137 HumanEval, MBPP
Problems, hallucinated code snippets, ground truth code snippets, test cases
Experiment and measure LLM capabilities for automatically detecting code hallucinations using one-shot prompts.
Designed explicit prompts for each of the hallucination types and input them into GPT3.5 to get Python code generations that have specific hallucination types. LMDefects [15] Java 113 (easy: 60, medium: 53)
LeetCode Problem descriptions, code snippets, public test cases
Evaluate the precision of Codexgenerated code and assess the feasibility of applying automated program repair (APR) techniques.
Collected public datasets from LeetCode not included in Codex training. Included a diverse range of Java tasks for analysis.
EvalPlus [35] Python 164 HumanEval Programming tasks, function signatures, and docstrings
Reveal the real correctness of LLMsynthesized code.
Extended the HumanEval dataset by adding type-aware mutations and generating an average of 764.17 test cases per problem to evaluate hallucinations. CodeContests [30] C++, Java, Python, etc.
13,328 (training), 117 (validation), 165 (test)
Codeforces, CodeChef, etc.
Problems, Correct and incorrect human submissions, test cases.
To train, validate, and evaluate AlphaCode.
Leveraged private and public code competition problems. Test cases were expanded through mutation methods.
MultiPL-E [8] 18 languages Similar to HumanEval, MBPP
HumanEval, MBPP
Not mentioned Propose the first massively parallel, multi-language benchmark for code generation.
Created a multi-language benchmark by converting Python-based NL2Code benchmarks into 18 programming languages. HalluCode [34] Python 5,663 CodeAlpaca Objectives, Hallucination categories, Task descriptions
Evaluate the performance of codeLLMs in recognizing hallucinations.
Focused on task description evaluation and detecting hallucinations specific to programming contexts.
5.4.1 Resource Mishandling. These errors arise when CodeLLMs produce code that improperly manages a systems
resources, leading to excessive consumption or inefficient allocation of memory that can eventually lead to code
failure [53, 65]. Hallucinations like these occur when CodeLLMs write code that includes data processing operations
that cause failures due to exceeded memory capacity or when there is numerical overflow due to errors in numerical
calculation limits. [53]. Also, Zhang et al. [65] mentions non-functional requirements that are related to suboptimal
performance like inefficient loop structures.
5.4.2 Security Vulnerability. This arises when CodeLLMs produce code that introduces security weaknesses that
make the system susceptible to attacks or unauthorized access [42, 65]. While only two papers have taxonomy that
can be categorized under security vulnerabilities, Pearce et al. [42] gives a deep detailed analysis of various security
vulnerabilities in generated code. While there are many kinds of security vulnerabilities, some are improper input
validation, use after free errors, and null pointer de-reference errors.
5.4.3 Code Smell. These occur when CodeLLMs produce code with low maintainability due to extraneous or unneces
sary code [34, 52, 57, 65]. Although these hallucinations are not critical for security or performance issues, their absence
is crucial for the maintainability and readability of the code that human developers use. These issues include things
like dead code, garbage code, or incomplete generation [34, 52, 57]. Sometimes these issues are called “non-functional
requirement violation” as code with these issues often contain a part that is unreachable, performs useless assignments,
only contains comments, or has empty function bodies [65].
6 Benchmarks and Metrics to Evaluate Hallucinations by CodeLLMs
6.1 Benchmarks
The growing interest in addressing hallucinations in LLM-generated code has led to the development of various
benchmarks. Standard benchmarks are necessary to analyze the hallucination tendencies of various CodeLLMs and
to evaluate hallucination detection and mitigation techniques. Table 1 shows recent benchmarks related to code
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 7
hallucination and summarize their distinct features. Existing benchmarks to evaluate hallucinations by CodeLLMs have
limitations, such as a lack of language diversity and a failure to reflect real-world workloads.
Many of those benchmarks build on existing LLM code generation benchmarks, extending them to overcome those
limitations. EvalPlus proposed by Liu et al.[35] is a benchmark that extends an existing benchmark, HumanEval, to
address its specific limitations. The HumanEval benchmark contains vague task descriptions and insufficient number of
test cases per task. Furthermore, some solutions labeled as correct in HumanEval were found to be erroneous. EvalPlus
addresses these limitations by increasing the average number of test cases per task to 764.1, leveraging LLMs for seed
input generation and employing type-aware mutation for fuzz testing. CodeMirage [1] assesses the ability of LLMs to
detect hallucinations in the input code. CodeMirage was generated using the HumanEval and MBPP databases, with
artificial hallucinations inserted into the code using the ChatGPT-3.5 model.
Among the seven benchmarks we inspected, five support only one programming language, and four of them
(CodeHaluEval, CodeMirage, EvalPlus, and HalluCode) specifically target Python coding tasks. This distribution reflects
the frequent use of Python in scenarios where LLMs generate code. In contrast, Fan et al. [15] proposes LMDefects,
a Java- focused benchmark that evaluates the correctness of code generated by Codex and explores the applicability
of automated program repair (APR) techniques to hallucinated code. LMDefects is based on easy and medium-level
problems from the LeetCode platform and incorporates public test cases provided by the platform.
Unlike aforementioned benchmarks, Multiple-E and CodeContests contain code generation tasks in diverse pro
gramming languages. Cassano et al. [8] introduced MultiPL-E, a benchmark that translates Python problems from the
HumanEval and MBPP datasets into 18 different programming languages. To rigorously compare models, it is essential
to evaluate their ability to generate code in languages beyond Python. Multi-language benchmarks have been developed
for this purpose, as CodeLLMs are typically designed to handle multiple programming languages. This benchmark
uses 18 custom compilers to translate code snippets, test cases, and other components originally designed for Python
into other languages, allowing a comparative analysis of LLM performance across languages. These compilers are also
extendable to support additional languages in the future.
CodeContests, proposed by Li et al. [30], includes programming challenges from platforms such as Codeforces and
CodeChef to train, validate, and evaluate the Alphacode model. This dataset supports multiple programming languages
such as C++, Java, Python, etc. enabling broader applicability.
6.2 Metrics
To compare and analyze model performance on benchmark datasets that are in line with their research goals, studies
adopt different evaluation metrics. Selecting the appropriate metrics is essential to accurately assess the specific aspects
of the model that the study aims to target. This section examines evaluation metrics used in the papers that are addressed
in Section 6.1. Table 2 summarizes the metrics used in various studies to compare the performance of models with
respect to code hallucination. We have grouped the metrics on the following basis: Functional Correctness, Hallucination
Detection, Hallucination Recognition and Hallucination Mitigation Metrics.
6.2.1 Functional Correctness. This category focuses on evaluating how well the generated code satisfies its intended
requirements. The most common metric, Pass@k, measures the frequency with which at least one of the k generated
solutions passes all test cases. Pass@10, a popular variation, represents the fraction of tasks in which at least one of the
10 generated solutions is correct. On the other hand, 10@k measures the percentage of tasks for which k samples were
8 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
Table 2. Comparative Analysis of Code Hallucination Metrics.
Category Metric Description Ref. Functional Correctness
Pass@k Evaluates the correctness of code generated by a CodeLLM. It measures the likelihood that a CodeLLM generates functionally correct code for a given task.
[8] [15] [35]
10@k Evaluates a CodeLLMs ability to generate correct code, specifically to assess the ability to produce multiple correct solutions for a single task.
[30]
Hallucination Detection
Hallucination Rate (HR) Reflects the hallucination phenomenon in LLMs during code generation tasks through actual execution tests.
[53]
Valid Rate (VR) Reflects the percentage of valid code outputs by an LLM. [34] Accuracy of Hallucination Existence Recognition ACCrec
Reflects the percentage of correctly identified existence of hallucinations. [34]
Hallucination Classification
Accuracy of Hallucination Type Recognition ACCt ype (i)
Reflects the percentage of accurately identified hallucination types. Liu et al. proposed five types of hallucinations.
[34]
Accuracy, Macro-precision, Macro-recall, and Macro-F1
Standard metrics used to evaluate multi-class classification performance, where classes represent different hallucination types.
[1]
Hallucination Mitigation
Accuracy of Hallucination Mitigation ACCmit
Reflects the percentage of modified hallucinated codes which are semantically correct.
[34]
created per task, and when at least 10 of them passed the test. Pass@k and 10@k consider hallucinations in generated
code to be any error that prevents the generated code from passing all test cases.
6.2.2 Hallucination Detection. This category quantifies the presence of hallucinations within the generated code. We
use Hallucination Rate (HR), Validate Rate (VR) and Accuracy of Hallucination Existence Recognition (ACCrec ) for
this type of metric [34, 53]. HR, as proposed by Tian et al. [53], measures the proportion of generated code samples
that syntactically valid but fail to execute as expected using their CodeHalu Algorithm. VR serves as a measure of
the proportion of generated outputs that are syntactically valid and executable [34]. Thus, a lower VR can suggest
hallucinations are interfering with the codes ability to run. ACCrec used in tandem with VR focuses on how accurately
a model identifies valid code outputs that also contain hallucinations.
6.2.3 Hallucination Type Classification. This category assesses a CodeLLMs ability to recognize and classify hallu
cinations. In contrast to detection, type classification aims to categorize given hallucinated codes into one of the
predefined hallucination types. The metrics used are Accuracy of Hallucination Type Recognition ACCt ype (i) [34] and
traditional multi-class classification metrics [1]. ACCt ype (i) assesses the precision of the model in categorizing the type
of hallucination present in valid code. Agarwal et al. [1] used accuracy, macro-precision, macro-recall and macro-F1 as
metrics to measure how well hallucinations were detected and classified according to their hallucination types. In this
context, accuracy refers to the percentage of hallucinations that are well categorized by the model that matches the
actual categories.
6.2.4 Hallucination Mitigation. This category is used to measure the ability to successfully fix hallucinated codes. Ac
curacy of Hallucination Mitigation ACCmit [34] shows the percentage of recognized hallucinations that are successfully
alleviated by CodeLLMs.
7 Causes of Hallucinations in Code Generation
We investigate the causes of hallucinations by CodeLLMs and classify them into three main issues: Training Data Issues,
Trained Model Issues, and Prompt Issues. Fig. 4 presents a hierarchical cause analysis tree for code hallucinations
generated by CodeLLMs showing the primary causes and into more specific factors.
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 9
Causes
Training Data Issues
Lack of Quantity of Training Datasets
Lack of Diverse Training
Sets [22]
Low Quality of Training Datasets
Training on Flawed or
Vulnerable Data [49] [50] [53] [65]
Outdated or Incomplete
Public API Knowledge [63] [2] [20] [65]
Trained Model Issues
Inadequate/Inappropriate Evaluations and Benchmarks
Lack of Benchmarks For
Real World SWE Tasks [49] [50] [22]
Lack of Evaluation Metrics for Real World SWE Tasks
[49] [50] [22]
Reasoning & Understanding Deficiencies
Limited Context Handling & Scope
Mishandling of Unseen
Private API [62]
Low Repository & Crossfile Contextual Understanding
[55] [22] [2] [32] [48] [10] [12]
Low Syntactical & Struc
tural Understanding [22] [53] [15]
Insufficient Dependency
Parsing [26] Temperature-related
Non-Determinism [51]
Token Length Limitation [10] [32]
CodeLLMs Lack of Requirement Clarification Mechanism
[39]
Prompt Issues
Ambiguous Prompts
Ambiguous Nature of
Natural Languages [13]
Ambiguous Require
ments In Prompts [39] Irrelevant Context in
Prompt [32]
Fig. 4. Potential causes of hallucinations by CodeLLMs.
7.1 Training Data Issues
One of the primary causes arises from the issues in quality and quantity of the training data. These can be categorized
as follows: a lack of diverse training sets, training on flawed or vulnerable data and, outdated or incomplete public API
knowledge. The limited diversity of training data restricts a CodeLLMs ability to generalize across various programming
tasks. Jain et al. [20] highlights that the breadth and quality of the training dataset are crucial for correct code generation.
In addition, CodeLLMs often produce code hallucinations because of their training on public repositories that often
contain deprecated or incomplete API documentation, leading that code to invoke non-existent APIs or contain API
misuse [65]. Training on flawed or vulnerable data from open-source projects compound the issues, as these CodeLLMs
propagate security vulnerabilities and inefficient implementation to the generated code [50].
7.2 Trained Model Issues
Major causes of code hallucinations include issues with the trained models. The causes of code hallucinations that
stem from issues with the trained models themselves are: Inadequate or Inappropriate Evaluations and Benchmarks,
Reasoning and Understanding Deficiencies, Temperature-related Non-Determinism, Model Input Handling, Token
Generation Inefficiency, and CodeLLMss Lack of Requirement Clarification Mechanism.
One contributor to code hallucinations is the use of inadequate evaluation benchmarks that fail to capture real-world
software engineering tasks. Current evaluation metrics and benchmarks often do not accurately represent real-world
tasks. CodeLLMs are frequently evaluated using benchmarks that lack the constructs necessary to assess the security
10 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
of generated code [49, 50]. The need for comprehensive benchmarks and metrics that evaluate a broader spectrum of
coding skills is ever prevalent [22].
Another crucial aspect contributing to code hallucinations is the inherent reasoning and understanding deficiencies
present in the trained models. One common reasoning and understanding deficiency is the CodeLLMs limited ability to
handle code context. As LLMs receive a larger code context, they often mishandle unseen private API and have trouble
understanding across files and entire repositories [1, 10, 12, 32, 48, 56]. LLMs lack prior knowledge about private libraries
and struggle to leverage external, pre-exisiting knowledge unless they are augmented with retrieval-based generation
techniques [62]. This lack of context is evident when generating functions with repository-level dependencies. [56].
The limited capacity of CodeLLMs to grasp the underlying structure and syntax of programming languages highlights
their reasoning and understanding deficiencies [15, 22, 53]. As transformer-based LLM architectures are the norm,
they may not be optimally designed to fully capture the inherent structure and syntax of programming languages [22].
The CodeLLMs reliance on pattern matching and statistical rules to generate code, results in a lack of fundamental
understanding of symbolic systems [53]. Code is treated as a series of tokens cause language models to lack awareness
of program semantics and lead to the generation of incorrect programs [15].
The non-deterministic nature of CodeLLMs, which is controlled by temperature settings and decoding strategies, is
an inherent issue with the trained model. The temperature parameter in CodeLLMs governs the randomness of the
generated responses as lower temperatures yield more predictable and deterministic outputs, while higher temperatures
increase creativity and diversity [51]. While the higher temperatures of verbose models benefit creative code generation,
they increase the risk of code hallucination rate [51].
Another aspect contributing to code hallucinations arises from limitations in how trained models handle input
tokens. CodeLLMs have an input token length limit, which impacts their ability retain all problem details [10]. This
makes it impossible to feed entire code repositories to the CodeLLMs to effectively generate code [32].
The limitations of existing CodeLLMs handling ambiguous requirements can be another source of code hallucinations.
Current CodeLLMs often lack a mechanism to clarify unclear or incomplete instructions, which can cause hallucinations
that do not satisfy the users requirements [39].
7.3 Prompt Issues
The third major cause of code hallucinations is the prompts. Two factors contributing to this are the ambiguous nature
of the prompt itself and presence of insufficient or irrelevant context in the prompt. A significant challenge originates
from the inherent ambiguity of natural language prompts. Natural language prompts tend to not fully capture the intent
of the user in a fully nuanced and accurate manner. Hence it is a challenge to generate code from such ambiguous
natural language prompts [13, 39]. Furthermore, code hallucinations can arise from contextual deficiencies in the
prompt. Providing insufficient context or including irrelevant details can hinder the CodeLLMs ability to generate
accurate and satisfactory code [32].
8 Hallucination Mitigation Methods
Various approaches to mitigate hallucinations are being actively explored. Among these, five approaches were selected
for comparative analysis. The following sections provide an overview of the specific hallucination types each approach
targets, the root causes they address, and a brief description of each method, along with its strengths and limitations.
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 11
8.1 De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding
There are two main challenges in addressing this issue. The first challenge is that LLMs lack knowledge of project
specific APIs, and may fail to correctly use existing functions and classes. To investigate this issue, they selected five
functions from each of ten open-source projects to create a code completion task. Experimental results showed that
44% of the generated code contained at least one instance of incorrect API usage. To reduce the likelihood of such
issues, it would be necessary to provide the entire project code as input. However, due to constraints, this is practically
impossible. Therefore, selecting only the essential code snippets to include becomes critical. The second challenge
lies in accurately identifying the importance of each piece of code for this purpose. To address this challenge, they
suggested an approach named De-hallucinator for iteratively retrieving relevant APIs to improve the prompts.
The De-hallucinator [12] pre-analyzes and indexes all source code within the project in advance. When a code
generation prompt is provided, it selects the most relevant APIs based on the input and creates a Retrieval Augmented
Generation (RAG) prompt to include these APIs. Alternatively, it generates an iterative prompt that incorporates
the APIs most relevant to the code produced by the initial prompt. These prompts are then used as inputs for code
generation. This approach has the advantage of not requiring modifications to the internal structure of the LLM model.
However, it has the drawback of relying on the project to contain well-documented and detailed API descriptions.
8.2 Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues
Liu et al. [37] proposed a hallucination mitigation method leveraging ChatGPTs self-revision capabilities. The approach
aims to address all code quality issues in LLM-generated code, including execution errors, incorrect outputs, and
maintainability problems. The method provides two types of feedback to the LLM immediately after code generation:
simple feedback and feedback with static analysis:
• Simple feedback: This feedback involves informing the model that the generated code contains quality issues
without specifying details.
• Feedback with static analysis: This feedback includes more detailed information, such as static analysis
results and runtime error messages for the generated code.
The study found that using these feedback methods enabled ChatGPT to self-revise 2060% of the generated code.
Furthermore, iterative feedback led to a gradual improvement in code quality over time.
This approach has the advantage of generalizing scenarios where developers use LLMs for code generation, effectively
demonstrating its mitigation performance. However, it has limitations, including the requirement for developers to
craft prompts manually and the need for a basic understanding of static analysis tools and error messages.
8.3 SynCode: LLM Generation with Grammar Augmentation
Ugare et al. [54] focused on Syntax Violation Hallucinations. Grammar-guided generation has recently been widely
proposed [16, 43, 47, 58] to ensure that LLM-generated code adheres strictly to predefined grammatical rules [54].
These methods modify the LLMs decoding algorithm to ensure that the model consistently selects tokens conforming
to a specific formal language. However, the tokens used by the model are predefined during training, and this often
leads to token misalignment where the models tokens do not match the terminals used in the specified grammar.
This misalignment is a significant factor contributing to the high error rates observed in grammar-guided generation.
To address this issue, the SynCode algorithm was proposed, leveraging the EBNF (Extended Backus-Naur Form)
representation of context-free grammar to guide the LLM during the decoding process. This ensures that the model
12 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
produces grammatically correct outputs throughout the generation process. The advantage of this approach is its
versatility, as it can be applied to any type of LLM decoding algorithm and supports all programming languages.
8.4 ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification
Mu et al. [39] proposed a method to mitigate hallucinations caused by ambiguous prompts. Generating correct code
requires a clear understanding of the users requirements, but the necessary information might not always be fully
included in the LLMs prompt. In real-world scenarios, developers often address ambiguous requirements by asking
clarifying questions to gather additional information. Inspired by this approach, they introduced a novel framework
where the LLM generates clarifying questions to help users refine their prompts.
The core challenges of this approach lie in determining when to ask questions and what questions to ask. To address
the first challenge, they implemented a code consistency check process. This involves generating test inputs based on the
users prompt and asking the LLM to produce n code solutions aligned with the prompt. The generated code solutions
are executed with the test inputs, and the resulting test outputs are compared. If the similarity among outputs is low, it
is determined that a clarifying question is needed. This method is based on the intuition that a better understanding of
the requirements should result in more consistent code solutions.
For the second challenge, they employed reasoning-based prompts to help the LLM identify elements of the prompt
causing ambiguity and generate targeted clarifying questions. The reasoning-based prompt includes instructions for
clarifying question generation, few-shot examples, and the users requirements alongside the generated code solutions.
The ClarifyGPT framework has the advantage of achieving mitigation effects without requiring direct modifications
to a model. It also aids developers who struggle to craft clear prompts. However, this approach has significant drawbacks,
including high overhead due to the processes of input generation, code generation, and clarifying question generation.
Additionally, the examples for the question generation prompt must be manually crafted.
8.5 LLM Hallucinations in Practical Code Generation:Phenomena, Mechanism, and Mitigation
Zhang et al. [65] analyzed the types of LLM hallucinations in code generation and potential factors that cause hallucina
tions. Based on the findings, they suggest a mitigation method based on RAG. The study identified three primary root
causes of hallucinations in LLM-generated code: (1) incorrect or insufficient understanding of task requirements, (2)
lack of factual knowledge relevant to the generation tasks, and (3) inability to access the necessary code and non-code
resources from the repository. To mitigate these issues, the authors proposed a RAG-based approach. They first created
a retrieval corpus by scanning all source files from repositories in the CoderEval dataset and extracting consecutive lines
of code. When a query is presented to the LLM, the system retrieves related code snippets from the corpus, appending
the most relevant ones to the prompt.
This approach has several advantages. It requires no additional effort from users, ensures that only essential
information necessary for code generation is provided to the model, and supports handling project-specific APIs.
However, its effectiveness is significantly influenced by the quality and quantity of the source code available for retrieval.
Moreover, the retrieval process introduces overhead, which can impact efficiency.
Despite these challenges, the RAG-based mitigation method demonstrated a modest reduction in hallucinations
across six LLMs. This study serves as a pilot exploration of RAG-based mitigation methods, shedding light on their
possible applications in reducing hallucinations in LLMs.
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 13
9 Discussion and conclusion
The findings in this paper suggest several promising directions for future research. First, the development of more
diverse and representative benchmark datasets, encompassing various programming languages and use cases, is
essential for evaluating LLMs in broader contexts. Second, advances in hallucination mitigation techniques, such as
retrieval-augmented generation, clarifying question frameworks, and grammar-guided decoding, indicate the potential
of combining multiple approaches to enhance reliability. Third, the integration of LLMs into real-world software
development workflows calls for adaptive techniques that can dynamically address context-specific hallucinations,
improving practical usability. By synthesizing these insights, this study serves as a road-map for advancing research and
development in LLM code generation, ultimately contributing to the creation of more robust and trustworthy systems.
References
[1] Vibhor Agarwal, Yulong Pei, Salwa Alamir, and Xiaomo Liu. 2024. CodeMirage: Hallucinations in Code Generated by Large Language Models. doi:10.48550/arXiv.2408.08333 arXiv:2408.08333 [2] Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu Lahiri, and Sriram Rajamani. 2024. Monitor-guided decoding of code LMs with static analysis of repository context. Advances in Neural Information Processing Systems 36 (2024).
[3] Miltiadis Allamanis, Sheena Panthaplackel, and Pengcheng Yin. 2024. Unsupervised Evaluation of Code LLMs with Round-Trip Correctness. arXiv preprint arXiv:2402.08699 (2024).
[4] Amazon. 2022. What is CodeWhisperer? https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is-cwspr.html. [5] Anthropic. 2025. Claude 3.7 Sonnet. https://www.anthropic.com/news/claude-3-7-sonnet. [6] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021). [7] Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, et al. 2023. Purple llama cyberseceval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724 (2023). [8] Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering 49, 7 (July 2023), 36753691. doi:10.1109/TSE.2023.3267446 [9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021). [10] Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, Yan Liu, Enyu Zhou, Ming Zhang, Yuhao Zhou, Yueming Wu, Rui Zheng, Ming Wen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. 2024. Whats Wrong with Your Code Generated by Large Language Models? An Extensive Study. doi:10.48550/arXiv.2407.06153 arXiv:2407.06153 [11] Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861 (2023). [12] Aryaz Eghbali and Michael Pradel. 2024. De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding. doi:10.48550/arXiv.2401.01701 arXiv:2401.01701 [13] Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K. Lahiri. 2024. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation. IEEE Transactions on Software Engineering 50, 9 (Sept. 2024), 22542268. doi:10.1109/TSE.2024. 3428972 [14] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 3153. doi:10.1109/ICSE-FoSE59343.2023.00008 [15] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 23). IEEE Press, Melbourne, Victoria, Australia, 14691481. doi:10.1109/ICSE48619.2023.00128 [16] Georgi Gerganov et al. 2024. llama.cpp: Port of Facebooks LLaMA model in C/C++. https://github.com/guidance-ai/guidance. [17] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021). [18] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 33, 8 (Dec. 2024), 220:1220:79. doi:10.1145/3695988
14 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
[19] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. 43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155 [20] Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2024. On Mitigating Code LLM Hallucinations with API Documentation. arXiv preprint arXiv:2407.09726 (2024).
[21] Kevin Jesse, Toufique Ahmed, Premkumar T Devanbu, and Emily Morgan. 2023. Large language models and simple, stupid bugs. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 563575.
[22] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv preprint arXiv:2406.00515 (2024).
[23] Nan Jiang, Qi Li, Lin Tan, and Tianyi Zhang. 2024. Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code. arXiv preprint arXiv:2410.09997 (2024).
[24] Raphaël Khoury, Anderson R Avila, Jacob Brunelle, and Baba Mamadou Camara. 2023. How secure is code generated by chatgpt?. In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 24452451.
[25] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning. PMLR, 1831918345. [26] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35 (2022), 2131421328. [27] Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599 (2024).
[28] Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. 2024. MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems. In Findings of the Association for Computational Linguistics: EMNLP 2024. 736783. [29] Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. 2023. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852 (2023).
[30] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson dAutume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-Level Code Generation with AlphaCode. Science 378, 6624 (Dec. 2022), 10921097. doi:10.1126/science.abq1158 [31] Yifan Li, Ensheng Shi, Dewu Zheng, Kefeng Duan, Jiachi Chen, and Yanlin Wang. 2024. RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening. In Proceedings of the 15th Asia-Pacific Symposium on Internetware. 229238.
[32] Dianshu Liao, Shidong Pan, Xiaoyu Sun, Xiaoxue Ren, Qing Huang, Zhenchang Xing, Huan Jin, and Qinying Li. 2024. A 3-CodGen: A RepositoryLevel Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware. IEEE Transactions on Software Engineering (2024).
[33] Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. doi:10.48550/arXiv.2109.07958 arXiv:2109.07958 [cs]. [34] Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. 2024. Exploring and Evaluating Hallucinations in LLM-Powered Code Generation. doi:10.48550/arXiv.2404.00971 arXiv:2404.00971 [35] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS 23). Curran Associates Inc., Red Hook, NY, USA, 2155821572. [36] Mingwei Liu, Tianyong Yang, Yiling Lou, Xueying Du, Ying Wang, and Xin Peng. 2023. Codegen4libs: A two-stage approach for library-oriented code generation. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 434445.
[37] Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. 2024. Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. ACM Trans. Softw. Eng. Methodol. 33, 5 (June 2024), 116:1116:26. doi:10.1145/3643674 [38] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021). [39] Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification. Proc. ACM Softw. Eng. 1, FSE (July 2024), 103:2332103:2354. doi:10.1145/ 3660810 [40] Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I. Wang, and Xi Victoria Lin. 2023. LEVER: Learning to Verify Language-toCode Generation with Execution. In Proceedings of the 40th International Conference on Machine Learning (ICML23, Vol. 202). JMLR.org, Honolulu, Hawaii, USA, 2610626128. [41] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. doi:10.48550/arXiv.2203.13474 arXiv:2203.13474 [cs]. [42] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github copilots code contributions. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754768.
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 15
[43] Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227 (2022). [44] Ge Qu, Jinyang Li, Bowen Li, Bowen Qin, Nan Huo, Chenhao Ma, and Reynold Cheng. 2024. Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation. arXiv preprint arXiv:2405.15307 (2024). [45] Kia Rahmani, Mohammad Raza, Sumit Gulwani, Vu Le, Daniel Morris, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari. 2021. Multi-modal program inference: A marriage of pre-trained language models and component-based synthesis. Proceedings of the ACM on Programming Languages 5, OOPSLA (2021), 129. [46] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. 2024. Code Llama: Open Foundation Models for Code. doi:10.48550/arXiv.2308.12950 arXiv:2308.12950 [cs]. [47] Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. arXiv preprint arXiv:2109.05093 (2021).
[48] Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2023. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning. PMLR, 3169331715.
[49] Mohammed Latif Siddiq, Joanna Cecilia da Silva Santos, Sajith Devareddy, and Anna Muller. 2024. Sallm: Security assessment of generated code. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops. 5465.
[50] Mohammed Latif Siddiq and Joanna CS Santos. 2022. SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security. 2933. [51] Joseph Spracklen, Raveen Wijewickrama, AHM Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala. 2024. We have a package for you! a comprehensive analysis of package hallucinations by code generating llms. arXiv preprint arXiv:2406.10279 (2024). [52] Florian Tambon, Arghavan Moradi Dakhel, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Giuliano Antoniol. 2024. Bugs in Large Language Models Generated Code: An Empirical Study. doi:10.48550/arXiv.2403.08937 arXiv:2403.08937 [53] Yuchen Tian, Weixiang Yan, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, and Dawn Song. 2024. CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification. doi:10.48550/arXiv.2405.00253 arXiv:2405.00253 [54] Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2024. SynCode: LLM Generation with Grammar Augmentation. doi:10.48550/arXiv.2403.01632 arXiv:2403.01632 [55] Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. 2024. Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation. arXiv preprint arXiv:2401.06391 (2024). [56] Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. RLCoder: Reinforcement Learning for Repository-Level Code Completion. arXiv:2407.19487 [cs.SE] https://arxiv.org/abs/2407.19487 [57] Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. 2024. Where Do Large Language Models Fail When Generating Code? arXiv preprint arXiv:2406.08731 (2024).
[58] Brandon T Willard and Rémi Louf. 2023. Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702 (2023). [59] Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (EASE 14). Association for Computing Machinery, New York, NY, USA, 110. doi:10.1145/2601248.2601268 [60] Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Hari Sundaram, et al. 2023. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation. arXiv preprint arXiv:2311.08588 (2023).
[61] Chen Yang, Yan Liu, and Changqing Yin. 2021. Recent Advances in Intelligent Source Code Generation: A Survey on Natural Language Based Studies. Entropy 23, 9 (Sept. 2021), 1174. doi:10.3390/e23091174 [62] Daoguang Zan, Bei Chen, Yongshun Gong, Junzhi Cao, Fengji Zhang, Bingchao Wu, Bei Guan, Yilong Yin, and Yongji Wang. 2023. Private-libraryoriented code generation with large language models. arXiv preprint arXiv:2307.15370 (2023). [63] Kechi Zhang, Huangzhao Zhang, Ge Li, Jia Li, Zhuo Li, and Zhi Jin. 2023. Toolcoder: Teach code generation models to use api search tools. arXiv preprint arXiv:2305.04032 (2023).
[64] Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2024. Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code. doi:10.48550/arXiv.2311.07989 arXiv:2311.07989 [cs]. [65] Ziyao Zhang, Yanlin Wang, Chong Wang, Jiachi Chen, and Zibin Zheng. 2024. LLM hallucinations in practical code generation: Phenomena, mechanism, and mitigation. arXiv preprint arXiv:2409.20550 (2024). [66] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 56735684.
[67] Li Zhong and Zilong Wang. 2024. Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation. Proceedings of the AAAI Conference on Artificial Intelligence 38, 19 (March 2024), 2184121849. doi:10.1609/aaai.v38i19.30185