571 lines
63 KiB
Plaintext
571 lines
63 KiB
Plaintext
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation,
|
||
and Challenges
|
||
YUNSEO LEE∗, UNIST, Republic of Korea
|
||
JOHN YOUNGEUN SONG∗, Handong Global University, Republic of Korea
|
||
DONGSUN KIM, Korea University, Republic of Korea
|
||
JINDAE KIM, Seoul National University of Science and Technology, Republic of Korea
|
||
MIJUNG KIM, UNIST, Republic of Korea
|
||
JAECHANG NAM†, Handong Global University, Republic of Korea
|
||
Recent technical breakthroughs in large language models (LLMs) have enabled them to fluently generate source code. Software
|
||
developers often leverage both general-purpose and code-specialized LLMs to revise existing code or even generate a whole function
|
||
from scratch. These capabilities are also beneficial in no-code or low-code contexts, in which one can write programs without a
|
||
technical background. However, due to their internal design, LLMs are prone to generating hallucinations, which are incorrect,
|
||
nonsensical, and not justifiable information but difficult to identify its presence. This problem also occurs when generating source code.
|
||
Once hallucinated code is produced, it is often challenging for users to identify and fix it, especially when such hallucinations can be
|
||
identified under specific execution paths. As a result, the hallucinated code may remain unnoticed within the codebase. This survey
|
||
investigates recent studies and techniques relevant to hallucinations generated by CodeLLMs. We categorize the types of hallucinations
|
||
in the code generated by CodeLLMs, review existing benchmarks and mitigation strategies, and identify open challenges. Based on
|
||
these findings, this survey outlines further research directions in the detection and removal of hallucinations produced by CodeLLMs.
|
||
1 Introduction
|
||
Ensuring the accuracy, reliability, and security of code generated by Large Language Models (LLMs) remains a critical
|
||
challenge [1, 12, 53]. A primary reason for this is the prevalence of hallucinations — instances where the model generates
|
||
code that is illogical, incorrect, or unfaithful to the specified requirements [14]. Addressing these hallucinations is
|
||
essential, as they undermine the trustworthiness of the generated code and can introduce significant risks and errors
|
||
into software applications.
|
||
Although benchmarks such as HumanEval [9] and Mostly Basic Python Programming (MBPP) [6] are commonly
|
||
used to evaluate the code generation performance of LLMs, there remains a lack of standardized methods of assessing
|
||
the hallucinations generated by CodeLLMs. These general benchmarks only measure the syntactical or token-wise
|
||
differences between the generated and oracle code. At most, the benchmarks provide simple test cases in which the
|
||
users can verify a subset of dynamic behaviors of the generated code, which are not useful for defining, detecting, and
|
||
mitigating hallucinations.
|
||
To address hallucination issues of code generation tasks, many researchers have created evaluation benchmarks
|
||
for the tasks recently, and proposed various approaches to addressing the issues. For example, benchmarks such as
|
||
∗Both authors contributed equally to this research. Yunseo Lee conducted this study while he was an undergraduate student at Handong Global University. †Corresponding author.
|
||
Authors’ Contact Information: Yunseo Lee, yunseo.lee@unist.ac.kr, UNIST, Ulsan, Republic of Korea; John Youngeun Song, john.song@handong.edu, Handong Global University, Pohang, Republic of Korea; Dongsun Kim, Korea University, Seoul, Republic of Korea, darkrsw@korea.ac.kr; Jindae Kim, Seoul National University of Science and Technology, Seoul, Republic of Korea, jindae.kim@seoultech.ac.kr; Mijung Kim, UNIST, Ulsan, Republic of Korea, mijungk@unist.ac.kr; Jaechang Nam, Handong Global University, Pohang, Republic of Korea, jcnam@handong.edu.
|
||
1
|
||
|
||
|
||
2 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
|
||
CodeHaluEval [53] and CodeMirage [1] have been developed to measure hallucination frequencies, while mitigation
|
||
strategies such as iterative grounding [12] and self-revision feedback loops [37] aim to reduce specific hallucinations.
|
||
The goal of this study is to provide a comprehensive analysis of code hallucinations, including their categorization,
|
||
evaluation metrics, and mitigation strategies. To achieve this goal, we (1) structured a detailed taxonomy of code hallu
|
||
cinations, (2) review and categorize existing benchmarks and evaluation metrics used for detecting these hallucinations,
|
||
(3) consolidated a list of root causes that contribute to code hallucinations, and (4) survey current mitigation strategies
|
||
designed to address code hallucinations.
|
||
2 Differences from other surveys on hallucinations of CodeLLMs
|
||
Although hallucinations generated by LLMs in general are studied in multiple surveys [14, 19, 61], our survey focuses
|
||
on hallucinations observed during code generation tasks using LLMs. The followings are the key aspects of our survey:
|
||
• Focus and Scope: We focus on hallucinations specifically observed from code generation tasks, addressing
|
||
unique challenges such as syntactic and semantic discrepancies in code output. In addition, while existing
|
||
surveys [14, 18, 22, 64] on code generation analyzed performance, benchmarks, data curation, and evaluation
|
||
metrics, they failed to systematically explore code hallucinations. By exploring taxonomy, benchmarks, metrics,
|
||
and mitigation strategies tailored to code-specific hallucinations, our survey fills this critical gap and provides a
|
||
comprehensive framework for future research.
|
||
• Taxonomy and Categorization: Existing hallucination surveys classify hallucinations into input-conflicting,
|
||
context-conflicting, and fact-conflicting types [19]. Building upon these classifications, our study introduces a
|
||
taxonomy that incorporates specialized hallucination types unique to the code generation process, allowing a
|
||
systematic exploration of hallucination issues specific to this domain.
|
||
• Integration of Benchmarks: Although other surveys [14, 22, 64] include benchmarks such as HumanEval [9]
|
||
and TruthfulQA [33], we identified four datasets and benchmarks explicitly aligned with detecting and mitigating
|
||
code hallucination, such as tests for functional correctness and adherence to APIs.
|
||
• Exploration of Mitigation Strategies: While previous surveys navigated mitigation approaches for general
|
||
natural languages [61], we delve into mitigation strategies such as fine-tuning with code-specific datasets,
|
||
leveraging automated testing frameworks, and integrating static and dynamic program analysis tools for
|
||
real-time hallucination detection.
|
||
3 Paper Collection and Review Schema
|
||
3.1 Survey Scope
|
||
We aim to cover in full the taxonomy, benchmarks and evaluation metrics, causes of hallucinations, and mitigation
|
||
techniques for hallucinations in code generated by CodeLLMs. The criteria for selecting papers are as follows:
|
||
• Papers that discuss both LLM-based code generation and LLM hallucination.
|
||
• Papers that define code hallucinations or propose taxonomies related to them.
|
||
• Papers that propose techniques for detecting or mitigating code hallucinations.
|
||
• Papers that introduce datasets or benchmarks for evaluating the performance of CodeLLMs.
|
||
To distinguish our study from existing surveys on hallucinations in the Natural Language Processing (NLP) domain
|
||
and focus on code generation, we included only papers that addressed both LLM code generation and LLM hallucination.
|
||
In particular, we searched for papers that explicitly used terms such as code hallucination or hallucinated code. For
|
||
|
||
|
||
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 3
|
||
mitigation-related studies, we included papers that addressed the correctness of generated code, even if the term
|
||
hallucination was not explicitly mentioned.
|
||
3.2 Methodology for Literature Identification
|
||
We conducted a systematic literature review on various papers. To gather as many relevant studies as possible, Google
|
||
Scholar keyword searches were performed using the terms “hallucination” and “code generation”. Considering the rapid
|
||
advances in research related to LLMs, the review focused mainly on articles published after 2023, while also including
|
||
two notable articles from 2022 based on their significance. Titles, abstracts, and introductions of the retrieved papers
|
||
were manually reviewed and categorized into three main categories: Taxonomy, Benchmark, and Mitigation.
|
||
In addition, to ensure comprehensive coverage of studies on code hallucination, the snowball method [59] was
|
||
employed. Snowballing, commonly used in survey studies, involves tracking citations of identified papers until no
|
||
additional relevant papers are found. This process helped identify missing studies from the initial search, as well as NLP
|
||
hallucination papers frequently cited in code hallucination research. Although these NLP studies were not included
|
||
in the systematic review as they did not focus on code, they provided foundational insights to develop classification
|
||
criteria for code hallucinations.
|
||
5
|
||
Taxonomy
|
||
3
|
||
20 Benchmarking
|
||
2
|
||
16
|
||
Mitigation
|
||
4
|
||
1
|
||
Fig. 1. Distribution of the categorization of papers.
|
||
arXiv
|
||
51.9% (27)
|
||
TSE
|
||
3.8% (2)
|
||
NeurIPS
|
||
5.8% (3)
|
||
ICML
|
||
5.8% (3)
|
||
ICSE 5.8% (3)
|
||
Venues with a Single Paper
|
||
26.9% (14)
|
||
Fig. 2. Distribution of papers by venue.
|
||
We categorized the papers into three key dimensions: Taxonomy, Benchmarking, and Mitigation, as shown Fig. 1.
|
||
Most of the papers fall under the Benchmarking category (20 papers [3, 6–9, 11, 17, 23, 25, 27–30, 35, 38, 49, 50, 60,
|
||
66, 67]) and the Mitigation category (16 papers [12, 13, 21, 26, 32, 36, 39, 40, 43, 45, 48, 51, 54, 55, 62, 63]), while fewer
|
||
studies are categorized under Taxonomy (five papers [15, 24, 42, 52, 57]). Overlapping areas reveal cross-disciplinary
|
||
contributions: four papers address both Taxonomy and Mitigation [31, 37, 44, 65], three papers address both Taxonomy
|
||
and Benchmarking [1, 34, 53], and 2 papers explore both Mitigation and Benchmarking [2, 20]. Only one paper [10]
|
||
combines all three dimensions, emphasizing the scarcity of comprehensive studies.
|
||
While many papers are in a preprint stage (e.g., arXiv), authors gradually publish papers at top venues in the
|
||
community. Fig. 2 shows the distributions of papers by venue. About a half of the papers (51.9%) were published on
|
||
arXiv. The remaining papers were published in top-tier conferences (39.2%) such as NeurIPS (Annual Conference on
|
||
Neural Information Processing Systems) and ICML (International Conference on Machine Learning), and academic
|
||
journals (7.8%) such as TSE (IEEE Transactions on Software Engineering).
|
||
4 LLM-based code generation (CodeLLMs) and its hallucination
|
||
CodeLLMs have been developed to address unique challenges in this domain. OpenAI’s Codex and its derivative
|
||
Copilot are prominent examples that introduced generative pre-trained models with billions of parameters that produce
|
||
|
||
|
||
4 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
|
||
snippets [9, 38]. Following these innovations, models such as Anthropic’s Claude Sonnet [5], Meta’s CodeLLaMA [46],
|
||
DeepMind’s AlphaCode [30], Salesforce’s CodeGen [41], and Amazon’s CodeWhisperer [4] entered the landscape,
|
||
each addressing different aspects of coding efficiency and applicability. OpenAI further refined its offerings with
|
||
GPT-3.5 and GPT-4, showcasing enhanced capabilities in generating syntactically and semantically accurate code. These
|
||
advancements are often accompanied by benchmark datasets such as HumanEval [9], DS-1000 [25], and MBPP [6],
|
||
which assess the performance of LLMs on diverse coding tasks.
|
||
Despite their promise, LLMs face a significant challenge in code generation including hallucinations. Hallucinations,
|
||
in this context, refer to the generation of code that is nonsensical, logically flawed, or unfaithful to the given task descrip
|
||
tion [10]. Studies in the NLP field have classified hallucinations into types such as input-conflicting, context-conflicting,
|
||
and fact-conflicting hallucinations [19]. Within code generation, hallucinations can manifest as bugs, syntactical errors,
|
||
security vulnerabilities, or even non-deterministic outputs.Existing research highlights that hallucinated outputs not
|
||
only degrade functional correctness, but may also introduce subtle errors, such as memory leaks or insecure code [7].
|
||
5 Taxonomy of Hallucination by CodeLLMs
|
||
In our effort to create a consolidated taxonomy of code hallucinations generated by CodeLLMs, we analyzed relevant
|
||
papers that presented their own classification of hallucinations. Rather than focusing on the causes of the hallucination,
|
||
our resulting taxonomy categorizes hallucinations based on the observable characteristics of error produced, as shown
|
||
in Fig. 3. A key advantage of this approach is that it provides an objective for classifying hallucinations, regardless of the
|
||
model architecture or the training datasets. The taxonomy consists of four primary categories: Syntactic Hallucinations,
|
||
Runtime Execution Hallucinations, Functional Correctness Hallucinations, and Code Quality Hallucinations. In this
|
||
section, we discuss each primary category with detailed sub-categories.
|
||
5.1 Syntactic Hallucinations
|
||
These refer to errors that deviate from a language syntax, which render the code unable to parsed and unable to be
|
||
compiled or interpreted [2, 10, 15, 52, 57]. Syntactic hallucinations can be further classified into two sub-categories:
|
||
“Syntax Violations” and “Incomplete Code Generation”.
|
||
5.1.1 Syntax Violations. These occur when a CodeLLM generates code that violates the syntax of the programming
|
||
language, leading to compile-time errors [1, 10, 57]. Three research papers include a specific taxonomy on what kinds
|
||
of syntax violations there are [1, 10, 57]. One paper [1] classifies errors in generated code that are related to syntax
|
||
Code Hallucinations
|
||
Syntactic Hallucinations
|
||
Syntax Violation [1, 10, 57]
|
||
Incomplete Code Generation [15, 52]
|
||
Runtime Execution Hallucinations
|
||
API Knowledge Conflict [10, 34, 65]
|
||
Invalid Reference Errors [10, 15, 34, 53, 57]
|
||
Functional Correctness Hallucinations
|
||
Incorrect Logical Flow [10, 15, 34, 52, 53, 57]
|
||
Requirement Deviation [34, 52, 57, 65]
|
||
Code Quality Hallucinations
|
||
Resource Mishandling [53, 65]
|
||
Security Vulnerability [42, 65]
|
||
Code Smell [34, 52, 57, 65]
|
||
Fig. 3. Taxonomy of hallucinations possibly generated by CodeLLMs.
|
||
|
||
|
||
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 5
|
||
under the term Syntactic Incorrectness. Two papers classify syntax violations further and provide more specific terms
|
||
such as Incorrect Indentation, Conditional Error, Loop Error, Return Error, and Assignment Error [10, 57].
|
||
5.1.2 Incomplete Code Generation. This occurs when CodeLLMs stops generating code or entire code blocks are
|
||
missing [15, 52]. In violation of any specific coding language syntax rules, incomplete code generation will prevent the
|
||
code from being executed or compiled.
|
||
5.2 Runtime Execution Hallucinations
|
||
These occur when CodeLLMs generate code that is syntactically valid but produces runtime errors, such as exceptions
|
||
or crashes, during execution [10, 15, 34, 52, 53, 57, 65]. Although syntactic correctness is a necessary condition for code
|
||
execution, it does not guarantee that the code will function as intended or even run without errors. They manifest only
|
||
when the code is actually run and may depend on specific inputs or outside factors. Unlike syntactic hallucinations, these
|
||
types of hallucinations do not necessarily break the syntax, but cause the program to crash or behave unexpectedly.
|
||
5.2.1 API Knowledge Conflict. This occurs when CodeLLMs misuse libraries or APIs, leading to issues such as missing
|
||
imports or incorrect or extra parameters [10, 34, 65].
|
||
5.2.2 Invalid Reference Errors. These arise when CodeLLMs produce code that attempts to access or manipulate
|
||
program elements that are not yet defined in the code [10, 15, 34, 53, 57]. This can manifest in using variables that have
|
||
not been declared or attempting to access non-existent members of an object.
|
||
5.3 Functional Correctness Hallucinations
|
||
These arise when CodeLLMs generate code that can execute ,but does not satisfy the functional requirements of the
|
||
program, which are further categorized as Incorrect Logic Flow and Requirement Deviation [10, 15, 34, 52, 53, 57, 65].
|
||
While a program can be syntactically correct and free from runtime errors, it does not guarantee that the code can
|
||
perform its intended task.
|
||
5.3.1 Incorrect Logical Flow. This arises when CodeLLMs generates code that contains flaws in their implementation of
|
||
algorithms and reasoning [10, 15, 34, 52, 53, 57]. These hallucinations often lead to an incorrect solution. This category
|
||
encompasses flaws such as missing corner cases, incorrect conditional statements, and incorrect arithmetic operations.
|
||
5.3.2 Requirement Deviation. These arise when CodeLLMs produce code that deviates from the explicit requirements
|
||
and functionalities outlined in the prompt or problem description [34, 52, 57, 65]. These hallucinations represent the
|
||
failure of generated code that does not satisfy the requirements of the prompt. Given the diverse situations in which
|
||
requirement deviation occurs, taxonomies often categorize these errors under broad terms. This category encompasses
|
||
taxonomy like overall semantic conflicting hallucinations [34] and functional requirement violations [65], while one
|
||
paper [57] mentions usage of an incorrect function that does not match the requirements.
|
||
5.4 Code Quality Hallucinations
|
||
These occur when CodeLLMs generate code that introduce risks related to resource management, security vulnerabilities,
|
||
or performance degradation [34, 42, 52, 53, 57, 65]. These hallucinations often compromise the stability, security, and
|
||
efficiency of the overall system. We categorize these issues into three distinct subcategories: Resource Mishandling,
|
||
Security Vulnerability, and Code Smell Issues.
|
||
|
||
|
||
6 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
|
||
Table 1. Comparative Analysis of Code Hallucination Benchmarks.
|
||
Benchmark Number Data Name Language of Tasks Reference Content Purpose Construction CodeHaluEval [53] Python 699 APPS Not mentioned Comparing various types and frequencies of hallucinations in code generation across different LLMs.
|
||
Generated code using the APPS dataset, and applied the CodeHalu algorithm to identify the types of hallucinations present and their respective frequencies. CodeMirage [1] Python 1,137 HumanEval, MBPP
|
||
Problems, hallucinated code snippets, ground truth code snippets, test cases
|
||
Experiment and measure LLM capabilities for automatically detecting code hallucinations using one-shot prompts.
|
||
Designed explicit prompts for each of the hallucination types and input them into GPT3.5 to get Python code generations that have specific hallucination types. LMDefects [15] Java 113 (easy: 60, medium: 53)
|
||
LeetCode Problem descriptions, code snippets, public test cases
|
||
Evaluate the precision of Codexgenerated code and assess the feasibility of applying automated program repair (APR) techniques.
|
||
Collected public datasets from LeetCode not included in Codex training. Included a diverse range of Java tasks for analysis.
|
||
EvalPlus [35] Python 164 HumanEval Programming tasks, function signatures, and docstrings
|
||
Reveal the real correctness of LLMsynthesized code.
|
||
Extended the HumanEval dataset by adding type-aware mutations and generating an average of 764.17 test cases per problem to evaluate hallucinations. CodeContests [30] C++, Java, Python, etc.
|
||
13,328 (training), 117 (validation), 165 (test)
|
||
Codeforces, CodeChef, etc.
|
||
Problems, Correct and incorrect human submissions, test cases.
|
||
To train, validate, and evaluate AlphaCode.
|
||
Leveraged private and public code competition problems. Test cases were expanded through mutation methods.
|
||
MultiPL-E [8] 18 languages Similar to HumanEval, MBPP
|
||
HumanEval, MBPP
|
||
Not mentioned Propose the first massively parallel, multi-language benchmark for code generation.
|
||
Created a multi-language benchmark by converting Python-based NL2Code benchmarks into 18 programming languages. HalluCode [34] Python 5,663 CodeAlpaca Objectives, Hallucination categories, Task descriptions
|
||
Evaluate the performance of codeLLMs in recognizing hallucinations.
|
||
Focused on task description evaluation and detecting hallucinations specific to programming contexts.
|
||
5.4.1 Resource Mishandling. These errors arise when CodeLLMs produce code that improperly manages a system’s
|
||
resources, leading to excessive consumption or inefficient allocation of memory that can eventually lead to code
|
||
failure [53, 65]. Hallucinations like these occur when CodeLLMs write code that includes data processing operations
|
||
that cause failures due to exceeded memory capacity or when there is numerical overflow due to errors in numerical
|
||
calculation limits. [53]. Also, Zhang et al. [65] mentions non-functional requirements that are related to suboptimal
|
||
performance like inefficient loop structures.
|
||
5.4.2 Security Vulnerability. This arises when CodeLLMs produce code that introduces security weaknesses that
|
||
make the system susceptible to attacks or unauthorized access [42, 65]. While only two papers have taxonomy that
|
||
can be categorized under security vulnerabilities, Pearce et al. [42] gives a deep detailed analysis of various security
|
||
vulnerabilities in generated code. While there are many kinds of security vulnerabilities, some are improper input
|
||
validation, use after free errors, and null pointer de-reference errors.
|
||
5.4.3 Code Smell. These occur when CodeLLMs produce code with low maintainability due to extraneous or unneces
|
||
sary code [34, 52, 57, 65]. Although these hallucinations are not critical for security or performance issues, their absence
|
||
is crucial for the maintainability and readability of the code that human developers use. These issues include things
|
||
like dead code, garbage code, or incomplete generation [34, 52, 57]. Sometimes these issues are called “non-functional
|
||
requirement violation” as code with these issues often contain a part that is unreachable, performs useless assignments,
|
||
only contains comments, or has empty function bodies [65].
|
||
6 Benchmarks and Metrics to Evaluate Hallucinations by CodeLLMs
|
||
6.1 Benchmarks
|
||
The growing interest in addressing hallucinations in LLM-generated code has led to the development of various
|
||
benchmarks. Standard benchmarks are necessary to analyze the hallucination tendencies of various CodeLLMs and
|
||
to evaluate hallucination detection and mitigation techniques. Table 1 shows recent benchmarks related to code
|
||
|
||
|
||
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 7
|
||
hallucination and summarize their distinct features. Existing benchmarks to evaluate hallucinations by CodeLLMs have
|
||
limitations, such as a lack of language diversity and a failure to reflect real-world workloads.
|
||
Many of those benchmarks build on existing LLM code generation benchmarks, extending them to overcome those
|
||
limitations. EvalPlus proposed by Liu et al.[35] is a benchmark that extends an existing benchmark, HumanEval, to
|
||
address its specific limitations. The HumanEval benchmark contains vague task descriptions and insufficient number of
|
||
test cases per task. Furthermore, some solutions labeled as correct in HumanEval were found to be erroneous. EvalPlus
|
||
addresses these limitations by increasing the average number of test cases per task to 764.1, leveraging LLMs for seed
|
||
input generation and employing type-aware mutation for fuzz testing. CodeMirage [1] assesses the ability of LLMs to
|
||
detect hallucinations in the input code. CodeMirage was generated using the HumanEval and MBPP databases, with
|
||
artificial hallucinations inserted into the code using the ChatGPT-3.5 model.
|
||
Among the seven benchmarks we inspected, five support only one programming language, and four of them
|
||
(CodeHaluEval, CodeMirage, EvalPlus, and HalluCode) specifically target Python coding tasks. This distribution reflects
|
||
the frequent use of Python in scenarios where LLMs generate code. In contrast, Fan et al. [15] proposes LMDefects,
|
||
a Java- focused benchmark that evaluates the correctness of code generated by Codex and explores the applicability
|
||
of automated program repair (APR) techniques to hallucinated code. LMDefects is based on easy and medium-level
|
||
problems from the LeetCode platform and incorporates public test cases provided by the platform.
|
||
Unlike aforementioned benchmarks, Multiple-E and CodeContests contain code generation tasks in diverse pro
|
||
gramming languages. Cassano et al. [8] introduced MultiPL-E, a benchmark that translates Python problems from the
|
||
HumanEval and MBPP datasets into 18 different programming languages. To rigorously compare models, it is essential
|
||
to evaluate their ability to generate code in languages beyond Python. Multi-language benchmarks have been developed
|
||
for this purpose, as CodeLLMs are typically designed to handle multiple programming languages. This benchmark
|
||
uses 18 custom compilers to translate code snippets, test cases, and other components originally designed for Python
|
||
into other languages, allowing a comparative analysis of LLM performance across languages. These compilers are also
|
||
extendable to support additional languages in the future.
|
||
CodeContests, proposed by Li et al. [30], includes programming challenges from platforms such as Codeforces and
|
||
CodeChef to train, validate, and evaluate the Alphacode model. This dataset supports multiple programming languages
|
||
such as C++, Java, Python, etc. enabling broader applicability.
|
||
6.2 Metrics
|
||
To compare and analyze model performance on benchmark datasets that are in line with their research goals, studies
|
||
adopt different evaluation metrics. Selecting the appropriate metrics is essential to accurately assess the specific aspects
|
||
of the model that the study aims to target. This section examines evaluation metrics used in the papers that are addressed
|
||
in Section 6.1. Table 2 summarizes the metrics used in various studies to compare the performance of models with
|
||
respect to code hallucination. We have grouped the metrics on the following basis: Functional Correctness, Hallucination
|
||
Detection, Hallucination Recognition and Hallucination Mitigation Metrics.
|
||
6.2.1 Functional Correctness. This category focuses on evaluating how well the generated code satisfies its intended
|
||
requirements. The most common metric, Pass@k, measures the frequency with which at least one of the k generated
|
||
solutions passes all test cases. Pass@10, a popular variation, represents the fraction of tasks in which at least one of the
|
||
10 generated solutions is correct. On the other hand, 10@k measures the percentage of tasks for which k samples were
|
||
|
||
|
||
8 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
|
||
Table 2. Comparative Analysis of Code Hallucination Metrics.
|
||
Category Metric Description Ref. Functional Correctness
|
||
Pass@k Evaluates the correctness of code generated by a CodeLLM. It measures the likelihood that a CodeLLM generates functionally correct code for a given task.
|
||
[8] [15] [35]
|
||
10@k Evaluates a CodeLLM’s ability to generate correct code, specifically to assess the ability to produce multiple correct solutions for a single task.
|
||
[30]
|
||
Hallucination Detection
|
||
Hallucination Rate (HR) Reflects the hallucination phenomenon in LLMs during code generation tasks through actual execution tests.
|
||
[53]
|
||
Valid Rate (VR) Reflects the percentage of valid code outputs by an LLM. [34] Accuracy of Hallucination Existence Recognition ACCrec
|
||
Reflects the percentage of correctly identified existence of hallucinations. [34]
|
||
Hallucination Classification
|
||
Accuracy of Hallucination Type Recognition ACCt ype (i)
|
||
Reflects the percentage of accurately identified hallucination types. Liu et al. proposed five types of hallucinations.
|
||
[34]
|
||
Accuracy, Macro-precision, Macro-recall, and Macro-F1
|
||
Standard metrics used to evaluate multi-class classification performance, where classes represent different hallucination types.
|
||
[1]
|
||
Hallucination Mitigation
|
||
Accuracy of Hallucination Mitigation ACCmit
|
||
Reflects the percentage of modified hallucinated codes which are semantically correct.
|
||
[34]
|
||
created per task, and when at least 10 of them passed the test. Pass@k and 10@k consider hallucinations in generated
|
||
code to be any error that prevents the generated code from passing all test cases.
|
||
6.2.2 Hallucination Detection. This category quantifies the presence of hallucinations within the generated code. We
|
||
use Hallucination Rate (HR), Validate Rate (VR) and Accuracy of Hallucination Existence Recognition (ACCrec ) for
|
||
this type of metric [34, 53]. HR, as proposed by Tian et al. [53], measures the proportion of generated code samples
|
||
that syntactically valid but fail to execute as expected using their CodeHalu Algorithm. VR serves as a measure of
|
||
the proportion of generated outputs that are syntactically valid and executable [34]. Thus, a lower VR can suggest
|
||
hallucinations are interfering with the code’s ability to run. ACCrec used in tandem with VR focuses on how accurately
|
||
a model identifies valid code outputs that also contain hallucinations.
|
||
6.2.3 Hallucination Type Classification. This category assesses a CodeLLMs ability to recognize and classify hallu
|
||
cinations. In contrast to detection, type classification aims to categorize given hallucinated codes into one of the
|
||
predefined hallucination types. The metrics used are Accuracy of Hallucination Type Recognition ACCt ype (i) [34] and
|
||
traditional multi-class classification metrics [1]. ACCt ype (i) assesses the precision of the model in categorizing the type
|
||
of hallucination present in valid code. Agarwal et al. [1] used accuracy, macro-precision, macro-recall and macro-F1 as
|
||
metrics to measure how well hallucinations were detected and classified according to their hallucination types. In this
|
||
context, accuracy refers to the percentage of hallucinations that are well categorized by the model that matches the
|
||
actual categories.
|
||
6.2.4 Hallucination Mitigation. This category is used to measure the ability to successfully fix hallucinated codes. Ac
|
||
curacy of Hallucination Mitigation ACCmit [34] shows the percentage of recognized hallucinations that are successfully
|
||
alleviated by CodeLLMs.
|
||
7 Causes of Hallucinations in Code Generation
|
||
We investigate the causes of hallucinations by CodeLLMs and classify them into three main issues: Training Data Issues,
|
||
Trained Model Issues, and Prompt Issues. Fig. 4 presents a hierarchical cause analysis tree for code hallucinations
|
||
generated by CodeLLMs showing the primary causes and into more specific factors.
|
||
|
||
|
||
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 9
|
||
Causes
|
||
Training Data Issues
|
||
Lack of Quantity of Training Datasets
|
||
Lack of Diverse Training
|
||
Sets [22]
|
||
Low Quality of Training Datasets
|
||
Training on Flawed or
|
||
Vulnerable Data [49] [50] [53] [65]
|
||
Outdated or Incomplete
|
||
Public API Knowledge [63] [2] [20] [65]
|
||
Trained Model Issues
|
||
Inadequate/Inappropriate Evaluations and Benchmarks
|
||
Lack of Benchmarks For
|
||
Real World SWE Tasks [49] [50] [22]
|
||
Lack of Evaluation Metrics for Real World SWE Tasks
|
||
[49] [50] [22]
|
||
Reasoning & Understanding Deficiencies
|
||
Limited Context Handling & Scope
|
||
Mishandling of Unseen
|
||
Private API [62]
|
||
Low Repository & Crossfile Contextual Understanding
|
||
[55] [22] [2] [32] [48] [10] [12]
|
||
Low Syntactical & Struc
|
||
tural Understanding [22] [53] [15]
|
||
Insufficient Dependency
|
||
Parsing [26] Temperature-related
|
||
Non-Determinism [51]
|
||
Token Length Limitation [10] [32]
|
||
CodeLLMs Lack of Requirement Clarification Mechanism
|
||
[39]
|
||
Prompt Issues
|
||
Ambiguous Prompts
|
||
Ambiguous Nature of
|
||
Natural Languages [13]
|
||
Ambiguous Require
|
||
ments In Prompts [39] Irrelevant Context in
|
||
Prompt [32]
|
||
Fig. 4. Potential causes of hallucinations by CodeLLMs.
|
||
7.1 Training Data Issues
|
||
One of the primary causes arises from the issues in quality and quantity of the training data. These can be categorized
|
||
as follows: a lack of diverse training sets, training on flawed or vulnerable data and, outdated or incomplete public API
|
||
knowledge. The limited diversity of training data restricts a CodeLLM’s ability to generalize across various programming
|
||
tasks. Jain et al. [20] highlights that the breadth and quality of the training dataset are crucial for correct code generation.
|
||
In addition, CodeLLMs often produce code hallucinations because of their training on public repositories that often
|
||
contain deprecated or incomplete API documentation, leading that code to invoke non-existent APIs or contain API
|
||
misuse [65]. Training on flawed or vulnerable data from open-source projects compound the issues, as these CodeLLMs
|
||
propagate security vulnerabilities and inefficient implementation to the generated code [50].
|
||
7.2 Trained Model Issues
|
||
Major causes of code hallucinations include issues with the trained models. The causes of code hallucinations that
|
||
stem from issues with the trained models themselves are: Inadequate or Inappropriate Evaluations and Benchmarks,
|
||
Reasoning and Understanding Deficiencies, Temperature-related Non-Determinism, Model Input Handling, Token
|
||
Generation Inefficiency, and CodeLLMs’s Lack of Requirement Clarification Mechanism.
|
||
One contributor to code hallucinations is the use of inadequate evaluation benchmarks that fail to capture real-world
|
||
software engineering tasks. Current evaluation metrics and benchmarks often do not accurately represent real-world
|
||
tasks. CodeLLMs are frequently evaluated using benchmarks that lack the constructs necessary to assess the security
|
||
|
||
|
||
10 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
|
||
of generated code [49, 50]. The need for comprehensive benchmarks and metrics that evaluate a broader spectrum of
|
||
coding skills is ever prevalent [22].
|
||
Another crucial aspect contributing to code hallucinations is the inherent reasoning and understanding deficiencies
|
||
present in the trained models. One common reasoning and understanding deficiency is the CodeLLM’s limited ability to
|
||
handle code context. As LLMs receive a larger code context, they often mishandle unseen private API and have trouble
|
||
understanding across files and entire repositories [1, 10, 12, 32, 48, 56]. LLMs lack prior knowledge about private libraries
|
||
and struggle to leverage external, pre-exisiting knowledge unless they are augmented with retrieval-based generation
|
||
techniques [62]. This lack of context is evident when generating functions with repository-level dependencies. [56].
|
||
The limited capacity of CodeLLMs to grasp the underlying structure and syntax of programming languages highlights
|
||
their reasoning and understanding deficiencies [15, 22, 53]. As transformer-based LLM architectures are the norm,
|
||
they may not be optimally designed to fully capture the inherent structure and syntax of programming languages [22].
|
||
The CodeLLMs’ reliance on pattern matching and statistical rules to generate code, results in a lack of fundamental
|
||
understanding of symbolic systems [53]. Code is treated as a series of tokens cause language models to lack awareness
|
||
of program semantics and lead to the generation of incorrect programs [15].
|
||
The non-deterministic nature of CodeLLMs, which is controlled by temperature settings and decoding strategies, is
|
||
an inherent issue with the trained model. The temperature parameter in CodeLLMs governs the randomness of the
|
||
generated responses as lower temperatures yield more predictable and deterministic outputs, while higher temperatures
|
||
increase creativity and diversity [51]. While the higher temperatures of verbose models benefit creative code generation,
|
||
they increase the risk of code hallucination rate [51].
|
||
Another aspect contributing to code hallucinations arises from limitations in how trained models handle input
|
||
tokens. CodeLLMs have an input token length limit, which impacts their ability retain all problem details [10]. This
|
||
makes it impossible to feed entire code repositories to the CodeLLMs to effectively generate code [32].
|
||
The limitations of existing CodeLLMs handling ambiguous requirements can be another source of code hallucinations.
|
||
Current CodeLLMs often lack a mechanism to clarify unclear or incomplete instructions, which can cause hallucinations
|
||
that do not satisfy the user’s requirements [39].
|
||
7.3 Prompt Issues
|
||
The third major cause of code hallucinations is the prompts. Two factors contributing to this are the ambiguous nature
|
||
of the prompt itself and presence of insufficient or irrelevant context in the prompt. A significant challenge originates
|
||
from the inherent ambiguity of natural language prompts. Natural language prompts tend to not fully capture the intent
|
||
of the user in a fully nuanced and accurate manner. Hence it is a challenge to generate code from such ambiguous
|
||
natural language prompts [13, 39]. Furthermore, code hallucinations can arise from contextual deficiencies in the
|
||
prompt. Providing insufficient context or including irrelevant details can hinder the CodeLLM’s ability to generate
|
||
accurate and satisfactory code [32].
|
||
8 Hallucination Mitigation Methods
|
||
Various approaches to mitigate hallucinations are being actively explored. Among these, five approaches were selected
|
||
for comparative analysis. The following sections provide an overview of the specific hallucination types each approach
|
||
targets, the root causes they address, and a brief description of each method, along with its strengths and limitations.
|
||
|
||
|
||
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 11
|
||
8.1 De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding
|
||
There are two main challenges in addressing this issue. The first challenge is that LLMs lack knowledge of project
|
||
specific APIs, and may fail to correctly use existing functions and classes. To investigate this issue, they selected five
|
||
functions from each of ten open-source projects to create a code completion task. Experimental results showed that
|
||
44% of the generated code contained at least one instance of incorrect API usage. To reduce the likelihood of such
|
||
issues, it would be necessary to provide the entire project code as input. However, due to constraints, this is practically
|
||
impossible. Therefore, selecting only the essential code snippets to include becomes critical. The second challenge
|
||
lies in accurately identifying the importance of each piece of code for this purpose. To address this challenge, they
|
||
suggested an approach named De-hallucinator for iteratively retrieving relevant APIs to improve the prompts.
|
||
The De-hallucinator [12] pre-analyzes and indexes all source code within the project in advance. When a code
|
||
generation prompt is provided, it selects the most relevant APIs based on the input and creates a Retrieval Augmented
|
||
Generation (RAG) prompt to include these APIs. Alternatively, it generates an iterative prompt that incorporates
|
||
the APIs most relevant to the code produced by the initial prompt. These prompts are then used as inputs for code
|
||
generation. This approach has the advantage of not requiring modifications to the internal structure of the LLM model.
|
||
However, it has the drawback of relying on the project to contain well-documented and detailed API descriptions.
|
||
8.2 Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues
|
||
Liu et al. [37] proposed a hallucination mitigation method leveraging ChatGPT’s self-revision capabilities. The approach
|
||
aims to address all code quality issues in LLM-generated code, including execution errors, incorrect outputs, and
|
||
maintainability problems. The method provides two types of feedback to the LLM immediately after code generation:
|
||
simple feedback and feedback with static analysis:
|
||
• Simple feedback: This feedback involves informing the model that the generated code contains quality issues
|
||
without specifying details.
|
||
• Feedback with static analysis: This feedback includes more detailed information, such as static analysis
|
||
results and runtime error messages for the generated code.
|
||
The study found that using these feedback methods enabled ChatGPT to self-revise 20–60% of the generated code.
|
||
Furthermore, iterative feedback led to a gradual improvement in code quality over time.
|
||
This approach has the advantage of generalizing scenarios where developers use LLMs for code generation, effectively
|
||
demonstrating its mitigation performance. However, it has limitations, including the requirement for developers to
|
||
craft prompts manually and the need for a basic understanding of static analysis tools and error messages.
|
||
8.3 SynCode: LLM Generation with Grammar Augmentation
|
||
Ugare et al. [54] focused on Syntax Violation Hallucinations. Grammar-guided generation has recently been widely
|
||
proposed [16, 43, 47, 58] to ensure that LLM-generated code adheres strictly to predefined grammatical rules [54].
|
||
These methods modify the LLM’s decoding algorithm to ensure that the model consistently selects tokens conforming
|
||
to a specific formal language. However, the tokens used by the model are predefined during training, and this often
|
||
leads to token misalignment where the model’s tokens do not match the terminals used in the specified grammar.
|
||
This misalignment is a significant factor contributing to the high error rates observed in grammar-guided generation.
|
||
To address this issue, the SynCode algorithm was proposed, leveraging the EBNF (Extended Backus-Naur Form)
|
||
representation of context-free grammar to guide the LLM during the decoding process. This ensures that the model
|
||
|
||
|
||
12 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
|
||
produces grammatically correct outputs throughout the generation process. The advantage of this approach is its
|
||
versatility, as it can be applied to any type of LLM decoding algorithm and supports all programming languages.
|
||
8.4 ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification
|
||
Mu et al. [39] proposed a method to mitigate hallucinations caused by ambiguous prompts. Generating correct code
|
||
requires a clear understanding of the user’s requirements, but the necessary information might not always be fully
|
||
included in the LLM’s prompt. In real-world scenarios, developers often address ambiguous requirements by asking
|
||
clarifying questions to gather additional information. Inspired by this approach, they introduced a novel framework
|
||
where the LLM generates clarifying questions to help users refine their prompts.
|
||
The core challenges of this approach lie in determining when to ask questions and what questions to ask. To address
|
||
the first challenge, they implemented a code consistency check process. This involves generating test inputs based on the
|
||
user’s prompt and asking the LLM to produce n code solutions aligned with the prompt. The generated code solutions
|
||
are executed with the test inputs, and the resulting test outputs are compared. If the similarity among outputs is low, it
|
||
is determined that a clarifying question is needed. This method is based on the intuition that a better understanding of
|
||
the requirements should result in more consistent code solutions.
|
||
For the second challenge, they employed reasoning-based prompts to help the LLM identify elements of the prompt
|
||
causing ambiguity and generate targeted clarifying questions. The reasoning-based prompt includes instructions for
|
||
clarifying question generation, few-shot examples, and the user’s requirements alongside the generated code solutions.
|
||
The ClarifyGPT framework has the advantage of achieving mitigation effects without requiring direct modifications
|
||
to a model. It also aids developers who struggle to craft clear prompts. However, this approach has significant drawbacks,
|
||
including high overhead due to the processes of input generation, code generation, and clarifying question generation.
|
||
Additionally, the examples for the question generation prompt must be manually crafted.
|
||
8.5 LLM Hallucinations in Practical Code Generation:Phenomena, Mechanism, and Mitigation
|
||
Zhang et al. [65] analyzed the types of LLM hallucinations in code generation and potential factors that cause hallucina
|
||
tions. Based on the findings, they suggest a mitigation method based on RAG. The study identified three primary root
|
||
causes of hallucinations in LLM-generated code: (1) incorrect or insufficient understanding of task requirements, (2)
|
||
lack of factual knowledge relevant to the generation tasks, and (3) inability to access the necessary code and non-code
|
||
resources from the repository. To mitigate these issues, the authors proposed a RAG-based approach. They first created
|
||
a retrieval corpus by scanning all source files from repositories in the CoderEval dataset and extracting consecutive lines
|
||
of code. When a query is presented to the LLM, the system retrieves related code snippets from the corpus, appending
|
||
the most relevant ones to the prompt.
|
||
This approach has several advantages. It requires no additional effort from users, ensures that only essential
|
||
information necessary for code generation is provided to the model, and supports handling project-specific APIs.
|
||
However, its effectiveness is significantly influenced by the quality and quantity of the source code available for retrieval.
|
||
Moreover, the retrieval process introduces overhead, which can impact efficiency.
|
||
Despite these challenges, the RAG-based mitigation method demonstrated a modest reduction in hallucinations
|
||
across six LLMs. This study serves as a pilot exploration of RAG-based mitigation methods, shedding light on their
|
||
possible applications in reducing hallucinations in LLMs.
|
||
|
||
|
||
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 13
|
||
9 Discussion and conclusion
|
||
The findings in this paper suggest several promising directions for future research. First, the development of more
|
||
diverse and representative benchmark datasets, encompassing various programming languages and use cases, is
|
||
essential for evaluating LLMs in broader contexts. Second, advances in hallucination mitigation techniques, such as
|
||
retrieval-augmented generation, clarifying question frameworks, and grammar-guided decoding, indicate the potential
|
||
of combining multiple approaches to enhance reliability. Third, the integration of LLMs into real-world software
|
||
development workflows calls for adaptive techniques that can dynamically address context-specific hallucinations,
|
||
improving practical usability. By synthesizing these insights, this study serves as a road-map for advancing research and
|
||
development in LLM code generation, ultimately contributing to the creation of more robust and trustworthy systems.
|
||
References
|
||
[1] Vibhor Agarwal, Yulong Pei, Salwa Alamir, and Xiaomo Liu. 2024. CodeMirage: Hallucinations in Code Generated by Large Language Models. doi:10.48550/arXiv.2408.08333 arXiv:2408.08333 [2] Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu Lahiri, and Sriram Rajamani. 2024. Monitor-guided decoding of code LMs with static analysis of repository context. Advances in Neural Information Processing Systems 36 (2024).
|
||
[3] Miltiadis Allamanis, Sheena Panthaplackel, and Pengcheng Yin. 2024. Unsupervised Evaluation of Code LLMs with Round-Trip Correctness. arXiv preprint arXiv:2402.08699 (2024).
|
||
[4] Amazon. 2022. What is CodeWhisperer? https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is-cwspr.html. [5] Anthropic. 2025. Claude 3.7 Sonnet. https://www.anthropic.com/news/claude-3-7-sonnet. [6] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021). [7] Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, et al. 2023. Purple llama cyberseceval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724 (2023). [8] Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering 49, 7 (July 2023), 3675–3691. doi:10.1109/TSE.2023.3267446 [9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021). [10] Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, Yan Liu, Enyu Zhou, Ming Zhang, Yuhao Zhou, Yueming Wu, Rui Zheng, Ming Wen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. 2024. What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. doi:10.48550/arXiv.2407.06153 arXiv:2407.06153 [11] Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861 (2023). [12] Aryaz Eghbali and Michael Pradel. 2024. De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding. doi:10.48550/arXiv.2401.01701 arXiv:2401.01701 [13] Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K. Lahiri. 2024. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation. IEEE Transactions on Software Engineering 50, 9 (Sept. 2024), 2254–2268. doi:10.1109/TSE.2024. 3428972 [14] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 31–53. doi:10.1109/ICSE-FoSE59343.2023.00008 [15] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, Melbourne, Victoria, Australia, 1469–1481. doi:10.1109/ICSE48619.2023.00128 [16] Georgi Gerganov et al. 2024. llama.cpp: Port of Facebook’s LLaMA model in C/C++. https://github.com/guidance-ai/guidance. [17] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021). [18] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 33, 8 (Dec. 2024), 220:1–220:79. doi:10.1145/3695988
|
||
|
||
|
||
14 Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam
|
||
[19] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. 43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155 [20] Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2024. On Mitigating Code LLM Hallucinations with API Documentation. arXiv preprint arXiv:2407.09726 (2024).
|
||
[21] Kevin Jesse, Toufique Ahmed, Premkumar T Devanbu, and Emily Morgan. 2023. Large language models and simple, stupid bugs. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 563–575.
|
||
[22] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv preprint arXiv:2406.00515 (2024).
|
||
[23] Nan Jiang, Qi Li, Lin Tan, and Tianyi Zhang. 2024. Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code. arXiv preprint arXiv:2410.09997 (2024).
|
||
[24] Raphaël Khoury, Anderson R Avila, Jacob Brunelle, and Baba Mamadou Camara. 2023. How secure is code generated by chatgpt?. In 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2445–2451.
|
||
[25] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning. PMLR, 18319–18345. [26] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35 (2022), 21314–21328. [27] Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599 (2024).
|
||
[28] Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. 2024. MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems. In Findings of the Association for Computational Linguistics: EMNLP 2024. 736–783. [29] Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. 2023. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852 (2023).
|
||
[30] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-Level Code Generation with AlphaCode. Science 378, 6624 (Dec. 2022), 1092–1097. doi:10.1126/science.abq1158 [31] Yifan Li, Ensheng Shi, Dewu Zheng, Kefeng Duan, Jiachi Chen, and Yanlin Wang. 2024. RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening. In Proceedings of the 15th Asia-Pacific Symposium on Internetware. 229–238.
|
||
[32] Dianshu Liao, Shidong Pan, Xiaoyu Sun, Xiaoxue Ren, Qing Huang, Zhenchang Xing, Huan Jin, and Qinying Li. 2024. A 3-CodGen: A RepositoryLevel Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware. IEEE Transactions on Software Engineering (2024).
|
||
[33] Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. doi:10.48550/arXiv.2109.07958 arXiv:2109.07958 [cs]. [34] Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. 2024. Exploring and Evaluating Hallucinations in LLM-Powered Code Generation. doi:10.48550/arXiv.2404.00971 arXiv:2404.00971 [35] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, 21558–21572. [36] Mingwei Liu, Tianyong Yang, Yiling Lou, Xueying Du, Ying Wang, and Xin Peng. 2023. Codegen4libs: A two-stage approach for library-oriented code generation. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 434–445.
|
||
[37] Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. 2024. Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. ACM Trans. Softw. Eng. Methodol. 33, 5 (June 2024), 116:1–116:26. doi:10.1145/3643674 [38] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021). [39] Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification. Proc. ACM Softw. Eng. 1, FSE (July 2024), 103:2332–103:2354. doi:10.1145/ 3660810 [40] Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I. Wang, and Xi Victoria Lin. 2023. LEVER: Learning to Verify Language-toCode Generation with Execution. In Proceedings of the 40th International Conference on Machine Learning (ICML’23, Vol. 202). JMLR.org, Honolulu, Hawaii, USA, 26106–26128. [41] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. doi:10.48550/arXiv.2203.13474 arXiv:2203.13474 [cs]. [42] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754–768.
|
||
|
||
|
||
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges 15
|
||
[43] Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227 (2022). [44] Ge Qu, Jinyang Li, Bowen Li, Bowen Qin, Nan Huo, Chenhao Ma, and Reynold Cheng. 2024. Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation. arXiv preprint arXiv:2405.15307 (2024). [45] Kia Rahmani, Mohammad Raza, Sumit Gulwani, Vu Le, Daniel Morris, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari. 2021. Multi-modal program inference: A marriage of pre-trained language models and component-based synthesis. Proceedings of the ACM on Programming Languages 5, OOPSLA (2021), 1–29. [46] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. 2024. Code Llama: Open Foundation Models for Code. doi:10.48550/arXiv.2308.12950 arXiv:2308.12950 [cs]. [47] Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. arXiv preprint arXiv:2109.05093 (2021).
|
||
[48] Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2023. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning. PMLR, 31693–31715.
|
||
[49] Mohammed Latif Siddiq, Joanna Cecilia da Silva Santos, Sajith Devareddy, and Anna Muller. 2024. Sallm: Security assessment of generated code. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops. 54–65.
|
||
[50] Mohammed Latif Siddiq and Joanna CS Santos. 2022. SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security. 29–33. [51] Joseph Spracklen, Raveen Wijewickrama, AHM Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala. 2024. We have a package for you! a comprehensive analysis of package hallucinations by code generating llms. arXiv preprint arXiv:2406.10279 (2024). [52] Florian Tambon, Arghavan Moradi Dakhel, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Giuliano Antoniol. 2024. Bugs in Large Language Models Generated Code: An Empirical Study. doi:10.48550/arXiv.2403.08937 arXiv:2403.08937 [53] Yuchen Tian, Weixiang Yan, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, and Dawn Song. 2024. CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification. doi:10.48550/arXiv.2405.00253 arXiv:2405.00253 [54] Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2024. SynCode: LLM Generation with Grammar Augmentation. doi:10.48550/arXiv.2403.01632 arXiv:2403.01632 [55] Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. 2024. Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation. arXiv preprint arXiv:2401.06391 (2024). [56] Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. RLCoder: Reinforcement Learning for Repository-Level Code Completion. arXiv:2407.19487 [cs.SE] https://arxiv.org/abs/2407.19487 [57] Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. 2024. Where Do Large Language Models Fail When Generating Code? arXiv preprint arXiv:2406.08731 (2024).
|
||
[58] Brandon T Willard and Rémi Louf. 2023. Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702 (2023). [59] Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (EASE ’14). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/2601248.2601268 [60] Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Hari Sundaram, et al. 2023. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation. arXiv preprint arXiv:2311.08588 (2023).
|
||
[61] Chen Yang, Yan Liu, and Changqing Yin. 2021. Recent Advances in Intelligent Source Code Generation: A Survey on Natural Language Based Studies. Entropy 23, 9 (Sept. 2021), 1174. doi:10.3390/e23091174 [62] Daoguang Zan, Bei Chen, Yongshun Gong, Junzhi Cao, Fengji Zhang, Bingchao Wu, Bei Guan, Yilong Yin, and Yongji Wang. 2023. Private-libraryoriented code generation with large language models. arXiv preprint arXiv:2307.15370 (2023). [63] Kechi Zhang, Huangzhao Zhang, Ge Li, Jia Li, Zhuo Li, and Zhi Jin. 2023. Toolcoder: Teach code generation models to use api search tools. arXiv preprint arXiv:2305.04032 (2023).
|
||
[64] Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2024. Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code. doi:10.48550/arXiv.2311.07989 arXiv:2311.07989 [cs]. [65] Ziyao Zhang, Yanlin Wang, Chong Wang, Jiachi Chen, and Zibin Zheng. 2024. LLM hallucinations in practical code generation: Phenomena, mechanism, and mitigation. arXiv preprint arXiv:2409.20550 (2024). [66] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684.
|
||
[67] Li Zhong and Zilong Wang. 2024. Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation. Proceedings of the AAAI Conference on Artificial Intelligence 38, 19 (March 2024), 21841–21849. doi:10.1609/aaai.v38i19.30185 |