From Data to Code

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

A research resource on how training data quality issues propagate into LLM-generated code quality issues, and how they can be detected, mapped, and governed across the LLM lifecycle.

🌐 Documentation Website | 📄 arXiv

114 primary studies reviewed · 9 generated-code quality dimensions · 18 propagation mapping mechanisms

Overview

Modern code LLMs do not fail only at generation time. Their defects often reflect upstream problems in the data they were trained or fine-tuned on: vulnerable snippets, noisy text, duplicated samples, distribution gaps, privacy leakage, benchmark contamination, and other forms of low-quality training signal.

This repository accompanies the systematic literature review “Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code.” It organizes the reviewed evidence into taxonomies, propagation mappings, detection methods, and governance strategies for understanding the path from flawed data to flawed generated code.

📖 Abstract

Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state-of-the-art detection and mitigation techniques across the data, model, and generation lifecycles. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic-based post-generation filtering toward proactive, data-centric governance and closed-loop repair. Finally, we identify open challenges and outline research directions for developing reliable LLMs for code through integrated data curation and continuous evaluation.

Overview of the process of paper collection and filtering

Fig. 1. Overview of the paper collection and filtering process.

Fig. 2. Conceptual Framework of Quality Issues and Mitigation in the LLM Lifecycle.

📢 News

[2026-05] 📝 Our paper is now available on arXiv!
[2026-04] 🚀 Official documentation website is now live: From-Data-to-Code Website
[2026-04] 🚀 The From-Data-to-Code repository is officially launched.

📑 Table of Contents

From Data to Code

📚 Findings

💻 RQ1: Generated Code Quality Issues

We discard vague concepts like generic "code hallucination" and establish a unified taxonomy encompassing 9 core dimensions of quality issues in LLM-generated code:

Correctness: Functional accuracy and executability, categorized into syntax errors, logical flaws, and API misuse.
Security: Resilience against malicious exploitation, categorized into inherent design flaws and external vulnerabilities.
Compliance: Adherence to legal, ethical, and safety standards, categorized into copyright infringement, privacy leakage, and malicious code generation.
Robustness: Ability to handle abnormal inputs gracefully, manifesting as inadequate error handling and boundary condition failures.
Maintainability: Ease of long-term code modification, categorized into disorganized structure and low reusability.
Understandability: Human-readability and clarity, manifesting as poor naming conventions and lack of documentation.
Efficiency: Optimal system resource utilization, categorized into suboptimal time complexity and improper memory management.
Parsimony of Output: Conciseness of generated results, manifesting as redundant logic, useless loops, and extreme verbosity.
Miscellaneous: Anomalies outside the core eight dimensions, primarily manifesting as instruction-following failures.

Fig. 3. Taxonomy of Generated Code Quality Issues

📄 Papers Referenced in this Section:

LLMs Meet Library Evolution: LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion [Jun-24] [paper]
Copilot Security: Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? [Apr-22] [paper]
Copilot Evaluation: An Empirical Evaluation of GitHub Copilot’s Code Suggestions [Jan-25] [paper]
HalluCode: Exploring and Evaluating Hallucinations in LLM-Powered Code Generation [Apr-24] [paper]
CodeHalu: CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [May-24] [paper]
EffiBench: EffiBench: Benchmarking the Efficiency of Automatically Generated Code [Feb-24] [paper]
Mercury: Mercury: A Code Efficiency Benchmark for Code Large Language Models [Feb-24] [paper]
SStuBs: Large Language Models and Simple, Stupid Bugs [Mar-23] [paper]
package hallucinations: We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs [Jun-24] [paper]
HallTrigger: Code Hallucination [Jul-24] [paper]
Large Language Models for Code: Large Language Models for Code: Security Hardening and Adversarial Testing [Feb-23] [paper]
Purple Llama CYBERSECEVAL: Purple Llama CYBERSECEVAL: A Secure Coding Benchmark for Language Models [Dec-23] [paper]
Lost at C: Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants [Aug-22] [paper]
AI Assistants Security: Do Users Write More Insecure Code with AI Assistants? [Nov-22] [paper]
The Counterfeit Conundrum: The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations? [Feb-24] [paper]
Bugs in LLM-generated Code: Bugs in Large Language Models Generated Code: An Empirical Stud [Mar-24] [paper]
GitHub Copilot, Amazon CodeWhisperer, ChatGPT: Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT [Apr-23] [paper]
ChatGPT Code Quality: No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT [Aug-23] [paper]
CloudAPIBench: On Mitigating Code LLM Hallucinations with API Documentation [Jul-24] [paper]
CodeMirage: CodeMirage: Hallucinations in Code Generated by Large Language Models [Aug-24] [paper]
LLM-generated Code Efficiency: On Evaluating the Efficiency of Source Code Generated by LLMs [Apr-24] [paper]
AutoAPIEval: A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models [Sep-24] [paper]
DeSec: Decoding Secret Memorization in Code LLMs Through Token-Level Characterization [Oct-24] [paper]
When Fine-Tuning LLMs Meets Data Privacy: When Fine-Tuning LLMs Meets Data Privacy: An Empirical Study of Federated Learning in LLM-Based Program Repair [Dec-24] [paper]
Bias Unveiled: Bias Unveiled: Investigating Social Bias in LLM-Generated Code [Nov-24] [paper]
FairCoder: FairCoder: Evaluating Social Bias of LLMs in Code Generation [Jan-25] [paper]
CodeIP: CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [Apr-24] [paper]
From Effectiveness to Efficiency: From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions [Jun-24] [paper]
ENAMEL: How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark [Jun-24] [paper]
DeVAIC: DeVAIC: A Tool for Security Assessment of AI-generated Code [Apr-24] [paper]
PTMs: Comparing Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written [Nov-24] [paper]
Software Librarian: Is ChatGPT a Good Software Librarian? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations [Aug-24] [paper]
Codequal Analyzer: Improving LLM-Generated Code Quality with GRPO [Jun-25] [paper]
Artificial-Intelligence Generated Code Considered Harmful: Artificial-Intelligence Generated Code Considered Harmful: A Road Map for Secure and High-Quality Code Generation [Sep-24] [paper]
Unveiling Inefficiencies in LLM-Generated Code: Unveiling Inefficiencies in LLM-Generated Code: Toward a Comprehensive Taxonomy [Mar-25] [paper]
Python Tests Quality: Quality Assessment of Python Tests Generated by Large Language Models [Jun-25] [paper]
CoQuIR: CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval [Jun-25] [paper]
REAL: Training Language Models to Generate Quality Code with Program Analysis Feedback [May-25] [paper]
CIDRe: CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement [May-25] [paper]
Infinite-Instruct: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification [May-25] [paper]
Quality In, Quality Out: Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation [Mar-25] [paper]
Security and Quality in LLM-Generated Code: Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis [Feb-25] [paper]
SwallowCode: Rewriting Pre-Training Data Boosts LLM Performance in Math and Code [May-25] [paper]
ROSE: ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells [Jul-25] [paper]
Refining ChatGPT-Generated Code: Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues [Jul-23] [paper]
ReCode: ReCode: Updating Code API Knowledge with Reinforcement Learning [Jun-25] [paper]
Seed-Coder: Seed-Coder: Let the Code Model Curate Data for Itself [Jun-25] [paper]
Data-efficient Fine-tuning: Data-efficient LLM Fine-tuning for Code Generation [Apr-25] [paper]
CRPE: CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation [May-25] [paper]
DeepSeek-Coder: DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence [Jan-24] [paper]
CodeSmellEval: How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study [Dec-24] [paper]
RPG: Rethinking Repetition Problems of LLMs in Code Generation [May-25] [paper]
Repetition In Repetition Out: Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective [Oct-23] [paper]
Beyond Correctness: Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models [Jul-24] [paper]
Generated Code Diversity: Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes [Aug-24] [paper]
CodeMI: Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [Apr-24] [paper]
CodeCipher: CodeCipher: Learning to Obfuscate Source Code Against LLMs [Oct-24] [paper]
Code Llama: Code Llama: Open Foundation Models for Code [Aug-23] [paper]
Codex: Evaluating Large Language Models Trained on Code [Jul-21] [paper]
Path Planning Evaluation: Assessing LLM code generation quality through path planning tasks [Apr-25] [paper]
CODEJUDGE: CODEJUDGE : Evaluating Code Generation with Large Language Models [Jan-24] [paper]
Synthetic Data Generation: Synthetic Data Generation Using Large Language Models: Advances in Text and Code [Jan-25] [paper]
Unseen Horizons: Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar [Apr-25] [paper]
Code Generation Survey: A Survey on Large Language Models for Code Generation [Aug-24] [paper]
DataRecipe: DataRecipe --- How to Cook the Data for CodeLLM? [Oct-24] [paper]
aiXcoder-7B: aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing [Apr-25] [paper]
Imperfect Code Generation: Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models [May-24] [paper]
ClassEval: Evaluating Large Language Models in Class-Level Code Generation [Jun-24] [paper]
UCD-Training: Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs [Feb-26] [paper]
DRAINCODE: DRAINCODE: Stealthy Energy Consumption Attacks on Retrieval-Augmented Code Generation via Context PoisoningPreprint [Jan-26] [paper]
RealSec-Bench: RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [Jan-26] [paper]
ShortCoder: ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code GenerationPreprint [Jan-26] [paper]
APIKG4SYN: Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS [Nov-25] [paper]
MultiCodeIF: A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback [Jul-25] [paper]
Beyond Functional Correctness: Beyond functional correctness: Investigating coding style inconsistencies in large language models [Jun-24] [paper]
Adadec: Adadec: Uncertainty-guided adaptive decoding for llm-based code generation [Jun-25] [paper]
Code Copycat Conundrum: Code Copycat Conundrum: Demystifying Repetition in LLM-based Code Generation [Apr-25] [paper]
AllianceCoder: What to retrieve for effective retrieval-augmented code generation? an empirical study and beyond [Mar-25] [paper]
RustEvo^ 2: RustEvo^ 2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation [Mar-25] [paper]
RobGen: A Preliminary Study on the Robustness of Code Generation by Large Language Models [Mar-25] [paper]
Llm Hallucinations in Practical Code Generation: Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation [Sep-24] [paper]
COFFE: COFFE: A Code Efficiency Benchmark for Code Generation [Feb-25] [paper]
AATK Benchmark: Asleep at the keyboard? assessing the security of github copilot's code contributions [Aug-21] [paper]

📊 RQ2: Training Data Quality Issues

We categorize intrinsic flaws within pre-training and fine-tuning corpora into two core dimensions:

Code Attribute Quality Issues: Inherent defects within individual code samples that models explicitly learn, categorized into correctness, security, compliance, robustness, maintainability, understandability, and efficiency flaws.
Non-Code Attribute Quality Issues: Non-code textual noise and macro-level dataset flaws. Categorized into:
- Compliance and Security Risks (Textual): Hazards inherent in textual data, categorized into illegal/harmful, copyright-infringing, and privacy-leaking text.
- Distribution Imbalance Issues: Skewed dataset proportions, manifesting as imbalances across programming languages, domains, data types, or difficulty levels.
- Redundancy Issues: Excessive repetition, manifesting as duplicate samples or synthetic data degradation.
- Inadequate Diversity: Insufficient coverage of real-world scenarios, manifesting as underrepresented edge cases or niche business logic.
- Data Contamination Risks: Leakage of evaluation data, primarily manifesting as benchmark test sets embedded in training corpora.
- Low-Value Data: Data contributing little or negatively to learning, categorized into meaningless text, format noise, low-information-density text, erroneous text, and incomplete data.

Fig. 4. Taxonomy of Training Data Quality Issues

📄 Papers Referenced in this Section:

LLMs Meet Library Evolution: LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion [Jun-24] [paper]
Less is More: Less is More: On the Importance of Data Quality for Unit Test Generation [Feb-25] [paper]
DataMan: DataMan: Data Manager for Pre-training Large Language Models [Feb-25] [paper]
Phi-4: Phi-4 Technical Report [Dec-24] [paper]
SStuBs: Large Language Models and Simple, Stupid Bugs [Mar-23] [paper]
DeSec: Decoding Secret Memorization in Code LLMs Through Token-Level Characterization [Oct-24] [paper]
CIDRe: CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement [May-25] [paper]
Infinite-Instruct: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification [May-25] [paper]
Quality In, Quality Out: Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation [Mar-25] [paper]
SwallowCode: Rewriting Pre-Training Data Boosts LLM Performance in Math and Code [May-25] [paper]
Seed-Coder: Seed-Coder: Let the Code Model Curate Data for Itself [Jun-25] [paper]
CRPE: CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation [May-25] [paper]
GPT-4: GPT-4 Technical Report [Mar-23] [paper]
Code Pretraining: How Does Code Pretraining Affect Language Model Task Performance? [Sep-24] [paper]
StarCoder 2 and The Stack v2: StarCoder 2 and The Stack v2: The Next Generation [Feb-24] [paper]
Repetition In Repetition Out: Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective [Oct-23] [paper]
Every Sample Matters: Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM [Mar-25] [paper]
Code Data Training Stage: At Which Training Stage Does Code Data Help LLMs Reasoning? [Sep-23] [paper]
WaveCoder: WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning [Dec-23] [paper]
Brevity is the soul of wit: Brevity is the soul of wit: Pruning long files for code generation [Jul-24] [paper]
Benchmark Builders: Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks [Apr-25] [paper]
CodeMI: Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [Apr-24] [paper]
CodeCipher: CodeCipher: Learning to Obfuscate Source Code Against LLMs [Oct-24] [paper]
Code Pre-training Impact: To Code, or Not To Code? Exploring Impact of Code in Pre-training [Aug-24] [paper]
DataComp-LM: DataComp-LM: In search of the next generation of training sets for language models [Jun-24] [paper]
Logical Inference Pre-training: Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance? [Oct-24] [paper]
Code Llama: Code Llama: Open Foundation Models for Code [Aug-23] [paper]
Codex: Evaluating Large Language Models Trained on Code [Jul-21] [paper]
Path Planning Evaluation: Assessing LLM code generation quality through path planning tasks [Apr-25] [paper]
Datasets for Large Language Models: Datasets for Large Language Models: A Comprehensive Survey [Feb-24] [paper]
Synthetic Data Generation: Synthetic Data Generation Using Large Language Models: Advances in Text and Code [Jan-25] [paper]
Cracks in The Stack: Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets [May-25] [paper]
Unseen Horizons: Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar [Apr-25] [paper]
RTL-Breaker: RTL-Breaker: Assessing the Security of LLMs Against Backdoor Attacks on HDL Code Generation [Mar-25] [paper]
MG-Verilog: MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation [Jun-24] [paper]
Code Generation Survey: A Survey on Large Language Models for Code Generation [Aug-24] [paper]
DataRecipe: DataRecipe --- How to Cook the Data for CodeLLM? [Oct-24] [paper]
Training Data Extraction: Understanding Privacy Risks of Large Language Models in Japanese Based on Training Data Extraction Attacks [Aug-25] [paper]
aiXcoder-7B: aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing [Apr-25] [paper]
Imperfect Code Generation: Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models [May-24] [paper]
LLM-ProS: LLM-ProS: Analyzing Large Language Models’ Performance in Competitive Problem Solving [May-25] [paper]
Uncovering Pretraining Code in LLMs: Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach [Nov-25] [paper]
APIKG4SYN: Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS [Nov-25] [paper]
MultiCodeIF: A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback [Jul-25] [paper]
RustEvo^ 2: RustEvo^ 2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation [Mar-25] [paper]
AATK Benchmark: Asleep at the keyboard? assessing the security of github copilot's code contributions [Aug-21] [paper]

🔗 RQ3: Mapping: Data to Code

How do data defects cause code generation failures? We summarize 18 propagation mechanisms bridging the gap between dataset flaws and generated code defects:

Direct Mappings (10 types): The classic "garbage in, garbage out" replication. The model explicitly memorizes dataset flaws and replicates them.
Indirect Mappings (8 types): Insidious propagation. Non-code defects do not inject explicit errors but disrupt the model's internal representations via mechanisms such as entropy collapse, representation bias, or semantic drift.

Sankey Diagram of Mapping from Data Issues to Code Issues

Fig. 5. Mapping mechanisms from Training Data Issues to Generated Code Issues.

📄 Papers Referenced in this Section:

LLMs Meet Library Evolution: LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion [Jun-24] [paper]
Less is More: Less is More: On the Importance of Data Quality for Unit Test Generation [Feb-25] [paper]
Qwen: Qwen Technical Report [Sep-23] [paper]
DataMan: DataMan: Data Manager for Pre-training Large Language Models [Feb-25] [paper]
Phi-4: Phi-4 Technical Report [Dec-24] [paper]
Copilot Security: Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? [Apr-22] [paper]
Copilot Evaluation: An Empirical Evaluation of GitHub Copilot’s Code Suggestions [Jan-25] [paper]
HalluCode: Exploring and Evaluating Hallucinations in LLM-Powered Code Generation [Apr-24] [paper]
CodeHalu: CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [May-24] [paper]
SStuBs: Large Language Models and Simple, Stupid Bugs [Mar-23] [paper]
package hallucinations: We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs [Jun-24] [paper]
HallTrigger: Code Hallucination [Jul-24] [paper]
Large Language Models for Code: Large Language Models for Code: Security Hardening and Adversarial Testing [Feb-23] [paper]
Purple Llama CYBERSECEVAL: Purple Llama CYBERSECEVAL: A Secure Coding Benchmark for Language Models [Dec-23] [paper]
Lost at C: Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants [Aug-22] [paper]
AI Assistants Security: Do Users Write More Insecure Code with AI Assistants? [Nov-22] [paper]
Bugs in LLM-generated Code: Bugs in Large Language Models Generated Code: An Empirical Stud [Mar-24] [paper]
GitHub Copilot, Amazon CodeWhisperer, ChatGPT: Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT [Apr-23] [paper]
ChatGPT Code Quality: No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT [Aug-23] [paper]
CloudAPIBench: On Mitigating Code LLM Hallucinations with API Documentation [Jul-24] [paper]
CodeMirage: CodeMirage: Hallucinations in Code Generated by Large Language Models [Aug-24] [paper]
Syntactic Robustness: Syntactic Robustness for LLM-based Code Generation [Apr-24] [paper]
NLPerturbator: NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations [Jun-24] [paper]
AutoAPIEval: A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models [Sep-24] [paper]
DeSec: Decoding Secret Memorization in Code LLMs Through Token-Level Characterization [Oct-24] [paper]
When Fine-Tuning LLMs Meets Data Privacy: When Fine-Tuning LLMs Meets Data Privacy: An Empirical Study of Federated Learning in LLM-Based Program Repair [Dec-24] [paper]
Bias Unveiled: Bias Unveiled: Investigating Social Bias in LLM-Generated Code [Nov-24] [paper]
FairCoder: FairCoder: Evaluating Social Bias of LLMs in Code Generation [Jan-25] [paper]
CodeIP: CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [Apr-24] [paper]
DeVAIC: DeVAIC: A Tool for Security Assessment of AI-generated Code [Apr-24] [paper]
Software Librarian: Is ChatGPT a Good Software Librarian? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations [Aug-24] [paper]
Codequal Analyzer: Improving LLM-Generated Code Quality with GRPO [Jun-25] [paper]
Artificial-Intelligence Generated Code Considered Harmful: Artificial-Intelligence Generated Code Considered Harmful: A Road Map for Secure and High-Quality Code Generation [Sep-24] [paper]
Unveiling Inefficiencies in LLM-Generated Code: Unveiling Inefficiencies in LLM-Generated Code: Toward a Comprehensive Taxonomy [Mar-25] [paper]
Python Tests Quality: Quality Assessment of Python Tests Generated by Large Language Models [Jun-25] [paper]
CoQuIR: CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval [Jun-25] [paper]
REAL: Training Language Models to Generate Quality Code with Program Analysis Feedback [May-25] [paper]
CIDRe: CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement [May-25] [paper]
Infinite-Instruct: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification [May-25] [paper]
Quality In, Quality Out: Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation [Mar-25] [paper]
Security and Quality in LLM-Generated Code: Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis [Feb-25] [paper]
SwallowCode: Rewriting Pre-Training Data Boosts LLM Performance in Math and Code [May-25] [paper]
ROSE: ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells [Jul-25] [paper]
Refining ChatGPT-Generated Code: Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues [Jul-23] [paper]
Qwen2.5: Qwen2.5 Technical Report [Dec-24] [paper]
ReCode: ReCode: Updating Code API Knowledge with Reinforcement Learning [Jun-25] [paper]
Data-efficient Fine-tuning: Data-efficient LLM Fine-tuning for Code Generation [Apr-25] [paper]
CRPE: CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation [May-25] [paper]
DeepSeek-Coder: DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence [Jan-24] [paper]
GPT-4: GPT-4 Technical Report [Mar-23] [paper]
Code Pretraining: How Does Code Pretraining Affect Language Model Task Performance? [Sep-24] [paper]
StarCoder 2 and The Stack v2: StarCoder 2 and The Stack v2: The Next Generation [Feb-24] [paper]
CodeSmellEval: How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study [Dec-24] [paper]
RPG: Rethinking Repetition Problems of LLMs in Code Generation [May-25] [paper]
Repetition In Repetition Out: Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective [Oct-23] [paper]
Code Data Training Stage: At Which Training Stage Does Code Data Help LLMs Reasoning? [Sep-23] [paper]
Brevity is the soul of wit: Brevity is the soul of wit: Pruning long files for code generation [Jul-24] [paper]
Benchmark Builders: Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks [Apr-25] [paper]
Generated Code Diversity: Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes [Aug-24] [paper]
CodeMI: Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [Apr-24] [paper]
CodeCipher: CodeCipher: Learning to Obfuscate Source Code Against LLMs [Oct-24] [paper]
Code Pre-training Impact: To Code, or Not To Code? Exploring Impact of Code in Pre-training [Aug-24] [paper]
DataComp-LM: DataComp-LM: In search of the next generation of training sets for language models [Jun-24] [paper]
RedStone: RedStone: Curating General, Code, Math, and QA Data for Large Language Models [Dec-24] [paper]
Code Llama: Code Llama: Open Foundation Models for Code [Aug-23] [paper]
Codex: Evaluating Large Language Models Trained on Code [Jul-21] [paper]
Path Planning Evaluation: Assessing LLM code generation quality through path planning tasks [Apr-25] [paper]
CODEJUDGE: CODEJUDGE : Evaluating Code Generation with Large Language Models [Jan-24] [paper]
Datasets for Large Language Models: Datasets for Large Language Models: A Comprehensive Survey [Feb-24] [paper]
Synthetic Data Generation: Synthetic Data Generation Using Large Language Models: Advances in Text and Code [Jan-25] [paper]
Cracks in The Stack: Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets [May-25] [paper]
Unseen Horizons: Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar [Apr-25] [paper]
RTL-Breaker: RTL-Breaker: Assessing the Security of LLMs Against Backdoor Attacks on HDL Code Generation [Mar-25] [paper]
MG-Verilog: MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation [Jun-24] [paper]
Code Generation Survey: A Survey on Large Language Models for Code Generation [Aug-24] [paper]
DataRecipe: DataRecipe --- How to Cook the Data for CodeLLM? [Oct-24] [paper]
Training Data Extraction: Understanding Privacy Risks of Large Language Models in Japanese Based on Training Data Extraction Attacks [Aug-25] [paper]
aiXcoder-7B: aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing [Apr-25] [paper]
Imperfect Code Generation: Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models [May-24] [paper]
Inter-Dataset Code Duplication: On Inter-Dataset Code Duplication and Data Leakage in Large Language Models [Jan-25] [paper]
LLM-ProS: LLM-ProS: Analyzing Large Language Models’ Performance in Competitive Problem Solving [May-25] [paper]
UCD-Training: Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs [Feb-26] [paper]
ShortCoder: ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code GenerationPreprint [Jan-26] [paper]
Beyond Functional Correctness: Beyond functional correctness: Investigating coding style inconsistencies in large language models [Jun-24] [paper]
RustEvo^ 2: RustEvo^ 2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation [Mar-25] [paper]

🔍 RQ4: Detection Methods

Detection techniques are evolving from rigid static analysis to dynamic, model-driven, and hybrid evaluation frameworks. They form the diagnostic foundation of LLM quality governance and are classified into two categories:

1. Code-Level Detection

Identifies defects in generated code (e.g., runtime failures, hallucinations, security vulnerabilities) using three main paradigms:

Dynamic Analysis: Test-based execution (unit tests, functional benchmarks) and runtime monitoring to assess execution accuracy and resource efficiency.
Static Analysis: Rule-based detection (via tools like SonarQube, Semgrep) and manual inspection to find syntax errors, vulnerabilities, and code smells without executing the code.
Model-based Detection: "LLM-as-a-judge" techniques (direct, prompt-engineered, or fine-tuned evaluation) and lightweight ML classifiers for scalable semantic filtering.

2. Data-Level Detection

Targets the integrity, provenance, and representativeness of the underlying training data:

Dynamic Analysis: Execution-based validation (checking if scraped code compiles) and metric drift monitoring (detecting data leakage or contamination through training loss curves).
Static Analysis: Rule-based detection, human review, and provenance tracing (using file hashes to identify duplicate or benchmark-contaminated data).
Model-based Detection: High-throughput semantic screening using LLMs or lightweight classifiers to evaluate sample readability, information entropy, and potential hazards.

Taxonomy of Code Issue Detection Methods

Fig. 6. Taxonomy of Code Issue Detection Techniques

Taxonomy of Dataset Issue Detection Methods

Fig. 7. Taxonomy of Training Data Issue Detection Techniques

📄 Papers Referenced in this Section:

LLMs Meet Library Evolution: LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion [Jun-24] [paper]
Less is More: Less is More: On the Importance of Data Quality for Unit Test Generation [Feb-25] [paper]
Qwen: Qwen Technical Report [Sep-23] [paper]
Qwen2: Qwen2 Technical Report [Jul-24] [paper]
DataMan: DataMan: Data Manager for Pre-training Large Language Models [Feb-25] [paper]
Phi-4: Phi-4 Technical Report [Dec-24] [paper]
Copilot Security: Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? [Apr-22] [paper]
Copilot Evaluation: An Empirical Evaluation of GitHub Copilot’s Code Suggestions [Jan-25] [paper]
HalluCode: Exploring and Evaluating Hallucinations in LLM-Powered Code Generation [Apr-24] [paper]
CodeHalu: CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [May-24] [paper]
EffiBench: EffiBench: Benchmarking the Efficiency of Automatically Generated Code [Feb-24] [paper]
Mercury: Mercury: A Code Efficiency Benchmark for Code Large Language Models [Feb-24] [paper]
SStuBs: Large Language Models and Simple, Stupid Bugs [Mar-23] [paper]
package hallucinations: We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs [Jun-24] [paper]
HallTrigger: Code Hallucination [Jul-24] [paper]
Large Language Models for Code: Large Language Models for Code: Security Hardening and Adversarial Testing [Feb-23] [paper]
Purple Llama CYBERSECEVAL: Purple Llama CYBERSECEVAL: A Secure Coding Benchmark for Language Models [Dec-23] [paper]
Lost at C: Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants [Aug-22] [paper]
AI Assistants Security: Do Users Write More Insecure Code with AI Assistants? [Nov-22] [paper]
The Counterfeit Conundrum: The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations? [Feb-24] [paper]
Bugs in LLM-generated Code: Bugs in Large Language Models Generated Code: An Empirical Stud [Mar-24] [paper]
GitHub Copilot, Amazon CodeWhisperer, ChatGPT: Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT [Apr-23] [paper]
ChatGPT Code Quality: No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT [Aug-23] [paper]
CloudAPIBench: On Mitigating Code LLM Hallucinations with API Documentation [Jul-24] [paper]
CodeMirage: CodeMirage: Hallucinations in Code Generated by Large Language Models [Aug-24] [paper]
LLM-generated Code Efficiency: On Evaluating the Efficiency of Source Code Generated by LLMs [Apr-24] [paper]
Syntactic Robustness: Syntactic Robustness for LLM-based Code Generation [Apr-24] [paper]
DeSec: Decoding Secret Memorization in Code LLMs Through Token-Level Characterization [Oct-24] [paper]
Bias Unveiled: Bias Unveiled: Investigating Social Bias in LLM-Generated Code [Nov-24] [paper]
FairCoder: FairCoder: Evaluating Social Bias of LLMs in Code Generation [Jan-25] [paper]
From Effectiveness to Efficiency: From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions [Jun-24] [paper]
ENAMEL: How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark [Jun-24] [paper]
DeVAIC: DeVAIC: A Tool for Security Assessment of AI-generated Code [Apr-24] [paper]
PTMs: Comparing Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written [Nov-24] [paper]
Codequal Analyzer: Improving LLM-Generated Code Quality with GRPO [Jun-25] [paper]
Artificial-Intelligence Generated Code Considered Harmful: Artificial-Intelligence Generated Code Considered Harmful: A Road Map for Secure and High-Quality Code Generation [Sep-24] [paper]
Unveiling Inefficiencies in LLM-Generated Code: Unveiling Inefficiencies in LLM-Generated Code: Toward a Comprehensive Taxonomy [Mar-25] [paper]
Python Tests Quality: Quality Assessment of Python Tests Generated by Large Language Models [Jun-25] [paper]
CoQuIR: CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval [Jun-25] [paper]
REAL: Training Language Models to Generate Quality Code with Program Analysis Feedback [May-25] [paper]
CIDRe: CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement [May-25] [paper]
Infinite-Instruct: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification [May-25] [paper]
Quality In, Quality Out: Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation [Mar-25] [paper]
Security and Quality in LLM-Generated Code: Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis [Feb-25] [paper]
SwallowCode: Rewriting Pre-Training Data Boosts LLM Performance in Math and Code [May-25] [paper]
ROSE: ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells [Jul-25] [paper]
Refining ChatGPT-Generated Code: Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues [Jul-23] [paper]
Qwen3: Qwen3 Technical Report [May-25] [paper]
Qwen2.5: Qwen2.5 Technical Report [Dec-24] [paper]
TeleChat: Technical Report of TeleChat2, TeleChat2.5 and T1 [Jul-25] [paper]
Kimi K2: Kimi K2: Open Agentic Intelligence [Jul-25] [paper]
ReCode: ReCode: Updating Code API Knowledge with Reinforcement Learning [Jun-25] [paper]
Seed-Coder: Seed-Coder: Let the Code Model Curate Data for Itself [Jun-25] [paper]
Data-efficient Fine-tuning: Data-efficient LLM Fine-tuning for Code Generation [Apr-25] [paper]
CRPE: CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation [May-25] [paper]
DeepSeek-Coder: DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence [Jan-24] [paper]
StarCoder 2 and The Stack v2: StarCoder 2 and The Stack v2: The Next Generation [Feb-24] [paper]
CodeSmellEval: How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study [Dec-24] [paper]
RPG: Rethinking Repetition Problems of LLMs in Code Generation [May-25] [paper]
Repetition In Repetition Out: Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective [Oct-23] [paper]
Every Sample Matters: Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM [Mar-25] [paper]
WaveCoder: WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning [Dec-23] [paper]
Brevity is the soul of wit: Brevity is the soul of wit: Pruning long files for code generation [Jul-24] [paper]
Benchmark Builders: Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks [Apr-25] [paper]
Beyond Correctness: Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models [Jul-24] [paper]
Generated Code Diversity: Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes [Aug-24] [paper]
CodeMI: Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [Apr-24] [paper]
DataComp-LM: DataComp-LM: In search of the next generation of training sets for language models [Jun-24] [paper]
Codex: Evaluating Large Language Models Trained on Code [Jul-21] [paper]
Path Planning Evaluation: Assessing LLM code generation quality through path planning tasks [Apr-25] [paper]
CODEJUDGE: CODEJUDGE : Evaluating Code Generation with Large Language Models [Jan-24] [paper]
Datasets for Large Language Models: Datasets for Large Language Models: A Comprehensive Survey [Feb-24] [paper]
Synthetic Data Generation: Synthetic Data Generation Using Large Language Models: Advances in Text and Code [Jan-25] [paper]
Cracks in The Stack: Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets [May-25] [paper]
Unseen Horizons: Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar [Apr-25] [paper]
MG-Verilog: MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation [Jun-24] [paper]
Code Generation Survey: A Survey on Large Language Models for Code Generation [Aug-24] [paper]
DataRecipe: DataRecipe --- How to Cook the Data for CodeLLM? [Oct-24] [paper]
Training Data Extraction: Understanding Privacy Risks of Large Language Models in Japanese Based on Training Data Extraction Attacks [Aug-25] [paper]
aiXcoder-7B: aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing [Apr-25] [paper]
Imperfect Code Generation: Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models [May-24] [paper]
Inter-Dataset Code Duplication: On Inter-Dataset Code Duplication and Data Leakage in Large Language Models [Jan-25] [paper]
LLM-ProS: LLM-ProS: Analyzing Large Language Models’ Performance in Competitive Problem Solving [May-25] [paper]
ClassEval: Evaluating Large Language Models in Class-Level Code Generation [Jun-24] [paper]
Uncovering Pretraining Code in LLMs: Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach [Nov-25] [paper]
RealSec-Bench: RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [Jan-26] [paper]
ShortCoder: ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code GenerationPreprint [Jan-26] [paper]
APIKG4SYN: Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS [Nov-25] [paper]
MultiCodeIF: A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback [Jul-25] [paper]
Beyond Functional Correctness: Beyond functional correctness: Investigating coding style inconsistencies in large language models [Jun-24] [paper]
Adadec: Adadec: Uncertainty-guided adaptive decoding for llm-based code generation [Jun-25] [paper]
Code Copycat Conundrum: Code Copycat Conundrum: Demystifying Repetition in LLM-based Code Generation [Apr-25] [paper]
AllianceCoder: What to retrieve for effective retrieval-augmented code generation? an empirical study and beyond [Mar-25] [paper]
RustEvo^ 2: RustEvo^ 2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation [Mar-25] [paper]
RobGen: A Preliminary Study on the Robustness of Code Generation by Large Language Models [Mar-25] [paper]
Llm Hallucinations in Practical Code Generation: Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation [Sep-24] [paper]
COFFE: COFFE: A Code Efficiency Benchmark for Code Generation [Feb-25] [paper]
AATK Benchmark: Asleep at the keyboard? assessing the security of github copilot's code contributions [Aug-21] [paper]

🛠️ RQ5: Governance Strategies

We synthesize a Multi-layered Governance Framework spanning the entire data lifecycle and model inference stages to address quality defects:

1. Code-Level Mitigation

Model-level: SFT, RLHF/DPO, Reward-based optimization (combining execution correctness with static metrics), and Regularization-based stabilization to prevent mode collapse.
Generation-level:
- Pre-generation: Prompt Engineering, RAG, and Agent-based workflows.
- In-generation: Adaptive decoding constraints and Iterative Self-reflection.
- Post-generation: Automated AST-level repairs and sandbox execution filtering.

2. Data-Level Mitigation

Cleaning & Filtering: Execution-feedback elimination, static rule sanitization, and LLM-driven semantic cleaning to remove noise and vulnerabilities.
Data Balancing: Stratified resampling across programming languages, domains, and difficulty levels to mitigate representation bias.
Data Enhancement: Using LLMs or formatting tools to refactor, add docstrings, and standardize existing low-quality code.
Data Augmentation: Expanding datasets via high-quality synthetic generation (rule/LLM-based) and integration of curated open-source repositories.

Fig. 8. Taxonomy of Code Issue Mitigation Strategies

Taxonomy of Dataset Issue Mitigation Strategies

Fig. 9. Taxonomy of Training Data Issue Mitigation Strategies

📄 Papers Referenced in this Section:

LLMs Meet Library Evolution: LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion [Jun-24] [paper]
Less is More: Less is More: On the Importance of Data Quality for Unit Test Generation [Feb-25] [paper]
Qwen: Qwen Technical Report [Sep-23] [paper]
Qwen2: Qwen2 Technical Report [Jul-24] [paper]
DataMan: DataMan: Data Manager for Pre-training Large Language Models [Feb-25] [paper]
Phi-4: Phi-4 Technical Report [Dec-24] [paper]
SStuBs: Large Language Models and Simple, Stupid Bugs [Mar-23] [paper]
package hallucinations: We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs [Jun-24] [paper]
Large Language Models for Code: Large Language Models for Code: Security Hardening and Adversarial Testing [Feb-23] [paper]
CloudAPIBench: On Mitigating Code LLM Hallucinations with API Documentation [Jul-24] [paper]
AutoAPIEval: A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models [Sep-24] [paper]
DeSec: Decoding Secret Memorization in Code LLMs Through Token-Level Characterization [Oct-24] [paper]
Codequal Analyzer: Improving LLM-Generated Code Quality with GRPO [Jun-25] [paper]
REAL: Training Language Models to Generate Quality Code with Program Analysis Feedback [May-25] [paper]
CIDRe: CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement [May-25] [paper]
Infinite-Instruct: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification [May-25] [paper]
Quality In, Quality Out: Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation [Mar-25] [paper]
SwallowCode: Rewriting Pre-Training Data Boosts LLM Performance in Math and Code [May-25] [paper]
Refining ChatGPT-Generated Code: Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues [Jul-23] [paper]
Qwen3: Qwen3 Technical Report [May-25] [paper]
Qwen2.5: Qwen2.5 Technical Report [Dec-24] [paper]
TeleChat: Technical Report of TeleChat2, TeleChat2.5 and T1 [Jul-25] [paper]
Kimi K2: Kimi K2: Open Agentic Intelligence [Jul-25] [paper]
ReCode: ReCode: Updating Code API Knowledge with Reinforcement Learning [Jun-25] [paper]
Seed-Coder: Seed-Coder: Let the Code Model Curate Data for Itself [Jun-25] [paper]
Data-efficient Fine-tuning: Data-efficient LLM Fine-tuning for Code Generation [Apr-25] [paper]
CRPE: CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation [May-25] [paper]
DeepSeek-Coder: DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence [Jan-24] [paper]
Code Pretraining: How Does Code Pretraining Affect Language Model Task Performance? [Sep-24] [paper]
StarCoder 2 and The Stack v2: StarCoder 2 and The Stack v2: The Next Generation [Feb-24] [paper]
CodeSmellEval: How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study [Dec-24] [paper]
RPG: Rethinking Repetition Problems of LLMs in Code Generation [May-25] [paper]
Repetition In Repetition Out: Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective [Oct-23] [paper]
Brevity is the soul of wit: Brevity is the soul of wit: Pruning long files for code generation [Jul-24] [paper]
Benchmark Builders: Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks [Apr-25] [paper]
CodeCipher: CodeCipher: Learning to Obfuscate Source Code Against LLMs [Oct-24] [paper]
DataComp-LM: DataComp-LM: In search of the next generation of training sets for language models [Jun-24] [paper]
RedStone: RedStone: Curating General, Code, Math, and QA Data for Large Language Models [Dec-24] [paper]
Code Llama: Code Llama: Open Foundation Models for Code [Aug-23] [paper]
Codex: Evaluating Large Language Models Trained on Code [Jul-21] [paper]
Path Planning Evaluation: Assessing LLM code generation quality through path planning tasks [Apr-25] [paper]
CODEJUDGE: CODEJUDGE : Evaluating Code Generation with Large Language Models [Jan-24] [paper]
Synthetic Data Generation: Synthetic Data Generation Using Large Language Models: Advances in Text and Code [Jan-25] [paper]
Cracks in The Stack: Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets [May-25] [paper]
MG-Verilog: MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation [Jun-24] [paper]
Code Generation Survey: A Survey on Large Language Models for Code Generation [Aug-24] [paper]
DataRecipe: DataRecipe --- How to Cook the Data for CodeLLM? [Oct-24] [paper]
aiXcoder-7B: aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing [Apr-25] [paper]
Imperfect Code Generation: Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models [May-24] [paper]
Inter-Dataset Code Duplication: On Inter-Dataset Code Duplication and Data Leakage in Large Language Models [Jan-25] [paper]
LLM-ProS: LLM-ProS: Analyzing Large Language Models’ Performance in Competitive Problem Solving [May-25] [paper]
UCD-Training: Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs [Feb-26] [paper]
ShortCoder: ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code GenerationPreprint [Jan-26] [paper]
APIKG4SYN: Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS [Nov-25] [paper]
MultiCodeIF: A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback [Jul-25] [paper]
Beyond Functional Correctness: Beyond functional correctness: Investigating coding style inconsistencies in large language models [Jun-24] [paper]
Adadec: Adadec: Uncertainty-guided adaptive decoding for llm-based code generation [Jun-25] [paper]
Code Copycat Conundrum: Code Copycat Conundrum: Demystifying Repetition in LLM-based Code Generation [Apr-25] [paper]
AllianceCoder: What to retrieve for effective retrieval-augmented code generation? an empirical study and beyond [Mar-25] [paper]
RustEvo^ 2: RustEvo^ 2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation [Mar-25] [paper]
RobGen: A Preliminary Study on the Robustness of Code Generation by Large Language Models [Mar-25] [paper]
Llm Hallucinations in Practical Code Generation: Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation [Sep-24] [paper]
COFFE: COFFE: A Code Efficiency Benchmark for Code Generation [Feb-25] [paper]

🤝 Contribution

We warmly welcome contributions from the community! If you have new research or have discovered missing classic papers, please follow these steps:

Fork this repository.
Add your paper to the corresponding RQ section following the existing table format.
Submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
_includes		_includes
_sass/custom		_sass/custom
assets/css		assets/css
images		images
paper		paper
.gitignore		.gitignore
Gemfile		Gemfile
README.md		README.md
_config.yml		_config.yml
detection.md		detection.md
google5509a39cd4b75c33.html		google5509a39cd4b75c33.html
governance.md		governance.md
index.md		index.md
mapping.md		mapping.md
robots.txt		robots.txt
taxonomy.md		taxonomy.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Data to Code

Overview

📖 Abstract

📢 News

📑 Table of Contents