As AI code assistants become increasingly integrated into software development workflows, understanding how their code compares to human-written programs is critical for ensuring reliability, maintainability, and security. In this paper, we present a large-scale comparison of code authored by human developers and three state-of-the-art LLMs, i.e., ChatGPT, DeepSeek-Coder, and Qwen-Coder, on multiple dimensions of software quality: code defects, security vulnerabilities, and structural complexity. Our evaluation spans over 500k code samples in two widely used languages, Python and Java, classifying defects via Orthogonal Defect Classification and security vulnerabilities using the Common Weakness Enumeration. We find that AI-generated code is generally simpler and more repetitive, yet more prone to unused constructs and hardcoded debugging, while humanwritten code exhibits greater structural complexity and a higher concentration of maintainability issues. Notably, AI-generated code also contains more high-risk security vulnerabilities. These findings highlight the distinct defect profiles of AI-and humanauthored code and underscore the need for specialized quality assurance practices in AI-assisted programming.

Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity / Cotroneo, Domenico; Improta, Cristina; Liguori, Pietro. - (2025), pp. 252-263. ( 36th IEEE International Symposium on Software Reliability Engineering, ISSRE 2025 bra 2025) [10.1109/issre66568.2025.00035].

Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity

Cotroneo, Domenico;Improta, Cristina
;
Liguori, Pietro
2025

Abstract

As AI code assistants become increasingly integrated into software development workflows, understanding how their code compares to human-written programs is critical for ensuring reliability, maintainability, and security. In this paper, we present a large-scale comparison of code authored by human developers and three state-of-the-art LLMs, i.e., ChatGPT, DeepSeek-Coder, and Qwen-Coder, on multiple dimensions of software quality: code defects, security vulnerabilities, and structural complexity. Our evaluation spans over 500k code samples in two widely used languages, Python and Java, classifying defects via Orthogonal Defect Classification and security vulnerabilities using the Common Weakness Enumeration. We find that AI-generated code is generally simpler and more repetitive, yet more prone to unused constructs and hardcoded debugging, while humanwritten code exhibits greater structural complexity and a higher concentration of maintainability issues. Notably, AI-generated code also contains more high-risk security vulnerabilities. These findings highlight the distinct defect profiles of AI-and humanauthored code and underscore the need for specialized quality assurance practices in AI-assisted programming.
2025
Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity / Cotroneo, Domenico; Improta, Cristina; Liguori, Pietro. - (2025), pp. 252-263. ( 36th IEEE International Symposium on Software Reliability Engineering, ISSRE 2025 bra 2025) [10.1109/issre66568.2025.00035].
File in questo prodotto:
File Dimensione Formato  
Human-Written_vs._AI-Generated_Code_A_Large-Scale_Study_of_Defects_Vulnerabilities_and_Complexity.pdf

solo utenti autorizzati

Licenza: Copyright dell'editore
Dimensione 344.65 kB
Formato Adobe PDF
344.65 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/1043615
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact