Copilot vs. Humans: Comparing Vulnerability Rates in Code Generation

28 May 2024

Authors:

(1) Vahid Majdinasab, Department of Computer and Software Engineering Polytechnique Montreal, Canada;

(2) Michael Joshua Bishop, School of Mathematical and Computational Sciences Massey University, New Zealand;

(3) Shawn Rasheed, Information & Communication Technology Group UCOL - Te Pukenga, New Zealand;

(4) Arghavan Moradidakhel, Department of Computer and Software Engineering Polytechnique Montreal, Canada;

(5) Amjed Tahir, School of Mathematical and Computational Sciences Massey University, New Zealand;

(6) Foutse Khomh, Department of Computer and Software Engineering Polytechnique Montreal, Canada.

Table of Links

Abstract and Introduction

Original Study

Replication Scope and Methodology

Results

Discussion

There have been several studies evaluating the security issues of generated code by LLMs, specifically those generated by Copilot.

A recent study on the security of Copilot-generated code in GitHub projects by Fu et al. [15] reported that around 36% of the Copilot-generated code contains CWEs. The studied snippets revealed weaknesses related to 42 different CWEs, including eleven, that appear in the MITER’s 2022 Top-25 CWEs. Most weaknesses are found to be related to, among others, OS Command Injection (CWE-78), Use of Insufficiently Random Values (CWE-330), and Improper Check or Handling of Exceptional Conditions (CWE-703).

Previous studies, including the work of Khoury et al. [20], inspected code generated by ChatGPT for common vulnerabilities as well as its response to prompting to improve vulnerable code. The study found that while the model is conceptually “aware” of the vulnerabilities present in the code, it nevertheless continues to produce code with these vulnerabilities present. Hajipour et al. [21] investigated vulnerabilities introduced by specially engineered prompts. By inverting the target models, the study was able to extract prompts that would induce vulnerabilities in the generated code.

A study by He et al. [8] used adversarial testing to implement security hardening on pre-trained LLMs. This process showed a significant improvement in the security of the output code without having to retrain the models. (Study used CodeQL to validate generated code samples) Asare et al. [22] compared the rate of introduction of vulnerabilities by both humans and Copilot. Of the code samples tested, 33% was found to reintroduce the same vulnerabilities as the original code, with 25% being the same as the fixed code. The remaining 42% was code that was dissimilar to either the vulnerable or fixed code.

Several studies have also been conducted to examine the security of generated code from ChatGPT models. Shi et al. [23] proposed a backdoor attack that may be used exploit the security vulnerabilities of ChatGPT. Initial experiments show that attackers may manipulate generated text with this approach. Erner et al. [24] explored various attack vectors for ChatGPT and performed a qualitative analysis of the security implications of these vectors. Given the large attack surface, the study concluded that more research is required into each of the vectors to inform professionals and policymakers going forward.

Go et al. [25] demonstrated the usage of GitHub’s code search to find “Simple Stupid Insecure Practices” (SSIPs) in open-source software projects across the site. The study shows that SSIPs are common, exploitable vulnerabilities that can easily be found using GitHub. Perry et al. [26] performed a study comparing how users complete programming tasks with and without AI Code Assistants. The study found that users who had access to one of OpenAI’s code generation models wrote significantly less secure code than those without access. Huang et al. [27] surveyed the safety and trustworthiness of LLMs and the viability of use of various verification and validation techniques. The paper is intended as an organized collection of literature to facilitate a quick understanding of LLM safety and trustworthiness from the perspective of verification and validation.

A growing number of studies have investigated different aspects of GitHub Copilot’s code quality. Several of those studies have focused on the productivity aspects of Copilot. Dakhel et al. [5] Analyzed the viability of Copilot as a pair programmer/programming tool by investigating the correctness of solutions provided by the tool compared with those by programmers. It was reported that Copilot programmers’ solutions have a higher correctness ratio compared to those of Copilot. However, Copilot’s buggy solutions are found to require less effort to be repaired compared to the programmers’ ones.

Evaluating the practical quality of Copilot suggestions, Nquyen et al. [28] used LeetCode questions to create queries for Copilot in four programming languages. Java was found to have the highest rate of correct suggestions with 57% while JavaScript had the lowest at 27%. Some shortcomings of Copilot include generating incomplete code that relies on undefined helper functions or over complicated and circuitous code.

This paper is available on arxiv under CC 4.0 license.

Copilot vs. Humans: Comparing Vulnerability Rates in Code Generation

Table of Links

VI. RELATED WORK