paint-brush
Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code: Abstract & Introby@textmodels
202 reads

Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code: Abstract & Intro

tldt arrow

Too Long; Didn't Read

Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code.
featured image - Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code: Abstract & Intro
Writings, Papers and Blogs on Text Models HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Mohammed Latif Siddiq, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame;

(2) Joanna C. S. Santos, Department of Computer Science and Engineering, University of Notre Dame, Notre Dame.

Table of Links

Abstract

With the growing popularity of Large Language Models (e.g., GitHub Copilot, ChatGPT, etc.) in software engineers’ daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code. There are two contributing factors to the insecure code generation. First, existing datasets used to evaluate Large Language Models (LLMs) do not adequately represent genuine software engineering tasks sensitive to security. Instead, they are often based on competitive programming challenges or classroom-type coding tasks. In real-world applications, the code produced is integrated into larger codebases, introducing potential security risks. There’s a clear absence of benchmarks that focus on evaluating the security of the generated code. Second, existing evaluation metrics primarily focus on the functional correctness of the generated code while ignoring security considerations. Metrics such as pass@k gauge the probability of obtaining the correct code in the top k suggestions. Other popular metrics like BLEU, CodeBLEU, ROUGE, and METEOR similarly emphasize functional accuracy, neglecting security implications. In light of these research gaps, in this paper, we described SALLM, a framework to benchmark LLMs’ abilities to generate secure code systematically. This framework has three major components: a novel dataset of security-centric Python prompts, an evaluation environment to test the generated code, and novel metrics to evaluate the models’ performance from the perspective of secure code generation.

1 Introduction

A code LLM is a Large Language Model (LLM) that has been trained on a large dataset consisting of both text and code. As a result, code LLMs can generate code written in a specific programming language from a given prompt. These prompts provide a high-level specification of a developer’s intent [34].


Prompts can include single/multi-line code comments, code expressions (e.g., a function definition), text, or a combination of these. etc. Given a prompt as input, the LLM generates new tokens, one by one, until it reaches a stop sequence (i.e., a pre-configured sequence of tokens) or the maximum number of tokens is reached.


With the recent releases of GitHub Copilot [25] and ChatGPT [2], LLM-based source code generation tools are increasingly being used by developers in order to reduce software development efforts [77]. A recent survey with 500 US-based developers who work for large-sized companies showed that 92% of them are using LLMs to generate code for work and personal use [60]. Part of this fast widespread adoption is due to the increased productivity perceived by developers; LLMs help them to automate repetitive tasks so that they can focus on higher-level challenging tasks [77].


Although LLM-based code generation techniques may produce functionally correct code, prior works showed that they can also generate code with vulnerabilities and security smells [51, 52, 58]. A prior study has also demonstrated that training sets commonly used to train and/or fine-tune LLMs contain harmful coding patterns, which leak to the generated code [62]. Moreover, a recent study [52] with 47 participants showed that individuals who used the codex-davinci-002 LLM wrote code that was less secure compared to those who did not use it. Even worse, participants who used the LLM were more likely to believe that their code was secure, unlike their peers who did not use the LLM to write code.


There are two major factors contributing to this unsafe code generation. First, code LLMs are evaluated using benchmarks, which do not include constructs to evaluate the security of the generated code [63, 75]. Second, existing evaluation metrics (e.g., pass@k [11], CodeBLEU [56], etc.) assess models’ performance with respect to their ability to produce functionally correct code while ignoring security concerns. Therefore, the performance reported for these models overly focuses on improving the precision of the generated code with respect to passing the functional test cases of these benchmarks without evaluating the security of the produced code.



The contributions of this paper are:


• A novel framework to systematically and automatically evaluate the security of LLM generated code;


• A publicly available dataset of Python prompts [1];


• Two novel metrics (secure@k and vulnerability@k) and a demonstration of how to compute it statically and dynamically.


• A benchmarking of five LLMs (CodeGen-2B-mono, CodeGen-2.5-7B-mono, StarCoder, GPT-3.5, and GPT4) using our framework.


The rest of this paper is organized as follows: Section 2 introduces the core concepts necessary to understand this paper. Section 3 describes our framework in detail. Section 4 describes the empirical investigation we performed to benchmark LLMs. Section 5 presents the results of our experiments. Section 6 explains SALLM’s limitations. Section 7 presents related work. Finally, Section 8 concludes this paper while describing plans for future work.




[1] The dataset will be made public on GitHub upon acceptance and submitted to the artifact evaluation track.