Many generative AI systems are very good at writing code, and developers increasingly use them to develop high-quality software faster and more efficiently than they could without these systems. But companies must take precautions when using AI to write code to ensure that they are not violating copyright protections. We addressed general concerns related to IP risk and AI coding in a previous article. Here we focus on risks related to use of open-source software (OSS).
The OSS Challenge
Generative AI systems are trained on massive troves of data, including troves of code. Some of the largest available sources for code are OSS repositories, such as the thousands of publicly available OSS projects on GitHub. In response to prompts from users, AI systems may generate new code based on code from these repositories, sometimes without changing it much or at all. Moreover, it is difficult for users to know when open-source code has been incorporated or where the code was sourced from in the first place.
Nearly all popular OSS licenses contain conditions. “Permissive licenses” often require users to provide notice or attribution when they distribute the OSS code. “Copyleft licenses” (e.g., the “GNU General Public License” family) can include more onerous obligations, including the requirement to make works derived from the OSS available under an open-source license.
Companies that use AI to generate code must take care to ensure they do not unknowingly incorporate snippets of third-party code made available under an OSS license. If they do, they could be liable for copyright infringement. This is particularly important for companies that develop code for use in proprietary products.
Addressing Risks Before, During, and After Coding
Companies can take three main steps to reduce OSS-related risks when using AI to generate code.
- Before coding, set some boundaries. When possible, set configuration options on the AI system to reduce the likelihood that it will generate problematic code. GitHub Copilot, for example, allows users to enable a filter that prevents the system from suggesting code that is a close or exact match to public code that is available on GitHub. Companies should also consider putting limits on the length of AI-generated code snippets that developers are allowed to use. Smaller segments may be less likely to include third-party code that is entitled to copyright protection.
- While coding, tag code that was generated by AI. Companies should consider requiring developers to include tags in source code files to identify AI-generated code wherever it is included. This simple intervention can facilitate additional scrutiny in code reviews.
- Prior to production, use scanning tools to vet code generated by AI. Companies can deploy scanning tools to detect open-source and other third-party code that may be included in a code base. Many software composition analysis (SCA) tools can identify third-party files using hashes and open-source licenses by matching keywords, but the options for identifying unlabeled snippets of third-party code are limited. However, SCA tools that do perform snippet matching, such as Black Duck SCA and Revenera Code Insight, can accommodate generative AI use cases.
* * *
AI is making it easier for companies to generate code and develop software, creating new opportunities for business innovation but also introducing risks that can have significant implications.
[View source.]