Artificial Intelligence 8 min read

Does GitHub Copilot Really Copy-Paste Code? An In-Depth AI Analysis

This article examines GitHub Copilot’s AI-driven code suggestions, its underlying Codex model, multilingual support, performance on LeetCode challenges, and the controversy over potential code copying, presenting a detailed classification of observed cases, statistical analysis, and conclusions about the rarity and nature of actual code recitation.

Programmer DD

Jul 6, 2021

Does GitHub Copilot Really Copy-Paste Code? An In-Depth AI Analysis

Introduction

GitHub Copilot, built on OpenAI’s Codex algorithm, was trained on terabytes of public code from GitHub and English language examples. It claims to analyze strings, comments, function names, and existing code to generate matching snippets, supporting languages such as Python, JavaScript, TypeScript, Ruby, and Go.

Performance on LeetCode

Early users tested Copilot on LeetCode problems and reported that it consistently passed the test cases with near‑real‑time generation, leading some to claim that AI may write code better than humans. However, skeptics noted that the generated comments and templates closely resembled existing LeetCode solutions, suggesting possible prior exposure.

Copy‑Paste Allegations

Critics accused Copilot of simply “copy‑pasting” well‑known code, such as the fast inverse square‑root algorithm (magic number 0x5f3759df) and even GPL‑licensed snippets. GitHub responded that direct copying occurs in at most 0.1% of suggestions, and most output is original.

Case Classification

Albert Ziegler, a GitHub researcher, collected 453,780 Copilot suggestions for Python and categorized them into five groups:

Category 1 – Near‑duplicate suggestions after a prior accepted suggestion.

Category 2 – Long, repetitive sequences (e.g., repeated <p> tags).

Category 3 – Standard lists such as natural numbers, primes, or Greek letters.

Category 4 – Common solutions for low‑entropy tasks (e.g., using BeautifulSoup to parse Wikipedia lists).

Category 5 – Cases that match training data closely enough to be considered “recitation”.

Statistical Findings

After removing Category 1, 185 suggestions remained. Of these, 144 fell into Categories 2‑4, leaving 41 cases in Category 5, which Albert regards as true code recitation. Most of these 41 cases appeared in over 100 different files, indicating frequent reuse.

Conclusion

Albert concludes that while Copilot can occasionally copy exact code fragments, this behavior is rare and usually involves widely used snippets. He recommends that the UI indicate when a suggestion originates from training data, allowing users to attribute or reject the code accordingly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning AI code generation software development GitHub Copilot code plagiarism

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.