Artificial Intelligence 6 min read

Why Did GPT-4’s Performance Plummet Between March and June 2023?

A Stanford‑Berkeley study reveals that between March and June 2023 GPT‑4’s accuracy on prime‑checking fell from 97.6% to 2.4%, code generation quality dropped sharply, and sensitivity handling changed, underscoring the rapid, unpredictable shifts in large language model performance over short periods.

Programmer DD

Jul 21, 2023

Why Did GPT-4’s Performance Plummet Between March and June 2023?

The Stanford University and University of California, Berkeley collaboration titled “How Is ChatGPT's Behavior Changing Over Time?” evaluated GPT‑3.5 and GPT‑4 versions from March 2023 and June 2023 on four tasks: solving math problems, answering sensitive/dangerous questions, code generation, and visual reasoning.

Using a 500‑question dataset that required the model to determine whether a given integer is prime, GPT‑4 (March 2023) answered 488 correctly (97.6% accuracy). In contrast, GPT‑4 (June 2023) answered only 12 correctly (2.4%). GPT‑3.5 (June 2023) performed markedly better than its March counterpart.

The researchers also applied a “Chain‑of‑Thought” prompting style, asking “Is 17077 a prime? Think step by step.” The latest GPT‑4 incorrectly answered “No” and failed to produce any intermediate reasoning steps.

Compared with March, GPT‑4 in June was less willing to answer sensitive questions, and both GPT‑4 and GPT‑3.5 generated more code with formatting errors. Executable code dropped from 52.0% to 10.0% for GPT‑4 and from 22.0% to 2.0% for GPT‑3.5; redundancy increased modestly (GPT‑4 +20%).

In visual reasoning, both models showed slight improvement, but for over 90% of queries the outputs were identical across versions, with overall low performance (GPT‑4 27.4%, GPT‑3.5 12.2%). For certain specific questions, GPT‑4’s June performance was worse than in March.

Researchers conclude that the behavior of “identical” LLM services can change dramatically within a short time frame, highlighting the necessity of continuous quality monitoring. They plan to keep evaluating GPT‑3.5, GPT‑4, and other LLMs in an ongoing longitudinal study and advise users and companies that rely on LLMs to perform similar monitoring analyses.

More details can be found in the full report: https://arxiv.org/pdf/2307.09009.pdf