Artificial Intelligence 15 min read

Can GPT‑4 Be Considered Early AGI? Insights from Microsoft’s 155‑Page Study

This article reviews Microsoft’s extensive 155‑page work on early experiments with GPT‑4, exploring how the model approaches artificial general intelligence, its testing methodology, multimodal capabilities, programming and mathematical performance, interaction with tools and humans, limitations, societal impact, and future research directions.

21CTO

Apr 2, 2023

Can GPT‑4 Be Considered Early AGI? Insights from Microsoft’s 155‑Page Study

Welcome to the first issue of Paper Sync, a column that reads and discusses notable AI papers. This issue focuses on Microsoft’s extensive 155‑page work titled Sparks of Artificial General Intelligence: Early experiments with GPT‑4 , which is split into two parts due to length.

Intelligence is a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience . It is not merely book learning, a narrow academic skill, or test‑taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings – “catching on”, “making sense” of things, or “figuring out” what to do.

How to Define AGI

Microsoft scientists claim GPT‑4’s intelligence level is very close to human level and can be viewed as an early (though incomplete) version of an artificial general intelligence (AGI) system. They adopt a definition of intelligence from a 1994 study by 52 psychologists: “Intelligence is a general mental ability that includes reasoning, planning, problem solving, abstract thinking, understanding complex ideas, rapid learning, and learning from experience.” Under this definition, AGI refers to systems that meet or exceed human performance.

Testing and Presentation

Instead of traditional benchmarks like Super‑Natural Instructions or BIG‑bench, the authors propose a psychology‑inspired evaluation using human creativity to generate novel, difficult tasks. They categorize questions into four major groups (natural language, programming & math, planning & problem solving, human psychology & common sense) and six sub‑abilities, also discussing GPT‑4’s limitations, societal impact, and future directions.

Multimodal

The early GPT‑4 model was trained on pure text and not multimodal data. However, after fine‑tuning it can understand visual input. It can generate SVG code or JavaScript that can be compiled into images. Example prompts and generated images are shown below.

A frog hops into a bank and asks the teller, ‘Do you have any free lily pads?’ The teller responds, ‘No, but we do offer low interest loans for pond upgrades.’

A fantasy landscape of floating islands, waterfalls, and bridges, with a dragon flying in the sky and a castle on the largest island.

Cross‑disciplinary Combination Ability

The model’s ability to integrate knowledge from multiple domains is demonstrated through tasks that require combining interdisciplinary knowledge, such as using Aristotelian style reasoning to prove the infinitude of prime numbers.

Programming

GPT‑4 shows strong programming abilities, directly executing code without translation. Using 100 new LeetCode problems released after GPT‑4’s pre‑training, the model surpasses human performance, achieving a pass@5 rate higher than any prior baseline.

Mathematical Ability

GPT‑4 handles basic arithmetic and simple algebra correctly, but struggles with more complex polynomial problems, often producing incorrect calculations. With carefully crafted prompts that avoid step‑by‑step computation, it can arrive at correct answers, yet it remains far from expert‑level mathematical reasoning.

Interaction with the World

GPT‑4 can autonomously invoke external APIs when prompted, similar to the Toolformer approach, enabling it to solve problems that require tool use without additional training.

Human Interaction (Theory of Mind)

The authors evaluate GPT‑4’s Theory of Mind capabilities, showing that the model can infer the mental states, goals, and preferences of conversation participants, and can provide self‑consistent explanations for its answers.

Discrimination Ability

GPT‑4 can identify personal identifying information (PII) in text with an accuracy of about 77.4%, outperforming specialized privacy‑preserving tools such as Presidio.

Limitations

Despite impressive capabilities, GPT‑4 suffers from fundamental limitations due to its autoregressive training objective, which forces sequential, linear problem solving (System 1) and lacks planning and reflection (System 2). Examples include failures in the Tower of Hanoi planning task and grammatical errors in generated text.

The authors cite LeCun’s proposed framework as a possible direction to address these shortcomings.

Social Impact

The paper discusses potential societal harms of GPT‑4, including misinformation, malicious manipulation, bias, and impacts on professional knowledge, employment, and the economy.

Direction and Future

Future improvements for large language models toward more general AI should focus on hallucination reduction, long‑term memory, continual learning, personalization, planning, concept‑driven creativity, transparency, interpretability, consistency, cognitive biases, irrational reasoning, and robustness to prompts.

References
[1] Mainstream science on intelligence: An editorial with 52 signatories, history, and bibliography, 1997.
[2] Super‑NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
[3] Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
[4] GPT‑4 Technical Report
[5] PaLM‑E: An Embodied Multimodal Language Model
[6] Toolformer: Language Models Can Teach Themselves to Use Tools
[7] Theory of Mind May Have Spontaneously Emerged in Large Language Models
[8] Privacy protection with AI: Survey of data‑anonymization techniques
[9] Yann LeCun. A path towards autonomous machine intelligence. Open Review, 2022.
[10] Reflexion: an autonomous agent with dynamic memory and self‑reflection
[11] GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

programming multimodal GPT-4 AI Safety mathematics Artificial General Intelligence LLM evaluation

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.