Artificial Intelligence 13 min read

How OpenAI’s o1 Models Impact Developers: Performance, Limits, Cost, and Prompting

The article evaluates OpenAI’s o1 series—o1‑preview, o1‑mini and the upcoming full model—by comparing their complex reasoning strength, slower inference speed, higher pricing, API restrictions, and prompting best practices, helping developers decide when to adopt them.

CSS Magic

Oct 14, 2024

Performance

Applicable Scenarios

The o1 series uses reinforcement learning during training and a pre‑thinking token phase during inference, which significantly improves complex reasoning ability and is described as a “logic expert” in industry benchmarks. For creative generation tasks, its performance does not surpass the previous‑generation GPT‑4o, so developers should evaluate their specific use cases before switching.

Model Versions

Two versions are currently available: o1-preview and o1-mini. A full‑scale o1 is planned for later release. o1-preview (and the future o1) have broad world knowledge and focus on complex reasoning. o1-mini is smaller, faster, and cheaper; its advantages lie in higher inference speed and lower cost, making it better suited for programming, mathematics, and other tasks that do not require extensive general knowledge.

Inference Speed

Because o1 generates a batch of pre‑thinking tokens before producing the final output, its time‑to‑first‑token (TTFT) is longer than GPT‑4o, and its tokens‑per‑second (TPS) is inherently lower.

Benchmark on a simplified Chinese‑traditional conversion task (effective token count, pre‑thinking tokens excluded): gpt-4o: Inference time = 6.6 s, Token count = 564.3, TPS = 86.0 o1-preview: Inference time = 28.8 s, Token count = 579.0, TPS = 20.4 o1-mini: Inference time = 11.9 s, Token count = 578.7, TPS = 49.2

When pre‑thinking tokens are included, the totals become: o1-preview: Inference time = 28.8 s, Total tokens = 2392.3, Total TPS = 82.8 o1-mini: Inference time = 11.9 s, Total tokens = 1624.0, Total TPS = 138.1

Multiple test rounds show noticeable variance in o1’s results, likely due to the stochastic nature of the pre‑thinking phase.

The inference‑speed disadvantage can be mitigated by cloud‑side scheduling; a dedicated article will explore this later.

Overall, the slower speed means o1 is not a “one‑size‑fits‑all” model and is unsuitable for direct user‑facing conversational scenarios.

Limitations

o1 is still in testing and lacks capabilities present in ChatGPT, such as web search, file upload, image recognition, and the ability to drive GPTs.

No streaming output: The pre‑thinking phase and hidden tokens block streaming, making real‑time chat use cases impractical.

No system prompt support: System prompts are used internally for pre‑thinking and are not exposed via the API; developers must move system‑prompt logic into user messages.

No tool calls or structured output: These features are unavailable in the current release.

No multimodal capability: Although o1 is reportedly multimodal, image, video, and audio handling are not yet open.

API parameter restrictions: Parameters such as logprobs, temperature, top_p, and n are either fixed or unavailable.

Cost

Pricing (USD per million tokens) compared with the GPT‑4o series: gpt-4o: Input $2.50, Output $10.00 gpt-4o mini: Input $0.15, Output $0.60 o1-preview: Input $15.00, Output $60.00 o1-mini: Input $3.00, Output $12.00

While o1-mini is cheaper than o1-preview, the o1 series remains more expensive than the GPT‑4o lineup.

o1 supports prompt caching, which can reduce costs in many scenarios.

Pre‑thinking tokens are counted as output tokens and billed, so developers pay for tokens that are not visible.

API Protocol Changes

Input Parameter Changes

The former max_tokens parameter is renamed to max_completion_tokens for o1.

Only user and assistant message roles are accepted; using system results in an error because system prompts are not exposed.

Output Field Changes

The response now includes usage.completion_tokens_details.reasoning_tokens, which reports the number of pre‑thinking tokens used. This confirms that those tokens are billed as part of the completion.

Prompt Engineering for o1

Because o1 already performs internal chain‑of‑thought reasoning, traditional prompting tricks need adjustment:

Keep prompts simple and direct. Overly complex instructions can interfere with the model’s built‑in reasoning.

Avoid explicit chain‑of‑thought prompts. The model’s internal reasoning makes such prompts redundant.

Use delimiters (code blocks, XML tags, headings) to separate sections. This helps the model parse the prompt structure.

In Retrieval‑Augmented Generation (RAG) scenarios, limit extraneous context. Excess unrelated knowledge can distract the model.

Conclusion

o1 models demonstrate strong complex reasoning but currently suffer from slower speed, higher cost, and several API and feature limitations. Developers should weigh these trade‑offs against their specific requirements before adopting o1, and anticipate that future updates may alleviate many of the current drawbacks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Prompt engineering API OpenAI cost analysis o1

Written by

CSS Magic

Learn and create, pioneering the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.