Artificial Intelligence 13 min read

Turning Base Models from Semi‑Finished to Killer AI Products: A PM’s Playbook

The article breaks down how AI product managers can transform a raw base model into a market‑ready, high‑impact product by applying supervised fine‑tuning, tool‑use routing, RLHF alignment, and chain‑of‑thought reasoning, while highlighting trade‑offs, cost shifts, and evaluation metrics.

PMTalk Product Manager Community

Jan 5, 2026

Turning Base Models from Semi‑Finished to Killer AI Products: A PM’s Playbook

Supervised Fine‑Tuning (SFT)

During pre‑training the model learns a next‑token prediction objective on massive web crawls. SFT keeps the same objective but replaces the data with tightly formatted <User, Assistant> dialogue pairs. This embeds interaction design directly into the weights, enabling behaviours such as polite refusal or strict adherence to coding standards (e.g., PEP‑8). Because the dataset is small (typically 10 k–100 k examples) the cost centre shifts from GPU clusters to hiring domain experts to author high‑quality, vertical‑specific examples. The trade‑off is that SFT cannot inject new factual knowledge: if the base model has never seen a concept, SFT will only teach it how to *talk* about the concept, not the concept itself. Forcing the model to answer unknown questions leads to “hallucination” – the model learns to answer confidently while fabricating content.

Data quality over quantity: expert‑written examples outweigh noisy web data; in medical AI, clinicians must author the reference answers.

Cost structure change: budget moves from GPU hardware to subject‑matter‑expert salaries.

Limitation: SFT cannot create knowledge; new facts must be supplied via retrieval‑augmented generation (RAG) or tool calls.

Tool Use – Turning the Model into a Router

Post‑training acknowledges the model’s blind spot – it often "doesn’t know what it doesn’t know." The product logic becomes a routing pipeline:

User query → Intent recognition → Decision to call Search / Calculator / Code interpreter → Retrieve factual data → Synthesize answer

Old logic: query → model generates answer from memory → possible hallucination. New logic inserts an explicit tool‑call decision . Success is measured by tool‑call accuracy rather than raw token throughput. PMs should build test suites that verify, for example, that a weather query triggers the weather API and that a code‑generation request invokes the code interpreter.

Reinforcement Learning from Human Feedback (RLHF)

RLHF adds a Reward Model (RM) that learns human preferences from ranked answer pairs (A > B). The workflow is:

Human annotators rank two model outputs.

Train a smaller RM to predict the ranking.

Use the RM to score the main LLM during reinforcement‑learning updates, creating an automated, continuous A/B testing loop.

This aligns the model’s distribution toward traits such as helpfulness, harmlessness, and honesty. Two known risks are highlighted:

Goodhart’s Law: when a metric becomes the target, it ceases to be a good proxy for quality.

Reward hacking: the model may learn to produce verbose but meaningless text that scores highly on the RM.

Mitigation strategies include adding diversity constraints to the RLHF dataset and monitoring for over‑politeness or loss of personality.

Thinking & Reasoning – System 1 vs System 2

2025 saw the rise of reasoning‑oriented models (e.g., DeepSeek‑R1, OpenAI o1). Inspired by AlphaGo’s self‑play, these models perform internal chain‑of‑thought before emitting an answer, allowing self‑correction and multi‑step problem solving.

AlphaGo analogy: the model plays against itself, receiving a binary win/loss signal; in math or code the signal is “does the code run?” or “is the answer correct?”

Chain‑of‑Thought internalization: the model generates a hidden “thinking” sequence (often thousands of tokens) that is not part of the user‑visible output.

System 1 (fast): suitable for casual chat, email drafting – low latency, low hidden token cost.

System 2 (slow): required for complex logic, long code generation, legal analysis – incurs extra hidden token compute (e.g., 5 000 hidden tokens for a 100‑token answer), raising latency and cost.

Product managers must design dynamic model routing: assess query difficulty, then dispatch either a System 1 model for cheap, quick responses or a System 2 reasoning model when accuracy outweighs cost. The hidden token economics raise questions about pricing – whether to absorb the extra compute into per‑turn fees or to charge based on outcome effectiveness in B‑to‑B scenarios.

Integrated Post‑Training Stack

The four pillars – SFT for interaction norms, tool use for factual grounding, RLHF for user‑aligned behaviour, and reasoning (System 2) for deep problem solving – together form a modular stack. Rather than chasing ever larger base models, the engineering focus shifts to:

Curating high‑quality SFT data for the target domain.

Implementing reliable intent classifiers and tool‑call orchestration.

Training and monitoring a reward model to avoid Goodhartian collapse.

Deploying a hybrid inference layer that selects between fast and deliberative models based on cost‑benefit analysis.

This workflow reflects the technical reality of 2025 AI product development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

artificial-intelligence reasoning product management SFT RLHF Tool Use

Written by

PMTalk Product Manager Community

One of China's top product manager communities, gathering 210,000 product managers, operations specialists, designers and other internet professionals; over 800 leading product experts nationwide are signed authors; hosts more than 70 product and growth events each year; all the product manager knowledge you want is right here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Supervised Fine‑Tuning (SFT)

Tool Use – Turning the Model into a Router

Reinforcement Learning from Human Feedback (RLHF)

Thinking & Reasoning – System 1 vs System 2

Integrated Post‑Training Stack

PMTalk Product Manager Community

How this landed with the community

Was this worth your time?

0 Comments

Thinking & Reasoning – System 1 vs System 2