AI Product Managers Have Stopped Sketching Wireframes – Here’s Why
The article explains how AI product managers have shifted from creating prototype diagrams to designing continuous evaluation “exams”, using real‑world examples, data‑driven testing, cross‑team collaboration, and iterative error analysis to deliver truly useful AI products.
When a friend who switched to AI product management was asked how many prototype diagrams he had produced, he showed a stack of test‑case sheets, surprising his colleague. This anecdote reveals a common misconception: many still view AI product managers through the lens of traditional internet product roles that focus on PRDs and pixel‑perfect mockups.
In reality, AI products are more like raising a child than assembling Lego blocks. You cannot set a fixed requirement such as “the model must grow to 1.8 m”; instead, you must repeatedly evaluate and adjust the model’s behavior.
Consider a smart‑customer‑service bot built with an open‑source API that can answer simple queries like “What are the opening hours?” – this is a 60‑point product: it works, but it fails on nuanced questions such as “Which return policy is better, ours or the competitor’s?” A 100‑point product must handle such “out‑of‑scope” questions reliably and consistently.
AI product managers therefore spend their time designing “exams” for the AI. They categorize questions into basic functional items (e.g., checking the weather), logical reasoning tasks (e.g., planning an optimal route that avoids rush hour and passes a market), and edge‑case “trick” questions (e.g., a string of emojis like "🚗+⛽️+🗺️=?").
Data sets become the question bank; they must contain not only easy questions but also complex logic and creative edge cases. For instance, a team added emoji‑based queries to their dataset after encountering a user who asked "🚗+⛽️+🗺️=?"; initially the AI answered "I don’t know", but after augmenting the data it learned to return the nearest gas‑station prices.
Automated scoring (e.g., using GPT‑5 to grade answers) can evaluate thousands of questions quickly, yet it can also mis‑score, such as giving a high score to a bland response like "Drink more hot water" for "How to comfort a girlfriend". Human review is therefore essential to catch failures that machines miss.
When AI answers are wrong, the root cause may lie in data (the model never saw that pattern), algorithm (confusing "Apple phone" with "Apple fruit"), or generation (correct logic but awkward phrasing). Teams often conduct deep post‑mortems, labeling each error and prioritizing fixes based on impact (e.g., 30% of users abandoning the product due to a specific bug).
Cross‑functional collaboration is crucial: evaluation results may be dismissed by engineers as insignificant, while the UX team sees a major user‑experience issue. The AI product manager acts as a referee, using data to argue for resource allocation (e.g., requesting 5,000 new data points for a novel scenario).
Metrics beyond raw accuracy are needed; a model with high accuracy may still be verbose or miss user intent. Stale test sets also become a problem—using a year‑old exam bank for an AI education product led to irrelevant recommendations and user complaints.
In a competitive AI market, 60‑point products are being phased out. Users now demand not just "usable" but "delightful" experiences, which requires AI product managers to continuously refine evaluation, turn uncertainty into confidence, and iterate on both data and model improvements.
PMTalk Product Manager Community
One of China's top product manager communities, gathering 210,000 product managers, operations specialists, designers and other internet professionals; over 800 leading product experts nationwide are signed authors; hosts more than 70 product and growth events each year; all the product manager knowledge you want is right here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
