Why AI Buzzwords Multiply Faster Than My Hair Falls
The article maps three generations of AI engineering—Prompt Engineering, Context Engineering, and Harness Engineering—explaining their core capabilities, key terms like LLM, RAG, Agent, and evaluation methods, while offering practical tips, pitfalls, and a concise three‑question checklist to stay grounded amid the rapid influx of new AI jargon.
Three Generations of AI Engineering
AI application development has progressed through three capability jumps. Each generation adds a new layer of engineering:
First Generation – Prompt Engineering : design prompts (Prompt, System Prompt, Chain‑of‑Thought, Few‑shot, Zero‑shot, JSON Mode) so the model can understand human language.
Second Generation – Context Engineering : inject relevant external information (RAG, Memory, Vector DB, Embedding, Function Calling, Model Context Protocol (MCP), Agent‑to‑Agent (A2A), Skill, Agent, OpenClaw) before the model answers.
Third Generation – Harness (Quality‑Control) Engineering : build systematic evaluation pipelines (Eval Harness, Benchmark, MMLU, HumanEval, GSM8K, A/B Test, Regression Test) to verify that AI does not hallucinate.
Four Foundational Modules
LLM – the “brain” that understands, reasons, and generates text.
Memory – a notebook that stores dialogue, knowledge, and state.
Tools – the hands and feet that let the model query APIs, browse the web, or run code.
Planning – the project manager that breaks tasks into ordered steps.
Missing any module leaves the AI system crippled.
First Generation – Prompt Engineering
Core goal: make the model understand the request and produce the desired output.
Prompt
The raw text fed to the model; its quality directly determines output quality.
System Prompt
A hidden “top‑level design” that defines the model’s role and behavior boundaries (e.g., “You are a senior frontend architect, answer concisely with code examples”).
Chain‑of‑Thought (CoT)
Adding “think step by step” forces the model to spell out its reasoning, which is essential for math and logic problems.
Few‑shot
Provide a few “question → answer” examples before the real query so the model can mimic the style.
Zero‑shot
Ask without examples; the model relies on pre‑training. Works for simple questions but can drift on complex ones.
JSON Mode
Force the model to output JSON, making downstream parsing reliable—otherwise you’d have to scrape free‑form text with fragile regexes.
Current state: Prompt techniques are now a basic skill (2023 value, cheap to learn). However, prompts alone cannot access internal documents or trigger external actions, prompting the shift to the second generation.
Second Generation – Context Engineering
Core goal: inject the most relevant information into the model’s context window before it answers.
Local models have knowledge cut‑offs and cannot see enterprise data. Context engineering solves the “open‑book exam” problem.
Retrieval‑Augmented Generation (RAG)
Essence: attach an external knowledge base. When a user asks, the system first retrieves relevant documents, injects them into the prompt, then generates an answer.
Key components
Vector DB – stores document fragments as vectors; similarity search finds relevant pieces (e.g., Chroma, Pinecone, Milvus, pgvector).
Embedding – converts text or images into vectors; semantic similarity becomes distance in vector space.
Memory – a broader concept covering retrieval, conversation history, user profiles, and long‑term knowledge.
Function Calling
Essence: give the model “hands”. When it needs external data, it outputs a structured JSON command; the application executes the command and feeds the result back.
Model decides it needs weather → outputs {"tool": "get_weather", "city": "Beijing"} Your code calls the weather API and gets “8°C, sunny”.
Result is returned to the model.
Model replies in natural language.
Model Context Protocol (MCP)
Essence: a “USB‑C standard” for function calling. Instead of N×M custom integrations, MCP provides a unified interface (N+M complexity). Service providers implement an MCP server (GitHub, PostgreSQL, Notion, Puppeteer, Filesystem), and clients plug in with an MCP client.
Agent‑to‑Agent Protocol (A2A)
Essence: MCP connects AI to tools; A2A connects AI to other AIs. It standardizes communication among multiple agents (frontend, backend, testing agents) for discovery, task assignment, result sharing, and state sync.
Think of MCP as an AI power strip and A2A as an AI group chat.
Skill
Pre‑packaged instruction sets + tool‑calling logic for specific tasks. A “real” Skill has concrete tool integration, defined I/O contracts, and is composable; a “fake” Skill is just a prompt template.
Agent
An autonomous AI system that can plan, call tools, and maintain state. The full four‑module stack (LLM decides, Memory stores state, Tools act, Planning decomposes) enables sophisticated agents.
Evolution ladder: tool‑enhanced chat → single‑task agent → multi‑agent collaboration (A2A) → general autonomous agent.
OpenClaw
An open‑source agent framework that packages the four modules and context engineering into a ready‑to‑use product, similar to how Next.js builds on React.
OpenClaw is an application‑layer implementation of second‑generation technology; it does not invent new principles but bundles existing capabilities.
Current state: Without context engineering you’re like a SQL‑only developer who can’t handle data‑warehouse tasks.
Third Generation – Harness (Quality‑Control) Engineering
Core goal: build a standardized, reproducible, automated evaluation pipeline to answer “Does this AI system work reliably?”
The first two generations make AI “able to work”; the third ensures it works reliably.
Eval Harness
A standardized toolchain (e.g., EleutherAI’s lm‑evaluation‑harness or custom CI/CD pipelines) that runs benchmark suites to assess model or system capabilities.
Benchmark
Industry‑wide test sets. High scores don’t guarantee real‑world performance, but low scores almost always indicate problems.
MMLU
Massive Multitask Language Understanding – 57 subjects covering math, history, CS, law, etc. Serves as a “comprehensive exam” for models.
HumanEval
Measures code‑generation ability by having the model complete functions and then running tests. Important for code‑assistant selection.
GSM8K
Evaluates multi‑step math reasoning on elementary‑to‑middle‑school problems; reasoning models (e.g., o1, DeepSeek‑R1) double accuracy over pattern‑matching models.
A/B Test
Online comparison of two models or strategies on real users; focus on business metrics (adoption, satisfaction, task completion) rather than benchmark scores.
Regression Test
Run a “gold‑standard” question set after any change (prompt tweak, model swap, RAG adjustment) to ensure existing good cases still pass.
Skipping regression testing is like deploying a refactored component without unit tests—brave, but risky.
Practical advice
Allocate 5‑10% of traffic to the experiment group.
Monitor business metrics, not just benchmark numbers.
Also track latency, cost, and security interception rates.
Common pitfalls include over‑relying on Function Calling vs. MCP, assuming RAG is a cure‑all (garbage‑in‑garbage‑out), believing distilled models are always sufficient, treating reasoning models as universally better, and mistaking benchmark scores for real performance.
Pitfall Cheat‑Sheet (One‑Line Version)
Function Calling vs. MCP – they serve different purposes; use FC internally, MCP for external services.
RAG is not universal – bad retrieval yields bad output; add re‑ranking and citation tracing.
Distilled models are not always enough – edge cases still need large models; use small model for routine, large model as fallback.
Reasoning models are not always better – they are slower and may over‑think simple queries; use ordinary models for simple tasks, reasoning models for hard ones.
Benchmark scores are not everything – scores can be gamed; after passing benchmarks, run business‑specific cases and online A/B tests.
Skill cannot replace Agent – Skill is a pre‑cooked dish, Agent is a full kitchen; combine atomic Skills with Agent orchestration.
Conclusion
Master the three questions—how to ask (Prompt), what to feed (RAG/MCP/A2A), and how to verify (Harness)—to stay steady amid the flood of new AI buzzwords.
When a new term appears, ask:
Does it help me ask better? → First generation.
Does it feed the model more accurately? → Second generation.
Does it let me evaluate more scientifically? → Third generation.
If a term doesn’t fit any of these, it is likely marketing fluff that can be ignored.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
大转转FE
Regularly sharing the team's thoughts and insights on frontend development
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
