Prompt Engineering: The First Layer of Working With LLMs

This is Part 1 of a 3-part series on the layers of working with LLMs: Prompt Engineering, then Context Engineering, then Harness Engineering. Each layer builds on the one before it, and each is increasingly the responsibility of the platform you're using rather than you the user. We'll come back to that distinction at the end.

There are thousands of prompt engineering posts out there, which is a lot of noise to sort through if you just want to know what works. So for this one we wanted to go to the source and see what the three major labs themselves are saying. The post is built off the primary documents from each: Anthropic's Prompting best practices and their interactive tutorial, OpenAI's Prompt engineering guide, and Google's two Vertex AI guides, the Introduction to prompting and the more detailed Overview of prompting strategies.

Most of what they say overlaps. Some of it doesn't, and the disagreements are where the more useful guidance lives.

First, what is a prompt?

It helps to be precise about the basic mechanic before talking about technique.

A prompt is everything the model sees before it generates the next token. That includes the system instructions (set by whichever platform or API call is talking to the model), the user message (what you typed), any prior turns in the conversation, and any documents, tool outputs or retrieved snippets the platform has stuffed in for you.

The model has no memory outside this window. It is not thinking about your problem in the background. As Google's docs put it, a language model given some text "can predict what is likely to come next, like a sophisticated autocompletion tool."

If it isn't in the prompt, the model doesn't know it. Your tone preferences, your team's norms, the fact that this is for a B2B audience: none of it exists unless something put it there. Everything in the prompt is also competing for attention, so a long, vague prompt dilutes the signal and a focused prompt with the right structure usually does better than a long one with everything thrown in.

Prompt engineering exists as a discipline because what you put in this window deterministically shapes what comes out.

What the three labs all agree on

The three docs converge on roughly the same definition:

Anthropic frames it around clarity, examples, XML structure, role assignment, and thinking as the foundational levers.
OpenAI defines prompt engineering as the process of writing effective instructions for a model, such that it consistently generates content that meets your requirements.
Google decomposes a prompt into an Objective and Instructions (required), plus a long list of optional components: System instructions, Persona, Constraints, Tone, Context, Few-shot examples, Reasoning steps, Response format, Recap, and Safeguards.

Google's component list is the most detailed breakdown anyone publishes. Twelve named pieces, each with a clear job. Even if you never name them out loud while writing a prompt, knowing they exist gives you a checklist you can hold any prompt against.

The five techniques that show up everywhere

1. Be clear and specific, and write for a stranger

Anthropic's framing is the one worth quoting: "Think of Claude as a brilliant but new employee who lacks context on your norms and workflows." They follow it up with what they call the golden rule of prompting:

Show your prompt to a colleague with minimal context on the task and ask them to follow it. If they'd be confused, Claude will be too.

OpenAI says effectively the same thing in a more mechanical way. GPT models "benefit from very precise instructions", while reasoning models "will provide better results on tasks with only high-level guidance." Their analogy: a reasoning model is "like a senior co-worker… you can give them a goal", whereas a GPT model is "like a junior coworker" who needs explicit instructions.

Google's prompt-health checklist is the most concrete on what "specific" means in practice. Avoid relative qualifiers. Instead of "write a brief summary", write "write a summary of 3 sentences or less."

So the practical rule has two parts. Match your specificity to the model (more handholding for non-reasoning models, more goal-setting for reasoning ones), and replace fuzzy adjectives with measurable constraints regardless of which model you're using.

2. Assign a persona, or use system instructions

All three docs treat this as a first-class technique rather than a trick.

Anthropic: "Setting a role in the system prompt focuses Claude's behavior and tone for your use case. Even a single sentence makes a difference." OpenAI describes the developer message as "the system's rules and business logic, like a function definition," with the user message as the "inputs and configuration… like arguments to a function." Google defines persona as "who or what the model is acting as," and system instructions as "technical or environmental directives that may involve controlling or altering the model's behavior across a set of tasks."

Google's strategies doc has a complete worked example built around a math-tutor persona. The components fit together like this:

Component	Example
Objective	Help students with math problems without directly giving the answer
Persona	You are a math tutor here to help students with their math homework
Instructions	Understand what the problem is asking; understand where the student is stuck; give a hint for the next step
Constraints	Don't give the answer directly. Give hints. If the student is completely lost, give them detailed steps
Tone	Respond in a casual and technical manner
Context	A copy of the student's lesson plans for math
Recap	Don't give away the answer; provide hints instead. Always format the response in Markdown

Each row carries a different job. The persona sets the lens, the constraints define what cannot happen, the tone shapes the voice, and the recap restates the hard rules at the end so they don't get lost in a long generation.

A note that becomes important later: when you're typing into a chatbot like ChatGPT, Claude or Gemini, you don't usually get to set the system instruction directly. The platform has already set one for you. The persona and tone you ask for in your user message effectively layer on top of that system instruction. We'll come back to who controls what at the end.

3. Use examples, and use enough of them

Anthropic is the most prescriptive of the three:

Examples are one of the most reliable ways to steer Claude's output format, tone, and structure… Include 3–5 examples for best results.

They also give you the structural advice the others imply but don't spell out. Wrap each example in <example> tags, and the whole set in <examples> tags, so the model can tell instructions from demonstrations. Examples should be relevant (they should mirror the actual use case), diverse (they should cover edge cases), and structured.

OpenAI agrees: "When providing examples, try to show a diverse range of possible inputs with the desired outputs." Their code samples use the same XML-tagged pattern (<product_review>, <assistant_response>).

Google goes further and gives you a reusable template you can copy verbatim, lifted from their strategies doc:

<OBJECTIVE_AND_PERSONA>
You are a [persona]. Your task is to...
</OBJECTIVE_AND_PERSONA>

<INSTRUCTIONS>
1. ...
2. ...
</INSTRUCTIONS>

<CONSTRAINTS>
Dos and don'ts:
1. Dos
2. Don'ts
</CONSTRAINTS>

<CONTEXT>
The provided context
</CONTEXT>

<OUTPUT_FORMAT>
The output format must be...
</OUTPUT_FORMAT>

<FEW_SHOT_EXAMPLES>
Example #1
Input:
Thoughts:
Output:
</FEW_SHOT_EXAMPLES>

<RECAP>
Re-emphasise the constraints, output format, etc.
</RECAP>

Three labs, one shape: tagged sections, examples in their own block, the actual ask separated cleanly from the demonstrations. If you only adopt one habit from this post, this is the one to adopt.

4. Structure with XML and Markdown, and put things in the right order

Anthropic and OpenAI both recommend mixing Markdown for headings and sections with XML tags for content blocks. As Anthropic puts it, "XML tags help Claude parse complex prompts unambiguously, especially when your prompt mixes instructions, context, examples, and variable inputs."

OpenAI's recommended order for a developer message is:

Identity (purpose, communication style, high-level goals)
Instructions (rules, what to do and not do)
Examples (inputs and desired outputs)
Context (private data, retrieved documents)

Note where they put context: at the end. The reason is prompt caching:

Keep content that you expect to use over and over in your API requests at the beginning of your prompt… This enables you to maximize cost and latency savings from prompt caching.

The practical implication is to keep stable content at the top of the prompt and per-request content at the bottom.

Anthropic gives a slightly different rule for one specific case: long-context prompts of 20k tokens or more.

Put longform data at the top: place your long documents and inputs near the top of your prompt, above your query, instructions, and examples… Queries at the end can improve response quality by up to 30% in tests.

For short prompts, instructions go first and context goes last. For long-document prompts, the documents go first and the query goes last. This is the kind of detail that matters in production and rarely shows up in shorter writeups.

5. Tell it what to do, not what not to do

A small but underrated rule that Anthropic states twice in their docs:

Tell Claude what to do instead of what not to do. Instead of "Do not use markdown in your response", try "Your response should be composed of smoothly flowing prose paragraphs."

OpenAI's GPT-5 cookbook makes the same point. Negative instructions are weaker signals than positive ones. Whenever you find yourself writing a "don't", try to rewrite it as the behavior you want.

There is a useful exception. Google's strategies doc points out that constraints, framed as a dedicated section ("don't give the student the answer directly"), do work, but only when they're written as crisp rules and ideally paired with a positive instruction about what the model should do instead.

Where the labs disagree

On verbosity instructions, Anthropic's recent guidance for Claude Opus 4.7 says the model "calibrates response length to how complex it judges the task to be", and recommends adding explicit length instructions if your product depends on a particular style. OpenAI's GPT-5 guidance flips the polarity. GPT-5 is "highly steerable and responsive to well-specified prompts" but tends to under-instruct itself if you don't push it. The two labs are pointing at the same problem from opposite default behaviours.

On chain-of-thought, Anthropic's tutorial dedicates a whole chapter to "Precognition: Thinking Step by Step", but their newest guidance is to prefer general instructions over prescriptive steps: "A prompt like 'think thoroughly' often produces better reasoning than a hand-written step-by-step plan." OpenAI's reasoning-model guidance lands in the same place; don't choreograph the chain of thought, just give the goal. Google's strategies doc explicitly tells you to test removing your step-by-step instructions when you're using a Thinking model.

The pattern across all three: the smarter the model, the less you should try to choreograph its thinking. Step-by-step plans help with older or non-reasoning models. They get in the way with frontier reasoning models.

On "be thorough" type instructions, Anthropic warns directly against blanket thoroughness prompts on Claude 4.6 and later: "Tools that undertriggered in previous models are likely to trigger appropriately now. Instructions like 'If in doubt, use [tool]' will cause overtriggering." Google's prompt-health checklist makes a related point about emotional appeals and threats. "Very bad things will happen if you don't get this correct" used to help. It now usually hurts. Prompts get stale, and you should re-read your system prompts every time you upgrade a model.

A worked example

Take this vague prompt:

Help me think about pricing for our new product.

Here is the same intent rewritten against Google's template, using Anthropic's XML conventions:

<OBJECTIVE_AND_PERSONA>
You are a pricing strategist who has launched B2B SaaS products
in the $50 to $500 per month range. You think in terms of
willingness-to-pay, expansion revenue, and packaging clarity.
</OBJECTIVE_AND_PERSONA>

<CONTEXT>
- Product: an AI assistant for product managers
- Target users: PMs at startups (10 to 500 employees)
- Adjacent tools: Notion AI ($10/seat), ChatGPT Team ($25/seat), Linear ($14/seat)
- Current ARR: $0, pre-launch
</CONTEXT>

<INSTRUCTIONS>
Propose three pricing models we should seriously consider.
For each, give: a one-line description, who it's best for,
the single biggest risk, and a rough monthly price point.
</INSTRUCTIONS>

<FEW_SHOT_EXAMPLES>
<example>
Model: Per-seat flat
Best for: teams that all use it equally
Biggest risk: caps expansion, since PMs are a small slice of headcount
Price: $X/seat/month
</example>
</FEW_SHOT_EXAMPLES>

<OUTPUT_FORMAT>
A markdown table with columns: Model | Best for | Biggest risk | Price
</OUTPUT_FORMAT>

<RECAP>
Three models. Markdown table. Be specific about risk, not generic.
</RECAP>

The rewritten version is more useful not because it's longer, but because each section has a job. The persona sets the lens, the context constrains the recommendations, the example pins down the format, and the recap restates the hard requirements so they survive a long generation.

Who actually controls what

Here is the part that makes prompt engineering only one third of the work, and which sets up the rest of this series.

Typing into a chatbot composer can feel like writing "the prompt." You're really only writing one of several layers, and most of the interesting work lives in the layers around you.

The system instruction is set by whoever built the platform: OpenAI, Anthropic, Google, or any product built on top of them (including Idam AI). You don't see it. It defines the persona, the safety rules, the response style, the boundaries.

The context (past turns, retrieved documents, project files, tool outputs, your onboarding data) is triggered by you but assembled by the platform. On a raw LLM provider it might be as simple as a "Project" folder you drop files into. On a product platform built for a specific job, it can involve significant retrieval, ranking, summarisation, and data shaping. That's an engineering discipline of its own.

The harness (the loop that decides when to call the model, when to call a tool, when to retry, when to hand off to another agent, when to stop) is almost entirely platform code. Of the four layers, your user message is the only one you fully own.

The techniques in this post matter for all four layers. Whether you're typing into ChatGPT, writing a system prompt for an internal Claude bot, or designing the agent loop for a platform like the one we're building at Idam AI, the same primitives are doing the work: clarity, persona, examples, structure, positive instructions. What changes from layer to layer is who's writing them and how much else sits around them.

The next two posts pick up from here:

Context Engineering. What the model sees, when, and how. RAG, conversation memory, document chunking, cross-agent context, and why on a real product platform this stops being a writing problem and becomes a systems problem.
Harness Engineering. The loop around the model: tool use, retries, evals, guardrails, multi-agent orchestration, and everything else that turns a single model call into a working system.

A short checklist

Before you send (or design) a prompt you care about, run through these:

Can a colleague with no context follow it? (Anthropic's golden rule)
Have you covered the components that matter: Objective, Persona, Instructions, Constraints, Context, Examples, Output format, Recap? (Google's twelve)
Is the stable stuff at the top and the per-request stuff at the bottom? (For long-document prompts, flip it: documents first, query last.)
Are examples wrapped in tags, with 3 to 5 of them, covering edge cases?
Have you replaced "don't do X" with "do Y"?
If you're on a reasoning model, have you stripped out the step-by-step choreography you wrote for the previous one?
Have you removed emotional appeals, threats, and redundant instructions? (Google's prompt-health checklist)

Sources

Next up: Context Engineering. How the platform decides what goes into the context window, and why at scale that becomes an engineering problem more than a writing one.

Your AI Copilot for all things Product Management

Idam AI helps you streamline product tasks, enhance decision-making, and accelerate your product development lifecycle with intelligent automation.

Get Started with Idam AI

Table of Contents