Most AI Agent Skills Are Built Backwards

Description-only skills let agents generate code at runtime. Pre-built scripts remove the guessing entirely. Here's the architecture difference, why it matters for cost and reliability, and the open-source gpt-image-2 skill that proves it.

By David Sharkey•April 22, 2026•7 min read

Claude Code

Skills

Agents

OpenAI

gpt-image-2

TL;DR

Most AI skills describe the API and let the agent generate code at runtime. That burns tokens, hallucinates function signatures, and locks you into frontier models. Ship pre-built scripts instead — the agent reads context, picks the script, passes flags. Lower-tier models work. 13x fewer tokens. Zero hallucinated API calls. Open-source gpt-image-2 skill included.

View on GitHub

Claude Code can't reliably call OpenAI's image API from memory. Neither can Codex. Neither can Cursor.

Not because image generation is complicated. Because the model's training data contains hundreds of slightly different versions of the same function signature, and it doesn't know which one is current. So it guesses. I watched it guess four times in a row on a simple product mockup — wrong signature, wrong signature, deprecated parameter from gpt-image-1, then a silent skip where it said "I've created the mockup" without producing a file. Four attempts. Zero images. Roughly 12,000 tokens burned on Python scripts that never executed.

The fix is a skill. But the way I built it matters more than the fact that I built it, because most skills reproduce the exact same failure mode they're supposed to solve.

The two architectures

There are two ways to give an agent a new capability. The difference between them determines whether the skill actually works or just looks like it should.

Architecture 1: Description-only. You write a SKILL.md that explains the API — endpoints, parameters, constraints, example calls. The agent reads it, generates a Python script at runtime, and executes it.

This is what most people build. It's also what fails. You're asking the model to do two things simultaneously: understand when to use the tool, and generate correct code to call it. The first task is easy. The second is where tokens go to die.

The model has to hold the full API specification in working memory, reason about parameter types and validation rules, generate syntactically correct code, and get it right on the first attempt — because retry loops compound the cost and degrade the conversation. For trivial APIs this works. For anything with non-obvious constraints, it doesn't.

Architecture 2: Context + pre-built scripts. You write a SKILL.md that explains when to use the tool and what the flags mean. Then you ship the actual scripts the agent runs. No runtime code generation. The agent reads the context, matches the user's intent to the right script, constructs the CLI flags, and executes.

The agent's decision space shrinks from "understand the API, write the code, handle the errors" to "should I use this tool, and with what parameters?" Everything else is already solved.

Why pre-built scripts win

Lower-tier models work

This is the part nobody talks about. Description-only skills effectively require a frontier model. Generating correct API calls for a specific version of a specific service requires memorising exact function signatures — signatures that may have changed since the model's training cutoff. Claude Sonnet 4.6 and GPT-5.2 Instant will fumble this. Not because they can't reason, but because the task demands recall of implementation details that shift every few months.

Pre-built scripts turn code generation into pattern matching. The model reads the skill description, matches intent, and constructs CLI flags from a documented list. Sonnet 4.6 handles that without issues. So does everything above it.

This matters because token costs scale with model capability. If your skill only works on Opus, you're paying Opus prices for what should be a utility call.

Zero token spend on code generation

A description-only skill burns tokens twice per invocation: once to read the API specification, again to generate the code. For gpt-image-2, that's roughly 3,000–4,000 tokens of code generation overhead — before the model even thinks about the user's actual request.

Pre-built scripts cost exactly the tokens needed to read the SKILL.md (a few hundred) and construct the CLI command (a few dozen). Over a session with 15 image generations — variants for an ad campaign, say — the difference is 60,000 tokens versus 4,500. 13x.

No hallucinated function signatures

The silent failure mode. The model generates a plausible API call with a parameter name that doesn't exist, or a size value that violates a constraint it doesn't know about. The script runs. The API returns a 400. The model retries with a different guess. Sometimes it converges. Sometimes it loops 3–4 times and produces nothing.

gpt-image-2 is a good example of an API that punishes guessing. Both image edges must be multiples of 16. Maximum edge is 3,840px. Long-to-short ratio can't exceed 3:1. Total pixel count must fall between 655,360 and 8,294,400. No model reliably memorises all of that. A script that validates inputs before calling the API never gets it wrong.

The full API surface is available from day one

When a model generates code, it reaches for the parameters it's most confident about. For image generation, that usually means --prompt and --size, maybe --quality. Output format, compression level, multi-image compositing with up to 10 inputs, batch variant generation — these all exist in the API but the model rarely surfaces them on its own.

Pre-built scripts expose everything as CLI flags. The SKILL.md documents what each flag does. The agent uses --output-compression 85 or --n 4 because the options are right there in the context. No guessing. No subset.

What this looks like in practice

I built a gpt-image-2 skill using this architecture. Two Python scripts — generate.py for text-to-image, edit.py for compositing up to 10 reference images — with every flag the API exposes.

Generate from scratch:

uv run gpt-image/scripts/generate.py \
  --prompt "A vintage 1960s travel poster for Kyoto in autumn" \
  --size 1024x1536 \
  --quality high \
  --output kyoto.png

Edit an existing image:

uv run gpt-image/scripts/edit.py \
  --prompt "Replace the sky with a dramatic thunderstorm. Keep everything else identical." \
  --images photo.jpg \
  --output photo-stormy.png

Multi-image compositing:

uv run gpt-image/scripts/edit.py \
  --prompt "Place the dog from Image 2 next to the woman in Image 1. Match lighting and scale." \
  --images scene.png dog.png \
  --output composite.png

Dependencies install on first run via uv and PEP 723 inline metadata. No pip. No virtual environment. No setup beyond cloning the repo and having an OpenAI API key exported.

The SKILL.md tells Claude when to trigger the skill, what each script does, and what every flag means. The scripts do the work. Total context cost per invocation: ~800 tokens. Total code generation: zero.

The skill as an SDK

The analogy that keeps coming back: you don't hand a developer a specification and ask them to write their own HTTP client for every request. You give them an SDK with methods they call. The SDK handles serialisation, error codes, retries, auth. The developer handles intent.

Skills should work the same way. The SKILL.md is the documentation. The scripts are the SDK. The agent is the developer.

A description-only skill is like handing someone the RFC and asking them to implement the client from scratch every time they need to make a request. It works if they're sharp enough and have enough time. It fails the moment conditions aren't perfect.

Install

# User-wide (recommended)
git clone https://github.com/dshark3y/gpt-image-2-skill.git ~/.claude/skills/gpt-image

# Export your key
export OPENAI_API_KEY="sk-..."

Claude Code picks it up on next launch. Works as a skill or standalone from the terminal.

Source, flag reference, prompting guide, and size constraints at github.com/dshark3y/gpt-image-2-skill.

The instinct when building skills is to describe the capability and trust the model to figure out the rest. That instinct produces skills that work on demos and break in production — because code generation is the wrong task to delegate when a deterministic script would do the job with zero variance.

Context tells the agent when to act. Scripts tell it exactly how. Most people building skills are solving the wrong problem at the wrong layer.

And almost nobody's shipping the scripts.