The Problem Nobody Talks About
Writing a SKILL.md takes five minutes. The step-by-step tutorial walks you through directory structure, frontmatter, and body in less time than it takes to brew coffee. You end up with a syntactically valid skill that looks right.
Then you test it.
The skill does not trigger when you expect. Or it triggers on the wrong prompts. Or it triggers correctly but the agent interprets your instructions in a way you did not anticipate -- producing output that technically follows orders but misses the point entirely. You open the SKILL.md, tweak a sentence, test again, get a different wrong result, tweak again.
This gap between "skill exists" and "skill works" is where most authors give up. The format is easy. The iteration is hard. And the iteration is hard because the feedback loop is fundamentally different from what developers are used to.
When you write code, the computer executes your instructions deterministically. When you write a SKILL.md, an AI interprets your instructions probabilistically. The same words can produce different behavior across runs. A description that seems obvious to you might be ambiguous to the model. An instruction you consider clear might conflict with something else in the agent's context window.
Skill development requires a different workflow. Not just a text editor. A testing environment where you can edit, observe, and evaluate in tight, repeatable cycles.
What Makes Skill Development Different
Regular coding has a short, deterministic feedback loop. Change a line. Run the program. Output is the same every time. You reason forward from code to behavior with near-perfect accuracy.
Skill development breaks that contract. The "runtime" is a language model, and language models are stochastic. Three things make the loop fundamentally different:
The output varies between runs. One set of instructions. The agent reads them and produces output. Same prompt again -- slightly different output. Normal. A single test run proves very little. You need multiple runs to separate signal from noise.
The failure mode is subtle. When code breaks, it throws an error. When a skill misbehaves, the agent produces plausible-looking output that is wrong in quiet ways. It follows most instructions but skips the one you cared about most. It uses the right format but invents facts. You cannot grep for these failures. You have to read the output carefully, ideally alongside the agent's reasoning trace.
The trigger mechanism is separate from the instructions. A skill has two independent problems: does it activate when it should (the description field), and does it do the right thing once active (the body). A skill that never triggers is invisible no matter how good the instructions. A skill with a perfect description but bad instructions is worse than no skill. You must test both independently.
These properties mean you cannot develop skills the way you develop code. You need a workflow that accounts for variance, rewards careful observation, and separates trigger testing from behavior testing.
The Edit-Test-Evaluate Loop
The core workflow is a three-pane setup:
Left pane: your editor. The SKILL.md file, open. You edit frontmatter, description, body instructions. Every change is a hypothesis: "If I rephrase this section, the agent will handle edge case X correctly."
Right pane: the agent session. Type a prompt that should trigger the skill. Or one that should not. The agent responds. Watch what happens.
Bottom pane: output and transcript. The agent's reasoning trace, full output, files it touched. This is where you evaluate whether the hypothesis held.
The loop: edit, switch to agent pane, test, switch to output pane, evaluate, switch back to editor, refine. Each cycle should take 60 to 90 seconds. Longer means the bottleneck is context switching -- hunting for the right terminal tab, scrolling to find output, losing track of what changed.
This is where terminal layout matters. When all three views are visible simultaneously, you eliminate context-switching cost entirely. SKILL.md, agent interaction, and output at the same time. The feedback loop tightens from minutes to seconds.
The pane ratio changes as you work. Early on, when you are still nailing the description, the agent pane dominates -- prompt after prompt, checking triggers. Later, refining instructions, the editor and output panes split focus. Dragging and resizing panes on the fly matches layout to work phase.
Using AST Analysis to Understand Skill Structure
Skills with scripts and reference files have dependencies easy to lose track of. Your SKILL.md references scripts/validate.sh, which calls a helper in scripts/utils.sh, which reads config from references/. Rename one file and the skill breaks silently.
For skills interacting with your project's codebase, structure matters even more. A code review skill needs to know which files export what. A testing skill needs the test framework's API. A deployment skill needs the dependency chain.
Termdock's built-in Tree-sitter analysis handles this without leaving the terminal. It parses 12+ languages and shows function signatures, export structures, import graphs, call dependencies. When writing a skill that tells the agent "look at test files to understand testing patterns," you can first verify those files actually contain what you think.
Practical use case: building a skill that generates API route handlers following your conventions. You drop a reference file into references/. Before writing the instruction that tells the agent to read it, run Tree-sitter on the reference to see exact function signatures and type definitions. This prevents the common failure where instructions reference patterns that exist in your mental model but not in the actual file.
The Eval Loop: Systematic Skill Testing
The design principles for good skills cover theory. Here is practice.
Step 1: Establish a Baseline
Before writing the skill, run your target prompts against the agent without it. Save the outputs. This is your baseline. If the agent already handles the task well, you do not need a skill. If it fails in specific, repeatable ways, those failures become your eval criteria.
Anthropic's official best practices now recommend this eval-first approach: identify gaps by running Claude on representative tasks without a skill, document specific failures, then write minimal instructions that address those gaps. Evals before documentation, not after.
Step 2: Write 2-3 Test Prompts
Realistic. Not "test the code review skill" but "I just finished refactoring the auth module. Review the changes on this branch and tell me if it's safe to merge." Realistic phrasing, realistic context, realistic expectations.
Include at least one edge case. A prompt adjacent to the skill's domain but that should not trigger. A prompt with unusual phrasing for a task the skill should handle. Edge cases reveal whether description and instructions are robust or brittle.
Step 3: Run With-Skill vs Without-Skill
The only comparison that matters. Does the skill improve output? Run each test prompt three times with skill installed and three times without. Compare.
Roughly equivalent outputs means the skill adds context cost without value. Go back and sharpen instructions, or reconsider whether a skill is the right solution.
Consistently better with-skill outputs means you have signal. Next step.
Step 4: Review the Full Transcript
Not just final output. The reasoning trace. The agent's internal monologue tells you:
- Whether it loaded the skill (trigger test)
- Whether it followed instructions (compliance test)
- Where it deviated and why (interpretation test)
- How many tokens it spent re-stating instructions to itself (verbosity indicator)
The Skill Creator, updated March 2026, automates this with four composable sub-agents: executor runs the skill, grader scores outputs, comparator does blind A/B between versions, analyzer surfaces patterns. Not required for manual testing but accelerates iteration significantly.
Step 5: Iterate
Fix what the transcript revealed. Rerun. Review. Each cycle should produce measurable improvement on at least one eval prompt. Iterating without measurable progress means stepping back and re-examining assumptions. The skill might need structural change, not wording tweaks.
Description Optimization: The 20-Query Method
The description field determines whether your skill activates. Everything else is irrelevant if the trigger never fires. The Agent Skills Guide covers why. Here is how to systematically improve it.
Create 20 eval queries. Ten should trigger. Ten should not. Should-trigger covers common cases, edge cases, unusual phrasings. Should-not covers adjacent tasks clearly outside scope.
For a code review skill, should-trigger:
- "Review my last commit"
- "Is this PR safe to merge?"
- "Check the auth module for security issues"
- "I changed 15 files, can you audit them?"
- "Look at the diff and flag anything concerning"
Should-not:
- "Write unit tests for the auth module"
- "Deploy the staging branch"
- "Explain how the auth module works"
- "Create a new API endpoint"
- "Fix the failing CI build"
Run each query 3 times. Count trigger successes. Should-trigger firing 1 of 3 means undertriggering for that phrasing. Should-not firing 2 of 3 means overtriggering.
Target 80% accuracy across the full set. Below that, rewrite and retest. The Skill Creator uses 60/40 train/test split, iterating up to 5 times. You can do the same manually: find patterns in failed queries, adjust description to capture or exclude them.
Claude tends to undertrigger skills. Err on the side of "pushy" descriptions. List trigger conditions explicitly. "Use this skill when the user asks to review code, audit changes, check a PR, inspect a diff, or when you detect recently modified code and the user asks about quality" beats "Helps with code review."
Workspace Management for Multi-Skill Projects
Real projects have multiple skills. A team might maintain code review, commit message, testing, and deployment skills. Each has its own edit-test-evaluate cycle, eval queries, iteration history.
Switching between skill development contexts means changing open files, active agent sessions, loaded eval prompts. Losing terminal layout every switch costs 5-10 minutes rebuilding. Multiply by daily context switches and you lose serious time.
Termdock's workspace switching preserves terminal layout and session state. Switch from code review skill workspace to deployment skill workspace -- pane arrangement, working directories, active sessions come back exactly as left. Switch back -- code review context intact.
For teams maintaining skill libraries across projects, this is the difference between "skill development happens when I have a spare afternoon" and "skill development is part of my daily workflow." Lower friction entering and exiting the development environment means more iterations. More iterations mean better skills.
The Complete Skill Development Session
A full session, start to finish:
Minute 0-5: Setup. Open Termdock. Three-pane layout: editor left (60% width), agent session top-right, output bottom-right. Create skill directory and empty SKILL.md.
Minute 5-10: Baseline. In the agent pane, run three prompts representing tasks this skill should handle. Save outputs. Note failures: where the agent guessed wrong, missed conventions, gave generic output instead of project-specific.
Minute 10-20: First draft. Write frontmatter and body. Description targets the exact failure modes observed. Body instructions address specific gaps. Under 100 lines for the first draft. Reference design principles for structure.
Minute 20-35: Trigger testing. Run 10 should-trigger and 5 should-not queries. Resize panes -- agent session at 70%. Note trigger results per query. Adjust description based on failures. Rerun failed queries. Repeat until 80%+ accuracy.
Minute 35-55: Behavior testing. Switch layout so editor and output share focus. Run three baseline prompts with skill active. Compare against saved baseline. Read reasoning trace in output pane. Identify where agent followed well and where it deviated. Edit body. Rerun. Compare.
Minute 55-65: Polish. Full eval set one more time. Check for regressions. Confirm SKILL.md under 500 lines and every section load-bearing. Sections never referenced in transcripts -- remove. Tree-sitter to verify referenced scripts/files have expected signatures.
Minute 65-70: Commit. Skill tested. Description tuned. Instructions lean. Git visual diff to review changes. Commit.
Seventy minutes for a tested, evaluated skill. Most time spent observing and evaluating, not writing. That is the point: skill development is mostly testing, and testing requires the right environment.
What You Walk Away With
Skill development is not a writing problem. It is a testing problem. The format is trivial. Getting the trigger right, getting the instructions right, verifying the skill actually improves agent behavior over baseline -- that is the work.
The workflow that makes it tractable:
- Three-pane layout: editor, agent session, output. All visible simultaneously.
- Eval-first approach: establish baseline before writing a single line of SKILL.md.
- 20-query description optimization: 10 should-trigger, 10 should-not, 3 runs each, target 80%.
- Transcript review, not just output review: the reasoning trace tells you what the agent did with your instructions.
- Workspace persistence: save layout and session state to re-enter development context instantly.
This is the same loop the Anthropic Skill Creator automates. The tools help, but the methodology works with or without them. What matters is the discipline of measuring, comparing, and iterating -- not guessing, hoping, and shipping.
The Agent Skills Guide covers the ecosystem. The step-by-step tutorial covers the format. The design principles cover the craft. This article covers environment and workflow. All four together get you from "I have a SKILL.md" to "I have a skill that works."
Ready to streamline your terminal workflow?
Multi-terminal drag-and-drop layout, workspace Git sync, built-in AI integration, AST code analysis — all in one app.