·13 min read·agent-skills

The Skill Developer's Workflow: Building and Testing Agent Skills in Termdock

Writing a SKILL.md takes 5 minutes. Getting it to work reliably takes 5 hours. Learn the edit-test-evaluate loop, systematic description optimization, and the terminal layout that closes the gap.

DH
Danny Huang

The Problem Nobody Talks About

Writing a SKILL.md takes five minutes. The step-by-step tutorial walks you through directory structure, frontmatter, and body in less time than it takes to brew coffee. You end up with a syntactically valid skill that looks right.

Then you test it.

The skill does not trigger when you expect. Or it triggers on the wrong prompts. Or it triggers correctly but the agent interprets your instructions in a way you did not anticipate, producing output that is technically following orders but missing the point entirely. You open the SKILL.md, tweak a sentence, test again, get a different wrong result, tweak again, test again.

This gap between "skill exists" and "skill works" is where most authors give up. The format is easy. The iteration is hard. And the iteration is hard because the feedback loop is fundamentally different from what developers are used to.

When you write code, the computer executes your instructions deterministically. When you write a SKILL.md, an AI interprets your instructions probabilistically. The same words can produce different behavior across runs. A description that seems obvious to you might be ambiguous to the model. An instruction you consider clear might conflict with something else in the agent's context window.

This means skill development requires a different kind of workflow. Not just a text editor. A testing environment where you can edit, observe, and evaluate in tight, repeatable cycles.

What Makes Skill Development Different

Regular coding has a short, deterministic feedback loop. You change a line, run the program, and the output is the same every time. You can reason forward from code to behavior with near-perfect accuracy.

Skill development breaks that contract. The "runtime" is a language model, and language models are stochastic. Three things make the loop fundamentally different:

The output varies between runs. You write one set of instructions. The agent reads them and produces output. You run the exact same prompt again and get slightly different output. This is normal. It means a single test run proves very little. You need multiple runs to distinguish signal from noise.

The failure mode is subtle. When code breaks, it throws an error. When a skill misbehaves, the agent produces plausible-looking output that is wrong in quiet ways. It follows most of your instructions but skips the one you cared about most. It uses the right format but invents facts. You cannot grep for these failures. You have to read the output carefully, ideally alongside the agent's reasoning trace.

The trigger mechanism is separate from the instructions. A skill has two independent problems: does it activate when it should (the description field), and does it do the right thing once active (the body). A skill that never triggers is invisible no matter how good the instructions are. A skill with a perfect description but bad instructions is worse than no skill at all. You have to test both independently.

These three properties mean you cannot develop skills the way you develop code. You need a workflow that accounts for variance, rewards careful observation, and separates trigger testing from behavior testing.

The Edit-Test-Evaluate Loop

The core workflow for skill development is a three-pane setup:

Left pane: your editor. The SKILL.md file is open. You edit the frontmatter, the description, the body instructions. Every change is a hypothesis: "If I rephrase this section, the agent will handle edge case X correctly."

Right pane: the agent session. You type a prompt that should trigger the skill. Or a prompt that should not trigger it. The agent responds. You watch what happens.

Bottom pane: the output and transcript. The agent's reasoning trace, the full output, the files it touched. This is where you evaluate whether the hypothesis held.

The loop is: edit, switch to the agent pane, test, switch to the output pane, evaluate, switch back to the editor, refine. Each cycle should take 60 to 90 seconds. If it takes longer, the bottleneck is usually context switching: hunting for the right terminal tab, scrolling to find the output, losing track of what you changed.

This is where terminal layout matters. When all three views are visible simultaneously, you eliminate the context-switching cost entirely. You see the SKILL.md, the agent interaction, and the output at the same time. The feedback loop tightens from minutes to seconds.

Try Termdock Drag Resize Terminals works out of the box. Free download →

The ratio of panes changes as you work. Early in development, when you are still getting the description right, the agent pane dominates. You are running prompt after prompt, checking whether the skill triggers. Later, when you are refining instructions, the editor and output panes split the focus. Being able to drag and resize panes on the fly lets you match the layout to the phase of work.

Using AST Analysis to Understand Skill Structure

Skills that include scripts and reference files have dependencies that are easy to lose track of. Your SKILL.md references scripts/validate.sh, which calls a helper function from scripts/utils.sh, which reads a config file from references/. Rename one file and the skill breaks silently.

For skills that interact with your project's actual codebase, understanding the structure matters even more. A code review skill needs to know which files export what. A testing skill needs to understand the test framework's API. A deployment skill needs to trace the dependency chain.

Termdock's built-in Tree-sitter analysis handles this without leaving the terminal. It parses 12+ languages and shows you function signatures, export structures, import graphs, and call dependencies. When you are writing a skill that tells the agent "look at the test files to understand the project's testing patterns," you can first run the AST analysis yourself to verify that what you are pointing the agent toward actually contains what you think it contains.

The practical use case: you are building a skill that generates API route handlers following your project's conventions. You drop a reference file into the skill's references/ directory. Before writing the instruction that tells the agent to read it, you run Tree-sitter on the reference to see the exact function signatures and type definitions the agent will encounter. This prevents the common failure mode where your instructions reference patterns that exist in your mental model but not in the actual file.

Try Termdock Ast Code Analysis works out of the box. Free download →

The Eval Loop: Systematic Skill Testing

The design principles for good skills cover the theory. Here is the practice.

Step 1: Establish a Baseline

Before you write the skill, run your target prompts against the agent without it. Save the outputs. This is your baseline. If the agent already handles the task well, you do not need a skill. If it handles the task poorly in specific, repeatable ways, those failures become your eval criteria.

Anthropic's official best practices now recommend this eval-first approach: identify gaps by running Claude on representative tasks without a skill, document specific failures, then write minimal instructions that address those gaps. Evals before documentation, not after.

Step 2: Write 2-3 Test Prompts

These should be realistic. Not "test the code review skill" but "I just finished refactoring the auth module. Review the changes on this branch and tell me if it's safe to merge." Realistic phrasing, realistic context, realistic expectations.

Include at least one edge case. A prompt that is adjacent to the skill's domain but should not trigger it. A prompt that uses unusual phrasing for a task the skill should handle. Edge cases reveal whether your description and instructions are robust or brittle.

Step 3: Run With-Skill vs Without-Skill

This is the only comparison that matters. Does the skill improve the output? Run each test prompt three times with the skill installed and three times without. Compare the results.

If the outputs are roughly equivalent, the skill is adding context cost without adding value. Go back and make the instructions more specific, or reconsider whether a skill is the right solution.

If the with-skill outputs are consistently better, you have signal. Move to the next step.

Step 4: Review the Full Transcript

Do not just read the final output. Read the reasoning trace. The agent's internal monologue tells you:

  • Whether it loaded the skill (trigger test)
  • Whether it followed the instructions (compliance test)
  • Where it deviated and why (interpretation test)
  • How many tokens it spent re-stating your instructions back to itself (verbosity indicator)

The Skill Creator, updated in March 2026, automates this with four composable sub-agents: an executor that runs the skill, a grader that scores outputs, a comparator that does blind A/B testing between skill versions, and an analyzer that surfaces patterns. You do not need the Skill Creator to follow this process manually, but it accelerates iteration significantly if you have access.

Step 5: Iterate

Fix what the transcript revealed. Rerun. Review again. Each cycle should produce a measurably better result on at least one eval prompt. If you are iterating without measurable improvement, step back and re-examine your assumptions. The skill might need a structural change, not a wording tweak.

Description Optimization: The 20-Query Method

The description field determines whether your skill activates. Everything else is irrelevant if the trigger never fires. The Agent Skills Guide covers why this matters. Here is how to systematically improve it.

Create 20 eval queries. Ten should trigger the skill. Ten should not. The should-trigger set covers the common case, edge cases, and unusual phrasings. The should-not set covers adjacent tasks that are close but clearly outside the skill's scope.

For a code review skill, the should-trigger set might include:

  • "Review my last commit"
  • "Is this PR safe to merge?"
  • "Check the auth module for security issues"
  • "I changed 15 files, can you audit them?"
  • "Look at the diff and flag anything concerning"

The should-not set:

  • "Write unit tests for the auth module"
  • "Deploy the staging branch"
  • "Explain how the auth module works"
  • "Create a new API endpoint"
  • "Fix the failing CI build"

Run each query 3 times. Count trigger successes. If a should-trigger query only fires 1 out of 3 runs, your description is undertriggering for that phrasing. If a should-not query fires 2 out of 3, you are overtriggering.

Target 80% accuracy across the full set. Below that, rewrite the description and retest. The Skill Creator's description optimization uses a 60/40 train/test split and iterates up to 5 times. You can do the same manually: take the queries that failed, figure out what words or patterns they share, and adjust the description to capture or exclude them.

Claude tends to undertrigger skills, so err on the side of being "pushy" in your description. List trigger conditions explicitly. "Use this skill when the user asks to review code, audit changes, check a PR, inspect a diff, or when you detect that code has been recently modified and the user is asking about quality" is better than "Helps with code review."

Workspace Management for Multi-Skill Projects

Most real projects have multiple skills. A team might maintain a code review skill, a commit message skill, a testing skill, and a deployment skill. Each skill has its own edit-test-evaluate cycle, its own eval queries, its own iteration history.

Switching between skill development contexts means changing which files are open, which agent session is active, which eval prompts are loaded. If you lose your terminal layout every time you switch, you lose 5-10 minutes rebuilding the environment. Multiply that by the number of context switches in a day and you are losing serious time.

Termdock's workspace switching preserves terminal layout and session state. When you switch from the code review skill workspace to the deployment skill workspace, your pane arrangement, working directories, and active sessions come back exactly as you left them. When you switch back, the code review context is intact.

For teams maintaining skill libraries across multiple projects, this is the difference between "skill development is something I do when I have a spare afternoon" and "skill development is part of my daily workflow." The lower the friction of entering and exiting the development environment, the more iterations happen, and more iterations mean better skills.

The Complete Skill Development Session

Here is what a full skill development session looks like, from start to finish.

Minute 0-5: Setup. Open Termdock. Create a three-pane layout: editor on the left (60% width), agent session top-right, output bottom-right. Create the skill directory and an empty SKILL.md.

Minute 5-10: Baseline. In the agent pane, run three prompts that represent the task this skill should handle. Save the outputs. Note specific failures: where the agent guessed wrong, where it missed a project convention, where it produced generic output instead of project-specific output.

Minute 10-20: First draft. In the editor pane, write the frontmatter and body. The description targets the exact failure modes you observed. The body instructions address the specific gaps. Keep it under 100 lines for the first draft. Reference the design principles for guidance on structure.

Minute 20-35: Trigger testing. Run 10 should-trigger and 5 should-not-trigger queries. Resize the panes so the agent session takes 70% of the screen. For each query, note whether the skill triggered. Adjust the description based on failures. Run the failing queries again. Repeat until trigger accuracy is above 80%.

Minute 35-55: Behavior testing. Switch the layout so the editor and output panes share focus equally. Run the three baseline prompts with the skill active. Compare outputs against the baseline you saved earlier. Read the reasoning trace in the output pane. Identify where the agent followed instructions well and where it deviated. Edit the SKILL.md body. Rerun. Compare.

Minute 55-65: Polish. Run the full eval set one more time. Check for regressions: did fixing one behavior break another? Verify that the SKILL.md is under 500 lines and that every section is load-bearing. If any section was never referenced in the transcripts, remove it. Use Tree-sitter to verify that any referenced scripts or files have the signatures your instructions assume.

Minute 65-70: Commit. The skill is tested, the description is tuned, the instructions are lean. Use the git visual diff to review exactly what changed, then commit.

The whole session is 70 minutes for a well-tested, properly evaluated skill. Most of that time is spent observing and evaluating, not writing. That is the point: skill development is mostly testing, and testing requires the right environment.

What You Walk Away With

Skill development is not a writing problem. It is a testing problem. The format is trivial. Getting the trigger right, getting the instructions right, and verifying that the skill actually improves agent behavior over the baseline: that is the work.

The workflow that makes this tractable:

  1. Three-pane layout: editor, agent session, output. All visible simultaneously.
  2. Eval-first approach: establish a baseline before writing a single line of SKILL.md.
  3. 20-query description optimization: 10 should-trigger, 10 should-not, 3 runs each, target 80% accuracy.
  4. Transcript review, not just output review: the reasoning trace tells you what the agent did with your instructions.
  5. Workspace persistence: save your layout and session state so you can re-enter the development context instantly.

This is the same loop the Anthropic Skill Creator automates. The tools help, but the methodology works with or without them. What matters is the discipline of measuring, comparing, and iterating, not guessing, hoping, and shipping.

The Agent Skills Guide covers the full ecosystem. The step-by-step tutorial covers the format. The design principles cover the craft. This article covers the environment and the workflow. All four pieces together get you from "I have a SKILL.md" to "I have a skill that works."

DH
Free Download

Ready to streamline your terminal workflow?

Multi-terminal drag-and-drop layout, workspace Git sync, built-in AI integration, AST code analysis — all in one app.

Download Termdock →
#agent-skills#skill-md#claude-code#termdock#developer-workflow#testing

Related Posts