skill.yaml reference

Every skill in a skillpack bundle is a top-level directory that contains a skill.yaml — skills are auto-discovered by scanning the bundle root, so there is no skills/ wrapper. The directory name is the skill's name (slug-style, [a-z0-9][a-z0-9_-]*, unique within the bundle). Each skill.yaml is the source of truth for one skill: its entrypoint, model, schemas, tools, examples, and evals. The bundle root may also carry an optional puras.yaml — the pack manifest: the CLI's remote binding plus the skillpack's own page content (title, description) — see _build_pack.

Bundle layout:

my-skillpack/            # the bundle root
├── puras.yaml           # optional — pack manifest (binding + pack-page content)
└── my-skill/            # any top-level dir with a skill.yaml is a skill
    ├── skill.yaml       # this manifest
    ├── SKILL.md         # agentic only — the system prompt
    ├── scripts/main.py  # deterministic only — the entrypoint function
    └── references/      # optional — files the agent can read at runtime

Top-level skill.yaml fields:

yaml

title: Human-friendly Name      # optional — display label (cards, playground, SEO);
                                # falls back to the skill's directory name
description: One-line summary    # what the skill does — the lede on its public page

entrypoint: SKILL.md            # required. A `.md` file runs an agentic LLM loop with
                                # the file as the system prompt; `scripts/main.py:run`
                                # runs a deterministic function called with the inputs.
mode: task                      # optional, agentic only — `task` (default: one input
                                # runs to a structured result and ends) or `chat` (the
                                # skill is a multi-turn chat agent — its SKILL.md is the
                                # system prompt and its tools are the agent's tools; each
                                # user message continues the same conversation and the
                                # reply is the model's free-form text). A `chat` skill
                                # needs no input/output schema. See "Chat skills" below.

text_model: claude/sonnet-4-6   # optional, agentic only — the LLM, as a `family/variant`
                                # slug (see /pricing). `model:` is the rejected old name.
image_model: google/nano-banana # optional — default family for generate_image calls
video_model: bytedance/seedance # optional — default family for generate_video calls
audio_model: elevenlabs/tts     # optional — default family for generate_audio calls
visible_fields: [brief, product_images] # optional — non-required input fields
                                # to show expanded in the playground
disable_bash: false             # optional, agentic only — strip the bash tool
cache_ttl: 5m                   # optional, agentic only — Anthropic prompt-cache TTL:
                                # "5m" (default) or "1h" (survives long tool gaps
                                # between LLM turns, at a higher cache-write cost)

input_schema:  { Puras dialect } # required — validated before the run; drives the
                                 # playground form. Keeps an explicit `required` list.
output_schema: { Puras dialect } # required — validated after the run; drives result
                                 # rendering. Omits `required` (see the output contract).

tools:     [ ... ]              # optional, agentic only — agent-callable helpers
examples:  [ ... ]              # optional — playground seed scenarios
evals:     [ ... ]              # optional, agentic only — per-run graders
routing:   { ... }              # optional, agentic only — model tiering/escalation
allowed_tools: [ ... ]          # optional, agentic only — least-privilege tool whitelist
tool_limits: { tool: max }      # optional, agentic only — per-run, per-tool call caps
guardrails: { ... }             # optional, agentic only — runtime safety rails the worker
                                # ENFORCES on inputs/outputs/tool calls/retrievals (distinct
                                # from evals, which only score). See the Guardrails section.

Each block (tools, examples, evals) is detailed in its own section below.

Chat skills (`mode: chat`)

By default a skill is a task: one inputs payload runs the agent to a single structured output_schema result and the run ends. Set mode: chat to instead use the skill as a chat agent — a back-and-forth conversation where the skill's SKILL.md is the system prompt and its declared tools: are the agent's tools, exactly like a task skill, but the run pauses after each assistant reply to wait for the next user message instead of finishing.

yaml

entrypoint: SKILL.md
mode: chat                      # this skill is a conversational agent
# input_schema / output_schema are optional for chat skills — the turn input is
# always the user's message and the reply is the model's free-form text.
tools:                          # the agent's tools work identically to a task skill
  - name: search_docs
    entrypoint: tools/search.py:run
    input_schema:  { ... }
    output_schema: { ... }

A conversation is opened against a chat skill (POST /v1/conversations), and each POST /v1/conversations/{id}/messages runs one turn: the worker loads the conversation so far, appends the user message, runs the same agent tool-use loop the platform runs for task skills (events stream over the job's SSE), then persists the updated transcript and returns the assistant's reply. confirm: true tool gates, per-call usage/billing, the prompt cache, and model routing all apply unchanged. mode: chat is agentic-only.

Subskills

A top-level skill can have subskills nested under <X>/subskills/<Y>/. They use the same skill.yaml format, but they're treated differently:

Their qualified name is <X>/<Y> (parent / sub).
They are hidden from the API submit endpoint, the public explore listing, and the playground — i.e. they can't be invoked as top-level skills.
They are callable ONLY from their parent skill's runtime, via puras.subagent.run("<Y>", ...). The /v1/subagent/invoke resolver tries <parent>/<Y> first when the caller is a parent skill, so the parent references its subskills by their bare name.

Use subskills for pipeline-internal helpers that don't make sense to publish — research stages, render stages, etc. Use top-level skills for anything you'd want callable from elsewhere (MCP, dashboard, marketplace).

Tools (`tools:`)

A tool declared in an agentic skill's tools: list — a helper the agent can call from inside the loop, on top of the built-in tools. Agentic skills only.

Two shapes:

Local Python tool: entrypoint + input_schema + output_schema are all set; the tool runs as a function in this skill's deployment. The worker validates the call's input and the function's return against those schemas (same Puras dialect as the skill's own schemas).
Skill tool: skill is set (a bare skill name that must resolve in this same deployment). The tool dispatches via /v1/subagent/invoke; its schemas + description are copied from the target skill at load time, so you don't restate them (and must not). Use this to give an agentic skill other skills — top-level OR its own subskills — as callable tools.

Add confirm: true to a side-effectful tool (send, publish, delete, pay) to gate it behind a human approval: the run pauses and a person must approve or deny the call from the dashboard before it executes. The gate is enforced by the worker off this deploy-time flag, so the model can't bypass it.

yaml

tools:
  - name: search_inventory          # required — slug style, unique per skill
    description: Look up SKU stock.  # shown to the agent
    entrypoint: tools/inventory.py:run   # local Python tool
    input_schema:  { type: object, required: [sku], properties: { sku: { type: string } } }
    output_schema: { type: object, properties: { in_stock: { type: integer } } }

  - name: send_invoice               # side-effect → needs human approval
    description: Email the customer their invoice.
    entrypoint: tools/invoice.py:run
    input_schema:  { type: object, required: [to], properties: { to: { type: string } } }
    output_schema: { type: object, properties: { sent: { type: boolean } } }
    confirm: true                    # pause the run for an approve/deny

  - name: research                   # skill tool — schemas copied from the target
    skill: deep-research

Examples (`examples:`)

One entry in a skill's examples: list — a playground seed scenario.

Each example is a complete, ready-to-run inputs payload that must match the skill's input_schema, plus optional labels. The playground seeds its form with examples[0].inputs on mount and renders the rest as clickable chips, so good examples double as "Try it" discoverability and as a smoke baseline you can test against.

yaml

examples:
  - title: short label         # optional — chip label; falls back to "Example N"
    description: 1-line note    # optional — tooltip on the chip
    inputs: { ... }             # required — full input matching input_schema
    outputs: { ... }            # optional — pre-computed result, display only

Evals (`evals:`)

One grader in an agentic skill's evals: list. Evals are to a skill what unit tests are to code: each grader scores a run's output in [0,1]; the weighted mean (×100) is the run's eval_score, shown on the job cards/tables. A skill with no evals: simply produces no score. Agentic skills only.

Graders run in two contexts: after every live run (scoring that run) and — when the skill ships an evals.dataset — across an offline eval suite (POST /v1/skillpacks/{id}/evals), which runs the skill against each dataset case N times and aggregates pass-rate / cost / latency / variance. Four kinds:

kind: check — a deterministic Python grader. entrypoint (<file.py>:<func>, relative to the skill dir, same form as a tool) is called with (inputs, output) and returns {score, passed, detail} — the objective, unit-test layer (limits, counts, schema-shape assertions).
kind: rubric — an LLM-as-judge grader. criteria (+ optional anchored levels, a {"0": "...", "1": "..."} map) is handed to the skill's text model, which returns a 0..1 score with reasoning — the qualitative layer (voice, fidelity, language).
kind: exact_match — deterministic, free. Compares the output to the case's expected (from the dataset). field (optional dotted path like label or result.category) narrows it to one value; omit it to compare the whole output. Only scores in a suite where the case carries an expected; skipped on a live run.
kind: schema — deterministic, free. Validates the output against a JSON Schema. schema (a Puras-dialect mapping) gives an explicit shape; omit it to validate against the skill's own output_schema.

Each grader takes an optional weight (positive number, default 1.0) that sets its share of the weighted mean.

yaml

evals:
  dataset: evals/cases.jsonl     # optional — JSONL cases for the offline suite
  graders:
    - name: within_limits        # required — slug style, unique per skill
      kind: check
      weight: 2                   # optional — default 1.0
      entrypoint: evals/limits.py:grade
    - name: right_category
      kind: exact_match           # output (or `field`) must equal case.expected
      field: category
    - name: well_formed
      kind: schema                # validates output against the skill's output_schema
    - name: on_brand
      kind: rubric
      criteria: The copy matches the brand voice — confident, warm, never salesy.
      levels:                     # optional — anchored score → meaning
        "0": Off-brand or salesy.
        "1": Perfectly on-brand.

evals: may also be written as a bare list of graders (no dataset:); both forms parse identically.

Tool scope & limits (`allowed_tools:`, `tool_limits:`)

Two optional, agentic-only controls that run a skill with least privilege and guard against runaway tool loops. Both are enforced on every run by the agent loop.

yaml

allowed_tools:                 # whitelist — if present, the agent is offered
  - bash                       # ONLY these tools (built-ins AND your declared
  - file_read                  # tools). Anything outside it is dropped from the
  - web_search                 # tool list and refused at call time (defense in
  - my_custom_tool             # depth). Omit = every tool is available.

tool_limits:                   # per-run, per-tool call caps. A call past the
  web_search: 10               # cap becomes a soft tool error the model can
  generate_image: 20           # react to — it can't loop a tool forever.

set_output is reserved run infrastructure: it is never gated by allowed_tools or tool_limits, so a tight whitelist can't strand a skill that needs to return its output. A global platform cap (MAX_TOOL_CALLS_PER_RUN) applies under any per-skill tool_limits.

Guardrails (`guardrails:`)

The guardrails: block lists runtime safety rails. Guardrails are to a skill what a firewall is to a service: each rail ENFORCES a policy at runtime and can change or stop the run — distinct from evals, which only score after the fact. A skill with no guardrails: runs unguarded. Agentic skills only.

A rail fires at one of four phases (on:): input (the incoming request, before the agent sees it), output (the final result, before it's returned), tool_call (each tool invocation, before it runs) or retrieval (documents pulled in by RAG/search, before they enter context). Each rail has a kind that decides what it inspects and an action that decides what happens when it trips.

Six kinds:

kind: regex — pattern (a non-empty, compilable regex) matched against the text. The cheap, deterministic layer for known bad strings.
kind: pii — detects personal data; entities narrows it to a subset of email, phone, ssn, credit_card, ip, iban, api_key (omit to scan all). Pairs naturally with action: redact.
kind: check — a deterministic Python rail. entrypoint (<file.py>:<func>, relative to the skill dir, same form as a tool) is called and decides pass/fail — arbitrary custom policy.
kind: schema — validates the payload against a JSON Schema (schema, a Puras-dialect mapping; omit it to reuse the skill's output_schema). Only on on: output or on: retrieval.
kind: llm_judge — an LLM-as-judge rail. criteria (non-empty) is handed to the model, which judges whether the payload complies — the qualitative layer.
kind: classifier — a safety classifier; categories narrows it to a subset of prompt_injection, jailbreak, toxicity, hate, self_harm, sexual, violence, illegal, pii (omit to score all). threshold (0..1, default 0.7) sets how confident a hit must be to trip.

Six actions (action:, defaults to the block's on_violation):

block — stop the run and fail with a violation.
flag — record the violation but let the run continue.
redact — strip the offending span from the payload and continue.
rewrite — have the model rewrite the payload to comply (on: output only).
require_approval — pause for a human approve/deny decision (on: tool_call only — gates a specific tool: when named, else every tool call).
escalate — hand off / escalate the run (on: output only).

The block accepts a bare list of rails, or a mapping with an on_violation: default action (one of block, flag, redact) that rails inherit when they omit their own action, plus an optional deploy gate:. The gate (max_violation_rate, 0..100) blocks a hosted deploy whose offline guardrail suite trips on more than that share of cases; it needs an evals.dataset to run the suite against.

yaml

guardrails:
  on_violation: block            # optional — default action for rails (block|flag|redact)
  gate:                          # optional — deploy gate; needs an evals.dataset
    max_violation_rate: 5        #   reject the deploy if >5% of suite cases trip
  rails:
    - name: no_secrets           # required — slug style, unique per skill
      on: output                 # required — input | output | tool_call | retrieval
      kind: regex
      pattern: "sk-[A-Za-z0-9]{20,}"
      action: redact             # optional — falls back to on_violation
    - name: strip_pii
      on: output
      kind: pii
      entities: [email, phone, ssn]
      action: redact
    - name: no_injection
      on: input
      kind: classifier
      categories: [prompt_injection, jailbreak]
      threshold: 0.8
      action: block
    - name: confirm_payments
      on: tool_call
      kind: check
      tool: charge_card          # optional — scope to one tool (tool_call only)
      entrypoint: guardrails/payments.py:check
      action: require_approval
    - name: on_brand
      on: output
      kind: llm_judge
      criteria: The reply never promises a refund outside policy.
      detail: Refunds are capped at 30 days.   # optional — note shown on a violation

guardrails: may also be written as a bare list of rails (no on_violation:/ gate: wrapper); both forms parse identically.

The pack manifest (`puras.yaml`)

The optional pack manifest — a puras.yaml at the bundle root. One file, two jobs: the CLI's remote binding (which skillpack this directory deploys to) and the authored content for the skillpack's own public page.

yaml

slug: my-pack            # CLI's remote binding hint (informational, ignored server-side)

title: My Pack           # display name — synced to the pack page H1 on an
                         # activating deploy
description: One-liner   # the lede under the title, and the pack's
                         # description in listings / search

Unknown keys are rejected (name: specifically points to title:), so a typo fails at deploy instead of being silently ignored — same contract as skill.yaml's TOP_LEVEL_KEYS. The legacy skillpack_id: key (still written by older CLIs) is accepted but ignored, so it never breaks a deploy.

Input & output schemas: the Puras dialect

Skill authors write input_schema / output_schema in a small Puras dialect that adds end-user-meaningful types on top of JSON Schema:

type: image | video | audio | file   # accept a file ref (string OR object)
type: text                            # multi-line string (textarea widget)
type: color                           # hex string (color-picker widget)

Standard JSON Schema types (string, number, integer, boolean, array, object, null) pass through unchanged. The frontend reads the dialect schema to pick widgets; validators read the translated schema (via to_jsonschema) for jsonschema-compatible Draft202012 validation.

The dialect recognizes these type values:

Puras-added (translated to JSON Schema before validation): audio, color, file, image, text, video.
Standard JSON Schema (passed through unchanged): array, boolean, integer, null, number, object, string.

The output contract

Output schemas in the Puras dialect don't spell out required: the contract is that a skill returns everything it declares. This walks a translated JSON Schema (post-to_jsonschema) and, for every object node that declares properties but no explicit required, sets required = list(properties).

An explicit required is left untouched, so an author who genuinely needs an optional output field can still opt it out by writing the list by hand.

Pairs with prune_extras: undeclared keys are dropped before validation, so output schemas need neither required nor additionalProperties.

Chat skills (mode: chat)

Subskills

Tools (tools:)

Examples (examples:)

Evals (evals:)

Tool scope & limits (allowed_tools:, tool_limits:)

Guardrails (guardrails:)

The pack manifest (puras.yaml)

Input & output schemas: the Puras dialect

The output contract

Chat skills (`mode: chat`)

Tools (`tools:`)

Examples (`examples:`)

Evals (`evals:`)

Tool scope & limits (`allowed_tools:`, `tool_limits:`)

Guardrails (`guardrails:`)

The pack manifest (`puras.yaml`)