puras

Agent attachments

How to feed images, PDFs, and text files to an agentic skill — via inputs.attachments at submit time, or the file_read tool mid-run.

Agentic skills can see images and documents — not just text. There are two routes:

  1. inputs.attachments — submit files together with the prompt. They become part of the first user message the model sees.
  2. file_read tool — the agent calls it mid-run to attach drive files to its own context.

Both routes produce the same content blocks (image / document / text) for the model. Models without vision/document support fail the job upfront with a clear error.

inputs.attachments — submitting files with the prompt

json
{
  "skill": "ad-creative",
  "inputs": {
    "prompt": "Write a 60-word product blurb from this photo.",
    "attachments": [
      { "drive_path": "uploads/shoe.jpg" },
      { "url": "https://cdn.example.com/spec.pdf" },
      { "base64": "iVBORw0KGgo...", "media_type": "image/png" }
    ]
  }
}

(The submit body has no type field — agentic vs deterministic is decided by the skill's entrypoint in skill.yaml.)

Each entry is one of three shapes (same convention as inputs-and-drive's function inputs, but as an explicit list so the worker knows to attach instead of inline-as-text):

  • drive_path — file already in this project's drive. Resolved with path-traversal protection. Best for anything the user uploaded.
  • url — public HTTPS URL. For images and PDFs we pass the URL straight to the model — the worker doesn't download. For text/* we fetch first.
  • base64 — raw base64 (no data:..., prefix). Set media_type explicitly. Best for tiny generated payloads.

Optional media_type overrides MIME sniffing on any of the three shapes.

Supported file types

KindMIMEHow the model sees it
Imageimage/jpeg, image/png, image/gif, image/webpVision block — model literally looks at the pixels.
Documentapplication/pdfPDF block — text + page images together. Supported on the Claude family only.
Texttext/*, application/json, application/xml, application/yamlInlined as a text block (utf-8).

Anything else is rejected with a clear error. Hard limit: 5 MB per file. Text files are inlined up to 100k characters, then truncated.

file_read — letting the agent pull files into its own context

When the agent needs to look at something it didn't get up front (or wants to inspect a file it just wrote), it calls file_read:

jsonc
file_read({
  "paths": ["uploads/photo.jpg", "docs/spec.md", "renders/output.pdf"]
})

The tool result is a block list — one labeled header per file, then the content:

=== uploads/photo.jpg (image/jpeg, 234.1KB) ===
<image attached, model sees it>

=== docs/spec.md (text/markdown, 12.4KB) ===
# Product spec
...

=== renders/output.pdf (application/pdf, 1.2MB) ===
<document attached, model sees it>

Constraints:

  • Drive paths only. A leading drive/ is accepted and stripped. For arbitrary URLs, the agent should call download_url first, then file_read.
  • Max 10 paths per call, same 5 MB / 100k-char per-file limit as inputs.attachments.
  • On non-vision models, image/document files in the path list are skipped with an error in the result; text files still come through.

file_read is always exposed — you don't declare it in skill.yaml. It joins the platform-provided agent tools (bash, media, web_search, image_search, web_fetch, download_url). See mcp-tools for tools an MCP client invokes, not what the skill agent has at runtime.

Worked example — image-in / text-out skill

SKILL.md:

markdown
You are a product copywriter. The user will attach a single product photo.
Look at the photo and write a 60-word marketing blurb. Reply with just the blurb.

App submits:

js
const up = await uploadFile(file);   // → { drive_path: "uploads/<uuid>.jpg" }
await fetch(`${API_BASE}/v1/jobs`, {
  method: "POST",
  headers: { Authorization: `Bearer ${apiKey}`, "Content-Type": "application/json" },
  body: JSON.stringify({
    skill: "product-copywriter",
    inputs: {
      prompt: "Write the blurb.",
      attachments: [{ drive_path: up.drive_path }],
    },
  }),
});

The agent's first message arrives as a multimodal block list (text + image). No bash cat, no function tool. The model sees the photo directly and replies.

Mid-run pattern — search, download, look

When the agent needs to find an image online and then study it:

1. image_search("vintage red bicycle")          → list of URLs
2. download_url(url, path="research/ref.jpg")   → saved to drive
3. file_read(paths=["research/ref.jpg"])        → attached to context
4. (model now reasons over the actual image)

This is the canonical way to bring external visual material into the agent's working memory.

Choosing a model

Vision/PDF features require a vision-capable model. The platform fails the job upfront with a clear error when the chosen model can't process attached images or documents. Safe defaults:

  • claude/opus-4-7 — supports both images and PDFs.
  • claude/sonnet-4-7 — supports both.
  • gpt/4o, gemini/2.5-pro — images only (no native PDF). The job fails if a PDF is attached.

Set the model in the skill's skill.yaml:

yaml
# skills/product-copywriter/skill.yaml
description: Write product blurbs from a photo.
entrypoint: SKILL.md
model: claude/sonnet-4-7
input_schema: { ... }
output_schema: { ... }

Conventions

  • Use drive_path for anything > ~200 KB. Inline base64 lives in jobs.inputs forever (Postgres jsonb); large blobs there bloat the table.
  • Prefer inputs.attachments over file_read when the file is known at submit time. The model sees it without burning a tool round-trip.
  • Use file_read for files the agent decides to look at, like outputs of download_url, files the user dropped into the drive between jobs, or one of several candidates picked at runtime.
  • Don't paste image URLs into the prompt text expecting the model to "browse" them. Models don't auto-fetch — the URL is just text. Put it in attachments or call download_url + file_read.
  • Don't pass a PDF to a non-Claude model. GPT and Gemini slugs accept images but not documents; convert pages to images upstream if you need them.

See inputs-and-drive for the deterministic-skill side of file handling (how a Python skill or a per-skill tool reads the same input shapes), and example-project for a complete starter project.