Agent attachments

Agentic skills can see images and documents — not just text. There are two routes:

inputs.attachments — submit files together with the prompt. They become part of the first user message the model sees.
file_read tool — the agent calls it mid-run to attach drive files to its own context.

Both routes produce the same content blocks (image / document / text) for the model. Models without vision/document support fail the job upfront with a clear error.

inputs.attachments — submitting files with the prompt

json

{
  "skill": "ad-creative",
  "inputs": {
    "prompt": "Write a 60-word product blurb from this photo.",
    "attachments": [
      { "drive_path": "uploads/shoe.jpg" },
      { "url": "https://cdn.example.com/spec.pdf" },
      { "base64": "iVBORw0KGgo...", "media_type": "image/png" }
    ]
  }
}

(The submit body has no type field — agentic vs deterministic is decided by the skill's entrypoint in skill.yaml.)

Each entry is one of three shapes (same convention as inputs-and-drive's function inputs, but as an explicit list so the worker knows to attach instead of inline-as-text):

drive_path — file already in this project's drive. Resolved with path-traversal protection. Best for anything the user uploaded.
url — public HTTPS URL. For images and PDFs we pass the URL straight to the model — the worker doesn't download. For text/* we fetch first.
base64 — raw base64 (no data:..., prefix). Set media_type explicitly. Best for tiny generated payloads.

Optional media_type overrides MIME sniffing on any of the three shapes.

Supported file types

Kind	MIME	How the model sees it
Image	`image/jpeg`, `image/png`, `image/gif`, `image/webp`	Vision block — model literally looks at the pixels.
Document	`application/pdf`	PDF block — text + page images together. Supported on the Claude family only.
Text	`text/*`, `application/json`, `application/xml`, `application/yaml`	Inlined as a text block (utf-8).

Anything else is rejected with a clear error. Hard limit: 5 MB per file. Text files are inlined up to 100k characters, then truncated.

file_read — letting the agent pull files into its own context

When the agent needs to look at something it didn't get up front (or wants to inspect a file it just wrote), it calls file_read:

jsonc

file_read({
  "paths": ["uploads/photo.jpg", "docs/spec.md", "renders/output.pdf"]
})

The tool result is a block list — one labeled header per file, then the content:

=== uploads/photo.jpg (image/jpeg, 234.1KB) ===
<image attached, model sees it>

=== docs/spec.md (text/markdown, 12.4KB) ===
# Product spec
...

=== renders/output.pdf (application/pdf, 1.2MB) ===
<document attached, model sees it>

Constraints:

Drive paths only. A leading drive/ is accepted and stripped. For arbitrary URLs, the agent should call download_url first, then file_read.
Max 10 paths per call, same 5 MB / 100k-char per-file limit as inputs.attachments.
On non-vision models, image/document files in the path list are skipped with an error in the result; text files still come through.

file_read is always exposed — you don't declare it in skill.yaml. It joins the platform-provided agent tools (bash, media, web_search, image_search, web_fetch, download_url). See mcp-tools for tools an MCP client invokes, not what the skill agent has at runtime.

Worked example — image-in / text-out skill

SKILL.md:

markdown

You are a product copywriter. The user will attach a single product photo.
Look at the photo and write a 60-word marketing blurb. Reply with just the blurb.

App submits:

const up = await uploadFile(file);   // → { drive_path: "uploads/<uuid>.jpg" }
await fetch(`${API_BASE}/v1/jobs`, {
  method: "POST",
  headers: { Authorization: `Bearer ${apiKey}`, "Content-Type": "application/json" },
  body: JSON.stringify({
    skill: "product-copywriter",
    inputs: {
      prompt: "Write the blurb.",
      attachments: [{ drive_path: up.drive_path }],
    },
  }),
});

The agent's first message arrives as a multimodal block list (text + image). No bash cat, no function tool. The model sees the photo directly and replies.

Mid-run pattern — search, download, look

When the agent needs to find an image online and then study it:

1. image_search("vintage red bicycle")          → list of URLs
2. download_url(url, path="research/ref.jpg")   → saved to drive
3. file_read(paths=["research/ref.jpg"])        → attached to context
4. (model now reasons over the actual image)

This is the canonical way to bring external visual material into the agent's working memory.

Choosing a model

Vision/PDF features require a vision-capable model. The platform fails the job upfront with a clear error when the chosen model can't process attached images or documents. Safe defaults:

claude/opus-4-7 — supports both images and PDFs.
claude/sonnet-4-7 — supports both.
gpt/4o, gemini/2.5-pro — images only (no native PDF). The job fails if a PDF is attached.

Set the model in the skill's skill.yaml:

yaml

# skills/product-copywriter/skill.yaml
description: Write product blurbs from a photo.
entrypoint: SKILL.md
model: claude/sonnet-4-7
input_schema: { ... }
output_schema: { ... }

Conventions

Use drive_path for anything > ~200 KB. Inline base64 lives in jobs.inputs forever (Postgres jsonb); large blobs there bloat the table.
Prefer inputs.attachments over file_read when the file is known at submit time. The model sees it without burning a tool round-trip.
Use file_read for files the agent decides to look at, like outputs of download_url, files the user dropped into the drive between jobs, or one of several candidates picked at runtime.
Don't paste image URLs into the prompt text expecting the model to "browse" them. Models don't auto-fetch — the URL is just text. Put it in attachments or call download_url + file_read.
Don't pass a PDF to a non-Claude model. GPT and Gemini slugs accept images but not documents; convert pages to images upstream if you need them.

See inputs-and-drive for the deterministic-skill side of file handling (how a Python skill or a per-skill tool reads the same input shapes), and example-project for a complete starter project.