Agent attachments
How to feed images, PDFs, and text files to an agentic skill — via inputs.attachments at submit time, or the file_read tool mid-run.
Agentic skills can see images and documents — not just text. There are two routes:
inputs.attachments— submit files together with the prompt. They become part of the first user message the model sees.file_readtool — the agent calls it mid-run to attach drive files to its own context.
Both routes produce the same content blocks (image / document / text) for the model. Models without vision/document support fail the job upfront with a clear error.
inputs.attachments — submitting files with the prompt
{
"skill": "ad-creative",
"inputs": {
"prompt": "Write a 60-word product blurb from this photo.",
"attachments": [
{ "drive_path": "uploads/shoe.jpg" },
{ "url": "https://cdn.example.com/spec.pdf" },
{ "base64": "iVBORw0KGgo...", "media_type": "image/png" }
]
}
}
(The submit body has no type field — agentic vs deterministic is decided by the skill's entrypoint in skill.yaml.)
Each entry is one of three shapes (same convention as inputs-and-drive's function inputs, but as an explicit list so the worker knows to attach instead of inline-as-text):
drive_path— file already in this project's drive. Resolved with path-traversal protection. Best for anything the user uploaded.url— public HTTPS URL. For images and PDFs we pass the URL straight to the model — the worker doesn't download. For text/* we fetch first.base64— raw base64 (nodata:...,prefix). Setmedia_typeexplicitly. Best for tiny generated payloads.
Optional media_type overrides MIME sniffing on any of the three shapes.
Supported file types
| Kind | MIME | How the model sees it |
|---|---|---|
| Image | image/jpeg, image/png, image/gif, image/webp | Vision block — model literally looks at the pixels. |
| Document | application/pdf | PDF block — text + page images together. Supported on the Claude family only. |
| Text | text/*, application/json, application/xml, application/yaml | Inlined as a text block (utf-8). |
Anything else is rejected with a clear error. Hard limit: 5 MB per file. Text files are inlined up to 100k characters, then truncated.
file_read — letting the agent pull files into its own context
When the agent needs to look at something it didn't get up front (or wants to inspect a file it just wrote), it calls file_read:
file_read({
"paths": ["uploads/photo.jpg", "docs/spec.md", "renders/output.pdf"]
})
The tool result is a block list — one labeled header per file, then the content:
=== uploads/photo.jpg (image/jpeg, 234.1KB) ===
<image attached, model sees it>
=== docs/spec.md (text/markdown, 12.4KB) ===
# Product spec
...
=== renders/output.pdf (application/pdf, 1.2MB) ===
<document attached, model sees it>
Constraints:
- Drive paths only. A leading
drive/is accepted and stripped. For arbitrary URLs, the agent should calldownload_urlfirst, thenfile_read. - Max 10 paths per call, same 5 MB / 100k-char per-file limit as
inputs.attachments. - On non-vision models, image/document files in the path list are skipped with an error in the result; text files still come through.
file_read is always exposed — you don't declare it in skill.yaml. It joins the platform-provided agent tools (bash, media, web_search, image_search, web_fetch, download_url). See mcp-tools for tools an MCP client invokes, not what the skill agent has at runtime.
Worked example — image-in / text-out skill
SKILL.md:
You are a product copywriter. The user will attach a single product photo.
Look at the photo and write a 60-word marketing blurb. Reply with just the blurb.
App submits:
const up = await uploadFile(file); // → { drive_path: "uploads/<uuid>.jpg" }
await fetch(`${API_BASE}/v1/jobs`, {
method: "POST",
headers: { Authorization: `Bearer ${apiKey}`, "Content-Type": "application/json" },
body: JSON.stringify({
skill: "product-copywriter",
inputs: {
prompt: "Write the blurb.",
attachments: [{ drive_path: up.drive_path }],
},
}),
});
The agent's first message arrives as a multimodal block list (text + image). No bash cat, no function tool. The model sees the photo directly and replies.
Mid-run pattern — search, download, look
When the agent needs to find an image online and then study it:
1. image_search("vintage red bicycle") → list of URLs
2. download_url(url, path="research/ref.jpg") → saved to drive
3. file_read(paths=["research/ref.jpg"]) → attached to context
4. (model now reasons over the actual image)
This is the canonical way to bring external visual material into the agent's working memory.
Choosing a model
Vision/PDF features require a vision-capable model. The platform fails the job upfront with a clear error when the chosen model can't process attached images or documents. Safe defaults:
claude/opus-4-7— supports both images and PDFs.claude/sonnet-4-7— supports both.gpt/4o,gemini/2.5-pro— images only (no native PDF). The job fails if a PDF is attached.
Set the model in the skill's skill.yaml:
# skills/product-copywriter/skill.yaml
description: Write product blurbs from a photo.
entrypoint: SKILL.md
model: claude/sonnet-4-7
input_schema: { ... }
output_schema: { ... }
Conventions
- Use
drive_pathfor anything > ~200 KB. Inline base64 lives injobs.inputsforever (Postgresjsonb); large blobs there bloat the table. - Prefer
inputs.attachmentsoverfile_readwhen the file is known at submit time. The model sees it without burning a tool round-trip. - Use
file_readfor files the agent decides to look at, like outputs ofdownload_url, files the user dropped into the drive between jobs, or one of several candidates picked at runtime. - Don't paste image URLs into the prompt text expecting the model to "browse" them. Models don't auto-fetch — the URL is just text. Put it in
attachmentsor calldownload_url+file_read. - Don't pass a PDF to a non-Claude model. GPT and Gemini slugs accept images but not documents; convert pages to images upstream if you need them.
See inputs-and-drive for the deterministic-skill side of file handling (how a Python skill or a per-skill tool reads the same input shapes), and example-project for a complete starter project.