🗡️🐟 sailfish

your agents' tool calls → a fast private model on your own GPU

Sailfish turns your own agent tool-call history into a fast, private model that runs on your own hardware. Point it at your local nemesis8 logs, fine-tune Gemma-4 on a rented GPU, download the model, and serve it — accelerated by speculative decoding on cards that can hold it. No cloud, no API, nothing rented. The fastest fish in the ocean, ported to your box.

on a $300 RTX 3060 (12 GB)

66 → 82 tok/s

+26–29% via our trained 156 MB draft head · tool-harness avg 98 tok/s · accuracy unchanged (6/6) · measured 2026-07-04

on datacenter Ampere (A10G)

504.8 tok/s

INT4 + MTP drafter @ PPL 2.394 · verified Fast-Gemma-Challenge entry

measured on the same RTX 3060, llama.cpp, greedy, 256-token generations:

config                        agentic        prose       accuracy
bare Q4_K_M                  66.6 tok/s   67.6 tok/s      6/6
n-gram lookup (old default)  65.8 tok/s   63.7 tok/s      6/6
trained VSD draft head       82.6 tok/s   81.6 tok/s      6/6  ← shipped

The draft head is Google's reference drafter retrained on this machine's own tool-call sessions (910k on-policy tokens) to maximize surviving guesses — 2.64 vs 1.35 accepted per draft, ~2× stock. Speculative decoding is lossless: the big model verifies every guess, so answers are token-identical — we verified that on an independent scorer (PPL moved 0.0002 over 61,797 tokens under a drafter swap). Speed is the only thing that changes.

You can train one on your own sessions — the corpus scraper, acceptance evaluator, and training scaffold are on GitHub. Drafter weights + the full training writeup: soon. Stay tuned.

install locally use the hosted model → GitHub

run it on your machine

One line. Autodetects your GPU and serves the fastest valid config on localhost:22343 — OpenAI-compatible, so any agent or harness talks to it.

# Windows (PowerShell)
irm https://sailfish.nuts.services/install.ps1 | iex

# macOS / Linux
curl -fsSL https://sailfish.nuts.services/install.sh | sh

how it picks your setup

< 16 GB  (RTX 3060 12GB)   Gemma-4-E4B Q4, llama.cpp, trained 156 MB draft head (draft-mtp)
>= 16 GB (5090, A10G, L4)   stock INT4 + MTP drafter — the challenge stack, full speed
detection is automatic — VRAM + architecture, at container start

throughput by card, agentic output:

card              tier   throughput               basis
RTX 3060  12 GB    B      82.6 tok/s · 98 harness  measured (trained draft head)
A10G      24 GB    A      504.8 tok/s              measured (verified challenge entry)
RTX 5090  32 GB    A      ~280–420 tok/s           estimate · unvalidated

The 3060 and A10G rows are measured. The 5090 row is a bandwidth-scaled estimate for the llama.cpp + draft-head path (the full challenge stack would go higher) — don't hold us to it until we run one.

the loop

1. Harvest — pull your tool-call history from nemesis8 (or a local scrape). 2. Curate — a frontier model cleans it into training data; you review it, with a cost cap. 3. Train — fine-tune on an ephemeral cloud GPU (your Google Cloud, or ours). 4. Serve — download, hot-swap, and your model answers at reading speed, offline, yours.

talk to it

curl localhost:22343/v1/chat/completions -d '{
  "model":"gemma4-e4b",
  "messages":[{"role":"user","content":"what time is it?"}],
  "tools":[ ...your tools... ]
}'

Same shape as OpenAI. Point Hyperia, your agent harness, or anything else at it.

already use Ollama?

Run the same model straight from Hugging Face in the Ollama you already have — no container required:

# the stock model, today
ollama run hf.co/ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M

# your own tool-tuned model, once your BYO-cloud run pushes it as a GGUF repo
ollama run hf.co/your-hf-user/gemma4-e4b-toolft

Then point your agents at Ollama's OpenAI endpoint http://localhost:11434/v1 — same shape as everything else.

A LoRA fine-tune saves as safetensors; convert it to GGUF (llama.cpp convert_hf_to_gguf.py) before hf.co/… works in Ollama — or just let the Sailfish container serve it directly. Ollama gives you the model; the Sailfish container gives you the model plus the speculative-decoding speed — 5–12× on agent traffic (see the head-to-head above) — that Ollama's runner doesn't do.