Sailfish turns your own agent tool-call history into a fast, private model that runs on your own hardware. Point it at your local nemesis8 logs, fine-tune Gemma-4 on a rented GPU, download the model, and serve it โ accelerated by speculative decoding on cards that can hold it. No cloud, no API, nothing rented. The fastest fish in the ocean, ported to your box.
measured on the same RTX 3060, llama.cpp, greedy, 256-token generations:
config agentic prose accuracy bare Q4_K_M 66.6 tok/s 67.6 tok/s 6/6 n-gram lookup (old default) 65.8 tok/s 63.7 tok/s 6/6 trained VSD draft head 82.6 tok/s 81.6 tok/s 6/6 โ shipped
The draft head is Google's reference drafter retrained on this machine's own tool-call sessions (910k on-policy tokens) to maximize surviving guesses โ 2.64 vs 1.35 accepted per draft, ~2ร stock. Speculative decoding is lossless: the big model verifies every guess, so answers are token-identical โ we verified that on an independent scorer (PPL moved 0.0002 over 61,797 tokens under a drafter swap). Speed is the only thing that changes.
You can train one on your own sessions โ the corpus scraper, acceptance evaluator, and training scaffold are on GitHub. Drafter weights + the full training writeup: soon. Stay tuned.
install locally use the hosted model โ GitHubOne line. Autodetects your GPU and serves the fastest valid config on localhost:22343 โ OpenAI-compatible, so any agent or harness talks to it.
# Windows (PowerShell) irm https://sailfish.nuts.services/install.ps1 | iex # macOS / Linux curl -fsSL https://sailfish.nuts.services/install.sh | sh
< 16 GB (RTX 3060 12GB) Gemma-4-E4B Q4, llama.cpp, trained 156 MB draft head (draft-mtp) >= 16 GB (5090, A10G, L4) stock INT4 + MTP drafter โ the challenge stack, full speed detection is automatic โ VRAM + architecture, at container start
throughput by card, agentic output:
card tier throughput basis RTX 3060 12 GB B 82.6 tok/s ยท 98 harness measured (trained draft head) A10G 24 GB A 504.8 tok/s measured (verified challenge entry) RTX 5090 32 GB A ~280โ420 tok/s estimate ยท unvalidated
The 3060 and A10G rows are measured. The 5090 row is a bandwidth-scaled estimate for the llama.cpp + draft-head path (the full challenge stack would go higher) โ don't hold us to it until we run one.
1. Harvest โ pull your tool-call history from nemesis8 (or a local scrape). 2. Curate โ a frontier model cleans it into training data; you review it, with a cost cap. 3. Train โ fine-tune on an ephemeral cloud GPU (your Google Cloud, or ours). 4. Serve โ download, hot-swap, and your model answers at reading speed, offline, yours.
curl localhost:22343/v1/chat/completions -d '{
"model":"gemma4-e4b",
"messages":[{"role":"user","content":"what time is it?"}],
"tools":[ ...your tools... ]
}'
Same shape as OpenAI. Point Hyperia, your agent harness, or anything else at it.
Run the same model straight from Hugging Face in the Ollama you already have โ no container required:
# the stock model, today ollama run hf.co/ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M # your own tool-tuned model, once your BYO-cloud run pushes it as a GGUF repo ollama run hf.co/your-hf-user/gemma4-e4b-toolft
Then point your agents at Ollama's OpenAI endpoint http://localhost:11434/v1 โ same shape as everything else.
A LoRA fine-tune saves as safetensors; convert it to GGUF (llama.cpp convert_hf_to_gguf.py) before hf.co/โฆ works in Ollama โ or just let the Sailfish container serve it directly. Ollama gives you the model; the Sailfish container gives you the model plus the speculative-decoding speed โ 5โ12ร on agent traffic (see the head-to-head above) โ that Ollama's runner doesn't do.