The model that lasted three days (plus token bonfires)

2026-06-24

The Fizz

The Weekly Fizz.

Hello!

The Drip

Fable Five had a three-day life. It shipped, someone demonstrated a jailbreak to a group inside the US government, and the directive came down that no foreign national could use it, which, with no reliable way to enforce that, meant nobody could. So it got pulled. The irony wasn't lost on us: this landed right after a stretch of Dario writing about the dangers of moving too fast and the case for a coordinated slowdown. We got about three days with it. Worth talking about what we saw.

Inside The Bottle

Two things stuck with us this week.

First, model choice is really a binary. If you're using AI (building something, working through a strategy, whatever), just use the best model you can get your hands on. The frontier labs bake the smart stuff in now; six months ago we were reminding Claude to spin up a subagent, and today it just does. Unless you've got some strange budget limit, reach for the top of the shelf. But if you're building agentic AI into software, model choice is suddenly everything. You're paying on the meter, you've got a cost to serve, and you're balancing quality against speed against price. There's also the upgrade path, because dropping in a new model is like handing the same instructions to a different person. Test before you swap.

Which raises the cost question. Kellan's take: unit cost means almost nothing. Price per token is close to irrelevant. What matters is what it costs to actually get the objective done, the same reason we said no to hourly billing in consulting. The question isn't "how much per token," it's "how long to achieve the task." Benchmarks are already moving that way.

Then there's the new vocabulary: loops, goals, workflows. A loop is a schedule that keeps injecting a prompt. A goal is closer to test-driven development: set a condition and it runs until it's met. A workflow fans a job out across many agents in stages. They're powerful, and they burn tokens like nothing we've seen. The honest part: they're not always the move. Loops shine on the work you already understand: go check the site analytics, fix the known issue. For the work where you haven't found clarity yourself yet, staying hands-on is still the right call. Figure out which mode you're in first.

Lab Notes

■

Justin's note: Gave a model a short, vague "audit our codebase" task, the kind that usually returns a heaping pile of junk next to a couple useful things. This time it came back rock solid: dead code, contradictory copy between pages, even a line in our legal disclosures that said "Inc." when we're an LLC. Implemented almost all of it. First time I felt myself come over the confidence hurdle on longer, more ambiguous work.

■

Kellan's note: A single agent will cut corners on a big task, not because it's broken, but because it knows its own limits and quietly skims. That's the real reason workflows help: each agent gets a smaller job and leans on its partners. When you need context deep and wide, stop asking one agent to carry it all.

What Stopped Our Scroll

■

We got local models to triage the OpenClaw repo for FREE!. Open-weight models (Gemma, Qwen) run real-time PR triage at basically zero cost, a nice counterweight to the "just use the biggest model" reflex.

■

Prompt Injection as Role Confusion. Models judge instructions by formatting style, not by where they came from. Reformatting the same attack dropped its success rate from 61% to 10%.

■

Codex-maxxing for long-running work. OpenAI on pushing work past a single prompt: break goals into verifiable steps, and know when to hand off versus stay in the loop. Right on theme this week.

Listen to the full episode

TBA

TINY BOTTLE AI

“Concentrated AI.
Measurable Results.”

Minneapolis • Portland

Unsubscribe