The environment is the leverage.
Everyone's tuning prompts. The real gains came from two years of building the environment around the agent: the standards it reads, the gates it can't fake, the memory that means it doesn't relearn the codebase every morning. This piece is that architecture, with the actual scripts, hooks, and commands, and how to copy it for your stack, Flutter or not.
In the last few months the AI-coding world fell in love with a single idea: stop prompting, start looping. Hand the agent a goal, let it run until it's done. It's a real technique, and I'll be honest up front: it's the part of my own setup I lean on least. It also buries the lede. The thing that makes any of it work is everything underneath it.
I run a company this way. Two of us ship a production app across six sports and four TV platforms, behind a quality gate that does not bend. None of that comes from a clever prompt. It comes from two years of building the room the agent works in. This piece is that room, with the real scaffolding. It's framework-agnostic, but I'll show you the concrete Flutter version, down to the install commands, so you can lift it directly.
Claude Code installed, a git repo you can edit, and Python 3.9+ for the checker scripts. Everything below is generic scaffolding, not my app's source, so you can paste it into your own project and adapt. If you're new to Claude Code, start at section 01 and add one piece at a time; you do not need all of it on day one.
An agent is only as good as the room you put it in.
01The constitution
Your conventions can't live in your head. They live in a file the agent reads at the start of every session. In Claude Code that file is CLAUDE.md at the repo root. Keep it short and absolute: directives, boundaries, and a map of where things live. This one file pays back more than anything else you'll write.
# CLAUDE.md — the constitution every session reads first
## Core directives
0. Zero debt. Every line ships production-complete. No TODO, no "temporary".
1. Investigate first. Name the root cause in one sentence before editing.
2. Comply, then code. Read the rule that governs a file before you touch it.
3. "Done" means the gate is green — never "looks right to me".
## Boundaries (load-bearing — each one has a real incident behind it)
- NEVER bypass the use-case layer.
- NEVER import the raw HTTP client; use lib/net/api_client.dart.
- NEVER mark work complete if scripts/gate.sh fails.
## Where things live
- Layer rules ....... .claude/rules/*.md
- Canonical peers ... docs/canonical_peers.md
- The gate .......... scripts/gate.shThen split the detail into layered rule files under .claude/rules/, one per area, and point the agent at the right one by context. A rule the agent reads only when it's editing that layer stays relevant instead of bloating the constitution.
# .claude/rules/presentation.md
# Load this when editing anything under lib/presentation/.
## Async safety (non-negotiable)
Every async block that uses ref must guard its owner's lifetime:
- In a notifier: if (!ref.mounted) return; // after EVERY await
- In a widget: if (!context.mounted) return; // BuildContext is the gate
A guarded-looking-but-unguarded await is the #1 crash class we ship.
Canonical peer: lib/presentation/feature/home/home_notifier.dartDo this: write the ten rules you keep repeating in code review into CLAUDE.md. That file pays back on every session after.
02The gate is the whole bar
Slow down here. A goal an agent grades for itself is a goal it will eventually cheat. Not maliciously: "declare victory" is just the cheapest path. The fix is goals it cannot grade for itself: programs that exit non-zero when something is wrong. One script is the whole bar. If it's green, the work is done; if it's red, it isn't. That's the only definition of "done" we use.
#!/usr/bin/env bash
# scripts/gate.sh — the bar. Any red = exit non-zero = not done.
set -euo pipefail
echo "▸ analyzer (zero errors, warnings AND infos)"
dart analyze --fatal-infos
echo "▸ tests"
flutter test
echo "▸ riverpod async-safety scanner"
python3 -m riverpod_3_scanner ./lib
echo "▸ project-specific checks"
python3 scripts/check_forbidden_patterns.py
echo "✓ all gates green"Two of those lines are stock: dart analyze --fatal-infos (we treat warnings and infos as defects, not just errors) and flutter test. The other two lines are where the real protection lives.
The scanner: a checker for a bug class the stock tools miss
Our worst recurring crash was an async callback touching state after its owner was gone, the kind of thing the analyzer waves straight through. So we wrote a static analyzer that fails the build on it, and published it. It's on PyPI; you can drop it into your gate in two lines:
# A static analyzer for the Riverpod 3.x async-safety bug class:
# an async callback that touches state after its owner is disposed.
pip install riverpod-3-scanner # Python 3.9+
# Run it across your lib/. Exits non-zero on any unguarded async ref.
python3 -m riverpod_3_scanner ./lib
# -> "All checks passed" ...or a list of file:line violationsYour own checkers: one rule per bug you've shipped
This is the discipline that compounds. Every time we shipped a bug, we asked one question: what machine-checkable rule would have caught it? Then we wrote that rule as a small script and wired it into the gate. A bug class that bit us once can't bite again. Here's the whole pattern, a generic runnable starting point you can grow one row at a time:
#!/usr/bin/env python3
"""check_forbidden_patterns.py
One rule per bug class we've shipped. Each row was added the day a real
bug taught us it should exist. Exit non-zero so the gate fails on a hit."""
import re, sys, pathlib
LIB = pathlib.Path("lib")
# (id, pattern, why) — append a row every time a new bug class bites you.
RULES = [
("print-in-lib", re.compile(r"\bprint\("),
"Use the logger — print() ships debug spam to release builds."),
("hardcoded-snackbar", re.compile(r"SnackBar\([^)]*content:\s*Text\(\s*['\"]"),
"User-facing copy is hardcoded — route it through the localizer."),
]
def main() -> int:
hits = []
for path in LIB.rglob("*.dart"):
if path.name.endswith(".g.dart"):
continue
for i, line in enumerate(path.read_text("utf-8").splitlines(), 1):
for rid, pat, why in RULES:
if pat.search(line):
hits.append(f"{path}:{i} [{rid}] {why}")
print("\n".join(hits))
print(f"{'FAIL' if hits else 'PASS'} - {len(hits)} violation(s)")
return 1 if hits else 0
if __name__ == "__main__":
sys.exit(main())Three things make a checker carry its weight: it prints file:line so the fix is obvious; it exits non-zero so the gate actually fails; and it's born from a real incident, not a style preference. As you grow, graduate from line-scanning to the analyzer's AST when a rule needs real structure. But a regex that catches a true regression beats an elegant check that never ships. For pre-existing violations you can't fix today, write them to a baseline file the checker ignores, so the gate blocks new debt without forcing a stop-the-world cleanup.
Do this: take your last production incident, write the smallest script that would have caught it, and add a line to scripts/gate.sh. You've started the most valuable asset in the environment.
03Make the process physical with hooks
A rule you have to remember is a rule you'll eventually skip, and so will your agent. Claude Code hooks are commands the harness runs automatically at set moments, so the process runs itself instead of relying on anyone to remember it. We flag every code edit, then block the session from ending until a self-review has happened:
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|Write|MultiEdit",
"hooks": [
{ "type": "command", "command": "touch .claude/.needs-review" }
]
}
],
"Stop": [
{
"hooks": [
{ "type": "command", "command": "bash scripts/require_review.sh" }
]
}
]
}
}The Stop hook is the teeth. A hook that exits with code 2 blocks the stop and feeds its message back to the agent, so it literally cannot end the turn with the review undone:
#!/usr/bin/env bash
# scripts/require_review.sh — a Stop hook.
# Exit code 2 BLOCKS the session from ending and feeds the message
# back to the agent, so it can't quit without doing the review.
if [ -f .claude/.needs-review ]; then
echo "Self-review not declared. Check your diff against the gate, then:" >&2
echo " rm .claude/.needs-review" >&2
exit 2
fiDo this: find the step you skip when you're tired. Make a hook enforce it. The machine has no bad days.
04Name a canonical peer for every pattern
Left alone, an agent writes the average of everything it has ever seen. That average is mediocre, shaped like the internet rather than your codebase. So for every important pattern, name the one file that does it right. The instruction stops being "write a repository" and becomes "write a repository like this one." Keep a flat list and reference it from your rules, as you saw in the last line of the rule file in section 01:
Repository -> lib/data/repos/athlete_repository.dart
Notifier -> lib/presentation/feature/home/home_notifier.dart
DAO test -> test/data/dao/athlete_dao_test.dartDo this: pick your cleanest example of each major pattern, write its path down, and point the agent at it by name, every time.
05Package the workflows you repeat
Some work is the same shape every time: cutting a release, adding a language, running a migration. That shape should be a command, not a memory. A Claude Code skill is a folder with a SKILL.md: a name, a one-line "use when", and the steps. The agent invokes it by name and follows your checklist the same way every time:
---
name: ship-release
description: Use when cutting a production release — runs the gate, bumps version, tags.
---
# Ship a release
1. Run scripts/gate.sh. If anything is red, stop and fix it first.
2. Bump the version in pubspec.yaml.
3. Update CHANGELOG with the diff since the last tag.
4. Commit, tag vX.Y.Z, and report the build number.Do this: the next time you do a multi-step task twice, write it down as a skill on the second pass. You'll run it a hundred more.
06Run the work in stages
Real features aren't one prompt; they're a pipeline: investigate, spec, implement, then independently review. We run those as distinct stages, often with different agents, and route each to a model by one question: how expensive is it to verify the output? You don't route by how hard the task feels. You route by how hard it is to check.
| Work | Route to | Because |
|---|---|---|
| Search, inventory, mechanical edits | a fast, cheap model | output is grep-checkable |
| Well-specified implementation chunk | a mid model + the gate | the gate verifies it |
| Root cause, architecture, judgment | your strongest model | verifying means re-deriving it |
Do this: stop asking one agent to do everything in one shot. Split investigate from implement from review, even just into three prompts. The seam is where quality lives.
07Memory that compounds
The default agent has amnesia: every session it rediscovers your codebase, and every lesson evaporates when the window closes. Give it a memory that outlives the session: a searchable folder of decisions and lessons, plus a short notes file the agent reads before it starts. You don't need a database; a directory and grep go a long way:
# After a debugging session, write one file:
# docs/lessons/2026-06-11-async-ref-after-dispose.md
# what broke, why, the fix, and the gate rule you added.
# Find it again months later:
grep -rl "after dispose" docs/lessons/This is the part that actually took two years, and the part a competitor can't copy. It's the accumulated why behind every decision. Every incident becomes a written lesson; every lesson becomes a gate; every gate becomes a rule.
Do this: after your next bug, write one paragraph somewhere searchable. Do it ten times and you have an institution.
08Loops come last
Now the loop makes sense. And here's my confession: I barely loop at all. I tried the headless, run-it-overnight version and walked away from it; I keep a human in the room. But that's the point. The reason looping is even available to me, safely, the day I want it, is that "done" in my codebase is a green gate, not an opinion. Once that's true, the loop is a one-liner you bolt on top:
# In Claude Code, hand it a machine-checkable goal and walk away:
/loop drive scripts/gate.sh to green — run it, read the failures,
fix the ROOT CAUSE (never weaken a check), re-run. stop at exit 0.The discipline is all in the guardrails: only loop on deterministic goals, since an LLM grading itself declares victory early; cap the spend; sandbox it so a bad iteration can't corrupt your main branch; never let it do anything irreversible without you. But don't mistake the loop for the achievement. The loop is the cheap reward; the gate is the work. Most people chase the reward and skip the work. Build the work.
We didn't make the agent smarter. We made shipping a bug harder than shipping the fix.
09The Flutter cut
If you write Flutter, here's the concrete stack. Every item is a script like the ones above, and every one was born from a real production incident:
- The Riverpod async-safety scanner. pip install riverpod-3-scanner (section 02). It fails the build on an async ref used after its owner is disposed.
- A render-literal tracer. It follows user-facing strings to where they're drawn and fails on any hardcoded literal, because an un-translated string is a defect when you ship in multiple languages.
- A lifecycle checker. It forbids inherited-widget lookups too early in init, a bug that crashes only in release builds (you'll never click into it in debug).
- The analyzer at zero warnings and zero infos. dart analyze --fatal-infos, not just zero errors.
- More than three thousand tests, zero mocking libraries. Real values through real logic. A test propped up by five mocks tests the mocks.
None of these are exotic. Each is a small script that encodes one lesson. The exotic part is two years of them stacked into a wall where the common ways to ship a Flutter bug simply aren't available, to you or to your agent.
10Duplicate it: a Monday plan
You don't need our two years. You need the skeleton, and then one new gate every time a bug shows you where it goes. Here's the whole skeleton in one paste:
# 1 — the constitution
touch CLAUDE.md # paste the skeleton from section 01
# 2 — folders for rules + scripts
mkdir -p scripts .claude/rules
# 3 — the gate (start empty, add checks as you go)
printf '#!/usr/bin/env bash\nset -euo pipefail\n' > scripts/gate.sh
chmod +x scripts/gate.sh
# 4 — the scanner (Flutter / Riverpod projects)
pip install riverpod-3-scanner
echo 'python3 -m riverpod_3_scanner ./lib' >> scripts/gate.sh
# 5 — your first custom checker (drop in check_forbidden_patterns.py)
echo 'python3 scripts/check_forbidden_patterns.py' >> scripts/gate.sh
# 6 — make the gate impossible to skip
echo 'scripts/gate.sh' > .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit- Fill in CLAUDE.md with your ten review rules.
- Add one custom checker for your last incident.
- Add the Stop hook so the self-review can't be skipped.
- Name a canonical peer for each major pattern.
- Split your next feature into investigate → implement → review.
- Point /loop at the gate the first time a job's finish line is fully machine-checkable.
That's the room. The model gets better on its own every few months; the environment only gets better if you build it. And every hour you put in pays back on every task after, forever.
Want this running in your team?
Daylight AI is the hands-on practice behind this paper. We set up the same environment with your engineers and operators, against your real stack, until it's yours to keep.