GoodLeads ships product through an organization of AI agents the same way a software company ships through an organization of people: with managers, peers, rituals, escalation paths, and a culture documented in a wiki. The system is not an agent. It is an agent org — vertical roles, a horizontal role, a cross-model peer review function, and three self-improving loops.
Each box below is an agent identity defined by its system prompt, its skill library, its tool permissions, and its memory scope — not by which LLM happens to back it. Three of the four agent roles are Claude Sonnet 4.6; one (Code Judgment) is GPT-5 Codex. Models can be swapped by changing one config line. Roles are stable.
Platform EM is peer to the State GMs in role, but its scope crosses all of them — encoded by the bar literally spanning the canvas. State GMs file PRs upward into the EM unit, which operates three peer-review tools.
The agent organization is gated behind a single feature flag. Set ENABLE_AGENT_LAYER=true and /api/v1/gm/* mounts onto the FastAPI app. Set it false (the Forsyth deploy configuration) and the entire agent runtime disappears — the data plane keeps running. This is the deliverable boundary in code.
Four trigger cadences. One control plane. The agent runtime is feature-flagged off the data plane.
The org has three loops, each operating at a different timescale. The inner loop runs on every PR. The middle loop runs daily and weekly. The outer loop runs over multi-day experiments. They compound: every comment Gib makes can become a future scorer; every shipped experiment becomes a config row that future experiments ratchet from.
Webhook fires Platform EM. Mechanical DoD runs deterministic gates in parallel with Codex Code Judgment. Aggregator decides verdict, keyed to PR tier. Auto-fix authority lets EM repair mechanical gaps itself rather than block.
Founder commentary on Google Docs is mined into structured episodic memory. Weekly clustering produces scorer candidates. Promoted candidates ship as judge scorers in the experiment harness. Manager intuition becomes ML training signal.
Agents propose experiments, gated by per-agent monthly budget envelope. Variants forward-replay through the production context-builder. Wins write append-only vendor_config rows. Daily verifier compares post-shipment metrics; auto-reverts on regression.
Most PRs exit the wheel at AUTO-MERGE. A fixable mechanical gap loops back via the prominent AUTO-FIX edge. Tier S — product-level risk — peels off to a plain-English Google Doc for Gib.
Agent memory isn't one thing. It's four distinct layers, each with different persistence, update cadence, and access pattern. This maps to cognitive science, to how human organizations manage knowledge, and to the Karpathy "LLM Knowledge Base" pattern. The bar below encodes persistence as length — borrowing the visual idiom CS textbooks use for the memory hierarchy (registers → L1 → L2 → RAM → disk).
Current task
What happened
What we know
How we do things
Tiers are computed mechanically from PR file paths. The merge verdict depends on tier — Tier C/B/A auto-merge when both reviewers agree; Tier S routes to a Google Doc summary in plain English. Two architectural decisions sit underneath this: ADR-005 removes the founder from the PR approval loop; ADR-006 forbids single-model review.
Two pieces of the system are unusual relative to most agent stacks. The wiki collapses agent onboarding into the same surface humans already use. The experiment harness eliminates the drift between test and production by sharing the production context-builder.
Most agent harnesses bake "what we know" into the system prompt. Changing what the agent knows requires changing code, running tests, reviewing the diff, deploying. The cycle is hours-to-days.
Our wiki is a database table keyed by (state, domain). The Code Judgment agent's onboarding — engineering tenets, code-review checklist — is three wiki pages. Open the wiki UI, edit a tenet, hit save. The next PR review uses the new tenet.
Most A/B harnesses build a separate, simplified context for variants — and then experiments don't predict production. We solved this by making the variant runner a thin wrapper around the live build_matrix_context(), with snapshot data passed as override.
Variants and production share one code path. There is no drift. Agents are first-class consumers of the harness through five custom tools. Per-agent monthly budget envelopes cap spend. Daily verifier auto-reverts shipped configs whose production metrics drop.
The Forsyth Phase 1 Agreement perpetually licenses anything embedded in the deliverable. By keeping the agent organization in a sibling directory the deliverable never imports, the org stays GoodLeads-only retained IP. A CI test fails the build on any new import that crosses the line. Mike's portability methodology becomes automatable: a Forsyth-targeted build literally strips agents/ from the image.
Same source tree. Two states. The agent organization is a literal subtraction — not a wall, not a separate repo. Every customer after Forsyth is the same shape: data platform shipped, agent org retained.
All numbers below are 2026-05-01 production reality, not aspirations.
A competitor copying the surface ships in a sprint. Copying the architectural choices behind it takes a year. Each "high-difficulty" row below presupposes commitments made months earlier — they're each one or two commits in size, but they require having decided correctly back then.