BRIEF System architecture DATE 2026-05-01 STATUS Live in production

Agent Organization, Not An Agent

A team
of agents.
Working today.

GoodLeads ships product through an organization of AI agents the same way a software company ships through an organization of people: with managers, peers, rituals, escalation paths, and a culture documented in a wiki. The system is not an agent. It is an agent org — vertical roles, a horizontal role, a cross-model peer review function, and three self-improving loops.

The Org Chart

Vertical agents.
One horizontal agent.
Peer review functions.

Each box below is an agent identity defined by its system prompt, its skill library, its tool permissions, and its memory scope — not by which LLM happens to back it. Three of the four agent roles are Claude Sonnet 4.6; one (Code Judgment) is GPT-5 Codex. Models can be swapped by changing one config line. Roles are stable.

Platform EM is peer to the State GMs in role, but its scope crosses all of them — encoded by the bar literally spanning the canvas. State GMs file PRs upward into the EM unit, which operates three peer-review tools.

System Architecture

Where the org
actually lives.

The agent organization is gated behind a single feature flag. Set ENABLE_AGENT_LAYER=true and /api/v1/gm/* mounts onto the FastAPI app. Set it false (the Forsyth deploy configuration) and the entire agent runtime disappears — the data plane keeps running. This is the deliverable boundary in code.

Four trigger cadences. One control plane. The agent runtime is feature-flagged off the data plane.

Three Self-Improvement Loops

Concentric loops.
Different cadences.
One direction.

The org has three loops, each operating at a different timescale. The inner loop runs on every PR. The middle loop runs daily and weekly. The outer loop runs over multi-day experiments. They compound: every comment Gib makes can become a future scorer; every shipped experiment becomes a config row that future experiments ratchet from.

Inner — every PR Middle — daily / weekly Outer — multi-day

Loop 01 — Inner

PR Review

Cadence: every PR · 3-5 min

Webhook fires Platform EM. Mechanical DoD runs deterministic gates in parallel with Codex Code Judgment. Aggregator decides verdict, keyed to PR tier. Auto-fix authority lets EM repair mechanical gaps itself rather than block.

Webhook → HMAC validation Mechanical DoD ∥ Code Judgment Aggregator → tiered verdict Auto-merge · advisory · ack · block

Loop 02 — Middle

Learning

Cadence: daily mining · weekly clusters

Founder commentary on Google Docs is mined into structured episodic memory. Weekly clustering produces scorer candidates. Promoted candidates ship as judge scorers in the experiment harness. Manager intuition becomes ML training signal.

Mine doc comments → corpus table Cluster by recency + frequency Cluster review artifact for founder Promoted → judge scorers (14 shipped)

Loop 03 — Outer

Experiment

Cadence: multi-day · agent-driven

Agents propose experiments, gated by per-agent monthly budget envelope. Variants forward-replay through the production context-builder. Wins write append-only vendor_config rows. Daily verifier compares post-shipment metrics; auto-reverts on regression.

propose_experiment → budget gate Forward-replay through live harness Ship rules → vendor_config row verify-recent-shipments → auto-revert

Most PRs exit the wheel at AUTO-MERGE. A fixable mechanical gap loops back via the prominent AUTO-FIX edge. Tier S — product-level risk — peels off to a plain-English Google Doc for Gib.

Four-Layer Memory

What's stored.
Where. How long.

Agent memory isn't one thing. It's four distinct layers, each with different persistence, update cadence, and access pattern. This maps to cognitive science, to how human organizations manage knowledge, and to the Karpathy "LLM Knowledge Base" pattern. The bar below encodes persistence as length — borrowing the visual idiom CS textbooks use for the memory hierarchy (registers → L1 → L2 → RAM → disk).

← VOLATILE · seconds PERSISTENT · indefinite →

Working
Episodic
Semantic
Procedural

Length = time-to-decay · Color saturation = update cadence

Layer 01

Working

Current task

Storage: Injected briefing binder + session events
Updated: Per session
Used by: The agent in the moment

Layer 02

Episodic

What happened

Storage: agent_finding · agent_doc_comment · experiment_result
Updated: Every session
Used by: Briefing binder

Layer 03

Semantic

What we know

Storage: wiki_page (state × domain)
Updated: Weekly synthesis skill
Used by: GMs and Code Judgment

Layer 04

Procedural

How we do things

Storage: skills/*.md · JTBD-shaped
Updated: PR-level evolution
Used by: read_skill tool, on demand

The Trust Ladder

Autonomy is a gradient.
Not a switch.

Tiers are computed mechanically from PR file paths. The merge verdict depends on tier — Tier C/B/A auto-merge when both reviewers agree; Tier S routes to a Google Doc summary in plain English. Two architectural decisions sit underneath this: ADR-005 removes the founder from the PR approval loop; ADR-006 forbids single-model review.

Low Risk

docs
tests
README

VerdictAuto-merge if both reviewers green. Founder never sees these unless asked.

~50% of PRs

Internal

internal services
support utilities
non-hot-path

VerdictAuto-merge if both reviewers green. Logged in daily sweep.

~27%

Hot Path

scoring helpers
data services
codeowners-gated

VerdictAuto-merge if both reviewers green and CODEOWNERS satisfied.

~15%

Product Risk

scoring formula
vendor cost path
deliverable schema

VerdictPlain-English Google Doc to founder. Comment to approve. Next session merges.

~8%

Column width = approximate PR volume · Sprints 10–12 Tier S is rare by design

The Forward-Thinking Pieces

Wiki-onboarded agents.
An experiment harness
that runs on the harness.

Two pieces of the system are unusual relative to most agent stacks. The wiki collapses agent onboarding into the same surface humans already use. The experiment harness eliminates the drift between test and production by sharing the production context-builder.

Wiki Layer

Edit the agent in seconds.

Most agent harnesses bake "what we know" into the system prompt. Changing what the agent knows requires changing code, running tests, reviewing the diff, deploying. The cycle is hours-to-days.

Our wiki is a database table keyed by (state, domain). The Code Judgment agent's onboarding — engineering tenets, code-review checklist — is three wiki pages. Open the wiki UI, edit a tenet, hit save. The next PR review uses the new tenet.

Onboarding loop is editable in seconds, not days
Different agents read different views of the same table
Wiki edits are versioned and auditable
Semantic memory updates itself via weekly synthesis skill

Experiment Harness

Variants run on the live context.

Most A/B harnesses build a separate, simplified context for variants — and then experiments don't predict production. We solved this by making the variant runner a thin wrapper around the live build_matrix_context(), with snapshot data passed as override.

Variants and production share one code path. There is no drift. Agents are first-class consumers of the harness through five custom tools. Per-agent monthly budget envelopes cap spend. Daily verifier auto-reverts shipped configs whose production metrics drop.

Forward-replay = no drift between test and production
Agents propose, run, ship within budget
Append-only vendor_config carries experiment_id
Auto-revert closes the boy-scout-rules loop

Commercial Architecture

The boundary that
makes the IP licensable.

The Forsyth Phase 1 Agreement perpetually licenses anything embedded in the deliverable. By keeping the agent organization in a sibling directory the deliverable never imports, the org stays GoodLeads-only retained IP. A CI test fails the build on any new import that crosses the line. Mike's portability methodology becomes automatable: a Forsyth-targeted build literally strips agents/ from the image.

GoodLeads Build · Full

What we ship to ourselves

formation-gtm/

backend/data plane

pipeline/data plane

classification-service/data plane

agents/retained IP

backend/services/agents/retained IP

All directories present · agent layer LIVE

Forsyth Build · Stripped

What ships to a licensee

formation-gtm/

backend/data plane

pipeline/data plane

classification-service/data plane

agents/— removed —

backend/services/agents/— removed —

Agent layer stripped · CI test enforces import boundary

Same source tree. Two states. The agent organization is a literal subtraction — not a wall, not a separate repo. Every customer after Forsyth is the same shape: data platform shipped, agent org retained.

Operational Evidence

Today, on a normal Friday.

All numbers below are 2026-05-01 production reality, not aspirations.

Scheduled agent sessions fired between 7:53 and 9:31 CT today · all exit 0

Corpus-derived judge scorers in production · each tied to a founder comment

130+

Merged PRs across Sprints 10-12 · majority agent-authored, auto-merged

Pages or engineer interventions required this week · org runs itself

Today · 2026-05-01 · captured at 9:51 CT

6 AM

7 AM

8 AM

9 AM

10 AM

11 AM

12 PM

5 PR REVIEWS Platform EM · 7:53–8:54

CO STANDUP 8:31 → email 8:55

PLATFORM EM × 2 daily sweep + weekly · 9:01

FL STANDUP 9:31 → email 9:47

7 sessions · all exit 0 · 9 emails delivered · 0 engineer interventions · rest of day quiet by design

For The Technical Review Board

What's hard to copy.
And why.

A competitor copying the surface ships in a sprint. Copying the architectural choices behind it takes a year. Each "high-difficulty" row below presupposes commitments made months earlier — they're each one or two commits in size, but they require having decided correctly back then.

Component

Difficulty

Why

Org chart (vertical + horizontal agents)

Medium

Pattern is now public; the discipline of not collapsing roles into one mega-agent is the hard part.

Cross-model review

Low–med

Two API integrations + an aggregator. Defensible only because we have it shipped.

Trust ladder + tier-keyed autonomy

Med–high

The policy file is small, but routing Tier S to plain-English Google Docs (not Slack DMs to engineers) is a cultural commitment most teams won't make.

Wiki-onboarded agents

Medium

Schema + 1 tool. Easy to copy, but only valuable if you also have a manager who actually edits the wiki.

Comment-corpus → judge scorer

High

Requires (a) a manager who comments in writing on a stable surface, (b) a clustering pipeline, (c) the trust-ladder gate. Competitors try to replace this with feedback forms; manager corpus is the differentiator.

Forward-replay experiment harness

High

Variants share the production context-builder — not a separate one. Most teams compromise this; once compromised, experiments stop predicting production.

Boy-scout-rules auto-revert

Med–high

Per-vendor verifier registry + daily cron + finding system. Most teams stop at "agent ships"; we kept going to "agent ships, agent verifies, agent reverts."

Deliverable Boundary as commercial architecture

High

Legal and commercial design choice enforced by code. Most teams don't realize it's a choice until they try to license their work and find their IP fused with their deliverable.

A teamof agents.Working today.

Vertical agents.One horizontal agent.Peer review functions.

Where the orgactually lives.

Concentric loops.Different cadences.One direction.

PR Review

Learning

Experiment

What's stored.Where. How long.

Working

Episodic

Semantic

Procedural

Autonomy is a gradient.Not a switch.

Low Risk

Internal

Hot Path

Product Risk

Wiki-onboarded agents.An experiment harnessthat runs on the harness.

Edit the agent in seconds.

Variants run on the live context.

The boundary thatmakes the IP licensable.

What we ship to ourselves

What ships to a licensee

Today, on a normal Friday.

What's hard to copy.And why.

A team
of agents.
Working today.

Vertical agents.
One horizontal agent.
Peer review functions.

Where the org
actually lives.

Concentric loops.
Different cadences.
One direction.

What's stored.
Where. How long.

Autonomy is a gradient.
Not a switch.

Wiki-onboarded agents.
An experiment harness
that runs on the harness.

The boundary that
makes the IP licensable.

What's hard to copy.
And why.