Personal project · Case study

Inter: running a multi-agent AI operation under real governance

Can one person direct a team of AI agents with the discipline an enterprise would demand of any production system, and have them do real work? This project is my working answer. The control set covers documented change management, separation of duties, independent review, incident response, and auditability.

Jonathan Franks, CISSP, CRISC · 25+ years in IT & cybersecurity leadership

What it is

Inter is a small fleet of AI agents with defined roles, running on separate platforms. Together they operate a production job-search automation pipeline end to end: multi-source aggregation, LLM-assisted scoring with grounding verification, deduplication, a database-backed coordination layer that can wake agents on demand, scheduled compliance record-keeping, and a web operations dashboard. I serve as Release Manager, the human authority that every consequential decision routes through.

The pipeline is the workload. The governance is what I built it to prove out.

~2 hrs

task dispatched > agent woken autonomously > built & tested > independently reviewed > deployed to production

452

tests passing, none skipped, on a release authored end-to-end by an autonomously woken agent, including integration proofs against a real database

<$0.01

marginal cost per production pipeline run, by deliberate model-tier and batch economics

~2 min

detection-to-revocation on a real P1 credential exposure, under a pre-written incident protocol

The governance, actually practiced

The control set is derived from a policy framework designed toward ISO 27001, ISO 27701, ISO 42001, and SOC 2 alignment. It is scaled honestly to a single-operator project. Where cost or scale justified a deviation, it went into a maintained exceptions register.

Change control & release discipline

Versioned, frozen release packages with document-history requirements on every artifact. Shipped versions are never edited in place. Deviations from this rule have happened, and each one was caught, documented, and remediated on the record.
Separation of duties: the agent that authors a release never reviews it. A second seat performs code review and security architecture review on every release.
Every release package ships with evidence, not assertions: a full passing test suite, static security analysis, dependency vulnerability audit, and a secret scan, all archived with the release. Deployments end with smoke tests, and those tests got stricter after a live lesson taught us that a page can return 200 while everything behind it is broken.
Mandatory independent review gate on every release: a standing pair of reviewers from two other vendors, run as a deliberate cross-vendor check, with verdicts required to cite the evidence behind every pass or fail. The gate has caught real defects before deploy, including a credential-leak path in exception messages and a flaw that would have quietly defeated a tamper-evidence control. The detection record is tracked, and it is lopsided: one reviewer catches most of what gets caught, so an unreasoned "looks good" from the other is never leaned on. Disputed findings get verified against primary sources before acceptance, because the reviewer has also been wrong, and that was caught too.
A third review vendor was trialed and removed the same day one of its models fabricated an attribution footer impersonating another reviewer. Reviews only work if you can trust who said what, so that was a trust flag, and the decommission is on the record next to the original approval.
Model review is now paired with mechanical verification. Full review passes across multiple vendors and tiers all missed a one-line service-configuration defect that a standard linter caught in about a second at deployment. The lesson went into the process: linters and validators run alongside the reviewers at every code gate, because models and tools miss different things.

Security operations

P1–P5 incident classification with response timelines. Exercised on real incidents: a credential exposure went detection > revocation > rotation > clean audit > closure documentation inside the protocol's deadlines.
Pipeline failures now page a human. A monthly audit found the engine could die silently, so a failure-alerting path shipped: unit failure triggers a handler that writes the coordination record and posts to chat within seconds. The alerting package itself carried a defect that would have left the alert path dead; the activation checklist's verification step caught it, and the correction is documented in the same record as the finding it closed.
Recurring infrastructure audits mapped to framework control families, producing findings registers, a Plan of Action & Milestones, and a precondition gate that must pass before any internet exposure decision. One audit traced every human-facing alert on the cloud tenancy and found several pointed at mailboxes no human read; all of them now reach a person on two independent paths, and endpoint audience became a standing check.
Secrets discipline born of incident lessons: metadata-only verification of secret-bearing files, single-key extraction, no byte-level inspection. These rules are written into a non-negotiable floor that every agent session inherits at boot.

AI-specific management (the ISO 42001 layer)

Every agent operates under an acknowledged role charter, and structural changes require re-acknowledgment. One gateway-based agent was classified as a tool, not a seat, with its own tasking rules and scope fences, and was later ruled out of current builds entirely; both the classification and the retirement are on the record.
Autonomous wake-on-dispatch (a database write can spawn a working agent session) shipped with its prompt-injection risk formally accepted and documented: sender allowlists, database-verified dispatch content, single-flight locks, cooldowns, session logging, and periodic log review as compensating controls. Each new wake surface re-opens the acceptance.
Honest capability reporting is engineered in. Ungrounded model output gets flagged as unverifiable at ingestion, surfaced at triage, and aggressively filtered, because some of it proved to be hallucinated listings. Pipeline runs report degraded status instead of an optimistic "ok".
The reviewers are themselves benchmarked, with exact replays of past review inputs as the standing protocol. An early result that flattered two models was withdrawn when the exact-replay condition showed it had been an artifact of hinted prompts, and the withdrawal is in the ledger next to the claim.
An anti-pattern ledger records the failure modes agents actually exhibited in practice (invented policy exceptions, misplaced log entries, overconfident timing claims). Each one was converted into a standing check that future sessions boot against.
The system is built to survive losing a live session. Working sessions have been wiped mid-task more than once; the durable coordination record held, and a documented boot procedure reconstructs a fresh session from files, never from a model's memory of what it was doing.

Architecture, briefly

Three Claude-based seats (operations/review, engineering, infrastructure) run on separate platforms, with independent review performed by OpenAI- and Google-based models. Coordination runs on PostgreSQL with event-driven notification. When a dispatch row addressed to an agent appears, it wakes a headless session that boots against the governance record, does bounded work, reports back, and marks its own dispatch complete. The human-facing layer is a Django operations dashboard with a curated triage workflow. PostgreSQL became the system of record after a controlled, evidence-verified migration from the original datastore, which is retained as a frozen archive; the data plane runs over mutual TLS with per-task client certificates and per-task scoped database roles. Cost engineering is explicit: batch APIs for bulk work at half price, model tiers matched to task difficulty, and fleet-wide spend telemetry.

The coordination layer is currently being hardened from a shared record into a control plane: append-only history enforced by the database rather than by convention, hash-chained entries anchored by human-signed checkpoints with an independent verifier, one database identity per agent, and a single authenticated network path for all coordination traffic. Every piece rides the same gated release process, and none of it activates without express human authorization.

Multi-agent orchestration

PostgreSQL + event-driven wake

Django

LLM grounding verification

Batch API cost engineering

Hardened service configurations

Mutual TLS data plane

Cross-vendor review

A sanitized technical appendix covering the coordination-layer design, audit methodology, and incident handling is available on request.

Why it matters

Most AI-governance experience today is policy written for systems someone else runs. This project closes that loop. The person writing the control set has to live under it while agents do real work, fast, with real credentials and real data. The governance survived contact with autonomous agents, and the places it bent are documented, because keeping that record honest is the actual practice.

It also reflects how I approach the discipline professionally. Frameworks should be working tools, not shelf-ware. Risk acceptance should be a documented decision. And I don't believe you really own a control until you can explain the failure mode it addresses.