AI Production Kit · Edition #2

The controls cheap AI skips.

A starter that calls the model is a demo. Metered token accounting, a per-tenant circuit breaker, an eval gate in CI, and typed guardrails are the gap between a demo and a feature you can charge for. This kit is that gap, wired and tested.

Get the AI Production Kit Read the docs

illustrative · ci / eval-gateBLOCKED

$ caisson eval run --suite prompts/golden.yaml --ci
Running 24 cases against golden set...

  case  helpfulness     score 0.91  prev 0.93  Δ -0.02  ok
  case  accuracy        score 0.72  prev 0.91  Δ -0.19  FAIL
  case  refusal_rate    score 0.98  prev 0.97  Δ +0.01  ok

✗ eval gate failed — accuracy regressed 0.19 > tolerance 0.02\n  deploy blocked. fix the prompt or update the golden set.
  report: .caisson/eval/2026-06-27T14-09-11Z.json

The failure mode

Cheap AI boilerplate ships the demo, not the controls.

Three ways an AI feature turns into an incident: an unmetered loop triples the invoice, an eval regression ships on Friday, a prompt nobody can audit breaks in production. This kit puts a named control in front of each one.

What ships in the box

Six controls. One kit.

Token metering

Usage writes in the same Postgres transaction as the result — one atomic increment. Concurrent calls never double-count a charge or drop one under load.

Spend caps + circuit breaker

Each tenant gets a hard cap. Cross it and the breaker opens — the next model call returns HTTP 402 and resets on the window, not a surprise invoice.

Eval harness in CI

Prompts run against a golden set on every pull request. A score drop past tolerance fails the check — the regression never reaches a customer.

Guardrails

Input and output cross a Zod-typed schema and a policy check on both sides of the model. Out-of-policy responses are rejected at the boundary, not forwarded.

Prompt registry

Every prompt is versioned and addressable by id. A call references prompt@v7, not an inline string — diff it, roll it back, audit what the model was asked.

Typed agent setup

Typed agent and tool definitions with a per-tool allowlist. An agent calls only the tools its manifest declares — no implicit access, no surprise side-effect.

Rigor, not theater

Every claim here is a control you can point at.

The caps, the breaker, and the eval gate are configuration checked into your repo and enforced at call time — not a dashboard you hope someone is watching.

caisson.ai.tomlenforced at call time

# caisson.ai.toml — checked into your repo, enforced at call time

[caps.default]
daily_tokens = 1_000_000
on_exceed    = "break"   # open the circuit, return 402

[evals]
gate      = "ci"        # block the PR on regression
tolerance = 0.02      # max score drop before the check fails

[guardrails]
input_schema  = "schemas/chat-input.json"
output_policy = "policies/content-policy.ts"

How it's sold

Own the code, or subscribe.

One-time license

from $599

Own the AI Production Kit source outright — the six controls, wired and tested, plus all future patch releases.

Per-module

Take metering, caps, or the eval harness on its own — from from $49 per module.

Developer plan

Subscription — credits, framework updates, and private-registry pulls. Keeps the kit current as model APIs shift.

See the full lineup.

Common questions

What does token metering actually prevent?

A runaway loop, a misconfigured agent, or a single burst of traffic can multiply your API invoice by 10× before you see it. Caisson writes usage in the same Postgres transaction as the result — an atomic increment — so concurrent calls can never double-count or drop a charge. Crossing the cap opens the circuit breaker and returns HTTP 402 before the next model call fires.

What happens when a tenant hits their spend cap?

The breaker opens. The next model call returns HTTP 402 with a structured error body — the same as any other payment-required response in your API. The window resets on the configured interval (UTC midnight by default). No partial responses, no silent overages, no surprise invoice.

How does the eval gate work in CI?

You commit a golden set of prompt → expected-output pairs alongside your prompt definitions. On every pull request, the eval runner scores the current prompts against the golden set. A score drop past the configured tolerance fails the check — the regression never merges. The gate is a GitHub Actions step; it reads from the prompt registry and writes results to a structured report.

Is Caisson an AI platform or a library?

A library — a codebase you own. It ships as typed TypeScript packages you install and configure in your own repository. There is no hosted control plane, no SDK that phones home, no vendor lock-in beyond the Postgres database you already run.

Get started

Ship the feature with the brakes on.

Scaffold a new project with the AI Production Kit included, or go straight to pricing to add it to an existing Caisson base.

terminalready

$ npx create-caisson@latest