Reference architecture

Four topologies, one binary, identical policy.

Chainsaw is an install-path firewall between your developers, CI, and Kubernetes clusters and the sixteen package registries they pull from. The same Rego that fires in your pull request fires byte-identical at install, K8s admission, and runtime.

Component map

What runs, what it owns, what breaks if it dies

Chainsaw ships as one binary. The components below are processes and dependencies that binary either is or relies on. Every Chainsaw deployment runs the same set; topology decides where each one sits.

  • chainsaw-proxy

    Responsibility. Single Go binary. Terminates the registry-bound TLS connection from the dev / CI / K8s client, evaluates the Rego bundle against the requested artifact, and either forwards to the upstream registry, serves from cache, or refuses with a structured error. Workers are leader-elected via Postgres advisory locks; every replica can serve traffic, but only one runs each background job at a time.

    Failure mode. If every proxy replica is unreachable, clients fail according to CHAINSAW_FAIL_MODE (open or closed, per workspace). The proxy is stateless on the request path — restart cost is the cache warm-up only.

  • Billy — approval queue

    Responsibility. Out-of-band approval surface for exception requests. Developer hits a refusal at install time, links to Billy, justifies the package; an owner approves or denies inside the SLA window. Approvals write a signed exception row that the proxy reads on its next bundle refresh.

    Failure mode. If Billy is unreachable, the exception API degrades to read-only on the proxy side. Pending requests queue locally; nothing fails-open silently. mode: warn shifts to advisory; mode: block keeps refusing.

  • Postgres

    Responsibility. System of record. Holds signed audit rows, exception state, policy versions, RBAC, SCIM-synced groups, and the admission_decisions_shadow ledger. Logical replication supported; partitioned on event_time for the audit table.

    Failure mode. Streaming replica + automated failover. Proxy holds a small write-behind buffer for audit rows during primary cutover; on extended outage, mode: block proxies refuse rather than emit unrecorded decisions.

  • Blob store

    Responsibility. Holds SBOMs, scan artefacts, and signed intel bundle payloads. S3-compatible interface (S3, GCS, MinIO, R2). Reaper worker garbage-collects orphan blobs against the Postgres manifest.

    Failure mode. Reads cache locally on the proxy for the hot bundle. A blob-store outage degrades SBOM browsing and historical exports; the enforcement path keeps running off the cached bundle.

  • NATS — policy-bus

    Responsibility. Pub/sub fabric for policy and intel-bundle change notifications across replicas and across federation spokes. Subjects are namespaced per workspace; messages carry only the bundle digest, not the bundle itself.

    Failure mode. Replicas fall back to a 30s Postgres poll for bundle version. NATS clustering (3+ nodes) tolerates a single-node loss. Policy correctness is unaffected; freshness lag widens.

  • K8s admission webhook

    Responsibility. ValidatingAdmissionWebhook served by the same binary. Evaluates the same Rego bundle against pod / image specs, writes the verdict to admission_decisions_shadow, and either admits, warns, or rejects per workspace mode.

    Failure mode. failurePolicy: Ignore for warn mode, Fail for block mode. Webhook is horizontally scaled behind the cluster's service; loss of all replicas surfaces as an admission timeout the cluster handles per its own policy.

  • Intel-bundle store

    Responsibility. Signed, Sigstore-verified OPA bundle containing Rego policies plus the 25 supply-chain signals layered beyond CVE. Distributed as chainsaw-intel-bundle-YYYY-MM-DD.tar.gz, hot-swappable at runtime without restart.

    Failure mode. Verification is enforced — an unsigned or signature-invalid bundle is rejected and the previous bundle stays loaded. In air-gapped mode, sideloaded bundles use the same verification chain.

Topologies

SaaS, VPC-peered, on-prem hub-and-spoke, air-gapped

The deployment model is a placement decision, not a product decision. Same binary, same Rego bundle, same audit-row schema across all four. Pick the posture your network and compliance teams can sign; migrate later without re-papering the contract.

  • SaaS — Chainsaw-hosted

    Dev / CI developer + CI runners Chainsaw SaaS chainsaw-proxy Postgres Blob store Intel bundle Upstream registries npm / PyPI / Docker
    Figure 1 — SaaS. Chainsaw operates the control plane. Customer keeps identity-provider integration.

    Control plane and data plane both run in Chainsaw's environment. Customer dev tools point at a per-workspace hostname; TLS terminates at the proxy. Suited to teams with no residency or air-gap constraint who want the shortest deploy path.

  • VPC-peered — customer cloud, Chainsaw-managed control plane

    Customer VPC Dev / CI chainsaw-proxy Postgres Blob store Chainsaw control managed updates Upstream registries npm / PyPI / Docker
    Figure 2 — VPC-peered. Data plane in customer VPC. Control plane operates as managed service (dashed: out-of-band).

    Data plane (proxy, Postgres, blob store) runs in the customer's cloud account. Control plane (policy authoring, signed bundle distribution, support tooling) runs on Chainsaw infrastructure. Connection is outbound-only from customer to Chainsaw for bundle pull and telemetry; no inbound vendor connection. Same Rego, same audit row schema as SaaS.

  • On-prem hub-and-spoke — customer owns everything

    Hub global policy floor audit aggregator · Billy queue BU-A proxy + Rego overlay BU-B proxy + Rego overlay BU-C proxy + Rego overlay BU-D proxy + Rego overlay solid: signed policy bundle ↓ dashed: signed audit stream ↑
    Figure 3 — On-prem hub-and-spoke. Hub: global policy floor + Billy approval queue. Spokes: BU-scoped Rego overlay + local exceptions.

    One control hub Postgres + bundle store inside the customer's network. Each business unit runs its own proxy spoke pulling signed bundles from the hub over the policy-bus. Common in multi-BU shops where central security publishes policy and each BU enforces locally. Hub failure does not stop enforcement at the spokes — they keep serving from their last-known-good bundle.

  • Air-gapped — signed intel bundle sideload, zero outbound

    Customer perimeter (no outbound) Dev / CI chainsaw-proxy Local cache Postgres Intel bundle (sideloaded) Signed bundle tarball manual transfer no outbound
    Figure 4 — Air-gapped. Zero outbound. Signed intel bundle sideloaded via chainsaw bundle apply.

    No outbound network from the proxy. Operator transfers chainsaw-intel-bundle-YYYY-MM-DD.tar.gz across the boundary on the cadence the diode allows, runs chainsaw bundle verify, then chainsaw bundle apply for hot-swap. CHAINSAW_OFFLINE=1 disables every phone-home path; CHAINSAW_OFFLINE_FAIL_MODE selects behaviour when the bundle is older than the freshness threshold.

Dataflow

Registry → proxy → cache → client

Figure 5: dataflow — see architecture.diagrams.md

A client (npm, pip, mvn, gradle, cargo, go, docker, kubelet pulling an image, etc.) resolves the registry hostname to the proxy. The proxy authenticates the request, identifies the workspace, and consults its in-memory cache keyed on (registry, package, version, bundle-digest).

Cache hit. Verdict returns from memory. The proxy serves the cached artefact bytes (or refusal) and writes a signed audit row to Postgres. No upstream call.

Cache miss. The proxy fetches from the upstream registry, runs the 25 supply-chain signals beyond CVE, evaluates the Rego bundle, then either streams bytes to the client or returns a structured refusal. Signed audit row written on the way out. For Docker / OCI artefacts, Trivy runs inline against the layer set before the verdict commits.

K8s admission. The webhook receives the AdmissionReview, evaluates the same bundle against the pod spec and its image references, writes the verdict to admission_decisions_shadow, and returns admit / warn / deny per workspace mode.

Identity flow

Okta / Entra → SCIM → Rego input

Figure 6: identity flow — see architecture.diagrams.md

SCIM provisions users and groups from Okta or Entra into Postgres on the standard push schedule. Group membership flows into the Rego input as input.actor.groups, so policies can predicate on the directory without re-querying the IdP at evaluation time.

Browser sessions use HTTP-only, SameSite=Lax cookies bound to the workspace. Programmatic clients (CI runners, CLI) authenticate with bearer tokens via the Authorization header. Token secrets are bcrypt-hashed at rest and fronted by an in-process LRU+TTL cache so a hot-path token validation does not hit Postgres on every request.

Latency budget

Targets on the hot path

Targets, not measurements. Real numbers vary by upstream registry health, signal set enabled, and replica placement. Production deployments publish observed p50 / p95 to the customer's Prometheus.

Path Target Note
Cache hit, policy unchanged p50 target < 8 ms In-process bundle eval + memory-cached upstream response. No network egress on the hot path.
Cache miss, upstream fetch + 25 signals + Rego eval p95 target < 450 ms Dominated by upstream registry latency. Signal evaluation runs in parallel with the fetch where possible.
K8s admission decision p99 target < 250 ms Webhook returns within the cluster's default 10s timeout with substantial headroom; ledger write is async.

HA and failure modes

What happens when each component falls over

Proxy. Single binary, multiple replicas behind a TCP / HTTPS load balancer. Stateless on the request path; cache rebuilds on warm-up. Background workers (orphan-blob reaper, SBOM snapshotter, exception-expiry reminder, feedback tuner, ownership SLA timer, bypass-history snapshotter, datacleanup retention, policy-bus subscriber, deploy-correlation) are leader-elected via Postgres advisory locks; only the leader runs each one, and leader loss triggers re-election within seconds.

Postgres. Primary + streaming replica with automated failover. During cutover, the proxy buffers audit rows in a bounded write-behind queue. If the queue saturates and workspace mode is block, the proxy refuses rather than emit unrecorded decisions. In warn mode it logs and continues.

NATS. 3-node cluster minimum for production. Single-node loss is transparent; full-cluster loss falls back to a 30-second Postgres poll for bundle version, which widens freshness lag but does not change verdict correctness.

Billy. Approval surface decoupled from enforcement. If Billy is unreachable, the proxy keeps refusing on the policy that is loaded; pending requests do not auto-approve. Workspaces in mode: warn degrade to advisory; workspaces in mode: block remain fail-closed.

Telemetry surfaces

What you can scrape, trace, and ship to your SIEM

  • OpenTelemetry (OTLP)

    Traces and metrics export over OTLP/gRPC or OTLP/HTTP. Resource attributes tag workspace, replica, and bundle digest.

  • Prometheus scrape endpoint

    /metrics on the admin port. Histograms for request latency by verdict, counters for cache hit ratio, gauges for bundle age and policy-bus lag.

  • Golden-signal alerts (shipped)

    5xx rate > 1%, p99 latency > 1s, error-budget burn rate, bundle-age over threshold, NATS subscriber lag, leader-election flap.

  • Audit-row schema

    Stable JSON schema for every decision. Documented for Splunk, Sentinel, and QRadar feeds; replayable from Postgres or the WORM blob copy.

Air-gapped operational lifecycle

Sideload, verify, apply, audit

Bundle. Chainsaw publishes chainsaw-intel-bundle-YYYY-MM-DD.tar.gz signed via Sigstore. The bundle contains the Rego policies, the 25 supply-chain signal datasets, and a manifest. Operator transfers the tarball across the boundary on the cadence the diode allows.

Verify. chainsaw bundle verify ./chainsaw-intel-bundle-2026-05-25.tar.gz checks the Sigstore signature against the pinned trust root before the bundle is permitted to load. Verification failure is terminal — the previous bundle stays in place.

Apply. chainsaw bundle apply hot-swaps the live bundle without a restart and emits a policy.bundle.applied audit row carrying the new digest.

Doctor. chainsaw doctor --offline prints the per-provider matrix: which of the sixteen registries the current bundle can adjudicate, freshness per dataset, and Sigstore trust-root expiry.

Fail mode. CHAINSAW_OFFLINE_FAIL_MODE=open|closed selects behaviour when the loaded bundle exceeds the configured freshness threshold. closed refuses; open warns and continues.

Security baseline

Defaults you don't have to negotiate

  • Container identity

    Runs as uid 10001, non-root, read-only root filesystem. No NET_ADMIN, no SYS_ADMIN. Distroless base.

  • Signed binary

    Release artefacts signed via Sigstore with a SLSA-aligned provenance attestation. Verifiable before install.

  • Signed policy bundle

    OPA bundle Sigstore-verified at load time. Verification is enforced by default; disabling it requires an explicit, audited workspace flag.

  • HTTP security headers

    CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy set on every admin response. Cookies HTTP-only and SameSite=Lax.

Scope boundary

What Chainsaw is not

Chainsaw refuses on the install path. It is not a replacement for the categories below — it composes with them.

Architect review

Walk an engineer through your topology

30 minutes with a Chainsaw engineer. Bring your network diagram; leave with a placement plan for SaaS, VPC, on-prem hub-and-spoke, or air-gapped.