🧠

AI

P0 · Foundation Planned · P0 design phase Owner: CTO (vacant) → interim CEO

One LLM door for the whole platform — router, redactor, cacher, accountant. Every model call goes through here.

AI Gateway is the single integration point for every Large Language Model call CyberOS makes. Built on LiteLLM at its router core, it speaks OpenAI, Anthropic, AWS Bedrock, and Vertex AI behind one interface; fails over within 30 seconds when a primary provider degrades; redacts PII before bytes leave the cluster; stamps every call with a persona-version so the audit trail captures which agent identity made the request; caches deterministic responses for replay-stable testing; and tracks per-tenant token cost down to the invoice line item. Two flags matter: no client SDK is allowed in any other module (they all call AI Gateway over gRPC), and no cache may cross tenant boundaries (N(FR pending), hard rule). PRD §8.5 specifies the latency budget per route class; §9.7 specifies the FRs. This page documents the planned implementation at cyberos/services/ai-gateway/.

AI Gateway is the policy-enforcement and routing layer for every model call inside CyberOS. From the outside it is one gRPC service speaking chat.complete, embed, rerank, and image.generate. Inside, a LiteLLM-derived router consults per-tenant policy, applies persona-version system prompts, runs Presidio + custom-VN PII redaction, looks up the prompt-cache, picks a primary provider, retries twice with backoff, fails over to the secondary, accounts for tokens against the tenant's monthly cap, streams the response back over SSE, and emits one ai.invocation audit row per call. Zero retention — no provider sees cross-tenant prompts; no cache row crosses a tenant boundary; the only thing that leaves the cluster is the prompt and the response, and even those are PII-scrubbed.

Status
Planned
P0 · design phase · M+1
Est. LoC
~6,500
Python 3.13 (LiteLLM-derived) + Rust edge proxy
Providers (P0)
Bedrock · Anthropic · OpenAI
+ self-hosted BGE embedder
Failover SLA
≤ 30 s
(FR pending) · primary down → secondary live
PII redaction recall
≥ 99%
(FR pending) · VN + EN test set
Cache hit rate
≥ 30% (P0)
≥ 60% (P2+); (FR pending)
Depends on
AUTH · BRAIN · OBS
tenant resolution + audit + traces
Used by
CUO · Skill · KB · CHAT · …
every module that calls an LLM
1

Why AI Gateway exists

Letting every module embed its own LLM SDK creates three problems at once. (a) Cost-tracking turns into per-module spreadsheets that never agree. (b) PII redaction becomes a decentralized policy that drifts. (c) Provider failover requires per-module change windows whenever Anthropic / OpenAI degrades. The AI Gateway pattern is the standard answer: pay the cost of one integration once, let every other module call a single typed RPC, and centralise the policy.

🎯
One door, many providers

LiteLLM router speaks OpenAI / Anthropic / Bedrock / Vertex behind a single API. Switching providers is a config change, not a code change.

🛡
PII never leaves un-scrubbed

Presidio + custom Vietnamese rules redact CCCD, MST, bank accounts, addresses before the request hits a provider — recall ≥ 99% measured.

💸
Cost is a property of the platform

Every call lands one ai.invocation row: actor · model · tokens · USD · cache-state. Per-tenant cap, 80% warning, 100% hard stop.

The bet is the same bet AUTH and BRAIN make: pay the cost once at the substrate. Without AI Gateway, each module re-implements PII redaction (and at least one of them gets it wrong), each module embeds its own SDK (and the OpenAI Python client and the Anthropic Python client disagree on streaming chunks), and the regulator's "show me every prompt that touched personal data" question becomes a forensic project. With AI Gateway, that question is a SQL query.

2

What it does — 5W1H2C5M

PRD §8.5 + §9.7 + §11.2.1 give the full picture; this table is the working summary.

AxisQuestionAnswer
5W · WhatWhat is AI Gateway?A gRPC service that wraps a LiteLLM-derived router. It selects providers, redacts PII, caches deterministic responses, accounts for tokens, and emits audit rows. Single binary today; horizontally scalable behind an L7 load balancer.
5W · WhoWho calls it?CUO router (for routing decisions), Skill host (for skill-invoked LLM steps), KB (semantic ingest + retrieval), CHAT (summarisation, smart-reply), Genie (interactive Q&A), Email composer, Project planner. Owner: CTO seat.
5W · WhenWhen does call happen?Synchronously per user request (chat completion); asynchronously for batch jobs (KB ingest, daily digest). Cache lookup happens first; only cache-miss requests hit a provider.
5W · WhereWhere does it run?Fargate task in SG-1 (P0); read-only embedder/reranker GPU node (shared with KB). Multi-region active-active at P3+.
5W · WhyWhy a separate layer?Because per-module SDK adoption creates cost untraceability, PII drift, and failover ratchets. One gateway eliminates all three.
1H · HowHow does it work?Receive gRPC call → resolve tenant policy from BRAIN (cached) → look up cache → on miss, redact PII → inject persona-version system prompt → call primary provider with 2 retries → on continued failure within 30 s, fail over to secondary → stream response back via SSE → store redacted prompt + response in cache → emit ai.invocation audit row → return.
2C · CostCost?P0 budget: ≤ $150 / month LLM (PRD N(FR pending)). 50-tenant budget: ≤ $4 / active user / month LLM. Cache hit rate is the dominant lever.
2C · ConstraintsConstraints?(a) Zero cross-tenant cache. (b) PII recall ≥ 99% measured against a public VN+EN test set. (c) Provider must be on ZDR (Zero-Data-Retention) attested endpoint for sensitive routes. (d) Per-tenant monthly USD cap hard-enforced.
5M · MaterialsStack?Python 3.13 · LiteLLM (vendored) · grpc-py · Presidio · regex-based VN PII rules · Redis (cache) · DuckDB (usage roll-up) · OpenTelemetry · self-hosted BGE-M3 embedder + BGE-rerank on a shared L4 GPU.
5M · MethodsMethod choices?Streaming-first (SSE end-to-end). Circuit breaker per provider × model. Hash-keyed cache (SHA-256 of canonical prompt + model + parameters). Idempotency-Key header for replay safety. Per-route latency budgets (read ≤ 800 ms, write ≤ 2 s, ingest ≤ 5 s).
5M · MachinesDeployment?Fargate (CPU). One GPU pod for BGE-M3 + reranker (L4 24GB, shared with KB ingest).
5M · ManpowerWho maintains?0.5 FTE shared CTO + CDO at P0. CDO assumes primary at P1+.
5M · MeasurementHow measured?N(FR pending) (AI request p95 ≤ 2 s), (FR pending) cache hit rate, (FR pending) PII recall, per-tenant cost dashboard, provider error-rate burndown.
3

Architecture

Six internal pipelines between the gRPC ingress and the provider egress: tenant policy resolution, persona stamping, PII redaction, cache lookup, router selection, and accounting. The diagram below shows the full request path for a chat.complete call.

graph TB subgraph CLIENTS ["Callers"] CUO["🎯 CUO router"] SKILL["🛠 Skill host"] CHAT["💬 CHAT summarise"] KB["📚 KB ingest"] GENIE["✨ Genie Q&A"] end subgraph AI ["AI Gateway (gRPC + Rust edge proxy)"] ING["edge_proxy.rs
mTLS + auth"] TPR["tenant_policy.py
provider-pref · cap · ZDR-attestation"] PER["persona.py
inject system-prompt by persona_version"] RED["redactor.py
Presidio + VN rules · ≥ 99% recall"] CACHE["cache.py
Redis · key=SHA-256(canonical_prompt + model + params + tenant)"] ROUT["router.py (LiteLLM)
provider-pref · circuit-breaker · 2 retries"] STR["stream.py
SSE multiplexer"] ACC["accountant.py
tokens · USD · per-tenant cap"] EMB["embedder
BGE-M3 (self-hosted)"] RER["reranker
BGE-rerank-v2-m3 (self-hosted)"] end subgraph PROVIDERS ["External providers (ZDR-attested)"] BED["AWS Bedrock
Anthropic Sonnet · Haiku"] ANTH["Anthropic API
Sonnet · Opus"] OAI["OpenAI API
gpt-4o · o1"] VTX["Vertex AI (P1+)
Gemini · PaLM"] end subgraph STORES REDIS[("Redis 7
cache · TTL configurable")] DUCK[("DuckDB
usage roll-up · hourly")] end subgraph SINKS BRAIN["🧠 BRAIN
ai.invocation rows"] OBS["👁 OBS
traces + cost dashboard"] end CUO --> ING SKILL --> ING CHAT --> ING KB --> ING GENIE --> ING ING --> TPR TPR --> PER PER --> RED RED --> CACHE CACHE -->|miss| ROUT CACHE -->|hit| STR ROUT --> BED ROUT --> ANTH ROUT --> OAI ROUT -.P1+.-> VTX ROUT --> STR STR --> ACC ACC --> DUCK CACHE --> REDIS ROUT --> EMB ROUT --> RER ING --> BRAIN ING --> OBS ACC --> BRAIN classDef planned fill:#fef6e0,stroke:#9c750a classDef provider fill:#cba88a,stroke:#4338ca classDef store fill:#f5f3ff,stroke:#7c3aed classDef sink fill:#f5ede6,stroke:#45210e class ING,TPR,PER,RED,CACHE,ROUT,STR,ACC,EMB,RER planned class BED,ANTH,OAI,VTX provider class REDIS,DUCK store class BRAIN,OBS sink

Internal components

ComponentPath (planned)Responsibility
edge_proxy.rsservices/ai-gateway/edge/Rust mTLS proxy. Verifies caller JWT (AUTH), unwraps tenant_id, forwards to Python core over Unix socket.
tenant_policy.pyservices/ai-gateway/core/policy.pyResolves per-tenant provider preference, cost cap, ZDR-attestation requirement. Caches from BRAIN reads.
persona.pyservices/ai-gateway/core/persona.pyInjects persona-version system prompt at the gateway ((FR pending)). Reads meta/persona///prompt.md from BRAIN.
redactor.pyservices/ai-gateway/core/redactor.pyPresidio + custom VN rule pack. Detects CCCD, MST, bank acct, address, phone, email, name. Replaces with token sentinels; un-redact on response for trusted classes.
cache.pyservices/ai-gateway/core/cache.pyHash-keyed response cache. Key = SHA-256(canonical(prompt) || model || params || tenant_id). NEVER cross-tenant — tenant_id in key is the load-bearing fact.
router.pyservices/ai-gateway/core/router.pyLiteLLM-derived. Selects provider by tenant policy + route class. 2 retries with exponential backoff. Circuit-breaker on per-provider error rate. Fails over to secondary within 30 s.
stream.pyservices/ai-gateway/core/stream.pySSE multiplexer. Forwards provider stream chunks as data: … events. Backpressure-aware. Handles cancellation mid-stream.
accountant.pyservices/ai-gateway/core/accountant.pyToken + USD accounting per tenant. Emits hourly roll-up to DuckDB; per-tenant cap enforcement (80% warning, 100% hard stop). (FR pending).
circuit_breaker.pyservices/ai-gateway/core/circuit_breaker.pyPer-provider × per-model breaker. Opens on error rate > 10% / 60 s window; half-opens after 30 s; closes on first success.
idempotency.pyservices/ai-gateway/core/idempotency.pyReplay-safe via Idempotency-Key header. Same key → same response (returned from short-TTL Redis cache).
vn_pii_rules.pyservices/ai-gateway/core/vn_pii_rules.pyCustom VN PII detectors: CCCD (Decree 13 — 12 digits), MST (10/13 digits, validator), VietQR account, Vietnamese full names (rule-based).
embedder_client.pyservices/ai-gateway/core/embedder_client.pyCalls the BGE-M3 GPU pod over gRPC. Batches up to 32 requests per call.
reranker_client.pyservices/ai-gateway/core/reranker_client.pyCalls the BGE-rerank-v2-m3 GPU pod. Returns ordered score list.
audit_bridge.pyservices/ai-gateway/core/audit_bridge.pyEmits one ai.invocation row per call: actor · route · model · tokens-in · tokens-out · USD · cache_state · persona_version · redaction_applied. (FR pending).
cost_export.pyservices/ai-gateway/tools/cost_export.pyGenerates per-tenant monthly invoice line items from DuckDB roll-up.
4

Data model

AI Gateway is mostly stateless — its source of truth is BRAIN (provider config, tenant policy, persona prompts) and Redis (cache, idempotency keys). DuckDB holds the cost-roll-up for fast dashboard queries. The entities below show the read + write surfaces.

erDiagram TENANT ||--o{ TENANT_POLICY : "has policy" TENANT ||--o{ AI_INVOCATION : "incurs" TENANT ||--o{ TENANT_BUDGET : "has cap" PROVIDER ||--o{ PROVIDER_MODEL : "offers" PROVIDER_MODEL ||--o{ AI_INVOCATION : "fulfils" PERSONA ||--o{ PERSONA_VERSION : "has versions" PERSONA_VERSION ||--o{ AI_INVOCATION : "stamps" CACHED_RESPONSE ||--o| AI_INVOCATION : "may serve" REDACTION_RULE ||--o{ REDACTION_HIT : "matches" AI_INVOCATION ||--o{ REDACTION_HIT : "produces" TENANT { uuid id PK string slug string country } TENANT_POLICY { uuid tenant_id FK string primary_provider "bedrock or anthropic or openai" string fallback_provider bool require_zdr_attestation obj per_route_overrides "route_class to provider" } TENANT_BUDGET { uuid tenant_id FK int monthly_usd_cents int spent_usd_cents_mtd timestamp window_start bool warning_sent_80pct } PROVIDER { string id PK "bedrock or anthropic or openai or vertex" string region bool zdr_attested string status "active or degraded or disabled" } PROVIDER_MODEL { string id PK "bedrock-anthropic-claude-3-5-sonnet" string provider FK string model_name int context_window decimal price_input_per_1k decimal price_output_per_1k bool streaming_supported } PERSONA { string id PK "cuo or genie or hr-assistant or other" string display_name } PERSONA_VERSION { string persona_id FK string version "v2.3.1" string system_prompt timestamp valid_from timestamp valid_to } CACHED_RESPONSE { string key PK "SHA-256 of prompt model params tenant" bytes response_body int tokens_in int tokens_out timestamp created_at timestamp expires_at string tenant_id "load-bearing - no cross-tenant" } AI_INVOCATION { uuid id PK uuid tenant_id FK string actor "subject_id" string route_class "chat or embed or rerank or image" string model_id FK string persona_version int tokens_in int tokens_out decimal usd_cost int latency_ms string cache_state "miss or hit or bypass" bool redaction_applied string fallover_path "primary or secondary or primary-to-secondary" string brain_chain timestamp ts } REDACTION_RULE { string code PK "vn-cccd or vn-mst or en-email or other" string regex string sentinel "CCCD-sentinel" string locale } REDACTION_HIT { uuid id PK uuid invocation_id FK string rule_code FK int offset int length }

Provider + model matrix (P0)

ProviderModelRoute classZDRPricing /1k tokens
AWS Bedrockanthropic.claude-3.5-sonnetchat (default)$0.003 in / $0.015 out
AWS Bedrockanthropic.claude-3-haikuchat (cheap)$0.00025 in / $0.00125 out
Anthropic APIclaude-sonnet-4.5chat (high-quality)✓ (zero-retention)$0.003 in / $0.015 out
OpenAIgpt-4ochat (alt)✓ (zero-data-retention)$0.0025 in / $0.01 out
OpenAIo1-minireasoning (alt)$0.003 in / $0.012 out
Self-hostedBGE-M3embedn/a (in-cluster)free (amortised GPU)
Self-hostedBGE-rerank-v2-m3rerankn/a (in-cluster)free (amortised GPU)
Vertex AI (P1+)gemini-2.5-prochat (alt)$0.00125 in / $0.005 out
5

API surface

AI Gateway speaks gRPC internally and exposes a thin REST surface for non-internal callers. A federated GraphQL subgraph publishes the read-side (usage, model catalogue). No public MCP tools — the gateway is infrastructure, not directly agent-callable.

gRPC API (canonical)

syntax = "proto3";
package cyberos.ai.v1;

service AIGateway {
  // Streaming chat completion (SSE end-to-end).
  rpc ChatComplete(ChatRequest) returns (stream ChatChunk);
  // Non-streaming variant for batch jobs.
  rpc ChatCompleteSync(ChatRequest) returns (ChatResponse);
  // Embedding (single or batch).
  rpc Embed(EmbedRequest) returns (EmbedResponse);
  // Reranking.
  rpc Rerank(RerankRequest) returns (RerankResponse);
  // Cost lookup for the calling tenant.
  rpc UsageMTD(TenantRef) returns (UsageReport);
  // Model catalogue.
  rpc ListModels(Empty) returns (ModelList);
}

message ChatRequest {
  repeated Message messages = 1;
  string persona = 2;                 // "cuo" | "genie" | "hr-assistant"
  string route_class = 3;             // "chat" | "reasoning"
  string idempotency_key = 4;
  ModelHint hint = 5;                 // optional; tenant policy may override
  bool stream = 6;
  map<string, string> metadata = 7;   // free-form, recorded in audit
}

message Message {
  string role = 1;        // "user" | "assistant" | "system" (system reserved)
  string content = 2;
  repeated Tool tool_calls = 3;
}

message ChatChunk {
  string content_delta = 1;
  bool   done = 2;
  Usage  usage = 3;       // emitted on final chunk
}

message Usage {
  int32 tokens_in = 1;
  int32 tokens_out = 2;
  string model_id = 3;
  string cache_state = 4;
  double usd_cost = 5;
  bool   redaction_applied = 6;
  string persona_version = 7;
  string brain_chain = 8;
}

message EmbedRequest {
  repeated string inputs = 1;
  string model = 2;        // default: "bge-m3"
}

message EmbedResponse {
  repeated Embedding embeddings = 1;
  Usage usage = 2;
}

message Embedding { repeated float vector = 1; }

REST + SSE surface (planned, edge-only)

MethodPathPurpose
POST/v1/chat/completionsOpenAI-compatible chat endpoint (SSE supported with stream: true).
POST/v1/embeddingsOpenAI-compatible embeddings endpoint.
POST/v1/rerankCohere-style rerank endpoint.
GET/v1/modelsList models available to caller's tenant.
GET/v1/usageMTD usage report for the caller's tenant.
GET/healthLiveness + per-provider circuit-breaker state.
GET/metricsPrometheus scrape endpoint.

GraphQL subgraph (read-only)

extend schema
  @link(url: "https://specs.apollo.dev/federation/v2.5", import: ["@key", "@requiresScopes"])

type AIInvocation @key(fields: "id") @requiresScopes(scopes: [["ai.usage_read"]]) {
  id: ID!
  tenantId: ID!
  actor: String!
  routeClass: RouteClass!
  modelId: String!
  personaVersion: String!
  tokensIn: Int!
  tokensOut: Int!
  usdCost: Float!
  latencyMs: Int!
  cacheState: CacheState!
  redactionApplied: Boolean!
  failoverPath: String!
  ts: DateTime!
}

type UsageReport @key(fields: "tenantId month") {
  tenantId: ID!
  month: String!                    # "2026-05"
  totalCalls: Int!
  totalTokens: Int!
  totalUsdCost: Float!
  capUsdCost: Float!
  percentUsed: Float!
  byModel: [ModelUsage!]!
}

type ModelUsage {
  modelId: String!
  calls: Int!
  tokensIn: Int!
  tokensOut: Int!
  usdCost: Float!
}

enum RouteClass { CHAT REASONING EMBED RERANK IMAGE }
enum CacheState { MISS HIT BYPASS }

type Query {
  aiUsageMTD(tenantId: ID): UsageReport!
  aiInvocations(since: DateTime, limit: Int = 50): [AIInvocation!]!
    @requiresScopes(scopes: [["ai.usage_read"]])
  aiModels: [Model!]!
}
6

Key flows

Flow 1 — Streaming chat completion (cache miss)

sequenceDiagram autonumber participant CUO as CUO router participant AI as AI Gateway participant TP as tenant_policy participant PER as persona injector participant RED as redactor participant CACHE as Redis cache participant ROUT as router participant BED as AWS Bedrock participant STR as SSE stream participant ACC as accountant participant B as 🧠 BRAIN CUO->>AI: ChatComplete(messages, persona="cuo", stream=true) AI->>TP: get_policy(tenant_id) TP-->>AI: {primary:"bedrock", fallback:"anthropic", cap_usd_mtd:150} AI->>PER: inject system-prompt(persona="cuo", version="v2.3.1") PER-->>AI: messages + system AI->>RED: redact(messages) RED-->>AI: messages' + redaction_hits AI->>CACHE: GET sha256(canonical(messages')||model||params||tenant) CACHE-->>AI: miss AI->>ROUT: route(messages', model="claude-3.5-sonnet") ROUT->>BED: invoke streaming completion loop SSE chunks BED-->>ROUT: data: {delta:"..."} ROUT-->>STR: forward STR-->>CUO: data: {delta:"..."} end BED-->>ROUT: done {tokens_in:120, tokens_out:450} ROUT->>ACC: account(tokens, usd=0.0075) ACC->>ACC: check tenant cap (97/150 → OK) AI->>CACHE: SET key TTL=24h AI->>B: ai.invocation row {…} STR-->>CUO: done event with Usage

Cache-miss latency budget: ≤ 2 s p95 (N(FR pending)). Provider latency dominates; gateway overhead is < 50 ms typical.

Flow 2 — Cache hit (deterministic replay)

sequenceDiagram autonumber participant CHAT as CHAT summarise participant AI as AI Gateway participant TP as tenant_policy participant RED as redactor participant CACHE as Redis cache participant STR as SSE stream participant ACC as accountant participant B as 🧠 BRAIN CHAT->>AI: ChatCompleteSync(messages, persona="genie") AI->>TP: policy(tenant) AI->>RED: redact(messages) AI->>CACHE: GET key CACHE-->>AI: HIT {response, tokens, cached_at} AI->>ACC: account(cache=hit, tokens=0) ACC->>ACC: usage incremented; no USD charged AI->>B: ai.invocation row {cache_state:"hit"} AI-->>CHAT: response (≤ 50 ms p95)

Cache-hit budget: ≤ 50 ms p95. The audit row still records the call — cache hits are tracked separately for invoicing transparency.

Flow 3 — Provider failover (primary degraded)

sequenceDiagram autonumber participant K as KB ingest participant AI as AI Gateway participant ROUT as router participant CB as circuit_breaker participant BED as AWS Bedrock (primary) participant ANTH as Anthropic API (fallback) participant B as 🧠 BRAIN K->>AI: Embed(inputs, model="bge-m3") Note over AI,BED: bge-m3 is self-hosted; example uses chat AI->>ROUT: chat call, primary=bedrock ROUT->>BED: invoke BED-->>ROUT: 503 Service Unavailable ROUT->>ROUT: retry #1 (backoff 250 ms) ROUT->>BED: invoke BED-->>ROUT: 503 ROUT->>ROUT: retry #2 (backoff 1 s) ROUT->>BED: invoke BED-->>ROUT: 503 ROUT->>CB: record failure → trip breaker for bedrock:claude-3.5 Note over CB: error_rate > 10%/60s → OPEN ROUT->>ANTH: failover invoke ANTH-->>ROUT: response ROUT-->>AI: response with failover_path="primary→secondary" AI->>B: ai.invocation row {failover_path:"primary→secondary"} AI-->>K: response Note over CB: 30 s later HALF_OPEN; first success closes breaker

(FR pending): failover within 30 s of primary failure. The circuit breaker prevents pile-up against a degraded provider.

Flow 4 — Per-tenant cost cap enforcement

sequenceDiagram autonumber participant U as Module participant AI as AI Gateway participant ACC as accountant participant TPR as tenant_policy participant ALERT as CHAT alert bot participant B as 🧠 BRAIN U->>AI: ChatComplete(…) AI->>ACC: pre-check cap alt spent_mtd < 80% cap ACC-->>AI: allow AI->>AI: …normal flow… else 80% ≤ spent_mtd < 100% ACC-->>AI: allow + warning flag AI->>ALERT: post "tenant X at 84% AI cap" ALERT->>B: budget_warning row AI->>AI: …normal flow… else spent_mtd ≥ 100% ACC-->>AI: hard-stop AI->>B: ai.invocation row {decision:"blocked_cap"} AI-->>U: 429 Quota Exceeded end

(FR pending): 80% warning, 100% hard stop. Warnings post to the tenant's #cyberos-alerts CHAT channel; hard-stop returns 429 with a structured Retry-After header pointing at the next billing cycle.

Flow 5 — PII redaction (Vietnamese CCCD)

sequenceDiagram autonumber participant U as HR module participant AI as AI Gateway participant RED as redactor participant VPI as vn_pii_rules participant ROUT as router participant BED as Bedrock participant B as 🧠 BRAIN U->>AI: ChatComplete("Verify CCCD 037201234567 for Le Van A") AI->>RED: redact(messages) RED->>VPI: scan for VN PII VPI-->>RED: hits [{rule:"vn.cccd", offset:13, len:12}, {rule:"vn.name", offset:30, len:8}] RED->>RED: replace with sentinels - Verify CCCD CCCD_0 for NAME_0 RED-->>AI: redacted messages AI->>ROUT: send to provider (no real CCCD leaves cluster) ROUT->>BED: invoke BED-->>ROUT: response (refers to {{CCCD_0}}, {{NAME_0}}) ROUT-->>AI: response AI->>RED: un-redact in response (caller's tenant scope only) AI->>B: ai.invocation {redaction_applied:true, hits:2} AI-->>U: response with un-redacted CCCD/name

(FR pending): PII recall ≥ 99% on the VN + EN test set. Sentinels are caller-scoped — the same prompt from a different tenant produces different sentinels, so a cached row from tenant A never round-trips through tenant B.

7

Request lifecycle

A single AI invocation traverses ten states between caller and audit row. Most of the time is in Routing (provider RTT); cache-hit paths skip from CacheCheck straight to Streaming.

stateDiagram-v2 [*] --> Received: gRPC ingress Received --> AuthValidated: AUTH verifies JWT, resolves tenant AuthValidated --> PolicyResolved: tenant_policy + budget check PolicyResolved --> Blocked: cap exceeded (100%) PolicyResolved --> PersonaInjected: system prompt prepended PersonaInjected --> Redacted: PII scrubbed Redacted --> CacheCheck: hash key built CacheCheck --> Streaming: HIT (cached response) CacheCheck --> Routing: MISS Routing --> Streaming: provider produced chunk Routing --> Failover: primary errored 2× + 30s elapsed Failover --> Streaming: secondary producing chunks Streaming --> Accounted: tokens + USD totalled Accounted --> Audited: ai.invocation row to BRAIN Audited --> [*] Blocked --> [*]: 429 returned, audit row written

Latency budget per route class

Route classp95 targetp99 targetSource NFR
chat (default)≤ 2 s≤ 5 sN(FR pending)
chat (cache hit)≤ 50 ms≤ 200 msinternal
chat streaming TTFB≤ 500 ms≤ 1.2 sinternal · PERF-002
reasoning (o1, claude-opus)≤ 10 s≤ 30 sinternal
embed (BGE-M3)≤ 120 ms / item≤ 250 msinternal
rerank (BGE-rerank)≤ 80 ms / pair≤ 150 msinternal
8

Functional Requirements

The CyberOS FR catalogue is being rebuilt one feature at a time via the open fr-author Agent Skill.

Previous FR enumerations were archived 2026-05-14 and are no longer reflected on this page. PRD/SRS narrative remains authoritative for the spec; specific FRs land here as they are re-authored.

9

Non-Functional Requirements

Performance NFRs from PRD §11.2.1 + reliability from §11.2.2. Cross-referenced at nfr-catalog.html#ai.

NFR IDConcernTargetMeasurement
N(FR pending)AI request p95 latency (chat, miss)≤ 2 sk6 load test against gateway
N(FR pending)Streaming TTFB p95≤ 500 msk6 streaming test
N(FR pending)Cache-hit response p95≤ 50 msinternal bench
N(FR pending)Embed call p95 (per item)≤ 120 msBGE-M3 GPU bench
N(FR pending)Rerank call p95 (per pair)≤ 80 msBGE-rerank bench
N(FR pending)Cache hit rate at P0≥ 30%weekly roll-up
N(FR pending)Cache hit rate at P2+≥ 60%weekly roll-up
N(FR pending)Cost ceiling (internal P0)≤ $150/month LLMDuckDB invoice export
N(FR pending)Cost ceiling (50-tenant)≤ $4/active user/month LLMper-tenant dashboard
N(FR pending)PII redaction recall (VN+EN)≥ 99%(FR pending) test set
N(FR pending)PII redaction precision≥ 95%false-positive rate measured
N(FR pending)AI gateway provider failovercontinuous on primary outagechaos test
N(FR pending)Gateway availability (28-day)≥ 99.9%SLO monitor
N(FR pending)Cross-tenant cache leakage= 0property-based test in CI
N(FR pending)Failover detection latency≤ 30 schaos test (kill primary)
10

Dependencies

AI Gateway depends on three internal services and four external providers (P0). It is depended on by every CyberOS module that calls an LLM.

graph LR subgraph upstream ["AI Gateway depends on"] AUTH["🔐 AUTH
tenant resolution"] BRAIN["🧠 BRAIN
tenant_policy + persona prompts
+ ai.invocation rows"] OBS["👁 OBS
traces + metrics"] REDIS["⚡ Redis
cache + idempotency"] BED["AWS Bedrock"] ANTH["Anthropic API"] OAI["OpenAI API"] GPU["BGE-M3 GPU pod"] end AI["🧠 AI Gateway"] subgraph downstream ["Used by"] CUO["🎯 CUO"] SKILL["🛠 Skill host"] CHAT["💬 CHAT"] KB["📚 KB"] GENIE["✨ Genie"] PROJ["📋 PROJ"] EMAIL["✉️ EMAIL"] OTH["…all LLM-using modules"] end AUTH --> AI BRAIN --> AI OBS --> AI REDIS --> AI BED --> AI ANTH --> AI OAI --> AI GPU --> AI AI --> CUO AI --> SKILL AI --> CHAT AI --> KB AI --> GENIE AI --> PROJ AI --> EMAIL AI --> OTH classDef planned fill:#fef6e0,stroke:#9c750a classDef shipped fill:#f5ede6,stroke:#45210e classDef ext fill:#cba88a,stroke:#4338ca class AI,AUTH,OBS,CUO,SKILL,CHAT,KB,GENIE,PROJ,EMAIL,OTH planned class BRAIN,REDIS shipped class BED,ANTH,OAI,GPU ext
11

Compliance scope

AI Gateway is the chokepoint for "what did the AI see, and at what cost?" — making it the regulator's first call for AI Act + PDPL questions.

Regulation / standardArticle / clauseAI Gateway feature
EU AI Act (Reg. 2024/1689)Art. 12 — LoggingOne ai.invocation row per call; full input + redaction + output hash.
EU AI ActArt. 13 — TransparencyPer-call model id + persona version surfaced to caller's audit trail.
EU AI ActArt. 14 — Human oversightDestructive tool calls require human-confirm — gateway annotation routed via MCP.
EU AI ActArt. 15 — Accuracy, robustness, cybersecurityCircuit breaker + failover + PII redaction.
EU AI ActArt. 26 — Deployer obligationsPersona-version stamping pins the deployed agent version per call.
Vietnam PDPL (Law 91/2025)Art. 4 — Lawful processingPII redaction before extra-tenant transfer; per-tenant data-residency in policy.
Vietnam Decree 13/2023Art. 16 — Cross-border transferZDR-attested providers; per-tenant policy can pin EU-only / VN-only.
GDPRArt. 25 — Data protection by designRedaction is on-by-default; bypass requires explicit per-route tenant policy.
GDPRArt. 28 — Processor obligationsZDR contracts on file with Anthropic, OpenAI, AWS Bedrock.
ISO/IEC 42001 (AIMS)§ 8.3 — AI system lifecyclePersona-version stamping + provider catalogue + cost tracking close the loop.
OWASP Gen AI Top-10LLM01: Prompt injectionSystem prompt injected at gateway, not in caller-controlled message text.
OWASP Gen AI Top-10LLM06: Sensitive info disclosurePII redaction recall ≥ 99% — measured per release.
OWASP Gen AI Top-10LLM10: Model theftSelf-hosted BGE models behind mTLS; no external embedding API used.
SOC 2 Type IICC7.2 — MonitoringPer-tenant cost + latency + cache-hit dashboards.
12

Risk entries

AI Gateway-specific risks tracked in the risk register.

IDRiskLikelihoodImpactOwnerMitigation
R-AI-001Cross-tenant cache leakageLowCatastrophicCTOTenant_id baked into cache key; property-based CI test verifies no cross-tenant hits.
R-AI-002PII recall regression below 99%MediumHighCDOTest-set CI gate; release blocked if recall < 99%; quarterly red-team adds adversarial samples.
R-AI-003Tenant cost overrun (cap-bypass bug)LowHighCFOPre-call check; post-call check; daily cost-reconciliation against provider bill.
R-AI-004Primary provider extended outage (> 4 h)MediumMediumCTO30s failover + per-tenant fallback override; multi-provider posture documented in DR runbook.
R-AI-005Persona prompt drift between gateway + moduleMediumMediumCDOSingle source of truth in BRAIN; gateway-only injection ((FR pending)); CI test on each persona-version change.
R-AI-006Prompt-injection bypasses gateway redactionMediumHighCSOCaMeL-style enforcement; sentinel scheme cannot be guessed by upstream caller; red-team quarterly.
R-AI-007Provider rate-limit cascade (one tenant starves the rest)MediumMediumCTOPer-tenant token-bucket on gateway; global circuit-breaker prevents pile-up.
R-AI-008Cache poisoning via adversarial canonical promptLowHighCSOCache key includes tenant_id + idempotency-key; provider response hash compared to in-flight verification on critical routes.
R-AI-009BGE-M3 GPU pod single point of failureMediumMediumCTO2-replica deployment at P1+; CPU fallback (slow) on hot-path embed.
R-AI-010Vendor SDK CVE blocks releaseMediumLowCTOLiteLLM is vendored — patch in-tree; Renovate watches upstream weekly.
13

KPIs

9 KPIs covering latency, cost, redaction quality, and reliability.

KPIFormulaSourceTarget
Chat p95 latency (miss)histogramOBS · Prometheus≤ 2 s (N(FR pending))
Streaming TTFB p95histogramOBS≤ 500 ms
Cache hit ratecache_hits / total_callsDuckDB roll-up≥ 30% (P0)
PII redaction recallTP / (TP + FN) on test setCI gate≥ 99%
PII redaction precisionTP / (TP + FP)CI gate≥ 95%
Provider failover eventscount / 28 dai.invocationtracked; alert on > 100/day
Tenant cost overrun eventscount / 28 daccountant= 0 (hard-stop ensures)
Cross-tenant cache leakageproperty-test countCI= 0
USD spent vs. budget (MTD)spent / cap per tenantdashboard< 100% (warn at 80%)
14

RACI matrix

ActivityCEOCTOCDOCFOCSODPO
Service designARCICI
ImplementationIARIII
Provider contracts (ZDR, DPA)CCIARC
Cost tracking + invoicingICIA/RII
PII rule maintenance (VN+EN)ICA/RICC
Persona-prompt curationACRIII
Provider failover drillIA/RCICI
Compliance review (AI Act, PDPL)ICCICA/R
15

Planned CLI surface

Two CLIs: cyberos-ai for operators (tenant policy, cost reports, model catalogue) and the standard OpenAI-compatible curl path for ad-hoc testing.

1. Quick chat call

$ curl https://ai.cyberos.com/v1/chat/completions \
    -H "Authorization: Bearer $CYBEROS_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"model":"claude-3.5-sonnet","messages":[{"role":"user","content":"summarise Q1 OKRs"}],"stream":false}'

{
  "id": "ai_01HZJ8…XK",
  "model": "bedrock:anthropic.claude-3.5-sonnet",
  "persona_version": "genie-v1.0.2",
  "choices": [{"message":{"role":"assistant","content":"Q1 OKRs are …"}}],
  "usage": {"prompt_tokens":120,"completion_tokens":450,"usd_cost":0.0075,
            "cache_state":"miss","redaction_applied":false,"failover_path":"primary"}
}

2. Operator — view MTD usage

$ cyberos-ai usage mtd --tenant stephen-personal

Tenant:       stephen-personal
Month:        2026-05
─────────────────────────────
Total calls:  14,823
Tokens-in:    8.2 M
Tokens-out:   3.1 M
USD spent:    $97.42 / $150.00 (64.9%)

By model:
  bedrock:claude-3.5-sonnet    11,420 calls   $74.20  (76%)
  bedrock:claude-3-haiku        3,210 calls    $4.80   (5%)
  bge-m3 (embed)                  193 batches  free

By cache state:
  hit   4,612   (31.1%)  ← above target
  miss 10,211   (68.9%)

3. Operator — update tenant policy

$ cyberos-ai policy set --tenant acme-corp \
    --primary bedrock --fallback anthropic \
    --require-zdr true --cap-usd-monthly 500

[policy updated]
  tenant:    acme-corp
  primary:   bedrock
  fallback:  anthropic
  zdr:       required
  cap:       $500/month
[audit]    brain seq=14841

4. Operator — list models

$ cyberos-ai models list

ID                                                  ROUTE     ZDR   PRICE (in/out per 1k)
bedrock:anthropic.claude-3.5-sonnet                 chat      ✓     $0.003 / $0.015
bedrock:anthropic.claude-3-haiku                    chat      ✓     $0.00025 / $0.00125
anthropic:claude-sonnet-4.5                         chat      ✓     $0.003 / $0.015
openai:gpt-4o                                       chat      ✓     $0.0025 / $0.01
openai:o1-mini                                      reason    ✓     $0.003 / $0.012
self-hosted:bge-m3                                  embed     —     free
self-hosted:bge-rerank-v2-m3                        rerank    —     free

5. Operator — failover drill

$ cyberos-ai chaos failover --provider bedrock --duration 60s

[chaos]   injected 100% error rate on bedrock for 60 s
[detect]  primary failure recognised @ +6.2 s
[failover] secondary (anthropic) active @ +6.4 s
[recovery] bedrock errors cleared @ +60 s
[breaker] half-open @ +90 s ; closed @ +91 s
[result]   (FR pending) PASSED (failover ≤ 30 s)

6. Operator — export monthly invoice

$ cyberos-ai invoice export --tenant acme-corp --month 2026-05 --output invoice.csv

[invoice] tenant=acme-corp  month=2026-05  rows=14,823  written invoice.csv (1.2 MB)
[lines]   by_model · by_route · by_persona · by_date
16

Phase status & estimates

Status
Planned
P0 · design phase · M+1
Est. LoC (Python + Rust)
~6,500
Python core + Rust edge proxy
Planned tests
90+
unit + integration + chaos
P0 monthly LLM budget
$150
N(FR pending)
Cache TTL (default)
24 h
per-tenant override
CLI commands
~15 planned
cyberos-ai
CapabilityStatus
LiteLLM-derived router (Bedrock + Anthropic + OpenAI)planned · P0
Streaming SSE end-to-endplanned · P0
PII redaction (Presidio + VN rules)planned · P0
Persona-version system-prompt injectionplanned · P0
Response cache (Redis, tenant-keyed)planned · P0
Per-tenant cost cap + warningplanned · P0
Circuit breaker + 30 s failoverplanned · P0
ai.invocation audit row per callplanned · P0
Self-hosted BGE-M3 embedderplanned · P1
Self-hosted BGE-rerank-v2-m3planned · P1
Vertex AI (Gemini) providerplanned · P1+
Image generation route (DALL-E / Stable Diffusion)planned · P2+
Multi-region active-activeplanned · P3+
17

References

  • PRD §8.5 — AI Gateway architecture, latency budgets, ZDR enforcement.
  • PRD §9.7 — (FR pending) through (FR pending) (PRD-tier).
  • PRD §11.2.1 + §11.2.2 — Performance + reliability NFRs.
  • SRS §4.7 — Formal (FR pending) through (FR pending) with verification methods.
  • EU AI Act (Reg. 2024/1689) — articles 12, 13, 14, 15, 26.
  • OWASP Gen AI Top-10 (2025) — LLM01, LLM06, LLM08, LLM10 mitigations.
  • ISO/IEC 42001 (AIMS) — § 8.3 lifecycle and persona-version stamping.
  • LiteLLM upstream — base router we vendor and extend.
  • Microsoft Presidio — PII detection library.
  • BGE-M3 / BGE-rerank-v2-m3 — self-hosted embedding + rerank models.
  • Architecture context: infrastructure.html#ai.