The six-pillar plane
PRD §8.1 locks the high-level shape: every module is an independently deployable Apollo Federation subgraph and a Module-Federation frontend remote, sitting on top of six cross-cutting pillars. The pillars are infrastructure modules — they are owned, versioned, and on-call exactly like the functional modules — but they are special in that every other module depends on them. Removing any one of the six breaks the platform.
browser / Tauri")] AGENT[("Claude / Codex / Cursor
via Skills + MCP")] end subgraph PLANE ["Cross-cutting infrastructure plane (P0)"] direction LR AUTH["🔐 AUTH
OAuth 2.1 + JWT (RS256)
per-tenant authz server"] AIGW["⚡ AI Gateway
LiteLLM router
Bedrock · Anthropic · OpenAI"] MCPGW["🔌 MCP Gateway
2025-11-25 spec
per-module servers"] OBS["👁 OBS
LGTM stack
Loki · Grafana · Tempo · Mimir"] GQL["🌐 Apollo Router
Federation v2.5+
persisted queries"] NATS_BUS["📬 NATS JetStream
cyberos.{tenant}.{module}.{entity}.{verb}"] end subgraph MODS ["22 functional modules — subgraph + MCP + UI remote"] BRAIN["🧠 BRAIN"] CHAT["💬 CHAT"] PROJ["📋 PROJ"] OTHERS["…19 more"] end subgraph DATA ["Per-module data layer"] PG[("PostgreSQL 17
per-module schema
RLS enforced")] S3[("S3 / R2 / MinIO
object store")] end USER --> GQL AGENT --> MCPGW GQL -- "auth context" --> AUTH GQL --> BRAIN GQL --> CHAT GQL --> PROJ GQL --> OTHERS MCPGW --> BRAIN MCPGW --> CHAT MCPGW --> PROJ BRAIN --> AIGW CHAT --> AIGW PROJ --> AIGW BRAIN --> PG CHAT --> PG PROJ --> PG BRAIN --> S3 BRAIN -. "emits" .-> NATS_BUS CHAT -. "emits" .-> NATS_BUS PROJ -. "emits" .-> NATS_BUS BRAIN -. "traces" .-> OBS CHAT -. "traces" .-> OBS PROJ -. "traces" .-> OBS AIGW -. "traces" .-> OBS MCPGW -. "traces" .-> OBS GQL -. "traces" .-> OBS classDef plane fill:#fef6e0,stroke:#9c750a,stroke-width:2px classDef mod fill:#e8d4c2,stroke:#45210e classDef ext fill:#f9c64f,stroke:#9c750a classDef data fill:#cba88a,stroke:#45210e class AUTH,AIGW,MCPGW,OBS,GQL,NATS_BUS plane class BRAIN,CHAT,PROJ,OTHERS mod class USER,AGENT ext class PG,S3 data
Why a separate plane, not per-module?
- Audit unity. Compliance regulators need one audit chain, not 22. NATS subjects, Merkle-chained audit rows, OAuth issuance all share one canonical surface.
- Cost shape. Per-tenant LLM cost is the largest variable expense at scale; a single gateway lets the CFO put a ceiling on it (PRD §8.5: ≤ $150/mo internal, ≤ $4/active user/mo at 50-tenant).
- Provider failover. One gateway means primary Bedrock → Anthropic ZDR → OpenAI ZDR happens once, not 22 times in 22 different ways.
- Agent parity. Strategic bet #1 (PRD §2.3). A human's request and an agent's request hit the same gateway, the same RBAC, the same audit row.
The contracts each pillar exposes
- AUTH → JWT validation header, audience binding, RBAC predicate evaluator
- AI Gateway →
POST /v1/chat/completions,/v1/embeddings,/v1/rerank(OpenAI-shaped) - MCP Gateway → Streamable HTTP +
/.well-known/mcp+ per-tenant OAuth-PRM - OBS → OTel collector at
otel:4317(gRPC) +:4318(HTTP) - GraphQL → composed supergraph at
https://{tenant}.cyberos.world/graphql - NATS → subject
cyberos.{tenant}.{module}.{entity}.{verb}
AUTH
P0 · planned🔐 The identity backbone. AUTH owns who you are; every other module trusts AUTH. The minimal interface is "give me an RS256-signed JWT for this Member" — the rest is implementation.
Why a separate layer
If each module re-implemented identity, the platform would suffer the classic distributed-monolith pathology: 22 places where a security advisory needs to be applied, 22 places where MFA enforcement can drift. AUTH centralises the surface that absolutely cannot drift.
More structural: PRD §8.6 mandates that AI agents authenticate as Members — there is no "service account with broad permissions" pattern. Agent parity (PRD §2.3 bet #1) requires that a Claude session run on the same identity contract as a human session. One AUTH module is the only sensible place to enforce that.
Tech stack
- Library · Hono + jose (TypeScript subgraph) or axum + jsonwebtoken (Rust subgraph)
- Token · JWT RS256 with rotating signing keys (90-day rotation, 30-day grace)
- Session refresh · opaque token, HttpOnly+SameSite=Lax cookie, 30-day max
- MFA · TOTP (RFC 6238) minimum + WebAuthn / passkey for elevated roles
- Magic link · onboarding only, single-use, 15-minute TTL
- OAuth 2.1 + PKCE · per-tenant authorisation server (S256 challenge only)
- RBAC store · Postgres with RLS-enforced
app.tenant_idsession GUC
Why these picks
- RS256, not HS256 · only AUTH needs the private key; every subgraph validates with the public key without round-tripping AUTH.
- Per-tenant authz server ·
acme.cyberos.world/.well-known/oauth-authorization-servermeans a leaked Acme token cannot replay against Beta's gateway (audience binding from §8.4.1). - Postgres RLS · not application-layer ACL. Even a bug in a subgraph cannot cross tenant boundaries because the database itself refuses to return another tenant's rows.
- WebAuthn for Founder/CEO · phishing-resistant by spec; mandatory per PRD §8.6.
per-tenant authz server"] JWTISSUER["JWT issuer
RS256 · 24h expiry"] REFRESH["Refresh token store
opaque tokens · cookie"] MFAVERIFY["MFA verifier
TOTP + WebAuthn"] RBAC["RBAC predicate engine
role × resource × action"] KEYS["Signing key rotation
JWKS endpoint"] SESSIONDB[("Postgres
sessions · refresh · MFA enrolment")] end USER[("Member browser")] --> OAUTH OAUTH --> MFAVERIFY MFAVERIFY --> JWTISSUER JWTISSUER --> KEYS JWTISSUER --> SESSIONDB REFRESH --> SESSIONDB OAUTH -.-> REFRESH SUBG[("Any subgraph")] -. "JWKS fetch · cache" .-> KEYS SUBG -. "validates JWT
checks RBAC predicate" .-> RBAC classDef int fill:#e8d4c2,stroke:#45210e classDef ext fill:#f9c64f,stroke:#9c750a classDef db fill:#fde7b3,stroke:#9c750a class OAUTH,JWTISSUER,REFRESH,MFAVERIFY,RBAC,KEYS int class USER,SUBG ext class SESSIONDB db
Key contracts
# GraphQL contract — abbreviated
type Query {
me: Member!
myRoles: [Role!]!
jwks: JwksDocument! # public keys for downstream subgraphs
}
type Mutation {
exchangeCodeForToken(code: ID!, verifier: String!): TokenBundle!
refresh(token: String!): TokenBundle!
enrollMfa(method: MfaMethod!): MfaEnrollment!
stepUp(scope: String!, totp: String): TokenBundle!
revokeSession(sessionId: ID!): Boolean!
}
type TokenBundle {
jwt: String! # RS256, 24h expiry
expiresAt: DateTime!
refreshCookie: String # Set-Cookie header sent server-side
}
# MCP tool surface
cyberos.auth.whoami # readOnly=true
cyberos.auth.list_roles # readOnly=true
cyberos.auth.audit_login # readOnly=true (own sessions only)Role catalogue (PRD §8.6.1)
| Role | Reads | Writes | Sign / transfer | MFA |
|---|---|---|---|---|
| Founder/CEO | all (own tenant) | all (own tenant) | all | passkey · mandatory |
| Engineering Lead | all (own tenant) | all except REW/ESOP/HR-comp | — | passkey · mandatory |
| HR/Ops Lead | HR / REW / LEARN | HR / REW / LEARN | HR docs | TOTP min |
| Account Manager | CRM / PROJ / TIME / INV / PORTAL | CRM / PROJ / TIME / INV / PORTAL | INV / DOC | TOTP min |
| Member | own + assigned + public | own + assigned | own time entries | recommended |
| Board Member | governance scope | limited sign-offs | SP valuation | passkey · mandatory |
| External Client (P4) | PORTAL scope | PORTAL scope | own docs only | recommended |
| Tenant Admin (P4) | tenant config + audit | tenant config | tenant agreements | passkey · mandatory |
| AI Agent (Member) | = Member | = Member | no auto-sign | inherited |
Status
P0 planned · M0 → M+3 build window
- OAuth 2.1 PRM compatible with §8.4.3 MCP discovery
- JWKS endpoint live before any subgraph deploys
- Per-tenant audz server provisioning automated at TEN-module setup (P4)
References
- PRD §8.6 — Authentication & RBAC
- PRD §8.6.1 — Role catalogue (technical detail)
- PRD §8.4.1 — OAuth 2.1 for MCP (audience binding)
- RFC 7519 (JWT), RFC 7636 (PKCE), W3C WebAuthn L3
AI Gateway
P0 · planned⚡ One door for every LLM call. Routing, caching, redaction, persona stamping, cost accounting, residency enforcement, circuit breaking — all here, once.
Why a separate layer
The temptation in a 22-module platform is to let each module call the LLM SDK directly — "just import anthropic in the subgraph." That fails three ways. First, cost: without one place to set per-tenant budgets, the bill is unobservable until the credit-card statement arrives. Second, residency: a Vietnam-resident tenant must hit the Bedrock Singapore endpoint, never the US; the rule is too easy to bypass when 22 subgraphs each make their own choice. Third, safety: PII redaction, persona-version stamping, and the OWASP Gen-AI Top-10 mitigations cannot be 22-times-correct; they need one chokepoint.
PRD §8.5 makes the gateway non-optional: every LLM call from every module flows through it. The cost target — ≤ $150/mo internal, ≤ $4/active user/mo at 50-tenant scale — only works because we can see and cap every token at the gateway.
Tech stack
- Routing core · LiteLLM (MIT) — 100+ provider unified API
- Providers · primary AWS Bedrock (Claude Sonnet 4.6 / Haiku 4.5) → fallback Anthropic API ZDR → fallback OpenAI ZDR
- Embeddings · self-hosted BAAI/bge-m3 on a shared GPU node
- Rerank · self-hosted BAAI/bge-reranker-v2-m3
- Cache · Redis (semantic + exact-match prompt cache)
- Redaction · Microsoft Presidio + custom CyberSkill rules for VN identifiers (CCCD, MST)
- Cost ledger · Postgres + per-tenant rolling counter (resets daily UTC)
- Tracing · OTel spans for every model call; LangSmith for CUO sessions
Why these picks
- LiteLLM · MIT-licensed, single API, the entire CyberOS extension surface is middleware. Cheap to fork if needed.
- Bedrock primary · ZDR by default, residency by region, Anthropic models without contract overhead, regional Singapore endpoint for VN tenants.
- Self-hosted embeddings · embedding cost dominates at scale; BGE-M3 (one of the highest MIRACL scores) runs cheaply on one shared GPU.
- Presidio · open-source NER for PII, plus custom rules for Vietnamese-specific identifiers (MST, CCCD, bank account regex).
- Semantic cache · CUO answers many similar questions; 30-50% hit rate at internal scale per pilot.
per-tenant ceiling"} COSTGATE -- "over budget" --> REJECT[/"429 quota exceeded"/] COSTGATE -- "ok" --> REDACT["Presidio redaction
+ VN identifier rules"] REDACT --> PERSONA["Persona-version stamp
prepend system prompt"] PERSONA --> CACHE{"Semantic + exact cache"} CACHE -- "hit" --> CACHEHIT["Return cached response
fresh persona stamp"] CACHE -- "miss" --> ROUTER["LiteLLM router
model selection by capability"] ROUTER --> RESIDENCY{"Tenant residency"} RESIDENCY -- "vn-shard" --> BEDROCK_SG["Bedrock
ap-southeast-1"] RESIDENCY -- "eu-shard" --> BEDROCK_EU["Bedrock
eu-central-1"] RESIDENCY -- "us-shard" --> BEDROCK_US["Bedrock
us-east-2"] BEDROCK_SG --> RESPONSE["Stream response"] BEDROCK_EU --> RESPONSE BEDROCK_US --> RESPONSE BEDROCK_SG -. "failover" .-> ANTHROPIC["Anthropic ZDR"] ANTHROPIC -. "failover" .-> OPENAI["OpenAI ZDR"] ANTHROPIC --> RESPONSE OPENAI --> RESPONSE RESPONSE --> LEDGER["Cost ledger update"] RESPONSE --> AUDIT["Audit row to NATS"] LEDGER --> CALLER CACHEHIT --> CALLER classDef gate fill:#fef6e0,stroke:#9c750a,stroke-width:2px classDef prov fill:#e8d4c2,stroke:#45210e classDef block fill:#fecaca,stroke:#b91c1c class INGRESS,COSTGATE,REDACT,PERSONA,CACHE,ROUTER,RESIDENCY,RESPONSE,LEDGER,AUDIT,CACHEHIT gate class BEDROCK_SG,BEDROCK_EU,BEDROCK_US,ANTHROPIC,OPENAI prov class REJECT block
Key contracts
# OpenAI-shaped surface (LiteLLM convention)
POST /v1/chat/completions # streaming + non-streaming
POST /v1/embeddings # BGE-M3 self-hosted
POST /v1/rerank # BGE-reranker self-hosted
POST /v1/messages # Anthropic-shaped passthrough
GET /v1/usage?tenant=acme # rolling cost view
GET /v1/models # capability-classified list
# Required headers
Authorization: Bearer <subgraph JWT>
X-Cyberos-Tenant: acme
X-Cyberos-Persona: cfo-v3
X-Cyberos-Module: rew # for cost attribution
X-Cyberos-Trace-Id: 01HXY... # OTel trace propagation
# Response headers
X-Cyberos-Cost-Cents: 13
X-Cyberos-Cache: hit | miss | bypass
X-Cyberos-Provider: bedrock | anthropic | openai
X-Cyberos-Persona-Version: cfo-v3.2.1Latency budgets (PRD §8.5.1)
| Path | p50 | p95 | Notes |
|---|---|---|---|
| Chat completion (Haiku) | < 600 ms | < 1.4 s | CHAT message-suggest; uses prompt cache |
| Chat completion (Sonnet) | < 1.5 s | < 3.0 s | CUO answers; complex reasoning |
| Embedding (BGE-M3) | < 30 ms | < 80 ms | self-hosted; batch of 32 |
| Reranker (BGE-rerank-v2-m3) | < 80 ms | < 200 ms | self-hosted; top-150 → top-20 |
| BRAIN search end-to-end | < 120 ms | < 250 ms | embed + retrieve + rerank |
| MCP tool call (read-only) | < 200 ms | < 500 ms | via Apollo Router; cached |
| MCP tool call (write) | < 400 ms | < 1.0 s | includes audit + NATS emit |
Status
P0 planned · M+1 → M+3
- LiteLLM forked at
cyberos/litellm-cyberoswith middleware overlay - Presidio rule pack
cyberos-vn-rulescovers MST, CCCD, VietQR - Bedrock allowlist · Sonnet 4.6, Haiku 4.5, Titan embed (fallback)
- Cost ledger primed with $0.003/$0.015 Sonnet rate, $0.0008/$0.004 Haiku
References
- PRD §8.5 — AI Gateway
- PRD §8.5.1 — Latency budgets
- OWASP Gen AI Top 10 (2025-04 revision)
- NIST AI 600-1 (GenAI Risk Profile)
MCP Gateway
P0 · planned🔌 The agent-operability surface. Every module owns its MCP server; the Gateway federates them into one discovery endpoint. CyberOS targets the 2025-11-25 spec — the production-stable line as of May 2026.
Why a separate layer
The MCP "gateway" is not a single monolithic server — it is a federation router. Each module owns its own MCP server, runs side-by-side with its subgraph, shares the database connection pool, and uses the same RBAC predicates. The gateway's job is two things: (1) one discovery endpoint at /.well-known/mcp so a Claude Desktop or Cursor session can auto-detect the catalog, and (2) cross-cutting policy enforcement — tool annotations (destructive / readOnly / idempotent / openWorld), OAuth audience binding, audit row emission.
Naming convention is the moat against tool-name collisions: cyberos.{module}.{verb}_{noun}. Examples: cyberos.proj.create_task, cyberos.brain.search, cyberos.rew.payslip_explain (read-only narrative; never compute). Collisions are rejected at registration time.
2025-11-25 spec features adopted
| Spec change | What it gives us | Implementation |
|---|---|---|
| Tasks (long-running ops) | "draft the board pack" without blocking chat session | Tasks subgraph stores state; webhooks callback on complete |
| Sampling-with-Tools | Servers can delegate sub-tasks back to host LLM with tool access | Nested decomposition; rate-limited per session |
| SEP-986 well-known | Single .well-known/mcp replaces hand-coded server lists | Gateway publishes discovery; clients auto-detect |
| Tool annotations | destructive=true tools auto-route through human-in-the-loop | Validated at registration; runtime check on call |
| Streamable HTTP | Single endpoint, mid-stream resumability, HTTP/2 + HTTP/3 | Default transport; SSE deprecated for new servers |
| Elicitation | Server can ask user "which workspace?" mid-call | Implemented as prompt back-channel; Gateway proxies safely |
| Resource embedding | Tool returns can embed Resources for LLM to read | "Show me the policy doc" + tool-call combined patterns |
| Title fields | Human-friendly names separate from technical IDs | Genie panel display; technical name = audit identifier |
Tech stack
- Transport · Streamable HTTP (default) + WebSocket upgrade
- Per-module server · TypeScript SDK (
@modelcontextprotocol/sdk) or Rust (mcp-rs· CyberSkill-published) - Federation router · Hono + custom MCP-aware reverse proxy
- Discovery ·
/.well-known/mcp+/.well-known/oauth-protected-resource - OAuth · OAuth 2.1 + PKCE (S256); audience-bound tokens
- Registry · Postgres table; tool annotations validated on registration
- HITL · LangGraph
interrupt()gate for destructive tools
Tool annotations enforced
readOnlyHint=true· executes without confirmation promptdestructiveHint=true· HITL confirm UI; LangGraph interruptidempotentHint=true· safe to retry; no double-executionopenWorldHint=true· may communicate externally; annotated in audittitle="..."· human-friendly label in Genie panel
26+ AI clients")] CLIENT --> DISCOVERY["/.well-known/mcp
SEP-986 discovery"] DISCOVERY --> PRM["/.well-known/oauth-protected-resource
per-tenant"] PRM --> AUTHSRV["OAuth authorisation server
(AUTH module)"] CLIENT --> ROUTER["MCP Gateway router
Streamable HTTP"] ROUTER --> REGISTRY[("Tool registry
annotation-validated")] ROUTER --> HITL{"destructive?"} HITL -- "yes" --> CONFIRM["LangGraph interrupt()
user confirm in Genie panel"] HITL -- "no" --> DISPATCH["dispatch to module server"] CONFIRM --> DISPATCH DISPATCH --> BRAIN_MCP["BRAIN MCP server
cyberos.brain.*"] DISPATCH --> CHAT_MCP["CHAT MCP server
cyberos.chat.*"] DISPATCH --> PROJ_MCP["PROJ MCP server
cyberos.proj.*"] DISPATCH --> REW_MCP["REW MCP server
cyberos.rew.* (read-only)"] DISPATCH --> OTHER["…19 more"] BRAIN_MCP --> AUDITNATS["NATS audit row"] CHAT_MCP --> AUDITNATS PROJ_MCP --> AUDITNATS classDef gw fill:#fef6e0,stroke:#9c750a,stroke-width:2px classDef mod fill:#e8d4c2,stroke:#45210e classDef ext fill:#f9c64f,stroke:#9c750a classDef block fill:#fecaca,stroke:#b91c1c class DISCOVERY,PRM,ROUTER,REGISTRY,DISPATCH,CONFIRM gw class BRAIN_MCP,CHAT_MCP,PROJ_MCP,REW_MCP,OTHER mod class CLIENT,AUTHSRV,AUDITNATS ext class HITL block
OAuth-Protected Resource Metadata (PRD §8.4.3)
# GET /.well-known/oauth-protected-resource (per-tenant)
{
"resource": "https://acme-tenant.cyberos.world",
"authorization_servers": [
"https://acme-tenant.cyberos.world/oauth"
],
"scopes_supported": [
"cyberos.read", "cyberos.write",
"cyberos.brain.read", "cyberos.brain.write",
"cyberos.proj.read", "cyberos.proj.write",
"cyberos.chat.read", "cyberos.chat.write",
"cyberos.rew.read"
],
"bearer_methods_supported": ["header"]
}
# GET /.well-known/oauth-authorization-server
{
"issuer": "https://acme-tenant.cyberos.world/oauth",
"authorization_endpoint": ".../authorize",
"token_endpoint": ".../token",
"code_challenge_methods_supported": ["S256"],
"response_types_supported": ["code"],
"grant_types_supported": ["authorization_code", "refresh_token"]
}Note: Resource-server-as-OAuth-client behaviour is forbidden — the prior pattern that the security community flagged in 2025 ("token shadow handoff" via X-Forwarded-Authorization) is rejected at gateway level. Audience binding ensures an Acme tenant's token cannot be replayed at the Beta tenant's gateway even if leaked.
Status
P0 planned · M+1 → M+3
- 2025-11-25 spec targeted (Tasks, Streamable HTTP, Elicitation, PRM)
- Per-module servers ship alongside subgraphs; reuse RBAC/DB pool
- Tool annotations validated at registration; runtime check on call
- 26+ AI client compatibility tested (Claude Desktop, Code, Cursor, Cline)
References
- PRD §8.4 — MCP Gateway & 2025-11-25 spec
- PRD §8.4.1 — Authentication & authorisation
- PRD §8.4.2 — Tool registry & per-module servers
- PRD §8.4.3 — OAuth-Protected Resource & PRM flow
- MCP 2025-11-25 spec
OBS — Observability
P0 · planned👁 Logs, metrics, traces — for every module, every gateway, every agent action. OBS is also the surface CUO's CTO skill reports against.
Why a separate layer
In a 22-module platform with agent traffic on top of human traffic, you cannot answer "why is REW slow?" without tying together: a Member's request through Apollo Router, the REW subgraph's DB query, a call out to the AI Gateway for a payslip narrative, an audit row to NATS, and a downstream LEARN subscription. OBS is the single trace tree that makes that legible.
OBS also feeds CUO's CTO skill: weekly OBS dashboard digests, security advisory pipelines, and model registry summaries all read from OBS (PRD §13 AI matrix).
Tech stack — LGTM (DEC-021)
- Loki · logs (open-source, self-hostable, S3-compatible storage)
- Grafana · dashboards (open-source visualisation)
- Tempo · distributed traces (OTel-native)
- Mimir · metrics (Prometheus long-term storage)
- Collector · OpenTelemetry Collector (OTel) at
:4317gRPC - Alerting · Grafana Alertmanager + PagerDuty webhook
- Agent traces · LangSmith for CUO session timelines
- Synthetic monitoring · Grafana k6 scripts in CI
Why these picks
- LGTM, not Datadog · cost predictability is the founder's first constraint; Datadog at 22-module ingest scales linearly with bill.
- OTel-native · every subgraph + gateway speaks one wire format; trade providers later without rewriting instrumentation.
- S3 backend · Loki/Tempo/Mimir all store on the same R2/MinIO bucket as BRAIN's archival layer; one cost model.
- LangSmith for agent runs · CUO's persona-version stamps, tool calls, and HITL gates need agent-aware UX; LangSmith is the lowest-friction tool.
gRPC :4317 · HTTP :4318")] OTEL --> LOKI[("Loki
logs · S3 backend")] OTEL --> TEMPO[("Tempo
traces · S3 backend")] OTEL --> MIMIR[("Mimir
metrics · S3 backend")] LOKI --> GRAFANA[("Grafana
dashboards · LGTM unified")] TEMPO --> GRAFANA MIMIR --> GRAFANA GRAFANA --> ALERT[("Alertmanager
PagerDuty webhook")] GRAFANA --> CUO_CTO[("CUO · CTO skill
weekly digest")] classDef src fill:#e8d4c2,stroke:#45210e classDef pipe fill:#fef6e0,stroke:#9c750a,stroke-width:2px classDef store fill:#cba88a,stroke:#45210e classDef vis fill:#f5ede6,stroke:#45210e class M1,M2,M3,M4,M5 src class OTEL pipe class LOKI,TEMPO,MIMIR store class GRAFANA,ALERT,CUO_CTO vis
Status
P0 planned · M+2 → M+3
- LGTM stack via Grafana Cloud free tier in P0, self-host at P1
- OTel instrumentation in module-template; every new subgraph wired by default
- RED dashboard per subgraph (rate, errors, duration)
- CUO CTO-skill digest job emits weekly summary to Founder
References
- PRD §8.7 — Audit & event log
- SRS DEC-021 — LGTM observability stack
- OpenTelemetry semantic conventions 1.27+
- N(FR pending) — p99 latency degradation budget (CI gate)
GraphQL Federation
P0 · planned🌐 Apollo Federation v2.5+. One supergraph, 22 subgraphs, one persisted-query budget. The agent surface and the human surface read the same schema.
Why a separate layer (and why Apollo Federation specifically)
CyberOS rejects three alternative API postures. REST per module means the front-end host shell does N round-trips per page. BFF (backend-for-frontend) per module means N+M maintenance burdens. Single monolithic GraphQL means schema-merge conflicts every time a module ships. Apollo Federation v2.5+ solves all three: each module owns its subgraph SDL, the Router composes the supergraph at deploy time, and one HTTP roundtrip can pull from 5 modules in parallel.
PRD §8.2 makes persisted queries mandatory for production traffic. Query hashes are pre-registered at deploy; any unregistered query is rejected with a 400. This caps abuse, enables CDN caching, and lets each subgraph publish a query budget.
Tech stack
- Router · Apollo Router (Rust, OSS, MIT) v1.50+ — Federation v2.5+ compliant
- Subgraph servers · GraphQL Yoga (TypeScript) or async-graphql (Rust)
- Composition ·
rover supergraph composein CI - Persisted query store · GCS / R2 bucket (CDN-cached)
- Directives used ·
@key,@external,@requires,@provides,@shareable,@inaccessible,@tag - Auth context · JWT validated at Router;
tenant_id+actorpropagated to subgraphs as headers - Caching · Apollo Router edge cache + Cloudflare CDN
Why these picks
- Apollo Router, not GraphQL Mesh · production-grade Rust runtime; query plan cache; Federation v2.5 reference.
- Persisted queries mandatory · zero query injection surface; CDN-cacheable; rate-shaped.
- Federation v2.5 ·
@interfaceObject(P3 module hierarchies),@progressive @override(zero-downtime schema moves). - Schema deprecation discipline · removal requires ≥ 1 phase notice (N(FR pending)); breaks no client mid-phase.
Vite · React 19 · Tauri")] HOST --> ROUTER["Apollo Router v1.50+
query plan · auth context · cache"] ROUTER --> COMPOSITION{"Persisted query
hash lookup"} COMPOSITION -- "miss" --> REJECT[/"400 unregistered"/] COMPOSITION -- "hit" --> PLAN["Query plan
parallel subgraph fanout"] PLAN --> SG_BRAIN["BRAIN subgraph"] PLAN --> SG_CHAT["CHAT subgraph"] PLAN --> SG_PROJ["PROJ subgraph"] PLAN --> SG_AUTH["AUTH subgraph"] PLAN --> SG_DOTS["…18 more"] SG_BRAIN --> PG_BRAIN[("BRAIN schema
RLS-enforced")] SG_CHAT --> PG_CHAT[("CHAT schema
RLS-enforced")] SG_PROJ --> PG_PROJ[("PROJ schema
RLS-enforced")] SG_AUTH --> PG_AUTH[("AUTH schema")] classDef gw fill:#fef6e0,stroke:#9c750a,stroke-width:2px classDef mod fill:#e8d4c2,stroke:#45210e classDef ext fill:#f9c64f,stroke:#9c750a classDef block fill:#fecaca,stroke:#b91c1c classDef db fill:#fde7b3,stroke:#9c750a class ROUTER,PLAN gw class SG_BRAIN,SG_CHAT,SG_PROJ,SG_AUTH,SG_DOTS mod class HOST ext class COMPOSITION block class PG_BRAIN,PG_CHAT,PG_PROJ,PG_AUTH db
Status
P0 planned · M0 → M+1
- Apollo Router scaffold + composition CI live at M0
- BRAIN, AUTH subgraphs first to integrate at M+1
- Persisted query registration auto-bound to host-shell build
- Schema deprecation policy in CONTRIBUTING.md
References
- PRD §8.2 — GraphQL Federation
- PRD §8.3 — Module Federation (frontend)
- SRS DEC-002 — Apollo Federation v2.5+
- Apollo Federation v2 docs
NATS JetStream
P0 · planned📬 Every state-changing action emits an event. NATS JetStream is the spine. Durable consumers, tenant-scoped subjects, audit-grade retention.
Why a separate layer
A 22-module platform needs to decouple write paths from downstream effects. When REW publishes a payslip, six things need to happen: BRAIN ingests the narrative, LEARN updates the career-level snapshot, OBS emits a metric, CUO queues a "review your payslip" Notify, the Compliance audit row is hashed, and the Member's mobile gets a push. All six are events on the canonical subject cyberos.acme.rew.payslip.published.
PRD §8.10 locks the convention: cyberos.{tenant}.{module}.{entity}.{verb}. Subjects are tenant-scoped so subscribers cannot accidentally cross tenant boundaries. Streams retain 30 days by default, 90 days for compensation/ESOP. CUO's ambient-trigger consumers subscribe through durable consumers so a restart never loses pending nudges.
Tech stack
- Broker · NATS Server v2.10+ with JetStream
- Client libs ·
nats.go,@nats-io/nats.js,async-nats(Rust) - Schemas · CloudEvents 1.0 envelope + JSON Schema body per subject
- Schema registry · self-hosted; refs in subgraph CI
- Durable consumers · CUO ambient-trigger, OBS rollups, BRAIN ingestion
- Replication · 3-node JetStream cluster per region
- DLQ · failed messages routed to
cyberos.{tenant}.dlq.{module}.{entity}.{verb}
Why NATS, not Kafka / Redpanda
- Subject hierarchy native · Kafka has flat topics; NATS subjects (
cyberos.acme.proj.*) match CyberOS conventions one-to-one. - Latency · sub-millisecond pub/sub; Kafka adds tens of ms per consumer group.
- Footprint · single 50 MB binary; no Zookeeper/Kraft cluster to operate at 10-Member scale.
- JetStream · adds Kafka-style durability without giving up subject wildcards.
- Cost · runs on a single $20/mo VM in P0; clusters at P3.
Canonical subjects (PRD §8.10)
# Format
cyberos.{tenant}.{module}.{entity}.{verb}
# Examples
cyberos.acme.proj.task.created
cyberos.acme.proj.task.assigned
cyberos.acme.proj.task.completed
cyberos.acme.rew.payslip.published # 90-day retention
cyberos.acme.rew.bp_balance.updated
cyberos.acme.brain.fact.added
cyberos.acme.brain.fact.conflict_detected
cyberos.acme.crm.deal.stage_changed
cyberos.acme.chat.message.posted
cyberos.acme.audit.event.recorded # Merkle-chained; 7y retention
cyberos.acme.ai.invoke.completed # cost ledger
cyberos.acme.mcp.tool.invoked
# Durable consumers
- cuo-ambient → subscribes to *.task.* + *.deal.* + *.payslip.*
- brain-ingest → subscribes to all non-compensation events
- obs-rollup → subscribes to *.>
- compliance-audit → subscribes to *.audit.>
- learn-snapshot → subscribes to rew.payslip.* + hr.level.*cyberos dlq replayStatus
P0 planned · M0 → M+1
- Single-node NATS at M0; 3-node cluster at M+6
- Module-template includes typed publisher/consumer helpers
- Schema registry CI-validated at subgraph PR-time
- Per-tenant subject ACLs at NATS-level (defense in depth alongside RLS)
References
- PRD §8.10 — NATS event subjects
- PRD §8.7 — Audit & event log
- SRS DEC-004 — NATS JetStream events
- NATS JetStream docs
End-to-end · "a module makes a request"
The six pillars are useful in isolation; they are load-bearing when composed. Below is a single end-to-end trace of one user action — Trinh, a Member, asks Genie "what should I work on today?" — passing through every pillar exactly once.
The six pillars, one trace
References
CyberOS source documents
- PRD §8.1 — The high-level system
- PRD §8.2 — GraphQL Federation
- PRD §8.3 — Module Federation (frontend)
- PRD §8.4 — MCP Gateway and the 2025-11-25 spec
- PRD §8.4.1 — Authentication and authorisation
- PRD §8.4.2 — Tool registry and per-module servers
- PRD §8.4.3 — OAuth-protected resource and PRM flow
- PRD §8.5 — AI Gateway
- PRD §8.5.1 — Latency budgets
- PRD §8.6 — Authentication & RBAC
- PRD §8.6.1 — Role catalogue (technical detail)
- PRD §8.7 — Audit & event log
- PRD §8.8 — Multi-tenancy and residency
- PRD §8.10 — NATS event subjects
- PRD §11.2.1 — Performance Efficiency NFRs
- SRS DEC-001..DEC-066 — locked decisions
External standards & specs
- MCP 2025-11-25 specification
- Apollo Federation v2 docs
- RFC 7519 — JSON Web Token (JWT)
- RFC 7636 — PKCE for OAuth 2.0
- RFC 6749 — OAuth 2.0 / draft OAuth 2.1
- RFC 6238 — TOTP
- W3C WebAuthn Level 3
- RFC 6532 — Internationalized email (UTF-8 throughout)
- NATS JetStream documentation
- OpenTelemetry semantic conventions 1.27+
- OWASP Generative AI Top 10 (2025-04)
- NIST AI 600-1 — Generative AI Risk Profile