OBS is the shared telemetry plane: logs, metrics, traces, and AI-trace observability for every CyberOS module. Operationally, OpenTelemetry SDKs in every service ship to a single OTel collector that fans out β logs to Loki, metrics to Prometheus, traces to Tempo. Grafana renders dashboards (per-module SLO, per-tenant cost, per-region health). LangSmith captures full LLM call traces independently from the operational pipeline so AI debugging doesn't require correlating across three tools. Alert Manager fans critical alerts to PagerDuty, mid alerts to #cyberos-alerts, low signals into the CUO morning digest. The audit chain β owned by BRAIN β is exposed via a separate read-only OBS surface for regulators (PDPL Art. 14, EU AI Act Art. 12). Tenant scoping is enforced at the query proxy so a member of tenant A cannot see tenant B's logs.
Why OBS exists
Production observability is one of the line items that, if not centralised early, fragments quickly: one team picks Datadog, another picks Honeycomb, the AI team picks LangSmith, the compliance team asks for an audit-log dashboard that nobody owns. Centralise the platform, let every module emit OpenTelemetry, give compliance read-only audit views, and the question "is the platform healthy?" has one answer instead of five.
Loki + Grafana + Tempo + Prometheus = the full operational picture. Self-hosted; runs on Fargate + S3.
LangSmith captures full prompt + completion + tool-call chains. Operational tracing alone won't tell you why an agent made a bad decision.
EU AI Act Art. 12 + PDPL Art. 14 demand decision logging that regulators can inspect β OBS owns the read-only audit surface.
The bet: pay the LGTM operational cost once, plug LangSmith in beside it, and you get incident response, SLO tracking, AI debugging, and compliance evidence from one plane. The alternative β three different SaaS tools, each with its own auth and bill β is a money-and-context drain that compounds with every new module.
What it does β 5W1H2C5M
| Axis | Question | Answer |
|---|---|---|
| 5W Β· What | What is OBS? | A self-hosted LGTM stack (Loki, Grafana, Tempo, Prometheus) plus LangSmith for AI-trace observability, plus a small Rust query proxy that enforces tenant scoping on every read, plus Alert Manager for routing. |
| 5W Β· Who | Who reads it? | Operators: CTO + on-call engineers (dashboards, alerts). Module owners: for their SLO dashboards. Tenant admins: for their own tenant's cost + usage dashboards. Compliance: read-only audit surface. Auditors: per-engagement scope. |
| 5W Β· When | When does it run? | 24/7. OTel collector receives spans/logs/metrics in real time; alert evaluation every 30 s; dashboards refresh on user request or 30 s auto-refresh. |
| 5W Β· Where | Where does it run? | Self-hosted on AWS in SG-1 (P0). LangSmith is a managed SaaS (zero-retention contract); audit-log view is served from BRAIN reads via the query proxy. |
| 5W Β· Why | Why a separate plane? | So no module has to think about "where do my logs go?" β they emit OTel, the plane handles fan-out, retention, query, and alerting. |
| 1H Β· How | How does it work? | Services emit OTel; collector splits by signal type; Loki / Tempo / Prometheus ingest; Grafana queries via the tenant-aware proxy; LangSmith ingests AI traces over its own SDK; Alert Manager evaluates rules and routes; audit-log surface reads BRAIN binlog. |
| 2C Β· Cost | Cost? | P0: ~$130/month (S3 hot-tier storage + Fargate for query proxy + LangSmith starter). 50-tenant: ~$700/month including S3 cold tier + Grafana Enterprise (optional). |
| 2C Β· Constraints | Constraints? | (a) PII redaction before log shipping (β₯ 99.5% recall). (b) Tenant queries cannot bypass scope. (c) EU AI Act Art. 12 decision logs retained β₯ 6 months. (d) Audit-log surface is read-only for everyone. |
| 5M Β· Materials | Stack? | OpenTelemetry SDK (Rust + Python) Β· OTel Collector Β· Loki 3.x Β· Tempo 2.x Β· Prometheus 2.x Β· Grafana 11.x Β· LangSmith Β· Alert Manager Β· S3 (Loki / Tempo backing). |
| 5M Β· Methods | Method choices? | OTel for everything except AI traces (LangSmith). Trace-id propagation via W3C TraceContext. PII redaction at the collector. Tenant_id injected as a label by the collector based on JWT inspection. |
| 5M Β· Machines | Deployment? | Loki + Tempo on S3-backed object storage; Prometheus on a single Fargate task (P0); Grafana on Fargate; query proxy on Fargate. |
| 5M Β· Manpower | Who maintains? | 0.3 FTE CTO at P0. P1+: dedicated SRE/on-call rotation. |
| 5M Β· Measurement | How measured? | N(FR pending) (platform availability β₯ 99.5%), N(FR pending) (SLO dashboard β€ 60 s freshness), N(FR pending) (log PII recall β₯ 99.5%). |
Architecture
Every CyberOS service ships OTel SDK in-process. The collector receives all signals, applies PII redaction, tags with tenant_id, and fans out to Loki (logs), Tempo (traces), Prometheus (metrics). LangSmith receives AI-trace data directly from AI Gateway. Grafana renders dashboards via a Rust tenant-aware query proxy. Alert Manager evaluates Prometheus rules and routes.
OTLP grpc/http"] RED["redactor processor
PII scrub Β· β₯ 99.5%"] TAG["tenant_tag processor
JWT β tenant_id label"] SAMP["sampler
tail-based for traces"] EXP["exporters"] end subgraph LGTM ["LGTM backends (S3-backed)"] LOKI[("Loki
logs Β· 7d hot Β· 90d warm")] TEMPO[("Tempo
traces Β· 7d hot Β· 30d warm")] PROM[("Prometheus
metrics Β· 15d local Β· 1y in Mimir P1+")] end subgraph DASH ["Grafana + Query Proxy"] QP["tenant_query_proxy.rs
Rust Β· enforces tenant scope on every query"] GRAF["Grafana 11.x
dashboards"] end subgraph ALERT ["Alert Manager"] AM["alertmanager.yml
routes by severity"] PD["PagerDuty
critical"] CHAT["CHAT bot
mid"] DIG["CUO digest
low"] end subgraph AI ["AI trace plane"] LANG["LangSmith SaaS
(zero-retention)"] end subgraph AUDIT ["Audit surface"] BR["π§ BRAIN
binlog"] AS["audit_view.rs
read-only Β· auditor-scoped"] end SVC1 --> REC SVC2 --> REC SVC3 --> REC SVC4 --> REC SVCN --> REC REC --> RED RED --> TAG TAG --> SAMP SAMP --> EXP EXP --> LOKI EXP --> TEMPO EXP --> PROM PROM --> AM AM --> PD AM --> CHAT AM --> DIG GRAF --> QP QP --> LOKI QP --> TEMPO QP --> PROM SVC2 -. LangSmith SDK .-> LANG BR --> AS AS --> GRAF classDef shipped fill:#f5ede6,stroke:#45210e classDef planned fill:#fef6e0,stroke:#7c3aed classDef store fill:#f5f3ff,stroke:#7c3aed class BR shipped class REC,RED,TAG,SAMP,EXP,QP,GRAF,AM,AS planned class LOKI,TEMPO,PROM,LANG store class PD,CHAT,DIG planned class SVC1,SVC2,SVC3,SVC4,SVCN planned
Internal components
| Component | Where | Responsibility |
|---|---|---|
OTel Collector | services/obs/collector/ | Receives OTLP from every service. Applies PII redaction, tenant tagging, tail-based sampling. Fans out to Loki/Tempo/Prometheus. |
redactor processor | collector/processors/redactor.go | Presidio-equivalent PII scrubber in Go. Recall β₯ 99.5%. Same rule set as AI Gateway redactor. |
tenant_tag processor | collector/processors/tenant_tag.go | Inspects span attributes for tenant_id (from JWT context); adds as standard label. Sources of truth: tenant.id attribute. |
sampler | collector/processors/sampler.go | Tail-based β keeps 100% of error traces, samples 10% of successful ones. |
Loki | backend | Log storage. S3-backed. Compressed gzip. 7d hot Β· 90d warm. |
Tempo | backend | Trace storage. S3-backed. 7d hot Β· 30d warm. |
Prometheus | backend | Metrics. Local 15d. Mimir for 1y at P1+. |
tenant_query_proxy.rs | services/obs/query-proxy/ | Rust axum service. Every query (from Grafana or API) is intercepted; tenant_id from JWT injected as label filter; cross-tenant queries rejected with 403. |
Grafana | frontend | 11.x. Per-module SLO dashboards + per-tenant cost dashboards + read-only audit-log view (datasource: BRAIN). |
Alert Manager | backend | Routes alerts by severity. PagerDuty + CHAT + CUO digest integrations. |
SLO engine | services/obs/slo/ | Sloth-based. SLO definitions in YAML committed to repo. Burn-rate alerts generated automatically. |
cost_pipeline.py | services/obs/cost/ | Daily cost roll-up from AWS Cost Explorer + AI Gateway DuckDB + storage metrics. Per-tenant breakdown. |
audit_view.rs | services/obs/audit/ | Read-only audit-log API; consumes BRAIN binlog; exposes Grafana datasource so compliance can query in the same UI as operations. |
LangSmith client | integrated in AI Gateway | Sends prompt/completion/tool-call traces directly to LangSmith. Zero-retention contract in place. |
Data model
OBS is mostly streaming β its "data model" is the schema of OTel signals plus SLO and alert configuration. Below shows the entity relationships.
Canonical OTel attribute schema
| Attribute | Type | Required | Purpose |
|---|---|---|---|
tenant.id | string (UUID) | YES | Tenant scoping β load-bearing for all queries. |
tenant.slug | string | SHOULD | Human-readable label. |
actor.id | string | YES | Subject (user / agent / service). |
actor.kind | "human"|"agent"|"service" | YES | Authentication shape. |
persona.version | string | if agent | e.g. cuo-v2.3.1. |
module | string | YES | e.g. brain, auth, chat. |
service.name | string | YES | OTel standard. |
service.version | string | YES | OTel standard. |
deployment.environment | "dev"|"staging"|"prod" | YES | OTel standard. |
cyberos.severity_class | "p0"|"p1"|"p2"|"p3" | SHOULD | For alert routing. |
cyberos.cost_usd | float | if applicable | For per-tenant cost dashboards. |
API surface
Query API (Grafana-compatible, tenant-scoped)
All queries flow through tenant_query_proxy.rs, which extracts tenant_id from the caller's JWT and rewrites the query to inject {tenant_id="β¦"} label filter. Cross-tenant queries return 403.
| Method | Path | Purpose |
|---|---|---|
| POST | /api/v1/loki/query | LogQL query (Grafana datasource). |
| POST | /api/v1/loki/query_range | Range LogQL query. |
| POST | /api/v1/prom/query | PromQL query. |
| POST | /api/v1/prom/query_range | Range PromQL. |
| POST | /api/v1/tempo/api/search | Tempo trace search. |
| GET | /api/v1/tempo/api/traces/{id} | Get full trace by id. |
| POST | /api/v1/audit/query | BRAIN audit-log query (read-only). |
| GET | /api/v1/slo | List SLO targets for tenant. |
| GET | /api/v1/slo/{id}/burn | Burn-rate for a specific SLO. |
| GET | /api/v1/cost/mtd | MTD cost breakdown for tenant. |
| GET | /api/v1/alerts/active | Active alerts for tenant. |
| POST | /api/v1/alerts/{id}/silence | Silence an alert (operator scope). |
GraphQL subgraph (federated)
extend schema
@link(url: "https://specs.apollo.dev/federation/v2.5", import: ["@key", "@requiresScopes"])
type SLO @key(fields: "id") {
id: ID!
service: String!
indicator: SLOIndicator!
targetPct: Float!
window: String!
currentPct: Float!
budgetRemainingPct: Float!
burnRateShort: Float!
burnRateLong: Float!
}
type Alert @key(fields: "id") {
id: ID!
ruleName: String!
severity: Severity!
state: AlertState!
startedAt: DateTime!
resolvedAt: DateTime
labels: JSON!
}
type CostReport @key(fields: "tenantId month") {
tenantId: ID!
month: String!
totalUsdCost: Float!
infraUsdCost: Float!
aiUsdCost: Float!
storageUsdCost: Float!
byService: [ServiceCost!]!
}
type ServiceCost {
service: String!
usdCost: Float!
}
enum SLOIndicator { AVAILABILITY LATENCY ERROR_RATE THROUGHPUT }
enum Severity { CRITICAL WARNING INFO }
enum AlertState { PENDING FIRING RESOLVED }
type Query {
slos(service: String): [SLO!]! @requiresScopes(scopes: [["obs.read"]])
alertsActive: [Alert!]! @requiresScopes(scopes: [["obs.read"]])
costMTD: CostReport! @requiresScopes(scopes: [["obs.cost_read"]])
trace(id: String!): Trace @requiresScopes(scopes: [["obs.read"]])
}
OTel ingest endpoints
| Method | Path | Purpose |
|---|---|---|
| POST | /v1/logs | OTLP logs ingest (collector). |
| POST | /v1/metrics | OTLP metrics ingest. |
| POST | /v1/traces | OTLP traces ingest. |
| GET | /metrics | Prometheus scrape endpoint (collector self-telemetry). |
| GET | /health | Liveness + signal counts. |
Key flows
Flow 1 β Log ingestion (PII-scrubbed, tenant-tagged)
(FR pending): PII recall β₯ 99.5%. Redaction at the collector is the last point at which PII can be stopped before it lands on S3.
Flow 2 β Metric scrape + alert evaluation
ai_provider_error_total 47 loop alert eval every 30 s PROM->>PROM: evaluate rule: ai_request_latency_p95_seconds > 2 for 5m alt firing PROM->>AM: alert {severity=critical, service=ai-gateway} AM->>AM: route by labels AM->>PD: page on-call (severity=critical) AM->>CHAT: post message end end
Flow 3 β Trace propagation across modules
all share trace_id=t1 AR->>TEMPO: span (root) AUTH->>TEMPO: span CHAT->>TEMPO: span AI->>TEMPO: span BR->>TEMPO: span (x2)
(FR pending): end-to-end trace continuity verified. W3C TraceContext propagation through every internal call. One trace_id stitches the whole transaction.
Flow 4 β Alert escalation (severity-based routing)
(FR pending): PagerDuty for critical, CHAT for low, CUO digest for trends.
Flow 5 β Audit-log query (compliance review)
EU AI Act Art. 12: decision logs retained β₯ 6 months; PDPL Art. 14 DSAR; auditors get read-only access scoped by engagement.
Alert lifecycle
Alerts traverse a five-state lifecycle. Every state transition emits a metric for SLO compliance tracking.
SLO catalogue (P0)
| Service | Indicator | Target | Window | Owner |
|---|---|---|---|---|
| Platform (aggregate) | availability | β₯ 99.5% | 28d rolling | CTO |
| CHAT | availability | β₯ 99.9% | 28d | CTO |
| BRAIN search | availability | β₯ 99.5% | 28d | CDO |
| AUTH | availability | β₯ 99.95% | 28d | CSO |
| AI Gateway | availability | β₯ 99.9% | 28d | CTO |
| AI Gateway | latency p95 | β€ 2 s | 28d | CTO |
| MCP Gateway | availability | β₯ 99.95% | 28d | CTO |
| MCP Gateway | write tool p95 | β€ 1 s | 28d | CTO |
| GraphQL Router | latency p95 | β€ 400 ms | 28d | CTO |
| Backup RPO | recovery point | β€ 1 h | continuous | CTO |
| Backup RTO | recovery time | β€ 4 h | continuous | CTO |
Functional Requirements
The CyberOS FR catalogue is being rebuilt one feature at a time via the open fr-author Agent Skill.
Previous FR enumerations were archived 2026-05-14 and are no longer reflected on this page. PRD/SRS narrative remains authoritative for the spec; specific FRs land here as they are re-authored.
Non-Functional Requirements
| NFR ID | Concern | Target | Measurement |
|---|---|---|---|
N(FR pending) | Platform availability (28-day rolling) | β₯ 99.5% | SLO target Β· burn-rate alerts |
N(FR pending) | CHAT availability | β₯ 99.9% | SLO |
N(FR pending) | BRAIN search availability | β₯ 99.5% | SLO |
N(FR pending) | Backup RPO | β€ 1 h | scheduled backup audit |
N(FR pending) | Backup RTO | β€ 4 h | quarterly restore drill |
N(FR pending) | Cross-region failover (P3) | β€ 24 h | annual DR drill |
N(FR pending) | SLO dashboard refresh latency | β€ 60 s | monitor synthetic SLO breach |
N(FR pending) | Log ingest end-to-end latency | β€ 30 s p95 | synthetic log β query |
N(FR pending) | Trace ingest end-to-end | β€ 60 s p95 | synthetic trace |
N(FR pending) | Log PII redaction recall | β₯ 99.5% | test set |
N(FR pending) | Log PII redaction precision | β₯ 95% | test set |
N(FR pending) | OBS plane availability | β₯ 99.5% | SLO (recursive) |
N(FR pending) | Decision-log retention | β₯ 180 d | config audit Β· S3 lifecycle |
N(FR pending) | Cross-tenant query leakage | = 0 | property-based test |
N(FR pending) | OBS plane infra cost (P0) | β€ $130/month | cost dashboard |
Dependencies
tenant + scope verification"] BRAIN["π§ BRAIN
audit-log surface"] S3["βοΈ S3
Loki/Tempo storage"] LANGSMITH["LangSmith SaaS
AI traces"] end OBS["π OBS"] subgraph emitters ["Every CyberOS service emits OTel"] AUTH2["AUTH"] AI["AI"] MCP["MCP"] CHAT["CHAT"] BR2["BRAIN"] SK["Skill"] OTH["β¦all 22"] end subgraph consumers ["Consumers"] OPS["On-call ops"] COMP["Compliance"] AUDIT["External auditor"] CEO["CEO morning digest"] end AUTH --> OBS BRAIN --> OBS S3 --> OBS LANGSMITH --> OBS AUTH2 --> OBS AI --> OBS MCP --> OBS CHAT --> OBS BR2 --> OBS SK --> OBS OTH --> OBS OBS --> OPS OBS --> COMP OBS --> AUDIT OBS --> CEO classDef shipped fill:#f5ede6,stroke:#45210e classDef planned fill:#fef6e0,stroke:#7c3aed class BRAIN,SK shipped class OBS,AUTH,AI,MCP,CHAT,AUTH2,BR2,OTH planned class S3,LANGSMITH planned
Compliance scope
| Regulation / standard | Article / clause | OBS feature |
|---|---|---|
| EU AI Act | Art. 12 β Logging | Decision-log retention β₯ 6 months; LangSmith trace per AI decision. |
| EU AI Act | Art. 13 β Transparency | Audit-log surface available to deployers (tenant admins). |
| EU AI Act | Art. 14 β Human oversight | Per-tenant alerting flags anomalous agent behaviour. |
| Vietnam PDPL | Art. 14 β DSAR | Per-subject log + decision export via audit-log surface. |
| Vietnam Decree 13/2023 | Art. 17 β Processing log | Audit-log surface materialises the processing log for the regulator. |
| GDPR | Art. 30 β Records of processing | BRAIN audit chain + OBS audit-view = records of processing. |
| GDPR | Art. 32 β Security of processing | PII redaction on logs; tenant-scoped queries; mTLS to collectors. |
| GDPR | Art. 33 β Breach notification | Alert routing surfaces breaches; OBS provides forensic timeline. |
| ISO/IEC 27001:2022 | A.8.15 β Logging | Centralised structured logs; integrity via BRAIN chain. |
| ISO/IEC 27001:2022 | A.8.16 β Monitoring activities | Per-module SLO + alert pipeline. |
| ISO/IEC 42001 (AIMS) | Β§ 9.1 β Performance evaluation | LangSmith + AI Gateway metrics β‘ AI system performance KPIs. |
| SOC 2 Type II | CC7.2 β Monitoring controls | SLO dashboards Β· alert routing Β· audit-log retention. |
| SOC 2 Type II | CC7.3 β Detection | Alert manager + on-call rotation. |
Risk entries
| ID | Risk | Likelihood | Impact | Owner | Mitigation |
|---|---|---|---|---|---|
R-OBS-001 | PII leaks into Loki/Tempo via missed redaction rule | Medium | High | CSO | Recall β₯ 99.5% gated in CI; quarterly red-team; opt-in encryption at rest for sensitive log streams. |
R-OBS-002 | Cross-tenant log leakage via crafted query | Low | Catastrophic | CSO | Query-proxy property-based test gate; tenant_id always injected from JWT not user input. |
R-OBS-003 | LangSmith outage blinds AI debugging | Medium | Medium | CTO | Local OTel trace mirror retained 7 d; LangSmith is for deep analysis, not primary. |
R-OBS-004 | Alert fatigue (too many warnings) | High | Medium | CTO | Burn-rate alerting (Sloth) instead of static thresholds; quarterly alert review. |
R-OBS-005 | S3 retention misconfig β decision logs purged early | Low | High | CTO | Lifecycle policy declared in Terraform; CI gate verifies β₯ 180 d retention for decision-log bucket. |
R-OBS-006 | Grafana credential leak β broad audit-log access | Low | High | CSO | Grafana auth via OIDC SSO; per-folder scope; auditors get time-bound access. |
R-OBS-007 | Trace-id loss across async boundary β broken span tree | Medium | Low | CTO | OTel context propagation in every async runtime crate; CI test verifies multi-hop trace continuity. |
R-OBS-008 | Prometheus disk full β metrics gap | Medium | Medium | CTO | 15-d retention with auto-eviction; alert on free-disk < 30%; long-term in Mimir at P1+. |
R-OBS-009 | OTel SDK version drift across modules | Medium | Low | CTO | Pin SDK version in shared crate / package; Renovate alerts on upstream releases. |
R-OBS-010 | Cost-pipeline mis-attributes spend to wrong tenant | Medium | Medium | CFO | tenant_id required in every spend event; reconciliation gate against AWS bill monthly. |
KPIs
| KPI | Formula | Source | Target |
|---|---|---|---|
| Platform availability (28d) | 1 β error_minutes / total_minutes | Prometheus | β₯ 99.5% |
| SLO dashboard freshness | last_scrape_age | Prometheus | β€ 60 s |
| Log ingest p95 latency | histogram | collector | β€ 30 s |
| PII redaction recall | TP / (TP + FN) | CI gate | β₯ 99.5% |
| Cross-tenant query rejections | count | query_proxy | tracked; 0 successful breaches |
| Alert false-positive rate | fp / (fp + tp) | weekly review | β€ 20% |
| MTTR (critical) | resolved_at β fired_at | PagerDuty | β€ 60 min |
| Error-budget remaining (per SLO) | 1 β burned / budget | SLO engine | > 0 throughout window |
| Decision-log retention compliance | days_retained | S3 lifecycle | β₯ 180 d |
RACI matrix
| Activity | CEO | CTO | CSO | CDO | CFO | DPO |
|---|---|---|---|---|---|---|
| Stack design + deployment | I | A/R | C | C | I | I |
| SLO definition | A | R | C | C | I | I |
| Alert rule maintenance | I | A/R | C | I | I | I |
| PII redaction rule maintenance | I | C | C | A/R | I | C |
| On-call rotation | I | A/R | C | I | I | I |
| Cost pipeline + reconciliation | I | C | I | I | A/R | I |
| Audit-log surface design | I | C | C | C | I | A/R |
| Compliance review (AI Act, PDPL) | I | C | C | C | I | A/R |
Planned CLI surface
Operator CLI cyberos-obs plus standard Grafana + Loki + Prom CLIs.
1. Tail logs for a tenant
$ cyberos-obs logs tail --tenant acme --service auth --since 5m
2026-05-14T07:19:02Z INFO auth login_attempt subject=[REDACTED:email] trace=t_3ab9
2026-05-14T07:19:02Z INFO auth login_success aal=aal3 trace=t_3ab9
2026-05-14T07:19:03Z INFO rbac check action=brain.put decision=allow trace=t_3ab9
β¦
2. SLO status
$ cyberos-obs slo status
SERVICE INDICATOR TARGET CURRENT BUDGET BURN
platform availability 99.5% 99.94% 99% 0.06Γ (28d)
chat availability 99.9% 99.97% 71% 1.2Γ (warning)
auth availability 99.95% 100% 100% 0Γ
ai-gateway latency p95 2 s 1.4 s ok β
brain-search availability 99.5% 99.99% 99% 0Γ
mcp-gateway write p95 1 s 0.42 s ok β
graphql-router latency p95 400 ms 280 ms ok β
3. Active alerts
$ cyberos-obs alerts active
ALERT SEVERITY STARTED STATUS
ChatErrorBudgetBurnFast warning 5m ago firing
AIProviderLatencyHigh info 12m ago firing
S3LifecycleStaleConfig (cost-bucket) info 2h ago silenced
4. Per-tenant cost MTD
$ cyberos-obs cost mtd --tenant acme
TENANT: acme
MONTH: 2026-05
βββββββββββββββββββββββββββββββββββββ
Infra: $182.40
Fargate (chat) $52.10
Fargate (auth) $48.20
RDS Postgres $42.10
S3 storage $24.00
Other $16.00
AI: $97.42 (cap $150 Β· 64.9%)
Storage: $24.00
βββββββββββββββββββββββββββββββββββββ
TOTAL: $303.82
5. Trace lookup by id
$ cyberos-obs trace get t_3ab9c8d4
trace_id: t_3ab9c8d4
duration: 412 ms
spans:
apollo-router sendMessage(graphql) 412 ms
ββ auth RBAC.Check 8 ms
ββ chat CreateMessage 286 ms
β ββ brain put_message 12 ms
β ββ ai-gateway summariseSync 260 ms
β ββ tenant_policy 3 ms
β ββ redactor 2 ms
β ββ bedrock invoke 254 ms
ββ chat FanoutMentions 14 ms
6. Audit-log query (compliance)
$ cyberos-obs audit query --since 2026-04-01 --action 'brain.delete' --format jsonl
{"seq":12031,"action":"brain.delete","actor":"stephen@β¦","mode":"tombstone","path":"memories/β¦","ts":"β¦"}
{"seq":12102,"action":"brain.delete","actor":"dpo@β¦","mode":"purge","reason":"DSAR-2026-014","path":"memories/β¦","ts":"β¦"}
β¦
[query] 47 rows Β· chain integrity verified
7. SLO definition (YAML)
# cyberos-obs/slo/ai-gateway-latency.yml
slo:
id: ai-gateway-latency-p95
service: ai-gateway
indicator: latency_p95
target: 2.0 # seconds
window: 28d
alerts:
burn_rate_fast:
severity: critical
route: pagerduty
threshold: 2.0 # 2x burn over 1h
burn_rate_slow:
severity: warning
route: chat
threshold: 1.0 # 1x burn over 6h
Phase status & estimates
| Capability | Status |
|---|---|
| OTel Collector + LGTM backends | planned Β· P0 |
| PII redaction processor | planned Β· P0 |
| Tenant-tag processor | planned Β· P0 |
| tenant_query_proxy (Rust) | planned Β· P0 |
| Grafana dashboards (per-module SLO) | planned Β· P0 |
| Per-tenant cost dashboards | planned Β· P0 |
| Alert Manager + PagerDuty routing | planned Β· P0 |
| Audit-log surface (read-only) | planned Β· P0 |
| LangSmith integration | planned Β· P0 |
| SLO-as-code (Sloth-style) | planned Β· P0 |
| Auto-pause feature flags on burn | planned Β· P1 |
| Mimir for 1y metric retention | planned Β· P1+ |
| Multi-region active-active | planned Β· P3+ |
References
- PRD Β§8.7 β Observability plane architecture.
- PRD Β§9.9 β (FR pending) through (FR pending) (PRD-tier).
- PRD Β§11.2.2 β Reliability NFRs (REL-001 through REL-008).
- SRS Β§4.9 β Formal (FR pending) catalogue with verification methods.
- EU AI Act (Reg. 2024/1689) β Art. 12 logging, Art. 13 transparency, Art. 14 human oversight.
- ISO/IEC 27001:2022 β A.8.15 logging, A.8.16 monitoring activities.
- ISO/IEC 42001 (AIMS) β Β§ 9.1 performance evaluation.
- OpenTelemetry β specification + Rust + Python SDKs.
- Grafana Loki + Tempo + Mimir β upstream stack.
- LangSmith β managed AI-trace observability.
- Sloth β SLO-as-code engine (Prometheus rule generator).
- W3C TraceContext β propagation spec.
- Architecture context: infrastructure.html#obs.