Reliability report

30 days of real RelayOrb control-plane behavior

RelayOrb has been running in production continuously since Feb 28, 2026, 2:15 AM. This page reports system behavior from internal instrumentation over 2026-05-02T00:00:00Z through 2026-06-01T23:59:59Z. The load is synthetic monitoring and control-plane traffic, not public user adoption, and the numbers reflect how the system performed under sustained automated load.

/stats.json /cost_profile.jsonGenerated Jun 2, 2026, 3:22 AM

Gateway uptime

100.00%

Gateway requests served

86,450

Gateway p95 latency

31.57 ms

Gateway + registry uptime

99.996%

Component status

Gateway

Production-ready

100.00% uptime over the 30-day window, 86,450 requests served, and p95 latency of 31.57 ms.

Registry

Production-ready

99.99% uptime over the 30-day window with 120,978 internal control-plane requests served.

Rag worker

Resolved configuration regression

Most relayorb-rag-prod 5xx responses came from internal GET /metrics scrapes, not public invoke traffic. The worker repeatedly failed startup because capability registration returned 403 Forbidden for the default compute identity instead of the allowed relayorb-rag-sa service account.

The headline numbers on this page focus on gateway and registry because those are the production-grade control-plane components. The worker remained deployed, but this 30-day snapshot still includes the pre-fix synthetic monitoring period when a startup identity regression inflated worker errors without representing public invoke traffic.

What this is

These numbers come from Cloud Monitoring and Cloud Logging on the surviving prod services: gateway, registry, and worker. Availability is computed as non-5xx requests divided by total requests, latency comes from Cloud Run request latency percentiles, utilization comes from the container utilization distributions, and the cost profile is reconstructed from Cloud Run billable instance time plus public Cloud Billing SKU prices because billing export was not enabled.

The signal is still useful even though the traffic was internal. It shows how the healthy components behaved under constant heartbeat, scrape, and health traffic, and it also surfaced a real worker configuration regression. The previous deployment burned money by keeping warm services alive for synthetic traffic. The current deployment keeps the same endpoints live while letting the control plane sleep at zero traffic.

Architecture

Gateway stays public at the edge. Registry and worker sit behind the control plane, registry owns capability state and heartbeats, and the gateway records invocation state in SQL-backed stores configured through secret-backed DATABASE_URLvalues.

Public gateway - private registry - worker - SQL state

Service performance

Raw per-service numbers stay visible here. Gateway and registry are the components represented by the headline reliability cards above. Worker metrics remain included for honesty, with the failed `/metrics` startup loop called out separately instead of averaged into the front-door story.

Service	Uptime	Requests	p50 / p95 / p99	CPU avg	Memory avg	5xx rate	Error logs
Gateway relayorb-gateway-prod	100.00%	86,450	8.51 / 31.57 / 32.37 ms	0.323%	3.130%	0.000%	0
Registry relayorb-registry-prod	99.99%	120,978	7.00 / 12.85 / 13.97 ms	0.092%	3.589%	0.007%	9
Worker relayorb-rag-prod	19.22%	86,436	5.27 / 9.58 / 10.06 ms	3.628%	1.262%	80.780%	79,081

Rag worker diagnosis

Most relayorb-rag-prod 5xx responses came from internal GET /metrics scrapes, not public invoke traffic.

The worker repeatedly failed startup because capability registration returned 403 Forbidden for the default compute identity instead of the allowed relayorb-rag-sa service account.

Cloud Run startup logs show failed readiness probes and worker registration errors for capability rag.search@v1 before later scrape retries succeeded.

Resolved 2026-06-02: relayorb-rag-prod now runs as relayorb-rag-sa@relayorb-prod.iam.gserviceaccount.com. The 30-day window on this page still reflects the pre-fix synthetic-monitoring period.

Result: the worker issue is real, but the 30-day 5xx volume mostly measures synthetic monitoring noise on the worker path, not public control-plane reliability.

Worker uptime

19.22%

Worker 5xx rate

80.78%

Worker error logs

79,081

Reference deployment

$256.62/month

Modeled from the 30-day pre-shutdown topology: four always-warm demo services, the prod worker, the prod metrics scraper, and the request-driven prod edge services. This is the fixed warmth-burn profile that came from internal traffic.

Scale-to-zero current

$0.00/month fixed

The surviving prod services now run with minScale=0. With no traffic, fixed Cloud Run spend falls to zero and variable cost only appears when a real caller hits the gateway.

Reference cost breakdown

Cost is modeled from Cloud Monitoring billable instance time plus public Cloud Billing Catalog SKU prices for us-central1. Billing export was not enabled, so this is a reconstructed Cloud Run cost profile rather than an invoice export.

Service	Billing mode	Min instances	Requests	Billable seconds	Estimated monthly cost
Demo gateway relayorb-demo - relayorb-gateway-demo	instance-based	1	13,677	2,599,229	$49.39
Demo metrics scraper relayorb-demo - relayorb-metrics-scraper-demo	instance-based	1	0	2,591,637	$49.24
Demo worker relayorb-demo - relayorb-rag-demo	instance-based	1	5	2,592,224	$49.25
Demo registry relayorb-demo - relayorb-registry-demo	instance-based	1	129,492	2,592,582	$49.26
Gateway relayorb-prod - relayorb-gateway-prod	request-based	0	86,450	8,654	$0.22
Registry relayorb-prod - relayorb-registry-prod	request-based	0	120,978	12,215	$0.31
Worker relayorb-prod - relayorb-rag-prod	instance-based	0	86,436	511,309	$9.71
Prod metrics scraper relayorb-prod - relayorb-metrics-scraper-prod	instance-based	1	0	2,591,816	$49.24