Runway Observability

Runway supports observability for a service by integrating with monitoring stack. This includes both service-level metrics and load balancer observability across AWS, GKE, and Cloud Run environments.

Load Balancer Observability

Runway provides unified load balancer observability across all cloud environments using normalized runway_lb_* metrics.

Architecture

The observability pipeline uses provider-native exporters with OpenTelemetry normalization:

GCP (GKE & Cloud Run):

Stackdriver Exporter deployed to designated clusters via ArgoCD
Collects metrics from Cloud Load Balancing API
OTel Gateway normalizes to runway_lb_* schema

AWS:

CloudWatch Exporter deployed to EKS clusters via ArgoCD
Collects metrics from CloudWatch API
OTel Gateway normalizes to runway_lb_* schema

All metrics are exported to Mimir with X-Scope-OrgID: runway.

Available Metrics

Core Metrics (All Runtimes)

Metric	Description	Labels
`runway_lb_request_count`	Total requests to load balancer	`runtime`, `env`, `load_balancer` (AWS) / `forwarding_rule_name` (GCP)
`runway_lb_backend_latency_milliseconds`	Backend response time	`runtime`, `env`, `load_balancer` (AWS) / `forwarding_rule_name` (GCP), `statistic` (AWS only: average/minimum/maximum)

GCP-Specific Metrics (GKE & Cloud Run)

Metric	Description	Labels
`runway_lb_backend_request_count`	Requests reaching backends	`runtime`, `env`, `forwarding_rule_name`
`runway_lb_total_latency_milliseconds`	End-to-end latency (proxy to client)	`runtime`, `env`, `forwarding_rule_name`

AWS-Specific Metrics

Metric	Description	Labels
`runway_lb_response_code_count`	Requests by HTTP status class	`runtime`, `env`, `load_balancer`, `response_code_class` (2xx/4xx/5xx)
`runway_lb_backend_latency_milliseconds`	Backend latency with statistics	`runtime`, `env`, `load_balancer`, `statistic` (average/minimum/maximum)

Label Reference

Label	Values	Description
`runtime`	`aws`, `gke`, `cloudrun`	Cloud environment
`env`	`gprd`, `gstg`	Runway environment
`load_balancer`	string	AWS ALB/NLB name (AWS only)
`forwarding_rule_name`	string	GCP forwarding rule name (GKE/CloudRun)
`statistic`	`average`, `minimum`, `maximum`	Latency statistic (AWS only)
`response_code_class`	`2xx`, `4xx`, `5xx`	HTTP status code class (AWS only)

Query Examples

Cross-Cloud Queries

Request rate by runtime:

sum by (runtime) (rate(runway_lb_request_count{env="gprd"}[5m]))

Backend latency p99 across all clouds:

histogram_quantile(
  0.99,
  sum by (le, runtime) (
    rate(runway_lb_backend_latency_milliseconds_bucket{env="gprd"}[5m])
  )
)

Request drop rate by runtime (GCP only - AWS does not have backend_request_count):

sum by (runtime) (
  rate(runway_lb_request_count{env="gprd", runtime=~"gke|cloudrun"}[5m])
) - sum by (runtime) (
  rate(runway_lb_backend_request_count{env="gprd", runtime=~"gke|cloudrun"}[5m])
)

AWS-Specific Queries

Request rate by HTTP status class:

sum by (response_code_class) (
  rate(runway_lb_response_code_count{runtime="aws", env="gprd"}[5m])
)

Backend latency statistics:

runway_lb_backend_latency_milliseconds{runtime="aws", env="gprd", statistic=~"average|maximum"}

GCP-Specific Queries

Total latency p99 (GKE):

histogram_quantile(
  0.99,
  sum by (le, forwarding_rule_name) (
    rate(runway_lb_total_latency_milliseconds_bucket{runtime="gke", env="gprd"}[5m])
  )
)

Backend request count (CloudRun):

sum by (forwarding_rule_name) (
  rate(runway_lb_backend_request_count{runtime="cloudrun", env="gprd"}[5m])
)

Dashboards

Pre-built dashboards are available in Grafana:

Runway Load Balancer Metrics - Main - Unified cross-cloud dashboard with runtime comparison
Runway Load Balancer Metrics - EKS - AWS ALB/NLB specific metrics
Runway Load Balancer Metrics - GKE - GKE load balancer metrics
Runway Load Balancer Metrics - CloudRun - Cloud Run load balancer metrics

Service Observability (Cloud Run only)

Service-level observability via the metrics catalog is currently supported for Cloud Run services only. For GKE and EKS services, use the Load Balancer Observability metrics above.

Setup

Prerequisite for service observability is service catalog entry. Follow these steps:

Create new entry in service catalog in expected format: e.g. my_service.

Create new entry in metrics catalog: e.g.

local runwayArchetype = import 'service-archetypes/runway-archetype.libsonnet';
local metricsCatalog = import 'servicemetrics/metrics.libsonnet';

metricsCatalog.serviceDefinition(
  runwayArchetype(
    type='my_service',
    team='my_team',
  )
)

Run make generate and commit autogenerated content

After approval and merging, you can view newly generated service overview dashboard.

Metrics

By default, metrics are reported for a service even if a service catalog entry does not exist yet. Optionally, you can report custom metrics for a service.

Default

Default metrics are reported under stackdriver_cloud_run_* metric namespace in Mimir, e.g.:

stackdriver_cloud_run_revision_run_googleapis_com_request_count{job="runway-exporter",env="gprd",service_name="my_service"}

To learn more, refer to documentation.

Custom Metrics

Cloud Run

Custom metrics can be reported using Prometheus text-based exposition format. When scrape targets are present, Runway will deploy sidecar container for OpenTelemetry Collector preconfigured to automatically scrape ingress container at your specified port(s). To enable configuration:

# omitted for brevity
spec:
  observability:
    scrape_targets:
      - "localhost:8082"
    metrics_path: "/foo" # defaults to /metrics

These custom metrics will be available under the Mimir - Runway data source in Grafana.

To learn more, refer to documentation.

Dashboards

By default, a dashboard is generated for a service with the following service overview panels:

Default SLIs (e.g. runway_ingress)
Default Saturation Details (e.g. runway_container_memory_utilization)

The dashboard is checked into version control and can be extended with custom SLIs. Optionally, you can use general Runway Service Metrics dashboard.

To learn more, refer to documentation.

Alerts

By default, alerts are generated for a service with the following SLOs:

Apdex SLO violation
Error SLO violation
Traffic absent SLO violation

To override the default configuration, set the following fields in metrics catalog entry:

Option	Description	Default
`apdexSatisfiedThreshold`	Alter expected request latency of the Runway service	`1024` ms
`apdexScore`	Alter apdex threshold for the Runway service	`0.999`
`errorScore`	Alter how many errors are tolerated for the Runway service	`0.999`

For routing, you must specify a valid team in metrics catalog entry.

To learn more, refer to documentation.

Logs

Runway application logs are available in Grafana via ClickHouse. You can query logs by filtering on ServiceName (your runway_service_id). Please refer to the Logging documentation here.

How this works in practice - Guided Example

Secret Detection Service (Runway-managed) is in the service-catalog and metrics catalog. Two example questions:

Are the alerts listed here the ones that are monitored for SLO violations? If yes, then what are the alerts defined in https://alerts.gitlab.net?
Since the service borrows Runway’s SLI defaults, does it mean that in the event of an SLO violation, an incident issue is raised with a severity S4? If not, what triggers an active incident issue and a pager duty alert?

alerts will be present for all SLO violations, to see what alert-rules have been added you can use the alerts page in Grafana. Use the search box for filtering for the service in the search box.
They will not show up on alertmanager (alerts.gitlab.net) until an alert fires.
Alertmanager is what then decides what to do with the alert: post it to slack, page an SRE to pagerduty, etc.
Because the default SLIs have S4 severity, they will not page for an SLO violation, but just get reported to Slack for now. You can reroute those requests to an alert channel of your choice: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/alert-routing.md
Incident issues are usually raised by the oncall SRE when they are paged. But feel free to raise one yourself if you need assistance. Read more about reporting an incident in the handbook.

Runway has all environments in one tenant. We separate them with the evironment label. So that could be environment="gprd" or environment="gstg". stage is the label we use for differentiating canary and the main stage. So those labels could be stage="main" or stage="cny". Though runway services do not have a canary stage. Practically that will only be stage="main".

All alerts go to the #feed_alerts-general channel by default. Which makes that channel very noisy. Because of the noisy channel, it is highly recommended to route the alerts you’re interested in to a channel that you will monitor. Only alerts marked as S1 or S2 will page the on-call SRE. So setting that on your service would do that. Before doing that, the service should go through a readiness review: https://handbook.gitlab.com/handbook/engineering/infrastructure/production/readiness/