Runway Observability
Runway supports observability for a service by integrating with monitoring stack. This includes both service-level metrics and load balancer observability across AWS, GKE, and Cloud Run environments.
Load Balancer Observability
Section titled “Load Balancer Observability”Runway provides unified load balancer observability across all cloud environments using normalized runway_lb_* metrics.
Architecture
Section titled “Architecture”The observability pipeline uses provider-native exporters with OpenTelemetry normalization:
GCP (GKE & Cloud Run):
- Stackdriver Exporter deployed to designated clusters via ArgoCD
- Collects metrics from Cloud Load Balancing API
- OTel Gateway normalizes to
runway_lb_*schema
AWS:
- CloudWatch Exporter deployed to EKS clusters via ArgoCD
- Collects metrics from CloudWatch API
- OTel Gateway normalizes to
runway_lb_*schema
All metrics are exported to Mimir with X-Scope-OrgID: runway.
Available Metrics
Section titled “Available Metrics”Core Metrics (All Runtimes)
Section titled “Core Metrics (All Runtimes)”| Metric | Description | Labels |
|---|---|---|
runway_lb_request_count | Total requests to load balancer | runtime, env, load_balancer (AWS) / forwarding_rule_name (GCP) |
runway_lb_backend_latency_milliseconds | Backend response time | runtime, env, load_balancer (AWS) / forwarding_rule_name (GCP), statistic (AWS only: average/minimum/maximum) |
GCP-Specific Metrics (GKE & Cloud Run)
Section titled “GCP-Specific Metrics (GKE & Cloud Run)”| Metric | Description | Labels |
|---|---|---|
runway_lb_backend_request_count | Requests reaching backends | runtime, env, forwarding_rule_name |
runway_lb_total_latency_milliseconds | End-to-end latency (proxy to client) | runtime, env, forwarding_rule_name |
AWS-Specific Metrics
Section titled “AWS-Specific Metrics”| Metric | Description | Labels |
|---|---|---|
runway_lb_response_code_count | Requests by HTTP status class | runtime, env, load_balancer, response_code_class (2xx/4xx/5xx) |
runway_lb_backend_latency_milliseconds | Backend latency with statistics | runtime, env, load_balancer, statistic (average/minimum/maximum) |
Label Reference
Section titled “Label Reference”| Label | Values | Description |
|---|---|---|
runtime | aws, gke, cloudrun | Cloud environment |
env | gprd, gstg | Runway environment |
load_balancer | string | AWS ALB/NLB name (AWS only) |
forwarding_rule_name | string | GCP forwarding rule name (GKE/CloudRun) |
statistic | average, minimum, maximum | Latency statistic (AWS only) |
response_code_class | 2xx, 4xx, 5xx | HTTP status code class (AWS only) |
Query Examples
Section titled “Query Examples”Cross-Cloud Queries
Section titled “Cross-Cloud Queries”Request rate by runtime:
sum by (runtime) (rate(runway_lb_request_count{env="gprd"}[5m]))Backend latency p99 across all clouds:
histogram_quantile( 0.99, sum by (le, runtime) ( rate(runway_lb_backend_latency_milliseconds_bucket{env="gprd"}[5m]) ))Request drop rate by runtime (GCP only - AWS does not have backend_request_count):
sum by (runtime) ( rate(runway_lb_request_count{env="gprd", runtime=~"gke|cloudrun"}[5m])) - sum by (runtime) ( rate(runway_lb_backend_request_count{env="gprd", runtime=~"gke|cloudrun"}[5m]))AWS-Specific Queries
Section titled “AWS-Specific Queries”Request rate by HTTP status class:
sum by (response_code_class) ( rate(runway_lb_response_code_count{runtime="aws", env="gprd"}[5m]))Backend latency statistics:
runway_lb_backend_latency_milliseconds{runtime="aws", env="gprd", statistic=~"average|maximum"}GCP-Specific Queries
Section titled “GCP-Specific Queries”Total latency p99 (GKE):
histogram_quantile( 0.99, sum by (le, forwarding_rule_name) ( rate(runway_lb_total_latency_milliseconds_bucket{runtime="gke", env="gprd"}[5m]) ))Backend request count (CloudRun):
sum by (forwarding_rule_name) ( rate(runway_lb_backend_request_count{runtime="cloudrun", env="gprd"}[5m]))Dashboards
Section titled “Dashboards”Pre-built dashboards are available in Grafana:
- Runway Load Balancer Metrics - Main - Unified cross-cloud dashboard with runtime comparison
- Runway Load Balancer Metrics - EKS - AWS ALB/NLB specific metrics
- Runway Load Balancer Metrics - GKE - GKE load balancer metrics
- Runway Load Balancer Metrics - CloudRun - Cloud Run load balancer metrics
Service Observability (Cloud Run only)
Section titled “Service Observability (Cloud Run only)”Service-level observability via the metrics catalog is currently supported for Cloud Run services only. For GKE and EKS services, use the Load Balancer Observability metrics above.
Prerequisite for service observability is service catalog entry. Follow these steps:
- Create new entry in service catalog in expected format: e.g.
my_service. - Create new entry in metrics catalog: e.g.
metrics-catalog/services/my-service.jsonnet local runwayArchetype = import 'service-archetypes/runway-archetype.libsonnet';local metricsCatalog = import 'servicemetrics/metrics.libsonnet';metricsCatalog.serviceDefinition(runwayArchetype(type='my_service',team='my_team',)) - Run
make generateand commit autogenerated content
After approval and merging, you can view newly generated service overview dashboard.
Metrics
Section titled “Metrics”By default, metrics are reported for a service even if a service catalog entry does not exist yet. Optionally, you can report custom metrics for a service.
Default
Section titled “Default”Default metrics are reported under stackdriver_cloud_run_* metric namespace in Mimir, e.g.:
stackdriver_cloud_run_revision_run_googleapis_com_request_count{job="runway-exporter",env="gprd",service_name="my_service"}To learn more, refer to documentation.
Custom Metrics
Section titled “Custom Metrics”Cloud Run
Section titled “Cloud Run”Custom metrics can be reported using Prometheus text-based exposition format. When scrape targets are present, Runway will deploy sidecar container for OpenTelemetry Collector preconfigured to automatically scrape ingress container at your specified port(s). To enable configuration:
# omitted for brevityspec: observability: scrape_targets: - "localhost:8082" metrics_path: "/foo" # defaults to /metricsThese custom metrics will be available under the Mimir - Runway data source in Grafana.
To learn more, refer to documentation.
Dashboards
Section titled “Dashboards”By default, a dashboard is generated for a service with the following service overview panels:
- Default SLIs (e.g.
runway_ingress) - Default Saturation Details (e.g.
runway_container_memory_utilization)
The dashboard is checked into version control and can be extended with custom SLIs. Optionally, you can use general Runway Service Metrics dashboard.
To learn more, refer to documentation.
Alerts
Section titled “Alerts”By default, alerts are generated for a service with the following SLOs:
- Apdex SLO violation
- Error SLO violation
- Traffic absent SLO violation
To override the default configuration, set the following fields in metrics catalog entry:
| Option | Description | Default |
|---|---|---|
apdexSatisfiedThreshold | Alter expected request latency of the Runway service | 1024 ms |
apdexScore | Alter apdex threshold for the Runway service | 0.999 |
errorScore | Alter how many errors are tolerated for the Runway service | 0.999 |
For routing, you must specify a valid team in metrics catalog entry.
To learn more, refer to documentation.
Runway application logs are available in Grafana via ClickHouse. You can query logs by filtering on ServiceName (your runway_service_id). Please refer to the Logging documentation here.
How this works in practice - Guided Example
Section titled “How this works in practice - Guided Example”Secret Detection Service (Runway-managed) is in the service-catalog and metrics catalog. Two example questions:
- Are the alerts listed here the ones that are monitored for SLO violations? If yes, then what are the alerts defined in https://alerts.gitlab.net?
- Since the service borrows Runway’s SLI defaults, does it mean that in the event of an SLO violation, an incident issue is raised with a severity S4? If not, what triggers an active incident issue and a pager duty alert?
- alerts will be present for all SLO violations, to see what alert-rules have been added you can use the alerts page in Grafana. Use the search box for filtering for the service in the search box.
- They will not show up on alertmanager (alerts.gitlab.net) until an alert fires.
- Alertmanager is what then decides what to do with the alert: post it to slack, page an SRE to pagerduty, etc.
- Because the default SLIs have S4 severity, they will not page for an SLO violation, but just get reported to Slack for now. You can reroute those requests to an alert channel of your choice: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/alert-routing.md
- Incident issues are usually raised by the oncall SRE when they are paged. But feel free to raise one yourself if you need assistance. Read more about reporting an incident in the handbook.
Runway has all environments in one tenant. We separate them with the evironment label. So that could be environment="gprd" or environment="gstg". stage is the label we use for differentiating canary and the main stage. So those labels could be stage="main" or stage="cny". Though runway services do not have a canary stage. Practically that will only be stage="main".
All alerts go to the #feed_alerts-general channel by default. Which makes that channel very noisy. Because of the noisy channel, it is highly recommended to route the alerts you’re interested in to a channel that you will monitor. Only alerts marked as S1 or S2 will page the on-call SRE. So setting that on your service would do that. Before doing that, the service should go through a readiness review: https://handbook.gitlab.com/handbook/engineering/infrastructure/production/readiness/