Skip to content

PromQL Guide

This guide teaches you how to write PromQL queries from scratch. You’ll learn how to select metrics, filter data, calculate rates, and build useful queries for dashboards and alerts.

For a quick syntax lookup, see the PromQL Reference.

The simplest PromQL query is just a metric name:

cpu_used

This returns the current value of cpu_used for every server in your infrastructure. Each result includes the metric value along with its labels - key-value pairs that identify where the metric comes from.

A typical result might look like:

cpu_used{hostname="web-1", datacenter="paris"} 45.2
cpu_used{hostname="web-2", datacenter="paris"} 32.1
cpu_used{hostname="db-1", datacenter="london"} 78.5

Usually you don’t want all servers - you want specific ones. Use curly braces to filter by label values:

cpu_used{hostname="web-1"}

This returns only the CPU usage for web-1.

You can combine multiple filters. They work as AND conditions:

cpu_used{datacenter="paris", role="webserver"}

This returns CPU usage for all webservers in the Paris datacenter.

Sometimes you need more flexible filtering. PromQL supports four matching operators:

Exact match with =:

http_requests_total{status="200"}

Not equal with !=:

http_requests_total{status!="200"}

Regex match with =~:

# All 5xx errors
http_requests_total{status=~"5.."}
# Multiple values
cpu_used{hostname=~"web-1|web-2|web-3"}

Regex not match with !~:

# Exclude test servers
cpu_used{hostname!~"test-.*"}

Before going further, you need to understand the two main metric types:

Gauges measure a current value that can go up or down:

  • cpu_used - current CPU percentage
  • mem_used_bytes - current memory usage
  • temperature - current temperature

Counters count cumulative totals that only increase (or reset to zero):

  • http_requests_total - total requests since start
  • errors_total - total errors since start
  • bytes_sent_total - total bytes sent

This distinction matters because you query them differently.

Gauges are straightforward - the current value is meaningful on its own:

# Current CPU usage
cpu_used
# Current memory in use
mem_used_bytes
# Filter to high values
cpu_used > 80

To see how a gauge changed over time, use delta():

# Temperature change over the last hour
delta(temperature[1h])

Counter values alone aren’t useful - knowing you’ve served “5 million requests total” doesn’t tell you much. What matters is the rate of change.

Use rate() to calculate how fast a counter is increasing:

# Requests per second (averaged over 5 minutes)
rate(http_requests_total[5m])

The [5m] means “look at the last 5 minutes of data” to calculate the rate. This smooths out spikes and gives you a reliable per-second value.

Sometimes you want the total count over a period, not the per-second rate:

# Total requests in the last hour
increase(http_requests_total[1h])
# Total errors today
increase(errors_total[24h])

Often you have many time series and want to combine them.

# Total requests across all servers
sum(rate(http_requests_total[5m]))

This takes the request rate from each server and adds them together.

Use by to keep certain labels and aggregate the rest:

# Total requests per endpoint (combine all servers)
sum by (endpoint) (rate(http_requests_total[5m]))
# Average CPU per datacenter
avg by (datacenter) (cpu_used)

Use without to do the opposite - aggregate away specific labels:

# Sum everything except hostname (effectively "per datacenter")
sum without (hostname) (rate(http_requests_total[5m]))

Find the highest or lowest values:

# Top 5 servers by CPU
topk(5, cpu_used)
# Busiest endpoints
topk(10, sum by (endpoint) (rate(http_requests_total[5m])))

You can perform arithmetic with metrics:

# Convert bytes to gigabytes
mem_used_bytes / 1024 / 1024 / 1024
# Calculate percentage
(mem_used_bytes / mem_total_bytes) * 100
# Difference between two metrics
disk_total_bytes - disk_free_bytes

Filter results based on conditions:

# Only show CPU above 80%
cpu_used > 80
# Servers with less than 10% disk free
(disk_free_bytes / disk_total_bytes) * 100 < 10

Use and, or, and unless to combine conditions:

# High CPU AND high memory (stressed servers)
(cpu_used > 80) and (mem_used_perc > 80)
# Either condition (any resource pressure)
(cpu_used > 90) or (mem_used_perc > 90)

See how metrics behaved over a time period:

# Average CPU over the last hour
avg_over_time(cpu_used[1h])
# Peak memory usage today
max_over_time(mem_used_perc[24h])
# Minimum disk space (to find low points)
min_over_time(disk_free_bytes[7d])

Use offset to look at historical data:

# CPU usage 1 hour ago
cpu_used offset 1h
# Compare current to yesterday
cpu_used - (cpu_used offset 24h)

Let’s build some real-world queries.

Calculate what percentage of requests are errors:

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

This divides error requests by total requests and multiplies by 100.

Calculate uptime percentage over the last day:

# Percentage of time service was up
avg_over_time(up[24h]) * 100

The up metric is 1 when a target is reachable, 0 when it’s not.

Predict when disk will be full based on current growth:

# Seconds until disk is full (if growth continues)
disk_free_bytes / (rate(disk_used_bytes[1h]))

Or use the built-in prediction function:

# Predicted disk usage in 7 days
predict_linear(disk_used_bytes[1h], 7 * 24 * 3600)

If you have histogram metrics, calculate percentiles:

# 95th percentile response time
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

The le label (less than or equal) is required for histogram calculations.

Find servers under heavy load:

# Servers with high CPU for sustained period
avg_over_time(cpu_used[15m]) > 85
# Wrong: raw counter value is meaningless
http_requests_total
# Right: calculate the rate
rate(http_requests_total[5m])
# Too short: volatile, may miss data points
rate(http_requests_total[30s])
# Better: smooths out noise
rate(http_requests_total[5m])

Apply label filters before expensive operations:

# Efficient: filter first, then aggregate
sum(rate(http_requests_total{status="500"}[5m]))
# Less efficient: aggregates everything, then filters
sum(rate(http_requests_total[5m])) > 100

Avoid queries that return too many series:

# Potentially dangerous: returns every series
http_requests_total
# Safer: aggregate or filter
sum by (endpoint) (rate(http_requests_total[5m]))

Now that you understand PromQL basics: