PromQL Guide

This guide teaches you how to write PromQL queries from scratch. You’ll learn how to select metrics, filter data, calculate rates, and build useful queries for dashboards and alerts.

For a quick syntax lookup, see the PromQL Reference.

Your First Query

The simplest PromQL query is just a metric name:

cpu_used

This returns the current value of cpu_used for every server in your infrastructure. Each result includes the metric value along with its labels - key-value pairs that identify where the metric comes from.

A typical result might look like:

cpu_used{hostname="web-1", datacenter="paris"} 45.2
cpu_used{hostname="web-2", datacenter="paris"} 32.1
cpu_used{hostname="db-1", datacenter="london"} 78.5

Filtering with Labels

Usually you don’t want all servers - you want specific ones. Use curly braces to filter by label values:

cpu_used{hostname="web-1"}

This returns only the CPU usage for web-1.

You can combine multiple filters. They work as AND conditions:

cpu_used{datacenter="paris", role="webserver"}

This returns CPU usage for all webservers in the Paris datacenter.

Beyond Exact Matches

Sometimes you need more flexible filtering. PromQL supports four matching operators:

Exact match with =:

http_requests_total{status="200"}

Not equal with !=:

http_requests_total{status!="200"}

Regex match with =~:

# All 5xx errors
http_requests_total{status=~"5.."}

# Multiple values
cpu_used{hostname=~"web-1|web-2|web-3"}

Regex not match with !~:

# Exclude test servers
cpu_used{hostname!~"test-.*"}

Understanding Counters vs Gauges

Before going further, you need to understand the two main metric types:

Gauges measure a current value that can go up or down:

cpu_used - current CPU percentage
mem_used_bytes - current memory usage
temperature - current temperature

Counters count cumulative totals that only increase (or reset to zero):

http_requests_total - total requests since start
errors_total - total errors since start
bytes_sent_total - total bytes sent

This distinction matters because you query them differently.

Working with Gauges

Gauges are straightforward - the current value is meaningful on its own:

# Current CPU usage
cpu_used

# Current memory in use
mem_used_bytes

# Filter to high values
cpu_used > 80

To see how a gauge changed over time, use delta():

# Temperature change over the last hour
delta(temperature[1h])

Working with Counters

Counter values alone aren’t useful - knowing you’ve served “5 million requests total” doesn’t tell you much. What matters is the rate of change.

Calculating Rates

Use rate() to calculate how fast a counter is increasing:

# Requests per second (averaged over 5 minutes)
rate(http_requests_total[5m])

The [5m] means “look at the last 5 minutes of data” to calculate the rate. This smooths out spikes and gives you a reliable per-second value.

Total Increase

Sometimes you want the total count over a period, not the per-second rate:

# Total requests in the last hour
increase(http_requests_total[1h])

# Total errors today
increase(errors_total[24h])

Aggregating Data

Often you have many time series and want to combine them.

Summing Across Series

# Total requests across all servers
sum(rate(http_requests_total[5m]))

This takes the request rate from each server and adds them together.

Aggregating by Label

Use by to keep certain labels and aggregate the rest:

# Total requests per endpoint (combine all servers)
sum by (endpoint) (rate(http_requests_total[5m]))

# Average CPU per datacenter
avg by (datacenter) (cpu_used)

Use without to do the opposite - aggregate away specific labels:

# Sum everything except hostname (effectively "per datacenter")
sum without (hostname) (rate(http_requests_total[5m]))

Finding Extremes

Find the highest or lowest values:

# Top 5 servers by CPU
topk(5, cpu_used)

# Busiest endpoints
topk(10, sum by (endpoint) (rate(http_requests_total[5m])))

Doing Math

You can perform arithmetic with metrics:

# Convert bytes to gigabytes
mem_used_bytes / 1024 / 1024 / 1024

# Calculate percentage
(mem_used_bytes / mem_total_bytes) * 100

# Difference between two metrics
disk_total_bytes - disk_free_bytes

Comparing Values

Filter results based on conditions:

# Only show CPU above 80%
cpu_used > 80

# Servers with less than 10% disk free
(disk_free_bytes / disk_total_bytes) * 100 < 10

Combining Conditions

Use and, or, and unless to combine conditions:

# High CPU AND high memory (stressed servers)
(cpu_used > 80) and (mem_used_perc > 80)

# Either condition (any resource pressure)
(cpu_used > 90) or (mem_used_perc > 90)

Looking at History

Aggregating Over Time

See how metrics behaved over a time period:

# Average CPU over the last hour
avg_over_time(cpu_used[1h])

# Peak memory usage today
max_over_time(mem_used_perc[24h])

# Minimum disk space (to find low points)
min_over_time(disk_free_bytes[7d])

Querying the Past

Use offset to look at historical data:

# CPU usage 1 hour ago
cpu_used offset 1h

# Compare current to yesterday
cpu_used - (cpu_used offset 24h)

Practical Examples

Let’s build some real-world queries.

Error Rate Percentage

Calculate what percentage of requests are errors:

# Error rate as a percentage
  sum(rate(http_requests_total{status=~"5.."}[5m]))
/
  sum(rate(http_requests_total[5m]))
* 100

This divides error requests by total requests and multiplies by 100.

Service Availability

Calculate uptime percentage over the last day:

# Percentage of time service was up
avg_over_time(up[24h]) * 100

The up metric is 1 when a target is reachable, 0 when it’s not.

Disk Space Prediction

Predict when disk will be full based on current growth:

# Seconds until disk is full (if growth continues)
disk_free_bytes / (rate(disk_used_bytes[1h]))

Or use the built-in prediction function:

# Predicted disk usage in 7 days
predict_linear(disk_used_bytes[1h], 7 * 24 * 3600)

Request Latency Percentiles

If you have histogram metrics, calculate percentiles:

# 95th percentile response time
histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

The le label (less than or equal) is required for histogram calculations.

Saturation Detection

Find servers under heavy load:

# Servers with high CPU for sustained period
avg_over_time(cpu_used[15m]) > 85

Common Mistakes to Avoid

Don’t Query Raw Counters

# Wrong: raw counter value is meaningless
http_requests_total

# Right: calculate the rate
rate(http_requests_total[5m])

Choose Appropriate Time Windows

# Too short: volatile, may miss data points
rate(http_requests_total[30s])

# Better: smooths out noise
rate(http_requests_total[5m])

Filter Early

Apply label filters before expensive operations:

# Efficient: filter first, then aggregate
sum(rate(http_requests_total{status="500"}[5m]))

# Less efficient: aggregates everything, then filters
sum(rate(http_requests_total[5m])) > 100

Watch Cardinality

Avoid queries that return too many series:

# Potentially dangerous: returns every series
http_requests_total

# Safer: aggregate or filter
sum by (endpoint) (rate(http_requests_total[5m]))

Next Steps

Now that you understand PromQL basics:

Use the PromQL Reference for syntax lookup
Build custom dashboards with your queries
Create recording rules for complex calculations
Set up alerts based on PromQL conditions