PromQL Guide
This guide teaches you how to write PromQL queries from scratch. You’ll learn how to select metrics, filter data, calculate rates, and build useful queries for dashboards and alerts.
For a quick syntax lookup, see the PromQL Reference.
Your First Query
Section titled “Your First Query”The simplest PromQL query is just a metric name:
cpu_usedThis returns the current value of cpu_used for every server in your infrastructure.
Each result includes the metric value along with its labels - key-value pairs
that identify where the metric comes from.
A typical result might look like:
cpu_used{hostname="web-1", datacenter="paris"} 45.2cpu_used{hostname="web-2", datacenter="paris"} 32.1cpu_used{hostname="db-1", datacenter="london"} 78.5Filtering with Labels
Section titled “Filtering with Labels”Usually you don’t want all servers - you want specific ones. Use curly braces to filter by label values:
cpu_used{hostname="web-1"}This returns only the CPU usage for web-1.
You can combine multiple filters. They work as AND conditions:
cpu_used{datacenter="paris", role="webserver"}This returns CPU usage for all webservers in the Paris datacenter.
Beyond Exact Matches
Section titled “Beyond Exact Matches”Sometimes you need more flexible filtering. PromQL supports four matching operators:
Exact match with =:
http_requests_total{status="200"}Not equal with !=:
http_requests_total{status!="200"}Regex match with =~:
# All 5xx errorshttp_requests_total{status=~"5.."}
# Multiple valuescpu_used{hostname=~"web-1|web-2|web-3"}Regex not match with !~:
# Exclude test serverscpu_used{hostname!~"test-.*"}Understanding Counters vs Gauges
Section titled “Understanding Counters vs Gauges”Before going further, you need to understand the two main metric types:
Gauges measure a current value that can go up or down:
cpu_used- current CPU percentagemem_used_bytes- current memory usagetemperature- current temperature
Counters count cumulative totals that only increase (or reset to zero):
http_requests_total- total requests since starterrors_total- total errors since startbytes_sent_total- total bytes sent
This distinction matters because you query them differently.
Working with Gauges
Section titled “Working with Gauges”Gauges are straightforward - the current value is meaningful on its own:
# Current CPU usagecpu_used
# Current memory in usemem_used_bytes
# Filter to high valuescpu_used > 80To see how a gauge changed over time, use delta():
# Temperature change over the last hourdelta(temperature[1h])Working with Counters
Section titled “Working with Counters”Counter values alone aren’t useful - knowing you’ve served “5 million requests total” doesn’t tell you much. What matters is the rate of change.
Calculating Rates
Section titled “Calculating Rates”Use rate() to calculate how fast a counter is increasing:
# Requests per second (averaged over 5 minutes)rate(http_requests_total[5m])The [5m] means “look at the last 5 minutes of data” to calculate the rate.
This smooths out spikes and gives you a reliable per-second value.
Total Increase
Section titled “Total Increase”Sometimes you want the total count over a period, not the per-second rate:
# Total requests in the last hourincrease(http_requests_total[1h])
# Total errors todayincrease(errors_total[24h])Aggregating Data
Section titled “Aggregating Data”Often you have many time series and want to combine them.
Summing Across Series
Section titled “Summing Across Series”# Total requests across all serverssum(rate(http_requests_total[5m]))This takes the request rate from each server and adds them together.
Aggregating by Label
Section titled “Aggregating by Label”Use by to keep certain labels and aggregate the rest:
# Total requests per endpoint (combine all servers)sum by (endpoint) (rate(http_requests_total[5m]))
# Average CPU per datacenteravg by (datacenter) (cpu_used)Use without to do the opposite - aggregate away specific labels:
# Sum everything except hostname (effectively "per datacenter")sum without (hostname) (rate(http_requests_total[5m]))Finding Extremes
Section titled “Finding Extremes”Find the highest or lowest values:
# Top 5 servers by CPUtopk(5, cpu_used)
# Busiest endpointstopk(10, sum by (endpoint) (rate(http_requests_total[5m])))Doing Math
Section titled “Doing Math”You can perform arithmetic with metrics:
# Convert bytes to gigabytesmem_used_bytes / 1024 / 1024 / 1024
# Calculate percentage(mem_used_bytes / mem_total_bytes) * 100
# Difference between two metricsdisk_total_bytes - disk_free_bytesComparing Values
Section titled “Comparing Values”Filter results based on conditions:
# Only show CPU above 80%cpu_used > 80
# Servers with less than 10% disk free(disk_free_bytes / disk_total_bytes) * 100 < 10Combining Conditions
Section titled “Combining Conditions”Use and, or, and unless to combine conditions:
# High CPU AND high memory (stressed servers)(cpu_used > 80) and (mem_used_perc > 80)
# Either condition (any resource pressure)(cpu_used > 90) or (mem_used_perc > 90)Looking at History
Section titled “Looking at History”Aggregating Over Time
Section titled “Aggregating Over Time”See how metrics behaved over a time period:
# Average CPU over the last houravg_over_time(cpu_used[1h])
# Peak memory usage todaymax_over_time(mem_used_perc[24h])
# Minimum disk space (to find low points)min_over_time(disk_free_bytes[7d])Querying the Past
Section titled “Querying the Past”Use offset to look at historical data:
# CPU usage 1 hour agocpu_used offset 1h
# Compare current to yesterdaycpu_used - (cpu_used offset 24h)Practical Examples
Section titled “Practical Examples”Let’s build some real-world queries.
Error Rate Percentage
Section titled “Error Rate Percentage”Calculate what percentage of requests are errors:
# Error rate as a percentage sum(rate(http_requests_total{status=~"5.."}[5m]))/ sum(rate(http_requests_total[5m]))* 100This divides error requests by total requests and multiplies by 100.
Service Availability
Section titled “Service Availability”Calculate uptime percentage over the last day:
# Percentage of time service was upavg_over_time(up[24h]) * 100The up metric is 1 when a target is reachable, 0 when it’s not.
Disk Space Prediction
Section titled “Disk Space Prediction”Predict when disk will be full based on current growth:
# Seconds until disk is full (if growth continues)disk_free_bytes / (rate(disk_used_bytes[1h]))Or use the built-in prediction function:
# Predicted disk usage in 7 dayspredict_linear(disk_used_bytes[1h], 7 * 24 * 3600)Request Latency Percentiles
Section titled “Request Latency Percentiles”If you have histogram metrics, calculate percentiles:
# 95th percentile response timehistogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))The le label (less than or equal) is required for histogram calculations.
Saturation Detection
Section titled “Saturation Detection”Find servers under heavy load:
# Servers with high CPU for sustained periodavg_over_time(cpu_used[15m]) > 85Common Mistakes to Avoid
Section titled “Common Mistakes to Avoid”Don’t Query Raw Counters
Section titled “Don’t Query Raw Counters”# Wrong: raw counter value is meaninglesshttp_requests_total
# Right: calculate the raterate(http_requests_total[5m])Choose Appropriate Time Windows
Section titled “Choose Appropriate Time Windows”# Too short: volatile, may miss data pointsrate(http_requests_total[30s])
# Better: smooths out noiserate(http_requests_total[5m])Filter Early
Section titled “Filter Early”Apply label filters before expensive operations:
# Efficient: filter first, then aggregatesum(rate(http_requests_total{status="500"}[5m]))
# Less efficient: aggregates everything, then filterssum(rate(http_requests_total[5m])) > 100Watch Cardinality
Section titled “Watch Cardinality”Avoid queries that return too many series:
# Potentially dangerous: returns every serieshttp_requests_total
# Safer: aggregate or filtersum by (endpoint) (rate(http_requests_total[5m]))Next Steps
Section titled “Next Steps”Now that you understand PromQL basics:
- Use the PromQL Reference for syntax lookup
- Build custom dashboards with your queries
- Create recording rules for complex calculations
- Set up alerts based on PromQL conditions