Chapter 9: Alerting Rules (rules.yml)
In the previous chapter, Prometheus Configuration (prometheus.yml), we learned how to configure Prometheus to collect metrics from our applications. But collecting metrics is only half the battle! We also need to define what constitutes a problem and when to be alerted. That’s where the rules.yml
file comes in.
Think of rules.yml
as the brain that tells Prometheus when to sound the alarm. It contains the alerting rules that define when Prometheus should send an alert to Alertmanager (which we configured in Alertmanager).
Imagine you want to be alerted if your website’s response time exceeds 1 second. The rules.yml
file lets you define this condition as an alerting rule. Let’s see how!
What is rules.yml
?
rules.yml
is a YAML file that contains a list of alerting rules. Each rule defines a condition that, when met, triggers an alert. Let’s break down the key components of an alerting rule:
alert
: The name of the alert. This is a short, descriptive name (e.g.,HighCPUUsage
,ServiceDown
).expr
: The Prometheus Query Language (PromQL) expression that defines the condition. This is the “if” part of the “if-then” statement. We’ll explore PromQL in more detail in the next chapter, PromQL (Prometheus Query Language).for
: How long the condition must be true before the alert is triggered. This prevents transient spikes from triggering alerts.labels
: Labels to add to the alert. These labels can be used for routing alerts in Alertmanager (as we saw in Alertmanager).annotations
: Descriptive information about the alert, such as a summary and a description. This information is included in the alert notification.
Let’s look at an example!
A Simple rules.yml
Example
Here’s a snippet from our prom_config/rules.yml
file:
groups:
- name: Application_Status
rules:
- alert: BackendDown
annotations:
description: Backend pod HTTP traffic to port 5000 is down for more than 1 minutes.
summary: Backend pod HTTP 5000 is down (instance )
expr: probe_success{instance="backend-service:5000"} == 0
for: 10s
labels:
severity: critical
Let’s break down what this rule means:
groups:
Alerting rules are organized into groups for logical organizationname: Application_Status
: Name for the alert grouprules
: List of alerts in this groupalert: BackendDown
: The name of this alert isBackendDown
.annotations:
Adds metadata for alertdescription
: Description of the alertsummary
: Short summary of the alert
expr: probe_success{instance="backend-service:5000"} == 0
: This is the PromQL expression. It checks if theprobe_success
metric for thebackend-service:5000
instance is equal to 0. Remember from Blackbox Exporter, theprobe_success
metric is 1 if the service is up and 0 if it’s down.for: 10s
: The condition must be true for 10 seconds before the alert is triggered.labels: severity: critical
: Adds a labelseverity: critical
to the alert. This label can be used to route the alert to a specific receiver in Alertmanager (as we saw in Alertmanager).
What will happen?
If the probe_success
metric for backend-service:5000
is 0 for 10 seconds, Prometheus will trigger an alert named BackendDown
. This alert will have the severity: critical
label and will be sent to Alertmanager. Alertmanager will then route the alert to the appropriate receiver based on its configuration (e.g., sending a notification to a Slack channel).
Here’s another example, for the frontend:
- alert: FrontendDown
annotations:
description: Frontend pod HTTP traffic to port 5000 is down for more than 1 minutes.
summary: Frontend pod HTTP 5000 is down (instance )
expr: probe_success{instance="frontend:80"} == 0
for: 10s
labels:
severity: critical
The only difference is that it monitors probe_success{instance="frontend:80"}
.
How Prometheus Uses rules.yml
When Prometheus starts, it reads the rules.yml
file (as specified in the prometheus.yml
file’s rule_files
section). It then continuously evaluates the expressions defined in the alerting rules.
sequenceDiagram
participant Prometheus
participant rules.yml
participant Metrics
participant Alertmanager
Prometheus->>rules.yml: Read Alerting Rules
rules.yml-->>Prometheus: Alerting Rules Data
loop Evaluate Rules Every Evaluation Interval
Prometheus->>Metrics: Query Metrics (PromQL)
Metrics-->>Prometheus: Metrics Data
alt Rule Condition Met
Prometheus->>Alertmanager: Send Alert
end
end
- Prometheus reads the
rules.yml
file and parses the alerting rules. - At each
evaluation_interval
(defined inprometheus.yml
), Prometheus queries the metrics specified in theexpr
of each rule. - If the expression evaluates to
true
for the duration specified in thefor
field, Prometheus sends an alert to Alertmanager.
Prometheus’s notifier
package handles alerting. The main.go
file and related files contain logic for reading the rules and evaluating them.
Here’s a highly simplified hypothetical code example:
// This is not actual Prometheus code
package main
import "fmt"
// Simplified alert rule struct
type AlertRule struct {
Name string
Expr string
For int
// In real Prometheus, Expr would be parsed into an executable form
}
// Simplified function to evaluate an alert rule (this is where PromQL would be evaluated)
func evaluateRule(rule AlertRule) bool {
// This is a placeholder - in reality, this would query the metrics database
// and evaluate the PromQL expression
if rule.Name == "BackendDown" {
// Simulate backend being down
return true
}
return false
}
func main() {
// Example rule
rule := AlertRule{
Name: "BackendDown",
Expr: "probe_success{instance=\"backend-service:5000\"} == 0",
For: 10,
}
// Evaluate the rule
isFiring := evaluateRule(rule)
if isFiring {
fmt.Println("Alert! BackendDown is firing!")
// In real Prometheus, this is where the alert would be sent to Alertmanager
} else {
fmt.Println("Backend is up")
}
}
Explanation:
This simplified program demonstrates how Prometheus might evaluate an alerting rule. It has a placeholder for querying the metrics database and evaluating the PromQL expression. In a real implementation, the Expr
would be a parsed PromQL query, and the evaluateRule
function would fetch data and run the query against it.
Connecting the Dots: Prometheus, rules.yml
, and Alertmanager
Let’s recap how all these pieces work together:
- Prometheus Configuration (
prometheus.yml
): Tells Prometheus where to find metrics (scrape_configs
) and where to send alerts (alerting
). It also tells Prometheus where to find the alerting rules (rule_files
). - Alerting Rules (
rules.yml
): Defines the conditions that trigger alerts (using PromQL expressions). - Alertmanager Configuration (
alertmanager.yml
): Defines how to route and handle alerts (grouping, deduplication, silencing, and sending to receivers like Slack or email).
Conclusion
The rules.yml
file is essential for defining what constitutes a problem and when to be alerted. By combining PromQL expressions with thresholds and labels, you can create powerful alerting rules that ensure you’re notified of critical issues in your applications and infrastructure. Now that you understand how to define alerting rules, let’s dive deeper into PromQL, the language used to query metrics in Prometheus, in the next chapter, PromQL (Prometheus Query Language).