How to implement Prometheus’s Configuration Unit Testing

Jack Zhai
7 min readMay 12, 2023

--

Background

Prometheus has two components, one is Prometheus itself, and another one is Alertmanager.

Their duties is very clear as what the architecture shows:

  • Prometheus’s duty includes: pulling metrics from target regularly, storing metrics data, sending a notification to alertmanager when the data match an alert notification rule.
  • Alertmanager’s duty includes: when it receives a notification from Prometheus, it’ll send the notification to a specific receiver according by the alert route rule.

It is not difficult to guess that the core configuration[1] of Prometheus is as follows:

# ... 

# Rule files specifies a list of globs. Rules and alerts are read from
# all matching files.
rule_files:
[ - <filepath_glob> ... ]

# A list of scrape configurations.
scrape_configs:
[ - <scrape_config> ... ]

# ...

And the core configuration[2] of Alertmanager is as follows:

# ...
# The root node of the routing tree.
route:
[- <route_config>-]
# A list of notification receivers.
receivers:
[- <receivers>-]
# ...

In actual work, the configuration of Prometheus and Alertmanager would be very large.

Rigorous software engineering requires us to check the validity and correctness of these configurations before actually deploying them. Otherwise, the engineering efficiency of SRE/DevOps will be very low, because you need to manually debug huge configurations.

Therefore, we need an efficient way to ensure the validity and correctness of the configuration.

Ensure the validity and correctness of Prometheus’s configuration

promtool

Prometheus provides a command-line program called promtool. After decompressing the Prometheus package, you will find that it is placed in a folder with the Prometheus command.

promtool provides some subcommands to ensure the validity and correctness of Prometheus configuration:

# Verify the validity of the Prometheus configuration, 
# which supports the --lint="duplicate-rules"
# parameter for checking duplicate rule configurations
check config [<flags>] <config-files>...
# Verify the validity of the rule configuration
check rules [<flags>] <rule-files>...
# Execute the rules unit test case
test rules <test-rule-file>...

As for the validity check, you only need to execute the check subcommand, and there is no need to explain too much.

The test rules subcommand of promtool can realize the unit test of rule configuration, the specific command is as follows:

./promtool test rules test.yml

test.yaml is a unit test description file. Promtool supports specifying multiple unit test files at the same time, such as: ./promtool test rules test.yml test1.yml test2.yml

The format of the unit test description file content is as follows:

# Prometheus rule configuration file path
rule_files:
- rule1.yml
evaluation_interval: 1m

# unit test list
tests:
- interval: 1m
input_series:
alert_rule_test:
promql_expr_test:

Each test case under tests consists of 4 fields:

  1. input_series: the test data of the test case, that is, the time series data of the indicate;
  2. interval: represents the interval between each time series data;
  3. alert_rule_test: the test case of the alert rule;
  4. promql_expr_test: test cases for promql expressions. We can use it to debug our promsql.

They are described in detail next.

Test Case Data

The definition format of the test case data is as follows:

input_series:
- series: 'up{job="prometheus", instance="localhost:9090"}'
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
- series: 'up{job="node_exporter", instance="localhost:9100"}'
values: '1+0x6 0 0 0 0 0 0 0 0'
- series: 'go_goroutines{job="prometheus", instance="localhost:9090"}'
values: '10+10x2 30+20x5'
- series: 'go_goroutines{job="node_exporter", instance="localhost:9100"}'
values: '10+10x7 10+30x4'

Each input_serie consists of two fields:

  • series: the key of the time series data of the indicate
  • values: the value of the indicate. The interval between each value is the value of interval

In order to simplify the definition of the value of values field, you can use an extended notation to define its value. The syntax is as follows:

  • a+bxc means a a+b a+(2*b) a+(3*b) … a+(c*b) Read this as series starts at a, then c further samples incrementing by b.
  • a-bxc becomes a a-b a-(2*b) a-(3*b) … a-(c*b)'` Read this as series starts at a, then c further samples decrementing by b (or incrementing by negative b). There are special values to indicate missing and stale samples:
  • _ represents a missing sample from scrape
  • stale indicates a stale sample

Here are some examples from the official documentation:

  • -2+4x3 means:-2 2 6 10
  • 1-2x4 means:1 -1 -3 -5 -7
  • 1x4 means:1 1 1 1 1
  • 1 _x3 stale means:1 _ _ _ stale
  • 1+0x6 0 0 0 0 0 0 0 0 means: 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
  • 10+10x2 30+20x5 means:10 20 30 30 50 70 90 110 130

Test Promsql expression

When writing Prometheus alarm rules, a big pain point is that you cannot simply verify the correctness of the promsql you wrote. After we verify it in the unit test of the alarm rule, we configure it in the real rule file.

The promsql test case is written as follows:

promql_expr_test:
# The promsql expression to test, it will query data from the test data
- expr: go_goroutines > 5
# Evaluation time. Count from the 0th second of the test data.
# If the interval is 1m, then 4m represents the fourth value of the test data
eval_time: 4m
# After executing the promsql expression, the expected data
exp_samples: result
- labels: 'go_goroutines{job="prometheus",instance="localhost:9090"}'
value: 50
- labels: 'go_goroutines{job="node_exporter",instance="localhost:9100"}'
value: 50

Unit testing of alert rules

In the unit test description file, while adding the test case of the Promsql expression, we can also add the test case of the alarm rule. The code sample is as follows:

alert_rule_test:
- eval_time: 10m
alertname: InstanceDown
exp_alerts:
- exp_labels:
severity: page
instance: localhost:9090
job: prometheus
exp_annotations:
summary: "Instance localhost:9090 down"
description: "localhost:9090 of job prometheus has been down for more than 5 minutes."
  • eval_time: rule evaluation time.
  • alertname: alert name, which must be consistent with the alertname in Prometheus’ alert rules.
  • exp_labels: the label value in the expected alarm notification;
  • exp_annotations: the annotation value in the expected alarm notification.

Ensure the validity and correctness of the Alertmanager’s configuration

amtool

Similar to Prometheus, Alertmanager provides a command-line program called amtool[3].

We focus on its two subcommands:

  • config routes test: Verify the correctness of the configuration
  • check-config <config.yaml>: Verify configuration validity

Config routes test Subcommand Introduction

Amtool does not support defining test cases in YAML files like promtool, the following is a sample of its commands: amtool config routes test --config.file=config.yaml --verify.receivers=team-X-pager service=database owner=team-X

The --config.file parameter specifies the path to the configuration file.

In addition to supporting the specified configuration path, you can also use the parameter --alertmanager.url to specify the configuration of a running Alertmanager.

--verify.receivers specifies a list of receivers to return, separated by commas.

At the end of this subcommand is the label set, which is composed of key=value format and separated by a space. In the example service=database owner=team-X represents {service="database",owner="team-X"}

For better visualization, you can also add a --tree parameter, the effect is as follows:

% amtool config routes test --config.file=config.yaml --tree --verify.receivers=team-X-pager service=database owner=team-XMatching routes:
.
└── default-route
└── {service="database"}
└── {owner="team-X"} receiver: team-X-pager

If authentication fails, the command returns a non-zero result.

Check-config Subcommand Introduction

It works as follows:

% amtool check-config config.yaml
Checking 'config.yaml' SUCCESS
Found:
- global config
- route
- 1 inhibit rules
- 5 receivers
- 1 templates
SUCCESS

If authentication fails, the command returns a non-zero result.

Visual Alarm Notification Route

The Prometheus official website provides an online visual editor[4] for alert notification routing.

Paste the configuration into the edit text box, then enter the label of the alarm in “Match Label Set”, and finally the routing path of the notification will be displayed below. As shown in the figure below, the solid red dot is the receiver that matches the label:

Visual tool is very useful during the routing configuration debugging phase. It reduces the difficulty of routing configuration. However, you need to pay attention: do not upload any sensitive configuration to the public network.

How to Integrate into CI/CD Pipeline

The above is the most primitive way to use the two commands, that is, to run them manually. We need to integrate it into CI/CD pipeline for engineering.

There are generally two integration methods:

  1. Add a stage to execute promtool and amtool in Pipeline.
  2. Integrate into the build tool, such as integrating into Bazel.

Of course, if some self-hosted DevOps platforms need deep integration, you can introduce the implementation code of promtool’s and amtool’s into the code of your own DevOps platform.

Another Engineering Challenge

Another engineering challenge is that there are references between the above configuration files. The value of the expr field in the following prometheus rule file is actually referenced by the prometheus-unitesting.yml file.

If this reference relationship is not governed, the maintenance cost of these configurations will be very high in my experience.

Because YAML files do not inherently have the function of variable definition. Configuration languages that support programming such as Jsonnet and CUE can be used instead of YAML.

This topic is relatively large and is beyond the scope of this article.

References

[1] core configuration: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration
[2] core configuration: https://prometheus.io/docs/alerting/latest/configuration/#configuration
[3] amtool: https://github.com/prometheus/alertmanager#amtool
[4] online visual editor: https://prometheus.io/webtools/alerting/routing-tree-editor/?_gl=1

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Jack Zhai
Jack Zhai

Written by Jack Zhai

DevOps,SRE,Bazel The Author of 《Jenkins2.x In Practice》, https://showme.codes

No responses yet

Write a response