Getting started with Prometheus
8 min read
What is Monitoring?
Applications get complex and are needed to be managed on a large scale in order to ensure that your infrastructure stays operational. You should have a way of knowing how your applications are running, how the resources are being utilized, and the growth that takes place. Typically you have, let's say multiple servers running containers on them. As the user input grows, it makes sense to distribute these services individually, getting us to a microservice infrastructure. Now, if services want to connect with each other, there should be some sort of a way for them to be interconnected.
The Problem
Let's say your application stops working. You are not aware of what went wrong, which component of your application caused the failure, and other information. Or let's say your application is responding very slowly as all the traffic is being directed to just limited servers. That is a place no one would want to be in. As debugging this manually is going to be very time consuming.
The Solution
So how do you ensure that your application is being maintained properly, and is running with no downtime? We need some sort of an automated tool that constantly monitors our application and alerts us when something goes wrong (or right depending on the use case). Now, in our previous example, we would be notified when a service causes failure, and hence we can prevent our application from going down.
What is Prometheus?
Prometheus is an open-source monitoring & alerting tool. It was originally built by SoundCloud and now it is 100% open-source as a Cloud Native Computing Foundation graduated project. It has become highly popular in monitoring container & microservice environments.
Prometheus Architecture
Some Terminologies
Target - It is what Prometheus monitors. It can be your. aplications, servers, etc.
Metric - For our targets, we would like to monitor particular things. Like for example, if we have a server (target) we would want to monitor the number of errors on the HTTP endpoints exposed (metric).
Here we see the main component of Prometheus, i.e, the server. It consists of three parts:
Time Series Database (TSDB) - Stores the metrics data. It also ingest it (append only), compacts and allows querying efficiently.
Scrape Engine - Pulls the metrics (description above) from our target resources and sends them to the TSDB. (Prometheus pulls are called scrapes).
Server - Used to make queries for the data stored in TSDB. This is also used to display the metrics in a dashboard using Grafana/Prometheus UI.
More about Metrics
The metrics are defined with TYPE
& HELP
attributes to increase readability.
HELP
- It provides us with the description about the metric.TYPE
- Even tho Prometheus offers 4 core metric types to keep things simple, it allows us to create tags within those metric types for more specific use cases. The 4 core metric types are:Counter - As the name suggests, it is used to maintain a count of the metrics. This can be, let's say, number of requests, errors, etc. Note: Do not use this type if the value of your metric can decrease.
Gauge - It is best suited for metrics that can go up &. down, like CPU usage.
Histogram - A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
Summary - Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.
How does it work?
How does it get the data from Targets?
The Data Retrieval Worker pulls the data from the HTTP endpoints of the targets on path /metrics
. Here we notice 2 things:
The endpoints should expose the path
/metrics
.The data provided by the endpoint should be in the correct format that Prometheus understands.
Q. How do we make sure that the target services expose /metric & that data is in correct format? A. Some of them expose the endpoint by default. Ones that do not, need a component to do so. This component is known as an Exporter. An Exporter does the following:
Fetch data from the target
Convert data into a format that Prometheus understands
Expose the /metrics endpoint (This can now be retrieved by the Data Retrieval Worker) For different types of services, like APIs, Databases, Storage, HTTP, etc, Prometheus has a list of Exporters you can use.
Monitoring Personal Services
Let's say you want to monitor an application you have written in Java, you can use Client Libraries for that. It lets you expose application metrics via an HTTP endpoint /metrics
on your application’s instance which can then be used to send data to the Metrics Server. In the official documentation, a list of various libraries has been provided, with information on how to create your own.
How is it different?
As mentioned above, Prometheus. uses a pull mechanism to get data from targets. But mostly, other monitoring systems use a push mechanism (we'll see what that is in a bit). How is this different and what makes Prometheus so special?
Q. What do you mean by push mechanism? A. Instead of the server of the monitoring tool making requests to get the data, the servers of the application push the data to a database instead.
Q. Why is Prometheus better? A. You can just get the data from the endpoint of the target, by multiple Prometheus instances. Also note that this way Prometheus can also monitor whether an application is responsive or not, rather than waiting for the target to push data. (Checkout the official comparison documentation)
NOTE: But what happens if the targets don't give us enough time to make a pull request? For this, Prometheus uses the Pushgateway. Using this, these services can now push their data to the Data Retrieval Worker instead of it pulling data like it usually does. Using this, you get the best out of both the ways!
How to use it?
Now that we know how Prometheus works, lets take a look into how we actually use it. So we mentioned about targets, metrics and all sorts of things. Where do we define those? Answer, in a config (yaml) file.
Q. When you define what targets you want to collect data from in the file, how does Prometheus find these targets A. Using the Service Discovery. It also discovers services automatically based on the application running.
Default Configuration File
(Check the official documentation for configuration)
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
global
-scrape_interval
defines how often Prometheus is going to collect data from the targets mentioned in the file. This can of course be overridden.rule_files
- This allows us to set rules for metrics & alerts. These files can be reloaded at runtime by sendingSIGHUP
to the Prometheus process. Theevaluation_interval
defines how often these rules are evaluated. Prometheus supports 2 types of such rules:Recording Rules - If you are performing some frequent operations, they can be precomputed and saved in as a new set of time series. This makes the monitoring system a bit faster.
Alerting Rules - This lets you define conditions to send alerts to external services, for example, when a particular condition is triggered.
scrape_configs
- Here we define the services/targets that we need Prometheus to monitor. In this example file, thejob_name
isprometheus
. Meaning that it is monitoring the target as the Prometheus server itself. In short, it will get data from the/metrics
endpoint exposed by the Prometheus server. Here, the target by default islocalhost:9090
which is where Prometheus will expect the metrics to be, at/metrics
.
How does Alerting work?
Prometheus has an Alermanager that can be used to send alerts to you via Emails, mailing lists, etc. As mentioned above, Prometheus server uses the Alerting Rules to send alerts.
Where is the metrics data stored?
Prometheus stores it on disk, this can be a local database or remote. The data is stored in a time-series format so that one cannot write data directly.
How to get the data?
Prometheus lets use get the metrics data using the PromQLQuery Language. You can use a Web UI to request data from Prometheus server via PromQL.
Running it Locally
Let's take the example of the configuration file (config.yml
) above that monitors the Prometheus server running on our machine. (Checkout the README.md file for more information)
$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/prometheus.git
$ cd prometheus
$ make build
$ ./prometheus --config.file=your_config.yml$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/prometheus.git
$ cd prometheus
$ make build
$ ./prometheus --config.file=config.yml
Running it on localhost:9090
, you'll get the following Prometheus UI Dashboard that you can now configure:
Thanks for reading!
In the next blog we'll be looking at a few more examples of using Prometheus to monitor your Kubernetes resources, and Thanos!
Resources
https://prometheus.io/docs/introduction/overview/
https://github.com/roaldnefs/awesome-prometheus
https://www.youtube.com/watch?v=QgJbxCWRZ1s&feature=youtu.be
https://www.youtube.com/watch?v=mC6Zt5Ga9UQ
https://www.youtube.com/watch?v=Me-kZi4xkEs&t=1397s