In a typical datacenter environment where you run 100 and 100s of NVIDIA GPU equipped cluster of nodes, it becomes important to monitor those systems to gain insight of the performance metrics, memory usage, temperature and utilization. . Tools like Ganglia & Nagios are very popular due to their scalable & distributed monitoring architecture for high-performance computing systems such as clusters and Grids. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. But with the advent of container technology, there is a need of modern monitoring tools and solutions which works well with Docker & Microservices.
It’s all modern world of Prometheus Stack…
Prometheus is 100% open-source service monitoring system and time series database written in Go.It is a full monitoring and trending system that includes built-in and active scraping, storing, querying, graphing, and alerting based on time series data. It has knowledge about what the world should look like (which endpoints should exist, what time series patterns mean trouble, etc.), and actively tries to find faults.
How is it different from Nagios?
Though both serves a purpose of monitoring, Prometheus wins this debate with the below major points -
- Nagios is host-based. Each host can have one or more services, which has one check.There is no notion of labels or a query language. But Prometheus comes with its robust query language called “PromQL”. Prometheus provides a functional expression language that lets the user select and aggregate time series data in real time. The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus’s expression browser, or consumed by external systems via the HTTP API.
- Nagios is suitable for basic monitoring of small and/or static systems where blackbox probing is sufficient. But if you want to do whitebox monitoring, or have a dynamic or cloud based environment then Prometheus is a good choice.
- Nagios is primarily just about alerting based on the exit codes of scripts. These are called “checks”. There is silencing of individual alerts, however no grouping, routing or deduplication.
Let’s talk about Prometheus Pushgateway..
Occasionally you will need to monitor components which cannot be scraped. They might live behind a firewall, or they might be too short-lived to expose data reliably via the pull model. The Prometheus Pushgateway allows you to push time series from these components to an intermediary job which Prometheus can scrape. Combined with Prometheus’s simple text-based exposition format, this makes it easy to instrument even shell scripts without a client library.
The Prometheus Pushgateway allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus. It is important to understand that the Pushgateway is explicitly not an aggregator or distributed counter but rather a metrics cache. It does not have statsd-like semantics. The metrics pushed are exactly the same as you would present for scraping in a permanently running program.For machine-level metrics, the textfile collector of the Node exporter is usually more appropriate. The Pushgateway is intended for service-level metrics. It is not an event store.
Under this blog post, I will showcase how NVIDIA Docker, Prometheus & Pushgateway come together to push NVIDIA GPU metrics to Prometheus Stack.
- Docker Version: 20.10.10
- OS: Ubuntu 18.04 LTS
- Environment : Managed Server Instance with GPU
- GPU: GeForce GTX 1080 Graphics card
Cloning the GITHUB Repository
Run the below command to clone the below repository to your Ubuntu 16.04 system equipped with GPU card:
Script to bring up Prometheus Stack(Includes Grafana)
Change to nvidia-prometheus-stats directory with proper execute permission & then execute the ‘start_containers.sh’ script as shown below:
$ sudo chmod +x start_containers.sh
$sudo sh start_containers.sh
This script will bring up 3 containers in sequence — Pushgateway, Prometheus & Grafana
Executing GPU Metrics Script:
NVIDIA provides a python module for monitoring NVIDIA GPUs using the newly released Python bindings for NVML (NVIDIA Management Library). These bindings are under BSD license and allow simplified access to GPU metrics like temperature, memory usage, and utilization.
Next, under the same directory, you will find a python script called “test.py”.
Execute the script (after IP under line number — 124 as per your host machine) as shown below:
$ sudo python test.py
That’s it. It is time to open up Prometheus & Grafana UI under http://<IP-address>:9090
Just type gpu under the Expression section and you will see the list of GPU metrics automatically turned up as shown below:
Accessing the targets
Go to Status > Targets to see what targets are accessible. The Status should show up UP.
Click on Push gateway Endpoint to access the GPU metrics in details as shown:
You can access Grafana through the below link:
Did you find this blog helpful? Feel free to share your experience. Get in touch @ajeetsraina.
If you are looking out for contribution/discussion, join me at Collabnix Slack Channel to meet 6000+ DevOps Folks.