Brenton Baker profile image Brenton Baker

Metrics, Monitoring & Alerting

Peter Drucker summed it up many years ago, 'You can’t manage what you don’t measure'. Discover how I capture metrics to monitor and alert on everything that's important to me.

Metrics, Monitoring & Alerting

Like most things I work on, I find it useful to know two key pieces of information, the current state and the expected state. Understanding these states means you're able to determine if you're on track, or perhaps, how far off-track you are.

As it relates to technology and especially for IoT devices, the ability to monitor, analyse, and react to changes in a system's state is critical to effective operations.

In this blog I'll touch on a few systems I monitor in the context of my home environment and provide some insights to help get you started monitoring the stuff that you care about.

What I monitor

  • Home network
  • Tesla charging
  • Air quality
  • brentonbaker.com / cloud infrastructure
  • A few other services (vault, security systems etc)

The monitoring stack

To perform all the magic, I use Prometheus and Grafana. Prometheus is an open-source monitoring and alerting toolkit designed to collect, store, query, and alert on time-series data. Prometheus is part of a broader ecosystem that includes several components that provide comprehensive monitoring and alerting capabilities:

  1. Prometheus Server: The heart of the system. The Prometheus server scrapes, stores, and queries time-series data, providing real-time insights.
  2. Prometheus Alertmanager: The Alertmanager component handles alerting. It allows you to define and manage alerts based on metric thresholds, predefined conditions, or complex queries. It can group, deduplicate, and route alerts for rapid resolution.
  3. Prometheus Push Gateway: Sometimes, systems generate metrics that don't naturally fit the pull-based model used by Prometheus. The Push Gateway is an optional component that enables pushing metrics from short-lived jobs and batch processes, filling the gap in the Prometheus ecosystem. You'll see an example of this with my Tesla & AQ monitoring.
  4. PromQL: Prometheus Query Language (PromQL) is a specialised query language for time-series data. It allows you to perform complex queries, aggregations, and mathematical operations on your metrics, giving you deep insights into system performance.
  5. Grafana: To visualise and explore metrics, you can pair Prometheus with Grafana. Grafana allows you to create visualisations and dashboards that make it easy to understand your data. You'll see some examples coming up...

Defining the metrics that matter

Now we have the tools and we know the systems/services we want to monitor. Next up is defining the metrics that matter. Be intentional about what you're collecting and how you will use these metrics to inform decisions or trigger an action.

I want to satisfy the following objectives

  1. I must know when there are sustained periods of high upload and/or download on my home WAN connection
  2. I must know when there are resource constraints on my cloud infrastructure
  3. I must know as soon as my website is down for more than 5 minutes (yes, I set personal SLAs).
  4. I must know the cost per charge when I charge my Tesla at home
  5. I must know the total (lifetime) cost of Telsa charging at home
  6. I would like to be alerted when my Tesla has finished charging at home
  7. I would like to know how many new visitors access my website within a 24 hour period
  8. I must know when someone accesses (or attempts to access) the vault
  9. I must know when the vault disconnects from power
  10. I would like to know the current air quality in my office and a view of the last 24 hours of AQ metrics

Get building

To achieve objective 1, we can use the Prometheus SNMP Exporter (note: SNMPv3 supports encryption). The SNMP Exporter reads a config file snmp.yml that contains the OIDs to walk.

To achieve objective 2, we can use the Prometheus Node Exporter.

To achieve objective 3, we can use the Prometheus Blackbox Exporter.

For all other objectives, we need to build our own tooling to extract the metrics and expose them via a HTTP endpoint for Prometheus to scrape. I've provided two different examples below of how to achieve this.

Tesla charging metrics

The Tesla home charger exposes metrics via an API at http://charger-IP/api/1/vitals & http://charger-IP/api/1/lifetime. This is great, but Prometheus can't scrape from these endpoints directly. Below is my Python script that queries the Tesla API and pushes the metrics to Prometheus PushGateway at a defined frequency.

#!/usr/bin/env python3

import requests
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
import ssl
ssl._create_default_https_context = ssl._create_unverified_context


endpoint_url = 'http://<brenton-home>:8018/api/1/vitals'

pushgateway_url = 'https://monitoring.brentonbaker.com:9091'


registry = CollectorRegistry()


response = requests.get(endpoint_url)

if response.status_code == 200:
    # Extract metrics from the JSON response
    data = response.json()
    for key, value in data.items():
        metric_name = key.replace('-', '_') + '_metric'
        metric_description = key.replace('_', ' ').title()

        # Check if the value is numeric
        if isinstance(value, (int, float)):
            metric = Gauge(metric_name, metric_description, registry=registry)
            metric.set(float(value))

   
    push_to_gateway(pushgateway_url, job='brenton_tesla_model3', registry=registry)
    print('Metrics pushed to Prometheus Pushgateway successfully.')
else:
    print('Error:', response.text)

Air Quality metrics

Since my AQ sensor has a PHP library for extracting data from the serial interface, I decided to continue using PHP. The PHP script updates values of the AQ metrics through POST requests made by another service (extract shown in the second script below) and provides the metrics in a format that Prometheus can scrape.

<?php

// File path to store the values
$dataFile = __DIR__ . '/data.json';

// Load existing values from file if available
if (file_exists($dataFile)) {
    $data = json_decode(file_get_contents($dataFile), true);
    $v1 = $data['v1'];
    $v2 = $data['v2'];
} else {
    $v1 = 0;
    $v2 = 0;
}

if ($_SERVER['REQUEST_METHOD'] === 'POST') {
    // Handle POST request to update v1 and v2 values
    if (isset($_POST['v1']) && isset($_POST['v2'])) {
        $v1 = $_POST['v1'];
        $v2 = $_POST['v2'];

        // Save the updated values to file
        $data = ['v1' => $v1, 'v2' => $v2];
        file_put_contents($dataFile, json_encode($data));
    }
}

// Generate the metrics
$metrics = "# HELP v1_description small air particles\n";
$metrics .= "# TYPE v1_small_particles gauge\n";
$metrics .= "v1_small_particles $v1\n";
$metrics .= "# HELP v2_description large air particles\n";
$metrics .= "# TYPE v2_large_particles gauge\n";
$metrics .= "v2_large_particles $v2\n";


echo $metrics;

function upload($v1, $v2)
	{
		$this->log("sending $v1 and $v2 data");

		$url = 'http://<ip>:8016/metrics'; 
		$data = [
			'v1' => $v1,
			'v2' => $v2
		];
	

		$ch = curl_init($url);
		curl_setopt($ch, CURLOPT_POST, 1);
		curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($data));
		curl_exec($ch);
		curl_close($ch);


		$this->listen();
	

	}

Alerting

Prometheus AlertManager can be configured to accomodate alerting based on defined conditions. Below is a rule that sends an email when my Tesla has finished charging.

- name: brenton_tesla_charging
  rules:
  - alert: EVSEStateChanged
    expr: evse_state_metric{job="brenton_tesla_model3"} == 4
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Charging for Brenton's Tesla is complete"
      description: "Charging for Brenton's Tesla is complete"
To receive emails, you'll need to add SMTP configuration in alertmanager.yml - sample below for SendGrid
receivers:
- name: email
  email_configs:
  - send_resolved: false
    to: <your-email-address>
    from: <your-service-email-address>
    hello: localhost
    smarthost: smtp.sendgrid.net:587
    auth_username: apikey
    auth_password: <secret>
    headers:
      From: <your-service-email-address>
      Subject: '{{ template "email.default.subject" . }}'
      To: <your-email-address>
    html: '{{ template "email.default.html" . }}'
    require_tls: true
templates: []

Display the data

Now we have the monitoring stack configured, it's time to build and deploy dashboards with Grafana. Below are some dashboards I display in my home office.

Brenton Baker profile image Brenton Baker
Brenton Baker is a highly experienced and articulate problem solver with a track record of building and operating successful tech companies across the globe.