How to perform cluster monitoring with pgEdge platform

How to setup pgEdge for doing cluster monitoring with Prometheus and Grafana.

  1. First, let’s set up a Postgres metrics exporter that can export to a Prometheus instance.
  2. Download the latest release of the postgres exporter from here. Specify the DSN of the database instance it needs to scrape metrics from and run:

export DATA_SOURCE_USER=<database_user>
export DATA_SOURCE_PASS_FILE=/home/test/.prompgexp_pass
export DATA_SOURCE_URI="localhost:5432/demo?sslmode=disable"

./postgres_exporter


  1. Replace <database_user> with the appropriate username, and make changes as needed to the DATA_SOURCE_URI and the DATA_SOURCE_PASS_FILE.
  2. Starting the postgres exporter will create an HTTP server that listens on 9187 (by default), and metrics can be later scraped by Prometheus on <host>:9187/metrics.
  3. You will need to repeat steps 1-3 for each node in the pgEdge cluster.
  4. Now, let’s set up sql_exporter for exporting the results of custom queries. We primarily use this to export spock-related metrics into the dashboard.
  5. Download sql_exporter from here, and use this for the config file:

jobs:
  - job_name: pgedge_targets
    collectors: [pgedge_spock]
    enable_ping: true # Optional, true by default. Set to `false` in case you connect to pgbouncer or a data warehouse
    static_configs:
      - targets:
          n1: '<n1_dsn>'
          n2: '<n2_dsn>'
          n3: '<n3_dsn>'
        labels:  # Optional, arbitrary key/value pair for all targets
          cluster: pgedge_cluster

collectors:
- collector_name: pgedge_spock

  metrics:
    - metric_name: pg_spock_replication_lag
      type: gauge
      help: 'Replication lag in seconds'
      key_labels:
        - slot_name
      values: [seconds]
      query: |
        SELECT slot_name, (extract(epoch from replication_lag::interval)) as seconds from spock.lag_tracker as lag

    - metric_name: pg_spock_replication_bytes
      type: gauge
      help: 'Replication lag in bytes'
      key_labels:
        - slot_name
      values: [replication_lag_bytes]
      query: |
        SELECT slot_name, replication_lag_bytes FROM spock.lag_tracker AS lag_bytes

    - metric_name: pg_spock_resolutions
      type: gauge
      help: 'Number of resolutions'
      values: [count]
      query: |
        SELECT count(*) FROM spock.resolutions AS res_count

    - metric_name: pg_postmaster_start_time_seconds
      type: gauge
      help: 'Postgres uptime in seconds'
      values: [uptime_seconds]
      query: |
        SELECT
          EXTRACT(EPOCH FROM now() - pg_postmaster_start_time()) AS uptime_seconds

    - metric_name: pg_spock_sub_status
      type: gauge
      help: 'Spock subscription status'
      key_labels:
        - slot_name
      values: [status]
      query: |
        SELECT sub_status->>'slot_name' AS slot_name,
        (CASE WHEN sub_status->>'status' = 'replicating' THEN 0 ELSE 1 END) AS status
        FROM (SELECT to_json(spock.sub_show_status()) AS sub_status)

    - metric_name: pg_spock_exception_occurrences
      type: gauge
      help: 'Entries in the exception log table'
      values: [count]
      query: |
        SELECT count(*) FROM spock.exception_log;


  1. Replace <n1_dsn> with the actual DSN needed to connect to n1. Similarly, replace the DSN for n2, n3 and so on. Naming the targets is important here since the dashboard relies on target names to display metrics. For example, the DSNs could look something like this:

- targets:
          n1: 'postgresql://admin:password@n1:5432/demo?sslmode=disable'
          n2: 'postgresql://admin:password@n2:5432/demo?sslmode=disable'
          n3: 'postgresql://admin:password@n3:5432/demo?sslmode=disable'


  1. Once the config file is updated, run:

./sql_exporter


  1. Starting the sql_exporter will create an HTTP server listening on 9399 (by default). 

Note: You will need only one instance of sql_exporter for all your nodes. Unlike the postgres exporter, you do not need to install it on each pgEdge node.

  1. Next, we need an instance of Prometheus. Download it from here, and use this as the config file:

# my global config
global:
  scrape_interval: 5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 5s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# A scrape configuration containing exactly one endpoint to scrape:
scrape_configs:
  - job_name: 'postgres_exporter'
   static_configs:
      - targets: ['n1:9187', 'n2:9187', 'n3:9187', 'n1:9399']
    relabel_configs:
      - source_labels: [__address__]
        regex: (n1:9187)
        target_label: instance
        replacement: n1
      - source_labels: [__address__]
        regex: (n2:9187)
        target_label: instance
        replacement: n2
      - source_labels: [__address__]
        regex: (n3:9187)
        target_label: instance
        replacement: n3


  1. The targets specify the nodes to connect to and scrape metrics from. In our example, our nodes are n1, n2 and n3, and their respective instances of postgres exporters are running on n1:9187, n2:9187 and n3:9187. The sql_exporter is running on n1:9399.
  2. After making necessary changes to the config file, start Prometheus by:

./prometheus --config.file=prometheus.yml


  1. This will start the Prometheus server on <host>:9090
  2. Note: We strongly recommend having friendly node names, such as n1, n2, etc., for the nodes. You can modify the /etc/hosts file on each of the nodes to ensure that the friendly node names correctly point to the appropriate IP addresses or domain names.
  3. Finally, we will need an instance of Grafana for the metrics scraped by our Prometheus instance.
  4. Download and install Grafana from here. Starting grafana-server should start the dashboard at <host>:3000. Go through the setup and then add Prometheus as the data source. You can follow this tutorial for setting up the data source. 
  5. Make sure that the Prometheus server URL correctly points to the instance we created earlier. It should look something like this:

http://n1:9090


  1. Finally, let’s import our dashboard into Grafana. Download the dashboard template from here.
  2. Click on the left panel -> Dashboards -> New -> Import

  1. Upload or paste the dashboard JSON downloaded previously. You should now have the dashboard ready with metrics, and it should look something like this:



  1. Cluster-wide metrics are displayed in the first section of the dashboard, and instance-specific metrics are displayed in the sections following that. You can change the instance you want to look at using the ‘Instance’ dropdown at the top of the dashboard; like so: