2019-01-07

Monitoring Postgres with Prometheus

I'm glad people found my presentation at Lisbon on monitoring Postgres using Prometheus last October interesting. The slides are now uploaded to the conference web site at https://www.postgresql.eu/events/pgconfeu2018/schedule/session/2166/. Sorry for the delay. Now it's a near year it's time to work on improving monitoring in Postgres.

As an aside I experimented a bit with the build process for this slide deck. It's a Gitlab project and I set up a Gitlab CI pipeline to run Latex to build the beamer presentation and serve it from Gitlab pages. So you can see the most recent version of the slide deck from https://_stark.gitlab.io/monitoring-postgres-pgconf.eu-2018/monitoring.pdf any time and it's automatically rebuilt each time I do a git push.

You can also see the source code at https://gitlab.com/_stark/monitoring-postgres-pgconf.eu-2018 and feel free to submit issues if you spot any errors in the slides or even just suggestions on things that were unclear. **

But now the real question. I want to improve the monitoring situation in Postgres. I have all kinds of grand plans but would be interested to hear what people's feelings about what's the top priority and most practical changes they want to see in Postgres.

Personally I think the most important first step is to implement native Prometheus support in Postgres -- probably a background worker that would start up and expose all of pg_stats directly from shared memory to Prometheus without having to start an SQL session with all its transaction overhead. That would make things more efficient but also more reliable during outages. It would also make it possible to export data for all databases instead of having the agent have to reconnect for each database!

I have future thoughts about distributed tracing, structured logging, and pg_stats changes to support application profiling but they are subjects for further blog posts. I have started organizing my ideas as issues in https://gitlab.com/_stark/postgresql/issues feel free to comment on them or create new issues if you think it's something in these areas!

** (These URLs may have to change as the underscore is actually not legal, c.f. https://gitlab.com/gitlab-org/gitlab-ce/issues/40241)

4 comments:

  1. Would be awesome to improve pg_stat_replication (of the metrics that report that replication is working). Currently if streaming replication is working there are rows there, but if the replication stops, the rows just disappear, and disappearing metrics are very hard to deal with Prometheus.

    It would also be cool if the built in Postgres metrics supported custom metrics to be added, so that we can add metrics for table and index bloat, etc. Or just include them in the standard metrics :-)

    ReplyDelete
  2. Hi,

    first of all, thanks. You gave me some good ideas about how can we replace some pganalyze.com reports with Prometheus and Grafana.

    About that, all dashboards that you shared on your slides are available on Grafana website?

    If not, can you share them?

    Thanks

    ReplyDelete
  3. Hi,
    very interesting presentation, thanks for sharing.

    How about the grafana dashboard, you will also share it ?
    Thanks.

    ReplyDelete
  4. Hi,

    I realise this is old, but have you looked at OpenTelemetry? Supporting it in PostgreSQL could provide incredibly useful trace and metrics data :)

    ReplyDelete