Monitoring Postgres with Prometheus

I'm glad people found my presentation at Lisbon on monitoring Postgres using Prometheus last October interesting. The slides are now uploaded to the conference web site at https://www.postgresql.eu/events/pgconfeu2018/schedule/session/2166/. Sorry for the delay. Now it's a near year it's time to work on improving monitoring in Postgres.

As an aside I experimented a bit with the build process for this slide deck. It's a Gitlab project and I set up a Gitlab CI pipeline to run Latex to build the beamer presentation and serve it from Gitlab pages. So you can see the most recent version of the slide deck from https://_stark.gitlab.io/monitoring-postgres-pgconf.eu-2018/monitoring.pdf any time and it's automatically rebuilt each time I do a git push.

You can also see the source code at https://gitlab.com/_stark/monitoring-postgres-pgconf.eu-2018 and feel free to submit issues if you spot any errors in the slides or even just suggestions on things that were unclear. **

But now the real question. I want to improve the monitoring situation in Postgres. I have all kinds of grand plans but would be interested to hear what people's feelings about what's the top priority and most practical changes they want to see in Postgres.

Personally I think the most important first step is to implement native Prometheus support in Postgres -- probably a background worker that would start up and expose all of pg_stats directly from shared memory to Prometheus without having to start an SQL session with all its transaction overhead. That would make things more efficient but also more reliable during outages. It would also make it possible to export data for all databases instead of having the agent have to reconnect for each database!

I have future thoughts about distributed tracing, structured logging, and pg_stats changes to support application profiling but they are subjects for further blog posts. I have started organizing my ideas as issues in https://gitlab.com/_stark/postgresql/issues feel free to comment on them or create new issues if you think it's something in these areas!

** (These URLs may have to change as the underscore is actually not legal, c.f. https://gitlab.com/gitlab-org/gitlab-ce/issues/40241)


FOSDEM Blogging II

Well to continue my octennial series on PostgreSQL at FOSDEM...

This year I presented for the first time at FOSDEM. Not in the PostgreSQL room -- which I think can be a bit like "preaching to the choir" -- but in two other rooms. I hoped to bring a bit of exposure to Postgres outside the usual crowd.

I presented the crowd favorite PostgreSQL on a VAX in the main lightning talk hall Slonik came to visit to help out:

Also one of the audience members turned out to have a real VAX machine and be interested in repeating the experiments on real hardware. That's the kind of connection that is the strength of conferences like FOSDEM that aren't narrowly targeted.

Secondly I presented my Postgres fuzzer extension in the LLVM room. This is a Postgres module using the Libfuzzer from LLVM to fuzz Postgres functions. LIbfuzzer is a lot like AFL but runs entirely in-process so it's much faster and it's easier to integrate into a client-server architecture like Postgres. 

The main developer for Libfuzzer is Kostya Serebryany and I regret that I didn't remember his name during the talk when I tried to credit him.


FOSDEM Blogging

I haven't done much work-related blogging but here goes:

Various blogs seem to consider FOSDEM to be a great success. Here's another. It brought together developers from different places and different fields in a way that can make a real difference.

I met up with Tatsuo Ishii-san from Japan to discuss some Postgres development that we're both looking at, recursive queries. I think we made more progress together than either of us has been able to make alone.

I also found a great tool called IOGrind which is a valgrind module to analyze the I/O pattern across a disk. It will let us get a much better idea how well file systems are laying out our files and how much seeking is actually going on for apparently-sequential i/o.

Lastly Heikki was able to meet with BSD developers and discuss the kinds of APIs Postgres needs to be able to the timing of filesystem writes. fsync() is too coarse an API for our needs and leads to i/o spikes and unnecessarily high latency.

I've come away with new-found respect for events like FOSDEM or Barcamp which can bring together developers from disparate but related fields. In the future I would consider checking on the schedules of developers for features we really need and seeking out events where the opportunity to work with them may come up.

One example is Postgres's needs for i18n. strcoll()is too awkward for Postgres to handle per-column collations as it requires switching the global locale for every call. On glibc that would be tolerable -- though not great -- but on most OSes it would be unusably slow. Glibc has an internal strcol_l() but it's neither standard, nor documented. Perhaps if we sit down with the right people we could get a standard interface which meets our needs.