We have a machine running some stuff on Docker, and little by little it has started to become important to keep an eye on it. However, looking for information on monitoring a Docker server it always seem to assume you’re running it in Swarm mode, which is not and WILL NOT be the case of this machine, Swarm adds a layer of complexity unneeded in this case.
What do you recommend for this case? I for one would love if the thing didn’t just give you a view of the things running on it but also gave you notifications if something went wrong (like if a container had to be restarted, or if one suddenly started eating all the CPU or something unusual).
I will be keeping an eye on this thread to see what other people do, but what I have done in the past is to have a couple different health checking strategies.
- For web-accessible services I am running, I usually run something like Uptime Kuma or Gatus on a different box checking to make sure those web endpoints are available and performant. I lately have been really digging how Gatus can check more than just the response header, but also latency and certificate validity.
- For the host machine, you can set up custom alerts within netdata for stuff like cpu utilization and memory with custom thresholds. The only other solution I have used for this in the past is setting up alerts through my VPS provider (if it is a VPS that is).
- On really low-spec machines I have had trouble with netdata though, so I don’t have a good solution in those cases. Interested to see if there are less demanding options. Instead, I have resorted to just using dashdot as a PWA so that I can check it easily on my phone if I am on the go.
- For some custom services in the past that run on set schedules, I have used healthchecks.io (which you can selfhost) to send alerts in the case that they don’t run for some reason.
- As for the containers being restarted, I actually don’t have experience with that, so I am interested to see what others have done.
Gatus sounds pretty cool, I’ll definitely give it a closer look later. Maybe it’s the push I needed to go ahead and look into proper observability as a whole, log ingestion and whatnot. My homelab setup is sorely lacking on that department if I’m being honest lol
Have started experimenting with OpenTelemetry (https://opentelemetry.io/docs/what-is-opentelemetry/) to add observability to different parts of the stack running inside a Docker container.
Not gotten far enough to recommend anything specific, but there is big ecosystem of open source collectors and analytics tools out there.
@jherazob If you use Podman instead of Docker Cockpit gives a great web dashboard covering both the containers and the host machine.
Uptime Kuma for web monitoring.
I’m experimenting with both Zabbix and Netdata to see which one I want to keep for monitoring resources on my hosts.
I use healthchecks.io to monitor backup scripts and cronjobs.
I’m using Autoheal to restart containers that are in an unhealthy state. For some containers this means I need to write my own health check. I mostly did this to resolve a rare issue where Plex would lock up but it’s helped in other scenarios too.