This document describes for each alerts what you should do to diagnose and fix it.
This is fired by prometheus.
This means that Proxmox Mail Gateway defered mail queue has many messages.
This happens because of bad email addresses that can't be delivered.
- Connect to the Proxmox Mail Gateway interface,
- go to the Administration / Queues and check deferred mails. You can see them by target domains.
It might be that:
- we have emails of users with full inbox or that do not exists anymore In this case the best is to remove the users (remove them also in the newsletter, brevo)
- if you have emails to
root@<server>.openfoodfacts.org
, it might be that either the server relay is not configured correctly, or themailx
program is not installed or not thebsd-mailx
package. See mail configuration for servers
Finally when you are done, you can empty the queues.
Tip: To extract the email addresses in the detail of deferred mails, you can use the "Console" of the developers toolbar with this expression:
$x('//div[contains(@id, "pmgPostfixMailQueue")]//div[@class="x-grid-item-container"]//table//td[4]//text()').map(x => x.nodeValue).join("\n")
This is fired by the sanoid_check.sh
script which is regularly run by systemd
on every host using ZFS.
There are two possible alerts for each dataset.
This is fired because of different reasons:
-
if the dataset is a replication of a dataset on another host, it is not synchronized anymore. This might be transient because of a large volume of data to transfer, until syncoid catches up.
To diagnose, go on the server and:
- list snapshots with
zfs list -t snapshot path/of/dataset
and check the last one - eventually look at syncoid logs with
journalctl -xe -u syncoid
searching for you dataset.
Sometimes the synchronization is not working because the last snapshot has been removed on the source. See How to resync ZFS replication
- list snapshots with
-
if the dataset is local, it might be that you didn't configure sanoid (
sanoid.conf
) correctly for this dataset, be it you forgot to add snapshot specification, or add it tono_sanoid_checks
directives.To diagnose, go on the server and:
- list snapshots with
zfs list -t snapshot path/of/dataset
and check the last one - eventually look at sanoid logs with
journalctl -xe -u sanoid
searching for you dataset.
- list snapshots with
This is fired because snapshots are accumulating on a dataset (which can lead in increased disk usage).
On the server, list snapshots with zfs list -t snapshot path/of/dataset
.
Accumulation can be due to two main reasons:
- you forgot to configure the retention policy in
sanoid.conf
for this dataset, and sanoid is not removing old snapshots. In this case you will see a lot of hourly / daily snapshots. - there is another source adding snapshots to this dataset, which are not removed by sanoid.
- eventually sanoid is not able to cleanup snapshots,
use
journalctl -xe -u sanoid
to try to see why.