How do you automate the checking if the backup worked correctly, in face of saf bugs, rsync bugs/misconfiguration, or bit rot?
My solution is to pick a few random files (plus whatever is new), and compute their hashes on both local and remote versions. But it's slow and probabilistic. ZFS also helps, but I feel it's too transparent to rely on (what if the remote storage changes filesystem).
Those same questions always bug me, and I did try all from very smart to very brute force solutions. I love ZFS but then we can question ZFS and OS bugs in the same manner as saf or rsync -- that rabbit hole is deep and quickly becomes expensive since ZFS may need ECC ram and other more expensive components.
Lately, in last few years, I am leaning towards using many cheap backups instead of clever and more expensive ones, with the idea that many of them can't all break at the same time. Yes occasional checks are good but safety in numbers seems as a good strategy.
It is not an accident that saf tag line says "one backup is saf, two are safe, three are safer" ;)
On top of many cheap backups, I am also trying not to rely on any single peace of technology (I know, it is not ideal that hardware and OS remains the same on any computer no matter what backup is used). If I use saf as my preferred rsync based solution I will also use Borg or duply/duplicity as a additional backup to avoid rsync bugs.
Having two or more rsync based backups, so they all go trough the same rsync pipe, makes much less sense than mixing completely different backup solutions, right?
1. Generate a list of files on both sides and the sizes & dates, and compare that ignoring any that have changed/appeared since before the last backup cycle started. Unless your backups are truly massive in terms of number of files this is practical to automate and run at least as often as your backup cycle, and this catches many system errors or simple failures of the backups to run at all.
2. Occasionally checksum the whole damn lot in your latest snapshot and the originals. This can take a lot of time (and expense of you are using child storage with read access charges) so you want to do it less often but it catches bit rot and similar issues. Again you have to skip files that have been touched since the start of the last backup cycle.
3. If you keep a checksum (or list of files with checksum) of each snapshot, occasionally pick one and verify it from scratch. As with hashing the latest snapshot this can be quite resource intensive for massive backups but is fine for mine. You can also just compare meta-data (files, sizes, dates) to a stored list which will catch some types of filesystem corruption affecting your older snapshots.
One of these days I'll might get around to tidy+documenting+publishing my scripts that run all this…
That's close to what I do[1]. The size and date comparison is done by rsync, and I keep a text file with all expected file hashes, so if there's any disagreement between copies I know which one to trust.
These hashes are also ordered so that the top files haven't been checked the longest; part of the script is to take the top N files, checksum them, and move them to the bottom of the list. This guarantees every file is checksum once per N days.
I also donwload a random file in every run, to make sure the connection is not broken.
My use case is personal photos and videos, so I also make sure that my local files are never changed.
And finally, I highly recommend Hetzner Storage Boxes. Not only are they dirty cheap while still giving you ZFS and samba access, you can actually SSH into the box and run simple commands on the files locally, like sha25sum, without paying for network transfers.
My solution is to pick a few random files (plus whatever is new), and compute their hashes on both local and remote versions. But it's slow and probabilistic. ZFS also helps, but I feel it's too transparent to rely on (what if the remote storage changes filesystem).