Steps to reproduce
- Configure a scheduled backup (e.g. hourly) so that its start minute coincides with
apply-updates.timer (default 03:12)
- Wait for both timers to fire at the same time on a node where the backup of a large module (e.g.
mail) is still running when the update starts
- Observe the following backup runs for that module
Expected behavior
Module updates and backup runs do not interfere: either the update waits for the running backup to complete, or the backup is stopped gracefully so that Restic releases its repository lock. Subsequent backup runs succeed.
Actual behavior
The module update restarts the module's containers while backup1.service is mid-operation. The restic-<module>-<pid> container is killed with SIGTERM (backup1.service: Main process exited, code=killed, status=15/TERM) and Restic cannot release its lock. A stale lock file remains in the repository locks/ directory, and every subsequent backup run for that module fails during forget --prune with:
[ERROR] restic backup failed. Command '[...restic...] forget --prune --keep-last=360' returned non-zero exit status 11.
The failure chain persists until the lock is removed manually (restic unlock). Modules with short backup times are rarely affected; large modules (e.g. mail with ~400GB of data) are hit systematically.
Components
- core 3.19.2 (ghcr.io/nethserver/restic:3.19.2)
Steps to reproduce
apply-updates.timer(default 03:12)mail) is still running when the update startsExpected behavior
Module updates and backup runs do not interfere: either the update waits for the running backup to complete, or the backup is stopped gracefully so that Restic releases its repository lock. Subsequent backup runs succeed.
Actual behavior
The module update restarts the module's containers while
backup1.serviceis mid-operation. Therestic-<module>-<pid>container is killed with SIGTERM (backup1.service: Main process exited, code=killed, status=15/TERM) and Restic cannot release its lock. A stale lock file remains in the repositorylocks/directory, and every subsequent backup run for that module fails duringforget --prunewith:The failure chain persists until the lock is removed manually (
restic unlock). Modules with short backup times are rarely affected; large modules (e.g. mail with ~400GB of data) are hit systematically.Components