Summary
Introduce a transactional “failsafe update” procedure that installs a new system image in trial mode, boots it once, runs health checks, and only promotes it to the persistent default after explicit confirmation (or successful auto-confirm). If the trial boot fails or isn’t confirmed in time, the device automatically reverts to the previously working image on the next reboot/power cycle.
Use case
Devices are often deployed where console access is impossible or costly (remote POPs, towers/rooftops, cages without OOB). A bad update can break connectivity and require physically relocating the device to recover. A failsafe, confirm-to-promote update flow minimizes downtime and truck rolls, similar in spirit to commit-confirm, but for image updates.
Additional information
Possible implementation outline
- Use GRUB environment variables for the update flow:
- vyos_update_trial_image (image ID for next boot)
- vyos_update_first_boot (exported to OS for checks/watchdog)
- vyos_update_status (populated post-boot for audit)
- State machine: 1) Install image (failsafe) → 2) Mark trial → 3) Boot trial image (flag auto-cleared) → 4a) Checks pass → wait for confirm/auto-confirm → promote; or 4b) Checks fail/timeout → reboot → previous image.
Proposed behavior
- Install the new image without changing the persistent default boot image ID in GRUB configuration.
- Let the user select first boot mode for the new image: auto-confirm (promote if no checks are failed) or manual-confirm (wait for user confirmation for a limited amount of time, and fail if no confirmation is received).
- Mark the next boot as a trial boot of the new image by setting vyos_update_trial_image variable to a new image GRUB menu entry ID, also set vyos_update_first_boot variable to let the new system image know that this is the first boot, and failsafe boot mode with checks needs to be triggered. Also, we may set vyos_update_status to something like in_progress.
- GRUB clears the vyos_update_trial_image variable during boot, so any subsequent reboot naturally returns to the previous image unless promotion occurs.
- First boot after update:
- System checks for “vyos_update_first_boot” GRUB variable so services/health checks know a fallback is available.
- Built-in health checks run (at minimum: configuration load success; critical daemons up). Optional checks can include reachability tests or even user hooks.
- Confirmation/promotion:
- If health checks pass, the administrator confirms promotion (explicit command) or an optional timer auto-confirms.
- On confirmation, the running image becomes the persistent default for all future boots.
- vyos_update_status variable is set to success
- Failure/timeout:
- If health checks fail, a watchdog or boot hook triggers a reboot; since the trial flag was cleared at boot, the system comes back on the previous stable image.
- If the admin does not confirm within the configurable timer, the device reboots and reverts automatically.
- vyos_update_status is set to failed, so after the reboot, the operator may see on what stage the system failed to boot (in_progress vs. failed).