Page MenuHomeVyOS Platform

Failsafe update procedure with trial boot, auto-revert, and explicit confirmation
Open, NormalPublicFEATURE REQUEST

Description

Summary

Introduce a transactional “failsafe update” procedure that installs a new system image in trial mode, boots it once, runs health checks, and only promotes it to the persistent default after explicit confirmation (or successful auto-confirm). If the trial boot fails or isn’t confirmed in time, the device automatically reverts to the previously working image on the next reboot/power cycle.

Use case

Devices are often deployed where console access is impossible or costly (remote POPs, towers/rooftops, cages without OOB). A bad update can break connectivity and require physically relocating the device to recover. A failsafe, confirm-to-promote update flow minimizes downtime and truck rolls, similar in spirit to commit-confirm, but for image updates.

Additional information

Possible implementation outline

  • Use GRUB environment variables for the update flow:
    • vyos_update_trial_image (image ID for next boot)
    • vyos_update_first_boot (exported to OS for checks/watchdog)
    • vyos_update_status (populated post-boot for audit)
  • State machine: 1) Install image (failsafe) → 2) Mark trial → 3) Boot trial image (flag auto-cleared) → 4a) Checks pass → wait for confirm/auto-confirm → promote; or 4b) Checks fail/timeout → reboot → previous image.

Proposed behavior

  1. Install the new image without changing the persistent default boot image ID in GRUB configuration.
  2. Let the user select first boot mode for the new image: auto-confirm (promote if no checks are failed) or manual-confirm (wait for user confirmation for a limited amount of time, and fail if no confirmation is received).
  3. Mark the next boot as a trial boot of the new image by setting vyos_update_trial_image variable to a new image GRUB menu entry ID, also set vyos_update_first_boot variable to let the new system image know that this is the first boot, and failsafe boot mode with checks needs to be triggered. Also, we may set vyos_update_status to something like in_progress.
  4. GRUB clears the vyos_update_trial_image variable during boot, so any subsequent reboot naturally returns to the previous image unless promotion occurs.
  5. First boot after update:
    • System checks for “vyos_update_first_boot” GRUB variable so services/health checks know a fallback is available.
    • Built-in health checks run (at minimum: configuration load success; critical daemons up). Optional checks can include reachability tests or even user hooks.
  6. Confirmation/promotion:
    • If health checks pass, the administrator confirms promotion (explicit command) or an optional timer auto-confirms.
    • On confirmation, the running image becomes the persistent default for all future boots.
    • vyos_update_status variable is set to success
  7. Failure/timeout:
    • If health checks fail, a watchdog or boot hook triggers a reboot; since the trial flag was cleared at boot, the system comes back on the previous stable image.
    • If the admin does not confirm within the configurable timer, the device reboots and reverts automatically.
    • vyos_update_status is set to failed, so after the reboot, the operator may see on what stage the system failed to boot (in_progress vs. failed).

Details

Version
-
Is it a breaking change?
Stricter validation
Issue type
Feature (new functionality)

Event Timeline

Unknown Object (User) triaged this task as Normal priority.Sep 9 2025, 9:40 AM