Page MenuHomeVyOS Platform

Upgrade bug
Needs reporter action, HighPublicBUG

Description

I just upgraded from 1.4-rolling-202308050917 to 1.4-rolling-202308140557 and got the bellow trace during first boot.

[   26.123600] vyos-router[1158]: Waiting for NICs to settle down: settled in 0sec..
[   34.056949] vyos-router[1158]: Mounting VyOS Config...done.
[   34.359348] vyos-router[1158]: Starting VyOS router: migrate
[   34.360902] vyos-router[1668]: Traceback (most recent call last):
[   34.362330] vyos-router[1668]:   File "/usr/libexec/vyos/run-config-migration.py", line 86, in <module>
[   34.364295] vyos-router[1668]:     main()
[   34.365204] vyos-router[1668]:   File "/usr/libexec/vyos/run-config-migration.py", line 69, in main
[   34.367113] vyos-router[1668]:     virtual_migration.run()
[   34.368374] vyos-router[1668]:   File "/usr/lib/python3/dist-packages/vyos/migrator.py", line 213, in run
[   34.370314] vyos-router[1668]:     cfg_versions = self.read_config_file_versions()
[   34.371879] vyos-router[1668]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[   34.373496] vyos-router[1668]:   File "/usr/lib/python3/dist-packages/vyos/migrator.py", line 65, in read_config_file_versions
[   34.375834] vyos-router[1668]:     cfg_versions = component_version.from_file(cfg_file, vintage='vyos')
[   34.377857] vyos-router[1668]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^
[   34.379886] vyos-router[1668]:   File "/usr/lib/python3/dist-packages/vyos/component_version.py", line 80, in from_file
[   34.382112] vyos-router[1668]:     version_dict = from_string(line_in_config, vintage=vintage)
[   34.384288] vyos-router[1668]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[   34.386269] vyos-router[1668]:   File "/usr/lib/python3/dist-packages/vyos/component_version.py", line 57, in from_string
[   34.388781] vyos-router[1668]:     raise ValueError(f"malformed configuration string: {string_line}")
[   34.390966] vyos-router[1668]: ValueError: malformed configuration string: // vyos-config-version: "bgp@4:broadc
[   64.862720] vyos-router[1158]:  configure.
[   65.370559] vyos-config[1397]: Configuration success

I backed up my config.boot right before the reboot. That file's last three rows looked like this;

// Warning: Do not remove the following line.
// vyos-config-version: "bgp@4:broadcast-relay@1:cluster@1:config-management@1:conntrack@3:conntrack-sync@2:container@1:dhcp-relay@2:dhcp-server@6:dhcpv6-server@1:dns-dynamic@1:dns-forwarding@4:firewall@10:flow-accounting@1:https@4:ids@1:interfaces@30:ipoe-server@1:ipsec@12:isis@3:l2tp@4:lldp@1:mdns@1:monitoring@1:nat@5:nat66@1:ntp@3:openconnect@2:ospf@2:policy@5:pppoe-server@6:pptp@2:qos@2:quagga@11:rip@1:rpki@1:salt@1:snmp@3:ssh@2:sstp@4:system@26:vrf@3:vrrp@4:vyos-accel-ppp@2:wanloadbalance@3:webproxy@2"
// Release version: 1.4-rolling-202308050917

And after reboot the last three lines of the file config.boot.2023-08-15-184048.pre-migration looks like;

// Warning: Do not remove the following line.
// vyos-config-version: "bgp@4:broadc

config.boot seem to have been corrected after the first save.

Details

Difficulty level
Unknown (require assessment)
Version
1.4-rolling-202308140557
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

Could you share the full configuration ? so we can analyze what is the source of this problem .

thanks

I have attached both files.

Another update. I noticed that all firewall configuration was gone (apart from the groups) after a reboot.

I copied my backup in place over the faulty config.boot.

sudo cp -a /config/config.boot-pre-fw /config/config.boot

After another reboot it seem to have converted the firewall-rules correctly (at least they exist), and without any python-tracebacks.

I couldn't open those files, but it can be related our firewall refactor :

https://vyos.dev/T5160

try to upgrade the latest rolling-release , the migration-scripts should automatically migrate to the new configuration.

Both attached files seem to be downloadable (but now viewable in browser) from the download-link in the upper right corner after clicking a file.

I do have another identically configured (hardware) router that I haven't tried to upgrade yet (rtr1). We'll see how that upgrade will work. The one that got the python traceback was running in a KVM.

While I started writing this reply, I began preparing the other router (rtr1) for a upgrade by making a copy of /config/config.boot, and noticed a very disturbing thing;

twan@rtr1:/config$ ls -l config.boot*
-rwxrwxr-x 1 root vyattacfg 15402 Aug 19 14:58 config.boot
-rwxrwxr-x 1 root vyattacfg 13213 Aug  5 15:45 config.boot.2023-08-05-151728.pre-migration
-rwxrwxr-x 1 twan vyattacfg 13218 Aug 19 14:58 config.boot-pre-fw-upgrade
twan@rtr1:/config$ cp -a config.boot config.boot-pre-fw-upgrade
twan@rtr1:/config$ ls -l config.boot*
-rwxrwxr-x 1 root vyattacfg 15402 Aug 19 14:58 config.boot
-rwxrwxr-x 1 root vyattacfg 13213 Aug  5 15:45 config.boot.2023-08-05-151728.pre-migration
-rwxrwxr-x 1 twan vyattacfg 13218 Aug 19 14:58 config.boot-pre-fw-upgrade

Notice the file-size on the copied file compared with the original.

Note; Copying the file to /tmp does not truncate the file.

The copied file is chopped of in the middle of the configuration;

twan@rtr1:/config$ diff -u config.boot config.boot-pre-fw-upgrade
--- config.boot 2023-08-19 14:58:14.491396453 +0200
+++ config.boot-pre-fw-upgrade  2023-08-19 14:58:14.491396453 +0200
@@ -534,47 +534,4 @@
 }
 system {
     config-management {
-        commit-revisions 100
-    }
-    conntrack {
-    }
-    console {
-        device usb0b4p1.0 {
-            speed 115200
-        }
-    }
-    domain-name home.xxxxx.yy
-    host-name rtr1
-    login {
-        user twan {
-            authentication {
-                encrypted-password $6$rounds=656000$XXXXXXXXXX
-                public-keys twan@host {
-                    key XXXX
-                    type ssh-rsa
-                }
-            }
-        }
-    }
-    name-server 127.0.0.1
-    option {
-        ctrl-alt-delete ignore
-        keyboard-layout se-latin1
-    }
-    syslog {
-        global {
-            facility all {
-                level info
-            }
-            facility local7 {
-                level debug
-            }
-        }
-    }
-    time-zone Europe/Stockholm
-}
-
-
-// Warning: Do not remove the following line.
-// vyos-config-version: "bgp@4:broadcast-relay@1:cluster@1:config-management@1:conntrack@3:conntrack-sync@2:container@1:dhcp-relay@2:dhcp-server@6:dhcpv6-server@1:dns-dynamic@1:dns-forwarding@4:firewall@10:flow-accounting@1:https@4:ids@1:interfaces@30:ipoe-server@1:ipsec@12:isis@3:l2tp@4:lldp@1:mdns@1:monitoring@1:nat@5:nat66@1:ntp@3:openconnect@2:ospf@2:policy@5:pppoe-server@6:pptp@2:qos@2:quagga@11:rip@1:rpki@1:salt@1:snmp@3:ssh@2:sstp@4:system@26:vrf@3:vrrp@4:vyos-accel-ppp@2:wanloadbalance@3:webproxy@2"
-// Release version: 1.4-rolling-202308050917
+        commit-revisi
\ No newline at end of file

The filesystem is far from full.

twan@rtr1:/config$ lsblk -f
NAME        FSTYPE   FSVER LABEL       UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
loop0       squashfs 4.0                                                          0   100% /usr/lib/live/mount/rootfs/1.4-rolling-202308050917.squashfs
nvme0n1
├─nvme0n1p1
├─nvme0n1p2 vfat     FAT32 EFI         9278-E04D
└─nvme0n1p3 ext4     1.0   persistence f8ee07b9-4090-4221-b2f4-d811ec50875b  107.7G     2% /usr/lib/live/mount/persistence/boot/1.4-rolling-202308050917/grub
                                                                                           /boot/grub
                                                                                           /boot
                                                                                           /opt/vyatta/etc/config
                                                                                           /usr/lib/live/mount/persistence

fsck tells me something is funky with the fs, and I'll try to fix it by rebooting with fsck.mode=force fsck.repair=yes added to grub.

twan@rtr1:/config$ sudo fsck.ext4 -fn /dev/nvme0n1p3
e2fsck 1.47.0 (5-Feb-2023)
Warning!  /dev/nvme0n1p3 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (29610574, counted=29808972).
Fix? no

Free inodes count wrong (7768478, counted=7768432).
Fix? no

persistence: 30306/7798784 files (0.3% non-contiguous), 1582339/31192913 blocks

I noticed similar fsck-funkyness on the upgraded router (rtr2);

twan@rtr2:~$ sudo fsck.ext4 -nf /dev/sda3
e2fsck 1.47.0 (5-Feb-2023)
Warning!  /dev/sda3 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (1878186, counted=1878091).
Fix? no

Free inodes count wrong (632137, counted=632136).
Fix? no

persistence: 6839/638976 files (1.5% non-contiguous), 677457/2555643 blocks

However, I fail to get the disk checked & repaired on boot.
I tried adding fsck.mode=force fsck.repair=yes to grub, and also setting the check-interval of the disk with tune2fs -i 1w /dev/sda3. But none of these measures have forced the disk to be checked/repaired during boot.

twan@rtr2:~$ tune2fs -l /dev/sda3
tune2fs 1.47.0 (5-Feb-2023)
...
Filesystem created:       Mon Jul 31 18:07:01 2023
Last mount time:          Sat Aug 19 16:25:52 2023
Last write time:          Sat Aug 19 16:24:54 2023
Mount count:              33
Maximum mount count:      -1
Last checked:             Mon Jul 31 18:07:01 2023
Check interval:           604800 (1 week)
Next check after:         Mon Aug  7 18:07:01 2023
...

Any suggestions to how I can perform a repair on the disk?

I have never seen such weird behavior on a (ext4) filesystem before - even if I have tortured a lot of them.

To me the weird/scary thing is that both the original issue and now the copy-issue is similar in the way that the config.boot file got truncated in weird ways. And on completely different machines; first on a VM (rtr2) and then now the other one running natively on it's own hardware (rtr1).

I have created this task regarding the fsck issues (fsck does not run during boot): https://vyos.dev/T5498

Regarding your current issue you need to have fsck runned on your storage to fix whatever is broken.

Workaround to run it would be to download something like current ubuntu iso or such, put that onto a usb-drive using Etcher and then boot from that.

This way you can unmount the fixed storage (since its booted on the iso as ramdrive) and run something like:

fsck.ext4 -f -y /dev/sdaX

What kind of storage is it (vendor/model)?

I managed to enter initramfs on the machine running in a VM by appending break to grub. From there I manually ran a fs check, which didn't show any issues.

(initramfs) fsck.ext4 -pf /dev/sda3
persistence: 6840/638976 files (1.5% non-contiguous), 682208/2555643 blocks
(initramfs) fsck.ext4 -ff /dev/sda3
e2fsck 1.47.0 (5-Feb-2023)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
persistence: 6840/638976 files (1.5% non-contiguous), 682208/2555643 blocks
(initramfs) fsck.ext4 -nf /dev/sda3
e2fsck 1.47.0 (5-Feb-2023)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
persistence: 6840/638976 files (1.5% non-contiguous), 682208/2555643 blocks

And after a normal boot, fsck still complains about block & inodes count.

twan@rtr2:~$ fsck.ext4 -nf /dev/sda3
e2fsck 1.47.0 (5-Feb-2023)
Warning!  /dev/sda3 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (1873273, counted=1873199).
Fix? no

Free inodes count wrong (632135, counted=632134).
Fix? no

persistence: 6841/638976 files (1.5% non-contiguous), 682370/2555643 blocks

I haven't tried doing the same thing on my physical machine (rtr2) yet (need to attach screen & kbd), but I assume it will have similar outcome. I'll see if I can do a proper fsck on that machine later today.

I really don't think this is a fs (ext4) issue. I have seen many ext4 filesystems having much worse issues than this without showing this bad/weird/scary behavior.
I don't have enough knowledge about overlayfs and if that may be at fault here.

This issue really doesn't give me the warm fuzzy feelings. :/

Looks like you would need some more extensive checking of that partition.

Something like:

fsck.ext4 -ycf /dev/sda3

I ran what you suggested, but it still shows wrong block/inode count right after boot.

(initramfs) fsck.ext4 -ycf /dev/sda3
e2fsck 1.47.0 (5-Feb-2023)
sh: badblocks: not found
persistence: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

persistence: ***** FILE SYSTEM WAS MODIFIED *****
persistence: 6842/638976 files (1.5% non-contiguous), 684682/2555643 blocks
(initramfs) fsck.ext4 -nf /dev/sda3
e2fsck 1.47.0 (5-Feb-2023)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
persistence: 6842/638976 files (1.5% non-contiguous), 684682/2555643 blocks
(initramfs) exit

...

twan@rtr2:~$ fsck.ext4 -nf /dev/sda3
e2fsck 1.47.0 (5-Feb-2023)
Warning!  /dev/sda3 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (1870961, counted=1870888).
Fix? no

Free inodes count wrong (632134, counted=632133).
Fix? no

persistence: 6842/638976 files (1.5% non-contiguous), 684682/2555643 blocks

Edit:
And to clarify, this last fsck was run on a mounted file system, which one would expect to return inode/block count errors.

Viacheslav changed the task status from Open to Needs reporter action.Tue, Feb 13, 3:45 PM
Viacheslav added a subscriber: Viacheslav.

@twan Could you try with the 1.4-rc3 image?