Page MenuHomeVyOS Platform

Modifying configuration in the "interfaces" section from VRRP transition scripts causes configuration lockup and high CPU utilization
Closed, ResolvedPublicBUG

Description

  1. Bug. After executing transition scripts by VRRP, the configuration is locked and I can't change any value.
  2. Bug. Timeouts on MASTER (show vrrp) clear every ~3 sec and high cpu utilization on the master. 80-100 %

To reproduce

Router01:

set interfaces ethernet eth1 vif 99 address '10.0.0.254/24'

set high-availability vrrp group lan hello-source-address '10.0.0.254'
set high-availability vrrp group lan interface 'eth1.99'
set high-availability vrrp group lan no-preempt
set high-availability vrrp group lan peer-address '10.0.0.253'
set high-availability vrrp group lan priority '250'
set high-availability vrrp group lan transition-script backup '/config/scripts/vrrp-trans-fail.sh'
set high-availability vrrp group lan transition-script fault '/config/scripts/vrrp-trans-fail.sh'
set high-availability vrrp group lan transition-script master '/config/scripts/vrrp-trans-master.sh'
set high-availability vrrp group lan virtual-address '10.0.0.1/24'
set high-availability vrrp group lan vrid '99'

Router02:

set interfaces ethernet eth1 vif 99 address '10.0.0.253/24'

set high-availability vrrp group lan hello-source-address '10.0.0.253'
set high-availability vrrp group lan interface 'eth1.99'
set high-availability vrrp group lan no-preempt
set high-availability vrrp group lan peer-address '10.0.0.254'
set high-availability vrrp group lan transition-script backup '/config/scripts/vrrp-trans-fail.sh'
set high-availability vrrp group lan transition-script fault '/config/scripts/vrrp-trans-fail.sh'
set high-availability vrrp group lan transition-script master '/config/scripts/vrrp-trans-master.sh'
set high-availability vrrp group lan virtual-address '10.0.0.1/24'
set high-availability vrrp group lan vrid '99'

Transition script fail

vyos@r1-roll# sudo cat /config/scripts/vrrp-trans-fail.sh 
#!/bin/vbash

source /opt/vyatta/etc/functions/script-template

configure
set interfaces ethernet eth1 vif 99 description BACKUP_by_script

commit

exit

Transition master

#!/bin/vbash

source /opt/vyatta/etc/functions/script-template

configure
set interfaces ethernet eth1 vif 99 description MASTER_by_script
commit

exit

Router 1 in master state.

vyos@r1-roll:~$ show int
Codes: S - State, L - Link, u - Up, D - Down, A - Admin Down
Interface        IP Address                        S/L  Description
---------        ----------                        ---  -----------
eth0             192.168.122.11/24                 u/u  
eth1             -                                 u/u  
eth1.99          10.0.0.254/24                     u/u  MASTER_by_script
                 10.0.0.1/24                            
lo               127.0.0.1/8                       u/u  
                 ::1/128                                
vyos@r1-roll:~$ 
vyos@r1-roll:~$ show vrrp
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   2s

Trying to add or del configuration

vyos@r1-roll# delete interfaces ethernet eth1 vif 99 description 

  Failed to delete specified config path
  Delete failed

[edit]
vyos@r1-roll# 

vyos@r1-roll# delete interfaces ethernet eth1 vif 99 

  Failed to delete specified config path
  Delete failed

[edit]
vyos@r1-roll#

vyos@r1-roll# set interfaces ethernet eth0 description FOO

  Set failed

[edit]
vyos@r1-roll#

Permissions

vyos@r1-roll# sudo ls -la /config/scripts/
total 20
drwxrwsr-x 2 root vyattacfg 4096 Jul 28 16:05 .
drwxrwsr-x 8 root vyattacfg 4096 Jul 28 15:55 ..
-rwxr-xr-x 1 root vyattacfg  154 Jul 28 16:05 vrrp-trans-fail.sh
-rwxr-xr-x 1 root vyattacfg  153 Jul 28 16:05 vrrp-trans-master.sh
-rwxr-xr-x 1 root vyattacfg  230 Jul 28 01:24 vyos-postconfig-bootup.script

Second bug with timeouts

vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   1s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   6s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   2s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   3s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   2s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   1s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   3s
vyos@vyos:~$

Details

Difficulty level
Hard (possibly days)
Version
VyOS 1.3-rolling-202007280117
Why the issue appeared?
Design mistake
Is it a breaking change?
Perfectly compatible
Issue type
Bug (incorrect behavior)

Event Timeline

Viacheslav changed the task status from Open to Confirmed.Jul 28 2020, 4:17 PM
Viacheslav created this task.
Viacheslav updated the task description. (Show Details)
Viacheslav renamed this task from VRRP lock configuration to VRRP lock configuration and High CPU utilization..Jul 29 2020, 7:34 AM

The master change state every few seconds.

Jul 29 07:52:39 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-master.sh
Jul 29 07:52:40 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:40 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:40 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:41 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-fail.sh
Jul 29 07:52:42 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:42 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:42 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:43 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-master.sh
Jul 29 07:52:44 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:44 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:44 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:45 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-fail.sh
Jul 29 07:52:46 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:46 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:46 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:46 vyos Keepalived_vrrp[1821]: Warning: Failed to connect to the agentx master agent ([NIL]):
Jul 29 07:52:47 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-master.sh
Jul 29 07:52:48 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:48 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:48 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:49 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-fail.sh
Jul 29 07:52:50 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE

It seems that the problem is serious and under attention

I do not have a lab to reproduce this ATM.

My guess is that one of "configure", "set" or "commit" function is failing and causing the lock.
The return code of configure and commit should be checked to see if commit failed.

if commit fails in particular, this should be checked with "$?" and if not zero, then rollback should be run.
I have like the felling that after too many failed commit, the number of running commit is over what the backend allows causing a lock until one of them closes ... which they will never do.

Removing this line from the master prevents erroneous changes master/backup. And CPU displays normal values.

set high-availability vrrp group lan transition-script backup '/config/scripts/vrrp-trans-fail.sh'

Last Transition

$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   5m19s
thomas-mangin added a subscriber: thomas-mangin.

Ideally which interface is master/slave should be recorded and handled by VyOS so that users do not have to put some workaround like this one to know.

Changing description in a master transition script will lead to an endless loop, because of:

  1. Description change (or any other interface update) in a script trigger EthernetIf.update().
  2. EthernetIf.update() trigger a lot of interface changes:
Jul 29 14:05:36 vyos sudo[3097]:     root : TTY=ttyS0 ; PWD=/home/vyos ; USER=root ; COMMAND=/usr/bin/sh -c VYOS_TAGNODE_VALUE='eth1' /usr/libexec/vyos/conf_mode/interfaces-ethernet.py
Jul 29 14:05:36 vyos sudo[3097]: pam_unix(sudo:session): session opened for user root by vyos(uid=0)
Jul 29 14:05:36 vyos control.py[3098]: set_interface: alias,
Jul 29 14:05:36 vyos control.py[3098]: set_interface: link_detect, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: vrf,
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_cache_tmo, 30
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_filter, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_accept, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_announce, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_ignore, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: proxy_arp, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: proxy_arp_pvlan, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_forwarding, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_accept_ra, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_autoconf, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_dad_transmits, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: mtu, 1500
Jul 29 14:05:36 vyos control.py[3098]: set_interface: alias, MASTER_by_script
Jul 29 14:05:36 vyos control.py[3098]: set_interface: link_detect, 1
Jul 29 14:05:36 vyos Keepalived_vrrp[1302]: (lan) Entering BACKUP STATE
Jul 29 14:05:36 vyos Keepalived_vrrp[1302]: (lan) sent 0 priority
Jul 29 14:05:36 vyos control.py[3098]: set_interface: vrf,
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_cache_tmo, 30
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_filter, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_accept, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_announce, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_ignore, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: proxy_arp, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: proxy_arp_pvlan, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_forwarding, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_accept_ra, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_autoconf, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_dad_transmits, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: gro, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: gso, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: sg, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: tso, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ufo, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: admin_state, up
Jul 29 14:05:36 vyos Keepalived_vrrp[1302]: (lan) Entering MASTER STATE
Jul 29 14:05:36 vyos control.py[3098]: set_interface: gro, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: gso, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: sg, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: tso, off
Jul 29 14:05:37 vyos control.py[3098]: set_interface: ufo, off
Jul 29 14:05:37 vyos control.py[3098]: set_interface: admin_state, up
  1. Something from this all trigger keepalived interface reinitialization.
  2. Keepalived change VRRP state to BACKUP and then MASTER, and run transition scripts.
  3. GOTO 1.

I think we do not need to touch those interface options that have not been changed in config. This should solve the issue.

Yes, it would solve the issue ... but ... currently, we re-apply the whole interface setting, so there is no change to have the vyos and live configuration not sync'ed.
This would be lost. It is a trade-off, but it could be done. It would be however the only sub-system working that way.

(I will try to work a PR so we can discuss the details)

erkin set Issue type to Bug (incorrect behavior).Aug 29 2021, 1:41 PM
erkin removed a subscriber: Active contributors.
dmbaturin renamed this task from VRRP lock configuration and High CPU utilization. to Modifying configuration in the "interfaces" section from VRRP transition scripts causes configuration lockup and high CPU utilization.Sep 9 2021, 2:11 PM
dmbaturin changed Is it a breaking change? from Unspecified (possibly destroys the router) to Perfectly compatible.

Enabling debugging gives me:

Sep 18 11:34:41 LR1 vyos-configd[12958]: VyOS had an issue completing a command.
Sep 18 11:34:41 LR1 vyos-configd[12958]: We are sorry that you encountered a problem while using VyOS.
Sep 18 11:34:41 LR1 vyos-configd[12958]: There are a few things you can do to help us (and yourself):
Sep 18 11:34:41 LR1 vyos-configd[12958]: - Contact us using the online help desk if you have a subscription:
Sep 18 11:34:41 LR1 vyos-configd[12958]:   https://support.vyos.io/
Sep 18 11:34:41 LR1 vyos-configd[12958]: - Make sure you are running the latest version of VyOS available at:
Sep 18 11:34:41 LR1 vyos-configd[12958]:   https://vyos.net/get/
Sep 18 11:34:41 LR1 vyos-configd[12958]: - Consult the community forum to see how to handle this issue:
Sep 18 11:34:41 LR1 vyos-configd[12958]:   https://forum.vyos.io
Sep 18 11:34:41 LR1 vyos-configd[12958]: - Join us on Slack where our users exchange help and advice:
Sep 18 11:34:41 LR1 vyos-configd[12958]:   https://vyos.slack.com
Sep 18 11:34:41 LR1 vyos-configd[12958]: When reporting problems, please include as much information as possible:
Sep 18 11:34:41 LR1 vyos-configd[12958]: - do not obfuscate any data (feel free to contact us privately if your
Sep 18 11:34:41 LR1 vyos-configd[12958]:   business policy requires it)
Sep 18 11:34:41 LR1 vyos-configd[12958]: - and include all the information presented below
Sep 18 11:34:41 LR1 vyos-configd[12958]: Report time:      2021-09-18 11:34:41
Sep 18 11:34:41 LR1 vyos-configd[12958]: Image version:    VyOS 1.4-foo-202109161147
Sep 18 11:34:41 LR1 vyos-configd[12958]: Release train:    sagitta
Sep 18 11:34:41 LR1 vyos-configd[12958]: Built by:         [email protected]
Sep 18 11:34:41 LR1 vyos-configd[12958]: Built on:         Thu 16 Sep 2021 11:47 UTC
Sep 18 11:34:41 LR1 vyos-configd[12958]: Build UUID:       41c8d2e4-2274-4e4c-85cb-a2169000c72b
Sep 18 11:34:41 LR1 vyos-configd[12958]: Build commit ID:  06f91212e574c0
Sep 18 11:34:41 LR1 vyos-configd[12958]: Architecture:     x86_64
Sep 18 11:34:41 LR1 vyos-configd[12958]: Boot via:         installed image
Sep 18 11:34:41 LR1 vyos-configd[12958]: System type:      VMware guest
Sep 18 11:34:41 LR1 vyos-configd[12958]: Hardware vendor:  VMware, Inc.
Sep 18 11:34:41 LR1 vyos-configd[12958]: Hardware model:   VMware Virtual Platform
Sep 18 11:34:41 LR1 vyos-configd[12958]: Hardware S/N:     VMware-42 3f 67 73 77 df c4 80-42 c9 42 af ff 15 de 0b
Sep 18 11:34:41 LR1 vyos-configd[12958]: Hardware UUID:    73673f42-df77-80c4-42c9-42afff15de0b
Sep 18 11:34:41 LR1 vyos-configd[12958]: Traceback (most recent call last):
Sep 18 11:34:41 LR1 vyos-configd[12958]:   File "/usr/libexec/vyos/services/vyos-configd", line 277, in <module>
Sep 18 11:34:41 LR1 vyos-configd[12958]:     message = json.loads(msg)
Sep 18 11:34:41 LR1 vyos-configd[12958]:   File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
Sep 18 11:34:41 LR1 vyos-configd[12958]:     return _default_decoder.decode(s)
Sep 18 11:34:41 LR1 vyos-configd[12958]:   File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
Sep 18 11:34:41 LR1 vyos-configd[12958]:     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Sep 18 11:34:41 LR1 vyos-configd[12958]:   File "/usr/lib/python3.9/json/decoder.py", line 355, in raw_decode
Sep 18 11:34:41 LR1 vyos-configd[12958]:     raise JSONDecodeError("Expecting value", s, err.value) from None
Sep 18 11:34:41 LR1 vyos-configd[12958]: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Sep 18 11:34:41 LR1 Report time:      2021-09-18 11:34:41
Sep 18 11:34:41 LR1 Image version:    VyOS 1.4-foo-202109161147
Sep 18 11:34:41 LR1 Release train:    sagitta
Sep 18 11:34:41 LR1 Built by:         [email protected]
Sep 18 11:34:41 LR1 Built on:         Thu 16 Sep 2021 11:47 UTC
Sep 18 11:34:41 LR1 Build UUID:       41c8d2e4-2274-4e4c-85cb-a2169000c72b
Sep 18 11:34:41 LR1 Build commit ID:  06f91212e574c0
Sep 18 11:34:41 LR1 Architecture:     x86_64
Sep 18 11:34:41 LR1 Boot via:         installed image
Sep 18 11:34:41 LR1 System type:      VMware guest
Sep 18 11:34:41 LR1 Hardware vendor:  VMware, Inc.
Sep 18 11:34:41 LR1 Hardware model:   VMware Virtual Platform
Sep 18 11:34:41 LR1 Hardware S/N:     VMware-42 3f 67 73 77 df c4 80-42 c9 42 af ff 15 de 0b
Sep 18 11:34:41 LR1 Hardware UUID:    73673f42-df77-80c4-42c9-42afff15de0b
Sep 18 11:34:41 LR1 Traceback (most recent call last):
Sep 18 11:34:41 LR1   File "/usr/libexec/vyos/services/vyos-configd", line 277, in <module>
Sep 18 11:34:41 LR1     message = json.loads(msg)
Sep 18 11:34:41 LR1   File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
Sep 18 11:34:41 LR1     return _default_decoder.decode(s)
Sep 18 11:34:41 LR1   File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
Sep 18 11:34:41 LR1     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Sep 18 11:34:41 LR1   File "/usr/lib/python3.9/json/decoder.py", line 355, in raw_decode
Sep 18 11:34:41 LR1     raise JSONDecodeError("Expecting value", s, err.value) from None
Sep 18 11:34:41 LR1 json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Sep 18 11:34:41 LR1 systemd[1]: vyos-configd.service: Main process exited, code=exited, status=1/FAILURE
Sep 18 11:34:41 LR1 systemd[1]: vyos-configd.service: Failed with result 'exit-code'.
Sep 18 11:34:41 LR1 systemd[1]: vyos-configd.service: Consumed 1.002s CPU time.
Sep 18 11:34:41 LR1 systemd[1]: vyos-configd.service: Scheduled restart job, restart counter is at 2.
Sep 18 11:34:41 LR1 systemd[1]: Stopped VyOS configuration daemon.
Sep 18 11:34:41 LR1 systemd[1]: vyos-configd.service: Consumed 1.002s CPU time.
Sep 18 11:34:41 LR1 systemd[1]: Started VyOS configuration daemon.
Sep 18 11:34:54 LR1 Keepalived_vrrp[12659]: Warning: Failed to connect to the agentx master agent ([NIL]):
c-po changed Difficulty level from Unknown (require assessment) to Hard (possibly days).
c-po changed Why the issue appeared? from Will be filled on close to Design mistake.

The issue is caused by:

# determine IP addresses which are assigned to the interface and build a
# list of addresses which are no longer in the dict so they can be removed
cur_addr = self.get_addr()
for addr in list_diff(cur_addr, new_addr):

When the script calls into the library - we will drop all IP addresses set on the adapter but not available in the config dict.
As keepalived/vrrp manages this virtual address - we nuke it as it's unknown to the config - this results in a system going crazy
https://github.com/vyos/vyos-1x/blob/e83a113360ba18043edcf7f70689c7042dee2b37/python/vyos/ifconfig/interface.py#L1264-L1267

c-po moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus (1.3.0-epa1) board.
c-po moved this task from Need Triage to Finished on the VyOS 1.4 Sagitta board.