Page MenuHomeVyOS Platform

RPKI doesn't boot properly
Closed, ResolvedPublicBUG

Description

Have a config like:

rpki {
     cache routinator {
         address 192.168.100.90
         port 3323
     }
 }

and a route-map like:

route-map ebgp-transit-rpki {

    rule 10 {
        action deny
        match {
            rpki invalid
        }
    }
    rule 20 {
        action permit
        match {
            rpki notfound
        }
        set {
            local-preference 20
        }
    }
    rule 30 {
        action permit
        match {
            rpki valid
        }
        set {
            local-preference 100
        }
    }
}

after reboot this setup (bgp session having this import route-map set comes up with 0 prefixes) doesn't work till I enter vtysh and execute rpki stop, rpki start and clear bgp *

Details

Version
1.3-rolling-202002161917 (but this is going one for quite some time)
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I can confirm this bug also with VyOS 1.2.4

bgp-rtor-01 must receive prefixes
bgp-rtor-02 advertised prefixes

After rebooting "bgp-rtor-01" there are no connection to rpki server

vyos@bgp-rtor-01:~$ show ip bgp sum
...
Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
192.168.33.2    4     203115       4       3        0    0    0 00:00:14            0

vyos@bgp-rtor-01:~$ sudo vtysh -c "show rpki cache-connection"
No connection to RPKI cache server.
vyos@bgp-rtor-01:~$

We check that we will export prefixes.

vyos@bgp-rtor-02:~$ show ip bgp neighbors 192.168.33.1 advertised-routes 
BGP table version is 8, local router ID is 192.168.33.2, vrf id 0
Default local pref 100, local AS 203115
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 100.64.0.0/24    0.0.0.0                  0         32768 i
*> 100.64.1.0/24    0.0.0.0                  0         32768 i
*> 100.64.2.0/24    0.0.0.0                  0         32768 i
*> 100.64.3.0/24    0.0.0.0                  0         32768 i
...

Total number of prefixes 8
vyos@bgp-rtor-02:~$

After rebooting bgp-rtor-01, the dump (on side routinator server) does not show any attempts to connect to the routinator server.

root@ponctrl:/home/sever# tcpdump -ntti ens20 port 3323
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens20, link-type EN10MB (Ethernet), capture size 262144 bytes

looks for my like an frr bug. Has someone contacted upstream?

We saw something similar to this, but it seems like FRR eventually connected to RTRR. I think it has a timeout parameter — is that how often (slowly) it tries to re-establish?

While testing T1874 the procedure we followed was:

  1. install 1.2.4
  2. configure
  3. observe bgpd crash after ~5 minutes
  4. upgrade to 1.2.5
  5. reboot
  6. check RPKI RTRR connection had established
  7. check BGP session had established
  8. observe no crash in bgpd
  9. celebrate

And I can confirm that after boot-up, FRR was indeed connected to an RTRR server:

vyos@test.faelix.net:~$ sudo vtysh -c "show rpki cache-connection"
Connected to group 1
rpki tcp cache 46.227.201.12 3323 pref 1

Checking from the other end, on the RTRR server:

root@slm:~# ss -an | grep 46.227.201.78
tcp                ESTAB               0                    0                                                                   46.227.201.12:3323                                                      46.227.201.78:44676

I would say that this bug might be fixed?

I tried this today with 1.3-rolling-202004180117 ...

after reboot:

$ show rpki cache-connection
No connection to RPKI cache server.

again:
vRouter1# rpki stop
vRouter1# rpki start
vRouter1# show rpki cache-connection
Connected to group 1
rpki tcp cache 192.168.100.90 3323 pref 1
vRouter1# clear bgp *

solves everything.

@primoz, I have exactly the same issue with "1.4-rolling-202103011828 (sagitta)"

erkin set Issue type to Bug (incorrect behavior).Aug 31 2021, 5:39 PM

Still reproducible VyOS 1.3-beta-202111150443
After reboot

No imported routes:

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
192.168.122.11  4     203115         6         5        0    0    0 00:01:52            0        0

re-start rpki

r4-epa2# 
r4-epa2# show rpki cache-connection 
No connection to RPKI cache server.
r4-epa2# 

r4-epa2# rpki stop 
r4-epa2# rpki start
r4-epa2# exit

Reset bgp peer

vyos@r4-epa2:~$ reset ip bgp all


vyos@r4-epa2:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 192.168.122.14, local AS number 65001 vrf-id 0
BGP table version 8
RIB entries 15, using 2880 bytes of memory
Peers 1, using 21 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
192.168.122.11  4     203115        10        10        0    0    0 00:00:02            8        0

I'm able to reproduce this with 1.4, using the new config structure:

rpki {
    cache 10.3.96.4 {
        port 8082
        preference 1
    }
}

The same procedure restores operation:

$ vtysh
dal-1# rpki stop
dal-1# rpki start
dal-1# clear bgp *
dal-1# exit

Has any progress on this been made? I am still having this issue on 1.4-rolling-202205250217.

currently the only fix I have found is to run the following commands:

vtysh -c "rpki stop"
vtysh -c "rpki start"

Hi,

Same issue on VyOS 1.4-rolling-202208240217

And when you set the rpki ips you have wrong description on the options, instead of the "rpki server ip" you have "NTP server"

router# set protocols rpki cache ?
Possible completions:
> <x.x.x.x> IP address of NTP server
> <h:h:h:h:h:h:h:h> IPv6 address of NTP server
> <hostname> Fully qualified domain name of NTP server

Hi,

Same issue on VyOS 1.4-rolling-202208240217

And when you set the rpki ips you have wrong description on the options, instead of the "rpki server ip" you have "NTP server"

router# set protocols rpki cache ?
Possible completions:
> <x.x.x.x> IP address of NTP server
> <h:h:h:h:h:h:h:h> IPv6 address of NTP server
> <hostname> Fully qualified domain name of NTP server

I created a separate task for descriptions T4654

Hi,
same issue on VyOS 1.4-rolling-202212090319

After each reboot I get this:

show rpki cache-connection
No connection to RPKI cache server.

To regain a working connection I have to “touch” the rpki configuration (eg. changing the polling period to a random number). After commiting that change all starts to work as expected:

show rpki cache-connection
Connected to group 1
rpki tcp cache 10.42.0.3 8082 pref 1 (connected)

Chiming in here as a 'me too', on vyos-1.4-rolling-202305300317

Running into this as well on: 1.4-rolling-202307260317

Our workaround for the moment is just kicking RPKI with: vtysh -c 'rpki reset'
This could potentially be added to /config/scripts/vyos-postconfig-bootup.script but we haven't validated that yet.

[update]
Tested adding rpki reset to the bootup script as detailed above. This is a viable workaround, however, if you are using route-maps to classify routes based on RPKI status, they will need to be re-evaluated as well once RPKI is established. We found that (while ugly) putting sleep before a route refresh works well:

vtysh -c "rpki reset" && sleep 5 && vtysh -c "clear bgp * soft in"

Latest rolling uses FRR 9.0. - could you re-test it please?

@c-po Tried with latest rolling 1.4-rolling-202308060317, rpki doesn't start automatically, one must do:

$ vtysh
$ rpki start

Then rpki starts validating prefixes.

c-po changed the task status from Open to In progress.Aug 7 2023, 9:09 PM
c-po claimed this task.

@aalmenar could you test this patch?

diff i/usr/libexec/vyos/conf_mode/protocols_rpki.py w/usr/libexec/vyos/conf_mode/protocols_rpki.py
index 035b7db05..e05103aab 100755
--- i/usr/libexec/vyos/conf_mode/protocols_rpki.py
+++ w/usr/libexec/vyos/conf_mode/protocols_rpki.py
@@ -22,6 +22,7 @@ from vyos.config import Config
 from vyos.configdict import dict_merge
 from vyos.template import render_to_string
 from vyos.utils.dict import dict_search
+from vyos.utils.process import cmd
 from vyos.xml import defaults
 from vyos import ConfigError
 from vyos import frr
@@ -95,6 +96,11 @@ def apply(rpki):
         frr_cfg.add_before(frr.default_add_before, rpki['new_frr_config'])

     frr_cfg.commit_configuration(bgp_daemon)
+
+    start_stop_cmd = 'start'
+    if not rpki: start_stop_cmd = 'stop'
+    cmd(f'vtysh -c "rpki {start_stop_cmd}"')
+
     return None

 if __name__ == '__main__':
This comment was removed by aalmenar.

@c-po

Nope, now i had to do

vtysh
rpki stop
rpki start

for it to work again....

Hi,

I was able to fix by adding the following code in /config/scripts/vyos-postconfig-bootup.script you can edit and save by running:

sudo nano /config/scripts/vyos-postconfig-bootup.script

and add:

#!/bin/vbash
vtysh -c "rpki start"

exit

But I still hope that il will be fixed on the official release.

Regards

syncer triaged this task as Normal priority.Aug 12 2023, 10:09 PM

@egoistdream

interesting, as the above diff actually does the same but a bit earlier in the boot process

Could the error from latest nightly be due to that rpki module isnt loaded for FRR/bgp?

It seems like the commit from the other day which removed a duplicated configs.chroot regarding frr from vyos-build perhaps wasnt properly synced to the remaining daemons-file in vyos-1x?

https://github.com/vyos/vyos-build/commit/a9a1ca3cbb0951a37de286fffb2554103b561846

Removed config in vyos-build (data/live-build-config/hooks/live/30-frr-configs.chroot):

zebra=yes
bgpd=yes
ospfd=yes
ospf6d=yes
ripd=yes
ripngd=yes
isisd=yes
pimd=no
pim6d=yes
ldpd=yes
nhrpd=no
eigrpd=yes
babeld=yes
sharpd=no
pbrd=no
bfdd=yes
staticd=yes

vtysh_enable=yes

zebra_options="-s 90000000 --daemon -A 127.0.0.1 -M snmp"
bgpd_options="--daemon -A 127.0.0.1 -M snmp -M rpki -M bmp"
ospfd_options="--daemon -A 127.0.0.1 -M snmp"
ospf6d_options="--daemon -A ::1 -M snmp"
ripd_options="--daemon -A 127.0.0.1 -M snmp"
ripngd_options="--daemon -A ::1"
isisd_options="--daemon -A 127.0.0.1 -M snmp"
pimd_options="--daemon -A 127.0.0.1"
pim6d_options=""--daemon -A ::1"
ldpd_options="--daemon -A 127.0.0.1"
nhrpd_options="--daemon -A 127.0.0.1"
mgmtd_options=" --daemon -A 127.0.0.1"
eigrpd_options="--daemon -A 127.0.0.1"
babeld_options="--daemon -A 127.0.0.1"
sharpd_options="--daemon -A 127.0.0.1"
pbrd_options="--daemon -A 127.0.0.1"
staticd_options="--daemon -A 127.0.0.1"
bfdd_options="--daemon -A 127.0.0.1"

watchfrr_enable=no
valgrind_enable=no

Remaining config in vyos-1x (data/templates/frr/daemons.frr.tmpl):

zebra=yes
bgpd=yes
ospfd=yes
ospf6d=yes
ripd=yes
ripngd=yes
isisd=yes
pimd=no
pim6d=yes
ldpd=yes
nhrpd=no
eigrpd=yes
babeld=yes
sharpd=no
pbrd=no
bfdd=yes
staticd=yes

vtysh_enable=yes
zebra_options="  -s 90000000 --daemon -A 127.0.0.1
{%- if irdp is defined %} -M irdp{% endif -%}
{%- if snmp is defined and snmp.zebra is defined %} -M snmp{% endif -%}
"
bgpd_options="   --daemon -A 127.0.0.1
{%- if bmp is defined %} -M bmp{% endif -%}
{%- if snmp is defined and snmp.bgpd is defined %} -M snmp{% endif -%}
"
ospfd_options="  --daemon -A 127.0.0.1
{%- if snmp is defined and snmp.ospfd is defined %} -M snmp{% endif -%}
"
ospf6d_options=" --daemon -A ::1
{%- if snmp is defined and snmp.ospf6d is defined %} -M snmp{% endif -%}
"
ripd_options="   --daemon -A 127.0.0.1
{%- if snmp is defined and snmp.ripd is defined %} -M snmp{% endif -%}
"
ripngd_options=" --daemon -A ::1"
isisd_options="  --daemon -A 127.0.0.1
{%- if snmp is defined and snmp.isisd is defined %} -M snmp{% endif -%}
"
pimd_options="  --daemon -A 127.0.0.1"
pim6d_options=" --daemon -A ::1"
ldpd_options="  --daemon -A 127.0.0.1
{%- if snmp is defined and snmp.ldpd is defined %} -M snmp{% endif -%}
"
mgmtd_options=" --daemon -A 127.0.0.1"
nhrpd_options="  --daemon -A 127.0.0.1"
eigrpd_options="  --daemon -A 127.0.0.1"
babeld_options="  --daemon -A 127.0.0.1"
sharpd_options="  --daemon -A 127.0.0.1"
pbrd_options="  --daemon -A 127.0.0.1"
staticd_options="  --daemon -A 127.0.0.1"
bfdd_options="  --daemon -A 127.0.0.1"

watchfrr_enable=no
valgrind_enable=no

Proposed fix for vyos-1x (data/templates/frr/daemons.frr.tmpl):

zebra=yes
bgpd=yes
ospfd=yes
ospf6d=yes
ripd=yes
ripngd=yes
isisd=yes
pimd=no
pim6d=yes
ldpd=yes
nhrpd=no
eigrpd=yes
babeld=yes
sharpd=no
pbrd=no
bfdd=yes
staticd=yes

vtysh_enable=yes
zebra_options="   --daemon -A 127.0.0.1 -s 90000000
{%- if irdp is defined %} -M irdp{% endif -%}
{%- if snmp is defined and snmp.zebra is defined %} -M snmp{% endif -%}
"
bgpd_options="    --daemon -A 127.0.0.1
{%- if bmp is defined %} -M bmp{% endif -%}
{%- if rpki is defined %} -M rpki{% endif -%}
{%- if snmp is defined and snmp.bgpd is defined %} -M snmp{% endif -%}
"
ospfd_options="   --daemon -A 127.0.0.1
{%- if snmp is defined and snmp.ospfd is defined %} -M snmp{% endif -%}
"
ospf6d_options="  --daemon -A ::1
{%- if snmp is defined and snmp.ospf6d is defined %} -M snmp{% endif -%}
"
ripd_options="    --daemon -A 127.0.0.1
{%- if snmp is defined and snmp.ripd is defined %} -M snmp{% endif -%}
"
ripngd_options="  --daemon -A ::1"
isisd_options="   --daemon -A 127.0.0.1
{%- if snmp is defined and snmp.isisd is defined %} -M snmp{% endif -%}
"
pimd_options="    --daemon -A 127.0.0.1"
pim6d_options="   --daemon -A ::1"
ldpd_options="    --daemon -A 127.0.0.1
{%- if snmp is defined and snmp.ldpd is defined %} -M snmp{% endif -%}
"
mgmtd_options="   --daemon -A 127.0.0.1"
nhrpd_options="   --daemon -A 127.0.0.1"
eigrpd_options="  --daemon -A 127.0.0.1"
babeld_options="  --daemon -A 127.0.0.1"
sharpd_options="  --daemon -A 127.0.0.1"
pbrd_options="    --daemon -A 127.0.0.1"
staticd_options=" --daemon -A 127.0.0.1"
bfdd_options="    --daemon -A 127.0.0.1"

watchfrr_enable=no
valgrind_enable=no

Should probably add "-M rpki" permanently to FRR/bgp.