Today i stambled upon an issue on the latest rolling (Seems to be present since a bit of time) :
When using a lot of route maps in route maps, FRR seems to have issues when reloading the configuration, as it takes a few minutes, and it hangs the BGPd process, watchfrr is restarting the bgpd process, breaking the BGP.
vyos@core01.par1.infra.netaris.fr# show policy route-map | grep route-map | wc -l 26 vyos@core01.par1.infra.netaris.fr# show policy route-map | grep call | wc -l 30 vyos@core01.par1.infra.netaris.fr# show policy route-map | grep rule | wc -l 114
It seems that is somewhat linked to an existing issue on FRR that was resolved, but as VyOS generates the frr.conf itself, it wasn't fixed on VyOS side :
https://github.com/FRRouting/frr/issues/15706
example with frr.conf :
vyos@core01.par1.infra.netaris.fr:~$ cat /etc/frr/frr.conf | grep -A 8 -e '^route-map' | head -n 20 route-map HANDLE-LOCAL-PREF-COMMUNITIES permit 2 match community LOCAL-PREF-0 on-match next set comm-list LOCAL-PREF-0 delete set local-preference 0 exit ! route-map HANDLE-LOCAL-PREF-COMMUNITIES permit 3 match large-community LARGE-LOCAL-PREF-0 on-match next set large-comm-list LARGE-LOCAL-PREF-0 delete set local-preference 0 exit ! route-map HANDLE-LOCAL-PREF-COMMUNITIES permit 10 match community LOCAL-PREF-10 on-match next set comm-list LOCAL-PREF-10 delete set local-preference 10 exit
I attached the result of show policy route-map :
Here the logs when the bgpd becomes unresponsive (and then gets killed) :
May 13 15:47:59 core01.par1.infra.netaris.fr ldpd[1576]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ldpd[1576]: [G6NKK-8C6DV] end_config: VTY:0x55c72dc8ca40, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr ripngd[1552]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ripngd[1552]: [G6NKK-8C6DV] end_config: VTY:0x55fc60bbfce0, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr ripd[1548]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ripd[1548]: [G6NKK-8C6DV] end_config: VTY:0x55c728954b90, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr nhrpd[1583]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr nhrpd[1583]: [G6NKK-8C6DV] end_config: VTY:0x556f9db8ca70, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr staticd[1589]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr staticd[1589]: [G6NKK-8C6DV] end_config: VTY:0x555fa9bb2710, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr zebra[1534]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr zebra[1534]: [G6NKK-8C6DV] end_config: VTY:0x55d0170cc330, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr pim6d[1567]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr pim6d[1567]: [G6NKK-8C6DV] end_config: VTY:0x55bb0a2abea0, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr babeld[1564]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr babeld[1564]: [G6NKK-8C6DV] end_config: VTY:0x555ea001c7b0, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr isisd[1561]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr isisd[1561]: [G6NKK-8C6DV] end_config: VTY:0x562d393775a0, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr fabricd[1596]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr fabricd[1596]: [G6NKK-8C6DV] end_config: VTY:0x559117a16690, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr watchfrr[1513]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ospf6d[1558]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ospf6d[1558]: [G6NKK-8C6DV] end_config: VTY:0x55e536885960, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr mgmtd[1539]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr mgmtd[1539]: [G6NKK-8C6DV] end_config: VTY:0x556550f5f640, pending SET-CFG: 12
May 13 15:47:59 core01.par1.infra.netaris.fr ospfd[1555]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ospfd[1555]: [G6NKK-8C6DV] end_config: VTY:0x55d90feb2b60, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr bfdd[1592]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr bfdd[1592]: [G6NKK-8C6DV] end_config: VTY:0x557e0c4646d0, pending SET-CFG: 0
May 13 15:48:27 core01.par1.infra.netaris.fr bgpd[1541]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x7f8e2c087500 arg=0x558dffffd970 timer r=-9.634 (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_io.c:152} was scheduled to pop greater than 4s ago
May 13 15:48:50 core01.par1.infra.netaris.fr bgpd[1541]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:45
May 13 15:48:50 core01.par1.infra.netaris.fr bgpd[1541]: [G6NKK-8C6DV] end_config: VTY:0x558e45bfe840, pending SET-CFG: 0
May 13 15:50:22 core01.par1.infra.netaris.fr watchfrr[1513]: [T58XM-TP956][EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
May 13 15:50:22 core01.par1.infra.netaris.fr watchfrr[1513]: [YFT0P-5Q5YX] Forked background command [pid 37874]: /usr/lib/frr/watchfrr.sh restart bgpd
May 13 15:50:23 core01.par1.infra.netaris.fr bgpd[1541]: [ZW1GY-R46JE] Terminating on signal
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906426): RTR_MGR: rtr_mgr_stop()
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906488): RTR Socket: rtr_stop()
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906514): RTR_MGR: Calling rtr_mgr_cb with RTR_SHUTDOWN
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906539): RTR_MGR: Received RTR_SHUTDOWN callback
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906556): RTR_MGR: Group(1) status changed to: RTR_MGR_CLOSED
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906572): RTR Socket: pthread_cancel()
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:907340): RTR Socket: pthread_join()
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:908072): TCP Transport(45.152.70.25:3323): Socket closed
May 13 15:50:24 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:24:680282): RTR Socket: Socket shut down
May 13 15:50:24 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:24:680333): RTR_MGR: rtr_mgr_free()
May 13 15:50:24 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:24:680351): TCP Transport(45.152.70.25:3323): Freeing socket
May 13 15:50:25 core01.par1.infra.netaris.fr bgpd[1541]: [QYZDQ-4PHG5][EC 100663316] Attempting to process an I/O event but for fd: 36(8) no thread to handle this!
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [N5M5Y-J5BPG][EC 4043309121] Client 'vnc' (session id 0) encountered an error and is shutting down.
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [GE156-FS0MJ][EC 100663299] stream_read_try: read failed on fd 90: Connection reset by peer
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [N5M5Y-J5BPG][EC 4043309121] Client 'bgp' (session id 1) encountered an error and is shutting down.
May 13 15:50:42 core01.par1.infra.netaris.fr watchfrr[1513]: [HD38Q-0HBRT][EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [GE156-FS0MJ][EC 100663299] stream_read_try: read failed on fd 30: Connection reset by peer
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [N5M5Y-J5BPG][EC 4043309121] Client 'bgp' (session id 0) encountered an error and is shutting down.For now i tried to disable the watchfrr on bgpd, I'll try to see how long bgpd seems to hang (Below my setup) :
-------------------------------- VyOS Version and Package Changes -------------------------------- Version: VyOS 2025.05.13-0019-rolling Release train: current Release flavor: generic Built by: autobuild@vyos.net Built on: Tue 13 May 2025 00:19 UTC Build UUID: bcc09b0b-ac04-4a1c-9e57-be8420c579fd Build commit ID: 5891a5bd25bf65 Architecture: x86_64 Boot via: installed image System type: KVM guest Secure Boot: n/a (BIOS) Hardware vendor: OpenStack Foundation Hardware model: OpenStack Nova Hardware S/N: 89c14151-ee8a-4ba2-9604-0cf791412cb4 Hardware UUID: 89c14151-ee8a-4ba2-9604-0cf791412cb4 Copyright: VyOS maintainers and contributors vyos@core01.par1.infra.netaris.fr:~$ show bgp summary IPv4 Unicast Summary: BGP router identifier 185.171.202.73, local AS number 211615 VRF default vrf-id 0 BGP table version 1018928 RIB entries 1838875, using 224 MiB of memory Peers 4, using 94 KiB of memory Peer groups 7, using 448 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc 45.152.70.4 4 211615 22 23 1018928 0 0 00:18:51 1 2 edge02 Tunnel-IP 45.152.70.7 4 211615 24 26 1018928 0 0 00:20:14 1 3 edge01 Tunnel-IP 10.255.255.1 4 211615 23 25 1018928 0 0 00:20:14 0 2 Tunnel Amsterdam01 100.64.72.28 4 212815 168952 26 1018928 0 0 00:20:14 1005691 2 Dyjix Transit Total number of neighbors 4 IPv6 Unicast Summary: BGP router identifier 185.171.202.73, local AS number 211615 VRF default vrf-id 0 BGP table version 368261 RIB entries 397107, using 48 MiB of memory Peers 2, using 47 KiB of memory Peer groups 7, using 448 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc fd42:cafe:cafe:cafe::1 4 211615 61184 56101 368261 0 0 00:20:14 209229 208895 Tunnel Amsterdam01 2a10:4640:2::19 4 212815 56191 25 368261 0 0 00:20:13 208894 3 Dyjix Transit Total number of neighbors 2 vyos@core01.par1.infra.netaris.fr:~$ show system cpu CPU socket: 0 CPU Vendor: GenuineIntel Model: Intel Core Processor (Broadwell, IBRS) Cores: 1 Current MHz: 1997.662 CPU socket: 1 CPU Vendor: GenuineIntel Model: Intel Core Processor (Broadwell, IBRS) Cores: 1 Current MHz: 1997.662 CPU socket: 2 CPU Vendor: GenuineIntel Model: Intel Core Processor (Broadwell, IBRS) Cores: 1 Current MHz: 1997.662 CPU socket: 3 CPU Vendor: GenuineIntel Model: Intel Core Processor (Broadwell, IBRS) Cores: 1 Current MHz: 1997.662 CPU socket: 4 CPU Vendor: GenuineIntel Model: Intel Core Processor (Broadwell, IBRS) Cores: 1 Current MHz: 1997.662 CPU socket: 5 CPU Vendor: GenuineIntel Model: Intel Core Processor (Broadwell, IBRS) Cores: 1 Current MHz: 1997.662 CPU socket: 6 CPU Vendor: GenuineIntel Model: Intel Core Processor (Broadwell, IBRS) Cores: 1 Current MHz: 1997.662 CPU socket: 7 CPU Vendor: GenuineIntel Model: Intel Core Processor (Broadwell, IBRS) Cores: 1 Current MHz: 1997.662 vyos@core01.par1.infra.netaris.fr:~$ show system memory Total: 31.36 GB Free: 27.35 GB Used: 4.01 GB
Custom FRR configuration :
bgp input-queue-limit 100000 (Tried to fix the crashing issue as the buffer seemed too small for my RAM)
Side notes :
I am using RPKI and VRFs
