Page MenuHomeVyOS Platform

BGPD Getting killed in nested route-map usage scenario
Open, NormalPublic

Description

Today i stambled upon an issue on the latest rolling (Seems to be present since a bit of time) :
When using a lot of route maps in route maps, FRR seems to have issues when reloading the configuration, as it takes a few minutes, and it hangs the BGPd process, watchfrr is restarting the bgpd process, breaking the BGP.

vyos@core01.par1.infra.netaris.fr# show policy route-map | grep route-map | wc -l
26
vyos@core01.par1.infra.netaris.fr# show policy route-map | grep call | wc -l
30
vyos@core01.par1.infra.netaris.fr# show policy route-map | grep rule | wc -l
114

It seems that is somewhat linked to an existing issue on FRR that was resolved, but as VyOS generates the frr.conf itself, it wasn't fixed on VyOS side :
https://github.com/FRRouting/frr/issues/15706

example with frr.conf :

vyos@core01.par1.infra.netaris.fr:~$ cat /etc/frr/frr.conf | grep -A 8 -e '^route-map' | head -n 20
route-map HANDLE-LOCAL-PREF-COMMUNITIES permit 2
 match community LOCAL-PREF-0
 on-match next
 set comm-list LOCAL-PREF-0 delete
 set local-preference 0
exit
!
route-map HANDLE-LOCAL-PREF-COMMUNITIES permit 3
 match large-community LARGE-LOCAL-PREF-0
 on-match next
 set large-comm-list LARGE-LOCAL-PREF-0 delete
 set local-preference 0
exit
!
route-map HANDLE-LOCAL-PREF-COMMUNITIES permit 10
 match community LOCAL-PREF-10
 on-match next
 set comm-list LOCAL-PREF-10 delete
 set local-preference 10
exit

I attached the result of show policy route-map :

Here the logs when the bgpd becomes unresponsive (and then gets killed) :

May 13 15:47:59 core01.par1.infra.netaris.fr ldpd[1576]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ldpd[1576]: [G6NKK-8C6DV] end_config: VTY:0x55c72dc8ca40, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr ripngd[1552]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ripngd[1552]: [G6NKK-8C6DV] end_config: VTY:0x55fc60bbfce0, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr ripd[1548]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ripd[1548]: [G6NKK-8C6DV] end_config: VTY:0x55c728954b90, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr nhrpd[1583]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr nhrpd[1583]: [G6NKK-8C6DV] end_config: VTY:0x556f9db8ca70, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr staticd[1589]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr staticd[1589]: [G6NKK-8C6DV] end_config: VTY:0x555fa9bb2710, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr zebra[1534]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr zebra[1534]: [G6NKK-8C6DV] end_config: VTY:0x55d0170cc330, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr pim6d[1567]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr pim6d[1567]: [G6NKK-8C6DV] end_config: VTY:0x55bb0a2abea0, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr babeld[1564]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr babeld[1564]: [G6NKK-8C6DV] end_config: VTY:0x555ea001c7b0, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr isisd[1561]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr isisd[1561]: [G6NKK-8C6DV] end_config: VTY:0x562d393775a0, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr fabricd[1596]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr fabricd[1596]: [G6NKK-8C6DV] end_config: VTY:0x559117a16690, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr watchfrr[1513]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ospf6d[1558]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ospf6d[1558]: [G6NKK-8C6DV] end_config: VTY:0x55e536885960, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr mgmtd[1539]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr mgmtd[1539]: [G6NKK-8C6DV] end_config: VTY:0x556550f5f640, pending SET-CFG: 12
May 13 15:47:59 core01.par1.infra.netaris.fr ospfd[1555]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr ospfd[1555]: [G6NKK-8C6DV] end_config: VTY:0x55d90feb2b60, pending SET-CFG: 0
May 13 15:47:59 core01.par1.infra.netaris.fr bfdd[1592]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 15:47:59 core01.par1.infra.netaris.fr bfdd[1592]: [G6NKK-8C6DV] end_config: VTY:0x557e0c4646d0, pending SET-CFG: 0
May 13 15:48:27 core01.par1.infra.netaris.fr bgpd[1541]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x7f8e2c087500 arg=0x558dffffd970 timer  r=-9.634     (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_io.c:152} was scheduled to pop greater than 4s ago
May 13 15:48:50 core01.par1.infra.netaris.fr bgpd[1541]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:45
May 13 15:48:50 core01.par1.infra.netaris.fr bgpd[1541]: [G6NKK-8C6DV] end_config: VTY:0x558e45bfe840, pending SET-CFG: 0
May 13 15:50:22 core01.par1.infra.netaris.fr watchfrr[1513]: [T58XM-TP956][EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago
May 13 15:50:22 core01.par1.infra.netaris.fr watchfrr[1513]: [YFT0P-5Q5YX] Forked background command [pid 37874]: /usr/lib/frr/watchfrr.sh restart bgpd
May 13 15:50:23 core01.par1.infra.netaris.fr bgpd[1541]: [ZW1GY-R46JE] Terminating on signal
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906426): RTR_MGR: rtr_mgr_stop()
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906488): RTR Socket: rtr_stop()
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906514): RTR_MGR: Calling rtr_mgr_cb with RTR_SHUTDOWN
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906539): RTR_MGR: Received RTR_SHUTDOWN callback
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906556): RTR_MGR: Group(1) status changed to: RTR_MGR_CLOSED
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:906572): RTR Socket: pthread_cancel()
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:907340): RTR Socket: pthread_join()
May 13 15:50:23 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:23:908072): TCP Transport(45.152.70.25:3323): Socket closed
May 13 15:50:24 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:24:680282): RTR Socket: Socket shut down
May 13 15:50:24 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:24:680333): RTR_MGR: rtr_mgr_free()
May 13 15:50:24 core01.par1.infra.netaris.fr frrinit.sh[1541]: (2025/05/13 15:50:24:680351): TCP Transport(45.152.70.25:3323): Freeing socket
May 13 15:50:25 core01.par1.infra.netaris.fr bgpd[1541]: [QYZDQ-4PHG5][EC 100663316] Attempting to process an I/O event but for fd: 36(8) no thread to handle this!
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [N5M5Y-J5BPG][EC 4043309121] Client 'vnc' (session id 0) encountered an error and is shutting down.
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [GE156-FS0MJ][EC 100663299] stream_read_try: read failed on fd 90: Connection reset by peer
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [N5M5Y-J5BPG][EC 4043309121] Client 'bgp' (session id 1) encountered an error and is shutting down.
May 13 15:50:42 core01.par1.infra.netaris.fr watchfrr[1513]: [HD38Q-0HBRT][EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [GE156-FS0MJ][EC 100663299] stream_read_try: read failed on fd 30: Connection reset by peer
May 13 15:50:42 core01.par1.infra.netaris.fr zebra[1534]: [N5M5Y-J5BPG][EC 4043309121] Client 'bgp' (session id 0) encountered an error and is shutting down.

For now i tried to disable the watchfrr on bgpd, I'll try to see how long bgpd seems to hang (Below my setup) :

--------------------------------
VyOS Version and Package Changes
--------------------------------
Version:          VyOS 2025.05.13-0019-rolling
Release train:    current
Release flavor:   generic

Built by:         autobuild@vyos.net
Built on:         Tue 13 May 2025 00:19 UTC
Build UUID:       bcc09b0b-ac04-4a1c-9e57-be8420c579fd
Build commit ID:  5891a5bd25bf65

Architecture:     x86_64
Boot via:         installed image
System type:      KVM guest
Secure Boot:      n/a (BIOS)

Hardware vendor:  OpenStack Foundation
Hardware model:   OpenStack Nova
Hardware S/N:     89c14151-ee8a-4ba2-9604-0cf791412cb4
Hardware UUID:    89c14151-ee8a-4ba2-9604-0cf791412cb4

Copyright:        VyOS maintainers and contributors
vyos@core01.par1.infra.netaris.fr:~$ show bgp summary 

IPv4 Unicast Summary:
BGP router identifier 185.171.202.73, local AS number 211615 VRF default vrf-id 0
BGP table version 1018928
RIB entries 1838875, using 224 MiB of memory
Peers 4, using 94 KiB of memory
Peer groups 7, using 448 bytes of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
45.152.70.4     4     211615        22        23  1018928    0    0 00:18:51            1        2 edge02 Tunnel-IP
45.152.70.7     4     211615        24        26  1018928    0    0 00:20:14            1        3 edge01 Tunnel-IP
10.255.255.1    4     211615        23        25  1018928    0    0 00:20:14            0        2 Tunnel Amsterdam01
100.64.72.28    4     212815    168952        26  1018928    0    0 00:20:14      1005691        2 Dyjix Transit

Total number of neighbors 4

IPv6 Unicast Summary:
BGP router identifier 185.171.202.73, local AS number 211615 VRF default vrf-id 0
BGP table version 368261
RIB entries 397107, using 48 MiB of memory
Peers 2, using 47 KiB of memory
Peer groups 7, using 448 bytes of memory

Neighbor               V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
fd42:cafe:cafe:cafe::1 4     211615     61184     56101   368261    0    0 00:20:14       209229   208895 Tunnel Amsterdam01
2a10:4640:2::19        4     212815     56191        25   368261    0    0 00:20:13       208894        3 Dyjix Transit

Total number of neighbors 2
vyos@core01.par1.infra.netaris.fr:~$ show system cpu 
CPU socket: 0
CPU Vendor:       GenuineIntel
Model:            Intel Core Processor (Broadwell, IBRS)
Cores:            1
Current MHz:      1997.662

CPU socket: 1
CPU Vendor:       GenuineIntel
Model:            Intel Core Processor (Broadwell, IBRS)
Cores:            1
Current MHz:      1997.662

CPU socket: 2
CPU Vendor:       GenuineIntel
Model:            Intel Core Processor (Broadwell, IBRS)
Cores:            1
Current MHz:      1997.662

CPU socket: 3
CPU Vendor:       GenuineIntel
Model:            Intel Core Processor (Broadwell, IBRS)
Cores:            1
Current MHz:      1997.662

CPU socket: 4
CPU Vendor:       GenuineIntel
Model:            Intel Core Processor (Broadwell, IBRS)
Cores:            1
Current MHz:      1997.662

CPU socket: 5
CPU Vendor:       GenuineIntel
Model:            Intel Core Processor (Broadwell, IBRS)
Cores:            1
Current MHz:      1997.662

CPU socket: 6
CPU Vendor:       GenuineIntel
Model:            Intel Core Processor (Broadwell, IBRS)
Cores:            1
Current MHz:      1997.662

CPU socket: 7
CPU Vendor:       GenuineIntel
Model:            Intel Core Processor (Broadwell, IBRS)
Cores:            1
Current MHz:      1997.662
vyos@core01.par1.infra.netaris.fr:~$ show system memory 
Total: 31.36 GB
Free:  27.35 GB
Used:  4.01 GB

Custom FRR configuration :
bgp input-queue-limit 100000 (Tried to fix the crashing issue as the buffer seemed too small for my RAM)

Side notes :
I am using RPKI and VRFs

Details

Version
2025.05.13-0019-rolling
Is it a breaking change?
Perfectly compatible
Issue type
Bug (incorrect behavior)

Event Timeline

Hanarion triaged this task as Normal priority.
Hanarion created this object in space S1 VyOS Public.

Screenshot_20250513_195714.jpg (766×1 px, 125 KB)

After waiting more than 1 hour, no config reload happened so i trigged one by removing a unused peer group, it triggered, and as i expected, after almost 3 minutes, it finished without flapping the sessions

Below the logs :

May 13 17:52:43 core01.par1.infra.netaris.fr ospfd[1544]: [G6NKK-8C6DV] end_config: VTY:0x561429f72d40, pending SET-CFG: 0
May 13 17:52:43 core01.par1.infra.netaris.fr watchfrr[1501]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 17:52:43 core01.par1.infra.netaris.fr isisd[1550]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 17:52:43 core01.par1.infra.netaris.fr isisd[1550]: [G6NKK-8C6DV] end_config: VTY:0x55f598a0b5a0, pending SET-CFG: 0
May 13 17:52:43 core01.par1.infra.netaris.fr fabricd[1584]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 17:52:43 core01.par1.infra.netaris.fr fabricd[1584]: [G6NKK-8C6DV] end_config: VTY:0x55f5a6f1c7a0, pending SET-CFG: 0
May 13 17:52:43 core01.par1.infra.netaris.fr bfdd[1581]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
May 13 17:52:43 core01.par1.infra.netaris.fr bfdd[1581]: [G6NKK-8C6DV] end_config: VTY:0x55a2bf0dd600, pending SET-CFG: 0
May 13 17:52:50 core01.par1.infra.netaris.fr bgpd[1530]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x555fb8a4f160 arg=0x555f59fef650 timer  r=-7.274     bgp_announce_route_timer_expired() &paf->t_announce_route from ../bgpd/bgp_route.c:5678} was scheduled to pop greater than 4s ago
May 13 17:53:19 core01.par1.infra.netaris.fr bgpd[1530]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x555fb8a4f160 arg=0x555f59fc5600 timer  r=-5.956     (bgp_generate_updgrp_packets)() (&connection->t_generate_updgrp_packets) from ../bgpd/bgp_fsm.c:977} was scheduled to pop greater than 4s ago
May 13 17:53:41 core01.par1.infra.netaris.fr bgpd[1530]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:43
May 13 17:53:41 core01.par1.infra.netaris.fr bgpd[1530]: [G6NKK-8C6DV] end_config: VTY:0x555f5a0e9680, pending SET-CFG: 0
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:064405): RTR Socket: Serial Notify received (2229)
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:064473): RTR Socket: sending serial query, SN: 2228
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:064546): RTR_MGR: Group(1) status changed to: RTR_MGR_ESTABLISHED
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:064560): RTR Socket: State: RTR_SYNC
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:082213): RTR Socket: Cache Response PDU received
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:122922): RTR Socket: EOD PDU received.
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:122958): RTR Socket: New interval values: expire_interval:7200, refresh_interval:975, retry_interval:600
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:123056): RTR Socket: v4 prefixes added
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:123106): RTR Socket: v6 prefixes added
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:123117): RTR Socket: spki data added
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:123127): RTR Socket: Sync successful, received 13 Prefix PDUs, 0 Router Key PDUs, session_id: 14206, SN: 2229
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:123176): RTR Socket: State: RTR_ESTABLISHED
May 13 17:55:13 core01.par1.infra.netaris.fr frrinit.sh[1530]: (2025/05/13 17:55:13:123188): RTR Socket: waiting 975 sec. till next sync
May 13 17:56:06 core01.par1.infra.netaris.fr bgpd[1530]: [HKQ2F-8D0MY][EC 100663315] Thread Starvation: {(thread *)0x7f25b270ee80 arg=0x555f59fef1a0 timer  r=-145.240   bgp_announce_route_timer_expired() &paf->t_announce_route from ../bgpd/bgp_route.c:5678} was scheduled to pop greater than 4s ago
May 13 17:56:13 core01.par1.infra.netaris.fr watchfrr[1501]: [S35NY-PNJ2Y] bgpd: slow echo response finally received after 148.524919 seconds
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [YAF85-253AP][EC 100663299] buffer_flush_available: write error on fd 41: Broken pipe
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [THHDB-YPEY6][EC 100663299] vtysh_flush: write error to fd 41, closing
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [YAF85-253AP][EC 100663299] buffer_flush_available: write error on fd 41: Broken pipe
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [THHDB-YPEY6][EC 100663299] vtysh_flush: write error to fd 41, closing
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [YAF85-253AP][EC 100663299] buffer_flush_available: write error on fd 41: Broken pipe
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [THHDB-YPEY6][EC 100663299] vtysh_flush: write error to fd 41, closing
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [YAF85-253AP][EC 100663299] buffer_flush_available: write error on fd 41: Broken pipe
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [THHDB-YPEY6][EC 100663299] vtysh_flush: write error to fd 41, closing
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [YAF85-253AP][EC 100663299] buffer_flush_available: write error on fd 41: Broken pipe
May 13 17:56:13 core01.par1.infra.netaris.fr bgpd[1530]: [THHDB-YPEY6][EC 100663299] vtysh_flush: write error to fd 41, closing
May 13 17:56:13 core01.par1.infra.netaris.fr watchfrr[1501]: [WFP93-1D146] configuration write completed with exit code 0
May 13 17:56:18 core01.par1.infra.netaris.fr watchfrr[1501]: [JWX4V-MVV6K] bgpd state -> up : echo response received after 0.000282 seconds

A workaround may be to allow the user to set the watchfrr_options in /etc/frr/daemons to add a -T 500 for example for watchfrr to wait 500s before trying to restart bgpd