When using FRR with ECMP one can up in a situation where routes are suddently being dropped by FRR (Zebra) due to a race-condition between FRR and the Linux kernel when it comes to how nexthop-groups are being used and one of the ECMP paths for whatever reason is no longer available.
The situation can look like this in the log:
2023/07/26 10:26:30 ZEBRA: [HSYZM-HV7HF] Extended Error: Can not replace a nexthop with a nexthop group. 2023/07/26 10:26:30 ZEBRA: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Invalid argument, type=RTM_NEWNEXTHOP(104), seq=100352827, pid=2475708348 2023/07/26 10:26:30 ZEBRA: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (1769049[19431932/19431933]) into the kernel 2023/07/26 10:26:30 ZEBRA: [HSYZM-HV7HF] Extended Error: Can not replace a nexthop with a nexthop group. 2023/07/26 10:26:30 ZEBRA: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Invalid argument, type=RTM_NEWNEXTHOP(104), seq=100352851, pid=2475708348 2023/07/26 10:26:30 ZEBRA: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (1769049[19431932/19431933]) into the kernel 2023/07/26 10:26:34 ZEBRA: [SWQK6-6JY63][EC 4043309074] 132:1263:10.59.14.183/32: Failed to enqueue dataplane install 2023/07/26 10:27:10 BGP: [KTE2S-GTBDA][EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 492212 in VRF 196 2023/07/26 10:27:32 ZEBRA: [SWQK6-6JY63][EC 4043309074] 19:1468:10.115.5.246/32: Failed to enqueue dataplane install 2023/07/26 10:27:34 ZEBRA: [SWQK6-6JY63][EC 4043309074] 188:1411:10.11.4.9/32: Failed to enqueue dataplane install 2023/07/26 10:27:38 BGP: [KTE2S-GTBDA][EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 488260 in VRF 133
The fix is to add the following to frr.conf:
zebra nexthop-group keep 1
With the above the error condition no longer surfaces.
From the docs for Zebra https://docs.frrouting.org/en/latest/zebra.html#clicmd-zebra-nexthop-group-keep-1-3600
zebra nexthop-group keep (1-3600) Set the time that zebra will keep a created and installed nexthop group before removing it from the system if the nexthop group is no longer being used. The default time is 180 seconds.
The above have been discussed at:
https://forum.vyos.io/t/frr-loses-routing-info-after-5-12k-l2tp-subs-connected/10422/16
Solution provided by: