Page MenuHomeVyOS Platform

BFD is not starting after upgrade to 1.4-rolling-202302150317
Closed, ResolvedPublicBUG

Assigned To
Authored By
aserkin
Mar 2 2023, 12:20 AM
Referenced Files
F3689591: lns-3-1-commands-clean.cfg
Mar 7 2023, 11:32 AM
F3689592: image.png
Mar 7 2023, 11:32 AM
F3689589: lns-3-1-commands-clean.cfg
Mar 7 2023, 11:01 AM
F3689060: vyos-boot.log
Mar 6 2023, 8:00 PM

Description

The following errors appear while trying to configure BFD

bfd {
    peer 10.5.72.1 {
        multihop
        source {
            address 10.228.134.1
        }
    }
    peer 10.5.72.2 {
        multihop
        source {
            address 10.228.134.1
        }
    }
    profile BBR {
        interval {
            multiplier 3
            receive 350
            transmit 350
        }
    }
}

Mar 01 23:28:39 python3[58356]: Report time: 2023-03-01 23:28:39
Mar 01 23:28:39 python3[58356]: Image version: VyOS 1.4-rolling-202302150317
Mar 01 23:28:39 python3[58356]: Release train: current
Mar 01 23:28:39 python3[58356]: Built by: [email protected]
Mar 01 23:28:39 python3[58356]: Built on: Wed 15 Feb 2023 03:17 UTC
Mar 01 23:28:39 python3[58356]: Build UUID: e62b2d4d-c09c-4dd6-a722-884b782e4d13
Mar 01 23:28:39 python3[58356]: Build commit ID: 5207b6f510d677
Mar 01 23:28:39 Architecture[58356]: x86_64
Mar 01 23:28:39 python3[58356]: Boot via: installed image
Mar 01 23:28:39 python3[58356]: System type: KVM guest
Mar 01 23:28:39 python3[58356]: Hardware vendor: Red Hat
Mar 01 23:28:39 python3[58356]: Hardware model: KVM
Mar 01 23:28:39 python3[58356]: Hardware S/N:
Mar 01 23:28:39 python3[58356]: Hardware UUID: 109949e4-96b7-44ee-8c96-a111bb36bd23
Mar 01 23:28:39 python3[58356]: Traceback (most recent call last):
Mar 01 23:28:39 python3[58356]: File "/usr/libexec/vyos/conf_mode/protocols_bfd.py", line 121, in <module>
Mar 01 23:28:39 python3[58356]: apply(c)
Mar 01 23:28:39 python3[58356]: File "/usr/libexec/vyos/conf_mode/protocols_bfd.py", line 108, in apply
Mar 01 23:28:39 python3[58356]: frr_cfg.load_configuration(bfd_daemon)
Mar 01 23:28:39 python3[58356]: File "/usr/lib/python3/dist-packages/vyos/frr.py", line 435, in load_configuration
Mar 01 23:28:39 python3[58356]: self.imported_config = get_configuration(daemon=daemon)
Mar 01 23:28:39 python3[58356]: File "/usr/lib/python3/dist-packages/vyos/frr.py", line 149, in get_configuration
Mar 01 23:28:39 python3[58356]: raise OSError(code, output)
Mar 01 23:28:39 PermissionError[58356]: [Errno 1] Exiting: failed to connect to any daemons.

frr.log:
2023/03/01 23:20:20 BFD: [XCQPZ-40HX4] echov6-socket: socket: Too many open files
2023/03/01 23:20:20 BFD: bfdd/bfd_packet.c:1685: bp_echov6_socket(): assertion (!"echov6-socket: socket: %s") failed
BFD: Received signal 6 at 1677702020 (si_addr 0x760000a53d, PC 0x7feac57a0ce1); aborting...
BFD: in thread zclient_read scheduled from lib/zclient.c:4083 zclient_event()

Details

Difficulty level
Unknown (require assessment)
Version
1.4-rolling-202302150317
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

Cannot reproduce it with this configuration (VyOS 1.4-rolling-202302280651, don't have a lot of file descriptors):

set protocols bfd peer 192.0.2.5 multihop
set protocols bfd peer 192.0.2.5 source address '192.0.2.1'
set protocols bfd peer 192.0.2.6 multihop
set protocols bfd peer 192.0.2.6 source address '192.0.2.1'
set protocols bfd profile BBR interval multiplier '3'
set protocols bfd profile BBR interval receive '350'
set protocols bfd profile BBR interval transmit '350'

commit

vyos@r14# vtysh -c "show run bfd"
Building configuration...

Current configuration:
!
frr version 8.4.2
frr defaults traditional
hostname debian
log syslog
log facility local7
hostname r14
service integrated-vtysh-config
!
bfd
 profile BBR
  transmit-interval 350
  receive-interval 350
 exit
 !
 peer 192.0.2.5 multihop local-address 192.0.2.1
 exit
 !
 peer 192.0.2.6 multihop local-address 192.0.2.1
 exit
 !
exit
!
end
[edit]
vyos@r14#

Limit the number of file descriptors that will be used internally by the FRR daemons. By default, the daemons use the system ulimit value.
https://docs.frrouting.org/en/stable-8.4/basic.html?#cmdoption-limit-fds

To see the hard and soft values

ulimit -Hn
ulimit -Sn

sysctl fs.file-max

The limits look like standard
root@nn-vlns-3-1:~# ulimit -Hn
1048576
root@nn-vlns-3-1:~# ulimit -Sn
1024
root@nn-vlns-3-1:~# sysctl fs.file-max
fs.file-max = 9223372036854775807

But BFD process does not start after reboot.
Attaching journalctl -b --no-pager output.

The bfdd process did not start until i changed LimitNOFILE=1024 to LimitNOFILE=2048 in /lib/systemd/system/frr.service
That did the trick, but i'm not sure it's a good solution.
What do you think, @Viacheslav ?

Well there should be no harm in lifting the limit of open file descriptors for FRR as its a huge process tree.
Can you share your entire protocols configuration tree so we see what else is configured?

So I think it might be something else or you really have a lot of e.g. BGP peerings enabled.

Thank you for the hint, @c-po
Attached the entire config we have on the node.


There're not much BGP peers, but quite a number of VRFs which terminate remote access l2tp subscribers.
I'd really appreciate any advice on the system optimization for that particular task - ideally i'd like this node to terminate up to 20k l2tp subscribers with very low traffic (not exceeding 0.5gbps i guess).

Thank you for the hint, @c-po
Attached the entire config we have on the node.

I don't see attached config.

again. It says - download complete. And i can get it from the message:

image.png (163×451 px, 8 KB)

@aserkin WOW that is a huge VRF config. With that amount you definately reach the max FD limit.

In this use-case we probably should lift the limit from 1024 to 4096 for LimitNOFILE.

Can you probably tell us more about this config? It might be worth adding this to our smoketest infrastructure.

Thank you, @c-po. Will try raising limits to 4096.
Well in this project we're trying to implement L2TP network server with MPLE-PE functionality with our partner mobile operator. This is for b2b projects with a number of customers connecting their mobiles to corporate resources for some reasons.
So the config has three groups of BGP peers: four of ipv4-unicast peers (10.228.134.34, 10.228.134.36, 10.228.134.38, 10.228.134.40) for connection to L2TP LACs (actually they are mobile gateways - GGSN/PGW) and AAA servers, another pair is ipv4-vpn multihop peers (10.5.72.1,10.5.72.2) where customer's L3VPN connections are terminated, And one more peer connecting to 3d party carrier grade NAT solution for the customers who need Internet access.
The LNS and NAT nodes are implemented on a single server with KVM virtual machines interconnected with each other and with external world by OpenVSvitch/DPDK.
The VRF names are assigned by AAA server for each subscriber with Accel-VRF-Name attribute.
This is also where the defect https://github.com/FRRouting/frr/issues/12919 comes from. Just to spot on it)
Let me know if you nedd additional info.

BTW this configuration takes almost 20 minutes to load. I wonder if there's a way to speed up this process?

Great project! As I understand it, you're using BGP label-unicast to transport labels, and I'm curious about the operating systems your PEs/Ps are running on - are they Cisco, Juniper, or other vendors? I'm particularly interested in learning about the interoperability between different vendors so that I can incorporate it into my testing. @aserkin

As you can see LNS/MPLS-PE is being built on VyOS 1.4. MPLS-P are NSN (aka Alcatel Lucent) boxes as far as i know.

dmbaturin claimed this task.