User Details
- User Since
- Apr 26 2022, 2:45 PM (134 w, 5 d)
Jan 10 2024
Does anybody know if that's going to be fixed in FRR?
Nov 14 2023
Hi @v.huti
This is probably obsoleted. I've upgraded few times since then and came to version 8.5 which does not seem to suffer this. Thank you.
And we had to stop activities on the project due to other issue described in
https://vyos.dev/T5424
Aug 7 2023
If that was pppoe i'd have thought of arp, but here with fixed number of l2tp tunnels (22 tunnels from LACs) i don't think arp cache oversizes the table.
Some more information which i can't think of as a failure reason yet, but it looks strange, - just before the issue we see that LAC drops l2tp tunnel for some reason and starts to send SCCRQ with tid=0 as if it just started working. After a while accel-ppp daemon drops the old tunnels and starts the new ones for few LACs. This definitely cause massive (thousands) route updates between zebra and kernel i guess. Sometimes the system can stand this, sometimes it cant.
I checked the FRR version in the recent rolling release - it is release candidate still. Does it worth upgrading from 8.5.2? As for the possibility - yes, sure we can build latest image.
Adding what was available this time. Will try to turn on debugs next time if we have another chance. Yes, the behavior was identical to previous.
After 19 hours of production run since yesterday the failure occurred again despite the workaround applied. Routes are cleared from kernel for some reason. During the run we observed few l2tp tunnels drops followed by 600 to 6000 sessions drop. The reason is not clear for now but i'm not sure this should kill zebra functionality this way.
Aug 3 2023
Yes, i did that as option A yesterday. And rebooted. Then removed "zebra nexthop-group keep 1" and play a bit with interfaces up/down until kernel routes vanished. Then i put "zebra nexthop-group keep 1" back and rebooted again.
Will try option B then.
Meanwhile it appeared possible to fix "Route install failed" errors. I turned on debug zebra kernel, found the nhg_id which caused route install error and created it manually using nh1/nh2 provided by vtysh -c "show nexthop-group rib <nhg_id>". Just as it is described in the original thread regarding ipv6 routes.
There is still some problem with the workaround proposed. It seems not fully working when applied on the running system with active BGP sessions. At least i still see the next hop groups in the kernel which has only one next hop after our last tests:
Aug 2 2023
From last night tests it seems to be solved. Though i'd prefer to test the node in production for a few weeks to be sure.
May 10 2023
Apr 25 2023
Two cents from the fields. It will be nice to see vrf aware cg-nat solution, when subscribers from a number of "inside" vrfs NAT'ed into one outside vrf. Of course if that's possible.
Apr 19 2023
Mar 13 2023
Actually only multihop BGP peers go down. Others are up, but the routes received from them does not go to kernel, so the connectivity drops.
Latest techsupport: https://oc.cpm.ru/index.php/s/Fg9FfoOatihBOrQ
The system was alive more than 12 hours, but crashed the same way as before.
Mar 10 2023
Mar 8 2023
As you can see LNS/MPLS-PE is being built on VyOS 1.4. MPLS-P are NSN (aka Alcatel Lucent) boxes as far as i know.
BTW this configuration takes almost 20 minutes to load. I wonder if there's a way to speed up this process?
Thank you, @c-po. Will try raising limits to 4096.
Well in this project we're trying to implement L2TP network server with MPLE-PE functionality with our partner mobile operator. This is for b2b projects with a number of customers connecting their mobiles to corporate resources for some reasons.
So the config has three groups of BGP peers: four of ipv4-unicast peers (10.228.134.34, 10.228.134.36, 10.228.134.38, 10.228.134.40) for connection to L2TP LACs (actually they are mobile gateways - GGSN/PGW) and AAA servers, another pair is ipv4-vpn multihop peers (10.5.72.1,10.5.72.2) where customer's L3VPN connections are terminated, And one more peer connecting to 3d party carrier grade NAT solution for the customers who need Internet access.
The LNS and NAT nodes are implemented on a single server with KVM virtual machines interconnected with each other and with external world by OpenVSvitch/DPDK.
The VRF names are assigned by AAA server for each subscriber with Accel-VRF-Name attribute.
This is also where the defect https://github.com/FRRouting/frr/issues/12919 comes from. Just to spot on it)
Let me know if you nedd additional info.
Mar 7 2023
again. It says - download complete. And i can get it from the message:
Thank you for the hint, @c-po
Attached the entire config we have on the node.
There're not much BGP peers, but quite a number of VRFs which terminate remote access l2tp subscribers.
I'd really appreciate any advice on the system optimization for that particular task - ideally i'd like this node to terminate up to 20k l2tp subscribers with very low traffic (not exceeding 0.5gbps i guess).
Mar 6 2023
The bfdd process did not start until i changed LimitNOFILE=1024 to LimitNOFILE=2048 in /lib/systemd/system/frr.service
That did the trick, but i'm not sure it's a good solution.
What do you think, @Viacheslav ?
The limits look like standard
root@nn-vlns-3-1:~# ulimit -Hn
1048576
root@nn-vlns-3-1:~# ulimit -Sn
1024
root@nn-vlns-3-1:~# sysctl fs.file-max
fs.file-max = 9223372036854775807
Mar 2 2023
Dec 7 2022
Yes they are. 192.168.101.10 - is an ip of vpn remote access subscriber. He's connected to interface l2tp0 (accel-ppp). And i'm just trying to open tcp connection to port 80 on client from peer node.
The firewall settings does not seem to catch the traffic going out of l2tp* interfaces.
admin@vyos-lns-1:~$ show config commands |grep firewall set firewall interface l2tp* out name 'nodefw' set firewall log-martians 'disable' set firewall name nodefw rule 100 action 'accept' set firewall name nodefw rule 100 protocol 'tcp' set firewall name nodefw rule 100 tcp flags syn set firewall name nodefw rule 100 tcp mss '1300'
Oops. Thank you Nicolas.
Suddenly found myself far behind the current rolling release. Will upgrade first.
Dec 6 2022
There's no
set firewall interface
option here:
admin@vyos-lns-1:~$ show version
Version: VyOS 1.4-rolling-202209131208
Oct 17 2022
Added more bgpd/ospfd events to the log. The VRF Id seem to be correct. But the events look curious. After session start the interface is first created in vrf default (vrf default, id:0) followed by bgpd/ospfd events, then accel-ppp process moves it to destination vrf (vrf client, id:5) which is follwed by the bgpd/ospfd errors.
Finally, with more or less than 5000 sessions bgpd accidentally becomes unresponsive and utilizes 200% cpu (8 cores are used on VM). Accel-pppd process having all network destinations unreachable also goes unresponsive a bit later.
After that we have to reboot.
Oct 12 2022
That does not change the behavior. I get five messages on session start from bfdd, bgpd, ospfd processes, and 16 messages from all FRR daemons on session stop.
The only way to get rid of them is 'log syslog emergencies' but this filters important events as well.
Any suggestions on the problem, guys?
I see a lot of messages regarding these messages appearing in various scenarios since 2017 or even earlier in FRR community. But did not find any solution actually.
Oct 6 2022
This a project for mobile access to enterprise networks. VyOS plays as an MPLS-PE router as well as L2TP Network Server. Every subscriber coming via l2tp is directed to the customer's VRF other than default (with RADIUS attribute)
Sep 7 2022
I'd suggest adding
**Restart=always RestartSec=10**
to /usr/share/vyos/templates/telegraf/override.conf.j2 as it is done for ntp.service.
Otherwise the telegraf service do not start - it does 5 start attempts very quickly during boot with error:
Sep 07 11:43:59 vyos-lns-1 systemd[1]: telegraf.service: Failed with result 'exit-code'. Sep 07 11:43:59 vyos-lns-1 systemd[1]: telegraf.service: Scheduled restart job, restart counter is at 5. Sep 07 11:43:59 vyos-lns-1 systemd[1]: telegraf.service: Start request repeated too quickly. Sep 07 11:43:59 vyos-lns-1 systemd[1]: telegraf.service: Failed with result 'exit-code'.
and stays in a failed state.
see boot log attached.
Sep 1 2022
Need an advice guys, how we can reproduce the problem. I tried to peer with bird and announced 100k prefixes to the vyos box, but this simple config did not cause memory leak with bgpd. Still trying
Aug 19 2022
Nothing helps
Aug 18 2022
The only way to start telegraf with ip vrf exec i found - is to comment out
#User=telegraf
in /etc/systemd/system/vyos-telegraf.service and
chown root:root /run/telegraf
Aug 16 2022
Manual start of telegraf works for me
Aug 15 2022
Aug 10 2022
Hi Viacheslav
Sorry, i probably misspelled the config option. Actually it's availabe at [radius] section of accel-ppp.conf.
Below is the [radius] section from my /run/accel-pppd/l2tp.conf after i changed
/usr/libexec/vyos/conf_mode/vpn_l2tp.py:
Jul 29 2022
Jul 28 2022
Is there any chance to fix that ?