Page MenuHomeVyOS Platform

Firewall with 20K entries cannot load after reboot
Closed, ResolvedPublicBUG

Description

There is a report from the forum

I configured 20,000 firewall rules on my vyos(48 cores and 100GB RAN, KVM), but commit configuration need at least 3 hours,and after I reboot my vyos machine, it can not up and stagnate at “Mouting Vyos Config ”

If use nftables natively as:

sudo nft -s list ruleset > /tmp/rules.nft
sudo nft flush ruleset
sudo time nft -f /tmp/rules.nft

It takes almost 10 hours

nft.jpeg (1×1 px, 213 KB)

Details

Difficulty level
Hard (possibly days)
Version
1.4-rolling-202206161834
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

Any config available to test against?

I did my internal tests and can't reproduce it
20K entries applied in 0.20 sec

root@r14:/home/vyos# cat tmp.nft | wc -l
20029
root@r14:/home/vyos# 
root@r14:/home/vyos# sudo time nft -f tmp.nft
real	0m 0.20s
user	0m 0.13s
sys	0m 0.06s
root@r14:/home/vyos#

200K entries in 2 sec

root@r14:/home/vyos# cat tmp.nft | wc -l
200029
root@r14:/home/vyos# 
root@r14:/home/vyos# sudo nft flush ruleset
root@r14:/home/vyos# 
root@r14:/home/vyos# sudo time nft -f tmp.nft
real	0m 1.91s
user	0m 1.20s
sys	0m 0.70s
root@r14:/home/vyos#

There is a simple template to generate rules.

#!/usr/bin/env python3

import random


temp = '''table ip filter {
	chain VYOS_FW_FORWARD {
		type filter hook forward priority filter; policy accept;
		jump VYOS_POST_FW
	}

	chain VYOS_FW_LOCAL {
		type filter hook input priority filter; policy accept;
		jump VYOS_POST_FW
	}

	chain VYOS_FW_OUTPUT {
		type filter hook output priority filter; policy accept;
		jump VYOS_POST_FW
	}

	chain VYOS_POST_FW {
		return
	}

	chain VYOS_FRAG_MARK {
		type filter hook prerouting priority -450; policy accept;
		ip frag-off & 16383 != 0 meta mark set 0x000ffff1 return
	}

	chain NAME_FOO {'''


print(temp)

for i in range(20000):
    ip = ".".join(map(str, (random.randint(0, 255) for _ in range(4))))
    print(f'		ip saddr {ip} counter return comment "FOO-{i+1}"')
print('	        counter return comment "FOO default-action accept"')
print("    }\n}")
# ./generate.py | tee tmp.nft

hi,

today I want test how fast firewall rules loading and changing in vyos performed. I took an vyos-1.4-rolling-202308180646-amd64.iso boot it as kvm guest.
Then I added some rules with:

for I in seq 100 2542; do set firewall ipv6 name Test rule $I action accept ; set firewall ipv6 name Test rule $I destination port $I; set firewall ipv6 name Test rule $I protocol tcp ; done

After that I did an commit which took a large amount of time.
I added another rule and did again:

vyos@vyos# time commit

real 2m16.482s
user 0m36.947s
sys 0m14.218s
[edit]

I rebooted the box and It takes also around 12 minutes to start.
After start I did a config change in the system area not firewall, the commit needed:

vyos@vyos# set system time-zone Europe/Berlin
[edit]
vyos@vyos# time commit

real 2m17.489s
user 0m36.723s
sys 0m15.184s
[edit]
vyos@vyos#

This makes working with large rule sets nearly impossible.
Pure nft is running fast:

vbash-4.1# nft -s list ruleset > /tmp/rules
vbash-4.1# nft flush ruleset
vbash-4.1# time nft -f /tmp/rules

real 0m0.075s
user 0m0.047s
sys 0m0.030s
vbash-4.1#

This delay is not only present in latest version. Huge firewall (and not only firewall) config leads to more processing while committing changes.
Bare in mind that for every firewall config command, python scripts are invoked for sanity checks and for config generation.
If direct nft commands are used, then all this scripts are not called.

Same example in older version show similar delays.

vyos@vyos# run show ver | grep Ver
Version:          VyOS 1.4-rolling-202306080317
[edit]
vyos@vyos# set int eth eth3 description FOO
[edit]
vyos@vyos# time commit

real    1m4.074s
user    0m19.961s
sys     0m7.620s
[edit]
vyos@vyos# run show config comm | grep -c firewall
7330
[edit]
vyos@vyos#

Related: https://vyos.dev/T5388 (Something is fishy with commit and boot times when more than a few hundred static routes are being used).

The main issue seems to be that every single line in the config will invoke python interpreter on its own in multiple levels so there becomes a HUGE overhead compared to invoke it once and have that parse the config and dump into a nft file which then is imported through lets say "nft -o -f /path/file.nft". Same goes with routing information, having it dumped to frr.conf and then frr-process reloaded or injected as a batchfile instead of line by line (the load batchfile method would be prefered specially when dynamic routing protocols are being used).

Question is if it is possible to partly rewrite or tweak the config-engine to imlement for example a cache section by section so if its already parsed once it doesnt have to do this massive overhead again just because someone altered lets say language settings in a different section?

To avoid admins screwing things up the caches could have some kind filehash attached to it so if the hash doesnt match it will parse it line by line as today (until the codepath is altered to do batchmode) but if the hash matches it doesnt have to feed each line of config through all the checks since its already been validated (when the hash was attached to that section).

The commit and boot times are a major issue for larger deployments that is when static routes pass a few hundreds but also when number of firewall rules pass a few hundreds in total (worst scenario is if you both have more than a handful static routes AND more than a handful firewall rules).

If I would vote for something that would be to perform some vyos-config refactoring regarding how the vyos-config is translated into each conf-file which each service then will be use (nft, frr, dhcp etc). It could be performed in two levels where the first would be to have some kind of caching in place and the 2nd would be to rewrite so it will batch changes into nft and frr instead of injecting stuff line by line.

dmbaturin triaged this task as High priority.

Quick test done on a VM with 1 CPU and 1G RAM:

[email protected]# for I in  {1..2542}; do set firewall ipv6 name Test rule $I action accept ; set firewall ipv6 name Test rule $I destination port $I; set firewall ipv6 name Test rule $I protocol tcp ; done
[email protected]# time commit

real    3m20.143s
user    2m4.437s
sys     0m39.453s
[edit]
[email protected]# run show config comm | grep -c fire
7628
[edit]
[email protected]# set int eth eth0 description WAN
[edit]
[email protected]# time commit

real    1m37.000s
user    0m41.601s
sys     0m8.831s
[edit]
[email protected]# set firewall ipv4 forward filter rule 1 action accept 
[edit]
[email protected]# time commit

real    1m37.297s
user    0m42.046s
sys     0m8.681s
[edit]
[email protected]#