Page MenuHomeVyOS Platform

Static Route Path Monitoring, failover
Needs testing, Requires assessmentPublicFEATURE REQUEST

Description

Hello all,
sometimes it's not possible to do dynamic routing because not all peers supports it.
As fallback static routes are used.
I would like to see the possibility to monitor static routes by some kind of health checks like ping.
(Like Palo Alto does)

It's not the same as WAN load balancing because the PBR would add other complexity.

Regards
Markus

Details

Difficulty level
Unknown (require assessment)
Version
-
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Unspecified (please specify)

Event Timeline

Hi adestis, what you descripe is possible to do today with the help of a shellscript and the crontab, if you are interested i could help you create a script that does this for you, the one drawback is that the failover-time is in the ballpark of minutes, and the routes are not present in the configuration... Also, cron fills the log with messages every time it executed

Hello runar,
I know that it's possible to do it manually.
But I really would like to see a more integrated solution where you can add a check for the next hop inside the configuration.

A solution based on cron might be not so ideal because of the minimum time of 1 minute.

MikroTik RouterOS supports something like this:

/ip route add gateway=192.0.2.1,192.0.2.2 check-gateway=ping

or check-gateway=arp for boxes that don't ping very well.

It would be really nifty to find a way to add this to VyOS, but it would also have to interact well with FRR to ensure these "semi-static" routes propagate through to IGP/EGP where there is a redistribute static in effect.

Would it be reasonable to use BFD for this? Since BFD is already implemented we might be able to use that as well?

@Cheeze_It BFD for static routes would be nice as well but sometimes the target you test against is not under your control and/or does not support BFD.

@adestis yes, that is true....but that can be worked around. Any option can be used (either BFD, or ARP, or ICMP). I just wanted to give more ideas so that hopefully can get a working implementation for all 3.

So far I have seen that BFD for static routes in FRR is currently under development:
https://github.com/FRRouting/frr/issues/3369

(Seems like tests are only missing).

But so far I have not seen anything like @maznu mentioned what MikroTik has.
That really would be nice.

The way I was thinking is on this Juniper page here.

If you guys would like, I can mock it up in my lab, test it, and show you the configuration I used and maybe it would be possible for us to see if we can make something similar or at least with similar functionality.

Viacheslav renamed this task from Static Route Path Monitoring to Static Route Path Monitoring, failover.Jun 5 2022, 12:04 PM
Viacheslav added a project: VyOS 1.4 Sagitta.
Viacheslav set Is it a breaking change? to Unspecified (possibly destroys the router).
Viacheslav set Issue type to Unspecified (please specify).

PR https://github.com/vyos/vyos-1x/pull/1358

set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 check target '192.168.100.1'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 check timeout '10'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 check type 'icmp'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 interface 'eth1'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 metric '2'
Viacheslav changed the task status from Open to Needs testing.Dec 20 2022, 9:16 AM

At first look, at least it works, but it requires more tests and improvements

set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 check target '192.168.122.1'
set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 check timeout '5'
set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 check type 'icmp'
set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 interface 'eth0'

show

[email protected]:~$ show ip route 203.0.113.1
Routing entry for 203.0.113.1/32
  Known via "kernel", distance 0, metric 1, best
  Last update 00:04:42 ago
  * 192.168.122.1, via eth0

[email protected]:~$ 
[email protected]:~$ 
[email protected]:~$ sudo ip route show proto failover
203.0.113.1 via 192.168.122.1 dev eth0 metric 1 
[email protected]:~$ 

`

Hello everyone,

It works but has a little problem, if you set 2 routes to the same destination using different metrics and the main link dies it will change to the backup link but it wont change back to the main link when it come alive again, so you need to "kill" the backup link to make the main route active again.

Heres my tests:
IP 10.100.1.1/32 - Mikrotik Router Loopback
eth2 - main link - 172.25.30.9/30 (Vyos) - 172.25.30.10/30 (MK)
eth3 - backup link - 172.25.40.9/30 (Vyos) - 172.25.40.10/30 (MK)

 route 10.100.1.1/32 {
     next-hop 172.25.30.10 {
         check {
             target 172.25.30.10
             timeout 1
             type icmp
         }
         interface eth2
     }
     next-hop 172.25.40.10 {
         check {
             target 172.25.40.10
             timeout 1
             type icmp
         }
         interface eth3
         metric 100
     }
 }

[email protected]:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h01m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h01m
K>* 10.100.1.1/32 [0/1] via 172.25.30.10, eth2, 00:00:29
C>* 10.100.1.255/32 is directly connected, lo, 1d23h01m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h01m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h01m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h01m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h01m

Now disabling the Mikrotik IP 172.25.30.10:

[email protected]:~$ ping 172.25.30.10
PING 172.25.30.10 (172.25.30.10) 56(84) bytes of data.
From 172.25.30.9 icmp_seq=1 Destination Host Unreachable
From 172.25.30.9 icmp_seq=2 Destination Host Unreachable
From 172.25.30.9 icmp_seq=3 Destination Host Unreachable
From 172.25.30.9 icmp_seq=4 Destination Host Unreachable
From 172.25.30.9 icmp_seq=5 Destination Host Unreachable
^C
--- 172.25.30.10 ping statistics ---
7 packets transmitted, 0 received, +5 errors, 100% packet loss, time 6134ms
pipe 3
[email protected]:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h03m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h03m
K>* 10.100.1.1/32 [0/100] via 172.25.40.10, eth3, 00:00:28
C>* 10.100.1.255/32 is directly connected, lo, 1d23h03m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h03m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h03m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h03m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h03m

Now enabling the IP 172.25.30.10 again in the Mikrotik:

[email protected]:~$ ping 172.25.30.10
PING 172.25.30.10 (172.25.30.10) 56(84) bytes of data.
64 bytes from 172.25.30.10: icmp_seq=1 ttl=64 time=0.331 ms
64 bytes from 172.25.30.10: icmp_seq=2 ttl=64 time=0.328 ms
64 bytes from 172.25.30.10: icmp_seq=3 ttl=64 time=0.343 ms
64 bytes from 172.25.30.10: icmp_seq=4 ttl=64 time=0.315 ms
^C
--- 172.25.30.10 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3081ms
rtt min/avg/max/mdev = 0.315/0.329/0.343/0.010 ms
[email protected]:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h07m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h07m
K>* 10.100.1.1/32 [0/100] via 172.25.40.10, eth3, 00:04:15
C>* 10.100.1.255/32 is directly connected, lo, 1d23h07m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h07m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h07m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h07m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h07m

Lets disable the backup link next-hop in the MK router:

[email protected]:~$ ping 172.25.40.10
PING 172.25.40.10 (172.25.40.10) 56(84) bytes of data.
From 172.25.40.9 icmp_seq=1 Destination Host Unreachable
From 172.25.40.9 icmp_seq=2 Destination Host Unreachable
From 172.25.40.9 icmp_seq=3 Destination Host Unreachable
From 172.25.40.9 icmp_seq=4 Destination Host Unreachable
From 172.25.40.9 icmp_seq=5 Destination Host Unreachable
^C
--- 172.25.40.10 ping statistics ---
6 packets transmitted, 0 received, +5 errors, 100% packet loss, time 5129ms
pipe 4
[email protected]:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h08m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h08m
K>* 10.100.1.1/32 [0/1] via 172.25.30.10, eth2, 00:00:24
C>* 10.100.1.255/32 is directly connected, lo, 1d23h08m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h08m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h08m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h08m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h08m

Now the route changes back to the main link. This is the only problem I found while testing.

will be fixed in the next rolling release

will be fixed in the next rolling release

Nice! Gonna test later :D