Page MenuHomeVyOS Platform

Static Route Path Monitoring, failover
Closed, ResolvedPublicFEATURE REQUEST

Description

Hello all,
sometimes it's not possible to do dynamic routing because not all peers supports it.
As fallback static routes are used.
I would like to see the possibility to monitor static routes by some kind of health checks like ping.
(Like Palo Alto does)

It's not the same as WAN load balancing because the PBR would add other complexity.

Regards
Markus

Details

Difficulty level
Unknown (require assessment)
Version
-
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Unspecified (please specify)

Event Timeline

Hi adestis, what you descripe is possible to do today with the help of a shellscript and the crontab, if you are interested i could help you create a script that does this for you, the one drawback is that the failover-time is in the ballpark of minutes, and the routes are not present in the configuration... Also, cron fills the log with messages every time it executed

Hello runar,
I know that it's possible to do it manually.
But I really would like to see a more integrated solution where you can add a check for the next hop inside the configuration.

A solution based on cron might be not so ideal because of the minimum time of 1 minute.

MikroTik RouterOS supports something like this:

/ip route add gateway=192.0.2.1,192.0.2.2 check-gateway=ping

or check-gateway=arp for boxes that don't ping very well.

It would be really nifty to find a way to add this to VyOS, but it would also have to interact well with FRR to ensure these "semi-static" routes propagate through to IGP/EGP where there is a redistribute static in effect.

Would it be reasonable to use BFD for this? Since BFD is already implemented we might be able to use that as well?

@Cheeze_It BFD for static routes would be nice as well but sometimes the target you test against is not under your control and/or does not support BFD.

@adestis yes, that is true....but that can be worked around. Any option can be used (either BFD, or ARP, or ICMP). I just wanted to give more ideas so that hopefully can get a working implementation for all 3.

So far I have seen that BFD for static routes in FRR is currently under development:
https://github.com/FRRouting/frr/issues/3369

(Seems like tests are only missing).

But so far I have not seen anything like @maznu mentioned what MikroTik has.
That really would be nice.

The way I was thinking is on this Juniper page here.

If you guys would like, I can mock it up in my lab, test it, and show you the configuration I used and maybe it would be possible for us to see if we can make something similar or at least with similar functionality.

Viacheslav renamed this task from Static Route Path Monitoring to Static Route Path Monitoring, failover.Jun 5 2022, 12:04 PM
Viacheslav added a project: VyOS 1.4 Sagitta.
Viacheslav set Is it a breaking change? to Unspecified (possibly destroys the router).
Viacheslav set Issue type to Unspecified (please specify).

PR https://github.com/vyos/vyos-1x/pull/1358

set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 check target '192.168.100.1'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 check timeout '10'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 check type 'icmp'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 interface 'eth1'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 metric '2'
Viacheslav changed the task status from Open to Needs testing.Dec 20 2022, 9:16 AM

At first look, at least it works, but it requires more tests and improvements

set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 check target '192.168.122.1'
set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 check timeout '5'
set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 check type 'icmp'
set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 interface 'eth0'

show

vyos@r14:~$ show ip route 203.0.113.1
Routing entry for 203.0.113.1/32
  Known via "kernel", distance 0, metric 1, best
  Last update 00:04:42 ago
  * 192.168.122.1, via eth0

vyos@r14:~$ 
vyos@r14:~$ 
vyos@r14:~$ sudo ip route show proto failover
203.0.113.1 via 192.168.122.1 dev eth0 metric 1 
vyos@r14:~$ 

`

Hello everyone,

It works but has a little problem, if you set 2 routes to the same destination using different metrics and the main link dies it will change to the backup link but it wont change back to the main link when it come alive again, so you need to "kill" the backup link to make the main route active again.

Heres my tests:
IP 10.100.1.1/32 - Mikrotik Router Loopback
eth2 - main link - 172.25.30.9/30 (Vyos) - 172.25.30.10/30 (MK)
eth3 - backup link - 172.25.40.9/30 (Vyos) - 172.25.40.10/30 (MK)

 route 10.100.1.1/32 {
     next-hop 172.25.30.10 {
         check {
             target 172.25.30.10
             timeout 1
             type icmp
         }
         interface eth2
     }
     next-hop 172.25.40.10 {
         check {
             target 172.25.40.10
             timeout 1
             type icmp
         }
         interface eth3
         metric 100
     }
 }

vyos@vyos:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h01m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h01m
K>* 10.100.1.1/32 [0/1] via 172.25.30.10, eth2, 00:00:29
C>* 10.100.1.255/32 is directly connected, lo, 1d23h01m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h01m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h01m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h01m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h01m

Now disabling the Mikrotik IP 172.25.30.10:

vyos@vyos:~$ ping 172.25.30.10
PING 172.25.30.10 (172.25.30.10) 56(84) bytes of data.
From 172.25.30.9 icmp_seq=1 Destination Host Unreachable
From 172.25.30.9 icmp_seq=2 Destination Host Unreachable
From 172.25.30.9 icmp_seq=3 Destination Host Unreachable
From 172.25.30.9 icmp_seq=4 Destination Host Unreachable
From 172.25.30.9 icmp_seq=5 Destination Host Unreachable
^C
--- 172.25.30.10 ping statistics ---
7 packets transmitted, 0 received, +5 errors, 100% packet loss, time 6134ms
pipe 3
vyos@vyos:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h03m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h03m
K>* 10.100.1.1/32 [0/100] via 172.25.40.10, eth3, 00:00:28
C>* 10.100.1.255/32 is directly connected, lo, 1d23h03m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h03m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h03m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h03m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h03m

Now enabling the IP 172.25.30.10 again in the Mikrotik:

vyos@vyos:~$ ping 172.25.30.10
PING 172.25.30.10 (172.25.30.10) 56(84) bytes of data.
64 bytes from 172.25.30.10: icmp_seq=1 ttl=64 time=0.331 ms
64 bytes from 172.25.30.10: icmp_seq=2 ttl=64 time=0.328 ms
64 bytes from 172.25.30.10: icmp_seq=3 ttl=64 time=0.343 ms
64 bytes from 172.25.30.10: icmp_seq=4 ttl=64 time=0.315 ms
^C
--- 172.25.30.10 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3081ms
rtt min/avg/max/mdev = 0.315/0.329/0.343/0.010 ms
vyos@vyos:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h07m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h07m
K>* 10.100.1.1/32 [0/100] via 172.25.40.10, eth3, 00:04:15
C>* 10.100.1.255/32 is directly connected, lo, 1d23h07m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h07m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h07m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h07m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h07m

Lets disable the backup link next-hop in the MK router:

vyos@vyos:~$ ping 172.25.40.10
PING 172.25.40.10 (172.25.40.10) 56(84) bytes of data.
From 172.25.40.9 icmp_seq=1 Destination Host Unreachable
From 172.25.40.9 icmp_seq=2 Destination Host Unreachable
From 172.25.40.9 icmp_seq=3 Destination Host Unreachable
From 172.25.40.9 icmp_seq=4 Destination Host Unreachable
From 172.25.40.9 icmp_seq=5 Destination Host Unreachable
^C
--- 172.25.40.10 ping statistics ---
6 packets transmitted, 0 received, +5 errors, 100% packet loss, time 5129ms
pipe 4
vyos@vyos:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h08m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h08m
K>* 10.100.1.1/32 [0/1] via 172.25.30.10, eth2, 00:00:24
C>* 10.100.1.255/32 is directly connected, lo, 1d23h08m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h08m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h08m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h08m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h08m

Now the route changes back to the main link. This is the only problem I found while testing.

will be fixed in the next rolling release

will be fixed in the next rolling release

Nice! Gonna test later :D

Nice feature. I'm testing it now.

@Viacheslav, where is best place to discuss the feature (ask a question or report a bug)?

@Harliff It is better to write to this task if you find bugs or propose new features.
So anyone could claim/fix it.
Thanks.

Harliff rescinded a token.
Harliff awarded a token.

@Viacheslav Ok!

Minor bug: due to using "sudo ip route show" insted of just "ip route show", the system journal flooded with messages from sudo subsystem.

Apr 04 14:31:29 R1001-new sudo[16102]:     root : PWD=/ ; USER=root ; COMMAND=/usr/sbin/ip --json route show protocol failover 10.0.0.3/32 via 10.0.2.1 dev eth2 metric 20
Apr 04 14:31:29 R1001-new sudo[16102]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Apr 04 14:31:29 R1001-new sudo[16102]: pam_unix(sudo:session): session closed for user root

I recommend to use just "ip route show" (without sudo).

Tested on VyOS 1.4-rolling-202304020811

Bug: unable to rename a failover route:

[edit protocols failover]
vyos@R1001-new# rename route 10.0.0.3/32  to route 10.0.0.4/32

  Rename failed

Is it possible to implement multiple test targets instead of just one?

Arguments:

  1. if we can define only one host - it is reasonable to chose a ISP gateway as this host. But we may face situations where the ISP gateway is alive, but the internet is not reacheable through it.
  1. if we use a host in the internet - then the host itself may went down.

We may chose a well-known reliable host (e.g. 8.8.8.8) as ping target. But the host would become unavailable when primary ISP goes down (because there would be static route to e.g. 8.8.8.8/32 via ISP1).

We have targets-checks 203.0.113.1, 192.0.2.1, and if any of these targets are unreachable, we delete this route.
Is it correct?

Targets and logs will be fixed in the next rolling release

Sorry, missed some messages.

We have targets-checks 203.0.113.1, 192.0.2.1, and if any of these targets are unreachable, we delete this route.
Is it correct?

It is not correct. I think it would be better to remove the route if ALL of corresponding targets are unreachable.

A target may become unreachable due to a problem of its own rather than an uplink failure. This is the reason why I asked to add multiple targets per uplink.

@Harliff Could you re-check?

Yes! I'm going to do it now.

Sorry, missed some messages.

We have targets-checks 203.0.113.1, 192.0.2.1, and if any of these targets are unreachable, we delete this route.
Is it correct?

It is not correct. I think it would be better to remove the route if ALL of corresponding targets are unreachable.

A target may become unreachable due to a problem of its own rather than an uplink failure. This is the reason why I asked to add multiple targets per uplink.

Maybe adding check policy any|all will be suitable for all cases
something like

set protocols failover route 192.0.2.55/32 next-hop 203.0.113.1 check policy any-available|all-available|any-fail|all-fail

and any-fail as the default behavior

PR https://github.com/vyos/vyos-1x/pull/1966

set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check policy 'any-available'
set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check target '192.168.122.1'
set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check target '192.168.122.11'
set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check timeout '3'
set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 interface 'eth0'

@Harliff Could you check it? Available in the latest rolling release

vyos@r14# set  protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check policy 
Possible completions:
   all-available        All targets must be alive
   any-available        Any target must be alive (default)
Viacheslav claimed this task.
Viacheslav moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus (1.3.3) board.
This comment was removed by JeffWDH.

Could you check it? Available in the latest rolling release

Sorry for late answer. I've tested the feature - it working fine. Thanks for your job!

It will be great to see op-mode command to see current failover status.

BTW, do you have any plans to use this mechanism for WAN failover?

Is it OK to discuss such topics here?

BTW, do you have any plans to use this mechanism for WAN failover?
It will be great to see op-mode command to see current failover status.

BTW, do you have any plans to use this mechanism for WAN failover?

@Harliff Could you explain or send some examples? Or create a new "feature request"

BTW, do you have any plans to use this mechanism for WAN failover?
Could you explain or send some examples?

Currently the WAN load balancing / WAN failover built using wan_lb program.

I've spend some time recently trying to get it working as it documented (on the 1.4-rc3) and achive only partial success:

  • failover is working
  • failover status display not working (show wan failover command show that all uplinks are down while really one of them working)
  • I've failed to make failover switch time less than a minute

Also, I'm suspecting that disable-source-nat option does nothing.
I haven't written a bug report yet.

So I thought: what if the VyOS will use iproute2 instead of the wan_lb for WAN failover?

Found task named wan load balance issues with 3 or more WANs.

Although I've found nothing related to WANs count in the task description and comments, I want to note that I'm using 4 WANs in my test environment.

I've made mistake while configuring my test stand so the WLB in my GNS3 was affected by contrack on host machine.
Looks like the WLB works as it should.
Sorry to bother you!

I've spend some time recently trying to get [WLB] working as it documented (on the 1.4-rc3) and achive only partial success:

  • failover is working
  • failover status display not working (show wan failover command show that all uplinks are down while really one of them working)
  • I've failed to make failover switch time less than a minute

Also, I'm suspecting that disable-source-nat option does nothing.
I haven't written a bug report yet.

WLB is another feature and not related to protocols failover route