Static Route Path Monitoring, failover
Closed, ResolvedPublicFEATURE REQUEST
Actions

Assigned To

Authored By

	adestis
	Feb 9 2019, 6:04 AM

Description

Hello all,
sometimes it's not possible to do dynamic routing because not all peers supports it.
As fallback static routes are used.
I would like to see the possibility to monitor static routes by some kind of health checks like ping.
(Like Palo Alto does)

It's not the same as WAN load balancing because the PBR would add other complexity.

Regards
Markus

Details

Version: -
Is it a breaking change?: Unspecified (possibly destroys the router)
Issue type: Unspecified (please specify)

Related Objects

Mentioned In: rVYOSONEX4ab192c7c9d4: T1237: Failover route add policy for targets checking
rVYOSONEX6b81f048a0cd: Merge pull request #1966 from sever-sever/T1237
rVYOSONEX1a402dd93974: T1237: Failover route add checks for multiple targets
rVYOSONEX3593ecfa51a6: Merge pull request #1941 from sever-sever/T1237
rVYOSONEXb1004dcd24ba: T1237: Fix failover route install route with diff metrics
rVYOSONEXbf790ab67d62: Merge pull request #1737 from sever-sever/T1237
rVYOSONEX932af7f09880: routing: T1237: Add new feature failover route
rVYOSONEXc44cd46619ba: Merge pull request #1358 from sever-sever/T1237
Mentioned Here: T4443: Wan Load Balancing Multiple Regressions

Event Timeline

adestis created this task.Feb 9 2019, 6:04 AM

Hi adestis, what you descripe is possible to do today with the help of a shellscript and the crontab, if you are interested i could help you create a script that does this for you, the one drawback is that the failover-time is in the ballpark of minutes, and the routes are not present in the configuration... Also, cron fills the log with messages every time it executed

pasik subscribed.Mar 12 2019, 6:06 PM

Hello runar,
I know that it's possible to do it manually.
But I really would like to see a more integrated solution where you can add a check for the next hop inside the configuration.

A solution based on cron might be not so ideal because of the minimum time of 1 minute.

MikroTik RouterOS supports something like this:

/ip route add gateway=192.0.2.1,192.0.2.2 check-gateway=ping

or check-gateway=arp for boxes that don't ping very well.

It would be really nifty to find a way to add this to VyOS, but it would also have to interact well with FRR to ensure these "semi-static" routes propagate through to IGP/EGP where there is a redistribute static in effect.

Would it be reasonable to use BFD for this? Since BFD is already implemented we might be able to use that as well?

@Cheeze_It BFD for static routes would be nice as well but sometimes the target you test against is not under your control and/or does not support BFD.

@adestis yes, that is true....but that can be worked around. Any option can be used (either BFD, or ARP, or ICMP). I just wanted to give more ideas so that hopefully can get a working implementation for all 3.

jack9603301 subscribed.Aug 31 2020, 4:52 PM

So far I have seen that BFD for static routes in FRR is currently under development:
https://github.com/FRRouting/frr/issues/3369

(Seems like tests are only missing).

But so far I have not seen anything like @maznu mentioned what MikroTik has.
That really would be nice.

The way I was thinking is on this Juniper page here.

If you guys would like, I can mock it up in my lab, test it, and show you the configuration I used and maybe it would be possible for us to see if we can make something similar or at least with similar functionality.

etedor subscribed.Apr 28 2021, 7:20 AM

etedor unsubscribed.

eronlloyd subscribed.Aug 30 2021, 10:52 AM

syncer edited projects, added VyOS 1.3 Equuleus (1.3.0); removed VyOS 1.3 Equuleus.Nov 6 2021, 11:25 AM

mpueschel subscribed.May 11 2022, 11:06 PM

Viacheslav renamed this task from Static Route Path Monitoring to Static Route Path Monitoring, failover.Jun 5 2022, 12:04 PM

Viacheslav added a project: VyOS 1.4 Sagitta.

Viacheslav set Is it a breaking change? to Unspecified (possibly destroys the router).

Viacheslav set Issue type to Unspecified (please specify).

PR https://github.com/vyos/vyos-1x/pull/1358

set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 check target '192.168.100.1'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 check timeout '10'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 check type 'icmp'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 interface 'eth1'
set protocols failover route 203.0.113.1/32 next-hop 192.168.100.1 metric '2'

syncer edited projects, added VyOS 1.3 Equuleus (1.3.3); removed VyOS 1.3 Equuleus (1.3.0).Aug 29 2022, 7:05 AM

danhusan awarded a token.Oct 13 2022, 1:02 PM

danhusan subscribed.Oct 13 2022, 1:06 PM

Restricted Repository Identity mentioned this in rVYOSONEXc44cd46619ba: Merge pull request #1358 from sever-sever/T1237.Dec 17 2022, 7:12 AM

Viacheslav mentioned this in rVYOSONEX932af7f09880: routing: T1237: Add new feature failover route.Dec 17 2022, 7:12 AM

Viacheslav changed the task status from Open to Needs testing.Dec 20 2022, 9:16 AM

At first look, at least it works, but it requires more tests and improvements

set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 check target '192.168.122.1'
set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 check timeout '5'
set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 check type 'icmp'
set protocols failover route 203.0.113.1/32 next-hop 192.168.122.1 interface 'eth0'

show

vyos@r14:~$ show ip route 203.0.113.1
Routing entry for 203.0.113.1/32
  Known via "kernel", distance 0, metric 1, best
  Last update 00:04:42 ago
  * 192.168.122.1, via eth0

vyos@r14:~$ 
vyos@r14:~$ 
vyos@r14:~$ sudo ip route show proto failover
203.0.113.1 via 192.168.122.1 dev eth0 metric 1 
vyos@r14:~$ 

`

Hello everyone,

It works but has a little problem, if you set 2 routes to the same destination using different metrics and the main link dies it will change to the backup link but it wont change back to the main link when it come alive again, so you need to "kill" the backup link to make the main route active again.

Heres my tests:
IP 10.100.1.1/32 - Mikrotik Router Loopback
eth2 - main link - 172.25.30.9/30 (Vyos) - 172.25.30.10/30 (MK)
eth3 - backup link - 172.25.40.9/30 (Vyos) - 172.25.40.10/30 (MK)

 route 10.100.1.1/32 {
     next-hop 172.25.30.10 {
         check {
             target 172.25.30.10
             timeout 1
             type icmp
         }
         interface eth2
     }
     next-hop 172.25.40.10 {
         check {
             target 172.25.40.10
             timeout 1
             type icmp
         }
         interface eth3
         metric 100
     }
 }

vyos@vyos:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h01m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h01m
K>* 10.100.1.1/32 [0/1] via 172.25.30.10, eth2, 00:00:29
C>* 10.100.1.255/32 is directly connected, lo, 1d23h01m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h01m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h01m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h01m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h01m

Now disabling the Mikrotik IP 172.25.30.10:

vyos@vyos:~$ ping 172.25.30.10
PING 172.25.30.10 (172.25.30.10) 56(84) bytes of data.
From 172.25.30.9 icmp_seq=1 Destination Host Unreachable
From 172.25.30.9 icmp_seq=2 Destination Host Unreachable
From 172.25.30.9 icmp_seq=3 Destination Host Unreachable
From 172.25.30.9 icmp_seq=4 Destination Host Unreachable
From 172.25.30.9 icmp_seq=5 Destination Host Unreachable
^C
--- 172.25.30.10 ping statistics ---
7 packets transmitted, 0 received, +5 errors, 100% packet loss, time 6134ms
pipe 3
vyos@vyos:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h03m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h03m
K>* 10.100.1.1/32 [0/100] via 172.25.40.10, eth3, 00:00:28
C>* 10.100.1.255/32 is directly connected, lo, 1d23h03m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h03m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h03m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h03m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h03m

Now enabling the IP 172.25.30.10 again in the Mikrotik:

vyos@vyos:~$ ping 172.25.30.10
PING 172.25.30.10 (172.25.30.10) 56(84) bytes of data.
64 bytes from 172.25.30.10: icmp_seq=1 ttl=64 time=0.331 ms
64 bytes from 172.25.30.10: icmp_seq=2 ttl=64 time=0.328 ms
64 bytes from 172.25.30.10: icmp_seq=3 ttl=64 time=0.343 ms
64 bytes from 172.25.30.10: icmp_seq=4 ttl=64 time=0.315 ms
^C
--- 172.25.30.10 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3081ms
rtt min/avg/max/mdev = 0.315/0.329/0.343/0.010 ms
vyos@vyos:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h07m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h07m
K>* 10.100.1.1/32 [0/100] via 172.25.40.10, eth3, 00:04:15
C>* 10.100.1.255/32 is directly connected, lo, 1d23h07m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h07m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h07m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h07m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h07m

Lets disable the backup link next-hop in the MK router:

vyos@vyos:~$ ping 172.25.40.10
PING 172.25.40.10 (172.25.40.10) 56(84) bytes of data.
From 172.25.40.9 icmp_seq=1 Destination Host Unreachable
From 172.25.40.9 icmp_seq=2 Destination Host Unreachable
From 172.25.40.9 icmp_seq=3 Destination Host Unreachable
From 172.25.40.9 icmp_seq=4 Destination Host Unreachable
From 172.25.40.9 icmp_seq=5 Destination Host Unreachable
^C
--- 172.25.40.10 ping statistics ---
6 packets transmitted, 0 received, +5 errors, 100% packet loss, time 5129ms
pipe 4
vyos@vyos:~$ show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

S>* 0.0.0.0/0 [210/0] via 10.10.1.254, eth0, weight 1, 1d23h08m
C>* 10.10.1.0/24 is directly connected, eth0, 1d23h08m
K>* 10.100.1.1/32 [0/1] via 172.25.30.10, eth2, 00:00:24
C>* 10.100.1.255/32 is directly connected, lo, 1d23h08m
C>* 10.250.250.0/30 is directly connected, vti10, 1d23h08m
C>* 172.20.30.0/24 is directly connected, eth1, 1d23h08m
C>* 172.25.30.8/30 is directly connected, eth2, 1d23h08m
C>* 172.25.40.8/30 is directly connected, eth3, 1d23h08m

Now the route changes back to the main link. This is the only problem I found while testing.

PR https://github.com/vyos/vyos-1x/pull/1737

Restricted Repository Identity mentioned this in rVYOSONEXbf790ab67d62: Merge pull request #1737 from sever-sever/T1237.Jan 5 2023, 6:23 AM

Viacheslav mentioned this in rVYOSONEXb1004dcd24ba: T1237: Fix failover route install route with diff metrics.Jan 5 2023, 6:23 AM

will be fixed in the next rolling release

In T1237#140040, @Viacheslav wrote:

will be fixed in the next rolling release

Nice! Gonna test later :D

Nice feature. I'm testing it now.

@Viacheslav, where is best place to discuss the feature (ask a question or report a bug)?

@Harliff It is better to write to this task if you find bugs or propose new features.
So anyone could claim/fix it.
Thanks.

Harliff awarded a token.Apr 4 2023, 11:28 AM

Harliff rescinded a token.

Harliff awarded a token.

@Viacheslav Ok!

Minor bug: due to using "sudo ip route show" insted of just "ip route show", the system journal flooded with messages from sudo subsystem.

Apr 04 14:31:29 R1001-new sudo[16102]:     root : PWD=/ ; USER=root ; COMMAND=/usr/sbin/ip --json route show protocol failover 10.0.0.3/32 via 10.0.2.1 dev eth2 metric 20
Apr 04 14:31:29 R1001-new sudo[16102]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Apr 04 14:31:29 R1001-new sudo[16102]: pam_unix(sudo:session): session closed for user root

I recommend to use just "ip route show" (without sudo).

Tested on VyOS 1.4-rolling-202304020811

Bug: unable to rename a failover route:

[edit protocols failover]
vyos@R1001-new# rename route 10.0.0.3/32  to route 10.0.0.4/32

  Rename failed

Is it possible to implement multiple test targets instead of just one?

Arguments:

if we can define only one host - it is reasonable to chose a ISP gateway as this host. But we may face situations where the ISP gateway is alive, but the internet is not reacheable through it.

if we use a host in the internet - then the host itself may went down.

We may chose a well-known reliable host (e.g. 8.8.8.8) as ping target. But the host would become unavailable when primary ISP goes down (because there would be static route to e.g. 8.8.8.8/32 via ISP1).

We have targets-checks 203.0.113.1, 192.0.2.1, and if any of these targets are unreachable, we delete this route.
Is it correct?

PR https://github.com/vyos/vyos-1x/pull/1941

Restricted Repository Identity mentioned this in rVYOSONEX3593ecfa51a6: Merge pull request #1941 from sever-sever/T1237.Apr 10 2023, 2:26 PM

Viacheslav mentioned this in rVYOSONEX1a402dd93974: T1237: Failover route add checks for multiple targets.Apr 10 2023, 2:26 PM

Targets and logs will be fixed in the next rolling release

@Harliff Could you re-check?

Sorry, missed some messages.

In T1237#146586, @Viacheslav wrote:

We have targets-checks 203.0.113.1, 192.0.2.1, and if any of these targets are unreachable, we delete this route.
Is it correct?

It is not correct. I think it would be better to remove the route if ALL of corresponding targets are unreachable.

A target may become unreachable due to a problem of its own rather than an uplink failure. This is the reason why I asked to add multiple targets per uplink.

In T1237#146839, @Viacheslav wrote:

@Harliff Could you re-check?

Yes! I'm going to do it now.

In T1237#147125, @Harliff wrote:

Sorry, missed some messages.

In T1237#146586, @Viacheslav wrote:

We have targets-checks 203.0.113.1, 192.0.2.1, and if any of these targets are unreachable, we delete this route.
Is it correct?

It is not correct. I think it would be better to remove the route if ALL of corresponding targets are unreachable.

A target may become unreachable due to a problem of its own rather than an uplink failure. This is the reason why I asked to add multiple targets per uplink.

Maybe adding check policy any|all will be suitable for all cases
something like

set protocols failover route 192.0.2.55/32 next-hop 203.0.113.1 check policy any-available|all-available|any-fail|all-fail

and any-fail as the default behavior

That would be great!

PR https://github.com/vyos/vyos-1x/pull/1966

set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check policy 'any-available'
set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check target '192.168.122.1'
set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check target '192.168.122.11'
set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check timeout '3'
set protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 interface 'eth0'

Restricted Repository Identity mentioned this in rVYOSONEX6b81f048a0cd: Merge pull request #1966 from sever-sever/T1237.Apr 22 2023, 7:27 PM

Viacheslav mentioned this in rVYOSONEX4ab192c7c9d4: T1237: Failover route add policy for targets checking.Apr 22 2023, 7:27 PM

@Harliff Could you check it? Available in the latest rolling release

vyos@r14# set  protocols failover route 192.0.2.55/32 next-hop 192.168.122.1 check policy 
Possible completions:
   all-available        All targets must be alive
   any-available        Any target must be alive (default)

Viacheslav closed this task as Resolved.Jun 28 2023, 8:22 AM

Viacheslav claimed this task.

Viacheslav moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus (1.3.3) board.

JeffWDH subscribed.Jul 20 2023, 12:34 PM

This comment was removed by JeffWDH.

Viacheslav moved this task from Open to Finished on the VyOS 1.4 Sagitta board.Oct 12 2023, 6:31 AM

Viacheslav removed a project: VyOS 1.3 Equuleus (1.3.3).

Could you check it? Available in the latest rolling release

Sorry for late answer. I've tested the feature - it working fine. Thanks for your job!

It will be great to see op-mode command to see current failover status.

BTW, do you have any plans to use this mechanism for WAN failover?

Is it OK to discuss such topics here?

BTW, do you have any plans to use this mechanism for WAN failover?
It will be great to see op-mode command to see current failover status.

In T1237#174506, @Harliff wrote:

BTW, do you have any plans to use this mechanism for WAN failover?

@Harliff Could you explain or send some examples? Or create a new "feature request"

BTW, do you have any plans to use this mechanism for WAN failover?
Could you explain or send some examples?

Currently the WAN load balancing / WAN failover built using wan_lb program.

I've spend some time recently trying to get it working as it documented (on the 1.4-rc3) and achive only partial success:

failover is working
failover status display not working (show wan failover command show that all uplinks are down while really one of them working)
I've failed to make failover switch time less than a minute

Also, I'm suspecting that disable-source-nat option does nothing.
I haven't written a bug report yet.

So I thought: what if the VyOS will use iproute2 instead of the wan_lb for WAN failover?

Found task named wan load balance issues with 3 or more WANs.

Although I've found nothing related to WANs count in the task description and comments, I want to note that I'm using 4 WANs in my test environment.

I've made mistake while configuring my test stand so the WLB in my GNS3 was affected by contrack on host machine.
Looks like the WLB works as it should.
Sorry to bother you!

I've spend some time recently trying to get [WLB] working as it documented (on the 1.4-rc3) and achive only partial success:

failover is working

failover status display not working (show wan failover command show that all uplinks are down while really one of them working)

I've failed to make failover switch time less than a minute

Also, I'm suspecting that disable-source-nat option does nothing.
I haven't written a bug report yet.

WLB is another feature and not related to protocols failover route

Static Route Path Monitoring, failoverClosed, ResolvedPublicFEATURE REQUESTActions

Description

Details

Related Objects

Event Timeline

Static Route Path Monitoring, failover
Closed, ResolvedPublicFEATURE REQUEST
Actions