Page MenuHomeVyOS Platform

WAN failover, not to balance the load
On hold, LowPublicBUG

Description

SCENARIO:

This is a textbook implementation of WAN failover. Two internet circuits from different providers. The primary circuit is on eth1. The backup circuit is on eth2. LAN is eth0. The router provides a NAT and does masquerading from the LAN to the internet.

When the primary circuit is in good health and up, only the primary circuit should be used for traffic between the internet and the LAN. The backup circuit should be like a hot spare, otherwise not in use.

When a health check determines that the primary interface has failed, the router should fail over to the backup circuit. When on the backup circuit, only the backup circuit should be used for traffic between the internet and the LAN.

PROBLEMS:

  1. The router's primary address on eth1 is found in traffic on eth2. The router's backup interface's IP address is sometimes found in traffic on eth1.
  1. The router's own WAN IP address is replacing the IP address of remote hosts accessing services on a LAN host. This makes it impossible to know where the packet actually came from. Applications such as SIP and my VPN need to know the real peer IP address, not the address of my primary WAN interface.
  1. Traffic destined for a LAN host going through the router via the address on the primary interface (eth1) will reach the LAN host, and the LAN host will respond correctly, but the router may try to send the replies out the backup interface (eth2) instead of the interface that the traffic arrived on (eth1).

CONFIGURATION:

interfaces {

ethernet eth0 {
    address 192.168.192.1/24
    description LAN_LAN_LAN
    duplex auto
    hw-id 8c:89:a5:99:4a:8a
    smp-affinity auto
    speed auto
}
ethernet eth1 {
    address dhcp
    description Cable_Primary_WAN
    duplex auto
    hw-id 8c:89:a5:99:4a:8b
    smp-affinity auto
    speed auto
}
ethernet eth2 {
    address dhcp
    description DSL_Backup_WAN
    duplex auto
    hw-id 8c:89:a5:99:4a:8c
    smp-affinity auto
    speed auto
}

}

load-balancing {

wan {
    enable-local-traffic
    flush-connections
    interface-health eth1 {
        failure-count 5
        nexthop dhcp
        success-count 1
        test 10 {
            resp-time 5
            target 8.8.8.8
            ttl-limit 1
            type ping
        }
        test 20 {
            resp-time 5
            target 4.2.2.1
            ttl-limit 1
            type ping
        }
    }
    interface-health eth2 {
        failure-count 4
        nexthop dhcp
        success-count 1
        test 10 {
            resp-time 5
            target 8.8.4.4
            ttl-limit 1
            type ping
        }
        test 20 {
            resp-time 5
            target 4.2.2.2
            ttl-limit 1
            type ping
        }
    }
    rule 10 {
        failover
        inbound-interface eth0
        interface eth1 {
            weight 10
        }
        interface eth2 {
            weight 1
        }
        protocol all
    }
    sticky-connections {
        inbound
    }
}

}

nat {
destination {

rule 40 {
    description "Preroute IAX2"
    destination {
        port iax
    }
    inbound-interface eth1
    protocol udp
    translation {
        address 192.168.192.242
    }
}
rule 44 {
    description "Preroute SIP"
    destination {
        port 5060-5061
    }
    inbound-interface eth1
    protocol tcp_udp
    translation {
        address 192.168.192.242
    }
}
rule 49 {
    description "Preroute RTP"
    destination {
        port 10000-20000
    }
    inbound-interface eth1
    protocol udp
    translation {
        address 192.168.192.242
    }
}
rule 2655 {
    description "Preroute tinc VPN to HP"
    destination {
        port 2655
    }
    inbound-interface eth1
    protocol tcp_udp
    translation {
        address 192.168.192.58
        port tinc
    }
}
rule 7040 {
    description "Preroute IAX2"
    destination {
        port iax
    }
    inbound-interface eth2
    protocol udp
    translation {
        address 192.168.192.242
    }
}
rule 7044 {
    description "Preroute SIP"
    destination {
        port 5060-5061
    }
    inbound-interface eth2
    protocol tcp_udp
    translation {
        address 192.168.192.242
    }
}
rule 7049 {
    description "Preroute RTP"
    destination {
        port 10000-20000
    }
    inbound-interface eth2
    protocol udp
    translation {
        address 192.168.192.242
    }
}
rule 7655 {
    description "Preroute tinc VPN to HP"
    destination {
        port 2655
    }
    inbound-interface eth2
    protocol tcp_udp
    translation {
        address 192.168.192.58
        port tinc
    }
}

}
source {

rule 110 {
    description "Hairpin NAT"
    destination {
        address 192.168.192.0/24
    }
    outbound-interface eth0
    source {
        address 192.168.192.0/24
    }
    translation {
        address masquerade
    }
}
rule 192 {
    description NAT
    outbound-interface eth1
    source {
        address 192.168.192.0/24
    }
    translation {
        address masquerade
    }
}
rule 193 {
    description NAT
    outbound-interface eth2
    source {
        address 192.168.192.0/24
    }
    translation {
        address masquerade
    }
}

}
}

EVIDENCE:

rob@Rt-9877:~$ sh int e
Codes: S - State, L - Link, u - Up, D - Down, A - Admin Down
Interface IP Address S/L Description


eth0 192.168.192.1/24 u/u LAN_LAN_LAN
eth1 24.217.88.23/21 u/u Cable_Primary_WAN
eth2 162.205.147.147/22 u/u DSL_Backup_WAN
eth3 - u/D

rob@Rt-9877:~$ sh wan
Interface: eth1

Status:  active
Last Status Change:  Thu Aug 31 05:26:42 2017
+Test:  ping  Target: 8.8.8.8
 Test:  ping  Target: 4.2.2.1
  Last Interface Success:  0s 
  Last Interface Failure:  n/a                
  # Interface Failure(s):  0

Interface: eth2

Status:  active
Last Status Change:  Thu Aug 31 05:26:42 2017
+Test:  ping  Target: 8.8.4.4
 Test:  ping  Target: 4.2.2.2
  Last Interface Success:  0s 
  Last Interface Failure:  n/a                
  # Interface Failure(s):  0

Here you can see the primary interface's IP address talking on the backup interface:

rob@Rt-9877:~$ tshark -i eth2 -f "host 24.217.88.23"
Capturing on 'eth2'

1   0.000000 24.217.88.23 -> 8.8.4.4      ICMP 146 Destination unreachable (Port unreachable)
2   0.020403 24.217.88.23 -> 208.67.222.220 ICMP 146 Destination unreachable (Port unreachable)
3  25.137963 24.217.88.23 -> 208.67.222.222 ICMP 127 Destination unreachable (Port unreachable)
4  25.139023 24.217.88.23 -> 8.8.4.4      ICMP 127 Destination unreachable (Port unreachable)
5  25.151657 24.217.88.23 -> 208.67.222.220 ICMP 127 Destination unreachable (Port unreachable)
6  69.506506 24.217.88.23 -> 208.67.222.220 ICMP 164 Destination unreachable (Port unreachable)
7  69.512728 24.217.88.23 -> 208.67.220.220 ICMP 164 Destination unreachable (Port unreachable)
8  69.518716 24.217.88.23 -> 208.67.220.222 ICMP 164 Destination unreachable (Port unreachable)
9  91.828302 24.217.88.23 -> 208.67.220.220 ICMP 294 Destination unreachable (Port unreachable)

10 91.855232 24.217.88.23 -> 208.67.222.220 ICMP 294 Destination unreachable (Port unreachable)
11 124.915055 24.217.88.23 -> 8.8.4.4 ICMP 127 Destination unreachable (Port unreachable)
12 124.929002 24.217.88.23 -> 208.67.222.220 ICMP 127 Destination unreachable (Port unreachable)
13 124.953739 24.217.88.23 -> 208.67.220.220 ICMP 127 Destination unreachable (Port unreachable)
14 150.098830 24.217.88.23 -> 208.67.220.220 ICMP 121 Destination unreachable (Port unreachable)
15 150.106534 24.217.88.23 -> 208.67.222.220 ICMP 121 Destination unreachable (Port unreachable)
16 150.106603 24.217.88.23 -> 208.67.220.222 ICMP 121 Destination unreachable (Port unreachable)
17 182.941209 24.217.88.23 -> 71.10.216.2 ICMP 147 Destination unreachable (Port unreachable)
18 182.957277 24.217.88.23 -> 8.8.4.4 ICMP 147 Destination unreachable (Port unreachable)
19 182.984679 24.217.88.23 -> 208.67.222.220 ICMP 147 Destination unreachable (Port unreachable)
20 204.003832 24.217.88.23 -> 208.67.222.220 ICMP 143 Destination unreachable (Port unreachable)
21 204.003906 24.217.88.23 -> 208.67.220.222 ICMP 143 Destination unreachable (Port unreachable)
22 204.022680 24.217.88.23 -> 208.67.222.222 ICMP 143 Destination unreachable (Port unreachable)
23 229.996563 24.217.88.23 -> 8.8.4.4 ICMP 127 Destination unreachable (Port unreachable)
24 230.012135 24.217.88.23 -> 208.67.222.220 ICMP 127 Destination unreachable (Port unreachable)
^C24 packets captured

Here is an example of how the real IP address has been improperly replaced with my own primary WAN address instead of the real IP address of the peer.

[Aug 31 11:58:20] NOTICE[14058][C-000005f8]: chan_sip.c:25653 handle_request_invite: Call from '' (24.217.88.23:5071) to extension '+48914472532' rejected because extension not found in context 'unauthenticated'.
[Aug 31 12:10:23] NOTICE[21430][C-000005f9]: Ext. 01148914472532:2 @ unauthenticated: "1" <[email protected]> is attempting to make unauthorized calls
[Aug 31 13:28:40] NOTICE[22717][C-000005fa]: Ext. 0048914472532:2 @ unauthenticated: "1" <[email protected]> is attempting to make unauthorized calls

Here is an example of how those packets SHOULD look. Notice the IP address is not my own in this one.

[Aug 31 14:29:38] NOTICE[24035][C-00000634]: Ext. 1807706966:2 @ unauthenticated: "1807706966" <[email protected]> is attempting to make unauthorized calls
[Aug 31 14:29:38] NOTICE[24036][C-00000635]: Ext. 3182044188:2 @ unauthenticated: "3182044188" <[email protected]> is attempting to make unauthorized calls
[Aug 31 14:29:38] NOTICE[24037][C-00000636]: Ext. 100:2 @ unauthenticated: "100" <[email protected]> is attempting to make unauthorized calls
[Aug 31 14:29:38] NOTICE[24038][C-00000637]: Ext. 101:2 @ unauthenticated: "101" <[email protected]> is attempting to make unauthorized calls

Additional tests can be performed on request.

Details

Difficulty level
Hard (possibly days)
Version
999.201708272137
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Unspecified (please specify)

Event Timeline

Hey Rob,

Might be handy to post:
sh nat
sh system conntrack modules sip

source rule 
outbound-interface eth1
 source {
     address 192.168.1.0/24
 }
 translation {
     address masquerade
 }
 
 
outbound-interface eth2
 source {
     address 192.168.1.0/24
 }
 translation {
     address masquerade
 }

sip
disable/enable-indirect-media/enable-indirect-signalling

In T375#7874, @p10003 wrote:

Hey Rob,

Might be handy to post:
sh nat
sh system conntrack modules sip

sip
disable/enable-indirect-media/enable-indirect-signalling

Updated task with nat rules.

sh system conntrack modules sip
Configuration under specified path is empty

In my opinion, the problem may come from an intentional rewriting of the external IP address within the packets. I did, however, set disable-source-nat, but I did not see any difference.
https://wiki.vyos.net/wiki/WAN_load_balancing#Source_NAT_rules

So one other tiny thing you might try since you have 192.168.0.1/24 (eth0) but different 192.168.192.0/24 (nats).

While capable (no sh protocols handy) usually that implies some other L3 router/switch, which wasn't mentioned. (Or an un-posted eth3 interface.)

If you need both networks, may want to temporarily add another address on eth0 or better yet create a separate vlan eth0.192 (vif) on this vyos box.

If all hosts are really in .192. and no other routes or L3 device exists just update that eth0 entry.

You do correctly have dual rules for both interfaces source & destinations. (Maybe we'll get a shortcut bidirectional config option down road.)

Yes, you should be covered on that LB whether configured:

  • default auto SNAT
  • flipping disable-source-nat & manual dual rules
  • both auto SNAT & manual dual rules

Plus you're doing failover + flush-connections so shouldn't see quirks for very long.
.192. traffic has dual rules & .0. the interface is auto

However you're telling the box to track and masquerade out when the the source comes from 192.168.192 - but the link between your 192.168.192 (posted lan) & 192.168.0 (posted eth0 lan interface) is kinda non-existant from our viewpoint.

Making the next step: Confirm .0. is not typo and draw us the bigger picture, or stick an interface in .192. on the vyos box.

In T375#7897, @p10003 wrote:

So one other tiny thing you might try since you have 192.168.0.1/24 (eth0) but different 192.168.192.0/24 (nats).

If all hosts are really in .192. and no other routes or L3 device exists just update that eth0 entry.

Making the next step: Confirm .0. is not typo and draw us the bigger picture, or stick an interface in .192. on the vyos box.

This was an error on my part in this post on here. That's what I get for posting parts at different times. The subnet is 192.168.192.0/24 with no other L3 devices and no VLANs at all. Sorry about the mistake and the confusion.

Run tcpdump on your WAN with filter ICMP to confirm probing goes haywire; should be pretty easy to spot as you employed four different targets.

Two ways to fix this: create a route for testing-target-ip-wan1 over your wan1 gateway, and testing-target-ip-wan2 over your wan2 gateway. Otherwise your probing will not use the correct interface.
Alternatively employ your immediate gateways as testing target or if you DHCPing it, don't fil lout a testing-target at all and it will use your dhcp-lease assigned gateway.

syncer subscribed.

@EwaldvanGeffen can i mark this as solved?

@syncer if there is an actual issue we need more input from the user to continue. ONHOLD for me (or ARCHIVE).

syncer changed the task status from Open to On hold.Oct 13 2018, 4:32 PM

@Penguin your input is required

Since WAN load balancing/failover is due for complete rewrite, perhaps it's better to move this to 1.3.0

@Penguin
I can't say exactly about 1 point.

  1. It will be fixed with additional functions/pkgs UPnP/miniupnp, STUN/coturn. Need to test it.
  2. It will be fixed with local PBR (ip rules) T439 T2196 and work-around in T2747

    If you have time and test environment, ping me.
dmbaturin changed Difficulty level from Unknown (require assessment) to Hard (possibly days).
dmbaturin set Is it a breaking change? to Unspecified (possibly destroys the router).

Can I please check if this may have any resolution in the near future?
I am hoping to rely on this but at the moment it just load balances instead of fail over and that gets very very expensive when the backup is an ethernet port based 4G modem.

Much appreciated. Thanks!

dmbaturin set Issue type to Unspecified (please specify).
dmbaturin added a subscriber: zsdc.