vrf bind-to-all not working for TCP
Confirmed, NormalPublicBUG
Actions

Assigned To

None

Authored By

	exp
	Jun 10 2023, 6:24 AM

Description

This is a followup from my forum discussion what I believe is a bug: https://forum.vyos.io/t/why-do-my-outgoing-tcp-connections-fail-when-icmp-and-incoming-connections-are-ok/11185

Following MRE (complete config on 1.3-rolling-202305190616):

interfaces {
    ethernet eth0 {
        vif 2 {
            address dhcp
            description sonic
            vrf vrf_sonic
        }
        vif 3 {
            address 10.227.79.2/24
        }
    }
    loopback lo {
    }
}
policy {
    local-route {
        rule 101 {
            destination 0.0.0.0/0
            set {
                table local
            }
        }
        rule 102 {
            destination 0.0.0.0/0
            set {
                table main
            }
        }
        rule 104 {
            destination 0.0.0.0/0
            set {
                table 170
            }
        }
    }
}
system {
    config-management {
        commit-revisions 100
    }
    conntrack {
        modules {
            ftp
            h323
            nfs
            pptp
            sip
            sqlnet
            tftp
        }
    }
    console {
        device ttyS0 {
            speed 115200
        }
    }
    host-name Test1
    login {
        user vyos {
            authentication { xxx
            }
        }
    }
    name-server eth0.2
    ntp {
        server time1.vyos.net {
        }
        server time2.vyos.net {
        }
        server time3.vyos.net {
        }
    }
    syslog {
        global {
            facility all {
                level info
            }
            facility protocols {
                level debug
            }
        }
    }
    time-zone America/Los_Angeles
}
vrf {
    bind-to-all
    name vrf_sonic {
        table 170
    }
}

Setup

eth0.2 gets IP address and default route via DHCP. It is enslaved to VRF vrf_sonic such that default route lands in table 170 (and not the main table)
Local policy is created that takes precedence over the l3mdev (and other) rules: For all packet, first local table, then main table is consulted and finally table 170 (containing the default route)
vrf bind-to-all is set such that the response packets for locally generated packets of processes which are NOT bound to the VRF device are still accepted, even though they are coming in through the VRF enslaved device (eth0.2).

Desired outcome

The desired outcome of the config above is identical as if no VRF and routing table 170 would be used in the first place.
In this case, the default route would directly land in the main table.

What fails

First, confirm config is as expected:

$ show vrf

VRF name          state     mac address        flags                     interfaces
--------          -----     -----------        -----                     ----------

vrf_sonic         up        12:9e:9b:30:bf:9a  noarp,master,up,lower_up  eth0.2

$ show interfaces 
Codes: S - State, L - Link, u - Up, D - Down, A - Admin Down
Interface        IP Address                        S/L  Description
---------        ----------                        ---  -----------
eth0             -                                 u/u  
eth0.2           135.180.59.5/21                   u/u  sonic
eth0.3           10.227.79.2/24                    u/u  
lo               127.0.0.1/8                       u/u  
                 ::1/128                                
$ show ip route table 170
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup

VRF default table 170:
S>* 0.0.0.0/0 [210/0] via 135.180.56.1, eth0.2, weight 1, 00:14:26
C>* 135.180.56.0/21 is directly connected, eth0.2, 00:14:27
$ ip rule 
101:    from all lookup local
102:    from all lookup main
104:    from all lookup vrf_sonic
1000:   from all lookup [l3mdev-table]
2000:   from all lookup [l3mdev-table] unreachable
32765:  from all lookup local
32766:  from all lookup main
32767:  from all lookup default

ICMP (ping) works:

$ ping 8.8.8.8 count 2
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=113 time=4.24 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=113 time=5.12 ms

--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 3ms
rtt min/avg/max/mdev = 4.239/4.680/5.122/0.446 ms

UDP (DNS) works as well:

$ dig @8.8.8.8 www.google.com. +short
142.251.32.36

However, TCP fails:

$ curl www.google.com
curl: (7) Failed to connect to www.google.com port 80: Connection timed out

However, in the context of the VRF it works:

sudo ip vrf exec vrf_sonic curl www.google.com
[...] call(this);</script></body></html>

Starting tcpdump in parallel with "curl www.google.com" reveals:

$ tcpdump -n -i eth0.2 'host 142.250.191.36'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0.2, link-type EN10MB (Ethernet), capture size 262144 bytes
23:20:03.556659 IP 135.180.59.5.53818 > 142.250.191.36.80: Flags [S], seq 1275972659, win 64240, options [mss 1460,sackOK,TS val 3546963390 ecr 0,nop,wscale 7], length 0
23:20:03.561245 IP 142.250.191.36.80 > 135.180.59.5.53818: Flags [S.], seq 4072092939, ack 1275972660, win 65535, options [mss 1412,sackOK,TS val 3812757636 ecr 3546963390,nop,wscale 8], length 0
23:20:03.561328 IP 135.180.59.5.53818 > 142.250.191.36.80: Flags [R], seq 1275972660, win 0, length 0
23:20:04.581079 IP 135.180.59.5.53818 > 142.250.191.36.80: Flags [S], seq 1275972659, win 64240, options [mss 1460,sackOK,TS val 3546964414 ecr 0,nop,wscale 7], length 0
23:20:04.584924 IP 142.250.191.36.80 > 135.180.59.5.53818: Flags [S.], seq 4088086403, ack 1275972660, win 65535, options [mss 1412,sackOK,TS val 3812758659 ecr 3546964414,nop,wscale 8], length 0
23:20:04.585031 IP 135.180.59.5.53818 > 142.250.191.36.80: Flags [R], seq 1275972660, win 0, length 0

The SYN packet clearly goes out well and the Google server receives it. It responds with a SYN-ACK which is received by the VyOS box.
However, then VyOS responds with a RST (!) packet instead of ACK.

This is likely because the original request was not bound to the VRF interface but the default VRF but the response was received with the VRF enslaved interface (eth0.2). However, this is exactly the scenario that vrf bind-to-all should account for.

And indeed, it works for ICMP and UDP but it fails for TCP.

Hence this seems to be a clear bug which breaks source routing scenarios and should be fixed.

Details

Difficulty level: Easy (less than an hour)
Version: 1.3-rolling-202305190616
Why the issue appeared?: Will be filled on close
Is it a breaking change?: Perfectly compatible
Issue type: Bug (incorrect behavior)

Event Timeline

exp created this task.Jun 10 2023, 6:24 AM

pasik subscribed.Jun 10 2023, 7:01 AM

… I am fairly new to VyOS and the process so please apologize my basic question. Just wanting to clarify if I did anything wrong with the bug report, or this just takes time or it won’t be fixed at all?

I have put my VyOS investigations on hold for now due to this bug as it completely breaks my intended setup. Id be curious if there’s hope there will be a fixe at some point or if it’s better to move on.

SrividyaA changed the task status from Open to Confirmed.Jun 22 2023, 7:31 AM

SrividyaA subscribed.Jun 23 2023, 7:42 AM

Could you try with this command to see if the curl output is successful.

$ sudo ip vrf exec vrf_sonic curl www.google.com

Yes that works, output successful (see command in bug report).

syncer edited projects, added VyOS 1.3 Equuleus (1.3.5); removed VyOS 1.3 Equuleus.Aug 27 2023, 12:09 AM

syncer edited projects, added VyOS 1.3 Equuleus (1.3.6); removed VyOS 1.3 Equuleus (1.3.5).Dec 17 2023, 11:36 PM

dmbaturin triaged this task as High priority.Jan 11 2024, 11:21 AM

dmbaturin added projects: VyOS 1.4 Sagitta, VyOS 1.5 Circinus.

dmbaturin changed Is it a breaking change? from Unspecified (possibly destroys the router) to Perfectly compatible.

This issue seems to appear only with 1.3 but not with 1.4

syncer edited projects, added VyOS 1.3 Equuleus (1.3.7); removed VyOS 1.3 Equuleus (1.3.6).Feb 10 2024, 9:17 AM

set policy local-route rule 101 destination '0.0.0.0/0'
set policy local-route rule 101 set table 'local'
set policy local-route rule 102 destination '0.0.0.0/0'
set policy local-route rule 102 set table 'main'
set policy local-route rule 104 destination '0.0.0.0/0'
set policy local-route rule 104 set table '170'

set interfaces ethernet eth1 vrf 'red'
set vrf bind-to-all
set vrf name red table '170'

The same behavior with VyOS 1.3-stable-202404080454
Work in VyOS 1.4-stable-202404080309

Probably something was changed in the newest kernel version, that's why it is working for 1.4+

1.3

vyos@r1-right:~$ mtr -c 2 -n  --report --udp google.com
Start: 2024-04-10T19:07:22+0300
HOST: r1-right                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 100.64.0.14                0.0%     2    0.4   0.4   0.4   0.4   0.0
  2.|-- 192.168.122.1              0.0%     2    0.5   0.5   0.4   0.5   0.0
  3.|-- 192.168.0.1                0.0%     2    0.6   0.7   0.6   0.7   0.1
  4.|-- ???                       100.0     2    0.0   0.0   0.0   0.0   0.0
  5.|-- 10.161.252.202             0.0%     2    6.7   6.8   6.7   6.9   0.1
  6.|-- ???                       100.0     2    0.0   0.0   0.0   0.0   0.0
vyos@r1-right:~$ 
vyos@r1-right:~$ 
vyos@r1-right:~$ 
vyos@r1-right:~$ mtr -c 2 -n  --report --tcp google.com
Start: 2024-04-10T19:07:38+0300
HOST: r1-right                    Loss%   Snt   Last   Avg  Best  Wrst StDev
mtr: Address not available
vyos@r1-right:~$ 
vyos@r1-right:~$

Candidate for closing as wontfix for 1.3.x

Viacheslav lowered the priority of this task from High to Normal.Apr 10 2024, 4:08 PM

syncer edited projects, added VyOS 1.3 Equuleus (1.3.8); removed VyOS 1.3 Equuleus (1.3.7).May 13 2024, 7:31 PM

dmbaturin edited projects, added VyOS 1.3 Equuleus (1.3.9); removed VyOS 1.3 Equuleus (1.3.8).Jun 20 2024, 10:18 AM

vrf bind-to-all not working for TCPConfirmed, NormalPublicBUGActions

Description

Details

Event Timeline

vrf bind-to-all not working for TCP
Confirmed, NormalPublicBUG
Actions