Page MenuHomeVyOS Platform

Configure GRE over IPsec tunnel when source port is in VRF, OSPF causes GRE tunnel broken.
Open, LowPublicBUG

Description

Refer to T4031. My ASBR has two VRFs:

  1. cm_up to the Internet
  2. default for my backbone

My backbone have to use GRE over IPsec tunnel across the Internet to reach my other routers. By the default setting, I can't setup the tunnel like T4031.
So I modify the strongswan systemd service like this:

ExecStart=/usr/sbin/ip vrf exec cm_up /usr/sbin/charon-systemd

And it worked. When I setup OSPF over GRE tunnel, it works alright. But when I reboot the instance, the IPsec tunnel works correctly but the GRE tunnel is broken: it can't send or receive packet.

I tried to restart ipsec process, it doesn't work. I tried to delete tunnel and recreate one, but it doesn't work too.

But, when I delete the tunnel's OSPF announcement, and disable the tunnel and re-enable it. It works. When I re-set the tunnel's OSPF announcement, everything works smoothly.

I don't know what causes this bug but I'd love to fix the IPsec over VRF problem. But I have no idea about why OSPF brokes GRE tunnel.

Here's my configuration:

vyos@bsp-asbr2-cm:~$ show conf
interfaces {
    dummy dum0 {
        address 192.168.127.32/32
        description "GRE over IPSec originate loopback"
        vrf cm_up
    }
    dummy dum1 {
        address 192.168.127.34/32
    }
    ethernet eth0 {
        address XXX.XXX.XX.100/25
        description "To China Mobile static access"
        hw-id 00:0c:29:33:09:da
        vrf cm_up
    }
    ethernet eth1 {
        address 192.168.124.1/28
        description "Downstream to vSRX"
        hw-id 00:0c:29:33:09:e4
    }
    ethernet eth2 {
        address 192.168.124.66/28
        description "MPLS BB between 2 HV"
        disable
        hw-id 00:0c:29:33:09:ee
    }
    ethernet eth3 {
        address 192.168.124.33/28
        description "MPLS BB originate from CM"
        hw-id 00:0c:29:33:09:f8
        vrf cm_up
    }
    loopback lo {
    }
    tunnel tun0 {
        address 10.96.255.9/30
        description "S2S VPN 1"
        encapsulation gre
        ip {
            adjust-mss clamp-mss-to-pmtu
        }
        mtu 1428
        remote 192.168.63.32
        source-address 192.168.127.32
        source-interface dum0
    }
}
nat {
    destination {
        rule 10 {
            destination {
                port 10000-64000
            }
            inbound-interface eth0
            protocol tcp_udp
            translation {
                address 192.168.124.34
            }
        }
    }
    source {
        rule 10 {
            outbound-interface eth0
            protocol all
            translation {
                address masquerade
            }
        }
    }
}
pki {
    key-pair ipsec-CDSLCM {
        private {
            key ****************
        }
        public {
            key ****************
        }
    }
    key-pair ipsec-CDSLCU {
        public {
            key ****************
        }
    }
    key-pair ipsec-JXNCCT {
        public {
            key ****************
        }
    }
}
protocols {
    ospf {
        area 0.0.0.0 {
            network 192.168.0.0/15
            network 10.96.0.0/16
        }
        parameters {
            router-id 192.168.127.32
        }
    }
}
qos {
    policy {
        shaper test {
            bandwidth 330mbit
            default {
                bandwidth 300mbit
                queue-type fair-queue
            }
        }
    }
}
service {
    ntp {
        allow-client {
            address 0.0.0.0/0
            address ::/0
        }
        server time1.vyos.net {
        }
        server time2.vyos.net {
        }
        server time3.vyos.net {
        }
    }
    ssh {
        listen-address 192.168.124.1
    }
}
system {
    config-management {
        commit-revisions 100
    }
    conntrack {
        modules {
            ftp
            h323
            nfs
            pptp
            sip
            sqlnet
            tftp
        }
    }
    console {
        device ttyS0 {
            speed 115200
        }
    }
    host-name bsp-asbr2-cm
    login {
        user vyos {
            authentication {
                encrypted-password ****************
            }
        }
    }
    name-server 114.114.114.114
    syslog {
        global {
            facility all {
                level info
            }
            facility protocols {
                level debug
            }
        }
    }
    time-zone Asia/Shanghai
}
vpn {
    ipsec {
        esp-group MyESPGroup {
            proposal 1 {
                encryption aes128
                hash aes128gmac
            }
        }
        ike-group MyIKEGroup {
            proposal 1 {
                dh-group 2
                encryption aes128
                hash sha1
            }
        }
        interface eth0
        site-to-site {
            peer JXNCCT {
                authentication {
                    local-id cdslcm.ras.meit.su
                    mode rsa
                    remote-id zion.lv2.pw
                    rsa {
                        local-key ****************
                        remote-key ****************
                    }
                }
                connection-type respond
                default-esp-group MyESPGroup
                ike-group MyIKEGroup
                local-address XXX.XXX.XX.100
                remote-address any
                tunnel 1 {
                    local {
                        prefix 192.168.127.32/32
                    }
                    remote {
                        prefix 192.168.63.32/32
                    }
                }
            }
        }
    }
}
vrf {
    name cm_up {
        protocols {
            static {
                route 0.0.0.0/0 {
                    next-hop XXX.XXX.XX.1 {
                    }
                }
            }
        }
        table 101
    }
}
vyos@bsp-asbr2-cm:~$

Btw, can we default enable mitigations=off parameter on older hardware (like haswell/broadwell) when installation is taking progress?

Because without it the system's routing and ipsec performance will drop to some unbearable level. Like the IPsec throughput in D1521 is around 300Mbps with ksoftirqd take one CPU core entirely without mitigations=off.

Details

Version
1.4-rolling-202302150317
Is it a breaking change?
Perfectly compatible
Issue type
Bug (incorrect behavior)

Event Timeline

Btw, in this rolling release, OSPF BFD in tunnel doesn't work correctly too.

When I set BFD on tunnel interface, it refused to calculate routes and put it in the route table whatever BFD state is up or down. I have to delete the BFD configuration, commit, save and reboot to re-bring it up.

Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: authentication of 'domain1' with RSA_EMSA_PKCS1_SHA2_256 successful
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[IKE] <JXNCCT|2> peer supports MOBIKE
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: peer supports MOBIKE
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[IKE] <JXNCCT|2> authentication of 'domain2' (myself) with RSA_EMSA_PKCS1_SHA2_256 successful
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: authentication of 'domain2' (myself) with RSA_EMSA_PKCS1_SHA2_256 successful
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[IKE] <JXNCCT|2> IKE_SA JXNCCT[2] established between <pubIP2>[domain2]...<pubIP1>[domain1]
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: IKE_SA JXNCCT[2] established between <pubIP2>[domain2]...<pubIP1>[domain1]
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[IKE] <JXNCCT|2> scheduling rekeying in 28200s
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: scheduling rekeying in 28200s
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[IKE] <JXNCCT|2> maximum IKE_SA lifetime 31080s
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: maximum IKE_SA lifetime 31080s
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[CFG] <JXNCCT|2> selected proposal: ESP:AES_CBC_128/HMAC_SHA1_96/NO_EXT_SEQ
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: selected proposal: ESP:AES_CBC_128/HMAC_SHA1_96/NO_EXT_SEQ
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[KNL] <JXNCCT|2> received netlink error: Invalid argument (22)
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: received netlink error: Invalid argument (22)
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[KNL] <JXNCCT|2> unable to install source route for 192.168.127.32
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: unable to install source route for 192.168.127.32
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[IKE] <JXNCCT|2> CHILD_SA JXNCCT-tunnel-1{2} established with SPIs c4ba20f9_i c3ba4340_o and TS 192.168.127.32/32 === 192.168.63.32/32
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: CHILD_SA JXNCCT-tunnel-1{2} established with SPIs c4ba20f9_i c3ba4340_o and TS 192.168.127.32/32 === 192.168.63.32/32
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[ENC] <JXNCCT|2> generating IKE_AUTH response 1 [ IDr AUTH SA TSi TSr N(MOBIKE_SUP) N(NO_ADD_ADDR) ]
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: generating IKE_AUTH response 1 [ IDr AUTH SA TSi TSr N(MOBIKE_SUP) N(NO_ADD_ADDR) ]
Mar 16 12:47:29 bsp-asbr2-cm charon[45036]: 14[NET] <JXNCCT|2> sending packet: from <pubIP2>[4500] to <pubIP1>[4500] (476 bytes)
Mar 16 12:47:29 bsp-asbr2-cm charon-systemd[45036]: sending packet: from <pubIP2>[4500] to <pubIP1>[4500] (476 bytes)
Mar 16 12:47:59 bsp-asbr2-cm charon[45036]: 06[NET] <JXNCCT|2> received packet: from <pubIP1>[4500] to <pubIP2>[4500] (76 bytes)
Mar 16 12:47:59 bsp-asbr2-cm charon-systemd[45036]: received packet: from <pubIP1>[4500] to <pubIP2>[4500] (76 bytes)
Mar 16 12:47:59 bsp-asbr2-cm charon[45036]: 06[ENC] <JXNCCT|2> parsed INFORMATIONAL request 2 [ ]
Mar 16 12:47:59 bsp-asbr2-cm charon-systemd[45036]: parsed INFORMATIONAL request 2 [ ]
Mar 16 12:47:59 bsp-asbr2-cm charon[45036]: 06[ENC] <JXNCCT|2> generating INFORMATIONAL response 2 [ ]
Mar 16 12:47:59 bsp-asbr2-cm charon-systemd[45036]: generating INFORMATIONAL response 2 [ ]
Mar 16 12:47:59 bsp-asbr2-cm charon[45036]: 06[NET] <JXNCCT|2> sending packet: from <pubIP2>[4500] to <pubIP1>[4500] (76 bytes)
Mar 16 12:47:59 bsp-asbr2-cm charon-systemd[45036]: sending packet: from <pubIP2>[4500] to <pubIP1>[4500] (76 bytes)

When the IPsec connection established, strongswan said unable to install source route.

Workaround: put these lines to /config/scripts/vyos-postconfig-bootup.script

vtysh -c "conf t" -c "int tun0" -c "ip ospf passive"
sleep 10

systemctl stop strongswan
ip l set tun0 down

sleep 5
systemctl start strongswan

a=""

while [ "$a" == "" ]
do
a=`swanctl -l 2>/dev/null | grep in `
sleep 1
echo "Wait for Tunnel to be ready"
done

echo "Tunnel up"
sleep 20

ip l set tun0 up
echo "Hopefully done"

sleep 10
echo "Enabling OSPF"
vtysh -c "conf t" -c "int tun0" -c "no ip ospf passive"

@c-po @Viacheslav

It seems this problem is not caused by IPsec, but it was caused by GRE implementation.

If I set the underlay interface of a GRE tunnel to a VRF interface. The GRE tunnel will not be able to send packet, but once it received a packet from remote side, the GRE tunnel can work normally.

Btw, I still don't know how to properly configure IPsec in multiple VRFs.

I've just been picking at this one tonight because it's close to some areas of interest (DMVPNs in VRFs), so hopefully this input is useful and appropriate:

In 1.5-rolling-202405240020, I've got a roughly equivalent config to @diodep's original working between 2 instances, without hacking up any systemd units or scripts.

The main changes required were:

  • Charon lives in the default global VRF, static leak the default route from the Internet VRF into global so it can contact the peer. This could of course be more specific if global has its own default from OSPF. Final T4031 comments pointed to this fix.
  • Source NAT exclusions between the dum0 source IPs so encap works properly
  • set vrf bind-to-all

IPsec SAs, GRE and OSPF came up as expected.

Rebooting passive IPsec responder instance does result in OSPF and the IPsec SAs staying down for a while, but a forced restart on the initiator brought it back immediately. Rebooting the initiator doesn't seem to be a problem at all.

dmbaturin changed Is it a breaking change? from Unspecified (possibly destroys the router) to Perfectly compatible.
dmbaturin changed Issue type from Unspecified (please specify) to Bug (incorrect behavior).