Page MenuHomeVyOS Platform

BGP L3VPN connectivity is broken after re-enabling VRF
Open, HighPublicBUG

Description

BGP L3VPN connectivity is broken after re-enabling VRF, and Route Distinguisher has different values.
Topology:

l3vpn-map.png (532×853 px, 32 KB)

Initial configuration from the routers r-left and r-right

All actions will take place on the router r-right

  1. Ping from the r-right to client to be sure all works fine
vyos@r-right# run ping 10.100.1.11 vrf red 
PING 10.100.1.11 (10.100.1.11) 56(84) bytes of data.
64 bytes from 10.100.1.11: icmp_seq=1 ttl=63 time=0.771 ms
64 bytes from 10.100.1.11: icmp_seq=2 ttl=63 time=0.952 ms

Check the routing table for the VRF red and l2vpn bgp table
Pay attention to the values of Route Distinguisher: and remember them

vyos@r-right# run show ip route vrf red
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF red:
B>* 10.100.1.0/24 [20/0] via 192.168.100.102, br75001 onlink, weight 1, 00:00:48
C>* 10.105.10.0/24 is directly connected, br75001, 00:00:59
[edit]
vyos@r-right# 
vyos@r-right# 
[edit]
vyos@r-right# 
[edit]
vyos@r-right# run show bgp l2vpn evpn 
BGP table version is 1, local router ID is 10.221.12.201
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[EthTag]:[ESI]:[IPlen]:[VTEP-IP]:[Frag-id]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 65000:1
 *> [5]:[0]:[24]:[10.100.1.0]
                    192.168.100.102          0             0 65000 ?
                    RT:65000:1 ET:8 Rmac:56:33:e8:21:35:71
Route Distinguisher: 65001:75001
 *> [5]:[0]:[24]:[10.105.10.0]
                    192.168.100.100          0         32768 ?
                    ET:8 RT:65001:75001 Rmac:c6:8e:b8:c3:16:c9

Displayed 2 out of 2 total prefixes
[edit]
vyos@r-right#
  1. Disable and enable VRF red
vyos@r-right# set vrf name red disable 
[edit]
vyos@r-right# commit
[edit]
vyos@r-right# del vrf name red disable 
[edit]
vyos@r-right# commit
[edit]
vyos@r-right#
  1. Check connectivity with the client again (its broken)
vyos@r-right# run ping 10.100.1.11 vrf red 
/bin/ping: connect: Network is unreachable
[edit]
vyos@r-right#

Check the routing table for the VRF red and l2vpn bgp table again
There is no BGP route for the VRF red
Pay attention to the values of Route Distinguisher:

vyos@r-right# run show ip route vrf red
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF red:
C>* 10.105.10.0/24 is directly connected, br75001, 00:05:25
[edit]
vyos@r-right# 

vyos@r-right# 
[edit]
vyos@r-right# run show bgp l2vpn evpn 
BGP table version is 1, local router ID is 10.221.12.201
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[EthTag]:[ESI]:[IPlen]:[VTEP-IP]:[Frag-id]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 65000:1
 *> [5]:[0]:[24]:[10.100.1.0]
                    192.168.100.102          0             0 65000 ?
                    RT:65000:1 ET:8 Rmac:56:33:e8:21:35:71
Route Distinguisher: 10.221.12.201:2
 *> [3]:[0]:[32]:[192.168.100.100]
                    192.168.100.100                    32768 i
                    ET:8 RT:65001:75001

Displayed 2 out of 2 total prefixes
[edit]
vyos@r-right#
  1. To resolve this, need to delete vrf name red protocolcs and vni and add again:
delete vrf name red vni 
delete vrf name red protocols bgp 
commit

set vrf name red protocols bgp address-family ipv4-unicast redistribute connected
set vrf name red protocols bgp address-family l2vpn-evpn advertise ipv4 unicast
set vrf name red protocols bgp address-family l2vpn-evpn rd '65001:75001'
set vrf name red protocols bgp address-family l2vpn-evpn route-target export '65001:75001'
set vrf name red protocols bgp address-family l2vpn-evpn route-target import '65000:1'
set vrf name red protocols bgp parameters log-neighbor-changes
set vrf name red protocols bgp parameters router-id '10.221.12.201'
set vrf name red protocols bgp system-as '65001'
set vrf name red vni '75001'
commit

Check connectivity after re-adding

vyos@r-right# run ping 10.100.1.11 vrf red 
PING 10.100.1.11 (10.100.1.11) 56(84) bytes of data.
64 bytes from 10.100.1.11: icmp_seq=1 ttl=63 time=0.910 ms
64 bytes from 10.100.1.11: icmp_seq=2 ttl=63 time=0.808 ms


vyos@r-right# run show ip route vrf red
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF red:
B>* 10.100.1.0/24 [20/0] via 192.168.100.102, br75001 onlink, weight 1, 00:01:15
C>* 10.105.10.0/24 is directly connected, br75001, 00:12:23
[edit]
vyos@r-right#

We probably should add some dependencies after re-enabling VRFs as routes are not installed in the routing table.

Details

Difficulty level
Normal (likely a few hours)
Version
VyOS 1.4.0-epa2
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

Viacheslav created this task.

Probably VNI is applied after BGP

vyos@r4:~$ /usr/libexec/vyos/priority.py | match "vrf|bri|vxlan"
        11  vrf.py                              ['vrf']
       310  interfaces_bridge.py                ['interfaces', 'bridge']
       460  interfaces_vxlan.py                 ['interfaces', 'vxlan']
       481  protocols_static.py                 ['vrf', 'name', 'protocols', 'static']
       611  protocols_isis.py                   ['vrf', 'name', 'protocols', 'isis']
       621  protocols_ospf.py                   ['vrf', 'name', 'protocols', 'ospf']
       621  protocols_ospfv3.py                 ['vrf', 'name', 'protocols', 'ospfv3']
       821  protocols_bgp.py                    ['vrf', 'name', 'protocols', 'bgp']
       821  protocols_eigrp.py                  ['vrf', 'name', 'protocols', 'eigrp']
       822  vrf_vni.py                          ['vrf', 'name', 'vni']
vyos@r4:~$

I think I might've found the cause of this issue: the vni is unset from all VRFs when making changes. I posted a message about this on Slack (and about another, fairly similar, issue) on Slack about this.

We found it when we made NAT changes, but it's also the case when you disable/enable a VRF.

After a reboot FRR config shows:

vrf test
 vni 40
 ip route 1.1.1.1/32 wan nexthop-vrf wan
exit-vrf
!
vrf wan
 vni 14
exit-vrf

And when making any changes (for example, disabling / enabling the vrf) it becomes:

vrf test
 ip route 1.1.1.1/32 wan nexthop-vrf wan
exit-vrf

Where wan is fully gone and the vni is unset from the test vrf.

This problem is only present on VyOS 1.4.0-epa2, and not 1.5-rolling-202310120020 (an older rolling release).

So the root cause here is that vrf.py runs prior to vrf_vni.py where the first one eliminates all vni configuration within FRR.
The main reason for this weird logic is T5492.

When we delete BGP from a VRF this can only be done if the VNI portion inside FRR is dropped first b/c of https://github.com/FRRouting/frr/blob/064c3494527b9e84260410006768ed38e57e1de7/bgpd/bgp_vty.c#L1646-L1650

Thus we try to workaround this and already failed multiple times.

I will test getting rid of vrf_vni.py and move the l3vni deletion logic into protocols_bgp.py - stay tuned.