Page MenuHomeVyOS Platform

VxLAN not working properly after upgrading to latest October build and with a new installation
Closed, ResolvedPublicBUG

Description

After upgrading one of our routers to VyOS 1.3-rolling-202010110146, we are investigating issues with VxLAN with the same configuration before upgrading the router.

The destination gets ARP packets from the source side (the upgraded VyOS router), but if the destination replies to these ARP packets, the source does not them.
Rolling back to the May build solves the issue. A fresh installation with a new configuration does not help.

We are using the following October build:

Version:          VyOS 1.3-rolling-202010110146
Release Train:    equuleus

Built by:         [email protected]
Built on:         Sun 11 Oct 2020 01:46 UTC
Build UUID:       94bc3836-a078-407e-8b66-7b4760a64233
Build Commit ID:

and the following May build (before upgrading):

Version:          VyOS 1.3-rolling-202005260117
Release Train:    equuleus

Built by:         [email protected]
Built on:         Tue 26 May 2020 01:17 UTC
Build UUID:       c9832ae0-9cab-4287-bb2d-5d9bdfa02312
Build Commit ID:  a29347ca9dd260

This is the one of our VxLAN configurations on the upgraded router:

set interfaces vxlan vxlan122 address '10.0.122.1/30`
set interfaces vxlan vxlan122 address 'fd01:122::1/127'
set interfaces vxlan vxlan122 description 'VNI 122'
set interfaces vxlan vxlan122 port '4789'
set interfaces vxlan vxlan122 remote '116.202.x.xxx'
set interfaces vxlan vxlan122 source-address '45.xx.xx.x'
set interfaces vxlan vxlan122 vni '122'

A bit of investigation showed me that the local parameter is missing on the interface config in May build:

## ip -d l sh output on the source side

# May
ip -d l sh vxlan122
6: vxlan122: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 86:a1:c1:5f:ba:2a brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 
    vxlan id 122 remote 116.202.x.xxx srcport 0 0 dstport 4789 ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode none numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    alias VNI 122

# October
12: vxlan122: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 1e:0f:e8:39:16:29 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 
    vxlan id 122 remote 116.202.x.xxx local 45.xx.xx.x srcport 0 0 dstport 4789 ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode none numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    alias VNI 122

October build tcpdumps:

tcpdump on source side:

tcpdump: listening on vxlan122, link-type EN10MB (Ethernet), capture size 262144 bytes
15:11:30.346066 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.122.2 tell 10.0.122.1, length 28
15:11:31.356431 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.122.2 tell 10.0.122.1, length 28

tcpdump on destination side:

tcpdump: listening on vxlan122, link-type EN10MB (Ethernet), capture size 262144 bytes
17:10:30.135055 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.122.2 tell 10.0.122.1, length 28
17:10:30.135131 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.0.122.2 is-at 42:b5:43:36:a8:4d (oui Unknown), length 28
17:10:31.158696 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.0.122.2 tell 10.0.122.1, length 28
17:10:31.158752 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.0.122.2 is-at 42:b5:43:36:a8:4d (oui Unknown), length 28

Details

Difficulty level
Unknown (require assessment)
Version
1.3-rolling-202010110146
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Behavior change
Issue type
Bug (incorrect behavior)

Event Timeline

@tom.siewert
What will be if you delete the source-address on "October" node?

And add "source-interface ethX"

@tom.siewert
What will be if you delete the source-address on "October" node?

Same issue.

This comment was removed by Viacheslav.

source-interface cannot be used as the routers are not in the same multicast group, neither can communicate via multicast

I can't reproduce it with VyOS 1.3-rolling-202010170146 and other october releases

R1

set interfaces vxlan vxlan122 address '10.0.122.1/30'
set interfaces vxlan vxlan122 description 'VNI 122'
set interfaces vxlan vxlan122 port '4789'
set interfaces vxlan vxlan122 remote '100.64.0.2'
set interfaces vxlan vxlan122 vni '122'

R2

set interfaces vxlan vxlan122 address '10.0.122.2/30'
set interfaces vxlan vxlan122 description 'VNI 122'
set interfaces vxlan vxlan122 port '4789'
set interfaces vxlan vxlan122 remote '100.64.0.1'
set interfaces vxlan vxlan122 vni '122'

R1 dump

vyos@r4-roll# sudo tcpdump -ni vxlan122
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vxlan122, link-type EN10MB (Ethernet), capture size 262144 bytes
19:25:06.956126 ARP, Request who-has 10.0.122.1 tell 10.0.122.2, length 28
19:25:06.956206 ARP, Reply 10.0.122.1 is-at ca:75:7c:2b:2c:9a, length 28
19:25:06.956461 IP 10.0.122.2 > 10.0.122.1: ICMP echo request, id 2804, seq 1, length 64
19:25:06.956487 IP 10.0.122.1 > 10.0.122.2: ICMP echo reply, id 2804, seq 1, length 64
19:25:07.978047 IP 10.0.122.2 > 10.0.122.1: ICMP echo request, id 2804, seq 2, length 64
19:25:07.978127 IP 10.0.122.1 > 10.0.122.2: ICMP echo reply, id 2804, seq 2, length 64

Try to check macs and dest vtep ips on both sites

bridge fdb show dev vxlan122
This comment was removed by tom.siewert.

My last comment was wrong, here are the outputs for bridge fdb show dev vxlan122:

Source side (october router):

bridge fdb show dev vxlan122
00:00:00:00:00:00 dst 116.202.x.xxx self permanent

Destination side:

bridge fdb show dev vxlan122
00:00:00:00:00:00 dst 45.xx.xx.x self permanent

Check out the October versions on both sides.

I have investigated it now a bit deeper and found out that this router got migrated to VRF automatically (Our deployment stack automatically migrates upgraded/new deployed routers to VRF usage for OOB/VxLAN communication).

Why this also happen without the VRF is not really clear to me. Anyway, after adding vrf bind-to-all, the traffic gets routed correctly.

The task can be closed.

Viacheslav claimed this task.

Thank you.
Reopen the task or create a new one if you find some problems.

erkin renamed this task from VxLAN not working properly after upgrading to latest October build (also with newinstallation) to VxLAN not working properly after upgrading to latest October build and with a new installation.Aug 29 2021, 12:35 PM
erkin set Issue type to Bug (incorrect behavior).
erkin removed a subscriber: Active contributors.