Today we try put VyOS with 40Ge Mellanox into production, but it died and rebooted under load (~6-8Gbit/s traffic). In the IPMI was error "CPU: 5 PID: 36 Comm: ksoftirqd/5 Not tained 4.19.0-amd64-vyos". After we rollback to the old network scheme and tried to find the cause of the fall VyOS. We assume that the reason in the Mellanox drivers and interrupts, which not present for mlnx cards:
vyos@xx-gw2:~$ cat /proc/interrupts | grep eth 25: 0 0 46 0 0 0 6 0 PCI-MSI 409600-edge eth1 26: 0 1 0 0 0 0 0 0 PCI-MSI 2097152-edge eth0 27: 22515 0 3 0 0 0 0 0 PCI-MSI 2097153-edge eth0-TxRx-0 28: 0 0 163 3 0 0 0 0 PCI-MSI 2097154-edge eth0-TxRx-1 29: 0 0 0 0 51 0 0 0 PCI-MSI 2097155-edge eth0-TxRx-2 30: 0 0 0 0 0 3 233 0 PCI-MSI 2097156-edge eth0-TxRx-3 vyos@xx-gw2:~$
vyos@xx-gw2:~$ show interfaces Codes: S - State, L - Link, u - Up, D - Down, A - Admin Down Interface IP Address S/L Description --------- ---------- --- ----------- eth0 xx.xx.xx.xx/20 u/u MGMT eth1 - u/D eth2 - u/u eth2.700 xx.xx.xx.xx/22 u/u xxxx:xxxx::x:xx/48 eth2.703 xx.xx.xx.xx/30 u/u xxxx:xxxx::x:xx/126 eth2.704 xx.xx.xx.xx/30 u/u xxxx:xxxx::x:xx/126 eth2.712 xx.xx.xx.xx/30 u/u xxxx:xxxx:xxxx:1::a/126 eth3 - u/u eth3.100 xx.xx.xx.xx/30 u/u eth3.704 - u/u lo 127.0.0.1/8 u/u ::1/128 vyos@xx-gw2:~$
root@xx-gw2:~# lsmod | grep mlx mlx5_core 557056 0 mlxfw 20480 1 mlx5_core ipv6 417792 78 ip6table_mangle,mlx5_core ptp 20480 3 igb,e1000e,mlx5_core root@xx-gw2:~#
01:00.0 Ethernet controller: Mellanox Technologies MT27620 Family 01:00.1 Ethernet controller: Mellanox Technologies MT27620 Family