Page MenuHomeVyOS Platform

Mellanox cards, problem with interrupts
In progress, NormalPublic

Description

Today we try put VyOS with 40Ge Mellanox into production, but it died and rebooted under load (~6-8Gbit/s traffic). In the IPMI was error "CPU: 5 PID: 36 Comm: ksoftirqd/5 Not tained 4.19.0-amd64-vyos". After we rollback to the old network scheme and tried to find the cause of the fall VyOS. We assume that the reason in the Mellanox drivers and interrupts, which not present for mlnx cards:

vyos@xx-gw2:~$ cat /proc/interrupts | grep eth
 25:          0          0         46          0          0          0          6          0   PCI-MSI 409600-edge      eth1
 26:          0          1          0          0          0          0          0          0   PCI-MSI 2097152-edge      eth0
 27:      22515          0          3          0          0          0          0          0   PCI-MSI 2097153-edge      eth0-TxRx-0
 28:          0          0        163          3          0          0          0          0   PCI-MSI 2097154-edge      eth0-TxRx-1
 29:          0          0          0          0         51          0          0          0   PCI-MSI 2097155-edge      eth0-TxRx-2
 30:          0          0          0          0          0          3        233          0   PCI-MSI 2097156-edge      eth0-TxRx-3
vyos@xx-gw2:~$
vyos@xx-gw2:~$ show interfaces 
Codes: S - State, L - Link, u - Up, D - Down, A - Admin Down
Interface        IP Address                        S/L  Description
---------        ----------                        ---  -----------
eth0             xx.xx.xx.xx/20                  u/u  MGMT 
eth1             -                                 u/D  
eth2             -                                 u/u  
eth2.700         xx.xx.xx.xx/22                    u/u  
                 xxxx:xxxx::x:xx/48
eth2.703         xx.xx.xx.xx/30                  u/u   
                 xxxx:xxxx::x:xx/126
eth2.704         xx.xx.xx.xx/30                  u/u   
                 xxxx:xxxx::x:xx/126
eth2.712         xx.xx.xx.xx/30                  u/u   
                 xxxx:xxxx:xxxx:1::a/126
eth3             -                                 u/u  
eth3.100         xx.xx.xx.xx/30                     u/u  
eth3.704         -                                 u/u   
lo               127.0.0.1/8                       u/u  
                 ::1/128
vyos@xx-gw2:~$
root@xx-gw2:~# lsmod | grep mlx
mlx5_core             557056  0 
mlxfw                  20480  1 mlx5_core
ipv6                  417792  78 ip6table_mangle,mlx5_core
ptp                    20480  3 igb,e1000e,mlx5_core
root@xx-gw2:~#
01:00.0 Ethernet controller: Mellanox Technologies MT27620 Family
01:00.1 Ethernet controller: Mellanox Technologies MT27620 Family

Details

Difficulty level
Unknown (require assessment)
Version
VyOS 1.2.0-rc7
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

oliko triaged this task as High priority.Nov 15 2018, 10:14 AM
oliko created this task.
oliko created this object in space S1 VyOS Public.
oliko updated the task description. (Show Details)
oliko updated the task description. (Show Details)
oliko updated the task description. (Show Details)

Can you provide info about firmware level?
Thanks

syncer lowered the priority of this task from High to Normal.Nov 21 2018, 2:31 AM

interrupts.png (423×1 px, 213 KB)

There is a problem with the display of the name interface. Not critical.

dmbaturin changed the task status from Open to In progress.Nov 28 2018, 11:23 PM
dmbaturin added a subscriber: dmbaturin.

@oliko Could you retest it with rc9, which uses a 4.19.4 kernel?

@dmbaturin Yes. We'll try tomorrow morning and give you feedback.

@dmbaturin Hello, sorry for delay. We tested rc10 today, it not crashed but still writing a lot of errors to logs (in the attach).

Multiple fixes have been placed into the 4.19 series Kernel. Could you please try upgrading to VyOS 1.2.5 or 1.2.6-epa1?

dmbaturin set Is it a breaking change? to Unspecified (possibly destroys the router).
erkin set Issue type to Bug (incorrect behavior).Aug 31 2021, 7:22 PM