Page MenuHomeVyOS Platform

Recurring bugs in Intel NIC drivers
Closed, ResolvedPublic

Assigned To
Authored By
drac
Dec 30 2020, 9:06 PM
Referenced Files
F1115119: image.png
Jan 3 2021, 6:43 AM
F1115116: image.png
Jan 3 2021, 6:43 AM
F1115112: image.png
Jan 3 2021, 6:27 AM
F1114895: image.png
Jan 2 2021, 11:46 AM

Description

We've been having a similar issue as to what is reported here.
https://sourceforge.net/p/e1000/mailman/e1000-devel/?viewmonth=202012

Vyos 1.3-rolling is still using the out of tree drivers for Intel network cards.
Now the kernel has moved to 5.4 series is it worth switching to the in-kernel drivers?
Is there anything that the out-of-kernel driver supports that kernel 5.4 doesn't?

The latest 5.4. kernel looks like it might have a fix just released for our problem.
The OP at sourceforge indicated that the in-kernel driver has resolved it for them.

Details

Difficulty level
Unknown (require assessment)
Version
1.3 rolling
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

drac created this object in space S1 VyOS Public.

The frequency of this issues seems to have increased, we now seem to be getting panics daily (it was every 4 days previously)

drac triaged this task as High priority.Jan 2 2021, 7:17 AM

@drac @maznu called this Intel driver stuff a "tire fire" I have a 5.10.4 Kernel ISO which utilizes the build in Kernel drivers. Could you probably give this a test drive?

https://downloads.vyos.io/tmp/vyos-1.3-kernel-5.10.4-202012311317-amd64.iso

I'm also no big fanboy of those out-of-tree drivers but there are rumors that "they perform better" which is yet to be evaluated.

c-po renamed this task from Intel Driver Bug to Recurring bugs in Intel NIC drivers.Jan 2 2021, 11:25 AM
c-po changed the task status from Open to In progress.
c-po claimed this task.

@drac are you seeing Slab in /proc/meminfo gradually increasing before the panic? If so, the sourceforge post at the top recommends disabling TUPLE "acceleration". It seems that the more traffic you have, the quicker the crash. We were getting them every ~6 hours.

We've been running the in-kernel X710 drivers in the build from @c-po for about 5 hours now. So far, slab usage is looking healthy, and doesn't seem to be massively increasing:

Slab:             272928 kB

Also, having seen R3:5bd60a745de2 has been added, we've bodged some telegraf .conf files and can confirm we're not seeing endless Slab growth:

image.png (462×1 px, 48 KB)

I have my fingers crossed, hard, that this is a fix for us.

The odd thing about this is that I don't seem to have this issue consistently across systems.
I have two identical systems (hardware) one of them acting as a PPPoE concentrator with OSPF, the other is an L2TP session concentrator with OSPF and BGP.
I only see this issue on the L2TP system. It's currently only doing around 50Mbps of UDP on average.
The PPPoE system does at least twice that on average.

https://sourceforge.net/p/e1000/bugs/671/ mentions a lot of UDP traffic, L2TP is mainly UDP as well, maybe it is UDP traffic that is triggering the issue?
@maznu Do you have a lot of UDP on your system?

The systems are:
SuperMicro AS-5019D-FTN4 with X722-DA4 (i40e) cards fitted.

The system seems fine for a while, then all of a sudden I start getting errors with regards to memory page ref count problems being logged in /var/log/messages. Then all of a sudden it panics and reboots.
It doesn't store a core dump anywhere that I can see though.

I've been looking at getting my build working with KASAN enabled, but I haven't been successful at following the build instructions for VyOS (I'm running a fedora 32 machine) I'm going to switch to Debian or Ubuntu shortly, which should make it a bit easier for me, specifically so I can build VyOS.
@c-po, Do you think a daily KASAN enabled (i.e. debug build of VyOS) build be a good idea?

As I'm sure you've noticed the mailing list at Sourceforge isn't very helpful for trying to troubleshoot if
a) you're running an AMD CPU.
b) if you aren't running the latest stable kernel release
c) you built your own system and put in your own card
d) running any distribution of linux
I've never seen anything like it.

I did find the following though: http://patchwork.ozlabs.org/project/intel-wired-lan/list/
Which is tracking patches for the various Intel drivers.
There are a few XDP performance related patches. However, I couldn't care less about these at present. I just want a stable system!

In a couple of places Todd from Intel (sourceforge) mentioned a new version of the out of tree drivers is imminent.

@c-po I've loaded the ISO, and it will be active on the next panic I'll let you know how I get on.

@drac enabling such debug features is not easily possible as we can not install two kernels in parallel.

Building a Kernel is relatively easy following this instruction https://docs.vyos.io/en/latest/contributing/build-vyos.html#linux-kernel when using the VyOS provided Docker Container https://docs.vyos.io/en/latest/contributing/build-vyos.html#build-iso

I've switched the image overnight.
Everything seems to work ok apart from I've noticed the following items being logged (104) entries so far - and it's only been up 20 mins
an 3 01:40:26 vyos kernel: [ 968.052496] l2tp_core: tunl 18294: recv short packet (len=12)
Jan 3 01:40:34 vyos kernel: [ 975.367315] l2tp_core: tunl 19358: recv short packet (len=12)
Jan 3 01:40:34 vyos kernel: [ 975.431273] l2tp_core: tunl 56700: recv short packet (len=12)

The particular interface dealing with l2tp is coming in on an Intel igb interface.

@drac we're a typical ISP/NSP, with a fair amount of eyeball traffic behind us, so expecting to see a fairly high amount of UDP for QUIC (but it's not the bulk of our traffic on our VyOS boxes which are BGP peering/transit edge). Each of our six VyOS boxes is pushing around 300-500Mbit/sec, of which two have XL710 NICs (the rest are a mix of ixgbe and qlcnic).

Like you, I found the thread at sourceforge discussing the issue to be enlightening — in a bad way. Every possible attempt was made to disavow responsibility, in spite of a detailed and pretty conclusive bug report, before finally admitting "this is a known issue." I'll be sure to make reference to this "support experience" in our RFO, because it has led me to seriously consider swapping out our XL710 NICs for Mellanox. Even though that is going to require two significant maintenance windows on our part, Intel's attitude bothers me. These NICs and drivers have had a long history of bugs, so immediately trying to blame the customer is a surprising position to start from — we've been working with deployments of them in customer networks for over 3 years now, so are only too aware of how finicky they can be!

As for the performance, here is our comparison of CPU usage (black line) on VyOS 1.2.5 vs 1.3-Intel vs 1.3-stock. CPU % is the right y-axis, which I've shifted to make it more closely follow the shape of the traffic pattern.

image.png (302×1 px, 196 KB)

And a slightly longer-term traffic graph, showing CPU usage vs traffic levels across VyOS 1.2.5 to 1.3-rolling on the same XL710 box:

image.png (303×1 px, 233 KB)

From the looks of those graphs, I'd say the build of VyOS 1.3 with the in-kernel driver is more efficient than VyOS 1.3 with Intel's driver (which in turn is a significant improvement over VyOS 1.2).

As for memory usage on the same box, over the last 24 hours we've seen a slow increase in usage (the bgpd process appears to be the culprit, weighing in at 2.6GB). To me this appears to be a fairly typical asymptotic curve as route churn fragments bgpd's heap:

image.png (473×1 px, 58 KB)

Other routers' bgpd processes are currently at 3.7GB, 4.1GB, 3.3GB, 3.6GB, and 2.8GB. We'll be able to keep an eye on this in more detail via Telegraf+Prometheus now.

@maznu I've also been looking at switch to Mellanox cards after my experience with Intel. It's not as if this is a mass consumer product with end-users that don't really know what they're doing. It's a product that's most likely supported by IT/network staff that get to influence purchasing decisions for equipment like this.

Your graphs look pretty convincing to me that in built is the way to go.
However, it is bothering me about the short packet notifications I'm getting though.
They appear to being logged at exactly 1 minute intervals per tunnel (I currently have 5)
Whilst everything appears to be working ok, I can't see these small packets arriving at the interface using tcpdump.
This makes me think that they might be part of a larger packet that is somehow being scrambled by the driver.

Looks like I'm going to need to put together a custom iso and kernel build to troubleshoot. :(

You might want to take a look at the patches in T228 - its a 5.4 build with a bunch of C fixup, but using the Intel proprietary drivers for an in-tree build (permits signing of all modules at kernel build time).
We have this running on a host with a dual-port 740 (not doing all that much, some routing, NAT, ACL, and a couple of OpenVPN and IPSEC tunnels), and it seems to be fairly happy in that low intensity environment.
I can try to beat up on it and see how it fares, but probably worth a try.

As of T3218 we no longer utilize the out-of-tree drivers as there is no performance gain. Both current and equuleus branch (1.4 and 1.3) not use the Linux Kernel stock drivers.

Currently 1.3 is running 5.4 Kernel series and 1.4 is running 5.10 Kernel series.

c-po changed the task status from In progress to Needs testing.Jan 15 2021, 4:14 PM
erkin set Issue type to Bug (incorrect behavior).Aug 29 2021, 11:41 AM