My hardware is based on i211AT and slow (by today's standards, 4 cores @ 1GHz) low power CPU (PC Engines APU4) but the issues could affect other Intel NICs as well.
- Default ring buffer sizes (see "ethtool -g eth0") of 256 are sub-optimal for a fairly busy router, causing significant packet loss. Works much better after "ethtool -G rx 4096 tx 4096 eth0" which are the max allowed values. Real Internet traffic (in my case, BGP router for a small local ISP with a few hundreds of customers) tends to be quite bursty, so it's not immediately obvious from averaged traffic or CPU load statistics or even iperf tests that there are brief moments when it can't keep up. If the small defaults are needed due to some old hardware limitations, it would at least be good to clearly document as it cost me some grey hair to find the cause of packet loss (wired router is the last place to expect this in a mostly wireless network). Anyway, igb (even ixgbe uses just 512 by default, max is 4096 too) is for fairly recent hardware, e1000e for older (PCIe) and e1000 for much older (parallel PCI, PCI-X) hardware right? For comparison, I checked driver source for cheap Realtek NICs (r8169) which use 2560 RX descriptors - not tunable, but 10x more. Larger buffers mean higher latency but I'm not seeing that much higher, and packet loss before increasing them had much worse effect on customer experience (about 1% may not seem a lot, but matters a lot at todays high speeds, see the formula at https://en.wikipedia.org/wiki/TCP_tuning#Packet_loss - 10x more TCP speed needs 100x lower loss under square root).
- On the same busy router I'm seeing lots of "igb ...: partial checksum but l4 proto=29!" messages in dmesg logs, and also some "mixed HW and IP checksum settings", this probably has something to do with offload settings Not sure which of the settings yet. Again, defaults are probably tuned for a not very busy host, not for a busy router.
- This may be a quirk of my specific hardware or BIOS (coreboot) but of 4 i211AT interfaces, first two get two IRQs/queues and next two only get one IRQ/queue each. RPS needs to be enabled manually to distribute the load evenly between 4 CPU cores, otherwire just one core was 100%, especially on the box running as PPPoE server, this seems much more CPU intensive (my guess - copying each packet, as it's fresh data from the network it means lots of cache misses and stressed memory bandwidth) than plain routing.