I had a stall/deadlock seemingly take down my system this morning. I’m on a custom build forked off of 1.5 rolling somewhere around March 10th. Seems to be something around xfrm4_input so may be related to ipsec VPN/VTI interfaces
10:02
I had been testing some code around reloading the ipsec daemon when DHCP IPs change. That action itself doesn’t seem to be related (as far as I can tell, there was no DHCP renew around the time the CPU started spinning out of control), but could be some internal bug in strongswan where repeated reload actions don’t clean up correctly or something.
Notably there are a lot of logs around sending dead-peer detection to one particular peer right before the stall detection starts kicking in.
Mar 18 16:44:56 lcn-router kernel: rcu: INFO: rcu_preempt self-detected stall on CPU Mar 18 16:44:56 lcn-router kernel: rcu: 4-....: (17 GPs behind) idle=59b4/1/0x4000000000000000 softirq=1266> Mar 18 16:44:56 lcn-router kernel: rcu: (t=588507 jiffies g=7635657 q=24585 ncpus=12) Mar 18 16:44:56 lcn-router kernel: CPU: 4 PID: 145 Comm: kworker/4:1H Tainted: G W O L 6.6.21-amd64-vyo> Mar 18 16:44:56 lcn-router kernel: Hardware name: Supermicro Super Server/A2SDi-TP8F, BIOS 1.4 01/29/2021 Mar 18 16:44:56 lcn-router kernel: Workqueue: adf_pf_resp_wq_0 adf_response_handler_wq [intel_qat] Mar 18 16:44:56 lcn-router kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x65/0x2b0 Mar 18 16:44:56 lcn-router kernel: Code: 77 77 f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00> Mar 18 16:44:56 lcn-router kernel: RSP: 0018:ffffa932801dcb00 EFLAGS: 00000202 Mar 18 16:44:56 lcn-router kernel: RAX: 0000000000000001 RBX: ffff9851c789f04c RCX: ffff9851c789f048 Mar 18 16:44:56 lcn-router kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9851c789f04c Mar 18 16:44:56 lcn-router kernel: RBP: ffffa932801dcb98 R08: 00000000f51784c0 R09: 0000000000000002 Mar 18 16:44:56 lcn-router kernel: R10: 0000000000000005 R11: ffff9851b4213908 R12: ffff985200187700 Mar 18 16:44:56 lcn-router kernel: R13: 0000000000000002 R14: ffff9851c789f000 R15: ffff9851c789f04c Mar 18 16:44:56 lcn-router kernel: FS: 0000000000000000(0000) GS:ffff9854efb00000(0000) knlGS:0000000000000000 Mar 18 16:44:56 lcn-router kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 18 16:44:56 lcn-router kernel: CR2: 00007fa140001828 CR3: 0000000103e34000 CR4: 00000000003506e0 Mar 18 16:44:56 lcn-router kernel: Call Trace: Mar 18 16:44:56 lcn-router kernel: <IRQ> Mar 18 16:44:56 lcn-router kernel: ? rcu_dump_cpu_stacks+0xbf/0x100 Mar 18 16:44:56 lcn-router kernel: ? rcu_sched_clock_irq+0x652/0x1160 Mar 18 16:44:56 lcn-router kernel: ? nohz_balance_exit_idle+0x11/0xc0 Mar 18 16:44:56 lcn-router kernel: ? account_process_tick+0x26/0x140 Mar 18 16:44:56 lcn-router kernel: ? update_process_times+0x5d/0x90 Mar 18 16:44:56 lcn-router kernel: ? tick_sched_timer+0x7a/0xb0 Mar 18 16:44:56 lcn-router kernel: ? __pfx_tick_sched_timer+0x10/0x10 Mar 18 16:44:56 lcn-router kernel: ? __hrtimer_run_queues+0x10d/0x2a0 Mar 18 16:44:56 lcn-router kernel: ? hrtimer_interrupt+0xf9/0x230 Mar 18 16:44:56 lcn-router kernel: ? __sysvec_apic_timer_interrupt+0x69/0x170 Mar 18 16:44:56 lcn-router kernel: ? sysvec_apic_timer_interrupt+0x39/0xb0 Mar 18 16:44:56 lcn-router kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20 Mar 18 16:44:56 lcn-router kernel: ? native_queued_spin_lock_slowpath+0x65/0x2b0 Mar 18 16:44:56 lcn-router kernel: _raw_spin_lock+0x2b/0x40 Mar 18 16:44:56 lcn-router kernel: xfrm_input+0x1ef/0x1210 Mar 18 16:44:56 lcn-router kernel: xfrm4_esp_rcv+0x2f/0x70 Mar 18 16:44:56 lcn-router kernel: ip_protocol_deliver_rcu+0x187/0x190 Mar 18 16:44:56 lcn-router kernel: ip_local_deliver_finish+0x6d/0x90 Mar 18 16:44:56 lcn-router kernel: ip_sublist_rcv_finish+0x79/0x90 Mar 18 16:44:56 lcn-router kernel: ip_sublist_rcv+0x190/0x230 Mar 18 16:44:56 lcn-router kernel: ? __pfx_ip_rcv_finish+0x10/0x10 Mar 18 16:44:56 lcn-router kernel: ip_list_rcv+0x134/0x160 Mar 18 16:44:56 lcn-router kernel: __netif_receive_skb_list_core+0x299/0x2c0 Mar 18 16:44:56 lcn-router kernel: netif_receive_skb_list_internal+0x1ac/0x2e0 Mar 18 16:44:56 lcn-router kernel: napi_complete_done+0x69/0x1a0 Mar 18 16:44:56 lcn-router kernel: igc_poll+0x62f/0x1790 [igc] Mar 18 16:44:56 lcn-router kernel: __napi_poll+0x26/0x1b0 Mar 18 16:44:56 lcn-router kernel: net_rx_action+0x147/0x2c0 Mar 18 16:44:56 lcn-router kernel: __do_softirq+0xeb/0x2ef Mar 18 16:44:56 lcn-router kernel: __irq_exit_rcu+0x71/0xc0 Mar 18 16:44:56 lcn-router kernel: common_interrupt+0xa5/0xc0 Mar 18 16:44:56 lcn-router kernel: </IRQ> Mar 18 16:44:56 lcn-router kernel: <TASK> Mar 18 16:44:56 lcn-router kernel: asm_common_interrupt+0x22/0x40 Mar 18 16:44:56 lcn-router kernel: RIP: 0010:xfrm_replay_recheck+0x0/0x90 Mar 18 16:44:56 lcn-router kernel: Code: 83 f8 01 74 0a 83 f8 02 74 0a e9 9b f8 ff ff e9 e6 f6 ff ff e9 a1 f7 ff ff> Mar 18 16:44:56 lcn-router kernel: RSP: 0018:ffffa93280acfd88 EFLAGS: 00000202 Mar 18 16:44:56 lcn-router kernel: RAX: 0000000000000004 RBX: 00000000b1000000 RCX: ffff98518cba0000 Mar 18 16:44:56 lcn-router kernel: RDX: 00000000b1000000 RSI: ffff985200187d00 RDI: ffff9851c789f000 Mar 18 16:44:56 lcn-router kernel: RBP: ffffa93280acfdf0 R08: 0000000000000004 R09: 0000000000000004 Mar 18 16:44:56 lcn-router kernel: R10: ffffffffaf6060e0 R11: ffffffffafe0ecc0 R12: ffff985200187d00 Mar 18 16:44:56 lcn-router kernel: R13: 0000000000000002 R14: ffff9851c789f000 R15: ffff9851c789f04c Mar 18 16:44:56 lcn-router kernel: xfrm_input+0x4ca/0x1210 Mar 18 16:44:56 lcn-router kernel: qat_alg_callback+0x18/0x30 [intel_qat] Mar 18 16:44:56 lcn-router kernel: adf_handle_response+0x40/0xc0 [intel_qat] Mar 18 16:44:56 lcn-router kernel: adf_response_handler_wq+0x6c/0xc0 [intel_qat] Mar 18 16:44:56 lcn-router kernel: process_one_work+0x16f/0x340 Mar 18 16:44:56 lcn-router kernel: worker_thread+0x272/0x390 Mar 18 16:44:56 lcn-router kernel: ? preempt_count_add+0x65/0xa0 Mar 18 16:44:56 lcn-router kernel: ? __pfx_worker_thread+0x10/0x10 Mar 18 16:44:56 lcn-router kernel: kthread+0xee/0x120 Mar 18 16:44:56 lcn-router kernel: ? __pfx_kthread+0x10/0x10 Mar 18 16:44:56 lcn-router kernel: ret_from_fork+0x2b/0x40 Mar 18 16:44:56 lcn-router kernel: ? __pfx_kthread+0x10/0x10 Mar 18 16:44:56 lcn-router kernel: ret_from_fork_asm+0x1b/0x30 Mar 18 16:44:56 lcn-router kernel: </TASK>