Page MenuHomeVyOS Platform

Intel QAT causes CPU runaway/stall with ipsec VPN
Open, NormalPublicBUG

Description

I had a stall/deadlock seemingly take down my system this morning. I’m on a custom build forked off of 1.5 rolling somewhere around March 10th. Seems to be something around xfrm4_input so may be related to ipsec VPN/VTI interfaces
10:02
I had been testing some code around reloading the ipsec daemon when DHCP IPs change. That action itself doesn’t seem to be related (as far as I can tell, there was no DHCP renew around the time the CPU started spinning out of control), but could be some internal bug in strongswan where repeated reload actions don’t clean up correctly or something.
Notably there are a lot of logs around sending dead-peer detection to one particular peer right before the stall detection starts kicking in.

Mar 18 16:44:56 lcn-router kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Mar 18 16:44:56 lcn-router kernel: rcu:         4-....: (17 GPs behind) idle=59b4/1/0x4000000000000000 softirq=1266>
Mar 18 16:44:56 lcn-router kernel: rcu:         (t=588507 jiffies g=7635657 q=24585 ncpus=12)
Mar 18 16:44:56 lcn-router kernel: CPU: 4 PID: 145 Comm: kworker/4:1H Tainted: G        W  O L     6.6.21-amd64-vyo>
Mar 18 16:44:56 lcn-router kernel: Hardware name: Supermicro Super Server/A2SDi-TP8F, BIOS 1.4 01/29/2021
Mar 18 16:44:56 lcn-router kernel: Workqueue: adf_pf_resp_wq_0 adf_response_handler_wq [intel_qat]
Mar 18 16:44:56 lcn-router kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x65/0x2b0
Mar 18 16:44:56 lcn-router kernel: Code: 77 77 f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00>
Mar 18 16:44:56 lcn-router kernel: RSP: 0018:ffffa932801dcb00 EFLAGS: 00000202
Mar 18 16:44:56 lcn-router kernel: RAX: 0000000000000001 RBX: ffff9851c789f04c RCX: ffff9851c789f048
Mar 18 16:44:56 lcn-router kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9851c789f04c
Mar 18 16:44:56 lcn-router kernel: RBP: ffffa932801dcb98 R08: 00000000f51784c0 R09: 0000000000000002
Mar 18 16:44:56 lcn-router kernel: R10: 0000000000000005 R11: ffff9851b4213908 R12: ffff985200187700
Mar 18 16:44:56 lcn-router kernel: R13: 0000000000000002 R14: ffff9851c789f000 R15: ffff9851c789f04c
Mar 18 16:44:56 lcn-router kernel: FS:  0000000000000000(0000) GS:ffff9854efb00000(0000) knlGS:0000000000000000
Mar 18 16:44:56 lcn-router kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 18 16:44:56 lcn-router kernel: CR2: 00007fa140001828 CR3: 0000000103e34000 CR4: 00000000003506e0
Mar 18 16:44:56 lcn-router kernel: Call Trace:
Mar 18 16:44:56 lcn-router kernel:  <IRQ>
Mar 18 16:44:56 lcn-router kernel:  ? rcu_dump_cpu_stacks+0xbf/0x100
Mar 18 16:44:56 lcn-router kernel:  ? rcu_sched_clock_irq+0x652/0x1160
Mar 18 16:44:56 lcn-router kernel:  ? nohz_balance_exit_idle+0x11/0xc0
Mar 18 16:44:56 lcn-router kernel:  ? account_process_tick+0x26/0x140
Mar 18 16:44:56 lcn-router kernel:  ? update_process_times+0x5d/0x90
Mar 18 16:44:56 lcn-router kernel:  ? tick_sched_timer+0x7a/0xb0
Mar 18 16:44:56 lcn-router kernel:  ? __pfx_tick_sched_timer+0x10/0x10
Mar 18 16:44:56 lcn-router kernel:  ? __hrtimer_run_queues+0x10d/0x2a0
Mar 18 16:44:56 lcn-router kernel:  ? hrtimer_interrupt+0xf9/0x230
Mar 18 16:44:56 lcn-router kernel:  ? __sysvec_apic_timer_interrupt+0x69/0x170
Mar 18 16:44:56 lcn-router kernel:  ? sysvec_apic_timer_interrupt+0x39/0xb0
Mar 18 16:44:56 lcn-router kernel:  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
Mar 18 16:44:56 lcn-router kernel:  ? native_queued_spin_lock_slowpath+0x65/0x2b0
Mar 18 16:44:56 lcn-router kernel:  _raw_spin_lock+0x2b/0x40
Mar 18 16:44:56 lcn-router kernel:  xfrm_input+0x1ef/0x1210
Mar 18 16:44:56 lcn-router kernel:  xfrm4_esp_rcv+0x2f/0x70
Mar 18 16:44:56 lcn-router kernel:  ip_protocol_deliver_rcu+0x187/0x190
Mar 18 16:44:56 lcn-router kernel:  ip_local_deliver_finish+0x6d/0x90
Mar 18 16:44:56 lcn-router kernel:  ip_sublist_rcv_finish+0x79/0x90
Mar 18 16:44:56 lcn-router kernel:  ip_sublist_rcv+0x190/0x230
Mar 18 16:44:56 lcn-router kernel:  ? __pfx_ip_rcv_finish+0x10/0x10
Mar 18 16:44:56 lcn-router kernel:  ip_list_rcv+0x134/0x160
Mar 18 16:44:56 lcn-router kernel:  __netif_receive_skb_list_core+0x299/0x2c0
Mar 18 16:44:56 lcn-router kernel:  netif_receive_skb_list_internal+0x1ac/0x2e0
Mar 18 16:44:56 lcn-router kernel:  napi_complete_done+0x69/0x1a0
Mar 18 16:44:56 lcn-router kernel:  igc_poll+0x62f/0x1790 [igc]
Mar 18 16:44:56 lcn-router kernel:  __napi_poll+0x26/0x1b0
Mar 18 16:44:56 lcn-router kernel:  net_rx_action+0x147/0x2c0
Mar 18 16:44:56 lcn-router kernel:  __do_softirq+0xeb/0x2ef
Mar 18 16:44:56 lcn-router kernel:  __irq_exit_rcu+0x71/0xc0
Mar 18 16:44:56 lcn-router kernel:  common_interrupt+0xa5/0xc0
Mar 18 16:44:56 lcn-router kernel:  </IRQ>
Mar 18 16:44:56 lcn-router kernel:  <TASK>
Mar 18 16:44:56 lcn-router kernel:  asm_common_interrupt+0x22/0x40
Mar 18 16:44:56 lcn-router kernel: RIP: 0010:xfrm_replay_recheck+0x0/0x90
Mar 18 16:44:56 lcn-router kernel: Code: 83 f8 01 74 0a 83 f8 02 74 0a e9 9b f8 ff ff e9 e6 f6 ff ff e9 a1 f7 ff ff>
Mar 18 16:44:56 lcn-router kernel: RSP: 0018:ffffa93280acfd88 EFLAGS: 00000202
Mar 18 16:44:56 lcn-router kernel: RAX: 0000000000000004 RBX: 00000000b1000000 RCX: ffff98518cba0000
Mar 18 16:44:56 lcn-router kernel: RDX: 00000000b1000000 RSI: ffff985200187d00 RDI: ffff9851c789f000
Mar 18 16:44:56 lcn-router kernel: RBP: ffffa93280acfdf0 R08: 0000000000000004 R09: 0000000000000004
Mar 18 16:44:56 lcn-router kernel: R10: ffffffffaf6060e0 R11: ffffffffafe0ecc0 R12: ffff985200187d00
Mar 18 16:44:56 lcn-router kernel: R13: 0000000000000002 R14: ffff9851c789f000 R15: ffff9851c789f04c
Mar 18 16:44:56 lcn-router kernel:  xfrm_input+0x4ca/0x1210
Mar 18 16:44:56 lcn-router kernel:  qat_alg_callback+0x18/0x30 [intel_qat]
Mar 18 16:44:56 lcn-router kernel:  adf_handle_response+0x40/0xc0 [intel_qat]
Mar 18 16:44:56 lcn-router kernel:  adf_response_handler_wq+0x6c/0xc0 [intel_qat]
Mar 18 16:44:56 lcn-router kernel:  process_one_work+0x16f/0x340
Mar 18 16:44:56 lcn-router kernel:  worker_thread+0x272/0x390
Mar 18 16:44:56 lcn-router kernel:  ? preempt_count_add+0x65/0xa0
Mar 18 16:44:56 lcn-router kernel:  ? __pfx_worker_thread+0x10/0x10
Mar 18 16:44:56 lcn-router kernel:  kthread+0xee/0x120
Mar 18 16:44:56 lcn-router kernel:  ? __pfx_kthread+0x10/0x10
Mar 18 16:44:56 lcn-router kernel:  ret_from_fork+0x2b/0x40
Mar 18 16:44:56 lcn-router kernel:  ? __pfx_kthread+0x10/0x10
Mar 18 16:44:56 lcn-router kernel:  ret_from_fork_asm+0x1b/0x30
Mar 18 16:44:56 lcn-router kernel:  </TASK>

Details

Difficulty level
Unknown (require assessment)
Version
Fork of 1.5-rolling-202403100025
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

Viacheslav triaged this task as Normal priority.Mar 27 2024, 5:05 PM
Viacheslav added a project: VyOS 1.5 Circinus.

Offending driver is intel_qat - try disabling QAT first.

My system finally crashed again today. I found a workload that generates enough traffic over the VPN to reliably re-produce.

It does appear to be QAT. After disabling QAT it crashed again, but disabling QAT then rebooting with QAT never enabled seems to have it stable again.

lucasec renamed this task from CPU runaway/stall possibly related to Strongswan to Intel QAT causes CPU runaway/stall with ipsec VPN.Sun, Apr 14, 11:36 PM