Page MenuHomeVyOS Platform

Commit-archive via scp causes 100% CPU on boot
Closed, ResolvedPublicBUG

Assigned To
Authored By
FileGo
May 17 2021, 8:52 AM
Referenced Files
F1477640: image.png
Jun 23 2021, 10:06 AM
F1477661: image.png
Jun 23 2021, 10:06 AM
F1382716: image.png
May 17 2021, 8:52 AM

Description

System configured with commit-archive via scp:

# show system
 config-management {
     commit-archive {
         location scp://user:[email protected]/data
     }
     commit-revisions 100
 }
...

On upgrade to latest 1.4 rolling release (system previously normally ran 1.4-rolling-202105091233), the boot stops at Mounting VyOS config ... done. The router can be pinged and eventually logged into via SSH.

ps aux shows:

root     12460 75.6  1.8 182620 27384 ?        Dl   08:23   0:02 python3 -c from vyos.remote import upload; upload("/tmp/config.boot.12456", "scp://user:[email protected]/data/config.boot-vyos-backup.20210517_082334", source=None)

It appears that VyOS tries to push latest config via SCP, but in the process uses 100% of one (v)CPU. System then fails to respond for a while, as the memory gets exhausted, until the message is output on the console as below.

image.png (349×1 px, 109 KB)

Details

Difficulty level
Unknown (require assessment)
Version
1.4-rolling-202105160417
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)

Related Objects

StatusSubtypeAssignedTask
In progressFEATURE REQUESTNone
ResolvedFEATURE REQUESTerkin
ResolvedBUGerkin

Event Timeline

Even I also faced this error while upgrading the server from 1.2.7 to the latest rolling release.

these are boot message logs:

May 17 11:07:59 vyos kernel: [  548.897105] systemd-logind invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
May 17 11:07:59 vyos kernel: [  548.897155] CPU: 0 PID: 711 Comm: systemd-logind Not tainted 5.10.37-amd64-vyos #1
May 17 11:07:59 vyos kernel: [  548.897156] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
May 17 11:07:59 vyos kernel: [  548.897163] Call Trace:
May 17 11:07:59 vyos kernel: [  548.897307]  dump_stack+0x6d/0x88
May 17 11:07:59 vyos kernel: [  548.897342]  dump_header+0x45/0x1df
May 17 11:07:59 vyos kernel: [  548.897357]  oom_kill_process.cold.33+0xb/0x10
May 17 11:07:59 vyos kernel: [  548.897377]  out_of_memory+0x199/0x4d0
May 17 11:07:59 vyos kernel: [  548.897405]  __alloc_pages_slowpath.constprop.118+0xa1f/0xaf0
May 17 11:07:59 vyos kernel: [  548.897422]  __alloc_pages_nodemask+0x253/0x290
May 17 11:07:59 vyos kernel: [  548.897426]  pagecache_get_page+0xc6/0x230
May 17 11:07:59 vyos kernel: [  548.897431]  filemap_fault+0x4a3/0x920
May 17 11:07:59 vyos kernel: [  548.897472]  ? ep_send_events_proc+0x169/0x220
May 17 11:07:59 vyos kernel: [  548.897502]  ? xas_load+0x8/0x80
May 17 11:07:59 vyos kernel: [  548.897514]  __do_fault+0x33/0xe0
May 17 11:07:59 vyos kernel: [  548.897528]  handle_mm_fault+0x136a/0x1a10
May 17 11:07:59 vyos kernel: [  548.897550]  exc_page_fault+0x222/0x440
May 17 11:07:59 vyos kernel: [  548.897578]  ? asm_exc_page_fault+0x8/0x30
May 17 11:07:59 vyos kernel: [  548.897580]  asm_exc_page_fault+0x1e/0x30
May 17 11:07:59 vyos kernel: [  548.897600] RIP: 0033:0x7fe899ce9fe4
May 17 11:07:59 vyos kernel: [  548.897734] 2232 total pagecache pages
May 17 11:07:59 vyos kernel: [  548.897735] 257916 pages RAM
May 17 11:07:59 vyos kernel: [  548.897757] 0 pages HighMem/MovableOnly
May 17 11:07:59 vyos kernel: [  548.897758] 9086 pages reserved
May 17 11:07:59 vyos kernel: [  548.897758] 0 pages hwpoisoned
May 17 11:07:59 vyos kernel: [  548.898066] Out of memory: Killed process 1129 (python3) total-vm:742744kB, anon-rss:725752kB, file-rss:1472kB, shmem-rss:0kB, UID:0 pgtables:1492kB oom_score_adj:0
May 17 11:08:00 vyos zebra[878]: [EC 100663313] SLOW THREAD: task vtysh_read (7f4c2a3a49b0) ran for 5203ms (cpu time 72ms)

So, as you said the device will be accessible via ssh, but you won't be able to commit any further changes as the configuration did not not load yet.

vyos@vyos# commit
Configuration system temporarily locked due to another commit in progress
[edit]
vyos@vyos#

As a workaround, you can kill the process and continue to work on.

root      1662 95.8  2.0 182400 20348 ?        R    10:59 124:31 python3 -c from vyos.remote import upload; upload("/tmp/config.boot.1658", "scp://vyos:[email protected]/home/vyos/config.boot-v"

$sudo kill -9 1662

This is related to the parent task: https://phabricator.vyos.net/T3356

erkin changed the task status from Open to Needs testing.Jun 2 2021, 9:57 AM

Now that the Paramiko and Cryptography versions have been updated, does this problem persist with the newer nightlies? @SrividyaA @FileGo

Actually scratch that. I run a HA pair of VyOS routers via VRRP with a transition script on master/backup, and it looks like when it transitions from backup to master, the commit (at the end of the script) still locks in an endless cycle, combined with some sort of memory leak in keepalived-fifo.py (that doesn't occur if commit-archive via scp is not set up).

image.png (91×1 px, 30 KB)

image.png (354×1 px, 197 KB)

erkin changed the task status from Open to In progress.Jul 1 2021, 2:48 PM

Does this problem persist with the latest rolling version?

Assuming it does not (and I can't replicate this), since the responsible code was rewritten in November to properly use low-level sockets. Let me know if it still persists and I'll try to poke around Paramiko for performance bottlenecks.