Page MenuHomeVyOS Platform

Http api segfault with concurrent requests
Closed, ResolvedPublicBUG

Description

I'm experimenting with Vyos before swapping over from a Unifi USG and in doing so I am wanting to use terraform to codify changes. By default terraform will perform several things in parallel and as I have been adding more code blocks in the terraform provider, i've started to notice frequently I am getting 502 Bad Gateway responses from the http api. To try to reproduce this outside of terraform I am using a node program called "autocannon" to allow making concurrent requests.

When running a single connection it all seems to be fine:

$ npx autocannon --duration 10 --connections 1 --renderStatusCodes --method POST --form '{"data": {"type":"text", "value": "{\"op\": \"showConfig\", \"path\": []}"}, "key": {"type":"text","value":"my-secret-key"}}' 'https://vyos:8443/retrieve'
Running 10s test @ https://vyos:8443/retrieve
1 connections


┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max    │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────┤
│ Latency │ 545 ms │ 564 ms │ 603 ms │ 603 ms │ 563.48 ms │ 12.98 ms │ 603 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴────────┘
┌───────────┬──────┬──────┬───────┬───────┬─────────┬─────────┬──────┐
│ Stat      │ 1%   │ 2.5% │ 50%   │ 97.5% │ Avg     │ Stdev   │ Min  │
├───────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼──────┤
│ Req/Sec   │ 1    │ 1    │ 2     │ 2     │ 1.7     │ 0.46    │ 1    │
├───────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼──────┤
│ Bytes/Sec │ 8 kB │ 8 kB │ 16 kB │ 16 kB │ 13.6 kB │ 3.67 kB │ 8 kB │
└───────────┴──────┴──────┴───────┴───────┴─────────┴─────────┴──────┘
┌──────┬───────┐
│ Code │ Count │
├──────┼───────┤
│ 200  │ 17    │
└──────┴───────┘

Req/Bytes counts sampled once per second.
# of samples: 10

18 requests in 10.03s, 136 kB read

Bumping that up to just 2 concurrent connections I start getting errors.

$ npx autocannon --duration 10 --connections 2 --renderStatusCodes --method POST --form '{"data": {"type":"text", "value": "{\"op\": \"showConfig\", \"path\": []}"}, "key": {"type":"text","value":"my-secret-key"}}' 'https://vyos:8443/retrieve'
Running 10s test @ https://vyos:8443/retrieve
2 connections


┌─────────┬──────┬───────┬───────┬────────┬──────────┬──────────┬────────┐
│ Stat    │ 2.5% │ 50%   │ 97.5% │ 99%    │ Avg      │ Stdev    │ Max    │
├─────────┼──────┼───────┼───────┼────────┼──────────┼──────────┼────────┤
│ Latency │ 4 ms │ 47 ms │ 98 ms │ 104 ms │ 48.08 ms │ 27.71 ms │ 216 ms │
└─────────┴──────┴───────┴───────┴────────┴──────────┴──────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg    │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec   │ 333    │ 333    │ 572    │ 638    │ 557.6  │ 79.96   │ 333    │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 109 kB │ 109 kB │ 180 kB │ 200 kB │ 176 kB │ 23.7 kB │ 109 kB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘
┌──────┬───────┐
│ Code │ Count │
├──────┼───────┤
│ 200  │ 2     │
├──────┼───────┤
│ 502  │ 5574  │
└──────┴───────┘

Req/Bytes counts sampled once per second.
# of samples: 10

2 2xx responses, 5574 non 2xx responses
6k requests in 10.02s, 1.76 MB read

Sometimes I'll get a handful of successful ones and other times I'll get all failures when using this tool.

In the monitor log I also see the http api is crashing/segfaulting:

Feb 12 22:40:15 kernel: vyos-http-api-s[3490]: segfault at f1f47738 ip 00007fccf226dde3 sp 00007fccf1c89d60 error 6 in libvyosconfig.so.0[7fccf21aa000+10f000] likely on CPU 3 (core 3, socket 0)
Feb 12 22:40:15 kernel: Code: 8b 5f 18 48 c7 03 03 00 00 00 48 8b 5f 10 48 83 fb 01 74 4f 49 83 ef 38 4d 3b 7e 08 0f 82 8c 0e 00 00 49 8d 5f 08 48 83 c3 28 <48> c7 43 f8 0b 04 00 00 48 89 03 48 8d 7b e8 48 c7 47 f8 17 08 00
Feb 12 22:40:15 systemd[1]: vyos-http-api.service: Main process exited, code=killed, status=11/SEGV

As I run autocannon more, it is odd because its not always the same failure :\ As a developer, I know these are the worst kind of bugs :(

Here are some of the various errors I see when running the concurrent requests (each block was from one run of autocannon).

Feb 13 22:40:31 vyos-http-api[59419]: double free or corruption (out)
Feb 13 22:40:31 systemd[1]: vyos-http-api.service: Main process exited, code=killed, status=6/ABRT
Feb 13 22:43:09 kernel: traps: vyos-http-api-s[60009] general protection fault ip:7fcd2df94061 sp:7fcd2f1154f8 error:0 in libvyosconfig.so.0[7fcd2ded7000+10f000]
Feb 13 22:43:09 systemd[1]: vyos-http-api.service: Main process exited, code=killed, status=11/SEGV
Feb 13 22:43:43 kernel: vyos-http-api-s[60088]: segfault at 7f41c01bed56 ip 00007f41c02b3ef4 sp 00007f41c0bed4a8 error 7 in libvyosconfig.so.0[7f41c01b0000+10f000] likely on CPU 1 (core 1, socket 0)
Feb 13 22:43:43 kernel: Code: 00 00 00 0f 1f 00 48 8d 47 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f0 48 d1 f8 89 c6 85 c0 0f 88 f0 00 00 00 48 8b 42 28 <48> c7 42 38 ff ff ff ff 48 89 42 20 48 89 42 30 48 8b 0f 48 63 c6
Feb 13 22:43:43 systemd[1]: vyos-http-api.service: Main process exited, code=killed, status=11/SEGV
Feb 13 22:44:20 kernel: vyos-http-api-s[60112]: segfault at bf8 ip 00007f56f129a4dd sp 00007f56f1c23d30 error 4
Feb 13 22:44:20 kernel: vyos-http-api-s[60111]: segfault at 4d ip 00007f56f123c339 sp 00007f56f2424b80 error 4
Feb 13 22:44:20 kernel:  in libvyosconfig.so.0[7f56f11e6000+10f000]
Feb 13 22:44:20 kernel:  in libvyosconfig.so.0[7f56f11e6000+10f000] likely on CPU 2 (core 2, socket 0)
Feb 13 22:44:20 kernel:  likely on CPU 3 (core 3, socket 0)
Feb 13 22:44:20 kernel:
Feb 13 22:44:20 kernel: Code: b8 01 00 00 00 48 83 c4 28 c3 0f 1f 00 e9 3b f1 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 48 83 ec 28 48 89 04 24 48 89 5c 24 08 <48> 8b 7b f8 48 c1 ef 0a 48 8d 3c fd ff ff ff ff 48 0f b6 1c 3b 48
Feb 13 22:44:20 kernel: Code: 18 48 89 78 20 48 83 c4 08 c3 e8 92 d1 fa ff eb c6 48 83 ec 48 48 8b 40 10 48 89 44 24 18 48 8b 40 20 48 8b 58 20 48 8b 5b 20 <48> 8b 7b 20 48 8b 77 20 48 89 74 24 10 48 8b 56 20 48 89 54 24 20
Feb 13 22:44:20 systemd[1]: vyos-http-api.service: Main process exited, code=killed, status=11/SEGV
Feb 13 22:45:05 kernel: vyos-http-api-s[60174]: segfault at 0 ip 00007f26b2ecc8dd sp 0000000000000000 error 4 in libvyosconfig.so.0[7f26b2dbf000+10f000] likely on CPU 1 (core 1, socket 0)
Feb 13 22:45:05 kernel: Code: 83 c4 08 48 83 c4 08 41 5f 41 5e 41 5d 41 5c 5d 5b c3 48 83 c8 02 eb cf 90 49 f7 86 e8 00 00 00 01 00 00 00 75 09 49 8b 66 10 <41> 8f 46 10 c3 49 89 c4 48 89 c7 5e 48 89 e2 49 8b 4e 10 e8 3b 78
Feb 13 22:45:05 kernel: vyos-http-api-s[60173]: segfault at f ip 00007f26b2e5d7b6 sp 00007f26b3ffde20 error 4 in libvyosconfig.so.0[7f26b2dbf000+10f000] likely on CPU 0 (core 0, socket 0)
Feb 13 22:45:05 kernel: Code: b8 01 00 00 00 48 83 c4 08 c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 18 48 89 c7 48 83 fb 01 74 57 48 89 1c 24 48 89 7c 24 08 <48> 8b 03 48 8b 37 48 89 fb ff d6 48 89 44 24 10 48 8b 04 24 48 8b
Feb 13 22:45:05 systemd[1]: vyos-http-api.service: Main process exited, code=killed, status=11/SEGV

Previously I was running an image from June last year when I initially was starting to test out Vyos before getting too busy. I tried upgrading to the latest version and still getting the errors.

1: 1.4-rolling-202302130317 (default boot) (running image)
2: 1.4-rolling-202206250934

Hardware wise, I'm running on an inexpensive Intel N5105 from AliExpress running 16GB RAM, which should be more than sufficient for this, so I doubt it's a hardware/resources issue.

I've started to strip away pieces of the config to see if anything I had added was the cause for it, but so far no luck. Here is my config:

container {
    network services {
        prefix 10.1.0.0/24
    }
}
firewall {
    all-ping enable
}
interfaces {
    ethernet eth0 {
        address dhcp
        description WAN
        hw-id 60:be:b4:02:92:5f
    }
    ethernet eth1 {
        address 10.10.1.1/24
        description MANAGEMENT
        hw-id 60:be:b4:02:29:db
        vif 10 {
            address 192.168.32.1/21
            description SERVERS
        }
        vif 20 {
            address 10.10.21.1/24
            description TRUSTED
        }
        vif 30 {
            address 192.168.2.1/24
            description IOT
        }
        vif 40 {
            address 10.10.40.1/24
            description VIDEO
        }
        vif 50 {
            address 192.168.50.1/24
            description GUEST
        }
    }
    ethernet eth2 {
        address dhcp
        description WAN2
        hw-id 60:be:b4:02:29:dc
    }
    ethernet eth3 {
        address dhcp
        hw-id 60:be:b4:02:29:dd
    }
    loopback lo {
    }
}
nat {
    source {
        rule 1 {
            destination {
                address 0.0.0.0/0
            }
            outbound-interface eth0
            translation {
                address masquerade
            }
        }
    }
}
protocols {
}
service {
    https {
        api {
            debug
            keys {
                id terraform {
                    key ****************
                }
            }
        }
        virtual-host default {
            listen-address 0.0.0.0
            listen-port 8443
        }
    }
    ntp {
        allow-client {
            address 0.0.0.0/0
            address ::/0
        }
        server time1.vyos.net {
        }
        server time2.vyos.net {
        }
        server time3.vyos.net {
        }
    }
    ssh {
        port 22
    }
}
system {
    config-management {
        commit-revisions 100
    }
    conntrack {
        modules {
            ftp
            h323
            nfs
            pptp
            sip
            sqlnet
            tftp
        }
    }
    console {
        device ttyS0 {
            speed 115200
        }
    }
    domain-name local
    host-name gateway
    login {
        user vyos {
            authentication {
                encrypted-password ****************
                plaintext-password ****************
                public-keys brandencash@parchedmac {
                    key ****************
                    type ssh-rsa
                }
                public-keys personal {
                    key ****************
                    type ssh-ed25519
                }
            }
        }
    }
    name-server 1.1.1.1
    syslog {
        global {
            facility all {
                level info
            }
            facility protocols {
                level debug
            }
        }
    }
    time-zone America/Phoenix
}

Any idea what would be going on here?

Details

Version
1.4-rolling-202302130317
Is it a breaking change?
Perfectly compatible
Issue type
Bug (incorrect behavior)

Event Timeline

ammmze updated the task description. (Show Details)
jestabro subscribed.

@ammmze Thanks for the detailed report; allow me to investigate.

Again, thanks for the detailed reproducer; that made investigation straightforward. This appears to be simply an 'async' issue for FastAPI, the underlying web framework for vyos-http-api. FastAPI is very good at managing red/blue issues automatically, but in this case we need to explicitly annotate the endpoint method with async: an explicit lock does not appear necessary, though I will need to confirm. I'll provide a PR shortly. Thanks again !

jestabro changed the task status from Open to In progress.Feb 14 2023, 10:05 PM
jestabro triaged this task as High priority.
jestabro edited a custom field.
jestabro changed Is it a breaking change? from Unspecified (possibly destroys the router) to Perfectly compatible.

Awesome, I am glad to hear the provided info was helpful! Thank you for the quick investigation and explanation.

@ammmze That PR is merged, so will be in the next rolling. Kindly let me know of any remaining or other issues you see. I'll add autocannon to my common tests; thanks again for that tip.

jestabro changed the task status from In progress to Needs testing.Feb 15 2023, 5:09 PM

Excellent! I'll check it out and report back. Thank you!

Looks like a new nightly rolling update was finally released (vyos-1.4-rolling-202302231931-amd64) 🎉 I just installed it and so far it looks good. I'm no longer getting the 502 errors / segfault. Thanks again!

jestabro moved this task from Open to Finished on the VyOS 1.4 Sagitta board.
jestabro changed the task status from Resolved to Unknown Status.Jul 12 2023, 1:07 PM

Bug introduced in 1.3.x with upgrade of FastAPI framework in:
https://vyos.dev/T5176

jestabro changed the task status from Unknown Status to Resolved.Aug 24 2023, 6:36 PM
jestabro moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus (1.3.4) board.
jestabro changed the status of subtask T5305: REST API configure operation should not be defined as async from Unknown Status to Resolved.