Page MenuHomeVyOS Platform

HTTP API segfault during concurrent configuration requests
Closed, ResolvedPublicBUG

Description

When two or more configure requests to the API are sent at the same time the server has a chance to segfault.
Reported for /retrieve and fixed in: T5006 and then was removed for /configure in: T5305

this introduces a bug when a configure operation wants to itself make a call to the http-api.

I am do not think that the config itself calling the API is more important than externally well behaved API requests as such features sounds like it is a hack from the get-go.
I am unable to easily find the case T5305 talks about, so I can't test the effect changing this back to async would have on it.

This is very problematic for me as I am currently developing a terraform provider for vyos that I hope to release as alpha in a few weeks for public testing. If this will not be changed I will need to investigate the viability of my provider.
Result of monitor log | grep -i api below

Feb 26 14:01:09 vyos-http-api[74986]: processing form data
Feb 26 14:01:09 vyos-http-api[74986]: processing form data
Feb 26 14:01:11 kernel: traps: vyos-http-api-s[74996] general protection fault ip:7f18cd20a2fe sp:7f18ce3c29d8 error:0 in libvyosconfig.so.0[7f18cd186000+10e000]
Feb 26 14:01:11 systemd[1]: vyos-http-api.service: Main process exited, code=killed, status=11/SEGV
Feb 26 14:01:11 systemd[1]: vyos-http-api.service: Failed with result 'signal'.
Feb 26 14:01:11 systemd[1]: vyos-http-api.service: Consumed 2.481s CPU time.
Feb 26 14:01:11 systemd[1]: vyos-http-api.service: Scheduled restart job, restart counter is at 2.
Feb 26 14:01:11 systemd[1]: Stopped vyos-http-api.service - VyOS HTTP API service.
Feb 26 14:01:11 systemd[1]: vyos-http-api.service: Consumed 2.481s CPU time.
Feb 26 14:01:11 systemd[1]: Started vyos-http-api.service - VyOS HTTP API service.
Feb 26 14:01:11 vyos-http-api[75007]: /usr/libexec/vyos/services/vyos-http-api-server:104: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
Feb 26 14:01:11 vyos-http-api[75007]:   @validator("path")
Feb 26 14:01:11 vyos-http-api[75007]: /usr/share/vyos-http-api-tools/lib/python3.11/site-packages/pydantic/_internal/_config.py:322: UserWarning: Valid config keys have changed in V2:
Feb 26 14:01:11 vyos-http-api[75007]: * 'schema_extra' has been renamed to 'json_schema_extra'
Feb 26 14:01:11 vyos-http-api[75007]:   warnings.warn(message, UserWarning)
Feb 26 14:01:11 vyos-http-api[75007]: Enter main loop...
Feb 26 14:01:11 vyos-http-api[75007]: INFO:     Started server process [75007]
Feb 26 14:01:11 vyos-http-api[75007]: INFO:     Waiting for application startup.
Feb 26 14:01:11 vyos-http-api[75007]: INFO:     Application startup complete.
Feb 26 14:01:11 vyos-http-api[75007]: INFO:     Uvicorn running on unix socket /run/api.sock (Press CTRL+C to quit)

I made a hacky helper script to reproduce the issue, since this seems to be a race-condition type issue the test retries 200 times and stops if the API errors out.

#!/bin/bash

[ "$VYOS_HOST" == "" ] && echo "VYOS_HOST must be set to the dns or ip of your box"
[ "$VYOS_KEY" == "" ] && echo "VYOS_KEY must be set to your api key"

for i in {0..200}; do
    echo -n "$i/200 ";

    curl -k --fail --location --request POST "https://$VYOS_HOST/configure" --form key="$VYOS_KEY" --form data='[
        {"op":"delete","path":["firewall","ipv4","name","example1","default-action","accept"]},
        {"op":"set","path":["firewall","ipv4","name","example1","default-action","accept"]}
    ]' &
    pid1=$!

    curl -k --fail --location --request POST "https://$VYOS_HOST/configure" --form key="$VYOS_KEY" --form data='[
        {"op":"delete","path":["firewall","ipv4","name","example2","default-action","accept"]},
        {"op":"set","path":["firewall","ipv4","name","example2","default-action","accept"]}
    ]' &
    pid2=$!

    echo -n "Result: "

    wait $pid1
    ret1=$?
    wait $pid2
    ret2=$?

    echo

    [ $(($ret1 + $ret2)) != 0 ] && exit $(($ret1 + $ret2))
done

Details

Difficulty level
Normal (likely a few hours)
Version
1.5-rolling-202402240021
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Perfectly compatible
Issue type
Bug (incorrect behavior)

Event Timeline

I have create a pr with the required change to fix this issue if the problem described in T5305 is no longer a blocker.

pr: https://github.com/vyos/vyos-1x/pull/3051

(I wanted to add a dedicated test to ensure this bug is not introduced again as a regression at a later point, but was unable to run the tests in the repo due to some code requireing python 3.10 which some code only worked with eariler versions.)

Cf. the details leading to having changed this for config_op:
https://vyos.dev/T5305

I'll need to look at the competing requirements here before confirming. Thanks for the reproducer and details.

I hope a proper solution can be found as the current behavior of segfaulting the api server is a very unpleasant place to be while working on tooling for vyos.

The only testing I have been able to do is a live change of the server on my local instance, but it seemed to work just fine. Even inverted the reproducer script exit condition and it was able to run 200 times in a row without issue.

sudo vi /usr/libexec/vyos/services/vyos-http-api-server
sudo systemctl restart vyos-http-api
dmbaturin triaged this task as Normal priority.Feb 26 2024, 9:49 PM
jestabro changed Difficulty level from Unknown (require assessment) to Normal (likely a few hours).
jestabro changed Is it a breaking change? from Unspecified (possibly destroys the router) to Perfectly compatible.
jestabro changed the task status from Open to Needs reporter action.Mar 7 2024, 2:19 PM

Resolved in my tests using above reproducer; waiting for confirmation from submitter.

@penetal please confirm that this is resolved.

@jestabro I have tested my usecase now and it seems the problem is fixed and the API no longer segfaults. Thank you so much for the fix and the fantastic turn around on this.