Right now, our CI pipeline will not upload nightly builds if they fail to pass both "unit" smoke tests and "config load" (integration) smoke tests. In light of the floating StrongSWAN issue, that approach proved problematic: we are without nightly builds for two weeks already, we found that the issue is not reproducible in a clean state, and people are already complaining: https://forum.vyos.io/t/where-are-the-nightlies/7538
There is no question whether we should identify and fix that failure or not: we should. The question is whether this kind of failures should prevent nightly builds from being uploaded. I believe the answer is no.
Here's why:
- Nightly builds have no stability guarantees, in fact their purpose is to allow testing. The only completely useless nightly ISO is one that doesn't boot or where any commit fails.
- If there's a failure in the config load tests, the only way anyone can try reproducing the failure is to build an ISO from scratch. This is wasteful, and also makes it impossible to seek help with testing those failures from the community.
- Finally, as we see from this StrongSWAN issue, those tests are neither idempotent nor granular enough to help us find root causes.
I believe there should be two different jobs:
- ISO build + unit tests.
- Config load tests.
This way we will still be alerted if the latter job fails, but it will not make us build images twice and will let community members help us reproduce those failures.