End-to-end test flakiness frequently causes the vets-website continuous integration workflow to fail. In turn, CI failures waste Veteran-facing Services engineer time and disrupt daily deploys. The most common tactic when a test flakes out is to try rerunning the test until it passes. This is a slow, costly, and temporary solution that often defers the problem to later when the test fails again. Platform website contains documentation that outlines appropriate proper ways to test for flakiness such as running a test in a loop, but no way to enforce that this documentation is used unless a support request is filed. Platform assumes that most incidents of E2E flakiness do not result in a support request, but are just silently rerun and not spoken about.
Motivation
Solving flakiness will remove many roadblocks for numerous aspects of platform operation. Without flaky tests present in the main branch, engineers stand to save sometimes hours of time per task by not having to rerun a workflow that can take 45 minutes or more because a test failed unexpectedly. This would also stop unexpected hiccups in both branches merging into main as well as our production deployments. Removing flakiness from the test suite also ensures the quality of our product is tested by stable tests and eliminates support requests around flakiness-driven test failures.
Design
In order to view tables in BigQuery you will need to be granted access to the vsp_testing_tools dataset.
Our solution centers around a new BigQuery table called vets_website_e2e_allow_list (viewing this link requires special permissions) which holds attributes of each E2E spec file including:
its relative path (spec_path, string)
whether the spec is allowed or not (allowed, bool)
the titles of the failing tests in the spec (titles, array of strings)
the date the spec file was disallowed (disallowed_at, string)
the date the spec resulted in a warning (warned_at, string)
the workflow run where the spec was enabled/disabled (associated_workflow, string)
There are two parts of our solution:
The E2E Test Stability Review GitHub Actions Workflow in vets-website that runs once a day
New jobs (and new steps in existing jobs) in the Continuous Integration GitHub Actions Workflow in vets-website
The E2E Test Stability Review GitHub Actions Workflow
The purpose of this workflow is to discover flaky E2E tests and disallow them from running in CI. This workflow runs once a day and runs all allowed E2E test specs in a loop.
Allowed specs are specs in the vets_website_e2e_allow_list table with an allowed field value of true. If a test fails while running in the loop, the test is deemed flaky and the spec file's allowed field in the vets_website_e2e_allow_list table is set to false, the title of the failed test is added to the titles array, and the date field is set to the current datetime, ensuring that the test is skipped in CI until it is fixed.
The workflow runs once a day because:
The more times tests are run, the greater the opportunity they have to fail if they are flaky
If infrastructure changes, or if any dependencies change (e.g. versions of node, packages, including Cypress, etc.) we want to verify that changes don't introduce flakiness.
Key jobs and steps in the E2E Test Stability Review GitHub Actions Workflow
The fetch-e2e-allow-list job grabs the contents of the vets_website_e2e_allow_list table and writes it to a file e2e_allow_list.json
The cypress-tests-prep job downloads the e2e-allow-list.json file and sets the required environment variables
The cypress-tests-prep job sets an environment variable IS_STRESS_TEST to true.
The stress-test-cypress-tests job sets up 20 parallel threads that each process the entire e2e test suite.
The update-e2e-allow-list job checks out the qa-standards-dashboard-data repo and
Updates the vets_website_e2e_allow_list_change_log table (viewing this link requires special permissions), which contains a historical record of when specs are
added/enabled*
added/disabled*
enabled
disabled
Creates an E2E Test Stability Review Summary page that
provides a link to the Mochawesome Report
lists the tests that normally would have run because they're not skipped, but didn't because they're disallowed
The Continuous Integration GitHub Actions Workflow
The purpose of the updates to this workflow are to require new and updated E2E tests to be stress-tested before they are merged into master. Additionally, the updates allow engineers to fix failing tests and have them automatically allowed again.
Key jobs and steps in the Continuous Integration GitHub Actions Workflow
The fetch-allow-lists job uses a script in the qa-standards-dashboard-data repository to grab the contents of the vets_website_e2e_allow_list table, write it to a file, and pass the file to GitHub
The cypress-tests-prep retrieves the e2e-allow-list file from GitHub artifact storage
Test Selection sets the following env vars which are then set as outputs
TESTS, the output is tests
TESTS_TO_STRESS_TEST, the output is tests-to-stress-test (these are selected tests that are new or updated tests)
The cypress-tests job runs tests normally (i.e. the tests are split up amongst a number of Cypress runners and are only run once)
The stress-test-cypress-tests job runs all tests-to-stress-test in ten parallel threads. This job runs in parallel to the cypress-tests job so it doesn't add any time to the workflow.
The update-e2e-allow-list job checks out the qa-standards-dashboard-data repo and
Updates any tests that were previously set to allow=false to allow=true if they pass the stress-test
Updates the vets_website_e2e_allow_list_change_log table (viewing this link requires special permissions), which contains a historical record of when specs are
added/enabled*
added/disabled*
enabled
disabled
Creates an E2E Test Stability Review Summary page that
provides a link to the Mochawesome Report
lists the tests that would normally have run because they're not skipped, but didn't because they're disallowed
Additional Notes
The allow-list-notify.yml and allow-list-publish.yml files in the qa-standards-dashboard-data repo contains workflows that automate two features that will be run at 9am eastern time M-F.
Confluence/Platform Website page
The first workflow is located here. This creates a page in confluence which publishes a page in Platform Docs during our platform’s regularly scheduled publishing. It contains a list of test specs that are currently disallowed, along with how many days they have been disallowed for and if able to be determined, a team that is designated as the owner of that test spec. This workflow runs M-F, once daily.
Slack Notification
Also, a Slack notification fires off in the #vfs-all-teams channel, that delivers this information directly to a source where QA issues can be followed up on if necessary. This notification fires off every Tuesday and summarizes both the number of unit tests and e2e tests that are disabled.