E2E Stress Test and Allow List

yBackground

End-to-end test flakiness frequently causes the vets-website continuous integration workflow to fail. In turn, CI failures waste Veteran-facing Services engineer time and disrupt daily deploys. The most common tactic when a test flakes out is to try rerunning the test until it passes. This is a slow, costly, and temporary solution that often defers the problem to later when the test fails again. Platform website contains documentation that outlines appropriate proper ways to test for flakiness such as running a test in a loop, but no way to enforce that this documentation is used unless a support request is filed. Platform assumes that most incidents of E2E flakiness do not result in a support request, but are just silently rerun and not spoken about.

Motivation

Solving flakiness will remove many roadblocks for numerous aspects of platform operation. Without flaky tests present in the main branch, engineers stand to save sometimes hours of time per task by not having to rerun a workflow that can take 45 minutes or more because a test failed unexpectedly. This would also stop unexpected hiccups in both branches merging into main as well as our production deployments. Removing flakiness from the test suite also ensures the quality of our product is tested by stable tests and eliminates support requests around flakiness-driven test failures.

Design

In order to view tables in BigQuery you will need to be granted access to the vsp_testing_tools dataset.

Our solution centers around a new BigQuery table called vets_website_e2e_allow_list (viewing this link requires special permissions) which holds attributes of each E2E spec file including:

its relative path (spec_path, string)
whether the spec is allowed or not (allowed, bool)
the titles of the failing tests in the spec (titles, array of strings)
the date the spec file was disallowed (disallowed_at, string)
the date the spec resulted in a warning (warned_at, string)
the workflow run where the spec was enabled/disabled (associated_workflow, string)

There are two parts of our solution:

The E2E Test Stability Review GitHub Actions Workflow in vets-website that runs once a day
New jobs (and new steps in existing jobs) in the Continuous Integration GitHub Actions Workflow in vets-website

The E2E Test Stability Review GitHub Actions Workflow

Example E2E Test Stability Review workflow

The purpose of this workflow is to discover flaky E2E tests and disallow them from running in CI. This workflow runs once a day and runs all allowed E2E test specs in a loop.

Allowed specs are specs in the vets_website_e2e_allow_list table with an allowed field value of true. If a test fails while running in the loop, the test is deemed flaky and the spec file's allowed field in the vets_website_e2e_allow_list table is set to false, the title of the failed test is added to the titles array, and the date field is set to the current datetime, ensuring that the test is skipped in CI until it is fixed.

The workflow runs once a day because:

The more times tests are run, the greater the opportunity they have to fail if they are flaky
If infrastructure changes, or if any dependencies change (e.g. versions of node, packages, including Cypress, etc.) we want to verify that changes don't introduce flakiness.

Key jobs and steps in the `E2E Test Stability Review` GitHub Actions Workflow

The fetch-e2e-allow-list job grabs the contents of the vets_website_e2e_allow_list table and writes it to a file e2e_allow_list.json
The cypress-tests-prep job downloads the e2e-allow-list.json file and sets the required environment variables
The cypress-tests-prep job sets an environment variable IS_STRESS_TEST to true.
The stress-test-cypress-tests job sets up 20 parallel threads that each process the entire e2e test suite.
The update-e2e-allow-list job checks out the qa-standards-dashboard-data repo and
- Creates and publishes a Mochawesome Report
- Updates the vets_website_e2e_allow_list table, disallowing any tests that failed
- Updates the vets_website_e2e_allow_list_change_log table (viewing this link requires special permissions), which contains a historical record of when specs are
  - added/enabled*
  - added/disabled*
  - enabled
  - disabled
- Creates an E2E Test Stability Review Summary page that
  - provides a link to the Mochawesome Report
  - lists the tests that normally would have run because they're not skipped, but didn't because they're disallowed

The Continuous Integration GitHub Actions Workflow

Example of E2E test stability review running in CI workflow

The purpose of the updates to this workflow are to require new and updated E2E tests to be stress-tested before they are merged into master. Additionally, the updates allow engineers to fix failing tests and have them automatically allowed again.

Key jobs and steps in the `Continuous Integration` GitHub Actions Workflow

The fetch-allow-lists job uses a script in the qa-standards-dashboard-data repository to grab the contents of the vets_website_e2e_allow_list table, write it to a file, and pass the file to GitHub
The cypress-tests-prep retrieves the e2e-allow-list file from GitHub artifact storage
Test Selection sets the following env vars which are then set as outputs
- TESTS, the output is tests
- TESTS_TO_STRESS_TEST, the output is tests-to-stress-test (these are selected tests that are new or updated tests)
The cypress-tests job runs tests normally (i.e. the tests are split up amongst a number of Cypress runners and are only run once)
The stress-test-cypress-tests job runs all tests-to-stress-test in ten parallel threads. This job runs in parallel to the cypress-tests job so it doesn't add any time to the workflow.
The update-e2e-allow-list job checks out the qa-standards-dashboard-data repo and
- Creates and publishes a Mochawesome Report
- Updates the vets_website_e2e_allow_list table
  - Adds records for new spec paths
  - Updates any tests that were previously set to allow=false to allow=true if they pass the stress-test
- Updates the vets_website_e2e_allow_list_change_log table (viewing this link requires special permissions), which contains a historical record of when specs are
  - added/enabled*
  - added/disabled*
  - enabled
  - disabled
- Creates an E2E Test Stability Review Summary page that
  - provides a link to the Mochawesome Report
  - lists the tests that would normally have run because they're not skipped, but didn't because they're disallowed

Additional Notes

The allow-list-notify.yml and allow-list-publish.yml files in the qa-standards-dashboard-data repo contains workflows that automate two features that will be run at 9am eastern time M-F.

Confluence/Platform Website page

The first workflow is located here. This creates a page in confluence which publishes a page in Platform Docs during our platform’s regularly scheduled publishing. It contains a list of test specs that are currently disallowed, along with how many days they have been disallowed for and if able to be determined, a team that is designated as the owner of that test spec. This workflow runs M-F, once daily.

Example screen shot of the list of disallowed e2e tests

Slack Notification

Also, a Slack notification fires off in the #vfs-all-teams channel, that delivers this information directly to a source where QA issues can be followed up on if necessary. This notification fires off every Tuesday and summarizes both the number of unit tests and e2e tests that are disabled.

Example notification to the vfs-all-teams Slack channel — #vfs-all-teams notification example

Help and feedback

Suggest content changes to this page.
Submit new Platform Website content.
Get help from the Platform Support Team in Slack.
Submit a feature idea to the Platform.