Background

End-to-end test flakiness frequently causes the vets-website continuous integration workflow to fail. In turn, CI failures waste VFS engineer time and disrupt daily deploys. The most common tactic when a test flakes out is to try rerunning the test until it passes. This is a slow, costly, and temporary solution that often defers the problem to later when the test fails again. Platform website contains documentation that outlines appropriate proper ways to test for flakiness such as running a test in a loop, but no way to enforce that this documentation is used unless a support request is filed. Most incidents of E2E Flakiness do not result in a support request, but are just silently rerun and not spoken about. This is backed by comparing our flakiness tracking in Domo to the number of support requests we receive triggered by flakiness.


Motivation

Solving flakiness will remove many roadblocks for numerous aspects of platform operation. Without flaky tests, engineers stand to save sometimes hours of time per task by not having to rerun a workflow that can take up to 45 minutes because a test failed unexpectedly. This would also stop unexpected hiccups in both branches merging into main as well as our production deployments. Removing flakiness would also ensure the quality of our product is tested by a stable testing suite while simultaneously eliminating support requests around flakiness-driven test failures.


Design

Our solution centers around a new BigQuery table called vets_website_e2e_allow_list which holds attributes of each E2E spec file including:

  • its relative path (spec_path, string)

  • whether the spec is allowed or not (allowed, bool)

  • the titles of the failing tests in the spec (titles, array of strings)

  • the date the spec file was disallowed (disallowed_at, string)

There are two parts of our solution:

  1. The E2E Stress Test GitHub Actions Workflow in vets-website that runs once a day

  2. New jobs (and new steps in existing jobs) in the Continuous Integration GitHub Actions Workflow in vets-website

The E2E Stress Test GitHub Actions Workflow

The purpose of this workflow is to discover flaky E2E tests and disallow them from running in CI. This workflow runs once a day and runs all allowed E2E test specs in a loop.

Allowed specs are specs in the vets_website_e2e_allow_list table with an allowed field value of true. If a test fails while running in the loop, the test is deemed flaky and the spec file's allowed field in the vets_website_e2e_allow_list table is set to false, the title of the failed test is added to the titles array, and the date field is set to the current datetime, ensuring that the test is skipped in CI until it is fixed.

The workflow runs once a day because:

  1. The more times tests are run, the greater the opportunity they have to fail if they are flaky

  2. If infrastructure changes, or if any dependencies change (e.g. versions of node, packages, including Cypress, etc.) we want to verify that changes don't introduce flakiness.

Key jobs and steps in the E2E Stress Test GitHub Actions Workflow
  • The fetch-e2e-allow-list job grabs the contents of the vets_website_e2e_allow_list table and sets it as an output called allow_list

  • The cypress-tests-prep job passes new env vars into Test Selection (script/github-actions/select-cypress-tests.js) called ALLOW_LIST (which is set to the allow_list output set in fetch-e2e-allow-list) and IS_STRESS_TEST which is set to true

  • Test Selection filters the allow list by allowed=true and sets the new env var TESTS_TO_STRESS_TEST to the filtered list. This value is assigned to the output called tests.

  • The stress-test-cypress-tests job passes a new env var called IS_STRESS_TEST to the run-cypress-tests.js script. If IS_STRESS_TEST is set to true, it runs each batch of tests in a loop.

  • The update-e2e-allow-list job checks out the qa-standards-dashboard-data repo and

    • Creates and publishes a Mochawesome Report

    • Updates the vets_website_e2e_allow_list table, disallowing any tests that failed

    • Updates the vets_website_e2e_allow_list_change_log table, which contains a historical record of when specs are

      • added/enabled*

      • added/disabled*

      • enabled

      • disabled

    • Creates a new GitHub Issue on the va.gov-team repo for each spec file that has a newly detected flaky test

      • Using the Product Directory, we identify which product the test belongs to and retrieve the GitHub team label for the product, if the label exists

      • When the new issue is created, it's assigned to the e2e-flaky-test label and to the product's team label, if it exists

      • The issue will stand out due to the croissants in the title, notating the flakiness of it all.

    • Creates an E2E Stress Test Summary page that

      • provides a link to the Mochawesome Report

      • lists the tests that normally would have run because they're not skipped, but didn't because they're disallowed


The Continuous Integration GitHub Actions Workflow

The purpose of the updates to this workflow are to require new and updated E2E tests to be stress-tested before they are merged into master. Additionally, the updates allow engineers to fix failing tests and have them automatically allowed again.

Key jobs and steps in the Continuous Integration GitHub Actions Workflow
  • The fetch-e2e-allow-list job grabs the contents of the vets_website_e2e_allow_list table and sets it as an output called allow_list

  • The cypress-tests-prep job passes a new env var into Test Selection (script/github-actions/select-cypress-tests.js) called ALLOW_LIST (which is set to the output called allow_list in fetch-e2e-allow-list

  • Test Selection sets the following env vars which are then set as outputs

    • TESTS, the output is tests

    • TESTS_TO_STRESS_TEST, the output is tests-to-stress-test (these are selected tests that are new or updated tests)

    • TEST_SELECTION_DISALLOWED_TESTS, the output is test_selection_disallowed_tests (these are selected tests that are currently set to allow=false)

  • The cypress-tests job runs tests normally (i.e. the tests are split up amongst a number of Cypress runners and are only run once)

  • The stress-test-cypress-tests job runs tests-to-stress-test in a single Cypress instance, in a loop. This job runs in parallel to the cypress-tests job so it doesn't add any time to the workflow.

  • The update-e2e-allow-list job checks out the qa-standards-dashboard-data repo and

    • Creates and publishes a Mochawesome Report

    • Updates the vets_website_e2e_allow_list table

      • Adds records for new spec paths

      • Updates any tests that were previously set to allow=false to allow=true if they pass the stress-test

    • Updates the vets_website_e2e_allow_list_change_log table, which contains a historical record of when specs are

      • added/enabled*

      • added/disabled*

      • enabled

      • disabled

    • Creates an E2E Stress Test Summary page that

      • provides a link to the Mochawesome Report

      • lists the tests that would normally have run because they're not skipped, but didn't because they're disallowed


Additional Notes

The allow-list.yml file in the qa-standards-dashboard-data repo contains a workflow that automates two features that will be run at 9am eastern time M-F.

Confluence/Platform Website page

This creates a page in confluence which publishes a page in Platform Docs during our platform’s regularly scheduled publishing. It contains a list of test specs that are currently disallowed, along with how many days they have been disallowed for.

Slack Notification

Also, a Slack notification fires off in the #qas-notifications channel, that delivers this information directly to a source where QA issues can be followed up on if necessary:

#qas-notifications example of team notification