Skip to main content
Skip table of contents

Incident management

Last Updated: April 2, 2025

This document will help all team members working on VA.gov understand how to report incidents to the platform and provide relevant information to assist in routing reported issues to other appropriate teams. Please read the following document carefully in order to determine the appropriate path for your incident.

Is this a Severity 1 Issue? - Notify On-Call Immediately

  1. Determine if the issue is a major outage of VA.gov or a limited impact incident, if possible.

    1. Examples:

      • VA.gov is unresponsive

      • Platform errors causing extreme delays for users

      • Login to VA.gov is unavailable

  2. Alert the Incident Commander. (NOTE: If you do not have a PagerDuty license skip to step 3)

    1. In Slack Channel #vfs-platform-support trigger an alert with the command:
      /pd trigger

      PagerDuty Trigger Screen

      PagerDuty Trigger Screen

When the incident is believed to be Severity 1 which can be defined as (but not limited to):

  • Issue resolution is required in a 24-hour period

  • Part of (or all of) VA.gov is severely disrupted

In these cases, immediately escalate to #oncall using above steps.

  1. If you do not have the PagerDuty license we request you create a Support Ticket in #vfs-platform-support Slack Channel, using the “Incident” Request Topic. Once completed also send an email to incidentcommander@dsva.pagerduty.com when outside of normal business hours to page the IC on duty. Be sure to include all information, details, links, and any other pertinent information.

Support Ticket with Request topic as Incident

Support Ticket with Request topic as Incident

When to Report an Issue to be Triaged

General Rule: As stated below, VFS teams are responsible for finding and fixing bugs in the products within their jurisdiction. However, report issues if they meet any of the following criteria:

  • Seems systemic

  • Seems related to a product or service provided by the Platform

  • Seems related to a product managed by another VFS team

  • Has an unknown source and is causing problems for VA.gov users to report it to Triage

Examples:

  • An internal load testing tool is broken

  • Mock data is not working or is out of date

  • Metrics are being reported incorrectly or not reported at all

  • You are the Global UX team and you learn in research sessions that a lot of Veterans are having trouble accessing their education benefits

How to Report the Issue to be Triaged:

Choose one of the following:

  • If you know which team the issue should be routed to, reach out to their point of contact to confirm and directly assign the issue to that team.

  • If you aren't sure which team owns the issue and would like to send it directly to them without the assistance of Triage, the Product Directory can help guide you in the right direction.

  • If you aren't sure which team owns the issue and want to submit it to the Platform for triage, submit a Platform Support Ticket using the “Incident” issue using the current process here.

    • NOTE: Is there already a GitHub Issue? Mention your open ticket when creating a new one and add the following labels on the newly created ticket for visibility:

      • triage

      • triage-incident

When to NOT report the issue to be Triaged

  • Bugs with products under your team's jurisdiction including endpoints and integrations with APIs

  • Feature Improvements that belong within your own team

Still not sure?

If you still have a doubt about where to report your incident for whatever reason, please reach out to the #vfs-platform-support Slack channel and we would be happy to assist you.

How Reported Issues will be Triaged

The Platform

We will resolve issues with products/systems that fall under Platform ownership. See the Product Directory to learn which products and systems the Platform owns.

VFS teams

VFS teams will resolve all issues with Veteran-facing Services (including endpoints and integrations with APIs) by assigning an issue to the POC of the VFS team whose product is experiencing issues - per the ownership indicated in the Product Directory.

If the Product Directory does not indicate a VFS owner for a service, Triage will assign the issue to Chris Johnston.

Monthly Maintenance Window

To improve coordination and visibility around platform updates and maintenance, the Platform team has established a Monthly Maintenance Window, scheduled for the 2nd Wednesday of each month. This designated maintenance window will help teams plan and minimize disruptions, while ensuring that updates and fixes are completed during a dedicated time slot.

Key Details:

  • Day & Time: The maintenance window will occur on the 2nd Wednesday of each month, from 11:00 PM to 2:00 AM Eastern.

  • Purpose & Timing:

    • The time slot was chosen to avoid weekend scheduling, allowing a larger team to be available if needed.

    • The window falls in the middle of the week, which helps prevent issues from lingering into the weekend.

    • Traffic during this time is generally lighter compared to weekends, ensuring minimal disruption to users.

    • The maintenance will conclude before the daily traffic spike around 2:30 AM Eastern, when many scheduled jobs are executed.

  • Usage:

    • This scheduled maintenance window does not prevent the need for additional Platform Maintenance Windows. If more time is required or urgent maintenance arises, another window can be scheduled.

    • If the time block is insufficient for the tasks at hand, extra maintenance windows can be scheduled.

    • If no maintenance is required for a given month, the window will be canceled.

    • This window may also be leveraged for Off-Hours Deploys (OHD), allowing deployments to be scheduled outside of daily deploy hours without impacting regular platform operations.

  • Request & Awareness Process:

    • Requests to use the maintenance window will be made by creating a Support Ticket in the #vfs-platform-support Slack Channel, using the “PagerDuty Maintenance Window” Request Topic.

    • The request should include a description of:

      • The desired date.

      • The work to be done.

      • The rationale behind it

    • Support team will engage to align with the scheduled maintenance window, determine if Tier 2 resources are required, and communicate updates as needed.

    • The On-Call team will support the maintenance during this window. Additional resources will be engaged as necessary based on the tasks identified for the maintenance.

    • A filtered board will be used to track and manage all maintenance window requests.

Off Hours Deploy (OHD)

There are instances where teams require a deploy outside of the daily deployment window, but the request does not qualify for an Out Of Band (OOB) deploy. These requests often require more attention and carry a higher risk, making them unsuitable for the daily deploy process.

To streamline the handling of these requests, we have defined the Off Hours Deploy (OHD) process. These requests should be planned well in advance whenever possible. This process ensures that such deploys are properly planned and coordinated to reduce risks and ensure the right support is available.

Steps to Submit an Off Hours Deploy Request:

  1. Initiate the Request

    • Create a Support Ticket in the #vfs-platform-support Slack Channel, using the “Off Hours Deployment” Request Topic.

  2. Fill Out the Ticket Template

    • Open an OHD Request Issue Ticket and verify each of the tasks:

      • Description and Expectations: Outline what is being deployed and any specific goals.

      • Requesting Team: Identify the person and team making the request.

      • Date and Time: Specify the requested deploy date and time.

      • Platform Maintenance Window: Indicate if the deploy should be tied to a Platform Maintenance Window. If so, note the specific window.

      • Justification: Provide a reason for why this deploy cannot be handled within the daily deploy window or an OOB deploy.

      • Potential Support Needed: List any necessary support (e.g., Backend, Frontend, DevOps).

  3. Coordination and Awareness

    • Once the OHD Request Ticket is complete, Tier 1 will engage by:

      • Ensuring team members are aligned with the proposed date and time.

      • Confirming the timing does not overlap with an existing Platform Maintenance Window (if applicable).

      • Assigning the request to the correct Support team as well as the specific T1 and T2 on-call person.


JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.