Skip to main content
Skip table of contents

VA Platform Incident Playbook

Last Updated:

Overview

This document aims to provide clear guidelines on how to handle, communicate, and respond to incidents.

Note: An overview of the process can be found below, under “Incident Flow Chart”

Security Incidents

All security incidents must be reported to the ISSO(Griselda.Gallegos@va.gov or 512.326.6037) or the Privacy Office (oitprivacy@va.gov) immediately. Be sure to include the following information:

  • Who was involved in the issue (including your contact information)

  • Exactly when the issue occurred (a timeframe, if available)

  • What systems were affected

  • When the issue was discovered, and who discovered it.

  • How many individuals are affected

  • Types of information leaked, if applicable (SSNs, login credentials, names, addresses, etc).

The Privacy Officer or the ISSO creates a ticket in PSET (Privacy Security and Events Tracking System) https://www.psets.va.gov/. (This ticket is not created by anyone on the Platform)

All incidents will be reported to the VACSOCIncidentHandling@va.gov by the ISSO or the PO within 1 hour of occurrence.

Please include the following people on the email chain: Lindsey.Hattamer@va.gov, brandon.dech@va.gov, lindsay.insco@va.gov, clint.little@va.gov, kenneth.mayo1@va.gov steve.albers@va.gov, steve.albers@va.gov, christopher.johnston2@va.gov

You can track incidents in the DSVA #vfs-incident-tracking channel

VA Platform Process

These steps are an overview for reporting a PHI/PII spill to Platform.

Expand for more information
  1. Create a support request with the topic incident using /support bot in the vfs-platform-support channel.

  2. Platform Support will acknowledge the support request, and provide next steps.

  3. The responsible party of the spill will contact OIT VA Privacy office to report this security incident. The follow questions will need to be answered. You can find that information here.

  4. Platform Support will trigger a PagerDuty incident, and post communications in the DSVA vfs-incident-tracking slack channel.

  5. Platform Support will work to identify remediation steps, such as: scrubbing DataDog logs and expediting code review changes, if needed.

  6. Platform Support will follow up with OIT for any questions that we can help answer, as well as gathering the PSET number for the incident.

  7. Platform Support will update the incident thread in vfs-incident-tracking. This includes criteria such as:
    - What system is affected?
    - When was this discovered?
    - How many Veterans are impacted?
    - What PII identifiers were found?
    - How many total logs, if spill is within DataDog?
    - Support confirming DataDog logs have been filtered
    - When OIT resolves the PSET incident
    - A link to a PostMortem on the incident

  8. Platform Support will provide directions on how to create a PostMortem. If the user is not techincal, Support can create a PostMortem for the invidivudal.

  9. Platform will help contribute to the PostMortem, but the team responsible is expected to fill in most of the detail.

  10. We will offer a retrospective meeting with OCTO to have Platform and the responsible team discuss the spill.

Incident-handling steps

:announcement: These steps are an overview. See further down the page for more details if needed.

  1. Determine if the issue is a major outage of va.gov or a limited impact incident, if possible. See Classifying the Incident section below.

Minor incidents don’t need to follow the reporting outlined on this page.

  1. If you are NOT the Incident Commander (IC) and need to page the IC, create a support ticket in the DSVA Slack channel, #vfs-platform-support by using the /support command and use the incident topic.

How to use the /support command

Use the "Incident" topic

Where to add a description in a support ticket

  1. The Incident Commander will confirm the incident.

    1. For major incidents, the IC must notify @channel on significant status change or every three hours in the following channels:

      1. #vsp-leadership-incident-tracking

      2. #vfs-incident-tracking

      3. #vfs-platform-support

      4. #vfs-all-teams

      5. #vsp-contact-center-support

    2. For security incidents, the IC must escalate the incident to Tier 3 (Lindsey Hattamer and Ken Mayo). Follow the steps in the “Security Incidents” section

  2. The Incident Commander and subject matter experts work together.

    1. The engineer will focus on resolving the incident (see Triaging an Incident section below).

      1. A technical lead will act as the SME for issues related to the incident.

      2. The Incident Commander will focus on the communication steps outlined in this document Please review the difference between a Swarm Room and a MIM Bridge

    2. Classify the incident and escalate if needed.

    3. Start a Team Swarm Room if required.

  3. Provide updates in incident threads in the above channel as more details emerge.

  4. Once the Incident is resolved, publish a postmortem.

Triaging an Incident

Triaging an incident in the VA Platform can be difficult due to the complexity of the systems, teams, and infrastructure involved. The following suggestions are steps that may be taken by the IC or Tier 2 on-call engineers when troubleshooting.

Take notes and document the troubleshooting steps taken in the Slack thread or channel where this is being discussed.

Note: ✔️ Denotes an action to take when triaging an incident.

⬇️ Click below for triaging steps ⬇️

Triaging steps!

Check from the end-user perspective

These are user-facing aspects of the VA platform. If either of these are non-functional, there is a critical outage.

✔️ Check Vets API status endpoint (https://api.va.gov/v0/status). Success looks like an object with a git_revision key and sha value.

✔️ Check http://VA.Gov frontend (200 for VA.gov Home | Veterans Affairs )

Check #oncall and #devops-alerts

Look for any suspicious alerts at the time of the outage.

Check recent commits

For anything that could be suspicious

✔️ https://github.com/department-of-veterans-affairs/vets-api/commits/master

✔️ https://github.com/department-of-veterans-affairs/vsp-infra-application-manifests/commits/main

✔️ https://github.com/department-of-veterans-affairs/devops/commits/master

✔️ https://github.com/department-of-veterans-affairs/vsp-platform-revproxy/commits/main/

✔️ https://github.com/department-of-veterans-affairs/vsp-platform-fwdproxy/commits/main/

Review the General Monitoring Dashboards

More detailed troubleshooting might be required to determine the cause of the incident. The following resources may be helpful:

✔️ Check the Platform Infrastructure Dashboard

✔️ Check the Vets API Dashboard

✔️ Check the Datadog Log Explorer

Check Incoming Requests

Incoming traffic to the platform first flows through the Trusted Internet Connection (TIC).

✔️ Upstream - Check to ensure that the Internet gateway is healthy

Check for AWS Issues

Platform infrastructure runs on AWS in US-Gov-West, so check for issues in that region:

✔️ Check the AWS GovCloud health status page.

Additionally, because so many services and applications outside of the VA Platform are run on AWS, it may help to check the general AWS status page as well:

✔️ Check the General AWS health status page.

Note: AWS issues are likely out of our control. However, realizing that the problem lies with AWS and not the Platform infrastructure may save significant time when triaging.

Review Recent Deployments

Recently deployed code may have introduced new errors to the platform.

Some deployments in the VA platform are automated and run regularly while others may be initiated manually by engineers.

✔️ Check #devops-deploys in DSVA Slack for recent deployments that might relate to the current issue.

Note: Deployments are triggered for multiple apps from multiple tools:

Check Downstream Services

Multiple downstream and/or upstream services to the VA Platform can cause issues. In general, these services are outside of the control of the VA Platform Team and are usually NOT critical, but they may still cause alarms to trigger. These services include:

  • BEP/BGS

  • Lighthouse

  • …and many more

The forward proxy regularly makes status-check calls to these services.

✔️ Check if forward proxy backend connections are healthy

Check for maintenance windows for Downstream Services

Downstream services may be undergoing scheduled maintenance at any given time. For example, EVSS frequently schedules weekend maintenance.

The Platform Support Team is responsible for scheduling corresponding maintenance windows in PagerDuty to prevent unnecessary pages from being sent to the team.

However, you should still:

✔️ Check the EVSS Slack channel for scheduled maintenance.

✔️ Check DSVA Slack for any other mentions of maintenance or scheduled downtime

Check for Platform Service Dependencies

These services are tightly-coupled with Vets-API, meaning any failures in these services will cause significant issues or possibly a critical outage.

✔️ Check the Vets-API job scheduler queue (Sidekiq)

✔️ Check Redis (AWS Elasticache)

✔️ Check Database (AWS RDS - Postgres)

Check for Platform Service Infrastructure Issues

Platform uses a combination of AWS hosting solutions to run VA services. Typically, instances of these services are replicated and served portions of traffic by AWS. Issues in these services may cause those instances to die and reboot continuously.

✔️ Check the following application monitor(s) for cycling:

Incident Areas of Concern

Areas where things could go wrong

Classifying the Incident

The first step to take when an incident occurs is to classify it based on impact and severity. Incidents are handled differently, depending on how many users are affected and whether significant functionality is lost on va.gov.

This rubric provided by the VA MIM team
EF4C78C3-32C8-4BBE-BC45-99C2F6553DAD.png

Severity

Description

Urgency

Critical

  • A major service, ie login, is down for all customers and the platform has recourse.

    • Note: As an example, if DSLogon or id.me is down, that would be considered and “External Incident”; See below

  • Confidentiality or privacy is breached

  • Veteran data is at risk

Drop everything and work around the clock to fix the issue. Incident Commander should immediately open a Swarm Room and consider contacting the MIM team for an HPI declaration

Major (High)

  • One or more services is unavailable for a significant number of veterans or caregivers

  • Core functionality such as submitting a claim is significantly impacted

Drop everything and try to get the issue fixed ASAP. Incident Commander should immediately open a Swarm room and consider contacting the MIM for an HPI declaration. Clarify need for MIM with OCTODE leadership

Medium (Moderate)

  • One or more services is unavailable to 10% or less of veterans or caregivers

Try to fix the problem ASAP. Incident Commander should offer to open up a Swarm Room, but may not be necessary.

Minor (Low)

  • Some functionality is unavailable, but there’s a workaround

  • The site does not look the way it should, but it doesn’t affect functionality

  • Degradation in service to customers that are not veterans.

    • Example: Editors cannot publish new content

Try to fix the problem ASAP, but not necessarily evenings and weekends.

External

  • An external service such as DSLogon, id.me, or Salesforce is down

Not an Incident

  • Some “Incidents” don’t qualify as Incidents. For example, a degradation in service is small enough that the next build will fix the problem, or some other corrective action will fully restore quality of service.

Designating an event as an Incident, including ones designated as Minor, indicates the seriousness of the issue. The Platform has a sense of urgency around resolving all Incidents.

High Priority Incidents (HPI) and Major Incident Management (MIM)

If you believe this is a high priority incident (HPI) based on the above criteria, or by using the incident flowchart, please follow the steps in the Major Incident Management Playbook

External Incidents handling

External Incidents are treated differently than internal ones. Some examples might include Datadog or GitHub being down. You can post a message in #vfs-platform-support as you see fit.

Note: The rest of this document focuses on internal Incidents.

Roles

All Incidents must have:

  • Incident Commander

  • Support Team TL: Brandon Dech (backup: Lindsay Insco)

  • TL from the team who owns the broken service: Curt Bonade, Steven Venner, Ken Mayo (Or designated backup)

  • Program Manager: Andrea Townsend (backup: Em Allan)

  • A senior team member (TL or senior) who is not hands-on resolving the issue but can provide clear, timely updates. They must be able to answer questions from the MIM team about the incident and status of resolution. This person should bridge communication between the Swarm Room and the MIM Bridge.

    1. This should be a TL or a Senior member of the team (For example, Kyle for IST, or Curt for SRE.)

  • Engineering Lead: Lindsey Hattamer (backup: Clint Little)

  • VA Technical Leadership: Steve Albers (backup: Andrew Mo?)

  • VA Product Leadership: Erika Washburn (backup: Marni)

If you are named as required and you cannot attend, you must designate who your backup is in the Platform Leadership channel in slackmmander (IC)

Communication

Communication is the main job of the incident commander.

As a rule, no escalation is needed for minor incidents but the IC can choose to escalate at their discretion.

Slack

After notifying the channels listed in above in the Incident-handling steps section, communication should happen at least hourly in the #vfs-incident-tracking channel.

The IC’s job is to keep Leadership updated with the current status of an active incident at least every hour with the following:

  • The current state of the service

  • Remediation steps taken

  • Any new findings since the last update

  • Theory eliminations (i.e. ‘What have we determined is not the cause?’)

  • Anticipated next steps

  • ETA for the next update (if possible)

If OCTODE Platform Leadership is not reachable on slack, text them one at a time - beginning with Steve Albers - using the phone number from their slack profile.

Swarm Room and MIM Bridge

Please follow the guide on Swarm Rooms vs MIM Bridges for more information on rules and etiquette.

Leaked API Keys

The Vets API and Vets Website repositories are public, VA owned repos that are hosted on GitHub. Many of the external services we interact with require API keys for authentication. During local development, it’s common to temporarily embed these keys in the code. However, this practice can lead to accidental exposure of sensitive keys if committed to source control. This section will explain the Incident Commanders role on handling these types of incidents.

Types of API Keys

API keys can include private keys, OAuth tokens, Bearer tokens, and more. These credentials are often used for authentication and authorization when interacting with external services on the VA Network or adjacent APIs (such at Lighthouse).

Response Plan

Leaked API keys are often discovered during PR reviews, where a reviewer (Platform or VFS) may notice sensitive information committed in code. Upon identifying a leaked key, follow the Incident Commander process:

  1. Trigger an incident in PagerDuty, and escalate to Backend on-call. Or, if during business hours utilize Backend support.

  2. Notify OCTO leadership in the #vfs-incident-tracking channel.

  3. After the key is rotated, Backend will need to roll pods in EKS to ensure they pick up the new key.

Important note on downtime: Revoking the compromised key may cause downtime or disruptions to services that rely on it, until the new key is in place. To minimize impact, ensure any configuration files, environment variables, or secret management values are updated as soon as possible with the new key.

  • For external services integrations using the Breakers middleware pattern, an outage can be force triggered to manage disruptions. This mitigation strategy will additionally require scheduling a maintenance window in PagerDuty and may also involve placing a temporary banner on the frontend to notify users about the service impact.

  • If Rotation is delayed, please see the next section.

  • Again, this may result in downtime until a new key can be integrated.

  1. If you feel comfortable, you can follow triggering Breaker outages for EKS Environments. If not, you can tap the Backend engineer to help with this. This follows the first bullet point in the warning panel above.

  2. Using the list from bundle exec rake breakers:list_services, you can cross-reference this with the Service Directory in PagerDuty. We will want to setup maintenance windows for any of the affected services.

  3. Reach out to the Content and Information Architecture team in the #content-ia-centralized-team to facilitate setting up a banner on the frontend to inform users about service impact.

  4. After the incident has been resolved, create a postmortem to document the incident.

External VA Service Outages/Incidents

If an issue with an external VA service is detected (TIC, MPI, BEP, VBMS, etc) you will need to file a Service Now (SNOW) ticket. The Incident commander should assist in filing this.

How to file a SNOW ticket in yourit.va.gov for users without ITIL access
  1. From behind the VA network, go to http://yourit.va.gov/va to get to your favorite page.

  2. Select “Report an issue”

  1. Select “Not sure? Submit your issue here”

  1. Fill out your general information and select “Next page”

    • Name, #, email, etc

  1. Brief description: 1 sentence max

  2. Is this happening at a VA location? No

  3. VA location: Anywhere

  4. Category: Software

  5. Subcategory: Web (IMPORTANT: Check the box that says “This device I am looking for is not on the list)

  6. Name: http://va.gov  

  7. URL : Whatever one is non-responsive 

  1. Click “Further Details”

  1. Select impact. 

  2. Submit issue. You will get a ticket number. Something like INC123456. Look out for communication via Teams, email, and your phone.  

How to file a SNOW ticket in yourit.va.gov for users with ITIL access
  1. When you open up http://yourit.va.gov you will be sent to a different dashboard

  2. Select “All”, type “incident” into the filter box (Or scroll all the way down I guess) and “Create New”

  1. You will get a ticket number immediately, even before you submit it. Copy this number.

  2. Location doesn’t matter, building and room are N/A, fun stuff etc etc. Default to DC or your nearest VAMC

  3. Category: Always affected Service

  4. Affected Service: If you don’t know, use http://VA.gov .

  1. Portfolio: Veteran Experience Services (usually works)

  2. Assignment group

    • If you do NOT know. Go with “ESD Tier 1” - they will help you get to the right person.

  3. Affected System Name/EE Number/Hostname : VA.gov 

If an issue with the TIC gateway is identified, the team has been encouraged to reach out to Ty Allinger - a Network Edge Operations engineer directly on Teams to attempt to troubleshoot live. If Ty Allinger is out of office, Lawrence Tomczyk is the backup manager after you file an Incident Ticket and assign it to NETWORK.NOC.NEO

Out of band deploys approval

Fixing a problem related to an incident might require deploying code outside of the daily-deploy schedule for emergencies. As a rule, out-of-band deployments requested from VFS teams require OCTODE platform leadership.

When you are requested to complete and out-of-band deployment for an urgent issue, ask the requester to complete an OOB deploy request ticket (here). Once the ticket is created, create an incident in PagerDuty under “Out of Band,” link the slack request thread, and escalate it immediately. It might be helpful to link the OCTODE member to the existing thread as well. Tag Lindsey Hattamer and Brandon Dech in out-of-band threads.

The on-call OCTODE member will respond and either approve or deny. From there, reach out to the appropriate Tier 2 team member to review, approve/give feedback if necessary, and merge.

The team will need to create a post-mortem for the issue.

Off-cycle deployments are not synonymous with out-of-band deployments. Off-cycle deploy requests are for VFS teams to coordinate with the Platform crew to deploy non-urgent work off-hours to minimize impact to Veterans for planned changes to work.

After the Incident

Once the Incident is resolved, follow the instructions to create a postmortem document. Get a draft up within 24 hours.

Incident Retrospective Process

OCTODE leadership may request that the Incident Commander schedule a retrospective meeting to bring all relevant parties together and discuss the PM while it is still in draft form. The intent of the meeting is typically to review and go through the details of the PM.

The postmortem document should be as complete as possible prior to the discussion. This should be treated as a call to cover the final draft.

Ensure that your meeting invite description includes:

  • A link to the post-mortem document

  • A short message explaining the reason for the meeting (e.g., “A retrospective to cover this post-mortem. Please feel free to reach out about any other discussion items or if you would like anyone else invited.”)

Be sure to send the meeting invite to Lindsey Hattamer, Brandon Dech, Andrea Townsend, Erika Washburn, Chris Johnston, Steve Albers, and any additional stakeholders involved or interested. Use VA.gov email addresses only. You may forward the invite to your company email address for visibility.

Julia Gray and Erica Robbins work closely with Chris Johnston and can assist you with finding time in his schedule for incident postmortems he requests (HPIs for example). They are available in Slack.

To keep the meeting efficient, consider including an agenda based on the titled topics from the PM. This can serve as a guideline to help facilitate the discussion.

During the call, the Incident Commander should share a link in the chat to the postmortem document, and begin sharing their screen with the postmortem open for review. The Incident Commander will walk the attendees through the postmortem section-by-section, and open the floor for discussion. The IC or another member of the Platform Support team should be working to take detailed notes so that the document can be updated.

Additionally, any action items that require issue tickets should be listed in the table of the postmortem and following the meeting, be created ASAP.

After notes taken from the discussion are added to the postmortem document, it can be taken out of draft mode, and you can request Steve Albers to review.

Incident Flow Chart

Resources

Incident Call Rules: Swarm Room vs. MIM Bridge

MIM SOP (Only accessible behind VA network [CAG, AVD, GFE])

YourIT Helpdesk Article (Only accessible behind VA network [CAG, AVD, GFE])

Platform

External

PagerDuty incident response

Atlassian Incident Response

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.