Incident response documentation for application teams
So, you just got an incident notification or reliability alert. Now what? This page will help you respond to incident notification and reliability alerts. Stay calm. You got this. All you need to do is follow the process below. Take a deep breath and begin.
STEP ONE - Follow this checklist:
Determine the incident priority using this matrix:
Service | > 1000 Users Affected/day | > 100 Users Affected/day | > 0 Users Affected/day |
---|---|---|---|
Tier 1 | P1 | P2 | P3 |
Tier 2 | P2 | P3 | P4 |
Tier 3 | P3 | P4 | P4 |
For incidents rated P2, P3, and P4:
Please reference the Resolution timeline below and then move onto STEP TWO.
For incidents rated P1:
FIRST: Determine whether the incident priority is correct. If the priority should be downgraded, adjust the incident priority level and document your reasoning on the incident.
Common reasons the priority might be downgraded:
The alert is a false positive
The number of affected users does not reach the P1 severity level
SECOND: Declare an incident in Datadog if one has not already been created. Then rate the incident with severity level 1.
This step will do the following:
Communicate to your team that the issue needs to be resolved ASAP.
Allow others who might reach out to notify you of an incident to see that you are already aware of it and actively working to resolve it.
Create a record that your team can use later - i.e., as part of a retrospective.
Don’t forget to:
Identify the person that will serve as the Incident Commander (note – this is not the person that does all the work to resolve the incident, but the person that coordinates communications with stakeholders and reports progress) and ensure that person is listed on the Datadog incident.
Identify the Lead Engineer for resolving the incident (note – this is the person that will lead the investigation into the root cause of the issue) and ensure that person is listed as a responder of type SME on the Datadog incident.
Resolution timeline
Use this matrix as a guideline to prioritize your team’s time and actions:
Incident Priority | Resolution Timeline | Updates to reporter and VA Product Owner | Temporary monitoring adjustments |
---|---|---|---|
P1 | ASAP - please dedicate any team member’s time who can fruitfully work on the issue, including off-hours | Every 3 hours until resolved | N/A |
P2 | Within the current week/sprint | Every business day until resolved | N/A |
P3/P4 | Within the next 2 sprints | Upon resolution | If the triggered monitor can be adjusted to ignore only the specific issue, do so and add an issue or acceptance criteria to restore the monitoring before the issue is considered resolved |
STEP TWO - Communicate
You must keep people informed. Below you’ll find a list of people you will need to contact.
Veterans
Use appropriate tooling and communication channels to ensure Veterans are aware of the issue as necessary and do not spend time doing work that will be lost. This may include:
Adding a downtime notification
Disabling a given app or feature
Direct communications such as emails, other alerting, etc.
Stakeholders
Ensure that your VA points of contact are informed and aware of the issue, its impact and expected resolution timeline. Please include a link to issues and/or slack conversations so that they may keep up to date on progress.
STEP THREE - Work the problem
Now it's time to deal with the problem. Follow the process below and you’ll be fine.
Determine the root cause
The incident commander should create a document to capture notes, discussions, and other items (screenshots, log messages, etc.) that can act as a part of the record of the incident for later reporting to stakeholders, and to assist the team when a retrospective is conducted.
Don’t be afraid to enlist help from Platform support and/or OCTO engineering points of contact.
Once you’ve determined a cause, proceed with resolution according to the general category of cause:
Resolve the issue
Cause category | This might look like | How to resolve |
---|---|---|
External service bug/outage |
| For P1 incidents: work with your OCTO point of contact to ensure that the Major Incident Management process has been triggered. In general, communicate with the service owners and work together to ensure that the root cause is addressed or has a plausible plan & timeline to be addressed. |
App bug or design issue |
| Darn, can’t blame it on something else! Work to resolve the root cause. Ensure that all changes are tested with unit and end-to-end tests that will detect a re-occurrence. Be sure to consider whether UI changes are necessary or could aid in preventing the issue in the future. |
Internal service outage |
| Create a support request with category “http://VA.gov incident” in #vfs-platform-support – trigger Pagerduty if necessary. |
Improperly configured monitor | Sometimes an incident just ain’t an incident. Adjust the triggered monitor in a manner that will prevent re-notification for only this specific issue. Ensure that another engineer reviews the changes and confirms their specificity. | |
Cosmic rays / Karmic justice / other incidents without clear causes or resolution | Ensure that any affected Veterans are made whole. This might include manual re-submission of applications or reaching out via approved communications channels to explain the incident and how they can re-start the task they were trying to complete. |
STEP FOUR - Wrap things up
Good job. Now it is time for the post-mortem. To wrap things up, please follow the process below.
For P1 & P2 incidents:
Ensure that a formal post-mortem is created and shared with all relevant stakeholders.
For P3/4 incidents:
An informal post-mortem shared via email or Slack is sufficient. Include the following information:
What happened
Why it happened
What we’re doing to prevent it from happening again
For Everyone:
Ensure that the reporting watch officer and your team’s VA points of contact are aware of the incident resolution and given a chance to review the post-mortem.
CONGRATULATIONS
You have successfully handled an incident notification or reliability alert. Thank you for following the steps in this document. You can relax now.
Help and feedback
Get help from the Platform Support Team in Slack.
Submit a feature idea to the Platform.