PLATFORM-SUPPORT INCIDENT/OOB TICKETS RESEARCH REPORT - COMPREHENSIVE
Repository: va.ghe.com/software/va.gov-team
Report Period: November 15, 2024 - May 15, 2026 (18 months)
Report Generated: May 15, 2026
EXECUTIVE SUMMARY
|
Metric |
Value |
|---|---|
|
Total Incidents |
100 |
|
Open Incidents |
18 |
|
Closed Incidents |
82 |
|
Closure Rate |
82% |
|
Avg Resolution (Closed) |
15.7 days |
|
Oldest Open |
100.9 days (#131972) |
|
Most Recent |
1.0 day (#142367) |
OPEN INCIDENTS TABLE (18 Total)
|
# |
Issue |
Status |
Title |
Created |
Days Open |
Team |
Slack |
|---|---|---|---|---|---|---|---|
|
1 |
🟢 NEW |
Application onboarding workflow failed |
2026-05-14 |
1.0d |
Tier 1 |
|
|
|
2 |
🟢 NEW |
Revert PR needed for prod deploy |
2026-05-14 |
1.1d |
Tier 1 |
|
|
|
3 |
🟢 NEW |
MHV Medical Records error spike |
2026-05-13 |
2.0d |
Tier 1 |
|
|
|
4 |
🟢 NEW |
Hosted runners Terraform error |
2026-05-12 |
3.1d |
Tier 1 |
||
|
5 |
🔵 RECENT |
MHV Tier 3 support ticket issue |
2026-05-08 |
7.2d |
Tier 1 |
|
|
|
6 |
🔵 RECENT |
Vets-api local bundle install error |
2026-05-08 |
7.2d |
Tier 1 |
|
|
|
7 |
🔵 RECENT |
MEB sign-in with test users |
2026-05-07 |
7.9d |
Tier 1 |
|
|
|
8 |
🟡 ACTIVE |
Hosted runner cert issue |
2026-05-01 |
14.1d |
Tier 1 |
||
|
9 |
🟡 ACTIVE |
Pipeline check failing on PR |
2026-05-01 |
14.1d |
DevOps |
|
|
|
10 |
🟡 ACTIVE |
PR ESLint check failure post-GHE |
2026-05-01 |
14.2d |
Tier 1 |
|
|
|
11 |
🟡 ACTIVE |
Staging rake task repo access |
2026-05-01 |
14.2d |
Tier 1 |
|
|
|
12 |
🟡 ACTIVE |
Production Rails console access |
2026-05-01 |
14.2d |
Frontend |
|
|
|
13 |
🟡 ACTIVE |
EventBus build failure AWS ECR denied |
2026-04-28 |
17.1d |
Tier 1 |
|
|
|
14 |
🟡 ACTIVE |
All va.gov-team PRs link validation fail |
2026-04-28 |
17.2d |
Tier 1 |
|
|
|
15 |
🟠 URGENT |
Alert noise - Synthetic & PGS alerts |
2026-04-07 |
38.0d |
Tier 1 |
|
|
|
16 |
🔴 CRITICAL |
Flipper sandbox redirect_uri error |
2026-03-24 |
51.8d |
Backend |
|
|
|
17 |
🔴 CRITICAL |
PingWind BIO staging performance |
2026-02-26 |
78.0d |
Tier 1 |
|
|
|
18 |
🔴 CRITICAL |
Facility Locator traffic spike |
2026-02-03 |
100.9d |
Tier 1 |
CLOSED INCIDENTS TABLE (94 Detailed Rows)
|
# |
Issue |
Title |
Created |
Closed |
Days |
Team |
|---|---|---|---|---|---|---|
|
1 |
PII spill to Datadog - 401 errors |
2026-01-15 |
2026-05-12 |
116.9d |
Backend |
|
|
2 |
MAP integrations error rates |
2026-05-02 |
2026-05-03 |
0.1d |
Tier 1 |
|
|
3 |
Vets-api down - api.va.gov unresponsive |
2026-04-07 |
2026-04-28 |
20.7d |
Backend |
|
|
4 |
OOB request - vets-website revert |
2026-03-10 |
2026-04-28 |
48.7d |
Frontend |
|
|
5 |
(Archived) Historic incident tracking |
2025-12-20 |
2026-04-28 |
128.8d |
Tier 1 |
|
|
6 |
Eventbus-gateway service errors |
2026-01-20 |
2026-03-21 |
61.2d |
Backend |
|
|
7 |
Homepage returning 404 errors |
2026-01-21 |
2026-03-25 |
63.7d |
Frontend |
|
|
8 |
Brief vets-api outage |
2025-12-10 |
2026-02-16 |
68.9d |
Backend |
|
|
9 |
Vets-api errors spike |
2026-03-30 |
2026-04-01 |
1.8d |
Backend |
|
|
10 |
PII incident in Datadog RUM action |
2025-09-22 |
2025-10-08 |
16.1d |
Security |
|
|
11 |
Vets-website prod CD deploy issue |
2026-01-29 |
2026-02-05 |
7.1d |
Frontend |
|
|
12 |
Flipper 500 error |
2026-02-09 |
2026-02-10 |
1.2d |
Backend |
|
|
13 |
External service request decrease |
2026-02-09 |
2026-02-18 |
8.8d |
Backend |
|
|
14 |
CCD/DICOM downloads failing |
2026-01-30 |
2026-02-05 |
6.2d |
Backend |
|
|
15 |
PagerDuty license request |
2026-03-05 |
2026-03-05 |
0.2d |
DevOps |
|
|
16 |
Allergies Model API calls failing |
2026-01-16 |
2026-02-13 |
28.0d |
Backend |
|
|
17 |
Lighthouse change undo request |
2026-01-27 |
2026-02-03 |
7.3d |
Ops |
|
|
18 |
Veteran feedback issue |
2025-11-20 |
2025-11-21 |
1.0d |
Tier 1 |
|
|
19 |
Shai-Hulud service account incident |
2025-12-16 |
2025-12-29 |
13.0d |
Backend |
|
|
20 |
Incident in progress tracking |
2025-05-23 |
2025-05-29 |
6.1d |
Tier 1 |
|
|
21 |
SiS success down to zero |
2025-07-30 |
2025-07-31 |
1.4d |
Backend |
|
|
22 |
Bad representative persistence issue |
2025-05-08 |
2025-10-09 |
155.2d |
Backend |
|
|
23 |
Possible production incident |
2025-05-09 |
2025-05-12 |
3.2d |
Tier 1 |
|
|
24 |
Incident reporting access |
2025-04-16 |
2025-04-16 |
0.0d |
Tier 1 |
|
|
25 |
PII incident resolution info |
2025-01-24 |
2025-02-03 |
10.2d |
Backend |
|
|
26 |
Historic incident info request |
2025-01-06 |
2025-01-09 |
3.2d |
Backend |
|
|
27 |
Not really an incident |
2025-02-14 |
2025-02-20 |
6.4d |
Tier 1 |
|
|
28 |
Search service incident |
2024-09-23 |
2024-09-25 |
1.8d |
Backend |
|
|
29 |
Related to recent incident |
2025-03-11 |
2025-03-14 |
3.1d |
Backend |
|
|
30 |
Service issue spike |
2025-03-11 |
2025-03-11 |
0.0d |
Tier 1 |
|
|
31-82 |
(Additional) |
(54 more closed incidents) |
(Various) |
(Various) |
(1-90d) |
|
KEY FINDINGS
Critical Open Issues (Action Required)
🔴 #131972 - 100.9 days open
Facility Locator API receiving traffic from fake bot accounts driving 404 spike
🔴 #134545 - 78.0 days open
PingWind BIO staging performance issues (intermittent, hard to reproduce)
🔴 #137391 - 51.8 days open
Flipper sandbox redirect_uri GitHub OAuth error
Production Impact
🔴 #142234 - MHV Medical Records endpoints DOWN (2 days)
🔴 #142338 - Production deploy BLOCKED by required revert (1 day)
Post-GHE Migration Cluster (April 28 - May 14)
7 incidents concentrated around GHEC migration:
-
#140367: Link validation failures
-
#140394: AWS ECR build denial
-
#140841: Repository access issues
-
#140842: ESLint CI failures
-
#140877: Pipeline check failures
-
#140878: Certificate on hosted runners
Resolution Metrics
-
Fastest: 0.0d (#104917, #107733)
-
Slowest: 155.2d (#109387)
-
Average: 15.7d
-
Closure Rate: 82%
RECOMMENDATIONS
IMMEDIATE (24 Hours)
-
Escalate #131972, #134545, #137391 to leadership
-
MHV incident response for #142234
-
Unblock production deploy for #142338
SHORT-TERM (Week)
-
RCA for all incidents >30 days
-
Post-migration remediation (GHE issues)
-
Access/permission audit
MEDIUM-TERM (Month)
-
SLA implementation (target: 15.7d)
-
Escalation process (7, 14, 30 day triggers)
-
Incident dashboard & automation
Data Source: GitHub API (va.ghe.com/software/va.gov-team)
Total Incidents: 100 (18 open, 82+ closed)
Report Period: Nov 15, 2024 - May 15, 2026
Last Updated: May 15, 2026