Triaging API Errors in EKS Environments
Introduction
When triaging errors in vets-api
there are several tools available.
Sentry for error disovery/tracking
Datadog for logs, error tracking, and metrics
Sentry
Sentry is our primary source for API errors. There are Sentry 'projects' for each of our environments: platform-api-development
, platform-api-staging
, and platform-api-production
. Selecting a project brings up a list of 'Unresolved Issues' sorted by 'Last Seen' date. Other sorting options are: 'Priority', 'First Seen', and 'Frequency'. Priority being one of the most useful as it's a time decay algorithm that uses the total frequency to show both consistently noisy and new issues.
Once you've found a Sentry issue you're interested in you can click it to view the details. The official Sentry docs cover issue details but there are areas of interest in how we use each section:
Tags: The auto-generated Rails tags, in concert with our custom tags, provide extra issue details.
controller_name
andtransaction
let you know the source of an issue.sign_in_method
marks if the user signed in via ID.me, DSLogon, MHV, or Login.gov. Theteam
tag marks an issue as belonging to an app team.Message: This section maps to the original exception's message. An identical message will appear in the AWS CloudWatch logs in the
message
field.
User: Provides the
authn_context
(authentication context), the user's LOA level, and their UUID.Additional Data: Unless filtered, the request
body
and extraerrors
details are here.request_uuid
is a valuable field for correlation with AWS CloudWatch logs.
Datadog Logs
Detailed logs for all EKS environments are stored in Datadog.
Datadog access is required to view the logs.
Logs can be filtered by container, pod, urls, IP addresses, etc.
Datadog Error Tracking
Datadog access is required to view errors.
Datadog provides an error tracking framework similar to Sentry.
ArgoCD Logs
ArgoCD container logs can be viewed in the Logs tab inside each pod. Logs can be viewed, copied, downloaded, and followed.
ArgoCD access is required to view ArogCD logs
Tracking totals from the api:
StatsD.increment("service.method.total")
For failures we can add tags to differentiate error types within failure totals:
StatsD.increment("service.method.fail", tags: ["error:#{error.class}"])
The above pattern is common enough in service classes that it's been abstracted out to a concern, Common::Client::Monitoring
, which can be mixed in to a service.
module EVSS
class Service < Common::Client::Base
include Common::Client::Monitoring
Service calls can then be wrapped in a block that automatically records totals and failures:
def get_appeals(user, additional_headers = {})
with_monitoring do
response = perform(:get, '/api/v2/appeals', {}, request_headers(user, additional_headers))
Appeals::Responses::Appeals.new(response.body, response.status)
end
end
With those calls in place we can query for the average across deployed server instances in Datadog:
avg:vets_api.statsd.api_appeals_get_appeals_total{env:eks-prod}.as_count()
The error query can filter by error tag:
sum:vets_api.statsd.api_appeals_get_appeals_fail{env:eks-prod} by {error}.as_count()
Help and feedback
Get help from the Platform Support Team in Slack.
Submit a feature idea to the Platform.