Skip to main content
Skip table of contents

Triaging API Errors in EKS Environments

Introduction

When triaging errors in vets-api there are several tools available.

  • Sentry for error disovery/tracking

  • Datadog for logs, error tracking, and metrics

Sentry

Sentry is our primary source for API errors. There are Sentry 'projects' for each of our environments: platform-api-developmentplatform-api-staging, and platform-api-production. Selecting a project brings up a list of 'Unresolved Issues' sorted by 'Last Seen' date. Other sorting options are: 'Priority', 'First Seen', and 'Frequency'. Priority being one of the most useful as it's a time decay algorithm that uses the total frequency to show both consistently noisy and new issues.

Once you've found a Sentry issue you're interested in you can click it to view the details. The official Sentry docs cover issue details but there are areas of interest in how we use each section:

  • Tags: The auto-generated Rails tags, in concert with our custom tags, provide extra issue details. controller_name and transaction let you know the source of an issue. sign_in_method marks if the user signed in via ID.me, DSLogon, MHV, or Login.gov. The team tag marks an issue as belonging to an app team.

  • Message: This section maps to the original exception's message. An identical message will appear in the AWS CloudWatch logs in the message field.

Example of common client error in Sentry

Sentry Error

  • User: Provides the authn_context (authentication context), the user's LOA level, and their UUID.

  • Additional Data: Unless filtered, the request body and extra errors details are here. request_uuid is a valuable field for correlation with AWS CloudWatch logs.

Example of additional data for an error showing code, detail, source, status, and title.

Example of additional data for an error showing code, detail, source, status, and title.

Datadog Logs

Detailed logs for all EKS environments are stored in Datadog.

Datadog access is required to view the logs.

Logs can be filtered by container, pod, urls, IP addresses, etc.

Datadog Error Tracking

Datadog access is required to view errors.

Datadog provides an error tracking framework similar to Sentry.

ArgoCD Logs

ArgoCD container logs can be viewed in the Logs tab inside each pod. Logs can be viewed, copied, downloaded, and followed.

ArgoCD access is required to view ArogCD logs

Copy logs button

Copy logs

Switching logs in a multi-container pod is accessible by clicking the Containers button.

Switching logs in a multi-container pod is accessible by clicking the Containers button.

Logs can be downloaded by clicking the Download logs button

Download Logs

You can follow any log in a pod by clicking the Follow button

Follow logs

Tracking totals from the api:

CODE
StatsD.increment("service.method.total")

For failures we can add tags to differentiate error types within failure totals:

CODE
StatsD.increment("service.method.fail", tags: ["error:#{error.class}"])

The above pattern is common enough in service classes that it's been abstracted out to a concern, Common::Client::Monitoring, which can be mixed in to a service.

CODE
module EVSS
  class Service < Common::Client::Base
    include Common::Client::Monitoring

Service calls can then be wrapped in a block that automatically records totals and failures:

CODE
def get_appeals(user, additional_headers = {})
  with_monitoring do
    response = perform(:get, '/api/v2/appeals', {}, request_headers(user, additional_headers))
    Appeals::Responses::Appeals.new(response.body, response.status)
  end
end

With those calls in place we can query for the average across deployed server instances in Datadog:

CODE
avg:vets_api.statsd.api_appeals_get_appeals_total{env:eks-prod}.as_count()
Example of Datadog chart showing avg

Example of Datadog chart showing avg:vets_api.statsd.api_appeals_get_appeals_total{env:eks-prod}.as_count()

The error query can filter by error tag:

CODE
sum:vets_api.statsd.api_appeals_get_appeals_fail{env:eks-prod} by {error}.as_count()
Example of Datadog chart showing sum

Example of Datadog chart showing sum:vets_api.statsd.api_appeals_get_appeals_fail{env:eks-prod} by {error}.as_count()


JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.