Skip to main content
Skip table of contents

Triaging API Errors in EKS Environments

Last Updated: July 7, 2025

When troubleshooting errors in vets-api there are several tools available.

  • Sentry for error discovery/tracking

  • Datadog for logs, error tracking, and metrics

  • ArgoCD for deployment errors

This page explains how to investigate errors in those places.

Datadog logs

Detailed logs for Vets API for all EKS environments are stored in Datadog.

Datadog access is required to view the logs.

Logs can be filtered by container, pod, urls, IP addresses, etc.

Datadog error tracking

Datadog access is required to view errors.

Datadog provides an error tracking framework. This is similar to Sentry.

Sentry

Datadog is preferred over Sentry for error tracking.

Sentry tracks API errors. This was primarily used to track errors before Datadog. There are Sentry 'projects' for each of our environments: platform-api-developmentplatform-api-staging, and platform-api-production. Selecting a project brings up a list of 'Unresolved Issues' sorted by 'Last Seen' date. Other sorting options are: 'Priority', 'First Seen', and 'Frequency'. Priority being one of the most useful as it's a time decay algorithm that uses the total frequency to show both consistently noisy and new issues.

Once you've found a Sentry issue you're interested in you can click it to view the details. The official Sentry docs cover issue details but there are areas of interest in how we use each section:

  • Tags: The auto-generated Rails tags, in concert with our custom tags, provide extra issue details. controller_name and transaction let you know the source of an issue. sign_in_method marks if the user signed in via ID.me, DSLogon, MHV, or Login.gov. The team tag marks an issue as belonging to an app team.

  • Message: This section maps to the original exception's message. An identical message will appear in the AWS CloudWatch logs in the message field.

Example of common client error in Sentry

Sentry Error

  • User: Provides the authn_context (authentication context), the user's LOA level, and their UUID.

  • Additional Data: Unless filtered, the request body and extra errors details are here. request_uuid is a valuable field for correlation with AWS CloudWatch logs.

Example of additional data for an error showing code, detail, source, status, and title.

Example of additional data for an error showing code, detail, source, status, and title.

ArgoCD logs

ArgoCD container logs can be viewed in the Logs tab inside each pod. Logs can be viewed, copied, downloaded, and followed.

ArgoCD access is required to view ArogCD logs

Copy logs button

Copy logs

Switching logs in a multi-container pod is accessible by clicking the Containers button.

Switching logs in a multi-container pod is accessible by clicking the Containers button.

Logs can be downloaded by clicking the Download logs button

Download Logs

You can follow any log in a pod by clicking the Follow button

Follow logs

Long term error tracking

See also the StatsD Metrics page.

Tracking totals from the api:

RUBY
StatsD.increment("service.method.total")

For failures we can add tags to differentiate error types within failure totals:

RUBY
StatsD.increment("service.method.fail", tags: ["error:#{error.class}"])

The above pattern is common enough in service classes that it's been abstracted out to a concern, Common::Client::Monitoring, which can be mixed in to a service.

RUBY
module EVSS
  class Service < Common::Client::Base
    include Common::Client::Monitoring

Service calls can then be wrapped in a block that automatically records totals and failures:

RUBY
def get_appeals(user, additional_headers = {})
  with_monitoring do
    response = perform(:get, '/api/v2/appeals', {}, request_headers(user, additional_headers))
    Appeals::Responses::Appeals.new(response.body, response.status)
  end
end

With those calls in place we can query for the average across deployed server instances in Datadog:

RUBY
avg:vets_api.statsd.api_appeals_get_appeals_total{env:eks-prod}.as_count()
Example of Datadog chart showing avg

Example of Datadog chart showing avg:vets_api.statsd.api_appeals_get_appeals_total{env:eks-prod}.as_count()

The error query can filter by error tag:

RUBY
sum:vets_api.statsd.api_appeals_get_appeals_fail{env:eks-prod} by {error}.as_count()
Example of Datadog chart showing sum

Example of Datadog chart showing sum:vets_api.statsd.api_appeals_get_appeals_fail{env:eks-prod} by {error}.as_count()


JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.