Skip to main content
Skip table of contents

Vets API on EKS

Intro

The vets-api development, staging, sandbox, and production environments currently reside in EKS. This document provides background information on how EKS and how to work with Vets API from the EKS infrastructure.

How does it work?

EKS

EKS is a managed service that you can use to run Kubernetes on AWS. It removes the need to install, operate, and maintain your own Kubernetes control plane. In the light of container orchestration, EKS works automatically by unifying the infrastructure components.

ECR

ECR is an AWS manager container image registry service. A registry exists for the Vets API-specific images in ECR. When the docker image is built upon push to the k8s branch, an image is then pushed to the Vets API ECR registry.

ArgoCD

ArgoCD provides a UI for developers to see the application and manages the deployment by checking the desired state against the deployed state.

GitHub Actions

Vets API in EKS utilizes GitHub Actions to build and update the docker image when a change is pushed to the k8s (EKS master) branch.

The vets-api deployments currently use the k8s branch to deploy to EKS. k8s is considered the "master" branch.

Additionally, any change pushed to the vets-api master branch is also auto merged into the k8s branch via this GHA.

Helm charts

The vets-api EKS deployment utilizes a custom helm chart that resides in the vets-api repository. The vets-api manifest files then reference the helm chart package and custom values are passed to the chart.

Utilizing a helm chart simplifies the deployment, maintainability and reduces repeated code. vets-api-server (puma) and vets-api-worker (sidekiq) are bundled into the same parent helm chart.

More on helm charts here.

Access

Access to the vets-api EKS applications is managed via GitHub teams (linked below). To obtain access, fill out a Vets-api ArgoCD terminal access request form. Note: prod access requires OCTO-DE approval and will take longer to get than the lower environments.

Vets API GitHub teams

Terminal access

Links to Vets API in ArgoCD (requires SOCKS)
  1. vets-api-dev

  2. vets-api-staging

  3. vets-api-sandbox

  4. vets-api-prod

Access the terminal via Argo:
  1. Navigate to http://argocd.vfs.va.gov/applications (Requires Socks)

  2. Search for "vets-api-{env-here}" in the search bar

  3. Click on a vets-api-web-* pod (far right)

    • Note: Look for the pod icon

  4. A Terminal tab will appear on the far right

    • Note: If you get an error or don't see the tab, log out/in of ArgoCD. If that doesn’t work, double check that you are a member of the GitHub team for the environment you’re in.

Rails console access
  1. Follow the steps above

  2. Run bundle exec rails c

How to access the Rails Console via ArgoCD

Rails Console Access in ArgoCD

Vets API settings and secrets

With EKS, settings and secrets are configured via EKS resources and definitions.

Secret values

The vets-api deployment utilizes secret references via a combination of the ExternalSecret resource and ENV vars in the values.yaml. The env vars can then be referenced via "ruby .erb" in in the settings.local.yml values in the configMap definition. If your setting does not need to be secret, it can just be added to the settings.local.yml configMap definition in the values.yaml (see the ”Creating or updating a non-sensitive value” section). Details on all of this below.

Care and attention to detail should be taken when adding secrets to vets-api–a misconfigured secret in Parameter Store or in the code will cause a vets-api pod to fail.

Creating or updating a non-sensitive value:

A non-sensitive value is something that doesn’t need to be stored in AWS Parameter Store (for example, mock_debts: false or service_name: VBS). In this case, you can add the value to the settings.local.yml configMap section of values.yaml. This section is shown in the screenshot.

The appropriate section to add your non-sensitive value to is under configMap.data.settings.local.yml

settings.local.yml section of configMap in values.yaml

Adding a cert

For adding certs, see Add certs as secrets to vets-api. It’s uncommon, but these instructions are for adding a cert or other item that needs to end up at a very specific mount path in the pod.

Steps to create a new secret value:

A secret value is a sensitive value that needs to be stored in AWS Parameter Store. Most items belong in settings-local-secrets and you can follow the steps below to get your secret in vets-api. Steps for adding a settings-local-secret:

  1. Add your secret to Parameter Store:

CODE
aws ssm put-parameter --name /dsva-vagov/vets-api/dev/your_value_goes_here --value your_value_goes_here --type SecureString --overwrite
  1. In the settings-local-secrets section of secrets.yaml, add an entry (key and name).

    1. the key and name can be added to the spec.data section of settings-local-secrets

      settings-local-secrets in secrets.yaml

  2. Add a new entry to the settings-local-secrets definition in values.yaml. The name and path need to match the key and name added in the previous step. Include an env_var definition.
    Note: Be sure that the path and the name match exactly what you have placed in the ExternalSecret Resource in the step above.

    1. vets-api.common.secrets.settings-local-secrets section of code of values.yaml

      settings-local-secrets in values.yaml

  3. In that same file, values.yaml, add your setting to the settings-configmap configMap definition and reference the ENV var you just created. (The settings.local.yml section uses .erb syntax.)

    1. This section is under vets-api.common.configMaps.configMap.data

      settings-configmap configMap definition in values.yaml

Steps to update/rename an existing value:

  1. If the parameter store secret path hasn't changed, just update the value in parameter store.

  2. If the parameter store secret path HAS changed:

    1. Update the path name in secrets.yaml.

    2. Update the corresponding env_var definition (path and/or env_var) in the vets-api-secrets definition in values.yaml.

How do the secrets work with the parent helm charts?

  1. An ExternalSecret Custom Resource Definition (CRD) was created here to pull in secrets from parameter store.

  2. ENV vars are created on the deployment resource by looping through the secrets definition in the values.yaml.

    1. The deployment resource:

      YAML
                {{- range $keys, $key := $root.Values.common.secrets }}
                {{- range $secrets, $secret := $key }}
                  - name: {{ $secret.env_var }}
                    valueFrom:
                      secretKeyRef:
                        name: {{ $keys }}
                        key:  {{ $secret.name }}
                {{- end }}
                {{- end }}
    2. The start of the secrets definition in values.yaml:

      YAML
          secrets:
            vets-api-secrets:
              - name: sidekiq_license
                path: /dsva-vagov/vets-api/common/sidekiq_license
                env_var: BUNDLE_ENTERPRISE__CONTRIBSYS__COM
            settings-local-secrets:
              - name: kms_key_id
                path: /dsva-vagov/vets-api/dev/kms_key_id
                env_var: KMS_KEY_ID
  3. This configMap definition references and defines a configMap based on the definition in the values.yaml

Parameter Store updates will not trigger pods to be replaced to reload Secrets. Any changes made in the Parameter Store will not be deployed until the next ArgoCD sync in applied.

A version can be added to the end of a parameter path to ensure the correct value is deployed to Vets-API.

ex:

Latest version in the Parameter Store will be used if the version is not defined:

CODE
settings-local-secrets:
  - name: tt1_ssm_testing
    path: /dsva-vagov/vets-api/dev/tt1/testing
    env_var: TT1_SSM_TESTING

Even though there are 3 versions in the Parameter Store, Vets-API will use the 2nd version because it’s defined after the parameter path:

CODE
settings-local-secrets:
  - name: tt1_ssm_testing
    path: /dsva-vagov/vets-api/dev/tt1/testing:2
    env_var: TT1_SSM_TESTING

Vets API EKS deploy process

How it works

Vets API in EKS deploys from the k8s branch. Eventually, k8s will be merged into master, but the merge will occur AFTER all environments have been released. The deploy process consists of a combination of GitHub Actions, ECR, yaml manifests, and ArgoCD.

Deploy Process Overview

The following steps detail how changes are deployed to EKS:
  1. A change is committed to the k8s branch

    1. Note: A merge action currently exists, so anything merged to master, automatically syncs into the k8s branch via this Github Action.

  2. This automatically kicks off a GHA to

    1. Build and push an image to ECR

    2. Update the image tag in the manifests repo - example here

  3. Argo is configured to autosync the vets-api application upon a change to the manifest file. (autosync_enabled defaults to true)

  4. Argo auto syncs the vets-api dev application (ArgoCD requires socks)

  5. Changes are deployed

Again, Vets API utilizes the custom helm chart.

Deploy Process Details

After committing a change to master (k8s), you should be able to see when your change was deployed. Once you merge a change, after the image is pushed to ECR, the manifest repo image_tag will be updated with the commit SHA of your change via the VA VSP BOT. Watch the autosync for the manifest commit message and SHA.

Example:

Commit SHA for Vets API merge to k8s branch

Vets API commit SHA

Vets API Manifest Tag updates via VA VSP BOT

Manifest Tag Updates

The Rolling Update

Vets API on EKS utilizes a rolling update pattern to ensure zero downtime and no disruption to service. This will incrementally replace pods with new ones, while gracefully draining connections on pods rolling out of service. See more on rolling updates here.

Bulkhead Deployment Pattern

The Bulkhead deployment pattern, utilized in our production environment, acts as a safeguard mechanism, compartmentalizing sections Vets API through defined ingress routes. This guarantees fault tolerance, meaning that even if a set of pods were to have an issue, the overall application remains undisturbed, ensuring consistent performance levels like latency, etc. Currently, several latency prone and high traffic routes are directed to their dedicated bulkheads.

Metrics related to the current bulkhead deployments can be viewed on this Datadog dashboard. We manage these bulkheads through ingress routes, service classes, and distinct pod deployments managed by ReplicaSet resources. Ultimately, we aim to have most distinct logical code grouping or product catered to by an individual bulkheads (e.g. Think the modules in Vets API), which would create an illusion of fault tolerant microservices. Currently, a number of routes benefit from the bulkhead deployment pattern, providing greater benefit overall such as log segregation, increased resiliency and simplified debugging. All “common” routes funnel to the vets-api-web pods. Detailed definitions of our existing bulkheads can be found here defined under the webServices key in the manifest repo.

See bulkhead image below:

The image below showcases our current bulkhead deployments, focusing particularly on the feature-toggles bulkhead. This structure uses a ReplicaSet to guarantee a consistent number of running pods within each bulkhead. Furthermore, the ReplicaSet actively preserves the desired pod replicas count, ensuring resilience and constant availability. Every bulkhead scales autonomously based on custom Datadog metrics related to available puma threads. Alongside feature-toggles, the image also displays other operational bulkheads, as evident in their respective ReplicaSets and pods.

Bulkhead example showcasing the Feature Toggles Bulkhead

Bulkhead examples in ArgoCD UI

Resource Hook & Deployment Flow

For further details on the deployment rollout process and details around hook configuration and pre-sync ordering, see the “EKS Deployment Resource Hook Configuration & Deployment Flow” document.

Vets API EKS Architecture Diagram

Vets API EKS Deploy Process

NONE
    graph TD
    A[Vets API] -->|Commit to K8s branch| B{GitHub Actions}
    B -->|One| D[Build Image & Push to ECR]
    B -->|Two| E[Deploy]
    B -->|Three| H[Code Checks & Linting]
    E -->|Parse & Update yaml| F[Update Manifest File Image Tag]
    F -->|Commit| G[Argo Detects Change]
    G -->|Argo Sync| I[Changes Deployed]

ClamAV

Prior to EKS, ClamAV (the virus scanner) was deployed in the same process as Vets API. With EKS, ClamAV has been broken out into a sidecar deployment that lives on the vets-api server and worker pods. See ClamAV repo for further details. Essentially, this new pattern allows us to extract the ClamAV service outside of Vets API to adopt the single responsibility pattern.

ClamAV is updated on the hour, every hour to ensure that the signature database is up to date via the mirror-images.yml and ci.yml Github Actions. Essentially, this follows the same deployment pattern as Vets API where images are pushed to ECR and the VA VSP BOT updates the manifest with the new image tag.

Containers on the Vets API web pod, including ClamAV

vets-api-web pod containers


JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.