Developer docs

ClamAV Architecture, Behavior, and Failures (vets-api)

Last Updated:

ClamAV scans uploaded files for viruses before they are stored or processed. This page explains how the scanning architecture works, how vets-api integrates with ClamAV, and how to diagnose and remediate common failures that can impact vets-api pod health or upload flows.

ClamAV Architecture, Behavior, and Failures (vets-api)

Overview

  • Goal: Ensure all file uploads scanned for viruses without destabilizing vets-api pods.

  • Pattern: ClamAV runs as a sidecar container in each vets-api pod; Rails talks to it via a client library.

  • Risk: Because Kubernetes pod health is per-pod, ClamAV failures can mark the whole pod unhealthy, causing restarts and reduced availability. See ClamAV error logs (requires Datadog access)

High-level architecture

Components

  • Vets-API Rails app

    • Uses Common::VirusScan for scanning.

    • Uses UploaderVirusScan (CarrierWave concern) and the Shrine plugin #validate_virus_free to hook scanning into upload flows.

  • ClamAV sidecar container

    • Runs alongside vets-api in the same pod.

    • Exposes the ClamAV daemon on a TCP port inside the pod.

    • Pod health is impacted by both the Rails and ClamAV containers.

  • vsp-infra-clamav repo (https://va.ghe.com/software/vsp-infra-clamav)

    • Houses the ClamAV Docker image and Kubernetes configuration used by vets-api.

    • Notable GitHub Actions:

      • mirror-images.yml: builds and pushes ClamAV images to ECR.

      • s3_sync.yml: builds the ClamAV image, extracts DBs, syncs them to S3.

vets-api application integration

Common::VirusScan

Location: lib/common/virus_scan.rb

  • API

    • Common::VirusScan.scan(file_path, upload_context: nil) -> true/false

    • Raises on hard failures (e.g., temp file missing, ClamAV unreachable)

  • Behavior

    • Verifies the temp file exists; raises "Failed to create temp file" if not.

    • Mock mode:

      • If Settings.clamav.mock is true, returns true immediately (used for non-prod/testing).

    • Collects file metadata for audit:

      • Hashed basename (SHA-256), file size, and content type (Marcel).

    • Measures scan duration using a monotonic clock.

    • Calls perform_scan(file_path) and expects a hash:

      • { safe: true/false, virus_name: '...' }

    • Emits a scan audit log:

      • Message: "ClamAV Virus Scan Audit".

      • Fields:

        • event: 'virus_scan'

        • user_uuid, ip_address from RequestStore.store['additional_request_attributes'] (currently set to nil due to PII concerns, pending further guidance)

        • file_name (hashed), file_size, content_type

        • scan_result: "clean" or "infected"

        • virus_name, scan_duration_ms, upload_context

    • On any exception:

      • Emits an error audit log (scan_result: 'error').

      • Re-raises the error to the caller.

perform_scan

  • If file_path starts with clamav_tmp/:

    • Treats it as already in the ClamAV temp directory.

    • Sets mode 0640.

    • Calls ClamAV::PatchClient.new.scan_with_result(file_path).

  • Feature Flag :clamav_scan_file_from_other_location:

    • Enabled:

      • Logs that it is creating a ClamAV tmp file.

      • Calls #scan_file_from_other_location(original_path):

        • Ensures Rails.root.join('clamav_tmp') exists.

        • Sets original file mode 0640.

        • Builds a unique temp path under clamav_tmp/

        • Copies the original file to that path; verifies the copy exists.

        • Sets the temp file mode 0640.

        • Calls ClamAV::PatchClient.new.scan_with_result(temp_path).

        • Always attempts to delete the temp file (and logs success/failure).

    • Disabled:

      • Logs a warning: ClamAV scan from other locations is disabled.

      • Returns { safe: false, virus_name: nil }, which callers treat as unsafe.

UploaderVirusScan (CarrierWave integration)

Location: app/uploaders/uploader_virus_scan.rb

  • Inclusion

    • Included into CarrierWave uploaders that require virus scanning.

    • Registers a callback:

      • before(:store, :validate_virus_free)

  • Runtime behavior

    • Only active in production:

      • Immediately returns unless Rails.env.production?

    • Uses Common::FileHelpers.generate_clamav_temp_file(file.read) to write the upload to a ClamAV-readable temp file.

    • Calls Common::VirusScan.scan(temp_file_path).

    • Deletes the temp file after the scan.

    • If the scan result is false (infected or treated as unsafe):

      • Calls file.delete on the uploader file object.

      • Raises UploaderVirusScan::VirusFoundError, "Virus Found + #{temp_file_path}".

Shrine validate_virus_free plugin

Location: lib/shrine/plugins/validate_virus_free.rb

  • Purpose

    • Shrine-based uploads (e.g., some form submissions) use this plugin to validate uploads are virus-free before persistence. It is the Shrine counterpart to CarrierWave’s UploaderVirusScan.

  • Behavior

    • Attachers call validate_virus_free(message: nil) (e.g., from a Shrine validation block).

    • Wraps the scan in a Datadog trace "Scan Upload for Viruses".

    • Downloads the Shrine file, writes it to a ClamAV temp path via Common::FileHelpers.generate_clamav_temp_file, then calls Common::VirusScan.scan(temp_file_path) (and, when implemented, can pass upload_context: for audit logging).

    • Deletes the temp file after the scan.

    • If the scan returns false: logs a virus-detected warning (with hashed file name and optional upload context from record.class.name), adds a validation error, and returns false. In development, a special message prompts starting clamd.`j

    • If the scan returns true: returns true (validation passes).

  • Audit logging

    • Common::VirusScan emits the same "ClamAV Virus Scan Audit" log for Shrine scans as for other callers.

ClamAV sidecar behavior

Screenshot of an ArgoCD web pod CLAMAV sidecar container terminal
ClamAV sidecar highlighted in a vets-api pod in ArgoCD

Healthy behavior

  • Startup

    • ClamAV container starts alongside vets-api.

    • Loads virus databases (from the image or mounted data/S3-synced volume).

    • Binds to its configured TCP port and logs that it is ready.

  • Steady state

    • Occasional log lines for:

      • Definition updates (depending on configuration).

      • Internal housekeeping.

    • For each scan:

      • A short-lived log entry with request/response context.

    • Resource profile:

      • Spike in memory/CPU during DB load.

    • Relatively stable usage in steady state with periodic spikes during scans.

  • From vets-api’s point of view

    • ClamAV::PatchClient calls return within a few hundred ms under normal load.

    • Common::VirusScan.scan returns:

      • true for clean files.

      • false for confirmed infections or when configured to treat non-scannable cases as unsafe.

    • Audit logs show scan_result: 'clean' with reasonable scan_duration_ms.

Common error patterns

Expected / benign

  • Detection of real or test malware (e.g., EICAR)

    • Result: { safe: false, virus_name: 'Eicar-Test-Signature' } or similar.

    • vets-api behavior:

      • Common::VirusScan.scan returns false.

      • UploaderVirusScan raises VirusFoundError; file is deleted.

    • This is expected behavior and not a ClamAV failure.

Unexpected / problematic

  • Daemon unreachable

    • Symptoms:

      • Connection errors (ECONNREFUSED, timeouts) from ClamAV::PatchClient.

      • Errors logged in Common::VirusScan and error audit events.

    • Impact:

      • Upload flows that rely on scanning fail.

      • If health checks are tied to ClamAV readiness, pods may be marked Unready or restart.

  • Database load failures

    • Symptoms:

      • ClamAV logs show DB load errors or repeated restarts.

    • Impact:

      • Scans may fail outright.

      • Downstream, vets-api sees exceptions or very slow responses.

  • High memory usage / OOM

    • Symptoms:

      • ClamAV container is OOMKilled by Kubernetes.

      • Frequent pod restarts.

    • Impact:

      • Reduced capacity during churn.

      • Possible spikes in 5xx errors for upload endpoints.

Debugging procedures (Kubernetes + vets-api)

1. Identify pods and containers in trouble

  • Check vets-api pods for:

    • Unready status, repeated restarts, or CrashLoopBackOff.

  • Inspect container-level status:

    • Confirm whether the ClamAV container is failing (CrashLoop, OOMKilled, failing probes) while Rails appears healthy.

2. Inspect ClamAV logs

  • View logs for the ClamAV container in an affected pod:

    • Look for:

      • DB load success/failure.

      • Port binding issues.

      • Repeated crashes/restarts.

      • Timeouts or resource exhaustion.

  • Correlate with vets-api logs:

    • Error audit logs from Common::VirusScan (scan_result: 'error').

    • Exceptions originating from ClamAV::PatchClient or Common::VirusScan.scan.

    • Warnings such as "Clamav scan from other location disabled".

3. Validate vets-api configuration

  • Feature flags

    • Flipper.enabled?(:clamav_scan_file_from_other_location):

      • If disabled, only files already under clamav_tmp/ are scanned.

      • Any other file path returns { safe: false }, which uploaders treat as infected.

    • Settings.clamav.mock:

      • If true, scans always pass (returns true) without talking to ClamAV.

      • Acceptable for local/dev/test, not for production.

  • Temp file handling

    • Confirm Common::FileHelpers.generate_clamav_temp_file writes to a path that:

      • Is reachable and readable by ClamAV.

      • Resides in a filesystem with enough space.

    • Ensure clamav_tmp/ exists and has correct owner/mode.

  • Audit logs

    • Use "ClamAV Virus Scan Audit" entries to:

      • Confirm scans are being triggered for specific upload endpoints.

      • Examine scan_duration_ms for latency issues.

      • Spot patterns (e.g., errors only for certain file sizes or types).

4. Common remediation steps

  • ClamAV container repeatedly failing

    • Check:

      • Image version changes.

      • ClamAV configuration.

      • Resource limits/requests.

    • Mitigations:

      • Increase memory/CPU.

      • Adjust DB loading or update behavior if too heavy.

      • Roll back to a previous known-good image if a new release is faulty.

  • Connection errors from Rails

    • Validate:

      • ClamAV daemon is listening on expected host/port.

      • No network policy changes blocking traffic inside the pod.

    • Consider:

      • Restarting affected pods.

      • Temporarily enabling Settings.clamav.mock only if acceptable from a risk standpoint (and documenting the window).

  • Slow scans

    • Look for:

      • Large or numerous concurrent uploads.

      • High CPU contention on ClamAV pod(s).

    • Options:

      • Increase resources.

      • Rate-limit or size-limit uploads upstream.

      • Add retry behavior or backpressure in upload flows.

Impact on vets-api pods and request handling

  • Pod health

    • Any ClamAV sidecar failure can:

      • Fail readiness/liveness probes.

      • Trigger pod restarts and churn.

    • Systemic ClamAV issues (bad image, DB problems) can reduce overall cluster capacity.

  • Request behavior

    • For endpoints using UploaderVirusScan:

      • Requests block until Common::VirusScan.scan completes.

      • If:

        • Scan returns true: upload proceeds and file is stored.

        • Scan returns false: upload is rejected with VirusFoundError; file deleted.

        • Scan raises error: request typically fails with a 5xx (depending on controller handling).

  • Observability

    • Combine:

      • Application logs ("ClamAV Virus Scan Audit", Rails exceptions).

      • Sidecar logs (daemon startup, DB load, errors).

      • GitHub Actions workflows for image/DB pipeline health.

ClamAV image and database pipelines (vsp-infra-clamav)

Repo: vsp-infra-clamav

Image mirroring to ECR (mirror-images.yml)

  • Trigger

    • Daily cron around 12:30 PM Eastern (with separate EST/EDT entries) plus on-demand workflow_dispatch.

    • Time-gated-job ensures the job only runs when the Eastern hour is 12 unless manually triggered.

  • Behavior

    • prepare-build:

      • Checks out the repo.

      • Reads versions.json and exports .components as a JSON array (config).

    • mirror:

      • Matrix over config; each entry has version and repo.

      • Sets NOW (e.g., YYYY-MM-DD-HH) for tagging.

      • Configures AWS credentials and logs into ECR in us-gov-west-1.

      • Builds the ClamAV Docker image from ./Dockerfile with:

        • APP_VERSION=${{ matrix.versions.version }}

        • REPO=${{ matrix.versions.repo }}

      • Pushes the image to:

        • ${registry}/dsva/clamav:${GITHUB_SHA}-${NOW}

  • Failure handling

    • notify-on-failure sends a Slack alert to channel #platform-cop-be-notifications when the mirror job fails:

      • Explains that vets-api cannot receive updated ClamAV images and that downstream “Release and Update Manifests” workflows are blocked.

      • Suggests checking:

        • Docker build errors.

        • freshclam issues (e.g., CDN/network).

        • ECR login/push permissions.

        • Base image pulls (e.g., clamav/clamav:1.4 from Docker Hub).

Virus database sync to S3 (s3_sync.yml)

  • Trigger

    • Twice daily:

      • Cron for 12:01 AM / 12:01 PM Eastern (EST + EDT variants).

    • Also supports manual workflow_dispatch.

    • time-gated-job:

      • If manually triggered: always enables the job.

      • If scheduled: only continues when current Eastern hour is 0 or 12.

  • Behavior

    • upload_to_s3 (runs only when gated output is true):

      • Checks out the s3-upload branch.

      • Assumes an AWS role via OIDC in us-gov-west-1.

      • Logs into Docker Hub.

      • Builds the ClamAV Docker image from ./Dockerfile and loads it locally as clamav-image:latest

      • Creates a container: clamav-container.

      • Copies database files out of the container into a local database/ directory:

        • data/bytecode.cvd

        • data/main.cvd

        • data/daily.cvd

      • Uploads database/ to AWS s3

  • Failure handling

    • notify-on-failure sends a Slack alert to channel #platform-cop-be-notifications when the S3 sync fails:

      • Explains that ClamAV DBs (main.cvd, daily.cvd, bytecode.cvd) were not updated in S3.

      • Suggests checking:

        • Docker build/container creation.

        • docker cp for DB extraction.

        • AWS credentials, bucket permissions, and network to us-gov-west-1.

      • Calls out the risk of stale virus definitions for any infrastructure pulling DBs from dsva-vetsgov-utility-clamav.

Relationship to vets-api reliability

  • If image mirroring fails:

    • New ClamAV image versions are not pushed to ECR.

    • vets-api environments may:

      • Continue using older images with outdated ClamAV or OS components.

      • Experience deployment failures (ErrImagePull) if no valid image exists.

  • If DB sync fails:

    • S3 bucket may contain stale DB files.

    • ClamAV sidecars that depend on S3 for DBs will run with old virus signatures.

    • Virus scanning remains functional but is less effective against new threats.

Monitoring

These are all Datadog links that assume you have Datadog access.


Help and feedback