Skip to main content
Skip table of contents

Rate Limiting Guide in Vets API

A complete guide: when to add rate limiting and how to implement it.

Part 1: When Should Your Team Add Rate Limiting?

Rate limiting in vets-api is opt-in. Not every endpoint needs it. Use the guidance below to decide whether to add a Rack::Attack rule for your endpoint.

First: Understand the Two Types of 429s


Before adding rate limiting, distinguish between two different sources of 429 errors in vets-api:

Type

Description

Rack::Attack throttles (inbound)

vets-api itself rejects requests before they reach your endpoint. Protects vets-api from abusive or excessive inbound traffic.

Upstream 429s (outbound)

An external service (e.g. Lighthouse) returns a 429 to vets-api because your service is calling it too frequently. These are NOT solved by Rack::Attack — they require retry logic, caching, or coordination with the upstream service.

Real Example (May 2026): benefits_documents/service generated 376 429 errors over a 3-week period. Investigation showed the referrers were almost entirely va.gov/track-claims/your-claim-letters — real veterans checking their claim letters, many immediately after login. This was Lighthouse rate limiting vets-api’s outbound calls, not inbound abuse. Adding a Rack::Attack rule here would have blocked legitimate users. The correct fix is caching, retry logic, or working with the Lighthouse team to increase their rate limit.

How to tell the difference: If you’re seeing 429s in your logs but your endpoint isn’t in rack_attack.rb, check the referrer and controller in Datadog. User-facing referrers (e.g. va.gov/track-claims/*) with real controller names point to an upstream issue, not inbound abuse.

Should You Add a Rack::Attack Rule?


Ask yourself the following questions:

1. Is your endpoint unauthenticated or lightly authenticated?

Unauthenticated endpoints are the highest priority for rate limiting. Without authentication, there’s no barrier to abuse. See representation_management/next_steps_email as an example — without throttling it functioned as an open email relay.

2. Does your endpoint trigger expensive downstream calls?

If a single request fans out to multiple upstream services (e.g. Lighthouse FHIR APIs), a high request rate can cascade into upstream rate limit exhaustion. Consider rate limiting to protect both vets-api and your upstream dependencies.

3. Does your endpoint accept file uploads or send external communications?

File upload endpoints and anything that sends emails, notifications, or triggers external actions should be rate limited to prevent abuse and resource exhaustion.

4. Has your endpoint experienced a traffic spike or near-DoS incident?

Most existing Rack::Attack rules were added reactively after incidents. Don’t wait for an incident — if your endpoint is publicly accessible and handles sensitive operations, add a rule proactively.

5. Is your endpoint part of a form submission flow?

Form submission endpoints (POST) are good candidates for rate limiting. A legitimate user submitting a form rarely needs more than 15–30 submissions per minute.

You Probably Don’t Need Rack::Attack If…


  • Your endpoint is fully authenticated and only accessible to credentialed users

  • Your endpoint is read-only with low computational cost and no upstream fan-out

  • Traffic to your endpoint is low and stable with no history of abuse

  • You’re seeing 429s that trace back to upstream services rather than inbound request volume

Quick Decision Reference


Scenario

Priority

Suggested Limit

Unauthenticated POST (email, form)

High

5–15/min

File upload

High

8/5min

Form submission

Medium

15–30/min

Read endpoint with upstream calls

Medium

20–30/min

High-volume lookup (e.g. facility search)

Medium

30/min

Authenticated, read-only, low traffic

Low/None

Probably no rule needed

When in doubt, reach out to the Platform SRE team in #vfs-platform-support on Slack or open a support request. They can help review your endpoint’s traffic patterns in Datadog and recommend an appropriate limit.

Part 2: How to Implement Rate Limiting


Rate limiting is configured in config/initializers/rack_attack.rb using the Rack::Attack gem. There is no global rate limiting — it is added per-endpoint as needed.

When to Add Rate Limiting (Checklist)

Rate limiting should be considered when:

  • Your endpoint is publicly accessible

  • The endpoint calls expensive upstream services

  • The endpoint could be abused to cause denial of service

  • A Staging Review or Security Review requires it

Reference: The Security Review checklist includes “Rate limits defined” as a required item.

How to Add Rate Limiting

Add a throttle block to config/initializers/rack_attack.rb:

throttle('your_endpoint_name/ip', limit: 10, period: 1.minute) do |req|

req.remote_ip if req.path.starts_with?('/your/endpoint/path')

end

Configuration Options

Parameter

Description

Notes

limit

Maximum requests allowed in the period

period

Time window

e.g. 1.minute, 5.minutes

req.remote_ip

Use this (not req.ip) since we’re behind a load balancer

Preferred over req.ip

req.path

Can use == for exact match or .starts_with? for prefix

req.get? / req.post?

Optional — filter by HTTP method

What Happens When Rate Limited

  • Returns HTTP 429 Too Many Requests

  • Includes headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset

  • Response body: “throttled”

Part 3: Determining the Right Rate Limit


For New Endpoints (No Existing Traffic Data)

When launching a new endpoint, you won’t have production traffic data to analyze. Here’s how to approach rate limiting without historical data:

Step 1: Model the User Journey

Map out realistic usage scenarios. Your rate limit should accommodate the power user scenario with headroom.

Scenario

Calculation

Requests/min

Normal user

1 page load = 2 API calls, user visits 3 pages/min

6

Power user

Rapid searching/filtering, 10 actions/min

20

Automated refresh

Page polls every 30 seconds

2

Step 2: Find a Similar Endpoint for Reference

Your Endpoint Type

Similar Existing Endpoint

Their Limit

Search/lookup

facilities_api/v2/va

30/min

Form submission

education_benefits_claims

15/min

File operations

vic/profile_photo_attachments

8/5min

Status/polling

medical_copays

20/min

Step 3: Check Upstream Service Constraints

If your endpoint calls external services, their limits set your ceiling:

  • Your rate limit ≤ Upstream service limit / Expected concurrent users

  • Contact the upstream service team to understand their constraints.

Step 4: Start High and Plan to Adjust

Recommended approach for new endpoints:

# Phase 1: Launch with permissive limit (2-3x expected peak usage)

throttle('new_endpoint/ip', limit: 60, period: 1.minute) do |req|
req.remote_ip if req.path.starts_with?('/v0/new_endpoint')
end

Then follow this timeline:

Week

Action

1–2

Monitor traffic patterns in DataDog, no changes

3

Analyze P95 usage, identify if limit is too high

4+

Adjust limit based on actual data

Step 5: Add Monitoring From Day One

Deploy with DataDog monitoring so you can adjust quickly:

# In your controller or service

StatsD.increment('api.new_endpoint.request', tags: ["ip:#{request.remote_ip}"])

Step 6: Document Your Assumptions

In your PR, document:

  • Expected user behavior and request patterns

  • Similar endpoints used as reference

  • Upstream service constraints (if any)

  • Plan for adjusting limits post-launch

Example PR description:

Rate limit set to 30/min based on:

• Similar to facilities_api endpoint (30/min)

• Expected max 10 requests/min for power users

• Upstream service X has 100/min limit

• Will review after 2 weeks of production traffic

For Existing Endpoints (With Traffic Data)

If your endpoint already exists and has traffic, you can use DataDog to make data-driven decisions.

Step 1: Analyze Expected User Behavior

Think through the user journey: How many times would a legitimate user hit this endpoint in a session? Is it called once per page load? Multiple times during form submission? Are there any frontend polling patterns?

Example: If a user searches for facilities and might refine their search 5–6 times, and each search makes 2 API calls, that’s ~12 requests in a few minutes for an active user.

Step 2: Check Existing Traffic in DataDog

Before adding rate limiting, query DataDog for current traffic patterns:

# Requests per IP per minute

sum:vets_api.requests{path:/your/endpoint/*} by {client_ip}.rollup(count, 60)

Look for:

  • P95/P99 requests per IP per minute — what do normal heavy users look like?

  • Max requests per IP — what do potential abusers look like?

  • Distribution — is there a clear gap between normal and abnormal traffic?

Step 3: Start Permissive, Then Tighten

Phase

Limit

Purpose

1. Monitor only

None

Add logging/metrics to track what would be rate limited

2. High limit

100/min

Catch only obvious abuse

3. Tighten

30–50/min

Based on observed normal traffic

4. Final

10–20/min

If needed, based on upstream limits

Step 4: Consider Upstream Service Limits

If your endpoint calls an external service (PPMS, Lighthouse, etc.):

  • What are their rate limits?

  • Your limit should be lower than theirs to protect the upstream service

Step 5: Environment-Specific Limits

You can exclude non-production environments from rate limiting:

throttle('your_endpoint/ip', limit: 10, period: 1.minute) do |req|
req.remote_ip if req.path.starts_with?('/your/endpoint') &&
!Settings.vsp_environment.match?(/local|development|staging/)
end

Part 4: Reference


Safe Starting Points

Endpoint Type

Safe Starting Limit

Rationale

Read-only GET

30–60/min

Users may browse/search repeatedly

Form submission POST

15–20/min

Deliberate actions, allow for retries

File upload

10/5min

Heavy operations, natural user throttling

Shared with other apps

Coordinate with teams first

Avoid breaking partner integrations

Existing Rate Limits in rack_attack.rb

Endpoint

Limit

Period

Notes

facilities_api/v2/va

30

1 min

Added after DoS incident

facilities_api/v2/ccp/provider

8

1 min

PPMS protection

vic/profile_photo_attachments (GET)

8

5 min

Download limit

vic/profile_photo_attachments (POST)

8

5 min

Upload limit

vic/supporting_documentation_attachments

8

5 min

Upload limit

vic/vic_submissions

10

1 min

Form submission

check_in

10

1 min

Excludes local/dev/staging

medical_copays (GET)

20

1 min

Read operations

education_benefits_claims (POST)

15

1 min

Form submission

form214192 (POST)

30

1 min

Form submission

form21p530a (POST)

30

1 min

Form submission

form210779 (POST)

30

1 min

Form submission

form212680 (POST)

30

1 min

Form submission

vaos/v2/appointments (GET/POST/PUT)

30

1 min

VAOS appointments

vaos/v2/providers (GET)

30

1 min

VAOS providers

vaos/v2/locations (GET)

30

1 min

VAOS clinics

vaos/v2/community_care/eligibility (GET)

30

1 min

VAOS CC eligibility

vaos/v2/eligibility (GET)

30

1 min

VAOS patient eligibility

vaos/v2/scheduling/configurations (GET)

30

1 min

VAOS scheduling

vaos/v2/facilities (GET)

30

1 min

VAOS facilities

vaos/v2/relationships (GET)

30

1 min

VAOS relationships

ask_va_api/v0/zip_state_validation (POST)

60

1 min

Production only

ask_va_api/v0/diagnostics (GET)

30

1 min

representation_management/v0/next_steps_email (POST)

5

1 min

Per IP; prevents open relay

representation_management/v0/next_steps_email (POST)

3

1 hour

Per destination email address

Monitoring After Deployment

Set up a DataDog dashboard to track:

  1. 429 responses — How often is the limit being hit?

  2. Unique IPs hitting limits — Is it one bad actor or many users?

  3. Requests just below limit — Are legitimate users getting close?

DataDog query examples:

# Count of 429 responses

sum:vets_api.response{status:429,path:/your/endpoint/*}.as_count()

# Unique IPs hitting rate limits

count_distinct:vets_api.requests{status:429,path:/your/endpoint/*} by {client_ip}

Safe Rollout Strategy

  1. Start with a limit of 2–3x your expected heavy user (e.g., if you expect 10 requests max, set 30)

  2. Deploy to production with monitoring enabled

  3. Watch DataDog for 1–2 weeks to observe actual traffic patterns

  4. Tighten the limit based on observed data

  5. Document your rationale in the PR for future reference

Additional Resources

Questions?

Reach out in #vfs-platform-support on Slack.

Questions? Reach out in #vfs-platform-support on Slack.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.