Managing Signal to Noise Ratio

Signal to noise ratio (SNR) refers to the amount of alerts that reflect real errors needing resolution (signal) compared to the amount of alerts that do not reflect a real problem (noise). When there are many alerts which are false positives or not actionable, it becomes harder to notice and respond to alerts that indicate real problems. It’s important to balance these priorities with the need to ensure that problems with the application are not missed.

What does success look like?

Success requires the following three things to be true:

Alerts are actionable by the team.
- Escalation (e.g. to owners of an upstream service) is a valid form of action.
Alerts reflect the health of the application.
Alert priority reflects the user impact.

How to tell if your monitors are too noisy?

The monitors are too noisy if:

Alerts are not actionable
Alerts do not reflect significant user impact

How to improve signal to noise ratio?

To improve signal to noise ratio, perform the following actions:

Review monitor priority levels and align based on user impact.
- Start with the highest priority, how would you know if the application is completely non-functional?
Start with a less stringent threshold, and adjust over time.
Remove or increase intervals for re-notification of known issues.
Add retry attempts to the monitor before alerting to help with transient problems.
Use an Anomaly monitor for traffic volume instead of hard coded values. Anomaly monitors are well suited for metrics that have predictable patterns, and alert thresholds can be tuned to be appropriate for a given application.

Alert Priority Levels

Priority	Description	Examples
P1	Critical issue that warrants public notification and liaison with executive teams.	The system is in a critical state which is actively impacting a large number of customers. Functionality has been severely impaired for a long time, breaking the SLA or SLO. A security vulnerability that exposes Veteran data has come to our attention.
P2	Critical system issue actively impacting many users' ability to use the product.	The application is unavailable or experiencing a severe performance degradation for most/all users. Any other event for which the team deems an incident response necessary. Unrecoverable failures that require manual intervention for resolution.
P3	Stability or minor user-impacting issues that require immediate attention from service owners.	Partial loss of functionality, not affecting majority of users. Something that is likely to become a P2 if no action is taken.
P4	Minor issues requiring action, but not affecting user ability to use the product.	Performance issues (excessive latency, etc.). Individual host failure (i.e. one node out of a cluster). Delayed job failure (not impacting event & notification pipeline). Scheduled job failure (not impacting event & notification pipeline).
P5	Cosmetic issues or bugs, not affecting user ability to use the product.	Bugs not impacting the immediate ability to use the system.

Additional Resources:

For more on severity levels, reference:https://response.pagerduty.com/before/severity_levels/

If you have any questions or you would like to sign up for the Datadog support team’s weekly office hours (Mondays at 11am ET), please post in the #public_datadog Slack channel.

Help and feedback

Suggest content changes to this page.
Submit new Platform Website content.
Get help from the Platform Support Team in Slack.
Submit a feature idea to the Platform.