Managing Signal to Noise Ratio
Signal to noise ratio (SNR) refers to the amount of alerts that reflect real errors needing resolution (signal) compared to the amount of alerts that do not reflect a real problem (noise). When there are many alerts which are false positives or not actionable, it becomes harder to notice and respond to alerts that indicate real problems. It’s important to balance these priorities with the need to ensure that problems with the application are not missed.
What does success look like?
Success requires the following three things to be true:
Alerts are actionable by the team.
Escalation (e.g. to owners of an upstream service) is a valid form of action.
Alerts reflect the health of the application.
Alert priority reflects the user impact.
How to tell if your monitors are too noisy?
The monitors are too noisy if:
Alerts are not actionable
Alerts do not reflect significant user impact
How to improve signal to noise ratio?
To improve signal to noise ratio, perform the following actions:
Review monitor priority levels and align based on user impact.
Start with the highest priority, how would you know if the application is completely non-functional?
Start with a less stringent threshold, and adjust over time.
Remove or increase intervals for re-notification of known issues.
Add retry attempts to the monitor before alerting to help with transient problems.
Use an Anomaly monitor for traffic volume instead of hard coded values. Anomaly monitors are well suited for metrics that have predictable patterns, and alert thresholds can be tuned to be appropriate for a given application.
Alert Priority Levels
Priority | Description | Examples |
---|---|---|
P1 | Critical issue that warrants public notification and liaison with executive teams. |
|
P2 | Critical system issue actively impacting many users' ability to use the product. |
|
P3 | Stability or minor user-impacting issues that require immediate attention from service owners. |
|
P4 | Minor issues requiring action, but not affecting user ability to use the product. |
|
P5 | Cosmetic issues or bugs, not affecting user ability to use the product. |
|
Additional Resources:
For more on severity levels, reference:https://response.pagerduty.com/before/severity_levels/
If you have any questions or you would like to sign up for the Datadog support team’s weekly office hours (Mondays at 11am ET), please post in the #public_datadog Slack channel.
Help and feedback
Get help from the Platform Support Team in Slack.
Submit a feature idea to the Platform.