Datadog Concepts Monitors and Alerting: Difference between revisions
(→SLO) |
|||
(15 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
=Overview= | =Overview= | ||
When something goes wrong, a computer tells you about it. This is what a monitor is: a Datadog feature that actively checks metrics, integration availability, network endpoints, etc. and communicates when an alerting condition occurs. The monitor has a query and alert conditions. There are different monitor types. | When something goes wrong, a computer tells you about it. This is what a monitor is: a Datadog feature that actively checks metrics, integration availability, network endpoints, etc. and communicates when an alerting condition occurs. The monitor has a query and alert conditions. There are different monitor types. | ||
=Structure of a Monitor= | |||
<font size=-2> | |||
{ | |||
"id": 99999999, | |||
"[[#Title_.28Name.29|name]]": "MyApp: Latency Too High", | |||
"type": "query alert", | |||
"query": "min(last_30m):avg:myapp.health_check.latency{env:prod} > 60000", | |||
"[[#Message|message]]": "something\n\n something else\n\n @some-group@group.mycompany.com", | |||
"priority": 2, | |||
"restricted_roles": null, | |||
"tags": [ | |||
"env:prod" | |||
], | |||
"options": { | |||
"notify_audit": false, | |||
"timeout_h": 0, | |||
"silenced": {}, | |||
"include_tags": true, | |||
"thresholds": { | |||
"critical": 60000 | |||
}, | |||
"new_host_delay": 300, | |||
"require_full_window": false, | |||
"notify_no_data": false, | |||
"renotify_interval": 0, | |||
"escalation_message": "", | |||
"no_data_timeframe": null | |||
} | |||
"created_at": 1629504639000, | |||
"created": "2021-08-21 00:10:39.507701+00:00", | |||
"creator": { | |||
"email": "me@mycompany.com", | |||
"handle": "me@mycompany.com", | |||
"id": 9999999, | |||
"name": null | |||
}, | |||
"modified": "2022-04-18 23:52:11.280131+00:00", | |||
"overall_state_modified": "2021-08-21T21:43:23+00:00" | |||
"overall_state": "OK", | |||
"multi": false, | |||
"org_id": 999999, | |||
"deleted": null | |||
} | |||
</font> | |||
=Monitor Types= | =Monitor Types= | ||
==<span id='Metric'></span>Metric Monitor== | ==<span id='Metric'></span>Metric Monitor== | ||
Line 63: | Line 108: | ||
==Watchdog Monitor== | ==Watchdog Monitor== | ||
{{External|https://docs.datadoghq.com/monitors/}} | {{External|https://docs.datadoghq.com/monitors/}} | ||
=Alert= | |||
==Alert Conditions== | |||
===Check Alert=== | |||
===Cluster Alert=== | |||
=Notification= | |||
{{External|https://docs.datadoghq.com/monitors/notify/}} | |||
Notification are a key component of monitors. They keep the team informed of issues and support troubleshooting. The notification are configured when the monitors are created. | |||
==Say What's Happening== | |||
===Title (Name)=== | |||
===Message=== | |||
===Tags=== | |||
===Renotify=== | |||
===Priority=== | |||
==Notify your Team== | |||
===Notifications=== | |||
===EMAIL=== | |||
===Integrations=== | |||
Jira, PagerDuty, Slack, Webhooks | |||
===Modifications=== | |||
===Permissions=== | |||
==Test Notifications== | |||
=Triggered Monitor= | =Triggered Monitor= | ||
{{External|https://docs.datadoghq.com/monitors/manage/#triggered-monitors}} | |||
=Downtime= | =Downtime= | ||
=Incident= | =Incident= | ||
{{External|https://docs.datadoghq.com/monitors/incident_management}} | |||
=SLO= | =SLO= | ||
{{External|https://docs.datadoghq.com/monitors/service_level_objectives/}} | |||
Also see: {{Internal|Service_Level_Objectives_(SLO)#Overview|Service Level Objectives (SLO)}} | |||
=Operations= | =Operations= | ||
* [[Datadog Monitor Operations#Create_Monitors|Create monitors]] | * [[Datadog Monitor Operations#Create_Monitors|Create monitors]] | ||
* [[Datadog_Monitor_Operations#Mute|Mute monitors]] | |||
* [[Datadog_Monitor_Operations#Unmute|Unmute monitors]] | |||
* [[Datadog_Monitor_Operations#Resolve|Resolve monitors]] | |||
* [[Datadog_Monitor_Operations#Delete|Delete monitors]] |
Latest revision as of 23:34, 18 April 2024
External
Internal
Overview
When something goes wrong, a computer tells you about it. This is what a monitor is: a Datadog feature that actively checks metrics, integration availability, network endpoints, etc. and communicates when an alerting condition occurs. The monitor has a query and alert conditions. There are different monitor types.
Structure of a Monitor
{ "id": 99999999, "name": "MyApp: Latency Too High", "type": "query alert", "query": "min(last_30m):avg:myapp.health_check.latency{env:prod} > 60000", "message": "something\n\n something else\n\n @some-group@group.mycompany.com", "priority": 2, "restricted_roles": null, "tags": [ "env:prod" ], "options": { "notify_audit": false, "timeout_h": 0, "silenced": {}, "include_tags": true, "thresholds": { "critical": 60000 }, "new_host_delay": 300, "require_full_window": false, "notify_no_data": false, "renotify_interval": 0, "escalation_message": "", "no_data_timeframe": null } "created_at": 1629504639000, "created": "2021-08-21 00:10:39.507701+00:00", "creator": { "email": "me@mycompany.com", "handle": "me@mycompany.com", "id": 9999999, "name": null }, "modified": "2022-04-18 23:52:11.280131+00:00", "overall_state_modified": "2021-08-21T21:43:23+00:00" "overall_state": "OK", "multi": false, "org_id": 999999, "deleted": null }
Monitor Types
Metric Monitor
Metric monitors watch a continuous stream of data. The metrics are collected via the Datadog Agent or the API and can be alerted upon if they cross a threshold (for example) over a given period of time. Other alert detection methods are available.
Any metric currently reporting to Datadog is available for monitors.
Alert Detection Method
Threshold
A threshold alert compares metric values to a static threshold. This is the standard alert case. On each alert evaluation, Datadog calculates average/min/max/sum over the selected period and checks if it is above or below the threshold. The distribution metric type offers additional threshold options of calculating percentiles over the selected period.
Change
A change alert compares the absolute or relative (%) change in value between N minutes ago and now, and against a given threshold. The compared data points are not single points but are computed using the parameters in the alert conditions section. On each alert evaluation, Datadog calculates the raw difference (a positive or negative value) between the series now and N minutes ago, then computes the average/minimum/maximum/sum over the selected period. An alert is triggered when this computed series crosses the threshold. This type of alert is useful to track spikes, drops, or slow changes in a metric when there is not an unexpected threshold.
Anomaly
An anomaly detection alert uses past behavior to detect when a metric is behaving abnormally. For more details see Anomaly Monitor
Outliers
An outlier alert notifies when a member of a group (host, availability zone, partition, etc) is behaving unusually compared to the rest. For more details see outlier monitors.
Forecast
A forecast alert predicts the future behavior of a metric and compares it to a static threshold. It is well-suited for metrics with strong trends or recurring patterns. On each alert evaluation, a forecast alert predicts the future values of the metric along with the expected deviation bounds. An alert is triggered when any part of the bounds crosses the configured threshold. For more details see forecast monitors.
Host Monitor
A host monitor listens to the Datadog Agent heartbeats and notifies on the status of the heartbeat. This could give an indication whether the hosts the Agents run on are responsive. Every Datadog Agent reports a service check called datadog.agent.up
with the status OK
. The Host monitor has two kind of alert conditions: Check Alert and Cluster Alert.
Anomaly Monitor
Outlier Monitor
Forecast Monitor
APM Monitor
APM Application Performance Monitoring.
Audit Logs Monitor
CI Pipelines Monitor
Composite Monitor
Custom Check Monitor
Error Tracking Monitor
Event Monitor
Integration Monitor
Live Process Monitor
Logs Monitor
Network Monitor
Process Check Monitor
Real User Monitoring
Watchdog Monitor
Alert
Alert Conditions
Check Alert
Cluster Alert
Notification
Notification are a key component of monitors. They keep the team informed of issues and support troubleshooting. The notification are configured when the monitors are created.
Say What's Happening
Title (Name)
Message
Tags
Renotify
Priority
Notify your Team
Notifications
Integrations
Jira, PagerDuty, Slack, Webhooks
Modifications
Permissions
Test Notifications
Triggered Monitor
Downtime
Incident
SLO
Also see: