Latest revision as of 20:47, 25 May 2022

Internal

Datadog

Overview

Datadog is an observability platform that includes products for monitoring, alerting, metrics, dashboard, big logs, synthetics, user monitoring, CI/CD (how?). Datadog is API driven.

Organization

Route of a Metric from Application to Dashboard

Application (specialized library) → metric → DogStatsD → Datadog Backend → metric → Dashboard

The metrics are generated by an application-level library, such as Micrometer. For more details, see Metric Lifecycle below. The Datadog agent annotates the metric with additional tags (cluster name, pod name, etc.)

Metrics

https://docs.datadoghq.com/metrics/

Metrics are numerical values that can track anything about your environment over time. Example: latency, error rates, user signups. Metric data is ingested and stored as a datapoint with a value and a timestamp. The timestamp is rounded to the nearest second. If there is more than one value with the same timestamp, the latest received value overwrites the previous one. A sequence of metrics is stored as a timeseries. There are standard metrics, such as CPU, memory, etc, but metrics specific to business can be defined. Those are custom metrics. Metrics can be visualized in dashboards, Metrics Explorer and Notebooks

Metric Name

Valid characters?

Metric Tags

Metric Lifecycle

Metrics are created by application-level specialized libraries, such as Micrometer. For example, Micrometer creates measurements , which are semantically equivalent. The metric has a name, and it can optionally have one or more tags.

Metrics can be sent to Datadog from:

Datadog-supported integrations. More: https://docs.datadoghq.com/integrations/.
Directly from the Datadog platform, for example counting errors showing up in the logs and publishing that as a new metric. More: https://docs.datadoghq.com/logs/logs_to_metrics/.
Custom metrics generators.
Datadog Agent, which automatically sends standard metrics as CPU and disk usage.

Metric Types

https://docs.datadoghq.com/metrics/types/

Metric types determine which graphs and functions are available to use with the metric.

Count

A count metric adds up all the submitted values in a time interval. This would be suitable for a metric tracking the number of website hits, for example.

Rate

The rate metric takes the count and divides it by the length of the type interval (example: hits per second).

Gauge

A gauge metric takes the last value reported during the interval. This could be used to track values such as CPU or memory, where taking the last value provides a representative picture of the host's behavior during the time interval.

Histogram

A histogram reports five different values summarizing the submitted values: the average, count, median, 95th percentile, and max. This produces five different timeseries. This type of metric is suitable for things like latency, for which were is not enough to know the average value. Histograms allow you to understand how your data was spread out without recording every single data point.

Distribution

https://docs.datadoghq.com/metrics/distributions/

A distribution is similar to a histogram but it summarizes values submitted during a time interval across all hosts in your environment.

Set

Custom Metrics

https://docs.datadoghq.com/metrics/custom_metrics/

Metric Query

https://docs.datadoghq.com/metrics/#querying-metrics

 space-aggregation:metric.name{filter/scope} by {space_aggregation_grouping_by_tag}.time-aggregation

 avg:system.io.r_s{app:myapp} by {host}.rollup(avg, 3600)

TO PROCESS:

https://docs.datadoghq.com/metrics/advanced-filtering

Metric Query Elements

Metric Name in Metric Query

system.io.r_s See Metric Name above.

Filter/Scope

The query metric values can be filtered based on tags. The {...} section contains a comma-separated list of tag-name:tag-value pairs. It is said that the list of tag-name:tag-value pairs, also known as the query filter, scope the query. Example:

{app:myapp, something:somethingelse}

{status:true,!condition:ready,cluster_name:my-cluster}

{table_name:*event_f_incomplete AND time_hours:0 AND time_interval:0 AND time_unit:hour AND target_hour:-2 AND environment:myenv AND component:mycomp AND NOT (table_name:*tier1_event_f_incomplete)}

The curly braces {...} must always be present in the metric query definition {...}. If there are no particular tags to filter by, use the {*} syntax. For more details on tags, see Tags.

Space Aggregation

"Space" refers to the way metrics are distributed over hosts and tags. There are two different aspects of "space" that can be controlled when aggregating metrics: grouping and the space aggregator.

Grouping or Space Aggregator Tags

Grouping defines what constitutes a line on the graph. Grouping splits a single metric into multiple timeseries by tags such as host, container, and region. For example, if you have hundreds of hosts spread across four regions, grouping by region allows you to graph one line for every region. This would reduce the number of timeseries to four.

... by {host} ...

When more that one value is available for a group such defined, we need to instruct Datadog how to combine such values, with the space aggregator.

Space Aggregator

The space aggregator defines how the metrics in each group are combined. There are four main aggregations types available: sum, min, max, and avg. There are also count, p50, p75, p95, p99, 0-sum, 1-avg, 100-avg.

Time Aggregation

Datadog stores a large volume of points, and in most cases it’s not possible to display all of them on a graph. There would be more datapoints than pixels. Datadog uses time aggregation to solve this problem by combining data points into time buckets. For example, when examining four hours, data points are combined into two-minute buckets. This is called a rollup. As the time interval you’ve defined for your query increases, the granularity of your data becomes coarser.

There are five aggregations you can apply to combine your data in each time bucket: sum, min, max, avg, and count. By default, avg is applied, in which case rollup(...) does not show up in the metric query.

 space_agg:metric{...}.rollup(max, 60)

Time aggregation is always applied in every query you make.

Operations

Functions

Graph values can be modified with mathematical functions. This can mean performing arithmetic between an integer and a metric (for example, multiplying a metric by 2). Or performing arithmetic between two metrics (for example, creating a new timeseries for the memory utilization rate like this: jvm.heap_memory / jvm.heap_memory_max). Functions are optional.

Rollup

rollup(avg, 3600)

Other Examples

 avg:myapp.smoketest.run_time{$cluster_name}/1000

Events

https://docs.datadoghq.com/events/

Events are records of notable changes relevant for managing and troubleshooting IT operations, such as code deployments, service health, configuration changes or monitoring alerts. TO PROCESS:

Agent

The Datadog agent has a built-in StatsD server, exposed over a configurable port. It's written in Go.

Agent and Kubernetes

TO CONTINUE: https://docs.datadoghq.com/developers/dogstatsd/?tab=kubernetes#

DogStatsD

https://docs.datadoghq.com/developers/dogstatsd/

DogStatsD is a metrics aggregation service bundled with the Datadog agent. DogStatsD implements the StatsD protocol and a few extensions (histogram metric type, service checks, events and tagging).

DogStatsD accepts custom metrics,events and service checks over UDP and periodically aggregates them and forwards them to Datadog.

Monitors and Alerting

Monitors and Alerting

Unified Service Tagging

There are three reserved tags: "env", "service", "version".

Dashboard

Dashboard

Metrics Explorer

Tool to browse arbitrary metrics, by selecting their name.

Notebook

Security

User

Service Account

API Key

https://docs.datadoghq.com/account_management/api-app-keys/#api-keys

An API key is required by the Datadog Agent to submit metrics and events to Datadog. The API keys are also used by other third-party clients, such as, for example, the Pulumi Datadog resource provider, which provisions infrastructure on the Datadog backend. API keys are unique to an organization. To see API Keys: Console → Hover over the user name at the bottom of the left side menu → Organization Settings → API Keys.

To invoke into the API, the client expects the environment variable DATADOG_API_KEY to be set.

An API key is unique to an organization.

Application Key

https://docs.datadoghq.com/account_management/api-app-keys/#application-keys

Application keys, in conjunction with the organization’s API key, give users access to Datadog’s programmatic API. Application keys are associated with the user account that created them and by default have the permissions and scopes of the user who created them. To see or create Application Keys: Console → Hover over the user name at the bottom of the left side menu → Organization Settings → Application Keys.

API

Datadog API

Kubernetes Support

Understand this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    metadata:
      annotations
        ad.datadoghq.com/myapp.check_names: '["myapp"]'
        ad.datadoghq.com/myapp.init_configs: '[{"is_jmx": true, "collect_default_metrics": true}]'
        ad.datadoghq.com/myapp.instances: '[{"host": "%%host%%","port":"19081"}]'
    spec:
    [...]

@@ Line 3: / Line 3: @@
 =Overview=
 Datadog is an observability platform that includes products for monitoring, alerting, metrics, dashboard, big logs, <font color=darkkhaki>synthetics</font>, user monitoring, <font color=darkkhaki>CI/CD (how?)</font>. Datadog is API driven.
+=Organization=
 =Route of a Metric from Application to Dashboard=
-Application (specialized library) → [[#DogStatsD|DogStatsD]] → [[#Event|events]] → [[Datadog Dashboard#Overview|Dashboard]]
+Application (specialized library) → [[#Metric|metric]] → [[#DogStatsD|DogStatsD]] → Datadog Backend → [[#Metric|metric]] → [[Datadog Dashboard#Overview|Dashboard]]
+::::[[File:Datadog_Metric_Propagation.png|754px]]
-The [[#Metric|metrics]] are generated by an application-level library, such as [[Micrometer]]. For more details, see [[#Metric_Lifecycle|Metric Lifecycle]] below.
+The [[#Metric|metrics]] are generated by an application-level library, such as [[Micrometer]]. For more details, see [[#Metric_Lifecycle|Metric Lifecycle]] below. The Datadog agent annotates the metric with additional tags (cluster name, pod name, etc.)
+=Metrics=
+{{External|https://docs.datadoghq.com/metrics/}}
+Metrics are numerical values that can track anything about your environment over time. Example: latency, error rates, user signups. Metric data is ingested and stored as a datapoint with a value and a timestamp. The timestamp is rounded to the nearest second. If there is more than one value with the same timestamp, the latest received value overwrites the previous one. A sequence of metrics is stored as a timeseries. There are standard metrics, such as CPU, memory, etc, but metrics specific to business can be defined. Those are [[#Custom_Metrics|custom metrics]]. Metrics can be visualized in [[Datadog Dashboard#Overview|dashboards]], [[#Metrics_Explorer|Metrics Explorer]] and [[#Notebook|Notebooks]]
+==Metric Name==
+<font color=darkkhaki>Valid characters?</font>
+==Metric Tags==
+==Metric Lifecycle==
+Metrics are created by application-level specialized libraries, such as [[Micrometer#Overview|Micrometer]]. For example, Micrometer creates [[Micrometer#Measurement|measurements]] <font color=darkkhaki>, which are semantically equivalent</font>. The metric has a [[#Metric_Name|name]], and it can optionally have one or more [[#Metric_Tags|tags]].
+Metrics can be sent to Datadog from:
+* Datadog-supported integrations. <font color=darkkhaki>More: https://docs.datadoghq.com/integrations/</font>.
+* Directly from the Datadog platform, for example counting errors showing up in the logs and publishing that as a new metric. <font color=darkkhaki>More: https://docs.datadoghq.com/logs/logs_to_metrics/</font>.
+* [[#Custom_Metrics|Custom metrics]] generators.
+* [[#Agent|Datadog Agent]], which automatically sends standard metrics as CPU and disk usage.
+==Metric Types==
+{{External|https://docs.datadoghq.com/metrics/types/}}
+Metric types determine which graphs and functions are available to use with the metric.
+===Count===
+A count metric adds up all the submitted values in a time interval. This would be suitable for a metric tracking the number of website hits, for example.
+===Rate===
+The rate metric takes the count and divides it by the length of the type interval (example: hits per second).
+===Gauge===
+A gauge metric takes the last value reported during the interval. This could be used to track values such as CPU or memory, where taking the last value provides a representative picture of the host's behavior during the time interval.
+===Histogram===
+A histogram reports five different values summarizing the submitted values: the average, count, median, 95th percentile, and max. This produces five different timeseries. This type of metric is suitable for things like latency, for which were is not enough to know the average value. Histograms allow you to understand how your data was spread out without recording every single data point.
+===Distribution===
+{{External|https://docs.datadoghq.com/metrics/distributions/}}
+A distribution is similar to a [[#Histogram|histogram]] but it summarizes values submitted during a time interval across all hosts in your environment.
+===Set===
-Name, tags to metric. Statsd. Datadog adds a datadog tag (cluster-name, pod-name)
+==<span id='Custom_Metric'></span>Custom Metrics==
+{{External|https://docs.datadoghq.com/metrics/custom_metrics/}}
-Metric: eks.smoketest.create_pod.succ from cluster_name. Use the tag as a filter.
+=Metric Query=
+{{External|https://docs.datadoghq.com/metrics/#querying-metrics}}
+<font size=-1>
+  <font color=burgundy>space-aggregation</font>:<font color=skyblue>metric.name</font>{filter/scope} <font color=burgundy>by {space_aggregation_grouping_by_tag}</font>.time-aggregation
+</font>
+<font size=-1>
+  <font color=burgundy>avg</font>:<font color=skyblue>system.io.r_s</font>{app:myapp} <font color=burgundy>by {host}</font>.rollup(avg, 3600)
+</font>
+<font color=darkkhaki>TO PROCESS:
+* https://docs.datadoghq.com/metrics/advanced-filtering
+</font>
+==Metric Query Elements==
+===Metric Name in Metric Query===
+<code>system.io.r_s</code>
+See [[#Metric_Name|Metric Name]] above.
+===Filter/Scope===
+The query  metric values can be filtered based on tags. The <code>{...}</code> section contains a comma-separated list of <code>tag-name:tag-value</code> pairs. It is said that the list of <code>tag-name:tag-value</code> pairs, also known as the '''query filter''', scope the query. Example:
+<syntaxhighlight lang='text'>
+{app:myapp, something:somethingelse}
+</syntaxhighlight>
+<syntaxhighlight lang='text'>
+{status:true,!condition:ready,cluster_name:my-cluster}
+</syntaxhighlight>
+<syntaxhighlight lang='text'>
+{table_name:*event_f_incomplete AND time_hours:0 AND time_interval:0 AND time_unit:hour AND target_hour:-2 AND environment:myenv AND component:mycomp AND NOT (table_name:*tier1_event_f_incomplete)}
+</syntaxhighlight>
+The curly braces <code>{...}</code> must always be present in the metric query definition <code>{...}</code>. If there are no particular tags to filter by, use the <code>{*}</code> syntax.
+For more details on tags, see [[#Tag|Tags]].
+===Space Aggregation===
+"Space" refers to the way metrics are distributed over hosts and tags. There are two different aspects of "space" that can be controlled when aggregating metrics:  [[#Grouping|grouping]] and the [[#Space_Aggregator|space aggregator]].
+====<span id='Grouping'></span><span id='Space_Aggregator_Tags'></span>Grouping or Space Aggregator Tags====
+Grouping defines what constitutes a line on the graph. Grouping splits a single metric into multiple timeseries by tags such as host, container, and region. For example, if you have hundreds of hosts spread across four regions, grouping by region allows you to graph one line for every region. This would reduce the number of timeseries to four.
+<font size=-1>
+ ... by {host} ...
+</font>
+When more that one value is available for a group such defined, we need to instruct Datadog how to combine such values, with the [[#Space_Aggregator|space aggregator]].
-=Metrics=
+====Space Aggregator====
-{{External|https://docs.datadoghq.com/metrics/}}
+The space aggregator defines how the metrics in each group are combined. There are four main aggregations types available: <code>sum</code>, <code>min</code>, <code>max</code>, and <code>avg</code>. There are also <code>count</code>, <code>p50</code>, <code>p75</code>, <code>p95</code>, <code>p99</code>, <code>0-sum</code>, <code>1-avg</code>, <code>100-avg</code>.
-Metrics are numerical values that can track anything about your environment over time. Example: latency, error rates, user signups. Metric data is ingested and stored as a datapoint with a value and a timestamp. The timestamp is rounded to the nearest second. If there is more than one value with the same timestamp, the latest received value overwrites the previous one. A sequence of metrics is stored as a timeseries. There are standard metrics, such as CPU, memory, etc, but metrics specific to business can be defined. Those are [[#Custom_Metrics|custom metrics]]. Metrics can be visualized in [[Datadog Dashboard#Overview|dashboards]], [[#Metrics_Explorer|Metrics Explorer]] and [[#Notebook|Notebooks]]
+===Time Aggregation===
+Datadog stores a large volume of points, and in most cases it’s not possible to display all of them on a graph. There would be more datapoints than pixels. Datadog uses time aggregation to solve this problem by combining data points into time buckets. For example, when examining four hours, data points are combined into two-minute buckets. This is called a '''rollup'''. As the time interval you’ve defined for your query increases, the granularity of your data becomes coarser.
+There are five aggregations you can apply to combine your data in each time bucket: <code>sum</code>, <code>min</code>, <code>max</code>, <code>avg</code>, and <code>count</code>. By default, <code>avg</code> is applied, in which case <code>rollup(...)</code> does not show up in the metric query.
+<font size=-1>
+  space_agg:metric{...}.rollup(max, 60)
+</font>
+Time aggregation is always applied in every query you make.
+===Operations===
+===Functions===
+Graph values can be modified with mathematical functions. This can mean performing arithmetic between an integer and a metric (for example, multiplying a metric by 2). Or performing arithmetic between two metrics (for example, creating a new timeseries for the memory utilization rate like this: jvm.heap_memory / jvm.heap_memory_max). Functions are optional.
+===Rollup===
+<code>rollup(avg, 3600)</code>
+==Other Examples==
+<font size=-1>
+  avg:myapp.smoketest.run_time{$cluster_name}/1000
+</font>
-==Metric Lifecycle==
+=<span id='Tag'>Tags</span>=
-Metrics are created by application-level specialized libraries, such as [[Micrometer#Overview|Micrometer]]. For example, Micrometer creates [[Micrometer#Measurement|measurements]]. The metric has a name, and it can optionally have one or more tags.
+{{External|https://docs.datadoghq.com/getting_started/tagging/using_tags/}}
-==<span id='Custom_Metric'></span>Custom Metrics==
+=<span id='Event'></span>Events=
-{{External|https://docs.datadoghq.com/metrics/custom_metrics/}}
+{{External|https://docs.datadoghq.com/events/}}
+Events are records of notable changes relevant for managing and troubleshooting IT operations, such as code deployments, service health, configuration changes or monitoring alerts.
+<font color=darkkhaki>
+TO PROCESS:
+* https://docs.datadoghq.com/events/
+* https://docs.datadoghq.com/events/guides/
+</font>
 =Agent=
@@ Line 37: / Line 131: @@
 DogStatsD accepts [[#Custom_Metric|custom metrics]],[[#Event|events]] and service checks over UDP and periodically aggregates them and forwards them to Datadog.
-=<span id='Event'></span>Events=
-{{External|https://docs.datadoghq.com/events/}}
+=<span id='Monitor'></span>Monitors and Alerting=
-Events are records of notable changes relevant for managing and troubleshooting IT operations, such as code deployments, service health, configuration changes or monitoring alerts.
+{{Internal|Datadog Concepts Monitors and Alerting#Overivew|Monitors and Alerting}}
-=Monitor=
-When something goes wrong, a computer tells you about it. This is what a monitor is. There re different monitor types. The monitor has a query. The monitor has alert conditions.
 =Unified Service Tagging=
 There are three reserved tags: "env", "service", "version".
@@ Line 47: / Line 140: @@
 {{Internal|Datadog Dashboard#Overview|Dashboard}}
 =Metrics Explorer=
+Tool to browse arbitrary metrics, by selecting their [[#Metric_Name|name]].
 =Notebook=
@@ Line 57: / Line 152: @@
 An API key is required by the [[#Agent|Datadog Agent]] to submit metrics and events to Datadog. The API keys are also used by other third-party clients, such as, for example, the [[Managing_Datadog_with_Pulumi#Datadog_Resource_Provider|Pulumi Datadog resource provider]], which provisions infrastructure on the Datadog backend. API keys are unique to an organization. To see API Keys: Console → Hover over the user name at the bottom of the left side menu → Organization Settings → API Keys.
+To invoke into the API, the client expects the environment variable <code>DATADOG_API_KEY</code> to be set.
+An API key is unique to an [[#Organization|organization]].
 ==Application Key==
 {{External|https://docs.datadoghq.com/account_management/api-app-keys/#application-keys}}
 Application keys, in conjunction with the organization’s [[#API_Key|API key]], give users access to Datadog’s programmatic API. Application keys are associated with the user account that created them and by default have the permissions and scopes of the user who created them.  To see or create Application Keys: Console → Hover over the user name at the bottom of the left side menu → Organization Settings → Application Keys.
+=API=
+{{Internal|Datadog API|Datadog API}}
+=Kubernetes Support=
+<font color=darkkhaki>Understand this:</font>
+<syntaxhighlight lang='yaml'>
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: myapp
+spec:
+  template:
+    metadata:
+      annotations
+        ad.datadoghq.com/myapp.check_names: '["myapp"]'
+        ad.datadoghq.com/myapp.init_configs: '[{"is_jmx": true, "collect_default_metrics": true}]'
+        ad.datadoghq.com/myapp.instances: '[{"host": "%%host%%","port":"19081"}]'
+    spec:
+    [...]
+</syntaxhighlight>

Datadog Concepts: Difference between revisions

Latest revision as of 20:47, 25 May 2022

Internal

Overview

Organization

Route of a Metric from Application to Dashboard

Metrics

Metric Name

Metric Tags

Metric Lifecycle

Metric Types

Count

Rate

Gauge

Histogram

Distribution

Set

Custom Metrics

Metric Query

Metric Query Elements

Metric Name in Metric Query

Filter/Scope

Space Aggregation

Grouping or Space Aggregator Tags

Space Aggregator

Time Aggregation

Operations

Functions

Rollup

Other Examples

Tags

Events

Agent

Agent and Kubernetes

DogStatsD

Monitors and Alerting

Unified Service Tagging

Dashboard

Metrics Explorer

Notebook

Security

User

Service Account

API Key

Application Key

API

Kubernetes Support

Navigation menu

Search