Datadog Concepts

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

Internal

Overview

Datadog is an observability platform that includes products for monitoring, alerting, metrics, dashboard, big logs, synthetics, user monitoring, CI/CD (how?). Datadog is API driven.

Organization

Route of a Metric from Application to Dashboard

Application (specialized library) → metricDogStatsD → Datadog Backend → metricDashboard

Datadog Metric Propagation.png

The metrics are generated by an application-level library, such as Micrometer. For more details, see Metric Lifecycle below. The Datadog agent annotates the metric with additional tags (cluster name, pod name, etc.)

Metrics

https://docs.datadoghq.com/metrics/

Metrics are numerical values that can track anything about your environment over time. Example: latency, error rates, user signups. Metric data is ingested and stored as a datapoint with a value and a timestamp. The timestamp is rounded to the nearest second. If there is more than one value with the same timestamp, the latest received value overwrites the previous one. A sequence of metrics is stored as a timeseries. There are standard metrics, such as CPU, memory, etc, but metrics specific to business can be defined. Those are custom metrics. Metrics can be visualized in dashboards, Metrics Explorer and Notebooks

Metric Name

Valid characters?

Metric Tags

Metric Lifecycle

Metrics are created by application-level specialized libraries, such as Micrometer. For example, Micrometer creates measurements , which are semantically equivalent. The metric has a name, and it can optionally have one or more tags.

Metrics can be sent to Datadog from:

Metric Types

https://docs.datadoghq.com/metrics/types/

Metric types determine which graphs and functions are available to use with the metric.

Count

A count metric adds up all the submitted values in a time interval. This would be suitable for a metric tracking the number of website hits, for example.

Rate

The rate metric takes the count and divides it by the length of the type interval (example: hits per second).

Gauge

A gauge metric takes the last value reported during the interval. This could be used to track values such as CPU or memory, where taking the last value provides a representative picture of the host's behavior during the time interval.

Histogram

A histogram reports five different values summarizing the submitted values: the average, count, median, 95th percentile, and max. This produces five different timeseries. This type of metric is suitable for things like latency, for which were is not enough to know the average value. Histograms allow you to understand how your data was spread out without recording every single data point.

Distribution

https://docs.datadoghq.com/metrics/distributions/

A distribution is similar to a histogram but it summarizes values submitted during a time interval across all hosts in your environment.

Set

Custom Metrics

https://docs.datadoghq.com/metrics/custom_metrics/

Metric Query

https://docs.datadoghq.com/metrics/#querying-metrics

 space-aggregation:metric.name{filter/scope} by {space_aggregation_grouping_by_tag}.time-aggregation

 avg:system.io.r_s{app:myapp} by {host}.rollup(avg, 3600)

TO PROCESS:

Metric Query Elements

Metric Name in Metric Query

system.io.r_s See Metric Name above.

Filter/Scope

The query metric values can be filtered based on tags. The {...} section contains a comma-separated list of tag-name:tag-value pairs. It is said that the list of tag-name:tag-value pairs, also known as the query filter, scope the query. Example:

{app:myapp, something:somethingelse}
{status:true,!condition:ready,cluster_name:my-cluster}
{table_name:*event_f_incomplete AND time_hours:0 AND time_interval:0 AND time_unit:hour AND target_hour:-2 AND environment:myenv AND component:mycomp AND NOT (table_name:*tier1_event_f_incomplete)}

The curly braces {...} must always be present in the metric query definition {...}. If there are no particular tags to filter by, use the {*} syntax. For more details on tags, see Tags.

Space Aggregation

"Space" refers to the way metrics are distributed over hosts and tags. There are two different aspects of "space" that can be controlled when aggregating metrics: grouping and the space aggregator.

Grouping or Space Aggregator Tags

Grouping defines what constitutes a line on the graph. Grouping splits a single metric into multiple timeseries by tags such as host, container, and region. For example, if you have hundreds of hosts spread across four regions, grouping by region allows you to graph one line for every region. This would reduce the number of timeseries to four.

... by {host} ...

When more that one value is available for a group such defined, we need to instruct Datadog how to combine such values, with the space aggregator.

Space Aggregator

The space aggregator defines how the metrics in each group are combined. There are four main aggregations types available: sum, min, max, and avg. There are also count, p50, p75, p95, p99, 0-sum, 1-avg, 100-avg.

Time Aggregation

Datadog stores a large volume of points, and in most cases it’s not possible to display all of them on a graph. There would be more datapoints than pixels. Datadog uses time aggregation to solve this problem by combining data points into time buckets. For example, when examining four hours, data points are combined into two-minute buckets. This is called a rollup. As the time interval you’ve defined for your query increases, the granularity of your data becomes coarser.

There are five aggregations you can apply to combine your data in each time bucket: sum, min, max, avg, and count. By default, avg is applied, in which case rollup(...) does not show up in the metric query.

 space_agg:metric{...}.rollup(max, 60)

Time aggregation is always applied in every query you make.

Operations

Functions

Graph values can be modified with mathematical functions. This can mean performing arithmetic between an integer and a metric (for example, multiplying a metric by 2). Or performing arithmetic between two metrics (for example, creating a new timeseries for the memory utilization rate like this: jvm.heap_memory / jvm.heap_memory_max). Functions are optional.

Rollup

rollup(avg, 3600)

Other Examples

 avg:myapp.smoketest.run_time{$cluster_name}/1000

Tags

https://docs.datadoghq.com/getting_started/tagging/using_tags/

Events

https://docs.datadoghq.com/events/

Events are records of notable changes relevant for managing and troubleshooting IT operations, such as code deployments, service health, configuration changes or monitoring alerts. TO PROCESS:

Agent

The Datadog agent has a built-in StatsD server, exposed over a configurable port. It's written in Go.

Agent and Kubernetes

TO CONTINUE: https://docs.datadoghq.com/developers/dogstatsd/?tab=kubernetes#

DogStatsD

https://docs.datadoghq.com/developers/dogstatsd/

DogStatsD is a metrics aggregation service bundled with the Datadog agent. DogStatsD implements the StatsD protocol and a few extensions (histogram metric type, service checks, events and tagging).

DogStatsD accepts custom metrics,events and service checks over UDP and periodically aggregates them and forwards them to Datadog.

Monitors and Alerting

Monitors and Alerting

Unified Service Tagging

There are three reserved tags: "env", "service", "version".

Dashboard

Dashboard

Metrics Explorer

Tool to browse arbitrary metrics, by selecting their name.

Notebook

Security

User

Service Account

API Key

https://docs.datadoghq.com/account_management/api-app-keys/#api-keys

An API key is required by the Datadog Agent to submit metrics and events to Datadog. The API keys are also used by other third-party clients, such as, for example, the Pulumi Datadog resource provider, which provisions infrastructure on the Datadog backend. API keys are unique to an organization. To see API Keys: Console → Hover over the user name at the bottom of the left side menu → Organization Settings → API Keys.

To invoke into the API, the client expects the environment variable DATADOG_API_KEY to be set.

An API key is unique to an organization.

Application Key

https://docs.datadoghq.com/account_management/api-app-keys/#application-keys

Application keys, in conjunction with the organization’s API key, give users access to Datadog’s programmatic API. Application keys are associated with the user account that created them and by default have the permissions and scopes of the user who created them. To see or create Application Keys: Console → Hover over the user name at the bottom of the left side menu → Organization Settings → Application Keys.

API

Datadog API

Kubernetes Support

Understand this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    metadata:
      annotations
        ad.datadoghq.com/myapp.check_names: '["myapp"]'
        ad.datadoghq.com/myapp.init_configs: '[{"is_jmx": true, "collect_default_metrics": true}]'
        ad.datadoghq.com/myapp.instances: '[{"host": "%%host%%","port":"19081"}]'
    spec:
    [...]