Automatically create and manage Kubernetes alerts with Datadog

4 years ago

6 minutes

Kubernetes enables teams to deploy and manage their own services, but this can lead to gaps in visibility as different teams create systems with varying configurations and resources. Without an established method for provisioning infrastructure, keeping track of these services becomes more challenging. Implementing infrastructure as code solves this problem by optimizing the process for provisioning and updating production-ready resources.

Now, you can go one step further by easily incorporating monitoring as code into your existing Kubernetes infrastructure with the Datadog Operator. We’ve extended the Operator to include a DatadogMonitor custom resource definition (CRD). Much like Prometheus alerting rules, which allow you to configure alert conditions based on Kubernetes metrics, Datadog CRDs enable you to automatically create and manage monitors for Kubernetes resources via your Kubernetes deployment manifests and tools like kubectl.

We’ll show how to get started with the Datadog Operator and look at a few examples of Datadog monitors you can create to proactively track and alert on the performance of your Kubernetes objects.

Get started with the DatadogMonitor custom resource definition

To start creating monitors through the Kubernetes API, you will first need to install the Datadog Operator via Helm—or update it to the latest version—and create a new file containing your DatadogMonitor deployment specification. You can then deploy the new monitor using the following kubectl command:COPY

 
kubectl apply -f sample-datadog-monitor.yaml

You can also add any new monitor to an existing manifest, enabling you to deploy them alongside other Kubernetes objects. Once deployed, you will be able to view your monitor in your Datadog account alongside all of your other monitors. You can also view the state of a specific monitor deployed via the DatadogMonitor custom resource definition directly in your Kubernetes environment using kubectl, as seen below:COPY

$ kubectl get datadogmonitor sample-datadog-monitor

NAME                       ID         MONITOR STATE   LAST TRANSITION        LAST SYNC              SYNC STATUS      AGE
sample-datadog-monitor     1234       Alert           2021-03-29T17:32:47Z   2021-03-30T12:52:47Z   OK               1d

In environments that rely on a large number of alerts for multiple services, using the kubectl get command to search by specific identifiers—such as the name of applications, Helm charts, or namespaces—can help you review the status of alerts you care about the most.

Monitor Kubernetes resources as soon as they are deployed

The Datadog Operator enables you to create a repeatable process for deploying, managing, and sharing monitors that are customized for your services, so you can implement automatic monitoring across your entire Kubernetes environment. This ensures that every team within your organization can easily create a comprehensive suite of monitors as part of the deployment process for their Kubernetes applications, so they do not have to dedicate time to building their own alerting services. Next, we’ll look at how you can use the DatadogMonitor CRD to create a customized workflow for deploying alerts with your Kubernetes applications.

Alert on issues with individual applications

Pods are a core component of your Kubernetes services, so it’s important to know when a deployment fails to launch new ones. For instance, you can add the following DatadogMonitor CRD to a deployment manifest for a specific application (k8s-app-1 in the example below) and be notified when one or more application pods are in a CrashLoopBackOff state. This issue could mean that a container is repeatedly crashing after restarting, which is often a result of not allocating enough resources for a pod in a deployment manifest.

k8s-app-1-deployment.yamlCOPY

 
apiVersion: datadoghq.com/v1alpha1
kind: DatadogMonitor
metadata:
  name: pods-crashloopbackoff
  namespace: datadog
spec:
  query: "max(last_10m):max:kubernetes_state.container.status_report.count.waiting{reason:crashloopbackoff, app:k8s-app-1} >= 1"
  type: "query alert"
  name: "[kubernetes] Pod {{pod_name.name}} is CrashLoopBackOff in k8s-app-1 app"
  message: "pod {{pod_name.name}} is in CrashLoopBackOff in k8s-app-1 app. \n Adjust the thresholds of the query to suit your infrastructure."
  tags:
    - "integration:kubernetes"

Track updates to cluster pods for critical services

If one of your teams maintains a backend service, they will need to know whether image updates are deployed and pulled successfully or not. For example, when one of the service’s pods is not able to pull a container image, it will generate an ImagePullBackOff error message. This can happen when the pod is pulling an outdated image path or is using credentials that are not configured properly. You can monitor when this happens in any Kubernetes namespace, as seen in the snippet below:

imagepullbackoff-monitor.yamlCOPY

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMonitor
metadata:
  name: pods-imagepullbackoff
  namespace: datadog
spec:
  query: "max(last_10m):max:kubernetes_state.container.status_report.count.waiting{reason:imagepullbackoff} by {kube_namespace,pod_name} >= 1"
  type: "query alert"
  name: "[kubernetes] Pod {{pod_name.name}} is ImagePullBackOff on namespace {{kube_namespace.name}}"
  message: "pod {{pod_name.name}} is ImagePullBackOff on {{kube_namespace.name}} \n This could happen for several reasons, for example a bad image path or tag or if the credentials for pulling images are not configured properly."
  tags:
    - "integration:kubernetes"

Notify teams on the state of cluster nodes

You can also create a monitor to notify you when a certain percentage of nodes for that shared service are in an unschedulable state, which could mean that a cluster does not have adequate resources to schedule new nodes.

unavailable-nodes-monitor.yamlCOPY

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMonitor
metadata:
  name: nodes-unavailable
  namespace: datadog
spec:
  query: "max(last_15m):sum:kubernetes_state.node.status{status:schedulable} by {kubernetes_cluster} * 100 / sum:kubernetes_state.node.status{*} by {kube_cluster_name} < 80"
  type: "query alert"
  name: "[kubernetes] Monitor Unschedulable Kubernetes Nodes"
  message: "More than 20% of nodes are unschedulable on ({{kube_cluster_name.name}} cluster). \n Keep in mind that this might be expected based on your infrastructure."
  tags:
    - "integration:kubernetes"

From any triggered monitor, you can pivot to view details about the affected pod, node, or container in order to resolve the issue. Datadog offers deeper insights into the state of your Kubernetes resources with the Live Container view, so you can pinpoint the root cause of the alert. Datadog can also automatically connect your monitors to instrumented services in Datadog APM using the service tag. This enables you to view all triggered alerts for your critical Kubernetes services so you can dive in to a specific alert for more details.

#datadog #kubernetes