Implications of custom WebhookConfigurations

Custom WebhookConfigurations can be used to validate admission requests that reach the API-Server or even change them.

Metakube does not pose any artificial limitations on what webhooks you can configure in your Metakube cluster.
Yet, installing a custom webhook can have drastic implications you should carefully consider.

Use cases

Kubernetes offers two kinds of custom webhook configurations: ValidatingWebhookConfiguration and MutatingWebhookConfiguration.

The former is often employed by custom Kubernetes controllers that bring their own CRDs.
The webhook then checks if the values in an object are valid and thus guards the controller from not being able to handle the object.

The MutatingWebhookConfiguration is mostly used to modify an object "on the fly" when that modification wouldn't be appropriate for the user to deal with.
Such modifications include:

injecting sidecar containers (e.g. service mesh proxies, logging)
setting labels/annotations (e.g. canarying updates)
modifying resource quota (e.g. vertical pod autoscaler)
replacing image names (e.g. registries or replacing tags with a digest)

Implications & potential risks

Single point of failure

Often a controller essentially requires the webhook to work.
This adds the webhook as a single point of failure for everything that it manages.
E.g. a service mesh requires that all pods that are part of the mesh get a sidecar proxy container injected.
If the webhook isn't available for any reason, this breaks any rollout functionality of Deployments, StatefulSets or DaemonSets immediately.
This may also result in a deadlock, if the webhook itself relies on components it manages.

Strain on OpenVPN

When a request reaches the API-Server, the admission controller calls every webhook configured that matches the object.
The webhook most probably is itself running in the cluster as a Service.
This means, the admission request needs to be routed to a pod, hence to a node in the cluster.
Since they are not in the same network, the request goes through a VPN tunnel.
Under normal circumstances this isn't an issue for performance.
But it does mean that the OpenVPN client pod in the kube-system namespace also becomes a single point of failure for things that normally wouldn't be affected by a VPN outage.

Strain on kube-apiserver

Every custom webhook adds another client side (kube-apiserver is the client here) request to its lifecycle.
An increased request duration isn't the only cause though.
Kube-apiserver is rate limiting the requests it handles concurrently.
We have seen that even if it's acceptable for the webhook request to fail (failurePolicy: Fail), this can cascade and congest kube-apiserver and cause unrelated requests (e.g. the attempt to fix it) to timeout (even with multiple replicas and plenty resources).

Considerations

Exclude the kube-system namespace

All of Metakube's managed resources reside in the kube-system namespace.
You must exclude the kube-system namespace from the scope of a custom WebhookConfiguration in order for Metakube to function properly:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values: ["kube-system"]
  rules:
    ...

Some controllers have their own annotations to include or exclude objects from the webhook's scope, e.g. linkerd.io/inject: disabled.

These do not prevent the admission controller from sending an admission request to the webhook.
The webhook itself will make a decision based on the annotation.
The risks of relying on the webhook and the network working still apply!

Note that the selector matches on the namespace's labels.
The following does not work:

- name: my-webhook.example.com
   namespaceSelector:
     matchExpressions:
     - key: name
       operator: NotIn
       values: ["kube-system"]

Reduce the timeout with `failurePolicy: Ignore`

The default timeout of the client-go library for Kubernetes (most widely used) is 30s.
The default timeout for admission requests is also 30s.
So if the admission request times out, the failure policy of Ignore might ignore the failure, but the client will cancel the request.
You might configure this failure policy to prevent a deadlock in case the webhook isn't reachable or broken.

Consider decreasing the timeout duration for the WebhookConfiguration for the failure policy to take its desired effect.

Implications of custom WebhookConfigurations

Use cases

Implications & potential risks

Single point of failure

Strain on OpenVPN

Strain on kube-apiserver

Considerations

Exclude the kube-system namespace

Reduce the timeout with failurePolicy: Ignore

References

Reduce the timeout with `failurePolicy: Ignore`