Custom WebhookConfigurations can be used to validate admission requests that reach the API-Server or even change them.
Metakube does not pose any artificial limitations on what webhooks you can configure in your Metakube cluster.
Yet, installing a custom webhook can have drastic implications you should carefully consider.
Kubernetes offers two kinds of custom webhook configurations: ValidatingWebhookConfiguration and MutatingWebhookConfiguration.
The former is often employed by custom Kubernetes controllers that bring their own CRDs.
The webhook then checks if the values in an object are valid and thus guards the controller from not being able to handle the object.
The MutatingWebhookConfiguration is mostly used to modify an object "on the fly" when that modification wouldn't be appropriate for the user to deal with.
Such modifications include:
Often a controller essentially requires the webhook to work.
This adds the webhook as a single point of failure for everything that it manages.
E.g. a service mesh requires that all pods that are part of the mesh get a sidecar proxy container injected.
If the webhook isn't available for any reason, this breaks any rollout functionality of Deployments, StatefulSets or DaemonSets immediately.
This may also result in a deadlock, if the webhook itself relies on components it manages.
When a request reaches the API-Server, the admission controller calls every webhook configured that matches the object.
The webhook most probably is itself running in the cluster as a Service.
This means, the admission request needs to be routed to a pod, hence to a node in the cluster.
Since they are not in the same network, the request goes through a VPN tunnel.
Under normal circumstances this isn't an issue for performance.
But it does mean that the OpenVPN client pod in the kube-system
namespace also becomes a single point of failure for things that normally wouldn't be affected by a VPN outage.
Every custom webhook adds another client side (kube-apiserver is the client here) request to its lifecycle.
An increased request duration isn't the only cause though.
Kube-apiserver is rate limiting the requests it handles concurrently.
We have seen that even if it's acceptable for the webhook request to fail (failurePolicy: Fail
), this can cascade and congest kube-apiserver and cause unrelated requests (e.g. the attempt to fix it) to timeout (even with multiple replicas and plenty resources).
All of Metakube's managed resources reside in the kube-system
namespace.
You must exclude the kube-system
namespace from the scope of a custom WebhookConfiguration in order for Metakube to function properly:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system"]
rules:
...
Some controllers have their own annotations to include or exclude objects from the webhook's scope, e.g. linkerd.io/inject: disabled
.
These do not prevent the admission controller from sending an admission request to the webhook.
The webhook itself will make a decision based on the annotation.
The risks of relying on the webhook and the network working still apply!
Note that the selector matches on the namespace's labels.
The following does not work:
- name: my-webhook.example.com
namespaceSelector:
matchExpressions:
- key: name
operator: NotIn
values: ["kube-system"]
failurePolicy: Ignore
The default timeout of the client-go library for Kubernetes (most widely used) is 30s.
The default timeout for admission requests is also 30s.
So if the admission request times out, the failure policy of Ignore
might ignore the failure, but the client will cancel the request.
You might configure this failure policy to prevent a deadlock in case the webhook isn't reachable or broken.
Consider decreasing the timeout duration for the WebhookConfiguration for the failure policy to take its desired effect.