Citadel Health Checking

You can enable Citadel’s health checking feature to detect the failures of the Citadel CSR (Certificate Signing Request) service. When a failure is detected, Kubelet automatically restarts the Citadel container.

When the health checking feature is enabled, the prober client module in Citadel periodically checks the health status of Citadel’s CSR gRPC server. It does this by sending CSRs to the gRPC server and verifies the responses. If Citadel is healthy, the prober client updates the modification time of the health status file. Otherwise, it does nothing. Citadel relies on a Kubernetes liveness and readiness probe with command line to check the modification time of the health status file on the pod. If the file is not updated for a period, Kubelet will restart the Citadel container.

Before you begin

Follow the Istio installation guide to install Istio with mutual TLS enabled.

Deploying Citadel with health checking

To enable health checking, redeploy Citadel:

$ istioctl manifest generate --set values.global.mtls.enabled=true,values.security.citadelHealthCheck=true > citadel-health-check.yaml
$ kubectl apply -f citadel-health-check.yaml

Verify that health checking works

Citadel will log the health checking results. Run the following in command line:

$ kubectl logs `kubectl get po -n istio-system | grep istio-citadel | awk '{print $1}'` -n istio-system | grep "CSR signing service"

You will see the output similar to:

... CSR signing service is healthy (logged every 100 times).

The log above indicates the periodic health checking is working. The default health checking interval is 15 seconds and is logged once every 100 checks.

(Optional) Configuring the health checking

This section talks about how to modify the health checking configuration. Open the file citadel-health-check.yaml, and locate the following lines.

...
  - --liveness-probe-path=/tmp/ca.liveness # path to the liveness health checking status file
  - --liveness-probe-interval=60s # interval for health checking file update
  - --probe-check-interval=15s    # interval for health status check
livenessProbe:
  exec:
    command:
    - /usr/local/bin/istio_ca
    - probe
    - --probe-path=/tmp/ca.liveness # path to the liveness health checking status file
    - --interval=125s               # the maximum time gap allowed between the file mtime and the current sys clock.
  initialDelaySeconds: 60
  periodSeconds: 60
...

The paths to the health status files are liveness-probe-path and probe-path. You should update the paths in Citadel and in the livenessProbe at the same time. If Citadel is healthy, the value of the liveness-probe-interval entry determines the interval used to update the health status file. The Citadel health checking controller uses the value of the probe-check-interval entry to determine the interval to call the Citadel CSR service. The interval is the maximum time elapsed since the last update of the health status file, for the prober to consider Citadel as healthy. The values in the initialDelaySeconds and periodSecondsentries determine the initial delay and the interval between each activation of the livenessProbe.

Prolonging probe-check-interval will reduce the health checking overhead, but there will be a greater lagging for the prober to get notified on the unhealthy status. To avoid the prober restarting Citadel due to temporary unavailability, the interval on the prober can be configured to be more than N times of the liveness-probe-interval. This will allow the prober to tolerate N-1 continuously failed health checks.

Cleanup

  • To disable health checking on Citadel:

    $ istioctl manifest apply --set values.global.mtls.enabled=true