Network Resilience and Testing
Istio provides opt-in failure recovery features that you can configure dynamically at runtime through the Istio traffic management rules. With these features, the service mesh can tolerate failing nodes and Istio can prevent localized failures from cascading to other nodes:
Timeouts and retries
A timeout is the amount of time that Istio waits for a response to a request. A retry is an attempt to complete an operation multiple times if it fails. You can set defaults and specify request-level overrides for both timeouts and retries or for one or the other.
Circuit breakers prevent your application from stalling as it waits for an upstream service to respond. You can configure a circuit breaker based on a number of conditions, such as connection and request limits.
Fault injection is a testing method that introduces errors into a system to ensure that it can withstand and recover from error conditions. You can inject faults at the application layer, rather than the network layer, to get more relevant results.
You can use Istio failure recovery features to complement application-level fault tolerance libraries in situations where their behaviors don’t conflict.
Timeouts and retries
You can use Istio’s traffic management resources to set defaults for timeouts and retries per service and subset that apply to all callers.
Override default timeout setting
The default timeout for HTTP requests is 15 seconds. You can configure a virtual service with a routing rule to override the default, for example:
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: ratings spec: hosts: - ratings http: - route: - destination: host: ratings subset: v1 timeout: 10s
Set number and timeouts for retries
You can specify the maximum number of retries for an HTTP request in a virtual service, and you can provide specific timeouts for the retries to ensure that the calling service gets a response, either success or failure, within a predictable time frame.
Envoy proxies automatically add variable jitter between your retries to minimize the potential impact of retries on an overloaded upstream service.
The following virtual service configures three attempts with a 2-second timeout:
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: ratings spec: hosts: - ratings http: - route: - destination: host: ratings subset: v1 retries: attempts: 3 perTryTimeout: 2s
Consumers of a service can also override timeout and retry defaults with request-level overrides through special HTTP headers. The Envoy proxy implementation makes the following headers available:
As with timeouts and retries, you can configure a circuit breaker pattern without changing your services. While retries let your application recover from transient errors, a circuit breaker pattern prevents your application from stalling as it waits for an upstream service to respond. By configuring a circuit breaker pattern, you allow your application to fail fast and handle the error appropriately, for example, by triggering an alert. You can configure a simple circuit breaker pattern based on a number of conditions such as connection and request limits.
Limit connections to 100
The following destination rule sets a limit of 100 connections for the
reviews service workloads of the v1 subset:
apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: reviews spec: host: reviews subsets: - name: v1 labels: version: v1 trafficPolicy: connectionPool: tcp: maxConnections: 100
See the circuit-breaking task for detailed instructions on how to configure a circuit breaker pattern.
You can use fault injection to test the end-to-end failure recovery capability of the application as a whole. An incorrect configuration of the failure recovery policies could result in unavailability of critical services. Examples of incorrect configurations include incompatible or restrictive timeouts across service calls.
With Istio, you can use application-layer fault injection instead of killing pods, delaying packets, or corrupting packets at the TCP layer. You can inject more relevant failures at the application layer, such as HTTP error codes, to test the resilience of an application.
You can inject faults into requests that match specific conditions, and you can restrict the percentage of requests Istio subjects to faults.
You can inject two types of faults:
Delays: Delays are timing failures. They mimic increased network latency or an overloaded upstream service.
Aborts: Aborts are crash failures. They mimic failures in upstream services. Aborts usually manifest in the form of HTTP error codes or TCP connection failures.
You can configure a virtual service to inject one or more faults while forwarding HTTP requests to the rule’s corresponding request destination. The faults can be either delays or aborts.
Introduce a 5 second delay in 10% of requests
You can configure a virtual service to introduce a 5 second delay for 10% of
the requests to the
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: ratings spec: hosts: - ratings http: - fault: delay: percentage: value: 0.1 fixedDelay: 5s route: - destination: host: ratings subset: v1
Return an HTTP 400 error code for 10% of requests
You can configure an abort instead to terminate a request and simulate a failure.
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: ratings spec: hosts: - ratings http: - fault: abort: percentage: value: 0.1 httpStatus: 400 route: - destination: host: ratings subset: v1
Combine delay and abort faults
You can use delay and abort faults together. The following configuration
introduces a delay of 5 seconds for all requests from the
v2 subset of the
ratings service to the
v1 subset of the
ratings service and an abort for
10% of them:
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: ratings spec: hosts: - ratings http: - match: - sourceLabels: app: reviews version: v2 fault: delay: fixedDelay: 5s abort: percentage: value: 0.1 httpStatus: 400 route: - destination: host: ratings subset: v1
For detailed instructions on how to configure delays and aborts, visit our fault injection task.
Compatibility with fault tolerance libraries
Istio failure recovery features are completely transparent to the application. Applications cannot distinguish between the Envoy proxy’s failure response and the failure response of the called upstream service, so fault tolerance libraries such as Hystrix are compatible with Istio.
When you use application-level fault tolerance libraries and Envoy proxy failure recovery policies at the same time, Istio first triggers the more restrictive of the two when failures occur.
For example: Suppose you can have two timeouts, one configured in a virtual service and another in an application’s library. The application sets a 5 second timeout for an API call to a service. However, you configured a 10 second timeout in your virtual service. In this case, the application’s timeout kicks in first.
Similarly, if you configure a circuit breaker using Istio and it triggers
before the application’s circuit breaker, the API calls to the service get an
503 error code from Istio’s Envoy proxy.