Enabling Defense-in-Depth for Enterprise Applications

Metrics and traceability standardized across services which provide an increase in reliability, not an increase in paging fatigue or heavy loads of unprocessed data.

SRE builds on observability of services. Successful SRE needs winnowed down, clearly actionable data:

  • Key signals as alerts for short term availability
  • Historical analysis to design for long term availability

The same golden signals of latency, traffic, errors, and saturation need to be collected and viewed for all services and potentially all pods.

Who is involved

SRE Team

Structuring best practices for achieving service level objectives through short term remediation and long-term service improvement.

Devops Team

Developers tasked with the building, deployment, and operation of a subset of the services in the organization.

Additional Stakeholders

Platform owner

(if separate from the SRE team)

Devops TeBusiness owner

and related service level agreement

Preconditions

  • A microservices architecture, such as a Kubernetes deployment or a VM-based implementation.
  • DevOps practices in place.

Workflow

Istio proxy and service level metrics instituted, collecting Envoy statistics and passing to Prometheus. Grafana standardized dashboards are made available to teams. Distributed tracing is also implemented.

Consider implementing Kiali

If metric cardinality is creating excess data and traffic, implement federated Prometheus servers to roll-up rules.

Proxy Level

Proxy level, service level, and tracing metrics are available in a standardized way. Alerting and paging are actionable and not bogging down forward-looking work by engineers.

Service mesh

Read case studies of companies that benefited from implementing Istio