Enabling Defense-in-Depth for Enterprise Applications
Metrics and traceability standardized across services which provide an increase in reliability, not an increase in paging fatigue or heavy loads of unprocessed data.
SRE builds on observability of services. Successful SRE needs winnowed down, clearly actionable data:
- Key signals as alerts for short term availability
- Historical analysis to design for long term availability
The same golden signals of latency, traffic, errors, and saturation need to be collected and viewed for all services and potentially all pods.
Who is involved
Structuring best practices for achieving service level objectives through short term remediation and long-term service improvement.
Developers tasked with the building, deployment, and operation of a subset of the services in the organization.
(if separate from the SRE team)
Devops TeBusiness owner
and related service level agreement
- A microservices architecture, such as a Kubernetes deployment or a VM-based implementation.
- DevOps practices in place.
Istio proxy and service level metrics instituted, collecting Envoy statistics and passing to Prometheus. Grafana standardized dashboards are made available to teams. Distributed tracing is also implemented.
Consider implementing Kiali
If metric cardinality is creating excess data and traffic, implement federated Prometheus servers to roll-up rules.
Proxy level, service level, and tracing metrics are available in a standardized way. Alerting and paging are actionable and not bogging down forward-looking work by engineers.