The pace of digital innovation is faster than ever. Organizations are shifting workloads to the cloud, modernizing applications and building new cloud native applications to increase reliability, improve efficiency and deliver better customer experiences. But IT and DevOps teams adopt hybrid and multi cloud environments and technologies such as containers, Kubernetes, server list functions and microservices. This results in greater operational complexity as well as more and more complicated failure conditions.
To counter this, observability is the answer to truly understand what complex systems are doing. Therefore, you can quickly identify and solve for unknown issues in production. In short, observability allows for seamless monitoring, troubleshooting and resolution across any stack. You go from an alert to problem resolution.
What is Observability?
With the significant increase of complexity in the cloud-native environments and more dynamic architectures, system failures (unknown or known) are certain to happen, which poses an enormous challenge for IT Support teams to find the root cause. To counter this, you need to make your system observable; to understand what’s happening inside, to predict what anomalies might happen, so you can find the root cause and resolve it quickly, by collecting and working with observability data such as events, metrics, and traces.
Observability is more suitable for distributed systems, mainly because it allows you to ask the following question “Why is my system not working”.
Thanks to Observability, IT Operations Teams can have full visibility on the E2E architecture, allowing them to easily trace any anomaly and find back the cause.
The Benefits of Observability
- Demystifying systems complexity
- Real-time view on system’s internal state
- Faster Root cause finding
- Rapid incident resolution
- Meeting service level agreements (SLAs) and service level objectives (SLOs)
- Good NPS – customer satisfaction
Pillars of Observability
As stated before, observability is the ability to measure a system’s Health based on its data outputs such as events, metrics, and traces (More data is always better!). The following are the four pillars of observability:
A pillar of observability are log traces: data outputs that indicate, among other things, timestamp, severity, component, and a descriptive message.
- Business applications
- Infrastructure applications
- Control Plane services
- Additional Resources (cache, bd, message pipelines)
Metrics are data series data that represent measurable values of different resources.
- Gauge metrics: value measures a specific instant in time (i.e CPU utilization)
- Delta metric:value measures the change since it was last recorded (i.e. request counts)
- Cumulative metric:value constantly increases over time (i.e. sent bytes)
Traceability refers to a significant sample of the behavior of the components in a distributed architecture. Thus, this pillar of observability consists of measuring order and latencies of the intercommunicated service chains to detect performance problems, poor parallelization of tasks, and poor communications design.
Observability in Action – Google Cloud Platform Use Case
We built alert policies to detect automatically when there is an increase of 5xx errors. When Google Cloud Platform detects one of these cases, it sends a notification to our ticketing system using the built-in webhooks, so we can investigate it.
With these alarms we have been able to solve numerous bugs in the source code which were overlooked in the testing phase of the development. Since every error is logged on GCP, we can detect common errors and easily identify the root cause. The errors go from bugs in the source code to misconfigurations in the infrastructure.
For example, we were able to detect the failure of an API when a subset of customers with specific parameters were consuming the resource. When the alarm was received, we identified a specific request, and we filtered the logs for all requests of the same type, leading us to discover the pattern.
Once the root cause was fixed, we created a specific alarm for this scenario, facilitating future investigations.