środa, 26 kwietnia 2023

OpenTelemetry - what is it & why should you care?

Per project's about page (https://opentelemetry.io/about/):


OpenTelemetry provides the libraries, agents, and other components that you need to capture telemetry from your services so that you can better observe, manage, and debug them

But what does that actually mean?


Let's first take a look at the domain - telemetry. It's composed of logs, metrics and traces. First two have been present in (enterprise) Java for a long time. The third component, traces, started to gain popularity in recent years with the rise of distributed architectures.

In case of monolithic architectures, logs usually contain all necessary data, ie complete business operation. In microservices worlds it's not that simple. Engineers need to have a way to somehow correlate execution happening in multiple places, often in an asynchronous manner (messaging systems anyone?). This is where distributed tracing comes into play. Each instrumented (more on this later) component involved in processing of a business operation generates a single data point (aka span). Instrumentation also makes sure that all spans of a business operation (called trace) are correlated (with traceId) and are in a chronological order

Why is OpenTelemetry so important?


Multiple companies have been using a kind of telemetry (sometimes even distributed traces!) for a long time now. Why should they switch to OpenTelemetry now? The answer is simple - standardisation across industry leaders. I've seen multiple in-house attempts to telemetry, usually as a way to help developers hunt and fix issues - a process that is crucial for revenue streams ;-) Such solutions are often very limited due to custom protocols, lack of resources to keep them in a good shape (hackathon / side project) or limited scope / functionality. Moreover, switching telemetry backend, used to visualise and explore telemetry, is almost impossible. That's why industry leaders (Splunk, NewRelic, Microsoft, Amazon - and others) have decided to join forces and create a standard. Everything open sourced, under Cloud Native Computing Foundation (CNCF, https://www.cncf.io/). 

OK, I get the idea, but WHY should I use it?


Here are some compelling reasons:

Debugging made easier

When an issue arises in a distributed system, it can be challenging to pinpoint the root cause. With distributed tracing, you can get a holistic view of the entire request flow across different services and identify the problematic component quickly. You can trace the request path and see the exact timing, order and success/error state of each operation across different services. This can significantly reduce the time spent on debugging and resolving issues.

Performance optimization

Telemetry allows you to capture performance metrics and analyze them to optimize your system's performance. Using traces, you can identify the slowest components in the request path and optimize them to reduce latency .

Understanding system behavior

Tracing provides insights into how your distributed system behaves in production. You can observe the actual request flow, identify any bottlenecks or anomalies, and gain a better understanding of how different components interact with each other. This can help you make informed decisions about system design, resource allocation, and capacity planning. Not an easy thing in a complex distributed architecture!

System monitoring

Telemetry is a valuable tool for monitoring the health and performance of your Java applications in production. You can set up alerts and notifications based on trace data to proactively detect and resolve any issues before they impact your users. Tracing data can also be used for generating reports, dashboards, and visualizations to gain real-time insights into the state of your system.

Vendor-agnostic instrumentation

OpenTelemetry is supported by many backend vendors, including Splunk, Datadog, New Relic, and others. This means that you can easily switch between different backends for visualizing and analyzing your telemetry data without having to change your instrumentation code. This flexibility allows you to choose the backend that best fits your requirements and budget, without being locked into a specific vendor.

OK, fine, but HOW can I use it?


OpenTelemetry project provides a bunch of different components, but since this blog is mainly about Java, let's focus on.. Java ;-) In the first paragraph an "instrumentation" was mentioned. It is, in short, a way to add tracing capabilities to your Java applications without much hassle. OpenTelemetry provides libraries and an JVM agent that can be easily integrated into any codebase to automatically capture and propagate trace information.
Adding OTEL to existing deployment is as easy as adding JVM agent and few configuration properties. Latest quick start guide is always available here: https://github.com/open-telemetry/opentelemetry-java-instrumentation#getting-started

Perfect! Now WHAT can I expect as a telemetry data?


In a microservices architecture, a single business operation often spans multiple services. A typical trace would contain multiple spans, each representing a different operation in the overall business process. Each span would include a unique identifier (the spanId), a reference to its parent span (the parentSpanId) and trace identifier (traceId). This allows all the spans to be correlated and reconstructed into the full trace.

Here's an example trace with three spans:

spans: - Span: name: "Frontend service" spanId: 0123456789abcdef traceId: 098k0k0 parentSpanId: null - Span: name: "Backend service" spanId: 23456789abcdef01 traceId: 098k0k0 parentSpanId: 0123456789abcdef - Span: name: "Database service" spanId: 3456789abcdef012 traceId: 098k0k0 parentSpanId: 23456789abcdef01


Postface

I hope that all of above convinced you, my dear reader, to go and try OpenTelemetry. Please bear in mind that the project delivers much, much more that I have described here - ranging from a great number of automatically instrumented libraries to passing seamless context between various provided services (like AWS SQS for example ;-) ).

Brak komentarzy: