Distributed Tracing, a Survey of Past and Future

In 2010, Google put online a paper, “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure”. Like many other Google publications on distributed systems, in the couple years following many open source projects started working on distributed tracing, all inspired by Dapper.

In June 2012, two years after the Dapper paper was out, Twitter opensourced Zipkin, the first distributed tracing open source project inspired by Dapper. Zipkin was created for application performance tuning. It has both tools to collect tracing data and a UI for visualization and trace data lookup. As the first open source end to end solution in distributed tracing, Zipkin was widely adopted.

The community stayed relatively quiet for more than two years after Zipkin was released, until Kubernetes was started in 2014. However, we needed to wait until 2015 to see some of the big advancements:

In February 2015, the author of the Dapper paper, Ben Silgelman, left Google and started Lightstep.
Also in February, Uber started an internal tracing project called Jaeger.
In June, Kubernetes 1.0 was released, together with the announcement of CNCF.
In November, the OpenTracing project was started by LightStep.

OpenTracing was created to solve the problem of standardization. To collect a complete distributed trace, trace context has to pass through all the components, which includes application code, dependent libraries, open source standalone services (nginx, mysql), and other vendor specific libraries and services. Without a standard API to define the collection and passing of trace context, it’s almost not possible to collect the full trace without all components binding to a specific tracing vendor. OpenTracing was built to solve that problem, by defining a standard API, so components from different vendors could implement the same API, to make it possible to collect end to end tracing data passing through different systems.

By 2016, projects became more mature. In October, Uber open sourced Jaeger. Jaeger was inspired by Dapper and Zipkin. It has native support for the OpenTracing standard. It provides backward compatibility with Zipkin, so if you have code already instrumented with Zipkin libraries, you can directly route data from Zipkin libraries to Jaeger backends. In Oct 2016, OpenTracing was accepted by CNCF as the third hosted project. Two months later, OpenTracing 1.0 was released.

2017 started with a new project created by Google, OpenCensus. OpenCensus came to play quite late, but it started with a broader scope: trace plus metrics, to provide a single standard to collect both metrics and traces. Unlike OpenTracing, which only defines a standard API and relies on third party vendors to provide specific implementations, OpenCensus starts with providing native libraries to support third party vendor backends. This gives the project more control over the available solutions, and it’s also easier for developers to understand how different pieces work together.

Why do we need OpenCensus when we already have OpenTracing? As mentioned, OpenTracing only defines the tracing data spec, and has an API library defined for each common programming language like Java/Go/Javascript etc. It does not provide actual implementation of the API, nor does it provide a backend system to store and analyze tracing data. The advantage of this approach is, as a developer, you can choose different vendor implementations which support OpenTracing without worrying about being locked in. As a third party vendor, you have enough flexibility to provide vendor specific features. The disadvantage is also obvious- developers need to find another vendor library which supports the OpenTracing API to build a end to end solution. On the other hand, OpenCensus provides not only the spec and API, but also the implementation, which makes it much easier to build an end to end system.

In January 2018, OpenCensus 1.0 was released. Even with standard API definitions like OpenTracing and OpenCensus, different tracing systems still use their own set of headers to propagate tracing context. This creates an issue when tracing requests that pass through components from different vendors, especially the components that developers do not have direct control of, like managed services, load balancers etc. In July 2018, Google and other vendors started work under W3C to define a standard HTTP context propagation header, so that tracing context can easily pass through components from different vendors.

2019 started with the merge of OpenTracing and OpenCensus into OpenTelemetry. As mentioned above, OpenTracing provides a lot of flexibility by only defining an API layer without providing implementation. OpenCensus provides an out of box implementation but lacks the flexibility to provide vendor specific features. OpenTelemetry takes the best parts of the two, providing default implementations for all the tracing backends and vendors, while allowing users to choose a different implementation for vendor specific features.

Image from opentelemetry

Unlike other competing open source projects, the existence of two tracing standards brought lots of confusion to the community. The merge created a single standard data source, where vendors can make the most of the rich tracing data to build their own tracing system and data analysis tools, and where developers just integrate once and have the full flexibility to switch between third party vendor tracing systems without worrying about being locked in.

In 2019 November, when the W3C tracing context specification entered proposed recommendation status, the standardization of distributed tracing was brought to another level.

Over ten years of development, distributed tracing evolved from one single paper to an active community with standardization on all the layers, from tracing only to overall observability, from latency optimization to root cause analysis and application performance management, from a single backend system to an end to end solution passing through boundaries. In 2020, with OpenTelemetry starting to serve as the de facto standardization for tracing, metrics and logging, here is what we see coming to the distributed tracing systems:

More adoption of distributed tracing, merging with monitoring and logging systems,
More correlations between tracing, metrics and logging, and
More analytic tools with AI/ML to make more sense of telemetry data.

With Kubernetes as the underlying orchestration engine for micro services and applications, a tracing/observability system needs to not only cover the application stack, but also the virtual and container infrastructure. With more adoption of converged tracing/ monitoring/ logging systems, Enterprise IT needs a platform which can help them to consistently deploy and manage the merging observability systems across multiple environments, based on their specific requirements, in a unified way.

Tags:

Concepts