Logging Operator: the Telemetry Pipeline for Kubernetes

This year is exceptionally eventful for the Logging operator. In this post, we summarize the project's achievements and take a sneak peek into the future as well. Once in a while, it’s worth recapping the reasons and goals behind a project and re-evaluating the direction of development. This is what we’ll do to give you a more complete understanding of it. We are still in the early stages of planning but wanted to share the comments, experiences, and principles that form the roadmap for a truly flexible tool handling logs on Kubernetes. Or to look one step further: not just logs but metrics and traces as well.

What is a Telemetry Pipeline?

First of all, let’s look at the title, and if you know what a telemetry pipeline means you can just skip this paragraph.

According to a Gartner report:
“Telemetry pipelines support collection, enrichment, transformation, and routing of operational data from sources to destinations.”

Of course, this is not a novel concept, rather it classifies the functionality by giving it a name. Tools like syslog-ng and Fluentd have been the building blocks of these pipelines for a long time. However, there is an emerging need for the functionality that “Telemetry pipelines” represent. You can pick your reasons from the exponentially growing costs, the rising complexity of manually configured systems, or just simply extracting value from the data in motion. We don’t have to dig deep into this world (although we will do that in a following blog post) to accept that the purpose of Logging operator is exactly that: to provide an easy-to-use abstraction layer to create these pipelines on Kubernetes.

Brief History of the Logging Operator

To understand the current state of the Logging operator let’s walk through the history of the project. Again if you are familiar with the story feel free to skip to the vision part.

A long time ago in a startup far, far away

Actually, it was 2018 when we started Logging operator at Banzai Cloud. We founded the company with the mindset that we help others adopt Kubernetes and Cloud Native technologies. Providing solutions for logging and monitoring was a crucial task. At that time the industry standard solution was to use Fluentd or Fluent Bit (which was a fairly young project back then) to collect the logs, and just send them to a storage like ElasticSearch.

We did have experience operating such solutions in static environments, but adapting to Kubernetes had its challenges and its advantages as well. For example, the resource labeling was a lifesaver to distinguish different workloads. (I had a presentation focusing on labels and how flexible they make the whole process)

The first version of the Logging operator [2018 August] just helped you manage a simple log collection system using Fluent Bit and Fluentd. It had minimal routing capabilities with the hardcoded app label. The heavy lifting was done by Fluentd using the built-in tags to isolate flows.

One year, and a lot of experience later

2019 September, Logging operator v2

The second version was more like what you know as Logging operator today. We introduced flexible, label-based routing, the namespaced Flow and Output resources, as well as the multi-namespace ClusterFlow and ClusterOutput resources. I consider this as the first true success of the project.

Make it more stable, make it production-grade

2020 March, Logging Operator v3

A fine-tuned version of the V2. More granular selectors. Handling edge cases like left-over logs. Adding windows agent. Later we introduced the host file tailer and sidecar webhooks. V3 shipped a lot of new features requested by the community. And one of the most interesting news was that Rancher integrated the Logging operator into their offering.

Oct 6, 2020 Rancher 2.5 with Logging Operator

The Road to the CNCF Sandbox

2023 March, Logging operator v4

After Cisco acquired Banzai Cloud, we were working on version 4, solving new challenges on the scalability and performance side of the problem. This led to introducing a new aggregator, syslog-ng. (For details, see the release announcement on the Cisco Tech Blog.)

Although Cisco was a great place to nurture the project, we believed that with the proper focus, we could do so much more. Personally for me, this was one of the driving factors for founding Axoflow.

Together with Cisco, we came to the conclusion that Logging operator would be best as part of the CNCF foundation. So Axoflow and Cisco moved Logging operator to a neutral environment called Kube Logging and started to prepare the project for donation. I’m happy to announce that a few weeks ago the vote passed, and currently we are on-boarding the project. From now on, the CNCF organization guarantees the independence of the project.

Let me pause here to personally thank everyone for the work they put into the Logging operator, or helped by answering the questions of other users in the different channels.

What does it mean to me as an end-user?

Practically as a user you can use the Logging operator the same way you used it so far.

What is the connection between Logging operator and Axoflow?

The core maintainers of the Logging operator are working for Axoflow. Axoflow provides solutions based on the Logging operator and provides support for it as well, including commercial support.

Are you still developing Logging operator? What can we expect from you?

If you are following the GitHub repository, you can see that the Logging operator gained momentum in the last couple of months. We are fixing bugs and making it possible to solve new use cases as well. We are at the planning board working hard to draft the future vision.

The latest release, v4.4

The latest release is out! We’ve been working hard to provide better solutions for multi-tenant use cases. The main problem is that every namespace shares the same aggregator instance, where the routing logic is executed. With Logging operator 4.4, you can separate the logs based on namespaces, so you can eliminate the noisy neighbor effect on the cluster. For more details check the Logging operator 4.4 release announcement blog post!

What is on the horizon?

From time to time we sit down and evaluate the good and bad solutions implemented in the Logging operator: what worked well, what problems require workarounds, what makes the users life harder. We collect these together and start planning the new phase. We identified different “new” requirements from the community, including:

Multi-tenancy / resource handling

In version 4.4, this was one of the main driving topics. More and more issues were raised about the usage of shared resources. As we tried to make the current solution able to handle these scenarios, we realized that with the current architecture these scenarios are complicated to configure, use, and debug. The next major release should make the tenants a first-class citizen, without sacrificing the simplicity of the configuration.

Edge-routing

During the initial planning of Logging operator the main idea was to collect the logs as quickly as possible, and keep the routing logic at the aggregator level. The main reasons behind that:

We wanted to keep the node/edge level logic as simple as possible.
There was no option to embed such logic in those lightweight components (like Fluent Bit). As time passed, these features became available in those applications as well.

As time passed, we realized that there are alternative tools that could solve the latter problem in a clean and safe way if we choose the right abstractions.

In the future we want to provide edge level routing without sacrificing the stability of the nodes. Remember, if we push down the logic to the node level, the same effect like noisy neighbors could occur, like it happened with the aggregators.

Multi-agent

The idea behind Logging operator always has been to abstract away the specific configuration layer. We have received an increasing number of suggestions about different agents people want to use to collect the logs - not only Fluent Bit. We agree that you should be able to choose the underlying engine, but that makes the configuration complex. Either you have a small subset of functionality, or you provide specialized configuration for the different engines. Our goal is to provide a common abstraction for routing logs, and also support the customized parsing and filtering features of the selected engine.

Furthermore, we want to have a closer collaboration with OpenTelemetry (another benefit of being part of the ecosystem). We believe that the OTel protocol and the OTel Collector are both valuable tools to build on, and will help Logging operator to become an observability operator.

Simplifying the API, aiming for Observability

Currently Logging operator has 14 different CRDs. And tons of configuration options for these resources. Some of these are for backward compatibility, while others were added because extending the current architecture was easier than refactoring the logic. In the following version we are going to review the resources and match them with the goals described above. We hope to create a new structure that will:

eliminate unnecessary resources,
make it easy to recognize which resource to use for each use case,
extend the functionality of Logging operator to the whole observability stack.

Summary

As you can see, things are not slowing down in the development of the Logging operator, and we are excited to see our plans get into reality, and make Logging operator a major player of the CNCF landscape.