Log Isolation on Shared Kubernetes Infrastructure

Compliance and efficient operations on shared infrastructure

Shared infrastructure is a great way to operate services efficiently. Services can benefit from the same resources, but this makes complying with certain standards and regulations rather challenging.

For example, HIPAA doesn’t set requirements on how to operate multi-tenant services, but has requirements about security and access control. Collecting all the logs securely in one place and keeping them until the retention requirements dictate might seem enough to comply. But how do you ensure efficient operations where engineering teams, who often operate their own services and need to access their logs, can efficiently diagnose an issue, or help in incident response? PCI-DSS is more opinionated about multi-tenancy. It explicitly states that audit logs should be made available for review, but only for the owning customer.

If you happen to operate multi-tenant services in one way or another, then the practices outlined below could be a lifesaver, even if you are thinking about isolation after the fact to comply with these or other industry standards.

Log isolation in multi-tenant Kubernetes environments

The Logging operator was envisioned with multi-tenancy in mind from the beginning. It was designed to leverage Kubernetes namespaces to isolate log flows so that developer teams can define their log forwarding rules themselves, without having to worry about how the log infrastructure is set up.

Why do you need log isolation in the first place? You might need it for different reasons at different points in the stack.

Compliance: Compliance requirements might dictate separating the logs for the individual tenants as soon as possible.
Security and availability: You want to minimize the impact of a security incident on the other tenants or the other parts of the system.
Robustness and stability: You want to minimize the effect of the tenants’ individual configurations, and load characteristics on each other.
Flexibility and delegation of authority: You want to isolate tenant-specific configurations so that you can manage and validate components individually. This also allows the delegation of ownership to tenants with accurate access control.

Limitation of log access control in Kubernetes

Kubernetes has an API endpoint to query or tail logs which is also protected by standard RBAC to control permissions. Unfortunately, by design, it is not appropriate for automated log shipping, only for relatively infrequent on-demand log queries.

For a comprehensive solution, you have to get to the log source, which is the container runtime on the nodes. The typical solution is to use a log agent as a DaemonSet and mount the host folder that contains all the logs. Mounting a host path – even in a read-only mode – is a security boundary, and once one steps over that boundary there are no more permission checks that would limit the agent in which application logs it can process. This can be a real problem in multi-tenant environments, where multiple teams or customers are sharing the same cluster.

Soft and hard multi-tenancy

There are several ways to think about isolation in Kubernetes, especially when it comes to logs. Soft multi-tenancy means that tenants trust each other to a certain degree, while hard multi-tenancy converges toward sharing as little resources as possible. In the soft form, tenants typically share resources for cost effectiveness and to keep operations complexity low, at the price of being less secure and robust against failures. Running the workloads of multiple teams or customers in the same Kubernetes cluster is already a type of soft multi-tenancy when you look at it from the cluster perspective. The goal in our case is to support the spectrum of log isolation levels within the bounds of a single cluster.

Hard multi-tenancy unlocked

Traditionally, the Logging operator supported only soft forms of multi-tenancy, but with the latest updates, it is capable of much more. You can finally separate the log path starting from the log collector agent (which is typically managed by the infrastructure team), and forward logs to standalone aggregators fully managed by tenant owners – all through the same convenient APIs provided by the Logging operator and controlled by standard Kubernetes RBAC. Tenants receive only the logs that belong to them and nothing more, while infrastructure operations can still process all the logs in the system completely separately.

Tenancy, the first-class citizen in Telemetry Controller

While the Logging operator can handle all the different multi-tenant scenarios described below, we found that the API is difficult to comprehend. As we started redesigning the resource definitions, we immediately found that fundamental changes were required in the underlying logic. We concluded that we would not refactor the Logging operator, but instead introduce a new project called Telemetry Controller. It solves the complexity issue by extracting the isolation functionality into a separate collection layer. This new controller is based on the OpenTelemetry Collector and drastically simplifies configuration, while working seamlessly with the Logging operator.

Use cases

The Logging operator supports many different configurations for multi-tenancy scenarios. Let’s see the most notable ones and their most important attributes.

Centrally managed customer soft tenants per namespace with a single, high performance syslog-ng aggregator

A managed web hosting provider can use the Logging Operator to separate the responsibilities of multiple teams, and compose a centralized configuration based on the individual demands of the different tenants, each hosting a specific customer. A single cluster is shared with multiple tenants, and the isolation is based on Kubernetes namespaces.

With the help of the Logging operator:

The infrastructure team controls the Logging operator deployment and the runtime configuration of the collector and the aggregator
The platform team controls the centrally managed destination and flow configurations
The developer team controls the individual log flow and destination configuration of the tenants based on the tenants’ needs and capabilities

Benefits

The Infra team configures logging once, after that they can mostly forget about it and just apply upgrades. The Platform team can configure cluster level log flows and can define shared destinations so that tenants don’t have access to the destination’s credentials. Engineers are free to create whatever flows they need, and define additional destinations in the tenants’ scope.

Multi-namespace hard tenants with a centrally managed log collector and tenant managed aggregators

A managed container service provider uses the Logging operator to provide their tenants with a fully self-managed aggregator that receives logs only from the tenant’s namespaces. A single cluster is shared with multiple tenants, where isolation is based on an additional third-party controller that allows multiple namespaces to form a tenant

With the help of the Logging operator:

The infrastructure teams control the Logging operator deployment and the collector’s runtime configuration that runs on all nodes and routes logs to individual tenants automatically
The tenant owner (the customer) can fully self-manage their log aggregator, including the runtime configuration (resource requests, number of replicas, etc.) and the log flow definitions and destinations

Benefits

Log collection and cluster level policies are entirely controlled by the infra team centrally, however, tenant aggregators are completely isolated, so they don’t interfere with each other in any way.

Multi-namespace hard tenants with a centrally managed log collector and tenant managed aggregators

Node group based hard tenants with isolated log collectors and aggregators

A large enterprise isolates its services and development teams into separate node groups within a single cluster, enforcing workloads on the tenants’ respective nodes.

With the help of the Logging operator:

The infrastructure teams control the Logging operator deployment only
The platform teams control the collector and aggregator runtime configuration for every tenant in an automated fashion
Tenants control their log flow definitions and destinations, but don’t have to worry about operating the aggregator or the collector

Benefits

Tenants are fully isolated, almost as if their workloads are running on separate clusters, while the operational burden can be completely delegated to the Platform team although it is optional). The infra team is optional in this case as well, the Platform team can take over their responsibilities easily.

Node group based hard tenants with isolated log collectors and aggregators

Conclusions

As you can see from the use cases shown above, there are several ways to implement log isolation in a shared Kubernetes environment. There is no “one size fits all” solution, but with a little experience, you can find the matching Logging operator configuration for almost any multi-tenant setup. The new Telemetry Controller increases the flexibility even further, while decreasing the complexity at the same time.

If you need help or advice to implement your logging solution, contact us!