Maximizing OpenTelemetry Transport Performance

In a former article, we discussed the challenges of ingesting large amounts of data through WAN connections and highlighted the advantages of the OTLP protocol compared to syslog in this use case. Now we go deeper by investigating the different factors of data transportation on the network. Moving large amounts of data has become a common problem in the last few years when we are talking about security and observability events. Many protocols and agents that have been used for years couldn’t keep up with this growing performance demand, so new solutions have emerged. One popular example is OpenTelemetry, which was tailored for this use case. In this post, we show you how we've tested the OpenTelemetry protocol (OTLP) regarding its performance and tuned it to maximize its data throughput for different network profiles.

Transportation Metrics

Tuning performance is usually an optimization task between the parameters that you can change concerning the environment and the ones you can’t. We will walk you through the tradeoffs and options you have when passing data (in this case telemetry data) through the network.

OpenTelemetry (OTLP) performance measurement definitions

Network properties

A network has several properties, but the two important ones for us are bandwidth and latency.

Bandwidth is the maximum rate of data transfer across a given path. It defines the maximum capacity of a network line. In a 10 Gbit/s network, you can move 1250 MB of data in a second.
Latency is the delay between the time when the sender sends out the data, and the time the destination receives it. In a TCP connection latency controls when we get an ACK response for the package we sent. So if the latency is 10ms, we need at least 20ms to send the data and receive an ACK. Therefore having a high latency and processing packages serially can have a big impact on the data transfer rate.

Data processing techniques

We have two common techniques to deal with bandwidth and latency.

Parallelization: When the connection has high latency that means that we can’t transfer all the data that would be possible with the bandwidth available. So a trivial solution is to send the data on more channels. That means that while one channel is waiting, others can still transfer data.
Compression: When we hit the capacity limit of a connection we can compress the data. Compressing the data means we will produce smaller packets in exchange for using CPU time to encode them. The compression ratio varies depending on the algorithm, the amount of processor time, and the data content as well. When we are talking about compressing machine data the compression ratio is generally between 5-10x.

Let’s see this in practice

We’ve used Virtual Machines (from Google cloud) for the measurements:

loggen: a c3-standard-8 (8 core) machine used to generate logs
client: a c3-standard-22 (22 core) machine to relay between an edge node (log generator) and the server (which is in our case just another relay)
server: a c3d-standard-16 (16 core) machine to receive messages from the client relay, and to send the received data to SIEMs or data lakes (which will be skipped for these measurements, as we are interested in the relay-to-relay log transport performance)

Network

All three machines are connected to the same network with 10 Gbps bandwidth uplink, which has less than 1 ms latency. We induce latency artificially for some of the tests to emulate slower links.

Payload

We use artificially generated logs with the same length for most of the tests. For comparing results with compression we used logs extracted from Kubernetes clusters, with an average size of 1137 bytes per message.

OpenTelemetry (OTLP) performance measurement layout

Configuration

Without going into further details, our example scenario does the following:

The loggen machine sends data to the client machine via the syslog protocol.
The client machines receive syslog data and convert them to OTLP.
During this conversion, the client applies parsing (regEx) and attribute mapping to simulate a real-life scenario.
The client sends the data to the server via OTLP/grpc protocol.

Numbers

We've performed the measurements with different number of client and server workers to find the optimal resource requirements. We've optimized the parameters for three different use cases:

maximizing event per second (EPS),
minimize CPU usage (percent), and
maximize throughput (MB/s).

We tested each scenario with and without compression. Last but not least, we investigated the effect of network latency on the transport behavior.

Parallel processing & compression

The two main parameters of this measurement are the number of worker threads on the client and the server side. The third one is whether the compression is enabled or disabled. The fourth one is the network latency between the client and the server.

AxoSyslog OpenTelemetry gRPC performance measurements

From these two charts, you can see that with a single worker on both sides, the throughput is low. Raising the number of workers increases the EPS significantly, but it quickly reaches a plateau. We did not see any significant EPS increase after 8 server workers and 64 client workers.

The Cost of Compression

Enabling compression nearly halves the throughput. The following two charts show you why:

‍Sending logs is already a CPU-intensive task: it can take up to 50-80% of the available CPU. Compression also needs CPU time. With compression enabled, we quickly consume all the CPU resources on the machine.

The Benefit of Compression

Although compression reduces the EPS and consumes more CPU, it is not for nothing. One of its benefits is straightforward: it reduces the network load. These two charts plot the network usage with compression disabled and enabled. It is hard to read exact values from them, so here they are: the peak network usage was 1012 MB/s without compression, on the other hand, it only used 60 MB/s with compression. It is not a fair comparison though, as we transported nearly twice as many logs per second without compression. If we divide this by the EPS at the measured points, we get 1500 bytes per message with no compression and 159 bytes per message with compression. That is a ~90% reduction in network load per message.

The other benefit of enabling compression is not trivial: Compressed transport is less sensitive to network latency.

AxoSyslog OpenTelemetry performance measurement

On a network with added 100ms latency, EPS drops significantly when compression is disabled, but stays nearly the same with compression. There is a simple explanation: more messages fit into a batch.

What is a batch? A batch means that we don’t send messages one at a time, instead we collect more to create a batch from them. We then send that packet to the server. In a high latency environment (based on our previous example) this means that when we have 10m latency we need to wait 20ms before sending the next message. When we build a batch from 100 messages that means that we need to wait 20ms for every 100 messages.

Why does compression help in this case? The Axoyslog OTLP/gRPC implementation allows you to set batch sizes on both event count and batch size in bytes. So when we compress the data that means we can put 5-10x more events into a batch for the same size limit.

The other really important thing to read from these charts is how the number of workers affect the EPS, which can be seen on the no-compression charts. With our calculated ideal 8 server and 64 client workers, the EPS drops from 700k to 357k. If we double the number of client workers, we can reach 470k, using 32 server workers we can reach 550k. We could further increase the number of client workers, but be aware that if there are significantly more workers than available CPU cores, you introduce a lot of context switches, which puts an unnecessary load on the CPU, resulting in hitting thresholds and reducing the throughput.

The advantages of using compression and parallel channels stand out even more in the case of larger network latencies. Let’s see how the EPS changes with 32 servers and 128 client workers and 0, 100, 500, 1000, and 2000 ms latencies:

Without compression (blue line), the throughput plummets pretty quickly, while with compression (red line) we have a bit more leeway. If the network has more than 200 ms latency, compression produces better EPS numbers.

Conclusion

These measurements helped us to find good defaults for the Axosyslog OTLP drivers. The first implementation already provided a remarkable 700k event per second throughput (with average event size 1137 bytes). That’s roughly 800 MB/s (~6.4 Gbit/s) raw data. Since then we further improved the engine and already exceeded 1M events per second. But be aware that log processing performance is largely affected by the content of the data and the transform operations that you apply to them. When you’re doing performance measurements of your telemetry pipeline, I recommend you to not use /dev/null destinations and add at least a couple of common operations on the data to simulate a real-life situation!

Rules of thumb:

If you are not bound by the capacity of your link and have below 100ms latency don’t use compression and apply around 8 server workers and 32-64 client workers.
If you need to save on network load or there is significant latency between the two nodes, it is worth enabling compression, and raising the number of workers can also help. Just remember that compression will consume significant CPU resources.