August 11, 2023

and

No items found.

syslog-ng

syslog

syslog-over-udp

ebpf

Scaling syslog to 1M EPS with eBPF

This article shows you how to parallelize syslog processing to reach message rates in the millions per second range. In our recent articles, we’ve discussed why organizations are still using UDP based transports for syslog and the message loss problem that inherently affects these solutions. The fixes outlined in the last article only solved the problem cases where you incur message losses while the CPU is not overloaded. With some tuning, those situations can fully be mitigated.

‍The next bottleneck that forms as you scale up your ingest rate is related to the per-core performance of your CPU. This article focuses on what you can do when your CPU core is maxed out and you need to scale syslog reception further, while staying within the constraints of the same compute node. This is called scaling up to all CPUs in the system, in contrast to scaling out, which happens as you spread the load across a cluster of compute nodes.

Before going into the details, let’s clear a couple of concepts first.

Tradeoff between Ordering and Parallelism

To resolve a situation where a single CPU core is the bottleneck, you have to introduce parallelism into the processing chain. This means that instead of using a single CPU core to do all processing, you start using more CPUs in parallel. Sounds simple enough, but there’s an inherent tradeoff between parallelism and message ordering.

Let’s assume for a moment that the system in question writes the messages into a simple text file. To keep the proper order in that file, the system needs to write it sequentially and in the right order.

Now let’s increase the number of CPUs doing the processing and use two of them:

In this system, the first CPU core receives and processes every odd numbered line, and the second CPU core receives and processes every even numbered line. The issue in this parallel system is that the two CPUs race to finish their work and append their specific message to the destination file:

If the first one wins this race, CPU1 and CPU2 write the messages to the file in the right order.
If the second one wins, the order of lines in the files will be incorrect.

In the second case, you can achieve the right order if you synchronize the writing process. However, this comes with a cost: when the wrong CPU wins, that CPU has to wait some time (that is, stall) until the right one finishes too. These stalls eventually build up and limit the throughput of the system as a whole. This means if you want to keep ordering, your scalability is limited.

The importance of message ordering

In the original syslog architecture and in other logging systems, message ordering was an important requirement. Imagine an event that produces many related messages in the stream of log messages, such as this backtrace that shows an error from a NodeJS application:

Apr 24 12:10:35 bzorp nodejs[54141]: ReferenceError: myArray is not defined
Apr 24 12:10:35 bzorp nodejs[54141]:   at next (/app/node_modules/express/lib/router/index.js:256:14)
Apr 24 12:10:35 bzorp nodejs[54141]:   at /app/node_modules/express/lib/router/index.js:615:15
Apr 24 12:10:35 bzorp nodejs[54141]:   at next (/app/node_modules/express/lib/router/index.js:271:10)
Apr 24 12:10:35 bzorp nodejs[54141]:   at Function.process_params (/app/node_modules/express/lib/router/index.js:330:12)
Apr 24 12:10:35 bzorp nodejs[54141]:   at /app/node_modules/express/lib/router/index.js:277:22
Apr 24 12:10:35 bzorp nodejs[54141]:   at Layer.handle [as handle_request] (/app/node_modules/express/lib/router/layer.js:95:5)
Apr 24 12:10:35 bzorp nodejs[54141]:   at Route.dispatch (/app/node_modules/express/lib/router/route.js:112:3)
Apr 24 12:10:35 bzorp nodejs[54141]:   at next (/app/node_modules/express/lib/router/route.js:131:13)
Apr 24 12:10:35 bzorp nodejs[54141]:   at Layer.handle [as handle_request] (/app/node_modules/express/lib/router/layer.js:95:5)
Apr 24 12:10:35 bzorp nodejs[54141]:   at /app/app.js:52:3

If you look at this series of messages, it is clear that they only make sense in this specific order. If your logging subsystem mangles the order in their final destination, the usefulness of the log about this issue deteriorates, making the event a lot more difficult, if not impossible, to process later.

Choosing between Parallelism and Ordering

Log infrastructures are globally distributed systems. Every aggregation point in a log infrastructure can become a bottleneck that needs parallel processing to be resolved. An aggregation point in the previous paragraph refers to locations in the infrastructure that aggregate multiple input streams (for example: messages by different hosts, containers or applications) for further processing.

Some examples for these aggregation points are:

the ingest of your Cloud based SIEM,
handing over messages to a Kafka,
the local collector like OTel/Beats/syslog-ng on a host,
a central log server,
a syslog-style log relay,

To see how you can resolve bottlenecks, let’s see a specific example: a central log server which receives syslog messages over UDP at a high rate and gets overloaded.

To parallelize this example, we need to clarify where ordering is important and where we can safely ignore it:

Ordering is critical for logs originating from a single process or thread of an application.
Ordering is less important for logs originating from multiple processes or threads that already run in parallel. These messages are not generated in an ordered fashion in the first place.

The distinction between these two helps to scale the ingestion:

Guarantee ordering for point #1 above, that is, retain ordering for messages originating from a single thread of execution.
Allow reordering between the rest of messages that may be part of different streams (for example, produced by different threads of execution) and process them in parallel. This is the key opportunity for enabling parallelism and the key to increase throughput.

Note that even though some messages will be reordered by this scheme, the timestamp associated with messages remains available. You can use this timestamp to sort messages based on their time of occurrence. Depending on the granularity of the timestamp, this might be less precise than strict ordering, but enough for most practical use-cases.

With this idea we identified a significant subset of incoming messages that doesn’t need strict ordering, making it possible to dramatically increase the message throughput.

Parallelizing UDP based syslog

With the principles covered, you can use your new knowledge to parallelize syslog over UDP, as processed by a syslog server somewhere in your infrastructure.

My advice is to tune single-threaded performance first, as described in my last article on receiving syslog over UDP. Parallel processing requires more considerations and results in a more complex system that is more difficult to build and operate, as parallelism is usually a joint effort between network gear (routers and load balancers) and properly configured log processing.

With that in mind, let’s assume the following single-threaded scenario:

In this scenario, the kernel ensures the message ordering by the UDP reception path. As a new datagram comes in from any source IP address, it is placed into a single First-In First-Out queue. This queue is the socket buffer, as discussed in our recent article. The log receiver then takes the messages from the socket buffer one at a time, then processes and delivers them to the right destination.

Parallel processing of streams

As you are reaching the single threaded limits, you may notice that the setup outlined in the previous section starts to drop messages. The only option to resolve this situation is to speed up message post-reception processing. You can’t slow down reception of UDP datagrams as that would only cause them to be dropped earlier in their transit (e.g. switches or routers).

If you single CPU core is already maxed out, which is the assumption we started with, then we need to apply some form of parallelization. The first and simplest parallelization technique to try is enabling multiple receive sockets, with a kernel mechanism called SO_REUSEPORT. This is explained in the socket(7) manual page.

If you are using syslog-ng, you can easily enable the use of SO_REUSEPORT, by using something like the following (simplified) syslog-ng configuration:

source s_network {
    udp(so-reuseport(1) persist-name("udp1"));
    udp(so-reuseport(1) persist-name("udp2"));
    udp(so-reuseport(1) persist-name("udp3"));
    udp(so-reuseport(1) persist-name("udp4"));
};

log {
    source(s_network); 
    destination { file(“/logs/syslog”); };
};

This opens 4 receiving sockets, instead of 1, all bound to the same UDP port 514. If you didn’t specify so-reuseport(1) in the config above, syslog-ng would fail to start up, as the kernel normally does not allow this to happen.

With this configuration, each of the four sockets receive messages in parallel, roughly 1/4th of the packets delivered to each one of the sockets.

By default, messages that come from the same sender (same IP/port) end up in the same socket, thereby ensuring that traffic from the same IP/port is delivered in order.

This configuration improves performance: it processes messages in parallel, using up to 4 CPU cores instead of just one. This is true as long as the number of syslog senders outnumber the configured number of UDP receive sockets. As explained above, the ordering of messages originating from a single sender is retained. In a way, this is the best of both worlds: we already have parallelism but message ordering between messages where they matter most is kept.

The downside of this scenario is that load balancing across sockets isn’t perfect: senders end up in sockets randomly. Since different senders will have different message rates, it may happen that one of the sockets gets overloaded while the others are idle. You can resolve some of these situations simply by increasing the number of sockets further. Track the following metrics:

The length of receive queues for each socket (netstat -anu output & the recently added syslog-ng metric named syslogng_socket_receive_buffer_used_bytes)
Number of message counters for the individual sockets and their potential imbalance (see syslog-ng metric named syslogng_input_events_total)
Number of messages lost per socket (syslog-ng metric named syslogng_socket_receive_dropped_packets_total)

The specifics of these metrics and how to best track them are outside the scope of this article, but you can read about them in the syslog-ng 4.2 release announcement or in this blog post.

Parallelize within streams

You might reach a point where the previous parallelization technique becomes insufficient, for example, if:

a single chatty source emits messages at a high rate (e.g. 1M EPS in a single UDP stream)
a bunch of chatty clients that all end up in the same socket, overloading one of your receive sockets

Once you increased the number of receive sockets to match your CPU cores and you still get drops, you might end up needing to allow reordering of messages even within streams originating from the same sender.

Remember that it is the kernel that chooses which receive socket for a specific UDP datagram being received. In the previous section, I explained that this socket is chosen randomly based on the source IP/port of the sender.

The good news is that this algorithm can be customized using a kernel facility called eBPF, which stands for Extended Berkeley Packet Filter.

Here’s the previous syslog-ng configuration snippet, with a single addition this time. Notice the ebpf() option for the first udp() source.

source s_network {
    udp(so-reuseport(1) persist-name("udp1")
        ebpf(reuseport(sockets(4)))
    );
    udp(so-reuseport(1) persist-name("udp2"));
    udp(so-reuseport(1) persist-name("udp3"));
    udp(so-reuseport(1) persist-name("udp4"));
};

log {
    source(s_network); 
    destination { file(“/logs/syslog”); };
};

The ebpf() option changes the kernel’s SO_REUSEPORT algorithm so that all messages are randomly placed into one of the UDP sockets. Note that because of the eBPF settings, you need syslog-ng version 4.2 or newer for this to work.

This configuration fixes the inbalance issues outlined in the previous section basically by ignoring the source IP/port. The decision which UDP socket buffer a datagram is enqueued is made for every datagram and not once for every stream. This means that messages are perfectly load-balanced across your set of UDP sockets. While this resolves the imbalance between the sockets and results in perfect load balancing, you will lose ordering between messages from the same sender, which is the price to pay for increased throughput.

Retaining the message order in parallelized processing

As described earlier in this article, ordering of messages can be important. You have seen how to scale UDP based syslog:

first by keeping message ordering, and then
by letting go of that requirement, resulting in a perfect load balanced processing across all your CPUs.

However, in the second case, you might end up needing that original order in your analytics backend. In that case, you can use accurate timestamps, or sequence numbers.

Accurate timestamp: Associate a timestamp with all log messages, one that’s accurate enough to ensure that you can restore their original order, by sorting them based on their timestamp. With accurate timestamps you can fully parallelize at any point in your collection infrastructure. Make sure to add the timestamp as early in the data supply chain as possible, ideally, at the log collector. Note the following points about the accuracy of the timestamps:

If your log data only has second granularity, then timestamps would only reconstruct original ordering if you have maximum 1 message/s. Problem is that log streams regularly have temporary peaks in them, where multiple messages would use the same timestamp, which you want to avoid.
Increasing the granularity of the timestamp helps, but you need at least microsecond or even nanosecond granularity (allows peaks of 1 million msg/s or 1 billion msg/s respectively).
Duplicate timestamps could occur even with nanosecond resolution.
You need a synchronized clock at ingestion points, but perfect synchronicity isn’t needed.
While traditional syslog didn’t have granular timestamps, syslog-ng has had support for them for decades, and can even restamp messages as they’re received. It can even use the kernel’s timestamping mechanism for this purpose.

Sequence numbers: Apart from using timestamps, a simple sequence number associated with messages at their source allows you to restore the order of messages.

Sequence numbers are unique, so duplicate timestamps don’t cause issues.
For example, on Cisco routers you can enable sequence numbers with the “service sequence-numbers” command. Syslog-ng can similarly generate sequence numbers using the $RCPTID macro. In RFC5424 formatted syslog messages, the structured data has a field called .meta.sequenceId.

Conclusion

From this article you have learned how to scale the processing of syslog messages up to 1M events per second or more. We also discussed the inherent tradeoffs involved when parallelizing message processing. Some may consider this kind of tuning advanced that few will ever need, however as volumes grow year over year, options like this come handy.

The Axoflow team is building a product that abstracts away these complexities while also retaining their power: making it much simpler to run global log infrastructures.