February 2, 2024

and

No items found.

syslog-ng

syslog

telemetry

troubleshooting

packet drop

UDP packet loss

How to detect TCP and UDP packet drops in syslog and telemetry pipelines

This blog post enumerates the most important causes of excessive buffering, packet drops, and message loss, and gives you easy-to-use recipes to identify the specific problem.

One of the key ingredients of a Security Operations Center is the collection of machine data from all across the enterprise into the SIEM for analysis. The amount of data can be enormous, a daily ingestion of 50-100TB of data is not unheard of. To put these numbers in perspective, 100TB of data daily is 1103MB of data every second. This requires about 10GBit/s network bandwidth to transfer over a single link.

At these rates, issues build up quickly. A one-second outage (like restarting a service) means we have to store over 1GB of data somewhere. Even if we break this down into a cluster of computers, an understanding of where we might be buffering or losing data is important to mitigate or resolve issues.

Transports using TCP

There are many different transport mechanisms to take log data from A to B; a lot of them use TCP (Transmission Control Protocol) underneath (e.g. syslog over TCP, or OpenTelemetry’s OTLP). TCP ensures that no messages are lost while the connection is established, but the use of TCP doesn't in and of itself prevent excessive buffering or message loss. The following sections help you identify and diagnose various types of bottlenecks that can arise when using a TCP-based transport.

Receiver side bottlenecks

Imagine that you are receiving messages from various devices to an aggregation point (like a syslog server or OpenTelemetry collector). You encounter message drops or excessive buffering on the sending side and you want to understand which component is at fault:

the server,
the application processing the logs, or
the network.

We will use Linux Kernel tracing features to find congested TCP connections. As an example, I am going to use a simple TCP-based syslog connection and syslog-ng to create a bottleneck and see how this can be diagnosed using tracing. Here’s a simple syslog-ng configuration to create a bottleneck:

@version: current

log {
    source { tcp(port(1514)); };
    destination { file("/dev/null" throttle(1k)); };
    flags(flow-control);
};

With the config above, syslog-ng takes any messages on TCP port 1514 and writes them to /dev/null to discard them. The throttle() option creates the bottleneck: it limits the message flow to 1000 messages per second.

Start syslog-ng using this command:

syslog-ng -Fef /path/to/syslog-ng.conf

On another terminal, generate messages at a pace much faster than 1k messages per second using the loggen utility from syslog-ng:

loggen --stream  --interval 60 --rate 10000 localhost 1514

Once the various buffers in the chain fill up, syslog-ng limits reading on port 1514, meaning that loggen won’t be able to maintain the requested rate and will slow down. In a production scenario, you’d have an actual workload that produces the logs. When the workload cannot send the messages at the required rate, it will:

Start buffering messages,
Start dropping messages once the buffer becomes full, and
Slow down if internal back pressure is implemented.

When something like this happens in production, you need to find out what the limiting factor is:

Is it a hardware/capacity limitation on the log server?
Is it a limitation in the application that processes the logs (in the case above, in syslog-ng)?
Is it a network bandwidth limitation?

Since the mitigation is very different, finding the root cause is important to remediate the problem. Although checking syslog-ng metrics would allow us to diagnose this issue, we are going to use the kernel’s tracing functionality because it’s easier to use for other logging implementations.

Find logging bottlenecks with kernel tracing

Tracing is a lightweight debugging tool for Linux kernels that allows you to do tracing in production. The overhead is minimal, especially compared to other debugging mechanisms, like packet captures.

First, install trace-cmd on the host running syslog-ng.

Debian or Ubuntu: apt-get install trace-cmd
CentOS: yum install trace-cmd
Fedora: dnf install trace-cmd

Once you have trace-cmd, you can start recording traces using trace-cmd record. You need root permissions for recording. This command starts recording cases where an incoming TCP packet hits a “zero window” condition on port 1514 (your port number might be different).

sudo trace-cmd record -e 'tcp:tcp_probe' -f "rcv_wnd == 0 && sport == 1514" sleep 60

The recording runs as long as its command parameter at the end of the command line (or indefinitely if no command is supplied). The example above runs for 60 seconds (see the sleep 60 command at the end of the trace-cmd command line). You can terminate tracing at any time by hitting Ctrl+C.

You can get trace events by running trace-cmd report, that shows something like this:

<idle>-0	[007] 126826.552534: tcp_probe:        	family=AF_INET src=127.0.0.1:1514 dest=127.0.0.1:53252 mark=0 data_len=0 snd_nxt=0xbe454ac1 snd_una=0xbe454ac1 snd_cwnd=10 ssthresh=2147483647 snd_wnd=65536 srtt=35 rcv_wnd=0 sock_cookie=d00b

What you see there is the TCP-level details of the connection that gets stalled because the window is zero (rcv_wnd=0). If the trace does not have any events, don’t worry – that just means that no connection was stalled, so there was no bottleneck.

With the information in the trace, you can conclude what is at fault:

If the TCP window is regularly hitting zero, that means that the receiving side has a bottleneck
1. If the server or CPU is maxed out, you might have a hardware capacity issue (check with vmstat or mpstat)
2. If the server is not maxed out, you might have an application-level problem or scaling issue.
If the TCP window is not hitting zero, then the application IS able to process its input.
1. If the sender is working fine, then you don’t have a bottleneck
2. If the sender is buffering/dropping on its own, then the bottleneck is the network bandwidth

Note that stalled connections would only exchange packets every once in a while, depending on TCP keepalive settings and other parameters, so it’s better to let the trace run for a while (30-60 seconds), as only actual packets trigger trace events.

Sender side bottlenecks

Similarly to the receiver side, the sender side can also be at fault. If the egress from your log server is slower than the ingress or the amount of logs locally generated, then the excess will start to fill your buffers. When those buffers fill up, messages need to be dropped.

The question on the sender side becomes:

Is it a limitation of the sender?
Is it a limitation of the downstream consumer?
Is it a limitation of the network bandwidth?

This time, instead of trying to find stalled connections on the receiver side, we use tracing on the sending side:

sudo trace-cmd record -e 'tcp:tcp_probe' -f "snd_wnd == 0 && dport == 1514" sleep 60

Notice that we changed the filter condition from rcv_wnd to snd_wnd, and also used dport to indicate the remote port to filter for the right connection.

The interpretation of the result is pretty similar:

If the outbound TCP window is regularly hitting zero, that means that the receiving party has a bottleneck
1. If on the receiving end the server or CPU is maxed out, you might have a hardware capacity issue (again, check with vmstat or mpstat)
2. If the server is not maxed out (CPU or disk), you might have an application-level problem or scaling issue (e.g. the application is unable to use all resources available to it).
3. If you have neither, then the bottleneck is the network bandwidth
If the TCP window is not hitting zero, then the downstream application IS able to process its input.
1. If the client is buffering/dropping on its own, then it might be an application issue on the sending side.
2. If the client is fine then you don’t have a problem.

Transports using UDP

UDP is different from TCP, as it guarantees neither delivery nor ordering. You don’t have visibility into congestion, and any sign of congestion will result in packet drops. The primary cause for packet drops is one of the following:

Upon receiving the datagram, the UDP stack finds that the socket buffer of the receiving application is full. The datagram is dropped and the UDP receive error counter is incremented.
On the network path between the sender and the receiver, any active element (switch, router, etc.) finds that a network link is congested. The datagram is dropped without notification.

As stated earlier, processing or network bottlenecks in UDP-based message delivery always cause message drops, and the only solution to this problem is to overprovision your infrastructure to its peak load. There are multiple ways for finding evidence of UDP packet loss:

Using netstat -nsu and look for “packet receive errors”
Tracing the udp:udp_fail_queue_rcv_skb tracepoint
Using the syslog-ng metric syslogng_socket_receive_dropped_packets_total

The solution using tracing is the following:

sudo trace-cmd record -e udp:udp_fail_queue_rcv_skb -f 'lport == 1514'

The trace-cmd report would show something like this:

socket-plugin-537339 [000] 144094.163543: udp_fail_queue_rcv_skb: rc=-12 port=1514

Unfortunately, the udp tracepoint does not give too much information about the source of the UDP datagram, even though that would be a trivial addition to the kernel. If all you need is the total number of UDP packet drops since boot, then netstat -nsu gives you that answer, without root privileges, like the screenshow below. Note the "packet receive errors" line in the Udp section, that shows 357k receive errors.

IcmpMsg:
	InType3: 2451
	OutType3: 2033
Udp:
	11162848 packets received
	3033 packets to unknown port received
	357243 packet receive errors
	19690297 packets sent
	357243 receive buffer errors
	356 send buffer errors
	IgnoredMulti: 75246
UdpLite:
IpExt:
	InNoRoutes: 26
	InMcastPkts: 27727
	OutMcastPkts: 5588
	InBcastPkts: 108355
	OutBcastPkts: 29086
	InOctets: 7252032521
	OutOctets: 19466635008
	InMcastOctets: 5424830
	OutMcastOctets: 816367
	InBcastOctets: 21937641
	OutBcastOctets: 14055753
	InNoECTPkts: 14760961
	InECT0Pkts: 781
MPTcpExt:

To reliably recognize and identify where UDP packet loss happens, you’d need a sequence number assigned to each message, something that syslog-ng can do with the flags(seqnum-all) destination flag (available in version 4.6 and later) and the proper use of the $SEQNUM macro in the destination template.

Additional ideas

So far, we've been trying to identify bottlenecks based on the network traffic itself. But here is a list of additional ideas that you could check independently of the transport. Some of these ideas are syslog-ng specific:

In a message forwarding scenario, chart your incoming and outgoing message rates and compare the curves. If the input and output curves follow each other with no or a little delay, then you can be sure that the forwarding mechanism can cope with the input and can send to the output, there is no bottleneck.
If your logging agent exposes metrics about message delays, check those out. Message delays should be in the order of seconds and not minutes or hours. The metric of interest in syslog-ng is syslogng_output_event_delay_sample_age_seconds.
Look at the metrics of your log agent implementation; syslog-ng has detailed metrics on buffer consumption and message drops.
If you have a receive bottleneck, and even though CPU is still available, your logging agent does not seem to make use of it, then it makes sense to look at the general latency of your logging agent and check resource limits. The latency measures how much time it takes for your logging agent to respond to an incoming message. You can check this with the syslog-ng-ctl healthcheck command, which would give you two metrics, the more important being: syslogng_mainloop_io_worker_roundtrip_latency_seconds. The normal value is around 0.1msec
If disk buffering is in use, check disk utilization and the amount of time your CPU spends waiting for a disk operation (vmstat, mpstat, etc).

Conclusion

Metrics and their trends are important to detect and diagnose bottlenecks or data reliability issues in large, distributed telemetry systems. Telemetry pipelines interact with everything in an enterprise infrastructure, collecting logs, metrics, and traces from all kinds of devices, and ensuring they all make it to their respective consumers. In such a system, bottlenecks can form at any moment and any location.

To quickly resolve these bottlenecks, you need an observability solution tailored for telemetry pipelines. A management plane that automatically monitors all elements of the pipeline and raises an alert if a problem occurs. The Axoflow Platform is such a management plane that collects, exposes, and visualizes the metrics that can help you identify a problem, wherever it forms. (See our video on detecting data loss!)

For an overview on how our platform enhances syslog-ng based log collection with metrics - including metrics for dropped packets - see the Metrics for syslog-ng based log management infrastructures blog post.

If you want to learn more about how we manage telemetry pipelines, check out our product page.