Why syslog over UDP loses messages and how to avoid that

syslog is the most important data source of your SIEM. The ratio between data coming from syslog vs everything else varies between installations, but a 50% is a good guess.

With data amounts increasing 25% YoY and enterprises regularly buying 1-10TB/day licenses from their favorite SIEM product, this means that you are pumping 500GB-5TB of data daily that originates from syslog. This includes data from your firewalls, routers, and the many other network and security appliances in your enterprise.

Message loss related resiliency with regards to syslog over UDP has always been a problem. Users who start looking at these metrics generally report a 30-40% loss of messages for syslog over UDP, but drops of up to 90% are not unheard of.

Let me get to the bottom of the reasons why messages are dropped and how these problems can be solved, or at least mitigated.

syslog transports

Syslog is traditionally transported from producers to consumers via UDP packets on port 514. The specifics of this is documented in RFC3164. Implementations of syslog (such as syslog-ng) do offer improvements over the original protocol, using TCP instead of UDP and offering encryption via SSL. There’s even a revised protocol (RFC5424) that promotes more reliable transports than the one originally described in RFC3164.

That said, syslog over UDP is still used in enterprise environments, even though more reliable alternatives have been available for decades. The reason for this is not all laziness and inertia. For carrier-grade firewalls and routers, the only protocol that can scale and manage to get all of the data off the appliance is UDP, due to its simplicity and efficiency (discussed below). This is why the use of UDP will continue, especially in large enterprises, for some time.

One way to minimize drops is to use the more reliable alternatives, but that may not be an option due to the scale requirements noted above.

Introduction to UDP - and the problem with UDP at scale

Before delving into how UDP can be scaled, let’s briefly look at how it works.

UDP is a very simple protocol, not much more than merely sending network packets on the LAN with a header added to all packets to make it routable on the Internet. This is why it is still a favored protocol in very large appliances, which value efficiency above just about everything else.

This also means that UDP inherits most of its properties from LAN equipment, such as switches and routers. UDP packets:

Can be dropped in case of a congestion
Can be reordered by a switch as it interconnects nodes on a network

The only thing that stands between the dropping or delivery of UDP packets are:

The speed of the network: if the wire speed is a lot higher than what you generally use, congestion is a rare event.
Buffering built into clients/servers and network equipment: meaning that dropping packets due to congestion can be avoided as long as these buffers are able to keep data until the congestion clears up.

The problem that you are facing as you scale ingesting UDP based packets can now be apparent:

As soon as your buffers get full at any point in the network, packets are dropped without notification.
This becomes worse as the number of hops (in terms of L2/L3 network equipment) between the client and the server increases, simply because the odds of the packet meeting a full buffer and network congestion increases. It might be an overloaded WAN link or an overloaded router; in either case these will cause the packet to be dropped.

Scaling UDP: minimizing packet loss

It’s worth mentioning that the only thing you can do when scaling UDP-based traffic is to minimize loss. Loss remains a possibility whatever you do. To avoid loss, you’d have to use a different transport (one that uses TCP is a good start).

If you are stuck using UDP and you encounter message loss already (which is probably why you are reading this article), take a structured approach to find the root cause(s).

Determine the cause of the loss: Are packets dropped by network equipment during their transit or are they being dropped on the endpoints (that is, the client or the server)?
If the packets do arrive at the destination server, determine if the application meant to receive them (syslog-ng for instance) is able to take over all packets from the kernel in time, so the buffer does not overflow.

Determine if packet loss occurs on the network

You need to confirm that packets make it to the receiver without loss. To do that, understand that each log message is actually a single UDP packet (IP level fragmentation is not usual due to the limited size of these messages). With this understanding, use either network monitoring tools (such as tcpdump, wireshark) or packet metrics in firewalls to measure if what is being sent is actually being received. You can also use kernel tracing on the endpoints - see our How to detect TCP and UDP packet drops in syslog and telemetry pipelines blog post for details.

It’s worth mentioning that usually the network does a pretty good job at delivering UDP traffic, so you might be fine at this level, but your mileage may vary. Since solutions at this layer are generally difficult or expensive (or both), I would personally start with problems between the kernel and the application (which is the next section in this article), but if that fails to resolve the problem you would have to come back here.

You cannot process or recover messages that the network drops. If you conclude that the network drops your packets, there are three ways to improve the situation:

Move the receiver closer to the sender: This typically means deploying the receiver in the same LAN segment. This receiver could be a relay/log router which can receive UDP locally and then use a more reliable protocol to forward the messages for further aggregation. It is best to avoid passing UDP through network infrastructure such as routers and firewalls unless protocol translation can not be achieved.
Use Quality of Service: If routers or other network infrastructure must be used, IP networks can prioritize traffic in different ways and provide different quality of service for specific data flows. Clients or relays can associate log traffic to a high priority band, which routers prioritize differently than regular traffic.
Increase bandwidth along the path: Check for bottlenecks in the network topology between the client and the server and increase bandwidth if that’s possible. This is probably the most difficult, if not impossible, to implement, but if the above steps fail that is your only option.

Determine if packet loss occurs between the kernel and the application

The packets we are focusing on now are dropped after they are delivered to the destination host (for example, the syslog server). This is where most of the drops happen in my experience, so let’s zoom into how this can be diagnosed.

First of all, it’s important to understand how the kernel handles incoming packets and how it hands them over to the application which is responsible for receiving them.

UDP packet handling on the syslog-ng host

There is - surprise, surprise - a buffer between kernel- and user-space, called the socket buffer, which is associated with the socket the application uses to receive those packets. The size of this buffer is important, but let’s assume for a moment that this is a few 100k by default.

Whenever a packet is received from the network, the IP stack in the kernel space:

Identifies the associated socket, as opened by the application, e.g. socket bound to
“0.0.0.0:514”
If the receive buffer has sufficient free space, the kernel:
- Moves the packet payload into the receive buffer of this socket,
- Notifies the application that a new packet has been received
If the receive buffer is full, the kernel:
- Drops the packet,
- Increases the UDP receive errors counter.

When the syslog application receives a notification from the kernel that a new packet is available, it:

Identifies the socket the notification is associated with (sockets are identified by a file descriptor or fd)
Issues a recv(fd) system call to receive the packet payload,
Performs parsing, processing and delivery to one of its outputs.

The connection between the kernel and the application is the receive buffer. As long as it has enough space allocated, no message is dropped. Once it is full, packets are dropped until space becomes available.

Summary

This concludes the reasons of the common causes of losing messages when using syslog over UDP, and how you can determine whether your messages are lost