Ways to break data ingestion of your SIEM

In this post we’ll give an end-to-end overview of how security data, for example, a log message, gets from your appliance to the SIEM, and highlight the bumps on the road and the problems that can occur. These problems can cause the data not get ingested into your SIEM, or leave your SOC teams and alerts blind to the data even if it gets ingested. We’ll give you tips on how to avoid these problems, and also show you how Axoflow can help you. To illustrate the various issues, we’ll use examples from Palo Alto Networks firewall logs.

In the examples we’ll focus on problems with syslog data, mainly because most appliances and services don’t support newer transport protocols like OpenTelemetry. (Axoflow supports a wide range of syslog protocols and formats, as well as OpenTelemetry.)

Configure the appliance

The first step in the pipeline is to configure the end point appliance to forward its logs somewhere, either directly to the SIEM, or to a data router. (Since now we’re discussing collecting data from appliances, we won’t cover scenarios where you could install a collector agent on the end point, that’s a topic for a later post.)

This step usually involves the following parts:

Setting how your appliance transports its logs (and where)
Configuring what to send
Selecting message format

Configure log transport

Most appliances support one or more versions of the syslog protocol (RFC 3164 and RFC5424), with the older RFC3164 version being more widespread, even though RFC5424 is over 15 years old.

Other parameters of the transport, like the transport protocol (UDP or TCP), IP address / hostname, port number, and in case you use TLS/SSL encryption and authentication (which you should) the keys, certificates, and validation settings must all align with what the destination SIEM or router expects, otherwise none of the data will reach the destination. For a Palo Alto firewall, here are the basic things you have to configure (based on the official documentation):

Log granularity: selecting what to send

Palo Alto firewalls have 17 different log types, such as those that describe the traffic, audit logs, or configuration changes. Certain appliances send every data by default, others have more layered configuration where you can select which log types you are interested in, for example, whether you want to forward logs about configuration changes. Obviously, if you don’t select the logs that your SIEM analytics and alerts build on, you’ll miss important events, and possibly an incident security breach.

Configure message format

Apart from configuring log transport and granularity, sometimes devices can represent the same log message in multiple different formats. Here we mean the actual payload of the message: what is included about the event and in what format.

Is it a simple human-readable text message?
A list of key-value pairs?
Or a structured JSON object that includes everything important (and possibly a lot more that it isn’t important) about the event?

Palo Alto firewalls send comma-separated values in the body of the log messages by default, like this:

<165>Mar 26 18:41:06 us-east-1-dc1-b-edge-fw 1,2025/03/26 18:41:06,007200001056,TRAFFIC,end,1,2025/03/26 18:41:06,192.168.41.30,192.168.41.255,10.193.16.193,192.168.41.255,allow-all,,,netbios-ns,vsys1,Trust,Untrust,ethernet1/1,ethernet1/2,To-Panorama,2025/03/26 18:41:06,8720,1,137,137,11637,137,0x400000,udp,allow,276,276,0,3,2025/03/26 18:41:06,2,any,0,2345136,0x0,192.168.0.0-192.168.255.255,192.168.0.0-192.168.255.255,0,3,0

Actually, even the different syslog protocols have somewhat different formats, so you have to make sure that the destination is expecting the format you’re sending, or that it handles both versions automatically. Axoflow detects the incoming syslog version and handles them transparently, including framing (more on framing in the next section).

Palo Alto firewalls support both RFC3164 and RFC5424 message formats. With manual configuration, it even supports custom CEF/LEEF message formats. However, there are some caveats.

Using CEF/LEEF requires you to manually configure the log format for each Palo Alto service. For example, like this:

This is difficult to understand, modify, and maintain, especially if you have multiple firewalls.

Such customizations don’t scale to other kinds of data sources (devices that are not Palo Alto firewalls), because if you’re trying to add some standardization for example to the data collected from your firewalls, you’ll quickly see that they provide different levels of customization possibilities, and different tools to do so, so you’ll probably won’t be able to get the same data in the same format out of different types of firewalls.

Also, the above CEF example (taken from the official PAN-OS 10.0 CEF Configuration Guide) is about 2600 characters long. But the Palo Alto user interface only accepts 2048 characters…

In general, such formatting customizations and standardizations are best handled in your security data pipeline, all the more because it has other reliability and cost benefits as well.

Message format and multiple destinations

Another problem regarding message formats can come up if you want to send data to multiple destinations. Multi-SIEM environments are on the rise, but not every SIEM or other destination can handle the same message formats. Staying with the Palo Alto firewall example, the Splunk Technical Addons (TAs) that parse and process the data received from Palo Alto expect the data to be in the default comma-separated key-value format. But Microsoft Sentinel expects you to send Palo Alto logs in CEF format to an intermediate AMA agent, which forwards the data to Sentinel…

Again, such requirements and problems can be handled uniformly in a security data pipeline, which can receive data from all your sources, and forward them to one or more destinations, using the format best suited for the specific destination.

Receive the data

Receiving the syslog data on the SIEM or an intermediate syslog server or router can be tricky for a number of reasons. As we’ve discussed in the previous sections, the configuration of the receiver (port, protocol,format, and so on) must match the configuration of the sender. However, that’s not enough, because:

The receiver must be able to keep up with the amount of data that the sources are sending in aggregate. In addition, the receiver must be able to cope with the amount of data a single source is sending using a single connection. Most high volume devices use a single TCP/UDP connection to send their logs, potentially triggering per-connection bottlenecks in the receiver.
Load balancers may lose data on their own, and do not balance syslog data properly, as they are inherently connection based (vs message based). A high-volume data source may end up overloading one of your servers in the cluster (thus causing message loss), while all the others are idle.
You should be able to monitor the data flow and detect on-the-wire data loss, and preferably get alerts when it occurs. The UDP protocol is notoriously unreliable, but you can lose data even while using TCP. (Axoflow provides detailed metrics on the status of your data flow, and alerts you in case of data loss.)
Timestamp and timezone settings must be handled properly, especially if the sender is in a different timezone. (Tip: use UTC everywhere, but at least try to auto-detect the timezone as they are often wrong/misleading)
Framing issues: in certain scenarios, the sender uses RFC5424-style octet counting (also called framing) in the protocol. Both the sender and the receiver should properly implement and handle framing, otherwise:
- messages may end up being truncated, or
- multiple messages being concatenated into a single one,
- or simply go unrecognized.
Sourcetyping: By sourcetyping I mean ensuring that the received data is properly attributed to the right source device type. This is crucial as most queries against the SIEM’s database will rely on the type of the data properly set. The concept of sourcetyping exists in all SIEM products, sometimes using product-specific terms. While it is called “sourcetype” in Splunk, Sentinel for instance uses tables to represent different types of logs.

Identifying the proper device type is often done by sending logs of specific appliances or services to a port number dedicated to that appliance/service. This means that if some data is accidentally sent to a different port number, it’s either dropped or misclassified.

The way around that problem is to have a robust classification system in place that
- actually checks the data it receives, and
- can handle and alert on unexpected data.

We’ll discuss such a system in our next blog post.

Summary

As you can see, many things can go wrong in data ingestion, so chances are high that at least some data is ingested incorrectly, causing security operations and forensics investigations to miss these events. All of us have war stories about data sources going dark for one of these reasons.

Most organizations don’t have monitoring and controls in place to notice if something changes (for example, the format of the messages sent by an appliance after an upgrade) or goes wrong (because of a configuration change).

Axoflow helps you avoid these problems by automatically detecting and handling multiple protocols and message formats, and by alerting on problems that it cannot handle automatically.

Request a demo