Classify security data in transit: improve data quality and reduce costs

In the recent post Ways to break data ingestion of your SIEM, I’ve covered how security data, for example, a log message,  gets from your appliance to the SIEM, highlighting the problems that can occur. Such problems can cause the data not get ingested into your SIEM, or leave your SOC teams and alerts blind to the data even if it gets ingested.

This post covers what it takes to build a robust classification system that actually verifies the data it receives instead of relying on using dedicated ports (spoiler: lots of work), its benefits (including automatic data labeling, high-quality data, and volume reduction), and how Axoflow provides such a classification system for you - out of the box, automated, without coding.

Classify the incoming data

The usual solution to classify incoming data is to send logs of specific appliances or services to a port number dedicated to that appliance/service. Obviously, this method does not check or validate ingested data against any rules or requirements and as Murphy’s law states: “Anything that can go wrong will go wrong.”, which in this context means that data that doesn’t meet the following two assumptions gets either dropped or misclassified.

  • The incoming data belongs to the specific type of device that should send data to this port
  • No other devices are sending data to this port

In an enterprise setting, maintaining the strict regime of ports being dedicated to device types is difficult. Proper controls to notice change are lacking. What should happen if a configuration or network/IP change causes devices to send their logs incorrectly? The way around that problem is to have a robust classification system that verifies the incoming data and handles unexpected data.

Verifying which device or service a certain message belongs to is difficult: even a single data source (like an appliance) can have different kinds of messages, and you have to be able to recognize each one, uniquely. This requires a deep, device and vendor-specific understanding of the data, and also of the syslog data formats and protocols, because oftentimes the data sources send invalid messages that you have to fix as part of the classification process.

For example, here is a sample log message from three top brand firewalls. Can you find something unique in each that allows you to identify the logs from each device?

FortiGate:

<165> us-east-1-dc1-a-dmz-fw date=2025-03-26 time=18:41:07Z devname=us-east-1-dc1-a-dmz-fw devid=FGT60D4614044725 logid=0100040704 type=event subtype=system level=notice vd=root logdesc="System performance statistics" action="perf-stats" cpu=2 mem=35 totalsession=61 disk=2 bandwidth=158/138 setuprate=2 disklograte=0 fazlograte=0 msg="Performance statistics: average CPU: 2, memory: 35, concurrent sessions: 61, setup-rate: 2"

Palo Alto:

<165>Mar 26 18:41:06 us-east-1-dc1-b-edge-fw 1,2025/03/26 18:41:06,007200001056,TRAFFIC,end,1,2025/03/26 18:41:06,192.168.41.30,192.168.41.255,10.193.16.193,192.168.41.255,allow-all,,,netbios-ns,vsys1,Trust,Untrust,ethernet1/1,ethernet1/2,To-Panorama,2025/03/26 18:41:06,8720,1,137,137,11637,137,0x400000,udp,allow,276,276,0,3,2025/03/26 18:41:06,2,any,0,2345136,0x0,192.168.0.0-192.168.255.255,192.168.0.0-192.168.255.255,0,3,0

SonicWall:

<165> id=us-west-1-dc1-a-dmz-fw sn=C0EFE3336C80 time="2025-03-26 18:41:01" fw=192.168.1.239 pri=6 c=1024 gcat=6 m=537 msg="Connection Closed" srcMac=00:50:56:f5:50:27 src=10.237.228.74:54406:X20 srcZone=Trusted natSrc=192.168.1.239:38377 dstMac=00:1a:f0:8b:e0:18 dst=44.190.129.212:123:X2 dstZone=Untrusted natDst=44.190.129.212:123 proto=udp/ntp sent=152 rcvd=152 spkt=2 rpkt=2 cdur=30250 rule="22 (LAN->WAN)" n=490872197 fw_action="NA" dpi=0

NOTE: A well-formed syslog message consists of a header and the message body, like this:

<priority>timestamp hostname application: message body with info
<-----------------header----------------><----message body----->
  • FortiGate: Begins with an incomplete syslog header (has only <priority> hostname), followed by space-separated key=value pairs, which repeats the hostname in the devname field and has a devid field.
  • Palo Alto: Our Palo Alto firewall log gets most of the header right (<priority>timestamp hostname), <165>Mar 26 18:41:06 us-east-1-dc1-b-edge-fw, omits the name of the application, then puts a long list of comma-separated values into the message body. The body begins with a version number (1), followed by a  timestamp and a serial number.
  • SonicWall: These logs begin with a <priority> field, followed by 1 or 2 spaces, and a long list of space-separated key=value pairs. The first such field is the id field, which contains the hostname.

Classification needs to be both reliable and performant. A naive implementation using regexps is neither. Still that’s the solution you will find at the core of today’s ingestion pipelines. We at Axoflow understand that creating and maintaining such a classification database is difficult, as many organizations have failed or abandoned in-house projects that shared this very goal. This is why we decided to make classification a core functionality of the Axoflow Platform, so you will never need to write another parsing regexp. At the moment, Axoflow supports over 90 data sources of well-known vendors.

Reduce the data volume

Classification and the ability to process your security data in the pipeline also allows you to:

  • Fix the incoming data (like the malformed firewall messages shown above) to add missing information, like hostname or timestamp.
  • use classification to identify the source,
  • Parse the log to access the information contained within
  • Redact sensitive information before it gets sent to a SIEM or storage, like PII information,
  • Enrich the data with contextual information, like adding labels based on the source or content of the data,
  • Use all the above to route the data to the appropriate destinations, and finally
  • Transform the data into an optimized format that the destination can reliably and effortlessly consume. This includes mapping your data to multiple different schemas if you use multiple analytic tools.

It also allows you to remove data that’s not needed, for example, by:

  • dropping entire messages if they are redundant or not relevant from a security perspective, or
  • removing parts of individual messages, like fields that are non-empty even if they do not convey information, and only contain values such as "N/A" or "0".

As this data reduction happens in the pipeline, before the data arrives in the SIEM or storage, it can save you significant costs, and also improves the quality of the data your detection engineers get to work with.

Palo Alto log reduction example

To give you an example on how data reduction works in practice, let’s get back to the Palo Alto firewall log message from the previous section, and see which parts you don’t need:

<165>Mar 26 18:41:06 us-east-1-dc1-b-edge-fw 1,2025/03/26 18:41:06,007200001056,TRAFFIC,end,1,2025/03/26 18:41:06,192.168.41.30,192.168.41.255,10.193.16.193,192.168.41.255,allow-all,,,netbios-ns,vsys1,Trust,Untrust,ethernet1/1,ethernet1/2,To-Panorama,2025/03/26 18:41:06,8720,1,137,137,11637,137,0x400000,udp,allow,276,276,0,3,2025/03/26 18:41:06,2,any,0,2345136,0x0,192.168.0.0-192.168.255.255,192.168.0.0-192.168.255.255,0,3,0

Let’s see what can we drop from this particular message:

  • Redundant timestamps: Palo Alto log messages contain up to five, practically identical timestamps:
    • the syslog timestamp in the header (Mar 26 18:41:06), 
    • the time Panorama (the management plane of Palo Alto firewalls) collected the message (2025/03/26 18:41:06), and 
    • the time when the event was generated (2025/03/26 18:41:06).
      (See the Receive time, Generated time, and High resolution timestamp fields in the Traffic Log Fields documentation.)

The sample log message has five timestamps. Leaving only one timestamp can reduce the message size by up to 15%.

  • The priority field (<165>) is identical in every message and has no information value. While that takes up only about 1% of the size of the, on high-traffic firewalls even this small change adds up to significant data saving.
  • Several fields contain default or empty values that provide no information, for example, default internal IP ranges like 192.168.0.0-192.168.255.255. Removing such fields yields over 10% size reduction

Note that when removing fields, we can delete only the value of the field, because the message format (CSV) relies on having a fixed order of columns for each message type. This also means that we have to individually check what can be removed from each of the 17 Palo Alto log type.

Palo Alto firewalls send this specific message when a connection is closed. They also send a message when a new connection is started, but that doesn’t contain any information that’s not available in the ending message, so it’s completely redundant and can be dropped. As every connection has a beginning and an end, this alone almost halves the size of the data stored per connection. For example:

Connection start message:

<113>Apr 11 10:58:18 us-east-1-dc1-b-edge-fw 1,10:58:18.421048,007200001056,TRAFFIC,end,1210,10:58:18.421048,192.168.41.30,192.168.41.255,10.193.16.193,192.168.41.255,allow-all,,,ssl,vsys1,trust-users,untrust,ethernet1/2.30,ethernet1/1,To-Panorama,2020/10/09 17:43:54,36459,1,39681,443,32326,443,0x400053,tcp,allow,43135,24629,18506,189,2020/10/09 16:53:27,3012,laptops,0,1353226782,0x8000000000000000,10.0.0.0-10.255.255.255,United States,0,90,99,tcp-fin,16,0,0,0,,testhost,from-policy,,,0,,0,,N/A,0,0,0,0,ace432fe-a9f2-5a1e-327a-91fdce0077da,0

Connection end message:

<113>Apr 11 10:58:18 us-east-1-dc1-b-edge-fw 1,10:58:18.421048,007200001056,TRAFFIC,end,1210,10:58:18.421048,192.168.41.30,192.168.41.255,10.193.16.193,192.168.41.255,allow-all,,,ssl,vsys1,trust-users,untrust,ethernet1/2.30,ethernet1/1,To-Panorama,2020/10/09 17:43:54,36459,1,39681,443,32326,443,0x400053,tcp,allow,43135,24629,18506,189,2020/10/09 16:53:27,3012,laptops,0,1353226782,0x8000000000000000,10.0.0.0-10.255.255.255,United States,0,90,99,tcp-fin,16,0,0,0,,testhost,from-policy,,,0,,0,,N/A,0,0,0,0,ace432fe-a9f2-5a1e-327a-91fdce0077da,0

Conclusion

Classifying security data is essential for extracting relevant information from the collected data, improving the quality of security data, and also for bringing the volume of the data to manageable levels. Building and maintaining a robust classification system for security data is a complex and difficult task that requires a lot of work and know-how. Axoflow provides a robust security data management pipeline that handles classification automatically for over 90 well-known data sources.

Follow Our Progress!

We are excited to be realizing our vision above with a full Axoflow product suite.

Sign me up
This button is added to each code block on the live site, then its parent is removed from here.

Recent posts

Ways to break data ingestion of your SIEM
AxoRouter Opens Windows! (WEC Edition)
How high-quality data saves you $$$$
How to upgrade syslog-ng to AxoSyslog
Google Pub/Sub gRPC, Sentinel and Azure Monitor destinations in AxoSyslog 4.10

Any Questions?

We are here to answer!

Stay in Touch?

Sign up to our newsletter!